diff mbox

Filesystem writes on RAID5 too slow

Message ID 20131121234116.GD6502@dastard
State New, archived
Headers show

Commit Message

Dave Chinner Nov. 21, 2013, 11:41 p.m. UTC
On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
> $ uname -a
> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
> i686 GNU/Linux

Oh, it's 32 bit system. Things you don't know from the obfuscating
codenames everyone uses these days...

> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
> $ mount -t xfs /dev/md0 /tmp/diskmnt/
> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
> 1000+0 records in
> 1000+0 records out
> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
....
> $ cat /proc/mounts
> (...)
> /dev/md0 /tmp/diskmnt xfs
> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0

sunit/swidth is 512k/1MB

> # same layout for other disks
> $ fdisk -c -u /dev/sda
....
>    Device Boot      Start         End      Blocks   Id  System
> /dev/sda1            2048    20565247    10281600   83  Linux

Aligned to 1 MB.

> /dev/sda2        20565248  1953525167   966479960   83  Linux

And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
aligned to 4k, though, so there shouldn't be any hardware RMW
cycles.

> $ xfs_info /dev/md0
> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>          =                       sectsz=4096  attr=2
> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>          =                       sunit=12

sunit/swidth of 512k/1MB, so it matches the MD device.

> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
> /tmp/diskmnt/filewr.zero:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> # this does not look good, does it?

Yup, looks broken.

/me digs through git. 

Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
the code that sets stripe unit alignment for the initial allocation
way back in 3.2.

[ Hmmm, that would explain the very occasional failure that
generic/223 throws outi (maybe once a month I see it fail). ]

Which means MD is doing RMW cycles for it's parity calculations, and
that's where performance is going south.

Current code:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
   0: [0..2097151]:    1056..2098207     0 (1056..2098207)  2097152 11111
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
$

Which indicates that even if we take direct IO based allocation out
of the picture, the allocation does not get aligned properly. This
in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.

With a fixed kernel:

$ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    6293504..8390655  0 (6293504..8390655) 2097152 10000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
$

It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.

Take fallocate out of the picture so the direct IO does the
allocation:

$ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
testfile:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
   0: [0..2097151]:    2099200..4196351  0 (2099200..4196351) 2097152 00000
 FLAG Values:
    010000 Unwritten preallocated extent
    001000 Doesn't begin on stripe unit
    000100 Doesn't end   on stripe unit
    000010 Doesn't begin on stripe width
    000001 Doesn't end   on stripe width

It's slower than with preallocation (no surprise - no allocation
overhead per write(2) call after preallocation is done) but the
allocation is still correctly aligned.

The patch below should fix the unaligned allocation problem you are
seeing, but because XFS defaults to stripe unit alignment for large
allocations, you might still see RMW cycles when it aligns to a
stripe unit that is not the first in a MD stripe. I'll have a quick
look at fixing that behaviour when the swalloc mount option is
specified....

Cheers,

Dave.

Comments

Christoph Hellwig Nov. 22, 2013, 9:21 a.m. UTC | #1
> From: Dave Chinner <dchinner@redhat.com>
> 
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
> 
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
> 
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Ooops.  The fix looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


Might be worth cooking up a test for this, scsi_debug can expose
geometry, and we already have it wired to to large sector size
testing in xfstests.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Boutin Nov. 22, 2013, 1:33 p.m. UTC | #2
Dave, I just applied your patch in my vanilla 3.10.10 Linux. Here are
the new performance figures for XFS:

$ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 4.95292 s, 212 MB/s

: )
So things make more sense now... I hit a bug in XFS and ext3 and ufs
do not support some kind of multiblock allocation.

Thank you all,
- Martin

On Thu, Nov 21, 2013 at 6:41 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Nov 21, 2013 at 08:31:38AM -0500, Martin Boutin wrote:
>> $ uname -a
>> Linux haswell1 3.10.10 #1 SMP PREEMPT Wed Oct 2 11:22:22 CEST 2013
>> i686 GNU/Linux
>
> Oh, it's 32 bit system. Things you don't know from the obfuscating
> codenames everyone uses these days...
>
>> $ mkfs.xfs -s size=4096 -f -l size=32m /dev/md0
>> $ mount -t xfs /dev/md0 /tmp/diskmnt/
>> $ dd if=/dev/zero of=/tmp/diskmnt/filewr.zero bs=1M count=1000 oflag=direct
>> 1000+0 records in
>> 1000+0 records out
>> 1048576000 bytes (1.0 GB) copied, 28.0304 s, 37.4 MB/s
> ....
>> $ cat /proc/mounts
>> (...)
>> /dev/md0 /tmp/diskmnt xfs
>> rw,relatime,attr2,inode64,sunit=1024,swidth=2048,noquota 0 0
>
> sunit/swidth is 512k/1MB
>
>> # same layout for other disks
>> $ fdisk -c -u /dev/sda
> ....
>>    Device Boot      Start         End      Blocks   Id  System
>> /dev/sda1            2048    20565247    10281600   83  Linux
>
> Aligned to 1 MB.
>
>> /dev/sda2        20565248  1953525167   966479960   83  Linux
>
> And that isn't aligned to 1MB. 20565248 / 2048 = 10041.625. It is
> aligned to 4k, though, so there shouldn't be any hardware RMW
> cycles.
>
>> $ xfs_info /dev/md0
>> meta-data=/dev/md0               isize=256    agcount=32, agsize=15101312 blks
>>          =                       sectsz=4096  attr=2
>> data     =                       bsize=4096   blocks=483239168, imaxpct=5
>>          =                       sunit=12
>
> sunit/swidth of 512k/1MB, so it matches the MD device.
>
>> $ xfs_bmap -vvp /tmp/diskmnt/filewr.zero
>> /tmp/diskmnt/filewr.zero:
>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>>    0: [0..2047999]:    2049056..4097055  0 (2049056..4097055) 2048000 01111
>>  FLAG Values:
>>     010000 Unwritten preallocated extent
>>     001000 Doesn't begin on stripe unit
>>     000100 Doesn't end   on stripe unit
>>     000010 Doesn't begin on stripe width
>>     000001 Doesn't end   on stripe width
>> # this does not look good, does it?
>
> Yup, looks broken.
>
> /me digs through git.
>
> Yup, commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") broke
> the code that sets stripe unit alignment for the initial allocation
> way back in 3.2.
>
> [ Hmmm, that would explain the very occasional failure that
> generic/223 throws outi (maybe once a month I see it fail). ]
>
> Which means MD is doing RMW cycles for it's parity calculations, and
> that's where performance is going south.
>
> Current code:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET          TOTAL FLAGS
>    0: [0..2097151]:    1056..2098207     0 (1056..2098207)  2097152 11111
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1024 ops; 0:00:02.00 (343.815 MiB/sec and 268.6054 ops/sec)
> $
>
> Which indicates that even if we take direct IO based allocation out
> of the picture, the allocation does not get aligned properly. This
> in on a 3.5TB 12 SAS disk MD RAID6 with sunit=64k,swidth=640k.
>
> With a fixed kernel:
>
> $ xfs_io -fd -c "truncate 0" -c "falloc 0 1g" -c "bmap -vvp" -c "pwrite 0 1g -b 1280k" testfile
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2097151]:    6293504..8390655  0 (6293504..8390655) 2097152 10000
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (415.192 MiB/sec and 332.4779 ops/sec)
> $
>
> It;s clear we have completely stripe swidth aligned allocation and it's 25% faster.
>
> Take fallocate out of the picture so the direct IO does the
> allocation:
>
> $ xfs_io -fd -c "truncate 0" -c "pwrite 0 1g -b 1280k" -c "bmap -vvp" testfile
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 820 ops; 0:00:02.00 (368.241 MiB/sec and 294.8807 ops/sec)
> testfile:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET            TOTAL FLAGS
>    0: [0..2097151]:    2099200..4196351  0 (2099200..4196351) 2097152 00000
>  FLAG Values:
>     010000 Unwritten preallocated extent
>     001000 Doesn't begin on stripe unit
>     000100 Doesn't end   on stripe unit
>     000010 Doesn't begin on stripe width
>     000001 Doesn't end   on stripe width
>
> It's slower than with preallocation (no surprise - no allocation
> overhead per write(2) call after preallocation is done) but the
> allocation is still correctly aligned.
>
> The patch below should fix the unaligned allocation problem you are
> seeing, but because XFS defaults to stripe unit alignment for large
> allocations, you might still see RMW cycles when it aligns to a
> stripe unit that is not the first in a MD stripe. I'll have a quick
> look at fixing that behaviour when the swalloc mount option is
> specified....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
> xfs: align initial file allocations correctly.
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
>
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
>
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 3ef11b2..8401f11 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -1635,7 +1635,7 @@ xfs_bmap_last_extent(
>   * blocks at the end of the file which do not start at the previous data block,
>   * we will try to align the new blocks at stripe unit boundaries.
>   *
> - * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
> + * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
>   * at, or past the EOF.
>   */
>  STATIC int
> @@ -1650,9 +1650,14 @@ xfs_bmap_isaeof(
>         bma->aeof = 0;
>         error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
>                                      &is_empty);
> -       if (error || is_empty)
> +       if (error)
>                 return error;
>
> +       if (is_empty) {
> +               bma->aeof = 1;
> +               return 0;
> +       }
> +
>         /*
>          * Check if we are allocation or past the last extent, or at least into
>          * the last delayed allocated extent.
Dave Chinner Nov. 22, 2013, 10:40 p.m. UTC | #3
On Fri, Nov 22, 2013 at 01:21:36AM -0800, Christoph Hellwig wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> > 
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> > 
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Ooops.  The fix looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> 
> Might be worth cooking up a test for this, scsi_debug can expose
> geometry, and we already have it wired to to large sector size
> testing in xfstests.

We don't need to screw around with the sector size - that is
irrelevant to the problem, and we have an allocation alignment
test that is supposed to catch these issues: generic/223.

As I said, I have seen occasional failures of that test (once a
month, on average) as a result of this bug. It was simply not often
enough - running in a hard loop didn't increase the frequency of
failures - to be able debug it or to reach my "there's a regression
I need to look at" threshold. Perhaps we need to revisit that test
and see if we can make it more likely to trigger failures...

Cheers,

Dave.
Christoph Hellwig Nov. 23, 2013, 8:41 a.m. UTC | #4
On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > geometry, and we already have it wired to to large sector size
> > testing in xfstests.
> 
> We don't need to screw around with the sector size - that is
> irrelevant to the problem, and we have an allocation alignment
> test that is supposed to catch these issues: generic/223.

It didn't imply we need large sector sizes, but the same mechanism
to expodse a large sector size can also be used to present large
stripe units/width.

> As I said, I have seen occasional failures of that test (once a
> month, on average) as a result of this bug. It was simply not often
> enough - running in a hard loop didn't increase the frequency of
> failures - to be able debug it or to reach my "there's a regression
> I need to look at" threshold. Perhaps we need to revisit that test
> and see if we can make it more likely to trigger failures...

Seems like 233 should have cought it regularly with the explicit
alignment options on mkfs time.  Maybe we also need a test mirroring
the plain dd more closely?

I've not seen 233 fail for a long time..
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 24, 2013, 11:21 p.m. UTC | #5
On Sat, Nov 23, 2013 at 12:41:06AM -0800, Christoph Hellwig wrote:
> On Sat, Nov 23, 2013 at 09:40:38AM +1100, Dave Chinner wrote:
> > > geometry, and we already have it wired to to large sector size
> > > testing in xfstests.
> > 
> > We don't need to screw around with the sector size - that is
> > irrelevant to the problem, and we have an allocation alignment
> > test that is supposed to catch these issues: generic/223.
> 
> It didn't imply we need large sector sizes, but the same mechanism
> to expodse a large sector size can also be used to present large
> stripe units/width.
> 
> > As I said, I have seen occasional failures of that test (once a
> > month, on average) as a result of this bug. It was simply not often
> > enough - running in a hard loop didn't increase the frequency of
> > failures - to be able debug it or to reach my "there's a regression
> > I need to look at" threshold. Perhaps we need to revisit that test
> > and see if we can make it more likely to trigger failures...
> 
> Seems like 233 should have cought it regularly with the explicit
> alignment options on mkfs time.  Maybe we also need a test mirroring
> the plain dd more closely?

Preallocation showed the problem, too, so we probably don't even
need dd to check whether allocation alignment is working properly.
We should probably write a test that spefically checks all the
different anlignment/extent size combinations we can use.

Preallocation should behave very similarly to direct IO, but I'm
pretty sure that it won't do things like round up allocations to
stripe unit/widths like direct IO does. The fact that we do
allocation sunit/swidth size alignment for direct Io outside the
allocator and sunit/swidth offset alignment inside the allocation is
kinda funky....

> I've not seen 233 fail for a long time..

Not surprising, it is a one in several hundred test runs occurrence
here...

Cheers,

Dave.
Christoph Hellwig Dec. 10, 2013, 7:18 p.m. UTC | #6
> xfs: align initial file allocations correctly.
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The function xfs_bmap_isaeof() is used to indicate that an
> allocation is occurring at or past the end of file, and as such
> should be aligned to the underlying storage geometry if possible.
> 
> Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> behaviour of this function for empty files - it turned off
> allocation alignment for this case accidentally. Hence large initial
> allocations from direct IO are not getting correctly aligned to the
> underlying geometry, and that is cause write performance to drop in
> alignment sensitive configurations.
> 
> Fix it by considering allocation into empty files as requiring
> aligned allocation again.

Seems like this one didn't get picked up yet?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Dec. 11, 2013, 12:27 a.m. UTC | #7
On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > xfs: align initial file allocations correctly.
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The function xfs_bmap_isaeof() is used to indicate that an
> > allocation is occurring at or past the end of file, and as such
> > should be aligned to the underlying storage geometry if possible.
> > 
> > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > behaviour of this function for empty files - it turned off
> > allocation alignment for this case accidentally. Hence large initial
> > allocations from direct IO are not getting correctly aligned to the
> > underlying geometry, and that is cause write performance to drop in
> > alignment sensitive configurations.
> > 
> > Fix it by considering allocation into empty files as requiring
> > aligned allocation again.
> 
> Seems like this one didn't get picked up yet?

I'm about to resend all my outstanding patches...

Cheers,

Dave.
Ben Myers Dec. 11, 2013, 7:09 p.m. UTC | #8
Hi,

On Wed, Dec 11, 2013 at 11:27:53AM +1100, Dave Chinner wrote:
> On Tue, Dec 10, 2013 at 11:18:03AM -0800, Christoph Hellwig wrote:
> > > xfs: align initial file allocations correctly.
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The function xfs_bmap_isaeof() is used to indicate that an
> > > allocation is occurring at or past the end of file, and as such
> > > should be aligned to the underlying storage geometry if possible.
> > > 
> > > Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the
> > > behaviour of this function for empty files - it turned off
> > > allocation alignment for this case accidentally. Hence large initial
> > > allocations from direct IO are not getting correctly aligned to the
> > > underlying geometry, and that is cause write performance to drop in
> > > alignment sensitive configurations.
> > > 
> > > Fix it by considering allocation into empty files as requiring
> > > aligned allocation again.
> > 
> > Seems like this one didn't get picked up yet?
> 
> I'm about to resend all my outstanding patches...

Sorry I didn't see that one.  If you stick the keyword 'patch' in the subject I
tend to do a bit better.

Regards,
	Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 3ef11b2..8401f11 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -1635,7 +1635,7 @@  xfs_bmap_last_extent(
  * blocks at the end of the file which do not start at the previous data block,
  * we will try to align the new blocks at stripe unit boundaries.
  *
- * Returns 0 in bma->aeof if the file (fork) is empty as any new write will be
+ * Returns 1 in bma->aeof if the file (fork) is empty as any new write will be
  * at, or past the EOF.
  */
 STATIC int
@@ -1650,9 +1650,14 @@  xfs_bmap_isaeof(
 	bma->aeof = 0;
 	error = xfs_bmap_last_extent(NULL, bma->ip, whichfork, &rec,
 				     &is_empty);
-	if (error || is_empty)
+	if (error)
 		return error;
 
+	if (is_empty) {
+		bma->aeof = 1;
+		return 0;
+	}
+
 	/*
 	 * Check if we are allocation or past the last extent, or at least into
 	 * the last delayed allocated extent.