diff mbox

[RFC,V2] ext4: flush delalloc blocks when space is low

Message ID 4ADF6628.9080105@redhat.com
State Accepted, archived
Headers show

Commit Message

Eric Sandeen Oct. 21, 2009, 7:51 p.m. UTC
Creating many small files in rapid succession on a small
filesystem can lead to spurious ENOSPC; on a 104MB filesystem:

for i in `seq 1 22500`; do
    echo -n > $SCRATCH_MNT/$i
    echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
done

leads to ENOSPC even though after a sync, 40% of the fs is free
again.

This is because we reserve worst-case metadata for delalloc writes,
and when data is allocated that worst-case reservation was not
needed.

I've added 2 flushers here:

 * when free space is low compared to dirty blocks, do an async flush
 * when we get a hard ENOSPC, do a sync flush before retry

This resolves the testcase for me, and survives all 4 generic
ENOSPC tests in xfstests.

V2: don't try to sync if we're still in a (probably nested) transaction.

Thanks to Josef for pointing out that possibility.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jan Kara Nov. 5, 2009, 2:09 p.m. UTC | #1
> Creating many small files in rapid succession on a small
> filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
> 
> for i in `seq 1 22500`; do
>     echo -n > $SCRATCH_MNT/$i
>     echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
> done
> 
> leads to ENOSPC even though after a sync, 40% of the fs is free
> again.
> 
> This is because we reserve worst-case metadata for delalloc writes,
> and when data is allocated that worst-case reservation was not
> needed.
> 
> I've added 2 flushers here:
> 
>  * when free space is low compared to dirty blocks, do an async flush
>  * when we get a hard ENOSPC, do a sync flush before retry
> 
> This resolves the testcase for me, and survives all 4 generic
> ENOSPC tests in xfstests.
> 
> V2: don't try to sync if we're still in a (probably nested) transaction.
> 
> Thanks to Josef for pointing out that possibility.
  I still think it's deadlockable... See below.

> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 1d04189..28bde58 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -605,11 +605,27 @@ int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
>   */
>  int ext4_should_retry_alloc(struct super_block *sb, int *retries)
>  {
> -	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
> +	s64 dirtyblocks = 0;
> +	struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
> +
> +	if (test_opt(sb, DELALLOC))
> +		dirtyblocks = percpu_counter_read_positive(dbc);
> +
> +	if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
>  	    (*retries)++ > 3 ||
>  	    !EXT4_SB(sb)->s_journal)
>  		return 0;
>  
> +	/* try a sync to flush delalloc space & free resvd metadata */
> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> +		if (!ext4_journal_current_handle()) {
> +			down_read(&sb->s_umount);
> +			sync_inodes_sb(sb);
> +			up_read(&sb->s_umount);
  ext4_should_retry_alloc() is called quite deep from the filesystem. In
particular we can hold i_mutex of some inodes etc. So I'd almost bet
that taking s_umount sem here violates lock ranking in some code paths
(an easy check would be to enable lockdep and stress the filesystem a
bit).
  Also calling sync_inodes_sb() with i_mutex held just seems as a bad
thing to do although I don't see where it could deadlock and so it's
probably just a matter of taste...
  If we start writeback from ext4_nonda_switch as you do below, I think
that we should get decent results even without synchronous writeback in
the allocation path (maybe we'd need to tweak a bit the logic in
ext4_nonda_switch to provide more time for writeback thread to catchup).

								Honza

> +			return 1;
> +		}
> +	}
> +
>  	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
>  
>  	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5c5bc5d..27c8b9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,11 +3024,18 @@ static int ext4_nonda_switch(struct super_block *sb)
>  	if (2 * free_blocks < 3 * dirty_blocks ||
>  		free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
>  		/*
> -		 * free block count is less that 150% of dirty blocks
> -		 * or free blocks is less that watermark
> +		 * free block count is less than 150% of dirty blocks
> +		 * or free blocks is less than watermark
>  		 */
>  		return 1;
>  	}
> +	/*
> +	 * Even if we don't switch but are nearing capacity,
> +	 * start pushing delalloc when 1/2 of free blocks are dirty.
> +	 */
> +	if (free_blocks < 2 * dirty_blocks)
> +		writeback_inodes_sb(sb);
> +
>  	return 0;
>  }
Eric Sandeen Nov. 5, 2009, 3:45 p.m. UTC | #2
Jan Kara wrote:
...

>> +	/* try a sync to flush delalloc space & free resvd metadata */
>> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
>> +		if (!ext4_journal_current_handle()) {
>> +			down_read(&sb->s_umount);
>> +			sync_inodes_sb(sb);
>> +			up_read(&sb->s_umount);
>   ext4_should_retry_alloc() is called quite deep from the filesystem. In
> particular we can hold i_mutex of some inodes etc. So I'd almost bet
> that taking s_umount sem here violates lock ranking in some code paths
> (an easy check would be to enable lockdep and stress the filesystem a
> bit).
>   Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> thing to do although I don't see where it could deadlock and so it's
> probably just a matter of taste...

Well, to be honest I agree with you ;)  It does still feel like a hack.

>   If we start writeback from ext4_nonda_switch as you do below, I think
> that we should get decent results even without synchronous writeback in
> the allocation path (maybe we'd need to tweak a bit the logic in
> ext4_nonda_switch to provide more time for writeback thread to catchup).

I think starting writeback helps a lot, but it seems that in the end we
still need a synchronous attempt when we hit a real enocpc... after I
finish dealing with this corruption thing I'll come back and look at this.

Maybe we should put the writeback in for now, and worry about the
synchronous sync-up later?

Thanks for the review,

-Eric

> 								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Nov. 5, 2009, 4:05 p.m. UTC | #3
> Jan Kara wrote:
> ...
> 
> >> +	/* try a sync to flush delalloc space & free resvd metadata */
> >> +	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
> >> +		if (!ext4_journal_current_handle()) {
> >> +			down_read(&sb->s_umount);
> >> +			sync_inodes_sb(sb);
> >> +			up_read(&sb->s_umount);
> >   ext4_should_retry_alloc() is called quite deep from the filesystem. In
> > particular we can hold i_mutex of some inodes etc. So I'd almost bet
> > that taking s_umount sem here violates lock ranking in some code paths
> > (an easy check would be to enable lockdep and stress the filesystem a
> > bit).
> >   Also calling sync_inodes_sb() with i_mutex held just seems as a bad
> > thing to do although I don't see where it could deadlock and so it's
> > probably just a matter of taste...
> 
> Well, to be honest I agree with you ;)  It does still feel like a hack.
> 
> >   If we start writeback from ext4_nonda_switch as you do below, I think
> > that we should get decent results even without synchronous writeback in
> > the allocation path (maybe we'd need to tweak a bit the logic in
> > ext4_nonda_switch to provide more time for writeback thread to catchup).
> 
> I think starting writeback helps a lot, but it seems that in the end we
> still need a synchronous attempt when we hit a real enocpc... after I
> finish dealing with this corruption thing I'll come back and look at this.
  Without the synchronous attempt, it will never be perfect, that is
correct. But it could be quite close to perfect...

> Maybe we should put the writeback in for now, and worry about the
> synchronous sync-up later?
  Yes, I'd do that for now.

									Honza
diff mbox

Patch

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1d04189..28bde58 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -605,11 +605,27 @@  int ext4_claim_free_blocks(struct ext4_sb_info *sbi,
  */
 int ext4_should_retry_alloc(struct super_block *sb, int *retries)
 {
-	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) ||
+	s64 dirtyblocks = 0;
+	struct percpu_counter *dbc = &EXT4_SB(sb)->s_dirtyblocks_counter;
+
+	if (test_opt(sb, DELALLOC))
+		dirtyblocks = percpu_counter_read_positive(dbc);
+
+	if ((!ext4_has_free_blocks(EXT4_SB(sb), 1) && !dirtyblocks) ||
 	    (*retries)++ > 3 ||
 	    !EXT4_SB(sb)->s_journal)
 		return 0;
 
+	/* try a sync to flush delalloc space & free resvd metadata */
+	if (!ext4_has_free_blocks(EXT4_SB(sb), 1) && dirtyblocks) {
+		if (!ext4_journal_current_handle()) {
+			down_read(&sb->s_umount);
+			sync_inodes_sb(sb);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+
 	jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
 
 	return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c5bc5d..27c8b9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,11 +3024,18 @@  static int ext4_nonda_switch(struct super_block *sb)
 	if (2 * free_blocks < 3 * dirty_blocks ||
 		free_blocks < (dirty_blocks + EXT4_FREEBLOCKS_WATERMARK)) {
 		/*
-		 * free block count is less that 150% of dirty blocks
-		 * or free blocks is less that watermark
+		 * free block count is less than 150% of dirty blocks
+		 * or free blocks is less than watermark
 		 */
 		return 1;
 	}
+	/*
+	 * Even if we don't switch but are nearing capacity,
+	 * start pushing delalloc when 1/2 of free blocks are dirty.
+	 */
+	if (free_blocks < 2 * dirty_blocks)
+		writeback_inodes_sb(sb);
+
 	return 0;
 }