Patchwork fix up lock order reversal in writeback

login
register
mail settings
Submitter Nick Piggin
Date Nov. 16, 2010, 11 a.m.
Message ID <20101116110058.GA4298@amd>
Download mbox | patch
Permalink /patch/71383/
State Not Applicable
Headers show

Comments

Nick Piggin - Nov. 16, 2010, 11 a.m.
I saw a lock order warning on ext4 trigger. This should solve it.
Raciness shouldn't matter much, because writeback can stop just
after we make the test and return anyway (so the API is racy anyway).

Signed-off-by: Nick Piggin <npiggin@kernel.dk>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 16, 2010, 1:01 p.m.
On Tue 16-11-10 22:00:58, Nick Piggin wrote:
> I saw a lock order warning on ext4 trigger. This should solve it.
> Raciness shouldn't matter much, because writeback can stop just
> after we make the test and return anyway (so the API is racy anyway).
  Hmm, for now the fix is OK. Ultimately, we probably want to call
writeback_inodes_sb() directly from all the callers. They all just want to
reduce uncertainty of delayed allocation reservations by writing delayed
data and actually wait for some of the writeback to happen before they
retry again the allocation.

Although the callers generally cannot get umount_sem because they hold
other locks, they have the superblock well pinned so grabbing umount_sem
makes sense mostly to make assertions happy. But as I'm thinking about it,
trylock *is* maybe the right answer to this anyway...

So
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb(sb);
> -		up_read(&sb->s_umount);
> -		return 1;
> -	} else
> -		return 0;
> +		if (down_read_trylock(&sb->s_umount)) {
> +			writeback_inodes_sb(sb);
> +			up_read(&sb->s_umount);
> +			return 1;
> +		}
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>  
> @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>  				   unsigned long nr)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb_nr(sb, nr);
> -		up_read(&sb->s_umount);
> -		return 1;
> -	} else
> -		return 0;
> +		if (down_read_trylock(&sb->s_umount)) {
> +			writeback_inodes_sb_nr(sb, nr);
> +			up_read(&sb->s_umount);
> +			return 1;
> +		}
> +	}
> +	return 0;
>  }
>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 16, 2010, 8:32 p.m.
On Tue, 16 Nov 2010 22:00:58 +1100
Nick Piggin <npiggin@kernel.dk> wrote:

> I saw a lock order warning on ext4 trigger. This should solve it.

Send us the trace, please.

The code comment implies that someone is calling down_read() under
i_lock?  That would be bad, and I'd expect it to have produced a
might_sleep() warning, not a lockdep trace.  

And I don't see how we can call writeback_inodes_sb() under i_lock
anyway, so I don't really have a clue what's going on here!

> Raciness shouldn't matter much, because writeback can stop just
> after we make the test and return anyway (so the API is racy anyway).
> 
> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>   *
>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>   * Returns 1 if writeback was started, 0 if not.
> + *
> + * May be called inside i_lock. May not start writeback if locks cannot
> + * be acquired.
>   */
>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>  {
>  	if (!writeback_in_progress(sb->s_bdi)) {
> -		down_read(&sb->s_umount);
> -		writeback_inodes_sb(sb);
> -		up_read(&sb->s_umount);
> -		return 1;
> -	} else
> -		return 0;
> +		if (down_read_trylock(&sb->s_umount)) {
> +			writeback_inodes_sb(sb);
> +			up_read(&sb->s_umount);
> +			return 1;
> +		}
> +	}
> +	return 0;

And it's pretty generous to describe a s/down_read/down_read_trylock/
as a "fix".  Terms like "bandaid" and "workaround" come to mind.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 17, 2010, 3:56 a.m.
On Tue, Nov 16, 2010 at 12:32:52PM -0800, Andrew Morton wrote:
> On Tue, 16 Nov 2010 22:00:58 +1100
> Nick Piggin <npiggin@kernel.dk> wrote:
> 
> > I saw a lock order warning on ext4 trigger. This should solve it.
> 
> Send us the trace, please.

I lost it, sorry.

 
> The code comment implies that someone is calling down_read() under
> i_lock?  That would be bad, and I'd expect it to have produced a
> might_sleep() warning, not a lockdep trace.  

Sorry not i_lock, i_mutex. writeback_inodes_sb_if_idle is called by
ext4's write_begin function which is called with i_mutex held from
generic_file_buffered_write, I believe is the trace.

 
> And I don't see how we can call writeback_inodes_sb() under i_lock
> anyway, so I don't really have a clue what's going on here!
> 
> > Raciness shouldn't matter much, because writeback can stop just
> > after we make the test and return anyway (so the API is racy anyway).

> > 
> > Signed-off-by: Nick Piggin <npiggin@kernel.dk>
> > 
> > Index: linux-2.6/fs/fs-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
> > +++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
> > @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
> >   *
> >   * Invoke writeback_inodes_sb if no writeback is currently underway.
> >   * Returns 1 if writeback was started, 0 if not.
> > + *
> > + * May be called inside i_lock. May not start writeback if locks cannot
> > + * be acquired.
> >   */
> >  int writeback_inodes_sb_if_idle(struct super_block *sb)
> >  {
> >  	if (!writeback_in_progress(sb->s_bdi)) {
> > -		down_read(&sb->s_umount);
> > -		writeback_inodes_sb(sb);
> > -		up_read(&sb->s_umount);
> > -		return 1;
> > -	} else
> > -		return 0;
> > +		if (down_read_trylock(&sb->s_umount)) {
> > +			writeback_inodes_sb(sb);
> > +			up_read(&sb->s_umount);
> > +			return 1;
> > +		}
> > +	}
> > +	return 0;
> 
> And it's pretty generous to describe a s/down_read/down_read_trylock/
> as a "fix".  Terms like "bandaid" and "workaround" come to mind.

As much as the writeback_inodes_sb_if_idle API itself is a bandaid,
I suppose. (it doesn't do any rate limiting of the dirtier, it's racy,
it doesn't specify how much to writeback, it's synchronous, etc).

Anyway, I don't know, there's not much other option for 2.6.37 AFAIKS.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 17, 2010, 4:30 a.m.
On 11/16/10 7:01 AM, Jan Kara wrote:
> On Tue 16-11-10 22:00:58, Nick Piggin wrote:
>> I saw a lock order warning on ext4 trigger. This should solve it.
>> Raciness shouldn't matter much, because writeback can stop just
>> after we make the test and return anyway (so the API is racy anyway).
>   Hmm, for now the fix is OK. Ultimately, we probably want to call
> writeback_inodes_sb() directly from all the callers. They all just want to
> reduce uncertainty of delayed allocation reservations by writing delayed
> data and actually wait for some of the writeback to happen before they
> retry again the allocation.

For ext4, at least, it's just best-effort.  We're not actually out of
space yet when this starts pushing.  But it helps us avoid enospc:

commit c8afb44682fcef6273e8b8eb19fab13ddd05b386
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Wed Dec 23 07:58:12 2009 -0500

    ext4: flush delalloc blocks when space is low
    
    Creating many small files in rapid succession on a small
    filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
    
    for i in `seq 1 22500`; do
        echo -n > $SCRATCH_MNT/$i
        echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
    done
    
    leads to ENOSPC even though after a sync, 40% of the fs is free
    again.

    <snip>

We don't need it to be synchronous - in fact I didn't think it was ...

ext4 should probably use btrfs's new variant and just get rid of the
one I put in, for a very large system/filesystem it could end up doing
a rather insane amount of IO when the fs starts to get full.

as for the locking problems ... sorry about that!

-Eric

> Although the callers generally cannot get umount_sem because they hold
> other locks, they have the superblock well pinned so grabbing umount_sem
> makes sense mostly to make assertions happy. But as I'm thinking about it,
> trylock *is* maybe the right answer to this anyway...
> 
> So
> Acked-by: Jan Kara <jack@suse.cz>
> 
> 								Honza
>>
>> Signed-off-by: Nick Piggin <npiggin@kernel.dk>
>>
>> Index: linux-2.6/fs/fs-writeback.c
>> ===================================================================
>> --- linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
>> +++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
>> @@ -1125,16 +1125,20 @@ EXPORT_SYMBOL(writeback_inodes_sb);
>>   *
>>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>>   * Returns 1 if writeback was started, 0 if not.
>> + *
>> + * May be called inside i_lock. May not start writeback if locks cannot
>> + * be acquired.
>>   */
>>  int writeback_inodes_sb_if_idle(struct super_block *sb)
>>  {
>>  	if (!writeback_in_progress(sb->s_bdi)) {
>> -		down_read(&sb->s_umount);
>> -		writeback_inodes_sb(sb);
>> -		up_read(&sb->s_umount);
>> -		return 1;
>> -	} else
>> -		return 0;
>> +		if (down_read_trylock(&sb->s_umount)) {
>> +			writeback_inodes_sb(sb);
>> +			up_read(&sb->s_umount);
>> +			return 1;
>> +		}
>> +	}
>> +	return 0;
>>  }
>>  EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
>>  
>> @@ -1145,17 +1149,21 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idl
>>   *
>>   * Invoke writeback_inodes_sb if no writeback is currently underway.
>>   * Returns 1 if writeback was started, 0 if not.
>> + *
>> + * May be called inside i_lock. May not start writeback if locks cannot
>> + * be acquired.
>>   */
>>  int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
>>  				   unsigned long nr)
>>  {
>>  	if (!writeback_in_progress(sb->s_bdi)) {
>> -		down_read(&sb->s_umount);
>> -		writeback_inodes_sb_nr(sb, nr);
>> -		up_read(&sb->s_umount);
>> -		return 1;
>> -	} else
>> -		return 0;
>> +		if (down_read_trylock(&sb->s_umount)) {
>> +			writeback_inodes_sb_nr(sb, nr);
>> +			up_read(&sb->s_umount);
>> +			return 1;
>> +		}
>> +	}
>> +	return 0;
>>  }
>>  EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);
>>  
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 17, 2010, 4:38 a.m.
On Tue, Nov 16, 2010 at 10:30:37PM -0600, Eric Sandeen wrote:
> On 11/16/10 7:01 AM, Jan Kara wrote:
> > On Tue 16-11-10 22:00:58, Nick Piggin wrote:
> >> I saw a lock order warning on ext4 trigger. This should solve it.
> >> Raciness shouldn't matter much, because writeback can stop just
> >> after we make the test and return anyway (so the API is racy anyway).
> >   Hmm, for now the fix is OK. Ultimately, we probably want to call
> > writeback_inodes_sb() directly from all the callers. They all just want to
> > reduce uncertainty of delayed allocation reservations by writing delayed
> > data and actually wait for some of the writeback to happen before they
> > retry again the allocation.
> 
> For ext4, at least, it's just best-effort.  We're not actually out of
> space yet when this starts pushing.  But it helps us avoid enospc:
> 
> commit c8afb44682fcef6273e8b8eb19fab13ddd05b386
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Wed Dec 23 07:58:12 2009 -0500
> 
>     ext4: flush delalloc blocks when space is low
>     
>     Creating many small files in rapid succession on a small
>     filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
>     
>     for i in `seq 1 22500`; do
>         echo -n > $SCRATCH_MNT/$i
>         echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
>     done
>     
>     leads to ENOSPC even though after a sync, 40% of the fs is free
>     again.
> 
>     <snip>
> 
> We don't need it to be synchronous - in fact I didn't think it was ...

By synchronous, I just mean that the caller is the one who pushes
the data into writeout. It _may_ be better if it was done by background
writeback, with a feedback loop to throttle the caller (preferably
placed outside any locks it is holding).

To be pragmatic, I think the thing is fine to actually solve the
problem at hand. I was just saying that it has a tiny little hackish
feeling anyway, so a trylock will be right at home there :)

 
> ext4 should probably use btrfs's new variant and just get rid of the
> one I put in, for a very large system/filesystem it could end up doing
> a rather insane amount of IO when the fs starts to get full.
> 
> as for the locking problems ... sorry about that!

That's no problem. So is that an ack? :)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 17, 2010, 5:05 a.m.
On 11/16/10 10:38 PM, Nick Piggin wrote:
> On Tue, Nov 16, 2010 at 10:30:37PM -0600, Eric Sandeen wrote:
>> On 11/16/10 7:01 AM, Jan Kara wrote:
>>> On Tue 16-11-10 22:00:58, Nick Piggin wrote:
>>>> I saw a lock order warning on ext4 trigger. This should solve it.
>>>> Raciness shouldn't matter much, because writeback can stop just
>>>> after we make the test and return anyway (so the API is racy anyway).
>>>   Hmm, for now the fix is OK. Ultimately, we probably want to call
>>> writeback_inodes_sb() directly from all the callers. They all just want to
>>> reduce uncertainty of delayed allocation reservations by writing delayed
>>> data and actually wait for some of the writeback to happen before they
>>> retry again the allocation.
>>
>> For ext4, at least, it's just best-effort.  We're not actually out of
>> space yet when this starts pushing.  But it helps us avoid enospc:
>>
>> commit c8afb44682fcef6273e8b8eb19fab13ddd05b386
>> Author: Eric Sandeen <sandeen@redhat.com>
>> Date:   Wed Dec 23 07:58:12 2009 -0500
>>
>>     ext4: flush delalloc blocks when space is low
>>     
>>     Creating many small files in rapid succession on a small
>>     filesystem can lead to spurious ENOSPC; on a 104MB filesystem:
>>     
>>     for i in `seq 1 22500`; do
>>         echo -n > $SCRATCH_MNT/$i
>>         echo XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > $SCRATCH_MNT/$i
>>     done
>>     
>>     leads to ENOSPC even though after a sync, 40% of the fs is free
>>     again.
>>
>>     <snip>
>>
>> We don't need it to be synchronous - in fact I didn't think it was ...
> 
> By synchronous, I just mean that the caller is the one who pushes
> the data into writeout. It _may_ be better if it was done by background
> writeback, with a feedback loop to throttle the caller (preferably
> placed outside any locks it is holding).
> 
> To be pragmatic, I think the thing is fine to actually solve the
> problem at hand. I was just saying that it has a tiny little hackish
> feeling anyway, so a trylock will be right at home there :)
> 
>  
>> ext4 should probably use btrfs's new variant and just get rid of the
>> one I put in, for a very large system/filesystem it could end up doing
>> a rather insane amount of IO when the fs starts to get full.
>>
>> as for the locking problems ... sorry about that!
> 
> That's no problem. So is that an ack? :)
> 

I'd like to test it with the original case it was supposed to solve; will
do that tomorrow.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 17, 2010, 6:10 a.m.
On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> On 11/16/10 10:38 PM, Nick Piggin wrote:
> >> as for the locking problems ... sorry about that!
> > 
> > That's no problem. So is that an ack? :)
> > 
> 
> I'd like to test it with the original case it was supposed to solve; will
> do that tomorrow.

OK, but it shouldn't make much difference, unless there is a lot of
strange activity happening on the sb (like mount / umount / remount /
freeze / etc).

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Nov. 18, 2010, 3:06 a.m.
On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > >> as for the locking problems ... sorry about that!
> > > 
> > > That's no problem. So is that an ack? :)
> > > 
> > 
> > I'd like to test it with the original case it was supposed to solve; will
> > do that tomorrow.
> 
> OK, but it shouldn't make much difference, unless there is a lot of
> strange activity happening on the sb (like mount / umount / remount /
> freeze / etc).

This makes sense to me as well.

Acked-by: "Theodore Ts'o" <tytso@mit.edu>

So how do we want to send this patch to Linus?  It's a writeback
change, so through some mm tree?  Or it lives in fs/fs-writeback.c
(which I always thought was weird; why is it not in mm/?), so maybe
through the VFS tree, even though I don't think Al would really care
about this patch.

Or I can push it to Linus through the ext4 tree....

     	      	    	  	      	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 18, 2010, 3:18 a.m.
On 11/17/10 12:10 AM, Nick Piggin wrote:
> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
>> On 11/16/10 10:38 PM, Nick Piggin wrote:
>>>> as for the locking problems ... sorry about that!
>>>
>>> That's no problem. So is that an ack? :)
>>>
>>
>> I'd like to test it with the original case it was supposed to solve; will
>> do that tomorrow.
> 
> OK, but it shouldn't make much difference, unless there is a lot of
> strange activity happening on the sb (like mount / umount / remount /
> freeze / etc).
> 

Ok, good point.  And since I failed in my promise anyway, I won't
hold you up;

Acked-by: Eric Sandeen <sandeen@redhat.com>

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 3:29 a.m.
On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o" <tytso@mit.edu> wrote:

> On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > >> as for the locking problems ... sorry about that!
> > > > 
> > > > That's no problem. So is that an ack? :)
> > > > 
> > > 
> > > I'd like to test it with the original case it was supposed to solve; will
> > > do that tomorrow.
> > 
> > OK, but it shouldn't make much difference, unless there is a lot of
> > strange activity happening on the sb (like mount / umount / remount /
> > freeze / etc).
> 
> This makes sense to me as well.
> 
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> So how do we want to send this patch to Linus?  It's a writeback
> change, so through some mm tree?

It's in my todo pile.  Even though the patch sucks, but not as much as
its changelog does.  Am not particularly happy merging an alleged
bugfix where the bug is, and I quote, "I saw a lock order warning on
ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
based on that??  Or to understand it a year from now?

When I get to it I'll troll this email thread and might be able to
kludge together a description which might be able to fool people into
thinking it makes sense.

And someone at least needs to fix the comment: s/i_lock/i_mutex/.  I
guess that would be me.

>  Or it lives in fs/fs-writeback.c
> (which I always thought was weird; why is it not in mm/?),

The idea was that the writeback walk goes
superblocks->inodes->address_spaces->pages->bios->drivers, so that code
needs to be spread over fs->mm->block->drivers.  The dividing line
between fs and block is approximately at the "address_spaces" mark. 
Yes, that line is a bit smeary.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 18, 2010, 6 a.m.
On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o" <tytso@mit.edu> wrote:
> 
> > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > > >> as for the locking problems ... sorry about that!
> > > > > 
> > > > > That's no problem. So is that an ack? :)
> > > > > 
> > > > 
> > > > I'd like to test it with the original case it was supposed to solve; will
> > > > do that tomorrow.
> > > 
> > > OK, but it shouldn't make much difference, unless there is a lot of
> > > strange activity happening on the sb (like mount / umount / remount /
> > > freeze / etc).
> > 
> > This makes sense to me as well.
> > 
> > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > 
> > So how do we want to send this patch to Linus?  It's a writeback
> > change, so through some mm tree?
> 
> It's in my todo pile.  Even though the patch sucks, but not as much as
> its changelog does.  Am not particularly happy merging an alleged
> bugfix where the bug is, and I quote, "I saw a lock order warning on
> ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
> based on that??  Or to understand it a year from now?

Sorry bout the confusion, it was supposed to be "i_mutex", and then it
would have been a bit more obvious.


> When I get to it I'll troll this email thread and might be able to
> kludge together a description which might be able to fool people into
> thinking it makes sense.

"Lock order reversal between s_umount and i_mutex".

i_mutex nests inside s_umount in some writeback paths (it was the end
io handler to convert unwritten extents IIRC). But hmm, wouldn't that
be a bug? We aren't allowed to take i_mutex inside writeback, are we?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 6:28 a.m.
On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote:

> On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:
> > On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o" <tytso@mit.edu> wrote:
> > 
> > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > > > >> as for the locking problems ... sorry about that!
> > > > > > 
> > > > > > That's no problem. So is that an ack? :)
> > > > > > 
> > > > > 
> > > > > I'd like to test it with the original case it was supposed to solve; will
> > > > > do that tomorrow.
> > > > 
> > > > OK, but it shouldn't make much difference, unless there is a lot of
> > > > strange activity happening on the sb (like mount / umount / remount /
> > > > freeze / etc).
> > > 
> > > This makes sense to me as well.
> > > 
> > > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > > 
> > > So how do we want to send this patch to Linus?  It's a writeback
> > > change, so through some mm tree?
> > 
> > It's in my todo pile.  Even though the patch sucks, but not as much as
> > its changelog does.  Am not particularly happy merging an alleged
> > bugfix where the bug is, and I quote, "I saw a lock order warning on
> > ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
> > based on that??  Or to understand it a year from now?
> 
> Sorry bout the confusion, it was supposed to be "i_mutex", and then it
> would have been a bit more obvious.
> 
> 
> > When I get to it I'll troll this email thread and might be able to
> > kludge together a description which might be able to fool people into
> > thinking it makes sense.
> 
> "Lock order reversal between s_umount and i_mutex".
> 
> i_mutex nests inside s_umount in some writeback paths (it was the end
> io handler to convert unwritten extents IIRC). But hmm, wouldn't that
> be a bug? We aren't allowed to take i_mutex inside writeback, are we?

I'm not sure that s_umount versus i_mutex has come up before.

Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
is a per-superblock thing, and i_mutex is a per-file thing, and files
live under superblocks.  Nesting s_umount outside i_mutex creates
complex deadlock graphs between the various i_mutexes, I think.

Someone tell me if btrfs has the same bug, via its call to
writeback_inodes_sb_nr_if_idle()?

I don't see why these functions need s_umount at all, if they're called
from within ->write_begin against an inode on that superblock.  If the
superblock can get itself disappeared while we're running ->write_begin
on it, we have problems, no?

In which case I'd suggest just removing the down_read(s_umount) and
specifying that the caller must pin the superblock via some means.

Only we can't do that because we need to hold s_umount until the
bdi_queue_work() worker has done its work.

The fact that a call to ->write_begin can randomly return with s_umount
held, to be randomly released at some random time in the future is a
bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
thing.

It'd be better if we pinned these superblocks via refcounting, not via
holding s_umount but even then, having ->write_begin randomly bump sb
refcounts for random periods of time is still pretty ugly.

What a pickle.

Can we just delete writeback_inodes_sb_nr_if_idle() and
writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
pretty handwavy - do we know that deleting these things would make a
jot of difference?

And why _do_ we need to hold s_umount during the bdi_queue_work()
handover?  Would simply bumping s_count suffice?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 18, 2010, 8:18 a.m.
On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
> 
> > On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:
> > > On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o" <tytso@mit.edu> wrote:
> > > 
> > > > On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> > > > > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > > > > > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > > > > > >> as for the locking problems ... sorry about that!
> > > > > > > 
> > > > > > > That's no problem. So is that an ack? :)
> > > > > > > 
> > > > > > 
> > > > > > I'd like to test it with the original case it was supposed to solve; will
> > > > > > do that tomorrow.
> > > > > 
> > > > > OK, but it shouldn't make much difference, unless there is a lot of
> > > > > strange activity happening on the sb (like mount / umount / remount /
> > > > > freeze / etc).
> > > > 
> > > > This makes sense to me as well.
> > > > 
> > > > Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> > > > 
> > > > So how do we want to send this patch to Linus?  It's a writeback
> > > > change, so through some mm tree?
> > > 
> > > It's in my todo pile.  Even though the patch sucks, but not as much as
> > > its changelog does.  Am not particularly happy merging an alleged
> > > bugfix where the bug is, and I quote, "I saw a lock order warning on
> > > ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
> > > based on that??  Or to understand it a year from now?
> > 
> > Sorry bout the confusion, it was supposed to be "i_mutex", and then it
> > would have been a bit more obvious.
> > 
> > 
> > > When I get to it I'll troll this email thread and might be able to
> > > kludge together a description which might be able to fool people into
> > > thinking it makes sense.
> > 
> > "Lock order reversal between s_umount and i_mutex".
> > 
> > i_mutex nests inside s_umount in some writeback paths (it was the end
> > io handler to convert unwritten extents IIRC). But hmm, wouldn't that
> > be a bug? We aren't allowed to take i_mutex inside writeback, are we?
> 
> I'm not sure that s_umount versus i_mutex has come up before.

Well, we appear to have i_mutex inside s_umount in ext4 as well, from
unwritten extent conversion in writeback completion (that's where the
lock order was recorded in my trace, I'll try to get another one).

I don't know if that's a good idea. Definitely we can't take i_mutex
inside writeback if it can be triggered recursively from something
like memory allocation in buffered write path.

So long as it is asynchronous, on the completion path, it is probably
OK.

 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.

You mean i_mutex outside s_umount?

 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?
> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
> 
> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
> 
> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.

Yeah that whole writeback_inodes_if_idle is nasty

 
> It'd be better if we pinned these superblocks via refcounting, not via
> holding s_umount but even then, having ->write_begin randomly bump sb
> refcounts for random periods of time is still pretty ugly.
> 
> What a pickle.
> 
> Can we just delete writeback_inodes_sb_nr_if_idle() and
> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> pretty handwavy - do we know that deleting these things would make a
> jot of difference?
> 
> And why _do_ we need to hold s_umount during the bdi_queue_work()
> handover?  Would simply bumping s_count suffice?

s_count just prevents it from going away, but s_umount is still needed
to keep umount, remount,ro, freezing etc activity away. I don't think
there is an easy way to do it.

Perhaps filesystem should have access to the dirty throttling path, kick
writeback or delayed allocation etc as needed, and throttle against
outstanding work that needs to be done, going through the normal
writeback paths?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Nov. 18, 2010, 10:51 a.m.
On Nov 18, 2010, at 3:18 AM, Nick Piggin wrote:

> s_count just prevents it from going away, but s_umount is still needed
> to keep umount, remount,ro, freezing etc activity away. I don't think
> there is an easy way to do it.

Hmm.... what about encoding the fact that we are in the process of unmounting the file system as a flag to keep remount, freeing, etc. away?    The equivalent of the inode's I_FREEING flag?

After all, it's not like we want freeze to wait until the umount is finished, and then continue on its merry way with the remount operation.  We want it to fail, because we're unmounting the file system.   So maybe the real problem is a mutex really isn't the right abstraction?

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 18, 2010, 2:55 p.m.
On 11/18/10 12:28 AM, Andrew Morton wrote:
> On Thu, 18 Nov 2010 17:00:00 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
> 
>> On Wed, Nov 17, 2010 at 07:29:00PM -0800, Andrew Morton wrote:
>>> On Wed, 17 Nov 2010 22:06:13 -0500 "Ted Ts'o" <tytso@mit.edu> wrote:
>>>
>>>> On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
>>>>> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
>>>>>> On 11/16/10 10:38 PM, Nick Piggin wrote:
>>>>>>>> as for the locking problems ... sorry about that!
>>>>>>>
>>>>>>> That's no problem. So is that an ack? :)
>>>>>>>
>>>>>>
>>>>>> I'd like to test it with the original case it was supposed to solve; will
>>>>>> do that tomorrow.
>>>>>
>>>>> OK, but it shouldn't make much difference, unless there is a lot of
>>>>> strange activity happening on the sb (like mount / umount / remount /
>>>>> freeze / etc).
>>>>
>>>> This makes sense to me as well.
>>>>
>>>> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
>>>>
>>>> So how do we want to send this patch to Linus?  It's a writeback
>>>> change, so through some mm tree?
>>>
>>> It's in my todo pile.  Even though the patch sucks, but not as much as
>>> its changelog does.  Am not particularly happy merging an alleged
>>> bugfix where the bug is, and I quote, "I saw a lock order warning on
>>> ext4 trigger".  I mean, wtf?  How is anyone supposed to review the code
>>> based on that??  Or to understand it a year from now?
>>
>> Sorry bout the confusion, it was supposed to be "i_mutex", and then it
>> would have been a bit more obvious.
>>
>>
>>> When I get to it I'll troll this email thread and might be able to
>>> kludge together a description which might be able to fool people into
>>> thinking it makes sense.
>>
>> "Lock order reversal between s_umount and i_mutex".
>>
>> i_mutex nests inside s_umount in some writeback paths (it was the end
>> io handler to convert unwritten extents IIRC). But hmm, wouldn't that
>> be a bug? We aren't allowed to take i_mutex inside writeback, are we?
> 
> I'm not sure that s_umount versus i_mutex has come up before.
> 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.
> 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?
> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
> 
> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
> 
> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.
> 
> It'd be better if we pinned these superblocks via refcounting, not via
> holding s_umount but even then, having ->write_begin randomly bump sb
> refcounts for random periods of time is still pretty ugly.
> 
> What a pickle.
> 
> Can we just delete writeback_inodes_sb_nr_if_idle() and
> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> pretty handwavy - do we know that deleting these things would make a
> jot of difference?

Really?  I thought it was pretty decent ;)

Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
shows the problem clearly, IIRC.  I should have included that in the
changelog, I suppose, sorry.

-Eric
 
> And why _do_ we need to hold s_umount during the bdi_queue_work()
> handover?  Would simply bumping s_count suffice?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 5:10 p.m.
On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:

> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> 
> Really?  I thought it was pretty decent ;)
> 
> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> shows the problem clearly, IIRC.  I should have included that in the
> changelog, I suppose, sorry.

Your email didn't really impart any information :(

I suppose I could accidentally delete those nasty little functions in a
drivers/parport patch then wait and see if anyone notices.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 5:58 p.m.
On Thu, 18 Nov 2010 19:18:22 +1100 Nick Piggin <npiggin@kernel.dk> wrote:

> On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote:
>  
> > Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> > is a per-superblock thing, and i_mutex is a per-file thing, and files
> > live under superblocks.  Nesting s_umount outside i_mutex creates
> > complex deadlock graphs between the various i_mutexes, I think.
> 
> You mean i_mutex outside s_umount?
> 

Nope.  i_mutex should nest inside s_umount.  Just as inodes nest inside
superblocks!  Seems logical to me ;)

> > Someone tell me if btrfs has the same bug, via its call to
> > writeback_inodes_sb_nr_if_idle()?
> > 
> > I don't see why these functions need s_umount at all, if they're called
> > from within ->write_begin against an inode on that superblock.  If the
> > superblock can get itself disappeared while we're running ->write_begin
> > on it, we have problems, no?
> > 
> > In which case I'd suggest just removing the down_read(s_umount) and
> > specifying that the caller must pin the superblock via some means.
> > 
> > Only we can't do that because we need to hold s_umount until the
> > bdi_queue_work() worker has done its work.
> > 
> > The fact that a call to ->write_begin can randomly return with s_umount
> > held, to be randomly released at some random time in the future is a
> > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > thing.
> 
> Yeah that whole writeback_inodes_if_idle is nasty
> 
>  
> > It'd be better if we pinned these superblocks via refcounting, not via
> > holding s_umount but even then, having ->write_begin randomly bump sb
> > refcounts for random periods of time is still pretty ugly.
> > 
> > What a pickle.
> > 
> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> > 
> > And why _do_ we need to hold s_umount during the bdi_queue_work()
> > handover?  Would simply bumping s_count suffice?
> 
> s_count just prevents it from going away, but s_umount is still needed
> to keep umount, remount,ro, freezing etc activity away. I don't think
> there is an easy way to do it.
> 
> Perhaps filesystem should have access to the dirty throttling path, kick
> writeback or delayed allocation etc as needed, and throttle against
> outstanding work that needs to be done, going through the normal
> writeback paths?

I just cannot believe that we need s_mount inside ->write_begin.  Is it
really the case that someone can come along and unmount or remount or
freeze our filesystem while some other process is down performing a
->write_begin against one of its files?  Kidding?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 18, 2010, 6:04 p.m.
On 11/18/10 11:10 AM, Andrew Morton wrote:
> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> 
>>> Can we just delete writeback_inodes_sb_nr_if_idle() and
>>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
>>> pretty handwavy - do we know that deleting these things would make a
>>> jot of difference?
>>
>> Really?  I thought it was pretty decent ;)
>>
>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
>> shows the problem clearly, IIRC.  I should have included that in the
>> changelog, I suppose, sorry.
> 
> Your email didn't really impart any information :(
> 
> I suppose I could accidentally delete those nasty little functions in a
> drivers/parport patch then wait and see if anyone notices.
> 

Um, ok, then, to answer the question directly :

No, please don't delete those functions, it will break ENOSPC handling
in ext4 as shown by xfstests regression test #204 ...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 18, 2010, 6:24 p.m.
On 11/18/10 12:04 PM, Eric Sandeen wrote:
> On 11/18/10 11:10 AM, Andrew Morton wrote:
>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
>>
>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and
>>>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
>>>> pretty handwavy - do we know that deleting these things would make a
>>>> jot of difference?
>>>
>>> Really?  I thought it was pretty decent ;)
>>>
>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
>>> shows the problem clearly, IIRC.  I should have included that in the
>>> changelog, I suppose, sorry.
>>
>> Your email didn't really impart any information :(
>>
>> I suppose I could accidentally delete those nasty little functions in a
>> drivers/parport patch then wait and see if anyone notices.
>>
> 
> Um, ok, then, to answer the question directly :
> 
> No, please don't delete those functions, it will break ENOSPC handling
> in ext4 as shown by xfstests regression test #204 ...

Further - 

What is going on here is that with delayed allocation, ext4 takes reservations
against free blocks based on the data blocks it must write out, and the
worst-case metadata that the writeout may take.  Getting writeback failing
with ENOSPC would be bad.

But then we wind up with a bunch of unflushed writes sitting on huge
metadata reservations, and start hitting ENOSPC due to that worst-case
reservation.  After a sync we have tons of free space again, because
the worst-case space reservations turned into usually best-case
reality.

That's what the function is used for; once we start filling up the
fs, we proactively flush data to free up the worst-case metadata
reservations.

Dropping it will put us back in the bad situation.

If there are other ideas to fix it, I'm all ears, but this worked.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason - Nov. 18, 2010, 6:33 p.m.
Excerpts from Andrew Morton's message of 2010-11-18 01:28:34 -0500:
> I'm not sure that s_umount versus i_mutex has come up before.
> 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.
> 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?

Btrfs is using the call differently, we kick off delalloc at transaction
start time when many fewer locks are held.

Since transaction start can happen with the inode mutex held, we'll end
up taking the s_unmount with the inode mutex held too.

But, we never take the inode lock internally in the writeback paths.

> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
> 
> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
> 
> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.
> 
> It'd be better if we pinned these superblocks via refcounting, not via
> holding s_umount but even then, having ->write_begin randomly bump sb
> refcounts for random periods of time is still pretty ugly.
> 
> What a pickle.
> 
> Can we just delete writeback_inodes_sb_nr_if_idle() and
> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> pretty handwavy - do we know that deleting these things would make a
> jot of difference?
> 
> And why _do_ we need to hold s_umount during the bdi_queue_work()
> handover?  Would simply bumping s_count suffice?
> 

We don't need to keep the super in ram, we need to keep the FS mounted
so that writepage and friends continue to do useful things.  s_count
isn't enough for that...but when the bdi stuff is passed an SB from
something that has the SB explicitly pinned, we should be able to safely
skip the locking.

Since these functions are only used in that context it makes good sense
to try_lock them or drop the lock completely.

I think the only reason we need the lock:

void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr)
{
...
        WARN_ON(!rwsem_is_locked(&sb->s_umount));

We're not going to go from rw to ro with dirty pages unless something
horrible has gone wrong (eios all around in the FS), so I'm not sure why
we need the lock at all?

-chris




--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 6:36 p.m.
On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote:

> On 11/18/10 11:10 AM, Andrew Morton wrote:
> > On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> > 
> >>> Can we just delete writeback_inodes_sb_nr_if_idle() and
> >>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> >>> pretty handwavy - do we know that deleting these things would make a
> >>> jot of difference?
> >>
> >> Really?  I thought it was pretty decent ;)
> >>
> >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> >> shows the problem clearly, IIRC.  I should have included that in the
> >> changelog, I suppose, sorry.
> > 
> > Your email didn't really impart any information :(
> > 
> > I suppose I could accidentally delete those nasty little functions in a
> > drivers/parport patch then wait and see if anyone notices.
> > 
> 
> Um, ok, then, to answer the question directly :
> 
> No, please don't delete those functions, it will break ENOSPC handling
> in ext4 as shown by xfstests regression test #204 ...
> 

If those functions "fix" a testcase then it was by sheer luck, and the
fs's ENOSPC handling is still busted.

For a start writeback_inodes_sb_if_idle() is a no-op if the device
isn't idle!  Secondly, if the device _was_ idle,
writeback_inodes_sb_if_idle() uses a work handoff to another thread,
which means that the work might not get executed for another six weeks.

So no, your ENOSPC handling is still busted and I'll be doing you a
favour when I send that parport patch.

No?

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason - Nov. 18, 2010, 6:39 p.m.
Excerpts from Eric Sandeen's message of 2010-11-18 13:24:57 -0500:
> > Um, ok, then, to answer the question directly :
> > 
> > No, please don't delete those functions, it will break ENOSPC handling
> > in ext4 as shown by xfstests regression test #204 ...
> 
> Further - 
> 
> What is going on here is that with delayed allocation, ext4 takes reservations
> against free blocks based on the data blocks it must write out, and the
> worst-case metadata that the writeout may take.  Getting writeback failing
> with ENOSPC would be bad.
> 
> But then we wind up with a bunch of unflushed writes sitting on huge
> metadata reservations, and start hitting ENOSPC due to that worst-case
> reservation.  After a sync we have tons of free space again, because
> the worst-case space reservations turned into usually best-case
> reality.
> 
> That's what the function is used for; once we start filling up the
> fs, we proactively flush data to free up the worst-case metadata
> reservations.
> 
> Dropping it will put us back in the bad situation.
> 
> If there are other ideas to fix it, I'm all ears, but this worked.

s/ext4/btrfs/g  We do the accounting and kick IO in different places,
but the idea is pretty much the same.  Some of the reservations are
freed when we start the IO and some are freed when the IO is done.

I understand that XFS is similar but does the writeback from its own
internal radix tree in the sky.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason - Nov. 18, 2010, 6:51 p.m.
Excerpts from Andrew Morton's message of 2010-11-18 13:36:38 -0500:
> On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> 
> > On 11/18/10 11:10 AM, Andrew Morton wrote:
> > > On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> > > 
> > >>> Can we just delete writeback_inodes_sb_nr_if_idle() and
> > >>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > >>> pretty handwavy - do we know that deleting these things would make a
> > >>> jot of difference?
> > >>
> > >> Really?  I thought it was pretty decent ;)
> > >>
> > >> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> > >> shows the problem clearly, IIRC.  I should have included that in the
> > >> changelog, I suppose, sorry.
> > > 
> > > Your email didn't really impart any information :(
> > > 
> > > I suppose I could accidentally delete those nasty little functions in a
> > > drivers/parport patch then wait and see if anyone notices.
> > > 
> > 
> > Um, ok, then, to answer the question directly :
> > 
> > No, please don't delete those functions, it will break ENOSPC handling
> > in ext4 as shown by xfstests regression test #204 ...
> > 
> 
> If those functions "fix" a testcase then it was by sheer luck, and the
> fs's ENOSPC handling is still busted.
> 
> For a start writeback_inodes_sb_if_idle() is a no-op if the device
> isn't idle!  Secondly, if the device _was_ idle,
> writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> which means that the work might not get executed for another six weeks.
> 
> So no, your ENOSPC handling is still busted and I'll be doing you a
> favour when I send that parport patch.

Btrfs uses it with this cool looping construct. It's an innovative
combination of while, 1, schedule_timeout(),  and if all goes well, break.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro - Nov. 18, 2010, 6:53 p.m.
On Wed, Nov 17, 2010 at 10:06:13PM -0500, Ted Ts'o wrote:

> This makes sense to me as well.
> 
> Acked-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> So how do we want to send this patch to Linus?  It's a writeback
> change, so through some mm tree?  Or it lives in fs/fs-writeback.c
> (which I always thought was weird; why is it not in mm/?), so maybe
> through the VFS tree, even though I don't think Al would really care
> about this patch.

As in "don't really like", TBH.  I'll take it (with saner commit message
and comment in the source), but I really wonder if we are just asking for
more trouble down the road.

Specifically, I *really* want to see locking rules for that sucker
spelled out.  What can be held by caller?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Nov. 18, 2010, 7:02 p.m.
On 11/18/10 12:36 PM, Andrew Morton wrote:
> On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> 
>> On 11/18/10 11:10 AM, Andrew Morton wrote:
>>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
>>>
>>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and
>>>>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
>>>>> pretty handwavy - do we know that deleting these things would make a
>>>>> jot of difference?
>>>>
>>>> Really?  I thought it was pretty decent ;)
>>>>
>>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
>>>> shows the problem clearly, IIRC.  I should have included that in the
>>>> changelog, I suppose, sorry.
>>>
>>> Your email didn't really impart any information :(
>>>
>>> I suppose I could accidentally delete those nasty little functions in a
>>> drivers/parport patch then wait and see if anyone notices.
>>>
>>
>> Um, ok, then, to answer the question directly :
>>
>> No, please don't delete those functions, it will break ENOSPC handling
>> in ext4 as shown by xfstests regression test #204 ...
>>
> 
> If those functions "fix" a testcase then it was by sheer luck, and the
> fs's ENOSPC handling is still busted.
> 
> For a start writeback_inodes_sb_if_idle() is a no-op if the device
> isn't idle!  

so writeback is already in progress and it's already doing what we need,
right?  Space is being freed up as we speak in that case.

> Secondly, if the device _was_ idle,
> writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> which means that the work might not get executed for another six weeks.

We start it quite early, before things are critical.

Yeah, it's not bulletproof but it is tons better.

-Eric

> So no, your ENOSPC handling is still busted and I'll be doing you a
> favour when I send that parport patch.
> 
> No?
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 8:17 p.m.
On Thu, 18 Nov 2010 13:02:43 -0600
Eric Sandeen <sandeen@redhat.com> wrote:

> On 11/18/10 12:36 PM, Andrew Morton wrote:
> > On Thu, 18 Nov 2010 12:04:21 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> > 
> >> On 11/18/10 11:10 AM, Andrew Morton wrote:
> >>> On Thu, 18 Nov 2010 08:55:18 -0600 Eric Sandeen <sandeen@redhat.com> wrote:
> >>>
> >>>>> Can we just delete writeback_inodes_sb_nr_if_idle() and
> >>>>> writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> >>>>> pretty handwavy - do we know that deleting these things would make a
> >>>>> jot of difference?
> >>>>
> >>>> Really?  I thought it was pretty decent ;)
> >>>>
> >>>> Anyway, xfstests 204, "Test out ENOSPC flushing on small filesystems."
> >>>> shows the problem clearly, IIRC.  I should have included that in the
> >>>> changelog, I suppose, sorry.
> >>>
> >>> Your email didn't really impart any information :(
> >>>
> >>> I suppose I could accidentally delete those nasty little functions in a
> >>> drivers/parport patch then wait and see if anyone notices.
> >>>
> >>
> >> Um, ok, then, to answer the question directly :
> >>
> >> No, please don't delete those functions, it will break ENOSPC handling
> >> in ext4 as shown by xfstests regression test #204 ...
> >>
> > 
> > If those functions "fix" a testcase then it was by sheer luck, and the
> > fs's ENOSPC handling is still busted.
> > 
> > For a start writeback_inodes_sb_if_idle() is a no-op if the device
> > isn't idle!  
> 
> so writeback is already in progress and it's already doing what we need,
> right?  Space is being freed up as we speak in that case.

With no guarantee that it's being freed at a sufficient rate.

> > Secondly, if the device _was_ idle,
> > writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> > which means that the work might not get executed for another six weeks.
> 
> We start it quite early, before things are critical.
> 
> Yeah, it's not bulletproof but it is tons better.

Translation: "it papers over a bug".

Look, if this was a little best-effort poke-writeback-now performance
tweak then fine.  But as an attempt to prevent a synchronous and bogus
ENOSPC error it's just hopeless.

Guys, fix the thing for real, and take that hack out.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton - Nov. 18, 2010, 8:22 p.m.
On Thu, 18 Nov 2010 13:51:15 -0500
Chris Mason <chris.mason@oracle.com> wrote:

> > If those functions "fix" a testcase then it was by sheer luck, and the
> > fs's ENOSPC handling is still busted.
> > 
> > For a start writeback_inodes_sb_if_idle() is a no-op if the device
> > isn't idle!  Secondly, if the device _was_ idle,
> > writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> > which means that the work might not get executed for another six weeks.
> > 
> > So no, your ENOSPC handling is still busted and I'll be doing you a
> > favour when I send that parport patch.
> 
> Btrfs uses it with this cool looping construct. It's an innovative
> combination of while, 1, schedule_timeout(),  and if all goes well, break.

If the calling code can do that then it doesn't need to pass the work
off to another thread at all.  Just sychronously call
writeback_inodes_sb(), then bye-bye goes writeback_inodes_sb_if_idle(),
to great sighs of relief.

And again: as btrfs is effectively making a synchronous call to
writeback_inodes_sb() via schedule(), then surely it does not need to
take s_umount to protect its own darn superblock!!

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason - Nov. 18, 2010, 8:36 p.m.
Excerpts from Andrew Morton's message of 2010-11-18 15:22:38 -0500:
> On Thu, 18 Nov 2010 13:51:15 -0500
> Chris Mason <chris.mason@oracle.com> wrote:
> 
> > > If those functions "fix" a testcase then it was by sheer luck, and the
> > > fs's ENOSPC handling is still busted.
> > > 
> > > For a start writeback_inodes_sb_if_idle() is a no-op if the device
> > > isn't idle!  Secondly, if the device _was_ idle,
> > > writeback_inodes_sb_if_idle() uses a work handoff to another thread,
> > > which means that the work might not get executed for another six weeks.
> > > 
> > > So no, your ENOSPC handling is still busted and I'll be doing you a
> > > favour when I send that parport patch.
> > 
> > Btrfs uses it with this cool looping construct. It's an innovative
> > combination of while, 1, schedule_timeout(),  and if all goes well, break.
> 
> If the calling code can do that then it doesn't need to pass the work
> off to another thread at all.  Just sychronously call
> writeback_inodes_sb(), then bye-bye goes writeback_inodes_sb_if_idle(),
> to great sighs of relief.

I think if_idle() is a different discussion than the lock.

writeback_inodes_sb will not actually writeback inodes.  It calls
bdi_queue_task, which puts a work request into the list of things the
bdi flusher thread is already doing, and then it waits for that work to
finish.

So, if the bdi flusher thread is already doing work on this bdi, it will
finish that work (at which point our ENOSPC problems may be solved) and
then it will do the work we've added.

The btrfs enospc code is calling this while we start transactions, so it
might happen once per process calling write().  So if you have 50 procs,
we don't want 50 work requests pigging our bdi work queue.  We just want
to let the flusher do its thing and wait for it to make progress.

schedule_timeout isn't a very pretty way to wait for that progress, but
we haven't had a pretty way to wait for IO...ever (don't say
congestion_wait, it really just broke down to schedule_timeout()).

Without the if_idle variant, we'll end up doing 50 calls to
writeback_inodes_sb, one after the other, even if the 1st call was
enough.

> 
> And again: as btrfs is effectively making a synchronous call to
> writeback_inodes_sb() via schedule(), then surely it does not need to
> take s_umount to protect its own darn superblock!!
> 

Definitely agree here.  I'd love to get rid of the s_umount lock.
Outside of the WARN_ON in __writeback_inodes_sb, I don't think we need
it.

Christoph added the s_umount to protect against mount -o remount,ro, but
I think we can deal with that via an extra check while actually walking
the inodes/pages?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 18, 2010, 11:58 p.m.
On Thu 18-11-10 13:33:54, Chris Mason wrote:
> > Can we just delete writeback_inodes_sb_nr_if_idle() and
> > writeback_inodes_sb_if_idle()?  The changelog for 17bd55d037a02 is
> > pretty handwavy - do we know that deleting these things would make a
> > jot of difference?
> > 
> > And why _do_ we need to hold s_umount during the bdi_queue_work()
> > handover?  Would simply bumping s_count suffice?
> > 
> 
> We don't need to keep the super in ram, we need to keep the FS mounted
> so that writepage and friends continue to do useful things.  s_count
> isn't enough for that...but when the bdi stuff is passed an SB from
> something that has the SB explicitly pinned, we should be able to safely
> skip the locking.
> 
> Since these functions are only used in that context it makes good sense
> to try_lock them or drop the lock completely.
> 
> I think the only reason we need the lock:
> 
> void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr)
> {
> ...
>         WARN_ON(!rwsem_is_locked(&sb->s_umount));
  The above is the only reason why we need the lock in the call from
->write_begin(). Yes. But:

> We're not going to go from rw to ro with dirty pages unless something
> horrible has gone wrong (eios all around in the FS), so I'm not sure why
> we need the lock at all?
  One of the nasty cases is for example: Writeback thread decides inode
needs writeout, so the thread gets inode reference. Then someone calls
umount and gets EBUSY because writeback thread is working. Kind of
unexpected... So generally we *do* need a serialization of writeback thread
and umount / remount. Just in that particular case where we call the
function from ->write_begin(), s_umount is overkill...

								Honza
Jan Kara - Nov. 19, 2010, 12:45 a.m.
On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> I'm not sure that s_umount versus i_mutex has come up before.
> 
> Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> is a per-superblock thing, and i_mutex is a per-file thing, and files
> live under superblocks.  Nesting s_umount outside i_mutex creates
> complex deadlock graphs between the various i_mutexes, I think.
> 
> Someone tell me if btrfs has the same bug, via its call to
> writeback_inodes_sb_nr_if_idle()?
> 
> I don't see why these functions need s_umount at all, if they're called
> from within ->write_begin against an inode on that superblock.  If the
> superblock can get itself disappeared while we're running ->write_begin
> on it, we have problems, no?
  As I wrote to Chris, the function just needs exclusion from umount /
remount happening (and want to stop umount from returning EBUSY when
writeback thread is writing something out). When the function is called
from ->write_begin this is no issue as you properly noted so s_umount is
not needed in that particular case.

> In which case I'd suggest just removing the down_read(s_umount) and
> specifying that the caller must pin the superblock via some means.
  Possibly, but currently the advantage is that we can have WARN_ON in the
writeback code that complains if someone starts writeback without properly
pinned superblock and we cannot easily do that with your general rule. I'm
not saying that should stop us from changing the rule but it was kind of
nice.

> Only we can't do that because we need to hold s_umount until the
> bdi_queue_work() worker has done its work.
> 
> The fact that a call to ->write_begin can randomly return with s_umount
> held, to be randomly released at some random time in the future is a
> bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> thing.
  I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
variants) does:
        bdi_queue_work(sb->s_bdi, &work);
        wait_for_completion(&done);
  So we return only after all the IO has been submitted and unlock s_umount
in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
because we are holding i_mutex and we need to get and put references
to other inodes while doing writeback (those would be really horrible lock
dependencies - writeback thread can put the last reference to an unlinked
inode...).

In fact, as I'm speaking about it, pushing things to writeback thread and
waiting on the work does not help a bit with the locking issues (we didn't
wait for the work previously but that had other issues). Bug, sigh.

What might be better interface for usecases like above is to allow
filesystem to kick flusher thread to start doing background writeback
(regardless of dirty limits). Then the caller can wait for some delayed
allocation reservations to get freed (easy enough to check in
->writepage() and wake the waiters) - possibly with a reasonable timeout
so that we don't stall forever.

								Honza
Nick Piggin - Nov. 19, 2010, 5:10 a.m.
On Thu, Nov 18, 2010 at 09:58:31AM -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 19:18:22 +1100 Nick Piggin <npiggin@kernel.dk> wrote:
> 
> > On Wed, Nov 17, 2010 at 10:28:34PM -0800, Andrew Morton wrote:
> >  
> > > Logically I'd expect i_mutex to nest inside s_umount.  Because s_umount
> > > is a per-superblock thing, and i_mutex is a per-file thing, and files
> > > live under superblocks.  Nesting s_umount outside i_mutex creates
> > > complex deadlock graphs between the various i_mutexes, I think.
> > 
> > You mean i_mutex outside s_umount?
> > 
> 
> Nope.  i_mutex should nest inside s_umount.  Just as inodes nest inside
> superblocks!  Seems logical to me ;)

Right, but your last sentence seemed to suggest that the natural
ordering creates deadlocks :)

 
> > > And why _do_ we need to hold s_umount during the bdi_queue_work()
> > > handover?  Would simply bumping s_count suffice?
> > 
> > s_count just prevents it from going away, but s_umount is still needed
> > to keep umount, remount,ro, freezing etc activity away. I don't think
> > there is an easy way to do it.
> > 
> > Perhaps filesystem should have access to the dirty throttling path, kick
> > writeback or delayed allocation etc as needed, and throttle against
> > outstanding work that needs to be done, going through the normal
> > writeback paths?
> 
> I just cannot believe that we need s_mount inside ->write_begin.  Is it
> really the case that someone can come along and unmount or remount or
> freeze our filesystem while some other process is down performing a
> ->write_begin against one of its files?  Kidding?

Not for the work handoff either? If that is all waited on synchronously
before ->write_end returns, then no we shouldn't need any more locks
of course.

But asynch writeout needs a mutex rather than refcount so the umount
has something to block against and not just fail.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 19, 2010, 5:16 a.m.
On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > The fact that a call to ->write_begin can randomly return with s_umount
> > held, to be randomly released at some random time in the future is a
> > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > thing.
>   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> variants) does:
>         bdi_queue_work(sb->s_bdi, &work);
>         wait_for_completion(&done);
>   So we return only after all the IO has been submitted and unlock s_umount
> in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
> because we are holding i_mutex and we need to get and put references
> to other inodes while doing writeback (those would be really horrible lock
> dependencies - writeback thread can put the last reference to an unlinked
> inode...).

But if we're waiting for it, with the lock held, then surely it can
deadlock just the same as if we submit it ourself?

BTW. are you taking i_mutex inside writeback? I mutex can be held
while entering page reclaim, and thus writepage... so it could be a
bug too.
 

> In fact, as I'm speaking about it, pushing things to writeback thread and
> waiting on the work does not help a bit with the locking issues (we didn't
> wait for the work previously but that had other issues). Bug, sigh.
> 
> What might be better interface for usecases like above is to allow
> filesystem to kick flusher thread to start doing background writeback
> (regardless of dirty limits). Then the caller can wait for some delayed
> allocation reservations to get freed (easy enough to check in
> ->writepage() and wake the waiters) - possibly with a reasonable timeout
> so that we don't stall forever.

We really need to throttle the producer without any locks held, no?
So the filesystem would like to to hook into dirtying path somewhere
without i_mutex held (and without implementing your own .write). Eg.
around the dirty throttling calls somewhere I suppose.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Nov. 19, 2010, 12:07 p.m.
On Nov 19, 2010, at 12:10 AM, Nick Piggin wrote:

> But asynch writeout needs a mutex rather than refcount so the umount
> has something to block against and not just fail.

Or we use a completion handler instead of a mutex for umount?

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 22, 2010, 6:16 p.m.
On Fri 19-11-10 16:16:19, Nick Piggin wrote:
> On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> > On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > > The fact that a call to ->write_begin can randomly return with s_umount
> > > held, to be randomly released at some random time in the future is a
> > > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > > thing.
> >   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> > variants) does:
> >         bdi_queue_work(sb->s_bdi, &work);
> >         wait_for_completion(&done);
> >   So we return only after all the IO has been submitted and unlock s_umount
> > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
> > because we are holding i_mutex and we need to get and put references
> > to other inodes while doing writeback (those would be really horrible lock
> > dependencies - writeback thread can put the last reference to an unlinked
> > inode...).
> 
> But if we're waiting for it, with the lock held, then surely it can
> deadlock just the same as if we submit it ourself?
  Yes, that's what I realized as well and what needs fixing. It was an
unintentional consequence of locking changes Christoph did to the writeback
code to fix other races.

> BTW. are you taking i_mutex inside writeback? I mutex can be held
> while entering page reclaim, and thus writepage... so it could be a
> bug too.
  No, writeback does not need i_mutex.

> > In fact, as I'm speaking about it, pushing things to writeback thread and
> > waiting on the work does not help a bit with the locking issues (we didn't
> > wait for the work previously but that had other issues). Bug, sigh.
> > 
> > What might be better interface for usecases like above is to allow
> > filesystem to kick flusher thread to start doing background writeback
> > (regardless of dirty limits). Then the caller can wait for some delayed
> > allocation reservations to get freed (easy enough to check in
> > ->writepage() and wake the waiters) - possibly with a reasonable timeout
> > so that we don't stall forever.
> 
> We really need to throttle the producer without any locks held, no?
> So the filesystem would like to to hook into dirtying path somewhere
> without i_mutex held (and without implementing your own .write). Eg.
> around the dirty throttling calls somewhere I suppose.
  Well, the particular case where we need to do delayed allocation but the
filesystem has already all blocks reserved is hard to detect without any
locks. At least we need some locks to find out how many blocks do we need
to reserve for that write(). So for filesystems it would solve the locking
issues if we could bail out of write() path up to the point where we hold
no locks, wait there / do necessary writeback and then restart the write...

								Honza
Andrew Morton - Nov. 22, 2010, 11:43 p.m.
On Wed, 17 Nov 2010 21:18:13 -0600
Eric Sandeen <sandeen@redhat.com> wrote:

> On 11/17/10 12:10 AM, Nick Piggin wrote:
> > On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> >> On 11/16/10 10:38 PM, Nick Piggin wrote:
> >>>> as for the locking problems ... sorry about that!
> >>>
> >>> That's no problem. So is that an ack? :)
> >>>
> >>
> >> I'd like to test it with the original case it was supposed to solve; will
> >> do that tomorrow.
> > 
> > OK, but it shouldn't make much difference, unless there is a lot of
> > strange activity happening on the sb (like mount / umount / remount /
> > freeze / etc).
> > 
> 
> Ok, good point.  And since I failed in my promise anyway, I won't
> hold you up;
> 
> Acked-by: Eric Sandeen <sandeen@redhat.com>

That patch still sucks!

We all breathlessly await v2, which will have a complete changelog and
approximately correct comments.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 23, 2010, 8:07 a.m.
On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote:
> On Fri 19-11-10 16:16:19, Nick Piggin wrote:
> > On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> > > On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > > > The fact that a call to ->write_begin can randomly return with s_umount
> > > > held, to be randomly released at some random time in the future is a
> > > > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > > > thing.
> > >   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> > > variants) does:
> > >         bdi_queue_work(sb->s_bdi, &work);
> > >         wait_for_completion(&done);
> > >   So we return only after all the IO has been submitted and unlock s_umount
> > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
> > > because we are holding i_mutex and we need to get and put references
> > > to other inodes while doing writeback (those would be really horrible lock
> > > dependencies - writeback thread can put the last reference to an unlinked
> > > inode...).
> > 
> > But if we're waiting for it, with the lock held, then surely it can
> > deadlock just the same as if we submit it ourself?
>   Yes, that's what I realized as well and what needs fixing. It was an
> unintentional consequence of locking changes Christoph did to the writeback
> code to fix other races.
> 
> > BTW. are you taking i_mutex inside writeback? I mutex can be held
> > while entering page reclaim, and thus writepage... so it could be a
> > bug too.
>   No, writeback does not need i_mutex.

I did in fact see i_mutex from writeback, which is how the lock order
reversal was noticed in the first place. I haven't had much luck in
reproducing it yet. It did come from end_io_work, I believe.

There is another deadlock in here, by the looks (although this is not
the one which triggers lockdep -- the workqueue coupling means it
defeats lockdep detection).

Buffered write holds i_lock, and waits for inode writeback submission,
but the work queue seems to be blocked on i_mutex (ext4_end_io_work),
and so it deadlocks.

Note that this is an AA deadlock, different from ABBA one relating to
s_umount lock (but very similar call chains IIRC).


[  748.406457] SysRq : Show Blocked State
[  748.406685]   task                        PC stack   pid father
[  748.406868] kworker/9:1   D 0000000000000000  6296   118      2
0x00000000
[  748.407173]  ffff88012acb3c90 0000000000000046 0000000000000000
ffff88012b1cec80
[  748.407686]  0000000000000002 ffff88012acb2000 ffff88012acb2000
000000000000d6c8
[  748.408200]  ffff88012acb2010 ffff88012acb3fd8 ffff880129c7dd00
ffff88012b1cec80
[  748.408711] Call Trace:
[  748.408866]  [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
[  748.409033]  [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480
[  748.409205]  [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
[  748.409379]  [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
[  748.409546]  [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
[  748.409717]  [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4]
[  748.409883]  [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
[  748.410046]  [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0
[  748.410219]  [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4]
[  748.410386]  [<ffffffff81069675>] worker_thread+0x175/0x3a0
[  748.410549]  [<ffffffff81069500>] ? worker_thread+0x0/0x3a0
[  748.410712]  [<ffffffff8106e246>] kthread+0x96/0xa0
[  748.410874]  [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[  748.411038]  [<ffffffff81039878>] ? finish_task_switch+0x78/0x110
[  748.411202]  [<ffffffff816036c0>] ? restore_args+0x0/0x30
[  748.411364]  [<ffffffff8106e1b0>] ? kthread+0x0/0xa0
[  748.411524]  [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10
[  748.606853] dbench        D 0000000000000000  2872  2916      1
0x00000004
[  748.607154]  ffff880129ec58b8 0000000000000046 ffff880129ec5838
ffffffff810806fd
[  748.607665]  ffff880129ec5898 ffff880129ec4000 ffff880129ec4000
000000000000d6c8
[  748.608176]  ffff880129ec4010 ffff880129ec5fd8 ffff88010a2d0000
ffff88011ff69f00
[  748.608686] Call Trace:
[  748.608837]  [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10
[  748.609001]  [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350
[  748.609164]  [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0
[  748.609328]  [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60
[  748.609493]  [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
[  748.609656]  [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
[  748.609820]  [<ffffffff815ffacd>] wait_for_common+0x10d/0x190
[  748.609984]  [<ffffffff810426e0>] ? default_wake_function+0x0/0x10
[  748.610149]  [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40
[  748.610314]  [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20
[  748.610479]  [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120
[  748.610646]  [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60
[  748.610811]  [<ffffffff8113d6d2>] ?
writeback_inodes_sb_if_idle+0x32/0x60
[  748.610978]  [<ffffffff8113d6da>]
writeback_inodes_sb_if_idle+0x3a/0x60
[  748.611153]  [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310
[ext4]
[  748.611321]  [<ffffffff810cbbf4>]
generic_file_buffered_write+0x114/0x2a0



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin - Nov. 23, 2010, 8:15 a.m.
Here is another one. Both i_mutex in writeback and s_umount in write(2)
underneath i_mutex seem like an interesting idea. The former you can
probably get away with (provided you solve the previous AA deadlock),
but the latter seems too problematic. I think my trylock patch solves
it.

[  409.479214] =======================================================
[  409.479527] [ INFO: possible circular locking dependency detected ]
[  409.479689] 2.6.37-rc3+ #26[  409.479837]
-------------------------------------------------------
[  409.479998] umount/4020 is trying to acquire lock:[  409.480155]
(ext4-dio-unwritten){+.+...}, at: [<ffffffff81067a80>]
flush_workqueue+0x0/0x540
[  409.480178] [  409.480178] but task is already holding lock:
[  409.480178]  (&type->s_umount_key#23){++++..}, at:
[<ffffffff8111b40d>] deactivate_super+0x3d
/0x60
[  409.480178] [  409.480178] which lock already depends on the new
lock.
[  409.480178] 
[  409.480178] 
[  409.480178] the existing dependency chain (in reverse order) is:
[  409.480178] 
[  409.480178] -> #3 (&type->s_umount_key#23){++++..}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81601ba2>] down_read+0x42/0x60
[  409.480178]        [<ffffffff8113d6d2>]
writeback_inodes_sb_if_idle+0x32/0x60
[  409.480178]        [<ffffffffa003fef8>]
ext4_da_write_begin+0x208/0x2d0 [ext4]
[  409.480178]        [<ffffffff810cbbf4>]
generic_file_buffered_write+0x114/0x2a0
[  409.480178]        [<ffffffff810cc5e0>]
__generic_file_aio_write+0x240/0x470
[  409.480178]        [<ffffffff810cc876>]
generic_file_aio_write+0x66/0xd0
[  409.480178]        [<ffffffffa0034fad>] ext4_file_write+0x3d/0xd0
[ext4]
[  409.480178]        [<ffffffff81117702>] do_sync_write+0xd2/0x110
[  409.480178]        [<ffffffff811179c8>] vfs_write+0xc8/0x190
[  409.480178]        [<ffffffff8111851a>] sys_pwrite64+0x7a/0x90
[  409.480178]        [<ffffffff8100312b>]
system_call_fastpath+0x16/0x1b
[  409.480178] 
[  409.480178] -> #2 (&sb->s_type->i_mutex_key#13){+.+.+.}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81601329>]
__mutex_lock_common+0x59/0x480
[  409.480178]        [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
[  409.480178]        [<ffffffffa0042107>] ext4_end_io_work+0x37/0xb0
[ext4]
[  409.480178]        [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
[  409.480178]        [<ffffffff81069675>] worker_thread+0x175/0x3a0
[  409.480178]        [<ffffffff8106e246>] kthread+0x96/0xa0
[  409.480178]        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[  409.480178] 
[  409.480178] -> #1 ((&io->work)){+.+...}:
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81068364>] process_one_work+0x1a4/0x5a0
[  409.480178]        [<ffffffff81069675>] worker_thread+0x175/0x3a0
[  409.480178]        [<ffffffff8106e246>] kthread+0x96/0xa0
[  409.480178]        [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
[  409.480178] 
[  409.480178] -> #0 (ext4-dio-unwritten){+.+...}:
[  409.480178]        [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
[  409.480178]        [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]        [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
[  409.480178]        [<ffffffffa004f5db>] ext4_sync_fs+0x3b/0x100
[ext4]
[  409.480178]        [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
[  409.480178]        [<ffffffff81143132>] sync_filesystem+0x32/0x60
[  409.480178]        [<ffffffff8111a97f>]
generic_shutdown_super+0x2f/0x100
[  409.480178]        [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
[  409.480178]        [<ffffffff8111b1e5>]
deactivate_locked_super+0x45/0x60
[  409.480178]        [<ffffffff8111b415>] deactivate_super+0x45/0x60
[  409.480178]        [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
[  409.480178]        [<ffffffff811376a9>] sys_umount+0x79/0x3a0
[  409.480178]        [<ffffffff8100312b>]
system_call_fastpath+0x16/0x1b
[  409.480178] [  409.480178] other info that might help us debug this:
[  409.480178] 
[  409.480178] 1 lock held by umount/4020:
[  409.480178]  #0:  (&type->s_umount_key#23){++++..}, at:
[<ffffffff8111b40d>] deactivate_super
+0x3d/0x60
[  409.480178] 
[  409.480178] stack backtrace:
[  409.480178] Pid: 4020, comm: umount Not tainted 2.6.37-rc3+ #26
[  409.480178] Call Trace:
[  409.480178]  [<ffffffff81082c39>] print_circular_bug+0xe9/0xf0
[  409.480178]  [<ffffffff81085122>] __lock_acquire+0x1382/0x1490
[  409.480178]  [<ffffffff810852c5>] lock_acquire+0x95/0x1b0
[  409.480178]  [<ffffffff81067a80>] ? flush_workqueue+0x0/0x540
[  409.480178]  [<ffffffff81067bc8>] flush_workqueue+0x148/0x540
[  409.480178]  [<ffffffff81067a80>] ? flush_workqueue+0x0/0x540
[  409.480178]  [<ffffffffa004f5db>] ext4_sync_fs+0x3b/0x100 [ext4]
[  409.480178]  [<ffffffff8113d68d>] ? writeback_inodes_sb+0x4d/0x60
[  409.480178]  [<ffffffff8114304e>] __sync_filesystem+0x5e/0x90
[  409.480178]  [<ffffffff81143132>] sync_filesystem+0x32/0x60
[  409.480178]  [<ffffffff8111a97f>] generic_shutdown_super+0x2f/0x100
[  409.480178]  [<ffffffff8111aa7c>] kill_block_super+0x2c/0x50
[  409.480178]  [<ffffffff8111b1e5>] deactivate_locked_super+0x45/0x60
[  409.480178]  [<ffffffff8111b415>] deactivate_super+0x45/0x60
[  409.480178]  [<ffffffff81136430>] mntput_no_expire+0xf0/0x190
[  409.480178]  [<ffffffff811376a9>] sys_umount+0x79/0x3a0
[  409.480178]  [<ffffffff8100312b>] system_call_fastpath+0x16/0x1b

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Nov. 23, 2010, 1:32 p.m.
On Tue 23-11-10 19:07:58, Nick Piggin wrote:
> On Mon, Nov 22, 2010 at 07:16:55PM +0100, Jan Kara wrote:
> > On Fri 19-11-10 16:16:19, Nick Piggin wrote:
> > > On Fri, Nov 19, 2010 at 01:45:52AM +0100, Jan Kara wrote:
> > > > On Wed 17-11-10 22:28:34, Andrew Morton wrote:
> > > > > The fact that a call to ->write_begin can randomly return with s_umount
> > > > > held, to be randomly released at some random time in the future is a
> > > > > bit ugly, isn't it?  write_begin is a pretty low-level, per-inode
> > > > > thing.
> > > >   I guess you missed that writeback_inodes_sb_nr() (called from _if_idle
> > > > variants) does:
> > > >         bdi_queue_work(sb->s_bdi, &work);
> > > >         wait_for_completion(&done);
> > > >   So we return only after all the IO has been submitted and unlock s_umount
> > > > in writeback_inodes_sb_if_idle(). And we cannot really submit the IO ourselves
> > > > because we are holding i_mutex and we need to get and put references
> > > > to other inodes while doing writeback (those would be really horrible lock
> > > > dependencies - writeback thread can put the last reference to an unlinked
> > > > inode...).
> > > 
> > > But if we're waiting for it, with the lock held, then surely it can
> > > deadlock just the same as if we submit it ourself?
> >   Yes, that's what I realized as well and what needs fixing. It was an
> > unintentional consequence of locking changes Christoph did to the writeback
> > code to fix other races.
> > 
> > > BTW. are you taking i_mutex inside writeback? I mutex can be held
> > > while entering page reclaim, and thus writepage... so it could be a
> > > bug too.
> >   No, writeback does not need i_mutex.
> 
> I did in fact see i_mutex from writeback, which is how the lock order
> reversal was noticed in the first place. I haven't had much luck in
> reproducing it yet. It did come from end_io_work, I believe.
> 
> There is another deadlock in here, by the looks (although this is not
> the one which triggers lockdep -- the workqueue coupling means it
> defeats lockdep detection).
> 
> Buffered write holds i_lock, and waits for inode writeback submission,
> but the work queue seems to be blocked on i_mutex (ext4_end_io_work),
> and so it deadlocks.
  Ah, I see. That's a new thing introduced by Ted's rewrite of
ext4_writepages() - bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc. And indeed it
introduced a deadlock as you describe...

								Honza

> Note that this is an AA deadlock, different from ABBA one relating to
> s_umount lock (but very similar call chains IIRC).
> 
> 
> [  748.406457] SysRq : Show Blocked State
> [  748.406685]   task                        PC stack   pid father
> [  748.406868] kworker/9:1   D 0000000000000000  6296   118      2
> 0x00000000
> [  748.407173]  ffff88012acb3c90 0000000000000046 0000000000000000
> ffff88012b1cec80
> [  748.407686]  0000000000000002 ffff88012acb2000 ffff88012acb2000
> 000000000000d6c8
> [  748.408200]  ffff88012acb2010 ffff88012acb3fd8 ffff880129c7dd00
> ffff88012b1cec80
> [  748.408711] Call Trace:
> [  748.408866]  [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [  748.409033]  [<ffffffff8160145c>] __mutex_lock_common+0x18c/0x480
> [  748.409205]  [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409379]  [<ffffffffa0042147>] ? ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409546]  [<ffffffff8160182e>] mutex_lock_nested+0x3e/0x50
> [  748.409717]  [<ffffffffa0042147>] ext4_end_io_work+0x37/0xb0 [ext4]
> [  748.409883]  [<ffffffff81068378>] process_one_work+0x1b8/0x5a0
> [  748.410046]  [<ffffffff8106830e>] ? process_one_work+0x14e/0x5a0
> [  748.410219]  [<ffffffffa0042110>] ? ext4_end_io_work+0x0/0xb0 [ext4]
> [  748.410386]  [<ffffffff81069675>] worker_thread+0x175/0x3a0
> [  748.410549]  [<ffffffff81069500>] ? worker_thread+0x0/0x3a0
> [  748.410712]  [<ffffffff8106e246>] kthread+0x96/0xa0
> [  748.410874]  [<ffffffff81003ed4>] kernel_thread_helper+0x4/0x10
> [  748.411038]  [<ffffffff81039878>] ? finish_task_switch+0x78/0x110
> [  748.411202]  [<ffffffff816036c0>] ? restore_args+0x0/0x30
> [  748.411364]  [<ffffffff8106e1b0>] ? kthread+0x0/0xa0
> [  748.411524]  [<ffffffff81003ed0>] ? kernel_thread_helper+0x0/0x10
> [  748.606853] dbench        D 0000000000000000  2872  2916      1
> 0x00000004
> [  748.607154]  ffff880129ec58b8 0000000000000046 ffff880129ec5838
> ffffffff810806fd
> [  748.607665]  ffff880129ec5898 ffff880129ec4000 ffff880129ec4000
> 000000000000d6c8
> [  748.608176]  ffff880129ec4010 ffff880129ec5fd8 ffff88010a2d0000
> ffff88011ff69f00
> [  748.608686] Call Trace:
> [  748.608837]  [<ffffffff810806fd>] ? trace_hardirqs_off+0xd/0x10
> [  748.609001]  [<ffffffff81600bb5>] schedule_timeout+0x1f5/0x350
> [  748.609164]  [<ffffffff8108380b>] ? mark_held_locks+0x6b/0xa0
> [  748.609328]  [<ffffffff8160314b>] ? _raw_spin_unlock_irq+0x2b/0x60
> [  748.609493]  [<ffffffff81039431>] ? get_parent_ip+0x11/0x50
> [  748.609656]  [<ffffffff81606c3d>] ? sub_preempt_count+0x9d/0xd0
> [  748.609820]  [<ffffffff815ffacd>] wait_for_common+0x10d/0x190
> [  748.609984]  [<ffffffff810426e0>] ? default_wake_function+0x0/0x10
> [  748.610149]  [<ffffffff81602ec9>] ? _raw_spin_unlock_bh+0x39/0x40
> [  748.610314]  [<ffffffff815ffbf8>] wait_for_completion+0x18/0x20
> [  748.610479]  [<ffffffff8113d1e7>] writeback_inodes_sb_nr+0xf7/0x120
> [  748.610646]  [<ffffffff8113d68d>] writeback_inodes_sb+0x4d/0x60
> [  748.610811]  [<ffffffff8113d6d2>] ?
> writeback_inodes_sb_if_idle+0x32/0x60
> [  748.610978]  [<ffffffff8113d6da>]
> writeback_inodes_sb_if_idle+0x3a/0x60
> [  748.611153]  [<ffffffffa003fefd>] ext4_da_write_begin+0x20d/0x310
> [ext4]
> [  748.611321]  [<ffffffff810cbbf4>]
> generic_file_buffered_write+0x114/0x2a0
> 
> 
>

Patch

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2010-11-16 21:44:32.000000000 +1100
+++ linux-2.6/fs/fs-writeback.c	2010-11-16 21:49:37.000000000 +1100
@@ -1125,16 +1125,20 @@  EXPORT_SYMBOL(writeback_inodes_sb);
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * May be called inside i_lock. May not start writeback if locks cannot
+ * be acquired.
  */
 int writeback_inodes_sb_if_idle(struct super_block *sb)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb(sb);
-		up_read(&sb->s_umount);
-		return 1;
-	} else
-		return 0;
+		if (down_read_trylock(&sb->s_umount)) {
+			writeback_inodes_sb(sb);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
 
@@ -1145,17 +1149,21 @@  EXPORT_SYMBOL(writeback_inodes_sb_if_idl
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
+ *
+ * May be called inside i_lock. May not start writeback if locks cannot
+ * be acquired.
  */
 int writeback_inodes_sb_nr_if_idle(struct super_block *sb,
 				   unsigned long nr)
 {
 	if (!writeback_in_progress(sb->s_bdi)) {
-		down_read(&sb->s_umount);
-		writeback_inodes_sb_nr(sb, nr);
-		up_read(&sb->s_umount);
-		return 1;
-	} else
-		return 0;
+		if (down_read_trylock(&sb->s_umount)) {
+			writeback_inodes_sb_nr(sb, nr);
+			up_read(&sb->s_umount);
+			return 1;
+		}
+	}
+	return 0;
 }
 EXPORT_SYMBOL(writeback_inodes_sb_nr_if_idle);