diff mbox

[BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Message ID 20110216081746.54d146d1.toshi.okajima@jp.fujitsu.com
State New, archived
Headers show

Commit Message

Toshiyuki Okajima Feb. 15, 2011, 11:17 p.m. UTC
Hi.

On Tue, 15 Feb 2011 18:29:54 +0100
Jan Kara <jack@suse.cz> wrote:
> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > under s_umount semaphore, we are prone to deadlock like the one you
> > > describe above.
> > 
> > One of the fundamental problems here is that the freeze and thaw
> > routines are using down_write(&sb->s_umount) for two purposes.  The
> > first is to prevent the resume/thaw from racing with a umount (which
> > it could do just as well by taking a read lock), but the second is to
> > prevent the resume/thaw code from racing with itself.  That's the core
> > fundamental problem here.
> > 
> > So I think we can solve this by introduce a new mutex, s_freeze, and
> > having the the resume/thaw first take the s_freeze mutex and then
> > second take a read lock on the s_umount.
>   Sadly this does not quite work because even down_read(&sb->s_umount)
> in thaw_super() can block if there is another process that tries to acquire
> s_umount for writing - a situation like:
>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> down_read(&sb->s_umount)
>   block on s_frozen
> 				down_write(&sb->s_umount)
> 				  -blocked
> 								down_read(&sb->s_umount)
> 								  -blocked
> behind the write access...
> 
> The only working solution I see is to check for frozen filesystem before
> taking s_umount semaphore which seems rather ugly (but might be bearable if
> we did so in some well described wrapper).
I created the patch that you imagine yesterday.
 
I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
without a fixed patch. After an hour, I confirmed that this deadlock happened.

However, on the kernel with a fixed patch, this deadlock doesn't still happen 
after 12 hours passed.

The patch for linux-2.6.38-rc4 is as follows:
---
 fs/fs-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Comments

Jan Kara Feb. 16, 2011, 2:56 p.m. UTC | #1
On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> On Tue, 15 Feb 2011 18:29:54 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > under s_umount semaphore, we are prone to deadlock like the one you
> > > > describe above.
> > > 
> > > One of the fundamental problems here is that the freeze and thaw
> > > routines are using down_write(&sb->s_umount) for two purposes.  The
> > > first is to prevent the resume/thaw from racing with a umount (which
> > > it could do just as well by taking a read lock), but the second is to
> > > prevent the resume/thaw code from racing with itself.  That's the core
> > > fundamental problem here.
> > > 
> > > So I think we can solve this by introduce a new mutex, s_freeze, and
> > > having the the resume/thaw first take the s_freeze mutex and then
> > > second take a read lock on the s_umount.
> >   Sadly this does not quite work because even down_read(&sb->s_umount)
> > in thaw_super() can block if there is another process that tries to acquire
> > s_umount for writing - a situation like:
> >   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > down_read(&sb->s_umount)
> >   block on s_frozen
> > 				down_write(&sb->s_umount)
> > 				  -blocked
> > 								down_read(&sb->s_umount)
> > 								  -blocked
> > behind the write access...
> > 
> > The only working solution I see is to check for frozen filesystem before
> > taking s_umount semaphore which seems rather ugly (but might be bearable if
> > we did so in some well described wrapper).
> I created the patch that you imagine yesterday.
>  
> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> without a fixed patch. After an hour, I confirmed that this deadlock happened.
> 
> However, on the kernel with a fixed patch, this deadlock doesn't still happen 
> after 12 hours passed.
> 
> The patch for linux-2.6.38-rc4 is as follows:
> ---
>  fs/fs-writeback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 59c6e49..1c9a05e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>         spin_unlock(&sb_lock);
> 
>         if (down_read_trylock(&sb->s_umount)) {
> -               if (sb->s_root)
> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>                         return true;
>                 up_read(&sb->s_umount);
  So this is something along the lines I thought but it actually won't work
for example if sync(1) is run while the filesystem is frozen (that takes
s_umount semaphore in a different place). And generally, I'm not convinced
there are not other places that try to do IO while holding s_umount
semaphore...

									Honza
Toshiyuki Okajima Feb. 17, 2011, 3:50 a.m. UTC | #2
(2011/02/16 23:56), Jan Kara wrote:
> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>> On Tue, 15 Feb 2011 18:29:54 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>> describe above.
>>>>
>>>> One of the fundamental problems here is that the freeze and thaw
>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>> it could do just as well by taking a read lock), but the second is to
>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>> fundamental problem here.
>>>>
>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>> second take a read lock on the s_umount.
>>>    Sadly this does not quite work because even down_read(&sb->s_umount)
>>> in thaw_super() can block if there is another process that tries to acquire
>>> s_umount for writing - a situation like:
>>>    TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>> down_read(&sb->s_umount)
>>>    block on s_frozen
>>> 				down_write(&sb->s_umount)
>>> 				  -blocked
>>> 								down_read(&sb->s_umount)
>>> 								  -blocked
>>> behind the write access...
>>>
>>> The only working solution I see is to check for frozen filesystem before
>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>> we did so in some well described wrapper).
>> I created the patch that you imagine yesterday.
>>
>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>
>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>> after 12 hours passed.
>>
>> The patch for linux-2.6.38-rc4 is as follows:
>> ---
>>   fs/fs-writeback.c |    2 +-
>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 59c6e49..1c9a05e 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>          spin_unlock(&sb_lock);
>>
>>          if (down_read_trylock(&sb->s_umount)) {
>> -               if (sb->s_root)
>> +               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
>>                          return true;
>>                  up_read(&sb->s_umount);

>    So this is something along the lines I thought but it actually won't work
> for example if sync(1) is run while the filesystem is frozen (that takes
> s_umount semaphore in a different place). And generally, I'm not convinced
> there are not other places that try to do IO while holding s_umount
> semaphore...
OK. I understand.

This code only fixes the case for the following path:
writeback_inodes_wb
-> ext4_da_writepages
    -> ext4_journal_start_sb
       -> vfs_check_frozen
But, the code doesn't fix the other cases.

We must modify the local filesystem part in order to fix all cases...?

Regards,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Feb. 17, 2011, 5:13 a.m. UTC | #3
On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
>> 
>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>> 
>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>> after 12 hours passed.
>>> 
>>> The patch for linux-2.6.38-rc4 is as follows:
>>> ---
>>>  fs/fs-writeback.c |    2 +-
>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>> 
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index 59c6e49..1c9a05e 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>         spin_unlock(&sb_lock);
>>> 
>>>         if (down_read_trylock(&sb->s_umount)) {
>>> -               if (sb->s_root)
>>> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
>>>                         return true;
>>>                 up_read(&sb->s_umount);

This seems like a very low-risk fix.

>>   So this is something along the lines I thought but it actually won't work
>> for example if sync(1) is run while the filesystem is frozen (that takes
>> s_umount semaphore in a different place). And generally, I'm not convinced
>> there are not other places that try to do IO while holding s_umount
>> semaphore...
> 
> OK. I understand.
> 
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
>   -> ext4_journal_start_sb
>      -> vfs_check_frozen
> But, the code doesn't fix the other cases.
> 
> We must modify the local filesystem part in order to fix all cases...?

It seems worthwhile to implement the low-risk fix that covers the common case, and if/when someone hits the rare 3-process case and/or submits a patch for it then that one will be fixed also.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Feb. 17, 2011, 10:41 a.m. UTC | #4
On Wed 16-02-11 22:13:53, Andreas Dilger wrote:
> On 2011-02-16, at 20:50, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> >> 
> >>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>> 
> >>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>> after 12 hours passed.
> >>> 
> >>> The patch for linux-2.6.38-rc4 is as follows:
> >>> ---
> >>>  fs/fs-writeback.c |    2 +-
> >>>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>> 
> >>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>> index 59c6e49..1c9a05e 100644
> >>> --- a/fs/fs-writeback.c
> >>> +++ b/fs/fs-writeback.c
> >>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >>>         spin_unlock(&sb_lock);
> >>> 
> >>>         if (down_read_trylock(&sb->s_umount)) {
> >>> -               if (sb->s_root)
> >>> +               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> >>>                         return true;
> >>>                 up_read(&sb->s_umount);
> 
> This seems like a very low-risk fix.
> 
> >>   So this is something along the lines I thought but it actually won't work
> >> for example if sync(1) is run while the filesystem is frozen (that takes
> >> s_umount semaphore in a different place). And generally, I'm not convinced
> >> there are not other places that try to do IO while holding s_umount
> >> semaphore...
> > 
> > OK. I understand.
> > 
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> >   -> ext4_journal_start_sb
> >      -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> > 
> > We must modify the local filesystem part in order to fix all cases...?
> 
> It seems worthwhile to implement the low-risk fix that covers the common
> case, and if/when someone hits the rare 3-process case and/or submits a
> patch for it then that one will be fixed also.
  Yes, the fix is simple enough that I won't oppose it getting in as a
band aid and if we add this band aid to fs/sync.c:sync_one_sb(), it would
even be a reasonably reliable band aid. But that doesn't change the fact
that the locking is simply broken ;).

								Honza
Jan Kara Feb. 17, 2011, 10:45 a.m. UTC | #5
On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> (2011/02/16 23:56), Jan Kara wrote:
> >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>On Tue, 15 Feb 2011 18:29:54 +0100
> >>Jan Kara<jack@suse.cz>  wrote:
> >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> >>>>>describe above.
> >>>>
> >>>>One of the fundamental problems here is that the freeze and thaw
> >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> >>>>first is to prevent the resume/thaw from racing with a umount (which
> >>>>it could do just as well by taking a read lock), but the second is to
> >>>>prevent the resume/thaw code from racing with itself.  That's the core
> >>>>fundamental problem here.
> >>>>
> >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> >>>>having the the resume/thaw first take the s_freeze mutex and then
> >>>>second take a read lock on the s_umount.
> >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> >>>in thaw_super() can block if there is another process that tries to acquire
> >>>s_umount for writing - a situation like:
> >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> >>>down_read(&sb->s_umount)
> >>>   block on s_frozen
> >>>				down_write(&sb->s_umount)
> >>>				  -blocked
> >>>								down_read(&sb->s_umount)
> >>>								  -blocked
> >>>behind the write access...
> >>>
> >>>The only working solution I see is to check for frozen filesystem before
> >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> >>>we did so in some well described wrapper).
> >>I created the patch that you imagine yesterday.
> >>
> >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> >>
> >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> >>after 12 hours passed.
> >>
> >>The patch for linux-2.6.38-rc4 is as follows:
> >>---
> >>  fs/fs-writeback.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> >>index 59c6e49..1c9a05e 100644
> >>--- a/fs/fs-writeback.c
> >>+++ b/fs/fs-writeback.c
> >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> >>         spin_unlock(&sb_lock);
> >>
> >>         if (down_read_trylock(&sb->s_umount)) {
> >>-               if (sb->s_root)
> >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> >>                         return true;
> >>                 up_read(&sb->s_umount);
> 
> >   So this is something along the lines I thought but it actually won't work
> >for example if sync(1) is run while the filesystem is frozen (that takes
> >s_umount semaphore in a different place). And generally, I'm not convinced
> >there are not other places that try to do IO while holding s_umount
> >semaphore...
> OK. I understand.
> 
> This code only fixes the case for the following path:
> writeback_inodes_wb
> -> ext4_da_writepages
>    -> ext4_journal_start_sb
>       -> vfs_check_frozen
> But, the code doesn't fix the other cases.
> 
> We must modify the local filesystem part in order to fix all cases...?
  Yes, possibly. But most importantly we should first find clear locking
rules for frozen filesystem that avoid deadlocks like the one above. And
the freezing / unfreezing code might become subtle for that reason, that's
fine, but it would be really good to avoid any complicated things for the
code in the rest of the VFS / filesystems.

								Honza
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59c6e49..1c9a05e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -456,7 +456,7 @@  static bool pin_sb_for_writeback(struct super_block *sb)
        spin_unlock(&sb_lock);

        if (down_read_trylock(&sb->s_umount)) {
-               if (sb->s_root)
+               if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
                        return true;
                up_read(&sb->s_umount);
        }