diff mbox

[RFC] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Message ID 20110328170628.ffe314fb.toshi.okajima@jp.fujitsu.com
State New, archived
Headers show

Commit Message

Toshiyuki Okajima March 28, 2011, 8:06 a.m. UTC
Hi.

On Thu, 17 Feb 2011 11:45:52 +0100
Jan Kara <jack@suse.cz> wrote:
> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > (2011/02/16 23:56), Jan Kara wrote:
> > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > >>Jan Kara<jack@suse.cz>  wrote:
> > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > >>>>>describe above.
> > >>>>
> > >>>>One of the fundamental problems here is that the freeze and thaw
> > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > >>>>it could do just as well by taking a read lock), but the second is to
> > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > >>>>fundamental problem here.
> > >>>>
> > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > >>>>second take a read lock on the s_umount.
> > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > >>>in thaw_super() can block if there is another process that tries to acquire
> > >>>s_umount for writing - a situation like:
> > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > >>>down_read(&sb->s_umount)
> > >>>   block on s_frozen
> > >>>				down_write(&sb->s_umount)
> > >>>				  -blocked
> > >>>								down_read(&sb->s_umount)
> > >>>								  -blocked
> > >>>behind the write access...
> > >>>
> > >>>The only working solution I see is to check for frozen filesystem before
> > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > >>>we did so in some well described wrapper).
> > >>I created the patch that you imagine yesterday.
> > >>
> > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > >>
> > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > >>after 12 hours passed.
> > >>
> > >>The patch for linux-2.6.38-rc4 is as follows:
> > >>---
> > >>  fs/fs-writeback.c |    2 +-
> > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > >>
> > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > >>index 59c6e49..1c9a05e 100644
> > >>--- a/fs/fs-writeback.c
> > >>+++ b/fs/fs-writeback.c
> > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > >>         spin_unlock(&sb_lock);
> > >>
> > >>         if (down_read_trylock(&sb->s_umount)) {
> > >>-               if (sb->s_root)
> > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > >>                         return true;
> > >>                 up_read(&sb->s_umount);
> > 
> > >   So this is something along the lines I thought but it actually won't work
> > >for example if sync(1) is run while the filesystem is frozen (that takes
> > >s_umount semaphore in a different place). And generally, I'm not convinced
> > >there are not other places that try to do IO while holding s_umount
> > >semaphore...
> > OK. I understand.
> > 
> > This code only fixes the case for the following path:
> > writeback_inodes_wb
> > -> ext4_da_writepages
> >    -> ext4_journal_start_sb
> >       -> vfs_check_frozen
> > But, the code doesn't fix the other cases.
> > 
> > We must modify the local filesystem part in order to fix all cases...?
>   Yes, possibly. But most importantly we should first find clear locking
> rules for frozen filesystem that avoid deadlocks like the one above. And
> the freezing / unfreezing code might become subtle for that reason, that's
> fine, but it would be really good to avoid any complicated things for the
> code in the rest of the VFS / filesystems.
I have deeply continued to examined the root cause of this problem, then 
I found it.

It is that we can write a memory which is mmaped to a file. Then the memory 
becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
"writeback" the memory. 

Therefore, the root cause of this hangup is not only ext4 component (with
delayed allocation feature) but also writeback mechanism for mmap. If you 
use the other filesystem, you can write something to the filesystem though 
you have freezed the filesystem.

A sample problem is attached on this mail.  Try to execute it then you can 
confirm that we can write some data to your filesystem while freezing the 
filesystem.
(If you change FS variable in go.sh from ext3 to ext4 and you execute
"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)

I think the best approach to fix this problem is to let users not to write
memory which is mapped to a certain file while the filesystem is freezing. 
However, it is very difficult to control users not to write memory which has 
been already mapped to the file.

Therefore, I think there is only actual method that we stop writeback thread 
to resolve the mmap problem. Also, by this fix, the original problem 
(ext4 delayed write vs unfreeze) can be solved.

I created a patch for this problem. Please confirm it.

------------------------------------------------------------------------------
----------
reproducer
----------
[run script] go.sh
#!/bin/sh

FS=ext3
gcc -o ./write ./write.c
dd if=/dev/zero of=/tmp/loop.$$ bs=1k seek=64k count=1 > /dev/null 2>&1
/sbin/mkfs.$FS -Fq /tmp/loop.$$
/sbin/losetup /dev/loop7 /tmp/loop.$$
mkdir -p mnt
mount -t $FS /dev/loop7 mnt
dd if=/dev/zero of=mnt/file bs=4k count=100 > /dev/null 2>&1
./write mnt/file &
pid=$!
# write 0 then 1
/bin/kill -SIGUSR1 $pid 
/bin/kill -SIGUSR1 $pid 
/sbin/fsfreeze -f mnt
cp /tmp/loop.$$ /tmp/loop.$$.pre
/bin/kill -SIGUSR1 $pid
sync
sleep 30
cp /tmp/loop.$$ /tmp/loop.$$.post
/sbin/fsfreeze -u mnt
/bin/kill -SIGTERM $pid
umount mnt
/sbin/losetup -d /dev/loop7
/usr/bin/cmp /tmp/loop.$$.pre /tmp/loop.$$.post > /dev/null 2>&1
if [ $? -ne 0 ]; then
        echo "freeze doesn't work correctly!"
else
        echo "freeze works correctly!"
fi
rm -f /tmp/loop.$$* 
exit 0

[program] write.c
#define LARGEFILE64_SOURCE

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <signal.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <fcntl.h>
#include <unistd.h>

int counter = 0;
char *mmap_addr;
int fd;

#define LOOP 100
#define UNIT 4096
#define MMAPSZ  (UNIT*LOOP)
#define FILENAME "./mnt/file"

void
write_inc(int sig)
{
        int i;

        for (i = 0; i < LOOP; i++) 
                *((int*)(mmap_addr + UNIT*i)) = counter;
        counter ++;
}

void
main_exit(int sig)
{
        munmap(mmap_addr, MMAPSZ);
        close(fd);
        exit(0);
}

int main(int argc, char *argv[])
{
        char *file = FILENAME;

        if ((fd = open(file, O_RDWR)) < 0) {
                perror("open error");
                exit(1);
        }
        if ((mmap_addr = mmap(0, MMAPSZ, PROT_WRITE, MAP_SHARED, fd, 0)) ==
MAP_FAILED) {
                perror("mmap error");
                close(fd);
                exit(2);
        }
        sigset(SIGTERM, (void *)main_exit);
        sigset(SIGUSR1, (void *)write_inc);
        while (1) 
                pause();
}

[step to rerproduce]
# sh ./go.sh 
------------------------------------------------------------------------------

[patch]
Now, we can write the memory which is mapped to a file while 
the filesystem to which it belongs is being freezed.
Therefore, the filesystem can modify even if it is being freezed.
This fix prevents the flusher thread from updating the filesystem.

Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
---
 fs/fs-writeback.c   |    2 +-
 fs/super.c          |    7 ++++++-
 mm/page-writeback.c |    2 ++
 3 files changed, 9 insertions(+), 2 deletions(-)

Comments

Dave Chinner March 31, 2011, 11:40 p.m. UTC | #1
On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> Hi.
> 
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<jack@suse.cz>  wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>>   block on s_frozen
> > > >>>				down_write(&sb->s_umount)
> > > >>>				  -blocked
> > > >>>								down_read(&sb->s_umount)
> > > >>>								  -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >>  fs/fs-writeback.c |    2 +-
> > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >>         spin_unlock(&sb_lock);
> > > >>
> > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > >>-               if (sb->s_root)
> > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > >>                         return true;
> > > >>                 up_read(&sb->s_umount);
> > > 
> > > >   So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > > 
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > >    -> ext4_journal_start_sb
> > >       -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > > 
> > > We must modify the local filesystem part in order to fix all cases...?
> >   Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then 
> I found it.
> 
> It is that we can write a memory which is mmaped to a file. Then the memory 
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory. 

Then surely the issue is that .page_mkwrite is not checking that the
filesystem is frozen before allowing the page fault to continue and
dirty the page?

> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing. 
> However, it is very difficult to control users not to write memory which has 
> been already mapped to the file.

If you don't allow the page to be dirtied in the fist place, then
nothing needs to be done to the writeback path because there is
nothing dirty for it to write back.

Cheers,

Dave.
Eric Sandeen March 31, 2011, 11:53 p.m. UTC | #2
On 3/31/11 6:40 PM, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara <jack@suse.cz> wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>>   Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>>   block on s_frozen
>>>>>>> 				down_write(&sb->s_umount)
>>>>>>> 				  -blocked
>>>>>>> 								down_read(&sb->s_umount)
>>>>>>> 								  -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>>  fs/fs-writeback.c |    2 +-
>>>>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>>         spin_unlock(&sb_lock);
>>>>>>
>>>>>>         if (down_read_trylock(&sb->s_umount)) {
>>>>>> -               if (sb->s_root)
>>>>>> +               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
>>>>>>                         return true;
>>>>>>                 up_read(&sb->s_umount);
>>>>
>>>>>   So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> -> ext4_da_writepages
>>>>    -> ext4_journal_start_sb
>>>>       -> vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>>   Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then 
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory 
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory. 
> 
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
> 
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing. 
>> However, it is very difficult to control users not to write memory which has 
>> been already mapped to the file.
> 
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.

I floated 

[PATCH, RFC] check for frozen filesystems in the mmap path

a long time ago, but it went nowhere; maybe time to revive that approach.

-Eric

> Cheers,
> 
> Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 1, 2011, 2:08 p.m. UTC | #3
On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > On Thu, 17 Feb 2011 11:45:52 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > >>>>>describe above.
> > > > >>>>
> > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > >>>>fundamental problem here.
> > > > >>>>
> > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > >>>>second take a read lock on the s_umount.
> > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > >>>s_umount for writing - a situation like:
> > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > >>>down_read(&sb->s_umount)
> > > > >>>   block on s_frozen
> > > > >>>				down_write(&sb->s_umount)
> > > > >>>				  -blocked
> > > > >>>								down_read(&sb->s_umount)
> > > > >>>								  -blocked
> > > > >>>behind the write access...
> > > > >>>
> > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > >>>we did so in some well described wrapper).
> > > > >>I created the patch that you imagine yesterday.
> > > > >>
> > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > >>
> > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > >>after 12 hours passed.
> > > > >>
> > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > >>---
> > > > >>  fs/fs-writeback.c |    2 +-
> > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > >>
> > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > >>index 59c6e49..1c9a05e 100644
> > > > >>--- a/fs/fs-writeback.c
> > > > >>+++ b/fs/fs-writeback.c
> > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > >>         spin_unlock(&sb_lock);
> > > > >>
> > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > >>-               if (sb->s_root)
> > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > >>                         return true;
> > > > >>                 up_read(&sb->s_umount);
> > > > 
> > > > >   So this is something along the lines I thought but it actually won't work
> > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > >there are not other places that try to do IO while holding s_umount
> > > > >semaphore...
> > > > OK. I understand.
> > > > 
> > > > This code only fixes the case for the following path:
> > > > writeback_inodes_wb
> > > > -> ext4_da_writepages
> > > >    -> ext4_journal_start_sb
> > > >       -> vfs_check_frozen
> > > > But, the code doesn't fix the other cases.
> > > > 
> > > > We must modify the local filesystem part in order to fix all cases...?
> > >   Yes, possibly. But most importantly we should first find clear locking
> > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > the freezing / unfreezing code might become subtle for that reason, that's
> > > fine, but it would be really good to avoid any complicated things for the
> > > code in the rest of the VFS / filesystems.
> > I have deeply continued to examined the root cause of this problem, then 
> > I found it.
> > 
> > It is that we can write a memory which is mmaped to a file. Then the memory 
> > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > "writeback" the memory. 
> 
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
  And is this a bug? That isn't clear to me...

> > I think the best approach to fix this problem is to let users not to write
> > memory which is mapped to a certain file while the filesystem is freezing. 
> > However, it is very difficult to control users not to write memory which has 
> > been already mapped to the file.
> 
> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
  Sure but that's only the problem he was able to hit. But generally,
there's a problem with needing s_umount for unfreezing because it isn't
clear there aren't other code paths which can block with s_umount held
waiting for fs to get unfrozen. And these code paths would cause the same
deadlock. That's why I chose to get rid of s_umount during thawing.

									Honza
Toshiyuki Okajima April 5, 2011, 10:44 a.m. UTC | #4
Hi.

(2011/04/01 8:40), Dave Chinner wrote:
> On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
>> Hi.
>>
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>>>> Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
>>>>>>>>> under s_umount semaphore, we are prone to deadlock like the one you
>>>>>>>>> describe above.
>>>>>>>>
>>>>>>>> One of the fundamental problems here is that the freeze and thaw
>>>>>>>> routines are using down_write(&sb->s_umount) for two purposes.  The
>>>>>>>> first is to prevent the resume/thaw from racing with a umount (which
>>>>>>>> it could do just as well by taking a read lock), but the second is to
>>>>>>>> prevent the resume/thaw code from racing with itself.  That's the core
>>>>>>>> fundamental problem here.
>>>>>>>>
>>>>>>>> So I think we can solve this by introduce a new mutex, s_freeze, and
>>>>>>>> having the the resume/thaw first take the s_freeze mutex and then
>>>>>>>> second take a read lock on the s_umount.
>>>>>>>    Sadly this does not quite work because even down_read(&sb->s_umount)
>>>>>>> in thaw_super() can block if there is another process that tries to acquire
>>>>>>> s_umount for writing - a situation like:
>>>>>>>    TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
>>>>>>> down_read(&sb->s_umount)
>>>>>>>    block on s_frozen
>>>>>>> 				down_write(&sb->s_umount)
>>>>>>> 				  -blocked
>>>>>>> 								down_read(&sb->s_umount)
>>>>>>> 								  -blocked
>>>>>>> behind the write access...
>>>>>>>
>>>>>>> The only working solution I see is to check for frozen filesystem before
>>>>>>> taking s_umount semaphore which seems rather ugly (but might be bearable if
>>>>>>> we did so in some well described wrapper).
>>>>>> I created the patch that you imagine yesterday.
>>>>>>
>>>>>> I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
>>>>>> without a fixed patch. After an hour, I confirmed that this deadlock happened.
>>>>>>
>>>>>> However, on the kernel with a fixed patch, this deadlock doesn't still happen
>>>>>> after 12 hours passed.
>>>>>>
>>>>>> The patch for linux-2.6.38-rc4 is as follows:
>>>>>> ---
>>>>>>   fs/fs-writeback.c |    2 +-
>>>>>>   1 files changed, 1 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>>>> index 59c6e49..1c9a05e 100644
>>>>>> --- a/fs/fs-writeback.c
>>>>>> +++ b/fs/fs-writeback.c
>>>>>> @@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
>>>>>>          spin_unlock(&sb_lock);
>>>>>>
>>>>>>          if (down_read_trylock(&sb->s_umount)) {
>>>>>> -               if (sb->s_root)
>>>>>> +               if (sb->s_frozen == SB_UNFROZEN&&   sb->s_root)
>>>>>>                          return true;
>>>>>>                  up_read(&sb->s_umount);
>>>>
>>>>>    So this is something along the lines I thought but it actually won't work
>>>>> for example if sync(1) is run while the filesystem is frozen (that takes
>>>>> s_umount semaphore in a different place). And generally, I'm not convinced
>>>>> there are not other places that try to do IO while holding s_umount
>>>>> semaphore...
>>>> OK. I understand.
>>>>
>>>> This code only fixes the case for the following path:
>>>> writeback_inodes_wb
>>>> ->  ext4_da_writepages
>>>>     ->  ext4_journal_start_sb
>>>>        ->  vfs_check_frozen
>>>> But, the code doesn't fix the other cases.
>>>>
>>>> We must modify the local filesystem part in order to fix all cases...?
>>>    Yes, possibly. But most importantly we should first find clear locking
>>> rules for frozen filesystem that avoid deadlocks like the one above. And
>>> the freezing / unfreezing code might become subtle for that reason, that's
>>> fine, but it would be really good to avoid any complicated things for the
>>> code in the rest of the VFS / filesystems.
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>
> Then surely the issue is that .page_mkwrite is not checking that the
> filesystem is frozen before allowing the page fault to continue and
> dirty the page?
>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>

> If you don't allow the page to be dirtied in the fist place, then
> nothing needs to be done to the writeback path because there is
> nothing dirty for it to write back.
OK. We can block the write operation by not allowing the page to be
dirtied in the first place.

But we can not easily stop writing the page which is *already mapped*
in the next place. Therefore I think writing back such pages can
be blocked only in the flusher thread.

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 6, 2011, 5:40 a.m. UTC | #5
On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > Jan Kara <jack@suse.cz> wrote:
> > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > >>>>>describe above.
> > > > > >>>>
> > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > > >>>>fundamental problem here.
> > > > > >>>>
> > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > >>>>second take a read lock on the s_umount.
> > > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > >>>s_umount for writing - a situation like:
> > > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > > >>>down_read(&sb->s_umount)
> > > > > >>>   block on s_frozen
> > > > > >>>				down_write(&sb->s_umount)
> > > > > >>>				  -blocked
> > > > > >>>								down_read(&sb->s_umount)
> > > > > >>>								  -blocked
> > > > > >>>behind the write access...
> > > > > >>>
> > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > >>>we did so in some well described wrapper).
> > > > > >>I created the patch that you imagine yesterday.
> > > > > >>
> > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > >>
> > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > >>after 12 hours passed.
> > > > > >>
> > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > >>---
> > > > > >>  fs/fs-writeback.c |    2 +-
> > > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > >>
> > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > >>index 59c6e49..1c9a05e 100644
> > > > > >>--- a/fs/fs-writeback.c
> > > > > >>+++ b/fs/fs-writeback.c
> > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > >>         spin_unlock(&sb_lock);
> > > > > >>
> > > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > > >>-               if (sb->s_root)
> > > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > > >>                         return true;
> > > > > >>                 up_read(&sb->s_umount);
> > > > > 
> > > > > >   So this is something along the lines I thought but it actually won't work
> > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > >there are not other places that try to do IO while holding s_umount
> > > > > >semaphore...
> > > > > OK. I understand.
> > > > > 
> > > > > This code only fixes the case for the following path:
> > > > > writeback_inodes_wb
> > > > > -> ext4_da_writepages
> > > > >    -> ext4_journal_start_sb
> > > > >       -> vfs_check_frozen
> > > > > But, the code doesn't fix the other cases.
> > > > > 
> > > > > We must modify the local filesystem part in order to fix all cases...?
> > > >   Yes, possibly. But most importantly we should first find clear locking
> > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > fine, but it would be really good to avoid any complicated things for the
> > > > code in the rest of the VFS / filesystems.
> > > I have deeply continued to examined the root cause of this problem, then 
> > > I found it.
> > > 
> > > It is that we can write a memory which is mmaped to a file. Then the memory 
> > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > "writeback" the memory. 
> > 
> > Then surely the issue is that .page_mkwrite is not checking that the
> > filesystem is frozen before allowing the page fault to continue and
> > dirty the page?
>   And is this a bug? That isn't clear to me...

Given the semantics of a frozen filesystem, letting any object be
dirtied while frozen (be it an inode, a page, a metadata block, etc)
is definitely a bug.

The way the freeze code is architected is that incoming dirtying
events are prevented so that the writeback side does not need to
care about the frozen state of the filesystem at all. The freeze
operation is supposed to block new dirtiers, then flush all dirty
objects resulting in everything being clean in the filesystem.

Hence if no objects are being dirtied, then there should never be
any need to block writeback threads due to the filesytem being
frozen because, by definition, there should be no work for them to
do. Hence if objects are being dirtied while the filesystem is
frozen, then that is a bug.

> > > I think the best approach to fix this problem is to let users not to write
> > > memory which is mapped to a certain file while the filesystem is freezing. 
> > > However, it is very difficult to control users not to write memory which has 
> > > been already mapped to the file.
> > 
> > If you don't allow the page to be dirtied in the fist place, then
> > nothing needs to be done to the writeback path because there is
> > nothing dirty for it to write back.
>   Sure but that's only the problem he was able to hit. But generally,
> there's a problem with needing s_umount for unfreezing because it isn't
> clear there aren't other code paths which can block with s_umount held
> waiting for fs to get unfrozen. And these code paths would cause the same
> deadlock. That's why I chose to get rid of s_umount during thawing.

Holding the s_umount lock while checking if frozen and sleeping
is essentially an ABBA lock inversion bug that can bite in many more
places that just thawing the filesystem. . Any where this is done
should be fixed, so I don't think just removing the s_umount lock
from the thaw path is sufficient to avoid problems.

Cheers,

Dave.
Jan Kara April 6, 2011, 6:18 a.m. UTC | #6
On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > On Mon, Mar 28, 2011 at 05:06:28PM +0900, Toshiyuki Okajima wrote:
> > > > On Thu, 17 Feb 2011 11:45:52 +0100
> > > > Jan Kara <jack@suse.cz> wrote:
> > > > > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > > > > (2011/02/16 23:56), Jan Kara wrote:
> > > > > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > > > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > > > > >>Jan Kara<jack@suse.cz>  wrote:
> > > > > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > > > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > > > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > > > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > > > > >>>>>describe above.
> > > > > > >>>>
> > > > > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > > > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > > > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > > > > >>>>it could do just as well by taking a read lock), but the second is to
> > > > > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > > > > >>>>fundamental problem here.
> > > > > > >>>>
> > > > > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > > > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > > > > >>>>second take a read lock on the s_umount.
> > > > > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > > > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > > > > >>>s_umount for writing - a situation like:
> > > > > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > > > > >>>down_read(&sb->s_umount)
> > > > > > >>>   block on s_frozen
> > > > > > >>>				down_write(&sb->s_umount)
> > > > > > >>>				  -blocked
> > > > > > >>>								down_read(&sb->s_umount)
> > > > > > >>>								  -blocked
> > > > > > >>>behind the write access...
> > > > > > >>>
> > > > > > >>>The only working solution I see is to check for frozen filesystem before
> > > > > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > > > > >>>we did so in some well described wrapper).
> > > > > > >>I created the patch that you imagine yesterday.
> > > > > > >>
> > > > > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > > > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > > > > >>
> > > > > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > > > > >>after 12 hours passed.
> > > > > > >>
> > > > > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > > > > >>---
> > > > > > >>  fs/fs-writeback.c |    2 +-
> > > > > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > > > > >>
> > > > > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > > > >>index 59c6e49..1c9a05e 100644
> > > > > > >>--- a/fs/fs-writeback.c
> > > > > > >>+++ b/fs/fs-writeback.c
> > > > > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > > > > >>         spin_unlock(&sb_lock);
> > > > > > >>
> > > > > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > > > > >>-               if (sb->s_root)
> > > > > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > > > > >>                         return true;
> > > > > > >>                 up_read(&sb->s_umount);
> > > > > > 
> > > > > > >   So this is something along the lines I thought but it actually won't work
> > > > > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > > > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > > > > >there are not other places that try to do IO while holding s_umount
> > > > > > >semaphore...
> > > > > > OK. I understand.
> > > > > > 
> > > > > > This code only fixes the case for the following path:
> > > > > > writeback_inodes_wb
> > > > > > -> ext4_da_writepages
> > > > > >    -> ext4_journal_start_sb
> > > > > >       -> vfs_check_frozen
> > > > > > But, the code doesn't fix the other cases.
> > > > > > 
> > > > > > We must modify the local filesystem part in order to fix all cases...?
> > > > >   Yes, possibly. But most importantly we should first find clear locking
> > > > > rules for frozen filesystem that avoid deadlocks like the one above. And
> > > > > the freezing / unfreezing code might become subtle for that reason, that's
> > > > > fine, but it would be really good to avoid any complicated things for the
> > > > > code in the rest of the VFS / filesystems.
> > > > I have deeply continued to examined the root cause of this problem, then 
> > > > I found it.
> > > > 
> > > > It is that we can write a memory which is mmaped to a file. Then the memory 
> > > > becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> > > > "writeback" the memory. 
> > > 
> > > Then surely the issue is that .page_mkwrite is not checking that the
> > > filesystem is frozen before allowing the page fault to continue and
> > > dirty the page?
> >   And is this a bug? That isn't clear to me...
> 
> Given the semantics of a frozen filesystem, letting any object be
> dirtied while frozen (be it an inode, a page, a metadata block, etc)
> is definitely a bug.
>
> The way the freeze code is architected is that incoming dirtying
> events are prevented so that the writeback side does not need to
> care about the frozen state of the filesystem at all. The freeze
> operation is supposed to block new dirtiers, then flush all dirty
> objects resulting in everything being clean in the filesystem.
> 
> Hence if no objects are being dirtied, then there should never be
> any need to block writeback threads due to the filesytem being
> frozen because, by definition, there should be no work for them to
> do. Hence if objects are being dirtied while the filesystem is
> frozen, then that is a bug.
  OK, after some thought I start to agree with you that it would be nice
if we didn't allow the pages to be dirtied at the first place. Otherwise
things get a bit fragile as writing a data block does *not* need a
transaction start as such (we just happen to do it in all code paths)...

> > > > I think the best approach to fix this problem is to let users not to write
> > > > memory which is mapped to a certain file while the filesystem is freezing. 
> > > > However, it is very difficult to control users not to write memory which has 
> > > > been already mapped to the file.
> > > 
> > > If you don't allow the page to be dirtied in the fist place, then
> > > nothing needs to be done to the writeback path because there is
> > > nothing dirty for it to write back.
> >   Sure but that's only the problem he was able to hit. But generally,
> > there's a problem with needing s_umount for unfreezing because it isn't
> > clear there aren't other code paths which can block with s_umount held
> > waiting for fs to get unfrozen. And these code paths would cause the same
> > deadlock. That's why I chose to get rid of s_umount during thawing.
> 
> Holding the s_umount lock while checking if frozen and sleeping
> is essentially an ABBA lock inversion bug that can bite in many more
> places that just thawing the filesystem.  Any where this is done should
> be fixed, so I don't think just removing the s_umount lock from the thaw
> path is sufficient to avoid problems.
  That's easily said but hard to do - any transaction start in ext3/4 may
block on filesystem being frozen (this seems to be similar for XFS as I'm
looking into the code) and transaction start traditionally nests inside
s_umount (and basically there's no way around that since sync() calls your
fs code with s_umount held). So I'm afraid we are not going to get rid of
this ABBA dependency unless we declare that s_umount ranks above filesystem
being frozen - but surely I'm open to suggestions.

Another possibility is just to hide the problem e.g. by checking for frozen
filesystem whenever we try to get s_umount. But that looks a bit ugly to
me.

									Honza
Dave Chinner April 6, 2011, 11:21 a.m. UTC | #7
On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > If you don't allow the page to be dirtied in the fist place, then
> > > > nothing needs to be done to the writeback path because there is
> > > > nothing dirty for it to write back.
> > >   Sure but that's only the problem he was able to hit. But generally,
> > > there's a problem with needing s_umount for unfreezing because it isn't
> > > clear there aren't other code paths which can block with s_umount held
> > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > 
> > Holding the s_umount lock while checking if frozen and sleeping
> > is essentially an ABBA lock inversion bug that can bite in many more
> > places that just thawing the filesystem.  Any where this is done should
> > be fixed, so I don't think just removing the s_umount lock from the thaw
> > path is sufficient to avoid problems.
>   That's easily said but hard to do - any transaction start in ext3/4 may
> block on filesystem being frozen (this seems to be similar for XFS as I'm
> looking into the code) and transaction start traditionally nests inside
> s_umount (and basically there's no way around that since sync() calls your
> fs code with s_umount held).

Sure, but the question must be asked - why is ext3/4 even starting a
transaction on a clean filesystem during sync? A frozen filesystem,
by definition, is a clean filesytem, and therefore sync calls of any
kind should not be trying to write to the FS or start transactions.
XFS does this just fine, so I'd consider such behaviour on a frozen
filesystem a bug in ext3/4...

> So I'm afraid we are not going to get rid of
> this ABBA dependency unless we declare that s_umount ranks above filesystem
> being frozen - but surely I'm open to suggestions.

Not sure I understand what you are saying there - this is already
the case, isn't it? i.e. it has to be held exclusive to freeze a
filesystem...

> Another possibility is just to hide the problem e.g. by checking for frozen
> filesystem whenever we try to get s_umount. But that looks a bit ugly to
> me.

And not necessary, AFAICT.

Cheers,

Dave.
Christoph Hellwig April 6, 2011, 1:44 p.m. UTC | #8
On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

XFS does have one special case for this.  When writing the dummy log
record at the end of the freeze process we use _xfs_alloc_trans to
bypass the frozen filesystem check as we have to write out this record
when the filesystem already is frozen.  But that's after the main
sync with its normal transactions.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 6, 2011, 5:40 p.m. UTC | #9
On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > nothing needs to be done to the writeback path because there is
> > > > > nothing dirty for it to write back.
> > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > clear there aren't other code paths which can block with s_umount held
> > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > 
> > > Holding the s_umount lock while checking if frozen and sleeping
> > > is essentially an ABBA lock inversion bug that can bite in many more
> > > places that just thawing the filesystem.  Any where this is done should
> > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > path is sufficient to avoid problems.
> >   That's easily said but hard to do - any transaction start in ext3/4 may
> > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > looking into the code) and transaction start traditionally nests inside
> > s_umount (and basically there's no way around that since sync() calls your
> > fs code with s_umount held).
> 
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...
  But by this you are essentially agreeing that the lock inversion is there
in principle. We just hide it by relying on the fact that no code path
trying to change anything with s_umount held (which is the right lock
ordering) gets called while the fs is frozen.  And that is fragile.
Actually, I've looked for a while and if you call quotactl(), it will get
s_umount and then tell filesystem to update quota information which blocks
inside the fs waiting for filesystem being unfrozen => deadlock. We can
change this code path to wait for frozen filesystem before taking s_umount
that essentially it just reinstates my point - it't fragile and IMHO we
need some more consistent way to handle this...

> > So I'm afraid we are not going to get rid of
> > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > being frozen - but surely I'm open to suggestions.
> 
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
  Not really. We freeze the fs under s_umount but freezing essentially
implements trylock semantics while setting s_frozen so that does not really
establish any lock dependency. What establishes lock dependency is the
thawing path which blocks on s_umount while the filesystem is still frozen.
And this dependency is the other way around - i.e., freezing above
s_umount. This is why I was messing with thawing code to fix this...
 
								Honza
Dave Chinner April 6, 2011, 10:54 p.m. UTC | #10
On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > nothing needs to be done to the writeback path because there is
> > > > > > nothing dirty for it to write back.
> > > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > clear there aren't other code paths which can block with s_umount held
> > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > > 
> > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > places that just thawing the filesystem.  Any where this is done should
> > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > path is sufficient to avoid problems.
> > >   That's easily said but hard to do - any transaction start in ext3/4 may
> > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > looking into the code) and transaction start traditionally nests inside
> > > s_umount (and basically there's no way around that since sync() calls your
> > > fs code with s_umount held).
> > 
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
>   But by this you are essentially agreeing that the lock inversion is there
> in principle. We just hide it by relying on the fact that no code path
> trying to change anything with s_umount held (which is the right lock
> ordering) gets called while the fs is frozen.  And that is fragile.

It's just another lock ordering rule. i.e. don't sleep on a frozen
filesystem with s_umount held.  It's no more fragile than the many
other lock ordering rules we have.

> Actually, I've looked for a while and if you call quotactl(), it will get
> s_umount and then tell filesystem to update quota information which blocks
> inside the fs waiting for filesystem being unfrozen => deadlock.

Which is a bug according to the above locking rule.

> We can
> change this code path to wait for frozen filesystem before taking s_umount
> that essentially it just reinstates my point - it't fragile and IMHO we
> need some more consistent way to handle this...
> 
> > > So I'm afraid we are not going to get rid of
> > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > being frozen - but surely I'm open to suggestions.
> > 
> > Not sure I understand what you are saying there - this is already
> > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > filesystem...
>   Not really. We freeze the fs under s_umount but freezing essentially
> implements trylock semantics while setting s_frozen so that does not really
> establish any lock dependency. What establishes lock dependency is the
> thawing path which blocks on s_umount while the filesystem is still frozen.
> And this dependency is the other way around - i.e., freezing above
> s_umount. This is why I was messing with thawing code to fix this...

It's just the tip of the iceberg. If we allow s_umount to be held
while waiting on a frozen filesystem, we open ouselves up to all
manner of problems. Such as umount hanging on a frozen fs,
(which means a shutdown with a frozen filesystem will hang), it can
hang sync, it can hang memory reclaim, it can hang in any path that
takes s_umount and hence do all sorts of bad things.

Yes, unthawing the filesystem will get things moving again with your
patch, but my point is that it simply does not address the problems
caused by the bad behaviour that has already occurred while the FS
is frozen. Fixing the thaw code in this way is like shooting the
messenger - it doesn't fix the problems being reported.

Cheers,

Dave.
Dave Chinner April 6, 2011, 10:59 p.m. UTC | #11
On Wed, Apr 06, 2011 at 09:44:28AM -0400, Christoph Hellwig wrote:
> On Wed, Apr 06, 2011 at 09:21:35PM +1000, Dave Chinner wrote:
> > Sure, but the question must be asked - why is ext3/4 even starting a
> > transaction on a clean filesystem during sync? A frozen filesystem,
> > by definition, is a clean filesytem, and therefore sync calls of any
> > kind should not be trying to write to the FS or start transactions.
> > XFS does this just fine, so I'd consider such behaviour on a frozen
> > filesystem a bug in ext3/4...
> 
> XFS does have one special case for this.  When writing the dummy log
> record at the end of the freeze process we use _xfs_alloc_trans to
> bypass the frozen filesystem check as we have to write out this record
> when the filesystem already is frozen.  But that's after the main
> sync with its normal transactions.

Right, that is a special case in the _freeze process_ (i.e. before
we've declared the FS frozen), not a normal operation on a frozen
filesystem.

If you want to list exceptions (i.e. where we explicitly avoid
writes to frozen fs), look for xfs_fs_writeable(), which stops
various write operations from proceeding when the fs is either
frozen, read-only or shut down.

Cheers,

Dave.
Jan Kara April 8, 2011, 9:33 p.m. UTC | #12
On Thu 07-04-11 08:54:01, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 07:40:01PM +0200, Jan Kara wrote:
> > On Wed 06-04-11 21:21:35, Dave Chinner wrote:
> > > On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> > > > On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> > > > > On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> > > > > > On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> > > > > > > If you don't allow the page to be dirtied in the fist place, then
> > > > > > > nothing needs to be done to the writeback path because there is
> > > > > > > nothing dirty for it to write back.
> > > > > >   Sure but that's only the problem he was able to hit. But generally,
> > > > > > there's a problem with needing s_umount for unfreezing because it isn't
> > > > > > clear there aren't other code paths which can block with s_umount held
> > > > > > waiting for fs to get unfrozen. And these code paths would cause the same
> > > > > > deadlock. That's why I chose to get rid of s_umount during thawing.
> > > > > 
> > > > > Holding the s_umount lock while checking if frozen and sleeping
> > > > > is essentially an ABBA lock inversion bug that can bite in many more
> > > > > places that just thawing the filesystem.  Any where this is done should
> > > > > be fixed, so I don't think just removing the s_umount lock from the thaw
> > > > > path is sufficient to avoid problems.
> > > >   That's easily said but hard to do - any transaction start in ext3/4 may
> > > > block on filesystem being frozen (this seems to be similar for XFS as I'm
> > > > looking into the code) and transaction start traditionally nests inside
> > > > s_umount (and basically there's no way around that since sync() calls your
> > > > fs code with s_umount held).
> > > 
> > > Sure, but the question must be asked - why is ext3/4 even starting a
> > > transaction on a clean filesystem during sync? A frozen filesystem,
> > > by definition, is a clean filesytem, and therefore sync calls of any
> > > kind should not be trying to write to the FS or start transactions.
> > > XFS does this just fine, so I'd consider such behaviour on a frozen
> > > filesystem a bug in ext3/4...
> >   But by this you are essentially agreeing that the lock inversion is there
> > in principle. We just hide it by relying on the fact that no code path
> > trying to change anything with s_umount held (which is the right lock
> > ordering) gets called while the fs is frozen.  And that is fragile.
> 
> It's just another lock ordering rule. i.e. don't sleep on a frozen
> filesystem with s_umount held.  It's no more fragile than the many
> other lock ordering rules we have.
  Except that for all the filesystems transaction start => sleep on a
frozen filesystem and in some code paths we have s_umount held while doing
a transaction start. So I don't buy the argument that it's just another
normal locking rule because normally we require that all the code paths
follow correct lock ordering. Now we have some paths (like sync) which do
not follow the correct lock ordering and we just make sure they are not
called if they could cause deadlocks by other means...

> > Actually, I've looked for a while and if you call quotactl(), it will get
> > s_umount and then tell filesystem to update quota information which blocks
> > inside the fs waiting for filesystem being unfrozen => deadlock.
> 
> Which is a bug according to the above locking rule.
  Yes, I was just trying to demonstrate that the locking rule changes
"block until the fs is unfrozen" into "kernel is deadlocked" in an
unexpected places... fsync_bdev() is another case which deadlocks
currently.

> > We can
> > change this code path to wait for frozen filesystem before taking s_umount
> > that essentially it just reinstates my point - it't fragile and IMHO we
> > need some more consistent way to handle this...
> > 
> > > > So I'm afraid we are not going to get rid of
> > > > this ABBA dependency unless we declare that s_umount ranks above filesystem
> > > > being frozen - but surely I'm open to suggestions.
> > > 
> > > Not sure I understand what you are saying there - this is already
> > > the case, isn't it? i.e. it has to be held exclusive to freeze a
> > > filesystem...
> >   Not really. We freeze the fs under s_umount but freezing essentially
> > implements trylock semantics while setting s_frozen so that does not really
> > establish any lock dependency. What establishes lock dependency is the
> > thawing path which blocks on s_umount while the filesystem is still frozen.
> > And this dependency is the other way around - i.e., freezing above
> > s_umount. This is why I was messing with thawing code to fix this...
> 
> It's just the tip of the iceberg. If we allow s_umount to be held
> while waiting on a frozen filesystem, we open ouselves up to all
> manner of problems. Such as umount hanging on a frozen fs,
> (which means a shutdown with a frozen filesystem will hang), it can
> hang sync, it can hang memory reclaim, it can hang in any path that
> takes s_umount and hence do all sorts of bad things.
  I see. The umount hang (especially in the shutdown case) is not nice.
Direct reclaim won't be blocked AFAICS if we stop dirtying pages while the
fs is frozen (which, as I already wrote, I agree is not a good thing to do
after some thought). Since you can block while accessing the frozen
filesystem anyway (because of atime updates or just because of writing
process waiting with i_mutex held for fs to be unfrozen) I'm not sure how
much worse it would be if s_umount lock would be another lock with which
we can wait for fs to get unfrozen...

> Yes, unthawing the filesystem will get things moving again with your
> patch, but my point is that it simply does not address the problems
> caused by the bad behaviour that has already occurred while the FS
> is frozen. Fixing the thaw code in this way is like shooting the
> messenger - it doesn't fix the problems being reported.
  I don't there has been any too bad behavior - you tried to access frozen
filesystem and you got blocked. But OK, I'll invest some more thought into
how to not block with s_umount held without sprinkling frozen checks over
the tree...

								Honza
Surbhi Palande May 2, 2011, 9:07 a.m. UTC | #13
Hi,

On 04/06/2011 02:21 PM, Dave Chinner wrote:
> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>> nothing needs to be done to the writeback path because there is
>>>>> nothing dirty for it to write back.
>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>> clear there aren't other code paths which can block with s_umount held
>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>
>>> Holding the s_umount lock while checking if frozen and sleeping
>>> is essentially an ABBA lock inversion bug that can bite in many more
>>> places that just thawing the filesystem.  Any where this is done should
>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>> path is sufficient to avoid problems.
>>    That's easily said but hard to do - any transaction start in ext3/4 may
>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>> looking into the code) and transaction start traditionally nests inside
>> s_umount (and basically there's no way around that since sync() calls your
>> fs code with s_umount held).
>
> Sure, but the question must be asked - why is ext3/4 even starting a
> transaction on a clean filesystem during sync? A frozen filesystem,
> by definition, is a clean filesytem, and therefore sync calls of any
> kind should not be trying to write to the FS or start transactions.
> XFS does this just fine, so I'd consider such behaviour on a frozen
> filesystem a bug in ext3/4...

I had a look at the xfs code for seeing how this is done.
xfs_file_aio_write()
   xfs_wait_for_freeze()
     vfs_check_frozen()
So xfs_file_aio_write() writes to buffers when the FS is not frozen.

Now, I want to know what stops the following scenario from happening:
--------------------
xfs_file_aio_write()
   xfs_wait_for_freeze()
     vfs_check_frozen()
At this point F.S was not frozen, so the next instruction in the 
xfs_file_aio_write() will be executed next.
However at this point (i.e after checking if F.S is frozen) the write 
process gets pre-empted and say the _freeze_ process gets control.

Now the F.S freezes and the write process gets the control back. And so 
we end up writing to the page cache when the F.S is frozen.
--------------------

Can anyone please enlighten me on how & why this premption is _not_ 
possible?

If this pre-emption is _possible_, then can we use sb->s_umount to 
prevent a freeze from happening while a write to the page cache buffers 
is going on. Eg:

* Before writing to the buffers in the page cache:

   down_write(sb->s_umount)
     if(sb->s_frozen == SB_FREEZE_WRITE) {
       // do not sleep with the sb->s_umount semaphore.
       up_write(s_umount);
       vfs_check_frozen();
       // if you are here then fs is not thawed.
       down_write(sb->s_umount);
     }


Thanks!


Warm Regards,
Surbhi.




>
>> So I'm afraid we are not going to get rid of
>> this ABBA dependency unless we declare that s_umount ranks above filesystem
>> being frozen - but surely I'm open to suggestions.
>
> Not sure I understand what you are saying there - this is already
> the case, isn't it? i.e. it has to be held exclusive to freeze a
> filesystem...
>
>> Another possibility is just to hide the problem e.g. by checking for frozen
>> filesystem whenever we try to get s_umount. But that looks a bit ugly to
>> me.
>
> And not necessary, AFAICT.
>
> Cheers,
>
> Dave.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 10:56 a.m. UTC | #14
On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>nothing needs to be done to the writeback path because there is
> >>>>>nothing dirty for it to write back.
> >>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>clear there aren't other code paths which can block with s_umount held
> >>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>
> >>>Holding the s_umount lock while checking if frozen and sleeping
> >>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>places that just thawing the filesystem.  Any where this is done should
> >>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>path is sufficient to avoid problems.
> >>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>looking into the code) and transaction start traditionally nests inside
> >>s_umount (and basically there's no way around that since sync() calls your
> >>fs code with s_umount held).
> >
> >Sure, but the question must be asked - why is ext3/4 even starting a
> >transaction on a clean filesystem during sync? A frozen filesystem,
> >by definition, is a clean filesytem, and therefore sync calls of any
> >kind should not be trying to write to the FS or start transactions.
> >XFS does this just fine, so I'd consider such behaviour on a frozen
> >filesystem a bug in ext3/4...
> 
> I had a look at the xfs code for seeing how this is done.
> xfs_file_aio_write()
>   xfs_wait_for_freeze()
>     vfs_check_frozen()
> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> 
> Now, I want to know what stops the following scenario from happening:
> --------------------
> xfs_file_aio_write()
>   xfs_wait_for_freeze()
>     vfs_check_frozen()
> At this point F.S was not frozen, so the next instruction in the
> xfs_file_aio_write() will be executed next.
> However at this point (i.e after checking if F.S is frozen) the
> write process gets pre-empted and say the _freeze_ process gets
> control.
> 
> Now the F.S freezes and the write process gets the control back. And
> so we end up writing to the page cache when the F.S is frozen.
> --------------------
> 
> Can anyone please enlighten me on how & why this premption is _not_
> possible?
  XFS works similarly as ext4 in this regard I believe. They have the log
frozen in xfs_freeze() so if the race you describe above happens, either
the writing process gets caught waiting for log to unfreeze or it manages
to start a transaction and then freezing process waits for transaction to
finish before it can proceed with freezing. I'm not sure why is there the
check in xfs_file_aio_write()...

								Honza
Surbhi Palande May 2, 2011, 11:27 a.m. UTC | #15
On 05/02/2011 01:56 PM, Jan Kara wrote:
> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>> nothing dirty for it to write back.
>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>> path is sufficient to avoid problems.
>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>> looking into the code) and transaction start traditionally nests inside
>>>> s_umount (and basically there's no way around that since sync() calls your
>>>> fs code with s_umount held).
>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>> by definition, is a clean filesytem, and therefore sync calls of any
>>> kind should not be trying to write to the FS or start transactions.
>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>> filesystem a bug in ext3/4...
>> I had a look at the xfs code for seeing how this is done.
>> xfs_file_aio_write()
>>    xfs_wait_for_freeze()
>>      vfs_check_frozen()
>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>
>> Now, I want to know what stops the following scenario from happening:
>> --------------------
>> xfs_file_aio_write()
>>    xfs_wait_for_freeze()
>>      vfs_check_frozen()
>> At this point F.S was not frozen, so the next instruction in the
>> xfs_file_aio_write() will be executed next.
>> However at this point (i.e after checking if F.S is frozen) the
>> write process gets pre-empted and say the _freeze_ process gets
>> control.
>>
>> Now the F.S freezes and the write process gets the control back. And
>> so we end up writing to the page cache when the F.S is frozen.
>> --------------------
>>
>> Can anyone please enlighten me on how&  why this premption is _not_
>> possible?
Thanks for your reply.
>    XFS works similarly as ext4 in this regard I believe. They have the log
> frozen in xfs_freeze() so if the race you describe above happens, either
> the writing process gets caught waiting for log to unfreeze
Agreed.
>   or it manages
> to start a transaction and then freezing process waits for transaction to
> finish before it can proceed with freezing. I'm not sure why is there the
> check in xfs_file_aio_write()...
>
> 			
I am sorry, but I don't understand how this will happen - i.e I can't 
understand what stops freeze_super() (or ext4_freeze) from freezing a 
superblock (as the write process stopped just before writing anything 
for this transaction and has not taken any locks?)

Thanks!

Warm Regards,
Surbhi.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 2, 2011, 12:06 p.m. UTC | #16
On 05/02/2011 02:27 PM, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>> nothing dirty for it to write back.
>>>>>>> Sure but that's only the problem he was able to hit. But generally,
>>>>>>> there's a problem with needing s_umount for unfreezing because it
>>>>>>> isn't
>>>>>>> clear there aren't other code paths which can block with s_umount
>>>>>>> held
>>>>>>> waiting for fs to get unfrozen. And these code paths would cause
>>>>>>> the same
>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>> places that just thawing the filesystem. Any where this is done
>>>>>> should
>>>>>> be fixed, so I don't think just removing the s_umount lock from
>>>>>> the thaw
>>>>>> path is sufficient to avoid problems.
>>>>> That's easily said but hard to do - any transaction start in ext3/4
>>>>> may
>>>>> block on filesystem being frozen (this seems to be similar for XFS
>>>>> as I'm
>>>>> looking into the code) and transaction start traditionally nests
>>>>> inside
>>>>> s_umount (and basically there's no way around that since sync()
>>>>> calls your
>>>>> fs code with s_umount held).
>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>> kind should not be trying to write to the FS or start transactions.
>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>> filesystem a bug in ext3/4...
>>> I had a look at the xfs code for seeing how this is done.
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>
>>> Now, I want to know what stops the following scenario from happening:
>>> --------------------
>>> xfs_file_aio_write()
>>> xfs_wait_for_freeze()
>>> vfs_check_frozen()
>>> At this point F.S was not frozen, so the next instruction in the
>>> xfs_file_aio_write() will be executed next.
>>> However at this point (i.e after checking if F.S is frozen) the
>>> write process gets pre-empted and say the _freeze_ process gets
>>> control.
>>>
>>> Now the F.S freezes and the write process gets the control back. And
>>> so we end up writing to the page cache when the F.S is frozen.
>>> --------------------
>>>
>>> Can anyone please enlighten me on how& why this premption is _not_
>>> possible?
> Thanks for your reply.
>> XFS works similarly as ext4 in this regard I believe. They have the log
>> frozen in xfs_freeze() so if the race you describe above happens, either
>> the writing process gets caught waiting for log to unfreeze
> Agreed.
>> or it manages
>> to start a transaction and then freezing process waits for transaction to
>> finish before it can proceed with freezing. I'm not sure why is there the
>> check in xfs_file_aio_write()...
>>
>>
> I am sorry, but I don't understand how this will happen - i.e I can't
> understand what stops freeze_super() (or ext4_freeze) from freezing a
> superblock (as the write process stopped just before writing anything
> for this transaction and has not taken any locks?)

To make myself a little more coherent:

freeze_super()
   ext4_freeze()
     1) jbd2_journal_updates()
     2) jbd2_journal_flush(journal)
     3) jbd2_journal_unlock_updates(journal).
     4) return

Say now the fs write process stopped just after checking that fs is not 
frozen (i.e its thawed). So its ready to write to the page cache. Just 
when it has finished this vfs_check_frozen() and before it starts any 
write (or transactions), say the write process gets pre-empted and then 
the freeze process freezes the superblock. Wont ext4_freeze() simply 
lock the current transactions, flush them to the log and then unlock the 
transactions (so that new handles/transactions can be accepted later?)

So then after the fsfreeze finishes freezing the F.S, say if the write 
process gets the control back. The write process assumes that after its 
out of vfs_check_frozen(), the fs is thawed (or unfrozen) where as in 
this case it is not.

So I don't understand, _what_ stops the writing process from starting a 
transaction in this case when the F.S is frozen already
and what stops the fsfreeze from waiting for the write process (when it 
has not yet started the write)?

Warm Regards,
Surbhi.




>
> Thanks!
>
> Warm Regards,
> Surbhi.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 12:20 p.m. UTC | #17
On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> On 05/02/2011 01:56 PM, Jan Kara wrote:
> >On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>nothing dirty for it to write back.
> >>>>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>places that just thawing the filesystem.  Any where this is done should
> >>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>path is sufficient to avoid problems.
> >>>>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>looking into the code) and transaction start traditionally nests inside
> >>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>fs code with s_umount held).
> >>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>kind should not be trying to write to the FS or start transactions.
> >>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>filesystem a bug in ext3/4...
> >>I had a look at the xfs code for seeing how this is done.
> >>xfs_file_aio_write()
> >>   xfs_wait_for_freeze()
> >>     vfs_check_frozen()
> >>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>
> >>Now, I want to know what stops the following scenario from happening:
> >>--------------------
> >>xfs_file_aio_write()
> >>   xfs_wait_for_freeze()
> >>     vfs_check_frozen()
> >>At this point F.S was not frozen, so the next instruction in the
> >>xfs_file_aio_write() will be executed next.
> >>However at this point (i.e after checking if F.S is frozen) the
> >>write process gets pre-empted and say the _freeze_ process gets
> >>control.
> >>
> >>Now the F.S freezes and the write process gets the control back. And
> >>so we end up writing to the page cache when the F.S is frozen.
> >>--------------------
> >>
> >>Can anyone please enlighten me on how&  why this premption is _not_
> >>possible?
> Thanks for your reply.
> >   XFS works similarly as ext4 in this regard I believe. They have the log
> >frozen in xfs_freeze() so if the race you describe above happens, either
> >the writing process gets caught waiting for log to unfreeze
> Agreed.
> >  or it manages
> >to start a transaction and then freezing process waits for transaction to
> >finish before it can proceed with freezing. I'm not sure why is there the
> >check in xfs_file_aio_write()...
> >
> >			
> I am sorry, but I don't understand how this will happen - i.e I
> can't understand what stops freeze_super() (or ext4_freeze) from
> freezing a superblock (as the write process stopped just before
> writing anything for this transaction and has not taken any locks?)
  So ext4_freeze() does
jbd2_journal_lock_updates(journal)
  which waits for all running transactions to finish and updates
j_barrier_count which stops any news ones from proceeding (check
function start_this_handle()).

								Honza
Surbhi Palande May 2, 2011, 12:30 p.m. UTC | #18
On 05/02/2011 03:20 PM, Jan Kara wrote:
> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>> path is sufficient to avoid problems.
>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>> fs code with s_umount held).
>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>> kind should not be trying to write to the FS or start transactions.
>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>> filesystem a bug in ext3/4...
>>>> I had a look at the xfs code for seeing how this is done.
>>>> xfs_file_aio_write()
>>>>    xfs_wait_for_freeze()
>>>>      vfs_check_frozen()
>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>
>>>> Now, I want to know what stops the following scenario from happening:
>>>> --------------------
>>>> xfs_file_aio_write()
>>>>    xfs_wait_for_freeze()
>>>>      vfs_check_frozen()
>>>> At this point F.S was not frozen, so the next instruction in the
>>>> xfs_file_aio_write() will be executed next.
>>>> However at this point (i.e after checking if F.S is frozen) the
>>>> write process gets pre-empted and say the _freeze_ process gets
>>>> control.
>>>>
>>>> Now the F.S freezes and the write process gets the control back. And
>>>> so we end up writing to the page cache when the F.S is frozen.
>>>> --------------------
>>>>
>>>> Can anyone please enlighten me on how&   why this premption is _not_
>>>> possible?
>> Thanks for your reply.
>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>> the writing process gets caught waiting for log to unfreeze
>> Agreed.
>>>   or it manages
>>> to start a transaction and then freezing process waits for transaction to
>>> finish before it can proceed with freezing. I'm not sure why is there the
>>> check in xfs_file_aio_write()...
>>>
>>> 			
>> I am sorry, but I don't understand how this will happen - i.e I
>> can't understand what stops freeze_super() (or ext4_freeze) from
>> freezing a superblock (as the write process stopped just before
>> writing anything for this transaction and has not taken any locks?)
>    So ext4_freeze() does
> jbd2_journal_lock_updates(journal)
>    which waits for all running transactions to finish and updates
> j_barrier_count which stops any news ones from proceeding (check
> function start_this_handle()).
>
Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal) 
which decrements the  j_barrier_count (which was previously 
updated/incremented in jbd2_journal_lock_updates) ? before it returns. 
So after this call a new transaction/handle can be accepted/started.

A comment in ext4_freeze() says:
/* we rely on s_frozen to stop further updates */
(before calling jbd2_journal_unlock_updates())

Warm Regards,
Surbhi.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 1:16 p.m. UTC | #19
On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
> On 05/02/2011 03:20 PM, Jan Kara wrote:
> >On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
> >>On 05/02/2011 01:56 PM, Jan Kara wrote:
> >>>On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
> >>>>On 04/06/2011 02:21 PM, Dave Chinner wrote:
> >>>>>On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
> >>>>>>On Wed 06-04-11 15:40:05, Dave Chinner wrote:
> >>>>>>>On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
> >>>>>>>>On Fri 01-04-11 10:40:50, Dave Chinner wrote:
> >>>>>>>>>If you don't allow the page to be dirtied in the fist place, then
> >>>>>>>>>nothing needs to be done to the writeback path because there is
> >>>>>>>>>nothing dirty for it to write back.
> >>>>>>>>   Sure but that's only the problem he was able to hit. But generally,
> >>>>>>>>there's a problem with needing s_umount for unfreezing because it isn't
> >>>>>>>>clear there aren't other code paths which can block with s_umount held
> >>>>>>>>waiting for fs to get unfrozen. And these code paths would cause the same
> >>>>>>>>deadlock. That's why I chose to get rid of s_umount during thawing.
> >>>>>>>Holding the s_umount lock while checking if frozen and sleeping
> >>>>>>>is essentially an ABBA lock inversion bug that can bite in many more
> >>>>>>>places that just thawing the filesystem.  Any where this is done should
> >>>>>>>be fixed, so I don't think just removing the s_umount lock from the thaw
> >>>>>>>path is sufficient to avoid problems.
> >>>>>>   That's easily said but hard to do - any transaction start in ext3/4 may
> >>>>>>block on filesystem being frozen (this seems to be similar for XFS as I'm
> >>>>>>looking into the code) and transaction start traditionally nests inside
> >>>>>>s_umount (and basically there's no way around that since sync() calls your
> >>>>>>fs code with s_umount held).
> >>>>>Sure, but the question must be asked - why is ext3/4 even starting a
> >>>>>transaction on a clean filesystem during sync? A frozen filesystem,
> >>>>>by definition, is a clean filesytem, and therefore sync calls of any
> >>>>>kind should not be trying to write to the FS or start transactions.
> >>>>>XFS does this just fine, so I'd consider such behaviour on a frozen
> >>>>>filesystem a bug in ext3/4...
> >>>>I had a look at the xfs code for seeing how this is done.
> >>>>xfs_file_aio_write()
> >>>>   xfs_wait_for_freeze()
> >>>>     vfs_check_frozen()
> >>>>So xfs_file_aio_write() writes to buffers when the FS is not frozen.
> >>>>
> >>>>Now, I want to know what stops the following scenario from happening:
> >>>>--------------------
> >>>>xfs_file_aio_write()
> >>>>   xfs_wait_for_freeze()
> >>>>     vfs_check_frozen()
> >>>>At this point F.S was not frozen, so the next instruction in the
> >>>>xfs_file_aio_write() will be executed next.
> >>>>However at this point (i.e after checking if F.S is frozen) the
> >>>>write process gets pre-empted and say the _freeze_ process gets
> >>>>control.
> >>>>
> >>>>Now the F.S freezes and the write process gets the control back. And
> >>>>so we end up writing to the page cache when the F.S is frozen.
> >>>>--------------------
> >>>>
> >>>>Can anyone please enlighten me on how&   why this premption is _not_
> >>>>possible?
> >>Thanks for your reply.
> >>>   XFS works similarly as ext4 in this regard I believe. They have the log
> >>>frozen in xfs_freeze() so if the race you describe above happens, either
> >>>the writing process gets caught waiting for log to unfreeze
> >>Agreed.
> >>>  or it manages
> >>>to start a transaction and then freezing process waits for transaction to
> >>>finish before it can proceed with freezing. I'm not sure why is there the
> >>>check in xfs_file_aio_write()...
> >>>
> >>>			
> >>I am sorry, but I don't understand how this will happen - i.e I
> >>can't understand what stops freeze_super() (or ext4_freeze) from
> >>freezing a superblock (as the write process stopped just before
> >>writing anything for this transaction and has not taken any locks?)
> >   So ext4_freeze() does
> >jbd2_journal_lock_updates(journal)
> >   which waits for all running transactions to finish and updates
> >j_barrier_count which stops any news ones from proceeding (check
> >function start_this_handle()).
> >
> Yes, but ext4_freeze() also calls
> jbd2_journal_unlock_updates(journal) which decrements the
> j_barrier_count (which was previously updated/incremented in
> jbd2_journal_lock_updates) ? before it returns. So after this call a
> new transaction/handle can be accepted/started.
> 
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())
  Ah, drat, you're right. I've missed this other part. It's the problem
that if you expect to see something, you'll see it regardless of the real
code ;).

The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
it's still racy (although the race window is relatively small) because the
filesystem can become frozen the instant after we check vfs_check_frozen().
Commit 6b0310fb broke it for ext4.

I guess the code was mostly copied from XFS which seems to have the same
problem in xfs_trans_alloc() since the git history beginning. I see two
ways to fix this - either fix ext4/xfs to check s_frozen after starting
a transaction and if the filesystem is being frozen, we stop the
transaction, wait for fs to get unfrozen, and restart. Another option is
to create an analogous logic using a atomic counter of write ops in vfs
that could be used by all filesystems. We'd just have to replace
vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
appropriate places...

Dave, Christoph, any opinions on this?
								Honza
Christoph Hellwig May 2, 2011, 1:22 p.m. UTC | #20
On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> Dave, Christoph, any opinions on this?

The busyloop in xfs_quiesce_attr which waits for all active transactions
to finish is supposed to fix this issue.

Note that XFS traditionally expects a two stage freeze process where
we first freeze new VFS-level writes, then flush the caches and then
stop transactions, wait for them to finish and do the remainder of
the freeze process, but I really messed that process up when moving
the sequence to generic code.  Funnily enough it seems to work
neverless.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 2, 2011, 1:22 p.m. UTC | #21
On 05/02/2011 04:16 PM, Jan Kara wrote:
> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>> fs code with s_umount held).
>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>> filesystem a bug in ext3/4...
>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>> xfs_file_aio_write()
>>>>>>    xfs_wait_for_freeze()
>>>>>>      vfs_check_frozen()
>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>
>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>> --------------------
>>>>>> xfs_file_aio_write()
>>>>>>    xfs_wait_for_freeze()
>>>>>>      vfs_check_frozen()
>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>> xfs_file_aio_write() will be executed next.
>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>> control.
>>>>>>
>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>> --------------------
>>>>>>
>>>>>> Can anyone please enlighten me on how&    why this premption is _not_
>>>>>> possible?
>>>> Thanks for your reply.
>>>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>> the writing process gets caught waiting for log to unfreeze
>>>> Agreed.
>>>>>   or it manages
>>>>> to start a transaction and then freezing process waits for transaction to
>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>> check in xfs_file_aio_write()...
>>>>>
>>>>> 			
>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>> freezing a superblock (as the write process stopped just before
>>>> writing anything for this transaction and has not taken any locks?)
>>>    So ext4_freeze() does
>>> jbd2_journal_lock_updates(journal)
>>>    which waits for all running transactions to finish and updates
>>> j_barrier_count which stops any news ones from proceeding (check
>>> function start_this_handle()).
>>>
>> Yes, but ext4_freeze() also calls
>> jbd2_journal_unlock_updates(journal) which decrements the
>> j_barrier_count (which was previously updated/incremented in
>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>> new transaction/handle can be accepted/started.
>>
>> A comment in ext4_freeze() says:
>> /* we rely on s_frozen to stop further updates */
>> (before calling jbd2_journal_unlock_updates())
>    Ah, drat, you're right. I've missed this other part. It's the problem
> that if you expect to see something, you'll see it regardless of the real
> code ;).
>
> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
> it's still racy (although the race window is relatively small) because the
> filesystem can become frozen the instant after we check vfs_check_frozen().
> Commit 6b0310fb broke it for ext4.
>
> I guess the code was mostly copied from XFS which seems to have the same
> problem in xfs_trans_alloc() since the git history beginning. I see two
> ways to fix this - either fix ext4/xfs to check s_frozen after starting
> a transaction and if the filesystem is being frozen, we stop the
> transaction, wait for fs to get unfrozen, and restart. Another option is
> to create an analogous logic using a atomic counter of write ops in vfs
> that could be used by all filesystems. We'd just have to replace
> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
> appropriate places...
How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
from ext4_unfreeze()?

So that indeed no transactions can be started before unfreeze is called.

This has another advantage, that it rightfully does not let you update 
the access time when the F.S is frozen (touch_atime called from a read 
path when the F.S is frozen) Otherwise we also need to fix this path.

Warm Regards,
Surbhi.

> Dave, Christoph, any opinions on this?
> 								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 2, 2011, 1:24 p.m. UTC | #22
On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> This has another advantage, that it rightfully does not let you
> update the access time when the F.S is frozen (touch_atime called
> from a read path when the F.S is frozen) Otherwise we also need to
> fix this path.

In most filesystens atime updates aren't transactional.  They just
get written into inode->i_atime, and at some later point when the
VFS tries to clean the inode it gets writtent back, either through
a transaction or not.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 2, 2011, 1:27 p.m. UTC | #23
On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
>> This has another advantage, that it rightfully does not let you
>> update the access time when the F.S is frozen (touch_atime called
>> from a read path when the F.S is frozen) Otherwise we also need to
>> fix this path.
> In most filesystens atime updates aren't transactional.  They just
> get written into inode->i_atime, and at some later point when the
> VFS tries to clean the inode it gets writtent back, either through
> a transaction or not.
>
Yes, agreed. But then when a F.S is frozen the inode should not be 
dirtied? Right? So this has to be fixed?
Also, in ext4, I think that updating atime starts a transaction.

Warm Regards,
Surbhi

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 2, 2011, 2:01 p.m. UTC | #24
On 5/2/11 7:30 AM, Surbhi Palande wrote:

...

> Yes, but ext4_freeze() also calls jbd2_journal_unlock_updates(journal) which decrements the  j_barrier_count (which was previously updated/incremented in jbd2_journal_lock_updates) ? before it returns. So after this call a new transaction/handle can be accepted/started.
> 
> A comment in ext4_freeze() says:
> /* we rely on s_frozen to stop further updates */
> (before calling jbd2_journal_unlock_updates())

that was me; 

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun May 16 02:00:00 2010 -0400

    ext4: don't return to userspace after freezing the fs with a mutex held


otherwise we return to userspace holding a mutex :(

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 2, 2011, 2:04 p.m. UTC | #25
On 5/2/11 8:22 AM, Surbhi Palande wrote:
> On 05/02/2011 04:16 PM, Jan Kara wrote:
>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>    Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>    That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>> fs code with s_umount held).
>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>> filesystem a bug in ext3/4...
>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>> xfs_file_aio_write()
>>>>>>>    xfs_wait_for_freeze()
>>>>>>>      vfs_check_frozen()
>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>
>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>> --------------------
>>>>>>> xfs_file_aio_write()
>>>>>>>    xfs_wait_for_freeze()
>>>>>>>      vfs_check_frozen()
>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>> control.
>>>>>>>
>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>> --------------------
>>>>>>>
>>>>>>> Can anyone please enlighten me on how&    why this premption is _not_
>>>>>>> possible?
>>>>> Thanks for your reply.
>>>>>>    XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>> Agreed.
>>>>>>   or it manages
>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>> check in xfs_file_aio_write()...
>>>>>>
>>>>>>            
>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>> freezing a superblock (as the write process stopped just before
>>>>> writing anything for this transaction and has not taken any locks?)
>>>>    So ext4_freeze() does
>>>> jbd2_journal_lock_updates(journal)
>>>>    which waits for all running transactions to finish and updates
>>>> j_barrier_count which stops any news ones from proceeding (check
>>>> function start_this_handle()).
>>>>
>>> Yes, but ext4_freeze() also calls
>>> jbd2_journal_unlock_updates(journal) which decrements the
>>> j_barrier_count (which was previously updated/incremented in
>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>> new transaction/handle can be accepted/started.
>>>
>>> A comment in ext4_freeze() says:
>>> /* we rely on s_frozen to stop further updates */
>>> (before calling jbd2_journal_unlock_updates())
>>    Ah, drat, you're right. I've missed this other part. It's the problem
>> that if you expect to see something, you'll see it regardless of the real
>> code ;).
>>
>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>> it's still racy (although the race window is relatively small) because the
>> filesystem can become frozen the instant after we check vfs_check_frozen().
>> Commit 6b0310fb broke it for ext4.
>>
>> I guess the code was mostly copied from XFS which seems to have the same
>> problem in xfs_trans_alloc() since the git history beginning. I see two
>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>> a transaction and if the filesystem is being frozen, we stop the
>> transaction, wait for fs to get unfrozen, and restart. Another option is
>> to create an analogous logic using a atomic counter of write ops in vfs
>> that could be used by all filesystems. We'd just have to replace
>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>> appropriate places...
> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> from ext4_unfreeze()?

we used to have that, but holding it locked until then means we exit the kernel
with a mutex held, which is pretty icky.

    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    lvcreate/1075 is leaving the kernel with locks still held!
    1 lock held by lvcreate/1075:
     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
    jbd2_journal_lock_updates+0xe1/0xf0


-Eric

> So that indeed no transactions can be started before unfreeze is called.
> 
> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
> 
> Warm Regards,
> Surbhi.
> 
>> Dave, Christoph, any opinions on this?
>>                                 Honza
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 2:20 p.m. UTC | #26
On Mon 02-05-11 09:22:04, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 03:16:19PM +0200, Jan Kara wrote:
> > Dave, Christoph, any opinions on this?
> 
> The busyloop in xfs_quiesce_attr which waits for all active transactions
> to finish is supposed to fix this issue.
  Hmm, but what prevents the following race?

  Thread 1					Thread 2
..
xfs_trans_alloc()
  xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
						freeze_super()
						  ...
						  xfs_fs_freeze()
						    ...
						    xfs_quiesce_attr()
						    ...
  _xfs_trans_alloc()
    atomic_inc(&mp->m_active_trans);
    ... goes on modifying the filesystem

  It seems to be a similar problem as in ext4 - the atomic_inc() and
vfs_check_frozen() are in the wrong order...

> Note that XFS traditionally expects a two stage freeze process where
> we first freeze new VFS-level writes, then flush the caches and then
> stop transactions, wait for them to finish and do the remainder of
> the freeze process, but I really messed that process up when moving
> the sequence to generic code.  Funnily enough it seems to work
> neverless.

								Honza
Jan Kara May 2, 2011, 2:26 p.m. UTC | #27
On Mon 02-05-11 16:27:39, Surbhi Palande wrote:
> On 05/02/2011 04:24 PM, Christoph Hellwig wrote:
> >On Mon, May 02, 2011 at 04:22:45PM +0300, Surbhi Palande wrote:
> >>This has another advantage, that it rightfully does not let you
> >>update the access time when the F.S is frozen (touch_atime called
> >>from a read path when the F.S is frozen) Otherwise we also need to
> >>fix this path.
> >In most filesystens atime updates aren't transactional.  They just
> >get written into inode->i_atime, and at some later point when the
> >VFS tries to clean the inode it gets writtent back, either through
> >a transaction or not.
> >
> Yes, agreed. But then when a F.S is frozen the inode should not be
> dirtied? Right? So this has to be fixed?
> Also, in ext4, I think that updating atime starts a transaction.
  Yes, it does. Any mark_inode_dirty() call causes a transaction update.

								Honza
Christoph Hellwig May 2, 2011, 2:41 p.m. UTC | #28
On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
>   Hmm, but what prevents the following race?
> 
>   Thread 1					Thread 2
> ..
> xfs_trans_alloc()
>   xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> 						freeze_super()

                                                  sb->s_frozen = SB_FREEZE_TRANS;

> 						  ...
> 						  xfs_fs_freeze()
> 						    ...
> 						    xfs_quiesce_attr()

						      waits for all active
						      transactions

> 						    ...

   xfs_trans_alloc
     -> blocks in xfs_wait_for_freeze
     (thus doesn't get to _xfs_trans_alloc)

>   _xfs_trans_alloc()
>     atomic_inc(&mp->m_active_trans);
>     ... goes on modifying the filesystem
> 
>   It seems to be a similar problem as in ext4 - the atomic_inc() and
> vfs_check_frozen() are in the wrong order...

I can't see the problem in this scheme.  Note that we want
_xfs_trans_alloc to be able to create a transaction for
xfs_fs_log_dummy, so that we can write the dummy log record after
freezing out all other transactions, so that one is special cased
and doesn't do the xfs_wait_for_freeze.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 2, 2011, 4:23 p.m. UTC | #29
On Mon 02-05-11 10:41:55, Christoph Hellwig wrote:
> On Mon, May 02, 2011 at 04:20:55PM +0200, Jan Kara wrote:
> >   Hmm, but what prevents the following race?
> > 
> >   Thread 1					Thread 2
> > ..
> > xfs_trans_alloc()
> >   xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > 						freeze_super()
> 
>                                                   sb->s_frozen = SB_FREEZE_TRANS;
> > 						  ...
> > 						  xfs_fs_freeze()
> > 						    ...
> > 						    xfs_quiesce_attr()
> 
> 						      waits for all active
> 						      transactions
> 
> > 						    ...
> 
>    xfs_trans_alloc
>      -> blocks in xfs_wait_for_freeze
  But why should it block when xfs_wait_for_freeze() gets called before
freeze_super() gets called? The other thread calls freeze_super() just
after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
called.  Or am I missing some serialization somewhere?

>      (thus doesn't get to _xfs_trans_alloc)
> 
> >   _xfs_trans_alloc()
> >     atomic_inc(&mp->m_active_trans);
> >     ... goes on modifying the filesystem
> > 
> >   It seems to be a similar problem as in ext4 - the atomic_inc() and
> > vfs_check_frozen() are in the wrong order...
> 
> I can't see the problem in this scheme.  Note that we want
> _xfs_trans_alloc to be able to create a transaction for
> xfs_fs_log_dummy, so that we can write the dummy log record after
> freezing out all other transactions, so that one is special cased
> and doesn't do the xfs_wait_for_freeze.
  OK.

								Honza
Christoph Hellwig May 2, 2011, 4:38 p.m. UTC | #30
On Mon, May 02, 2011 at 06:23:34PM +0200, Jan Kara wrote:
>   But why should it block when xfs_wait_for_freeze() gets called before
> freeze_super() gets called? The other thread calls freeze_super() just
> after xfs_wait_for_freeze() in thread 1 and before _xfs_trans_alloc() gets
> called.  Or am I missing some serialization somewhere?

Oh, now I get the race window you mean.  It's the single instruction
window between doing the frozen check and incrementing m_active_trans.

Yes, that one looks real, although very unlikely to hit.  Could be fixed
relatively easily by moving the check after the increment.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 3, 2011, 7:27 a.m. UTC | #31
On 05/02/2011 05:04 PM, Eric Sandeen wrote:
> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>     Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>     That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>> fs code with s_umount held).
>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>> xfs_file_aio_write()
>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>       vfs_check_frozen()
>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>
>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>> --------------------
>>>>>>>> xfs_file_aio_write()
>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>       vfs_check_frozen()
>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>> control.
>>>>>>>>
>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>> --------------------
>>>>>>>>
>>>>>>>> Can anyone please enlighten me on how&     why this premption is _not_
>>>>>>>> possible?
>>>>>> Thanks for your reply.
>>>>>>>     XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>> Agreed.
>>>>>>>    or it manages
>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>> check in xfs_file_aio_write()...
>>>>>>>
>>>>>>>
>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>> freezing a superblock (as the write process stopped just before
>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>     So ext4_freeze() does
>>>>> jbd2_journal_lock_updates(journal)
>>>>>     which waits for all running transactions to finish and updates
>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>> function start_this_handle()).
>>>>>
>>>> Yes, but ext4_freeze() also calls
>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>> j_barrier_count (which was previously updated/incremented in
>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>> new transaction/handle can be accepted/started.
>>>>
>>>> A comment in ext4_freeze() says:
>>>> /* we rely on s_frozen to stop further updates */
>>>> (before calling jbd2_journal_unlock_updates())
>>>     Ah, drat, you're right. I've missed this other part. It's the problem
>>> that if you expect to see something, you'll see it regardless of the real
>>> code ;).
>>>
>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>> it's still racy (although the race window is relatively small) because the
>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>> Commit 6b0310fb broke it for ext4.
>>>
>>> I guess the code was mostly copied from XFS which seems to have the same
>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>> a transaction and if the filesystem is being frozen, we stop the
>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>> to create an analogous logic using a atomic counter of write ops in vfs
>>> that could be used by all filesystems. We'd just have to replace
>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>> appropriate places...
>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>> from ext4_unfreeze()?
> we used to have that, but holding it locked until then means we exit the kernel
> with a mutex held, which is pretty icky.
>
>      ================================================
>      [ BUG: lock held when returning to user space! ]
>      ------------------------------------------------
>      lvcreate/1075 is leaving the kernel with locks still held!
>      1 lock held by lvcreate/1075:
>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>      jbd2_journal_lock_updates+0xe1/0xf0
>
>
> -Eric
Should this not be reverted? I think that its a lot easier to stop a 
transaction between a freeze and a thaw that way! If you agree, can I 
send a patch for the same?

Thanks!

Warm Regards,
Surbhi.


>> So that indeed no transactions can be started before unfreeze is called.
>>
>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>
>> Warm Regards,
>> Surbhi.
>>
>>> Dave, Christoph, any opinions on this?
>>>                                  Honza
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 3, 2011, 8:14 p.m. UTC | #32
On 5/3/11 2:27 AM, Surbhi Palande wrote:
> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>>     Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>>     That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>>       vfs_check_frozen()
>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>
>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>> --------------------
>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>     xfs_wait_for_freeze()
>>>>>>>>>       vfs_check_frozen()
>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>> control.
>>>>>>>>>
>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>> --------------------
>>>>>>>>>
>>>>>>>>> Can anyone please enlighten me on how&     why this premption is _not_
>>>>>>>>> possible?
>>>>>>> Thanks for your reply.
>>>>>>>>     XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>> Agreed.
>>>>>>>>    or it manages
>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>
>>>>>>>>
>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>>     So ext4_freeze() does
>>>>>> jbd2_journal_lock_updates(journal)
>>>>>>     which waits for all running transactions to finish and updates
>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>> function start_this_handle()).
>>>>>>
>>>>> Yes, but ext4_freeze() also calls
>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>> j_barrier_count (which was previously updated/incremented in
>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>> new transaction/handle can be accepted/started.
>>>>>
>>>>> A comment in ext4_freeze() says:
>>>>> /* we rely on s_frozen to stop further updates */
>>>>> (before calling jbd2_journal_unlock_updates())
>>>>     Ah, drat, you're right. I've missed this other part. It's the problem
>>>> that if you expect to see something, you'll see it regardless of the real
>>>> code ;).
>>>>
>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>> it's still racy (although the race window is relatively small) because the
>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>> Commit 6b0310fb broke it for ext4.
>>>>
>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>> a transaction and if the filesystem is being frozen, we stop the
>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>> that could be used by all filesystems. We'd just have to replace
>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>> appropriate places...
>>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>> from ext4_unfreeze()?
>> we used to have that, but holding it locked until then means we exit the kernel
>> with a mutex held, which is pretty icky.
>>
>>      ================================================
>>      [ BUG: lock held when returning to user space! ]
>>      ------------------------------------------------
>>      lvcreate/1075 is leaving the kernel with locks still held!
>>      1 lock held by lvcreate/1075:
>>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>      jbd2_journal_lock_updates+0xe1/0xf0
>>
>>
>> -Eric
> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?

Only if you want the kernel to start spewing "BUG!" messages again...

-Eric

> Thanks!
> 
> Warm Regards,
> Surbhi.
> 
> 
>>> So that indeed no transactions can be started before unfreeze is called.
>>>
>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>
>>> Warm Regards,
>>> Surbhi.
>>>
>>>> Dave, Christoph, any opinions on this?
>>>>                                  Honza
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 4, 2011, 8:26 a.m. UTC | #33
On 05/03/2011 11:14 PM, Eric Sandeen wrote:
> On 5/3/11 2:27 AM, Surbhi Palande wrote:
>> On 05/02/2011 05:04 PM, Eric Sandeen wrote:
>>> On 5/2/11 8:22 AM, Surbhi Palande wrote:
>>>> On 05/02/2011 04:16 PM, Jan Kara wrote:
>>>>> On Mon 02-05-11 15:30:23, Surbhi Palande wrote:
>>>>>> On 05/02/2011 03:20 PM, Jan Kara wrote:
>>>>>>> On Mon 02-05-11 14:27:51, Surbhi Palande wrote:
>>>>>>>> On 05/02/2011 01:56 PM, Jan Kara wrote:
>>>>>>>>> On Mon 02-05-11 12:07:59, Surbhi Palande wrote:
>>>>>>>>>> On 04/06/2011 02:21 PM, Dave Chinner wrote:
>>>>>>>>>>> On Wed, Apr 06, 2011 at 08:18:56AM +0200, Jan Kara wrote:
>>>>>>>>>>>> On Wed 06-04-11 15:40:05, Dave Chinner wrote:
>>>>>>>>>>>>> On Fri, Apr 01, 2011 at 04:08:56PM +0200, Jan Kara wrote:
>>>>>>>>>>>>>> On Fri 01-04-11 10:40:50, Dave Chinner wrote:
>>>>>>>>>>>>>>> If you don't allow the page to be dirtied in the fist place, then
>>>>>>>>>>>>>>> nothing needs to be done to the writeback path because there is
>>>>>>>>>>>>>>> nothing dirty for it to write back.
>>>>>>>>>>>>>>      Sure but that's only the problem he was able to hit. But generally,
>>>>>>>>>>>>>> there's a problem with needing s_umount for unfreezing because it isn't
>>>>>>>>>>>>>> clear there aren't other code paths which can block with s_umount held
>>>>>>>>>>>>>> waiting for fs to get unfrozen. And these code paths would cause the same
>>>>>>>>>>>>>> deadlock. That's why I chose to get rid of s_umount during thawing.
>>>>>>>>>>>>> Holding the s_umount lock while checking if frozen and sleeping
>>>>>>>>>>>>> is essentially an ABBA lock inversion bug that can bite in many more
>>>>>>>>>>>>> places that just thawing the filesystem.  Any where this is done should
>>>>>>>>>>>>> be fixed, so I don't think just removing the s_umount lock from the thaw
>>>>>>>>>>>>> path is sufficient to avoid problems.
>>>>>>>>>>>>      That's easily said but hard to do - any transaction start in ext3/4 may
>>>>>>>>>>>> block on filesystem being frozen (this seems to be similar for XFS as I'm
>>>>>>>>>>>> looking into the code) and transaction start traditionally nests inside
>>>>>>>>>>>> s_umount (and basically there's no way around that since sync() calls your
>>>>>>>>>>>> fs code with s_umount held).
>>>>>>>>>>> Sure, but the question must be asked - why is ext3/4 even starting a
>>>>>>>>>>> transaction on a clean filesystem during sync? A frozen filesystem,
>>>>>>>>>>> by definition, is a clean filesytem, and therefore sync calls of any
>>>>>>>>>>> kind should not be trying to write to the FS or start transactions.
>>>>>>>>>>> XFS does this just fine, so I'd consider such behaviour on a frozen
>>>>>>>>>>> filesystem a bug in ext3/4...
>>>>>>>>>> I had a look at the xfs code for seeing how this is done.
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>>      xfs_wait_for_freeze()
>>>>>>>>>>        vfs_check_frozen()
>>>>>>>>>> So xfs_file_aio_write() writes to buffers when the FS is not frozen.
>>>>>>>>>>
>>>>>>>>>> Now, I want to know what stops the following scenario from happening:
>>>>>>>>>> --------------------
>>>>>>>>>> xfs_file_aio_write()
>>>>>>>>>>      xfs_wait_for_freeze()
>>>>>>>>>>        vfs_check_frozen()
>>>>>>>>>> At this point F.S was not frozen, so the next instruction in the
>>>>>>>>>> xfs_file_aio_write() will be executed next.
>>>>>>>>>> However at this point (i.e after checking if F.S is frozen) the
>>>>>>>>>> write process gets pre-empted and say the _freeze_ process gets
>>>>>>>>>> control.
>>>>>>>>>>
>>>>>>>>>> Now the F.S freezes and the write process gets the control back. And
>>>>>>>>>> so we end up writing to the page cache when the F.S is frozen.
>>>>>>>>>> --------------------
>>>>>>>>>>
>>>>>>>>>> Can anyone please enlighten me on how&      why this premption is _not_
>>>>>>>>>> possible?
>>>>>>>> Thanks for your reply.
>>>>>>>>>      XFS works similarly as ext4 in this regard I believe. They have the log
>>>>>>>>> frozen in xfs_freeze() so if the race you describe above happens, either
>>>>>>>>> the writing process gets caught waiting for log to unfreeze
>>>>>>>> Agreed.
>>>>>>>>>     or it manages
>>>>>>>>> to start a transaction and then freezing process waits for transaction to
>>>>>>>>> finish before it can proceed with freezing. I'm not sure why is there the
>>>>>>>>> check in xfs_file_aio_write()...
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I am sorry, but I don't understand how this will happen - i.e I
>>>>>>>> can't understand what stops freeze_super() (or ext4_freeze) from
>>>>>>>> freezing a superblock (as the write process stopped just before
>>>>>>>> writing anything for this transaction and has not taken any locks?)
>>>>>>>      So ext4_freeze() does
>>>>>>> jbd2_journal_lock_updates(journal)
>>>>>>>      which waits for all running transactions to finish and updates
>>>>>>> j_barrier_count which stops any news ones from proceeding (check
>>>>>>> function start_this_handle()).
>>>>>>>
>>>>>> Yes, but ext4_freeze() also calls
>>>>>> jbd2_journal_unlock_updates(journal) which decrements the
>>>>>> j_barrier_count (which was previously updated/incremented in
>>>>>> jbd2_journal_lock_updates) ? before it returns. So after this call a
>>>>>> new transaction/handle can be accepted/started.
>>>>>>
>>>>>> A comment in ext4_freeze() says:
>>>>>> /* we rely on s_frozen to stop further updates */
>>>>>> (before calling jbd2_journal_unlock_updates())
>>>>>      Ah, drat, you're right. I've missed this other part. It's the problem
>>>>> that if you expect to see something, you'll see it regardless of the real
>>>>> code ;).
>>>>>
>>>>> The fact is we do vfs_check_frozen() in ext4_journal_start_sb() but indeed
>>>>> it's still racy (although the race window is relatively small) because the
>>>>> filesystem can become frozen the instant after we check vfs_check_frozen().
>>>>> Commit 6b0310fb broke it for ext4.
>>>>>
>>>>> I guess the code was mostly copied from XFS which seems to have the same
>>>>> problem in xfs_trans_alloc() since the git history beginning. I see two
>>>>> ways to fix this - either fix ext4/xfs to check s_frozen after starting
>>>>> a transaction and if the filesystem is being frozen, we stop the
>>>>> transaction, wait for fs to get unfrozen, and restart. Another option is
>>>>> to create an analogous logic using a atomic counter of write ops in vfs
>>>>> that could be used by all filesystems. We'd just have to replace
>>>>> vfs_check_frozen() with vfs_start_write() and add vfs_stop_write() at
>>>>> appropriate places...
>>>> How about calling  jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>>>> from ext4_unfreeze()?
>>> we used to have that, but holding it locked until then means we exit the kernel
>>> with a mutex held, which is pretty icky.
>>>
>>>       ================================================
>>>       [ BUG: lock held when returning to user space! ]
>>>       ------------------------------------------------
>>>       lvcreate/1075 is leaving the kernel with locks still held!
>>>       1 lock held by lvcreate/1075:
>>>        #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>>       jbd2_journal_lock_updates+0xe1/0xf0
>>>
>>>
>>> -Eric
>> Should this not be reverted? I think that its a lot easier to stop a transaction between a freeze and a thaw that way! If you agree, can I send a patch for the same?
> Only if you want the kernel to start spewing "BUG!" messages again...
>
> -Eric
But, then you need a much more complicated way to stop accepting the 
transactions and the writes between the freeze and the thaw? (in the 
write path and the read path)? Is this not much simpler?

Warm Regards,
Surbhi.








>> Thanks!
>>
>> Warm Regards,
>> Surbhi.
>>
>>
>>>> So that indeed no transactions can be started before unfreeze is called.
>>>>
>>>> This has another advantage, that it rightfully does not let you update the access time when the F.S is frozen (touch_atime called from a read path when the F.S is frozen) Otherwise we also need to fix this path.
>>>>
>>>> Warm Regards,
>>>> Surbhi.
>>>>
>>>>> Dave, Christoph, any opinions on this?
>>>>>                                   Honza
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen May 4, 2011, 2:30 p.m. UTC | #34
On 5/4/11 3:26 AM, Surbhi Palande wrote:
> On 05/03/2011 11:14 PM, Eric Sandeen wrote:
>> On 5/3/11 2:27 AM, Surbhi Palande wrote:

...

>>> Should this not be reverted? I think that its a lot easier to
>>> stop a transaction between a freeze and a thaw that way! If you
>>> agree, can I send a patch for the same?
>> Only if you want the kernel to start spewing "BUG!" messages
>> again...
>> 
>> -Eric
> But, then you need a much more complicated way to stop accepting the
> transactions and the writes between the freeze and the thaw? (in the
> write path and the read path)? Is this not much simpler?

I just cannot see how a solution which leads to:

>>      ================================================
>>      [ BUG: lock held when returning to user space! ]
>>      ------------------------------------------------
>>      lvcreate/1075 is leaving the kernel with locks still held!
>>      1 lock held by lvcreate/1075:
>>       #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>      jbd2_journal_lock_updates+0xe1/0xf0


can be considered viable.

You are welcome to send the patch, and if other ext4 devs concur with it then I'll be outvoted. :)

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b5ed541..2a60148 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -477,7 +477,7 @@  static bool pin_sb_for_writeback(struct super_block *sb)
 	spin_unlock(&sb_lock);
 
 	if (down_read_trylock(&sb->s_umount)) {
-		if (sb->s_root)
+		if (sb->s_frozen == 0 && sb->s_root)
 			return true;
 		up_read(&sb->s_umount);
 	}
diff --git a/fs/super.c b/fs/super.c
index 8a06881..bac28c4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -432,8 +432,13 @@  void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
 			continue;
 		sb->s_count++;
 		spin_unlock(&sb_lock);
-
+retry:
 		down_read(&sb->s_umount);
+		if (sb->s_frozen > 0) {
+			up_read(&sb->s_umount);
+			cond_resched();
+			goto retry;
+		}
 		if (sb->s_root)
 			f(sb, arg);
 		up_read(&sb->s_umount);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 31f6988..eb19642 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1058,7 +1058,9 @@  EXPORT_SYMBOL(generic_writepages);
 int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	int ret;
+	struct super_block *sb = mapping->host->i_sb;
 
+	vfs_check_frozen(sb, SB_FREEZE_TRANS);
 	if (wbc->nr_to_write <= 0)
 		return 0;
 	if (mapping->a_ops->writepages)