diff mbox

[RFC] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Message ID 20110330141205.GC22349@quack.suse.cz
State New, archived
Headers show

Commit Message

Jan Kara March 30, 2011, 2:12 p.m. UTC
Hello,

On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> On Thu, 17 Feb 2011 11:45:52 +0100
> Jan Kara <jack@suse.cz> wrote:
> > On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> > > (2011/02/16 23:56), Jan Kara wrote:
> > > >On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> > > >>On Tue, 15 Feb 2011 18:29:54 +0100
> > > >>Jan Kara<jack@suse.cz>  wrote:
> > > >>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> > > >>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> > > >>>>>Thanks for detailed analysis. Indeed this is a bug. Whenever we do IO
> > > >>>>>under s_umount semaphore, we are prone to deadlock like the one you
> > > >>>>>describe above.
> > > >>>>
> > > >>>>One of the fundamental problems here is that the freeze and thaw
> > > >>>>routines are using down_write(&sb->s_umount) for two purposes.  The
> > > >>>>first is to prevent the resume/thaw from racing with a umount (which
> > > >>>>it could do just as well by taking a read lock), but the second is to
> > > >>>>prevent the resume/thaw code from racing with itself.  That's the core
> > > >>>>fundamental problem here.
> > > >>>>
> > > >>>>So I think we can solve this by introduce a new mutex, s_freeze, and
> > > >>>>having the the resume/thaw first take the s_freeze mutex and then
> > > >>>>second take a read lock on the s_umount.
> > > >>>   Sadly this does not quite work because even down_read(&sb->s_umount)
> > > >>>in thaw_super() can block if there is another process that tries to acquire
> > > >>>s_umount for writing - a situation like:
> > > >>>   TASK 1 (e.g. flusher)		TASK 2	(e.g. remount)		TASK 3 (unfreeze)
> > > >>>down_read(&sb->s_umount)
> > > >>>   block on s_frozen
> > > >>>				down_write(&sb->s_umount)
> > > >>>				  -blocked
> > > >>>								down_read(&sb->s_umount)
> > > >>>								  -blocked
> > > >>>behind the write access...
> > > >>>
> > > >>>The only working solution I see is to check for frozen filesystem before
> > > >>>taking s_umount semaphore which seems rather ugly (but might be bearable if
> > > >>>we did so in some well described wrapper).
> > > >>I created the patch that you imagine yesterday.
> > > >>
> > > >>I got a reproducer from Mizuma-san yesterday, and then I executed it on the kernel
> > > >>without a fixed patch. After an hour, I confirmed that this deadlock happened.
> > > >>
> > > >>However, on the kernel with a fixed patch, this deadlock doesn't still happen
> > > >>after 12 hours passed.
> > > >>
> > > >>The patch for linux-2.6.38-rc4 is as follows:
> > > >>---
> > > >>  fs/fs-writeback.c |    2 +-
> > > >>  1 files changed, 1 insertions(+), 1 deletions(-)
> > > >>
> > > >>diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > >>index 59c6e49..1c9a05e 100644
> > > >>--- a/fs/fs-writeback.c
> > > >>+++ b/fs/fs-writeback.c
> > > >>@@ -456,7 +456,7 @@ static bool pin_sb_for_writeback(struct super_block *sb)
> > > >>         spin_unlock(&sb_lock);
> > > >>
> > > >>         if (down_read_trylock(&sb->s_umount)) {
> > > >>-               if (sb->s_root)
> > > >>+               if (sb->s_frozen == SB_UNFROZEN&&  sb->s_root)
> > > >>                         return true;
> > > >>                 up_read(&sb->s_umount);
> > > 
> > > >   So this is something along the lines I thought but it actually won't work
> > > >for example if sync(1) is run while the filesystem is frozen (that takes
> > > >s_umount semaphore in a different place). And generally, I'm not convinced
> > > >there are not other places that try to do IO while holding s_umount
> > > >semaphore...
> > > OK. I understand.
> > > 
> > > This code only fixes the case for the following path:
> > > writeback_inodes_wb
> > > -> ext4_da_writepages
> > >    -> ext4_journal_start_sb
> > >       -> vfs_check_frozen
> > > But, the code doesn't fix the other cases.
> > > 
> > > We must modify the local filesystem part in order to fix all cases...?
> >   Yes, possibly. But most importantly we should first find clear locking
> > rules for frozen filesystem that avoid deadlocks like the one above. And
> > the freezing / unfreezing code might become subtle for that reason, that's
> > fine, but it would be really good to avoid any complicated things for the
> > code in the rest of the VFS / filesystems.
> I have deeply continued to examined the root cause of this problem, then 
> I found it.
> 
> It is that we can write a memory which is mmaped to a file. Then the memory 
> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> "writeback" the memory. 
> 
> Therefore, the root cause of this hangup is not only ext4 component (with
> delayed allocation feature) but also writeback mechanism for mmap. If you 
> use the other filesystem, you can write something to the filesystem though 
> you have freezed the filesystem.
  Well, you can write something only in the caches, not to the on disk
image. So it's not a problem as such.

> A sample problem is attached on this mail.  Try to execute it then you can 
> confirm that we can write some data to your filesystem while freezing the 
> filesystem.
> (If you change FS variable in go.sh from ext3 to ext4 and you execute
> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> 
> I think the best approach to fix this problem is to let users not to write
> memory which is mapped to a certain file while the filesystem is freezing. 
> However, it is very difficult to control users not to write memory which has 
> been already mapped to the file.
  It is actually possible. In case of ext4, you could add a check (+ wait)
in ext4_page_mkwrite() whether the filesystem is frozen or in the process
of being frozen and if so, wait for it to get unfrozen. The only tough
problem here might be the locking as ext4_page_mkwrite() is called with
mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
But you'd have to fix all filesystems (and all paths possibly creating
dirty data) in this way.
 
> Therefore, I think there is only actual method that we stop writeback thread 
> to resolve the mmap problem. Also, by this fix, the original problem 
> (ext4 delayed write vs unfreeze) can be solved.
  Hmm, I had a look at the code again and think we could fix the issue
cleanly (i.e. all possible users of s_umount) as follows: The lock
ordering will be
  s_umount -> "fs frozen"
and there will be a new mutex s_freeze_mutex protecting changes of
s_frozen.

freeze_bdev() already observes this lock ordering, it will only take
s_freeze_mutex for the changes of s_frozen values. The only other code
that is relevant for the lock ordering is thaw_super() (the freezing
process is not expected to reenter kernel for the frozen filesystem).
In thaw_super() we could take s_freeze_mutex, do all the thawing work,
set s_frozen, release s_freeze_mutex and put superblock reference.

So something like the patch below - it seems to work for me, can you test
it please?

From 0939f4c2fd5d69d7d1bf7ece9a641bb561e9d0dd Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 30 Mar 2011 15:21:44 +0200
Subject: [PATCH] vfs: Fix deadlocks on frozen filesystem

When a filesystem is frozen and the flusher thread decides to do writeback
for the frozen filesystem (e.g. because pages were marked dirty by mmaped
write) we deadlock because we take s_umount semaphore and then try to write
dirty pages which blocks. In this situation there is no way to unfreeze
the filesystem because thawing code requires s_umount semaphore.

Fix the problem removing the need to take s_umount from thawing code. Instead
we introduce new s_freeze_mutex to provide necessary exclusion.

Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
 include/linux/fs.h |    1 +
 2 files changed, 35 insertions(+), 6 deletions(-)

Comments

Yongqiang Yang March 31, 2011, 8:37 a.m. UTC | #1
Hi everyone,

Amir met a deadlock when he tested ext4 with snapshot.  The deadlock
was reported on
https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
 It is difficult to reproduce the deadlock.  There is a deadlock
reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
Actually, these two deadlocks come from a same source.

Below are my analysis on the 1st one.  Mail is not a good place to
describe parallel processes.  I have submitted the analysis to
https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
 It is much more readable.

-- deadlock in ext4 with snapshot
       ext4 with snapshot calls freeze_super() to bring
       a fs be in a clean state when a user takes a snapshot.

     freeze                  truncate              kjournald

                    |  ext4_ext_truncate     |
    freeze_super()  |   starts a handle      |
    sets s_frozen   |                        |
                    |  ext4_ext_truncate     |
                    |  holds i_data_sem      |
  ext4_freeze()     |                        |   commit_transaction()
   wait for updates |                        |   waits for i_data_sem
                    |  ext4_free_blocks      |
                    |  calls dquot_free_block|
                    |                        |
                    |  dquot_free_block call |
                    |  ext4_dirty_inode      |
                    |                        |
                    |  ext4_dirty_inode      |
                    |  trys to start a handle|
                    |                        |
                    |  block due to s_frozen |

in ext3, ext3_freeze() prevents journal from being updated by
lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
unlock_journal_updates().

in ext4, however, before ext4_freeze() returns, it unlock journal, and
ext4 prevents journal from being updated by s_frozen. s_frozen is in
an upper layer, so it is out control of ext4 and deadlock is easy to
happen.

Could someone explain why ext4 does like above but not follow ext3?

Yongqiang.
Yongqiang Yang March 31, 2011, 8:48 a.m. UTC | #2
On Thu, Mar 31, 2011 at 4:37 PM, Yongqiang Yang <xiaoqiangnk@gmail.com> wrote:
> Hi everyone,
>
> Amir met a deadlock when he tested ext4 with snapshot.  The deadlock
> was reported on
> https://github.com/amir73il/ext4-snapshots/commit/56396185d922a73524a091b545e665543abf741a.
>  It is difficult to reproduce the deadlock.  There is a deadlock
> reported on http://www.spinics.net/lists/linux-ext4/msg23018.html.
> Actually, these two deadlocks come from a same source.
>
> Below are my analysis on the 1st one.  Mail is not a good place to
> describe parallel processes.  I have submitted the analysis to
> https://github.com/YANGYongqiang/ext4-snapshots/blob/9e0ae9ae9907125e6bf45aa91db296d4cc041b17/fs/ext4/BUGS#L143.
>  It is much more readable.
>
> -- deadlock in ext4 with snapshot
>       ext4 with snapshot calls freeze_super() to bring
>       a fs be in a clean state when a user takes a snapshot.
>
>     freeze                  truncate              kjournald
>
>                    |  ext4_ext_truncate     |
>    freeze_super()  |   starts a handle      |
>    sets s_frozen   |                        |
>                    |  ext4_ext_truncate     |
>                    |  holds i_data_sem      |
>  ext4_freeze()     |                        |   commit_transaction()
>   wait for updates |                        |   waits for i_data_sem
>                    |  ext4_free_blocks      |
>                    |  calls dquot_free_block|
>                    |                        |
>                    |  dquot_free_block call |
>                    |  ext4_dirty_inode      |
>                    |                        |
>                    |  ext4_dirty_inode      |
>                    |  trys to start a handle|
>                    |                        |
>                    |  block due to s_frozen |
>
> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
>
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.

Virtually,  it is not right to block ext4_journal_start_sb() before we
confirm that current thread has no a active handle. But ext4 does like
that. Deadlock is thus easy to happen.   Right?

>
> Could someone explain why ext4 does like above but not follow ext3?
>
> Yongqiang.
> --
> Best Wishes
> Yongqiang Yang
>
Toshiyuki Okajima March 31, 2011, 12:03 p.m. UTC | #3
Hi, thanks for your reviewing.

(2011/03/30 23:12), Jan Kara wrote:
>    Hello,
>
> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>> On Thu, 17 Feb 2011 11:45:52 +0100
>> Jan Kara<jack@suse.cz>  wrote:
>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
<SNIP>
>> I have deeply continued to examined the root cause of this problem, then
>> I found it.
>>
>> It is that we can write a memory which is mmaped to a file. Then the memory
>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>> "writeback" the memory.
>>
>> Therefore, the root cause of this hangup is not only ext4 component (with
>> delayed allocation feature) but also writeback mechanism for mmap. If you
>> use the other filesystem, you can write something to the filesystem though
>> you have freezed the filesystem.

>    Well, you can write something only in the caches, not to the on disk
> image. So it's not a problem as such.
My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
we can write in not only the caches but also the loopback device. However,
I don't still confirm that we can write to the real device(/dev/sdaX).

>
>> A sample problem is attached on this mail.  Try to execute it then you can
>> confirm that we can write some data to your filesystem while freezing the
>> filesystem.
>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>
>> I think the best approach to fix this problem is to let users not to write
>> memory which is mapped to a certain file while the filesystem is freezing.
>> However, it is very difficult to control users not to write memory which has
>> been already mapped to the file.
>    It is actually possible. In case of ext4, you could add a check (+ wait)
> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> of being frozen and if so, wait for it to get unfrozen. The only tough
> problem here might be the locking as ext4_page_mkwrite() is called with
> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> But you'd have to fix all filesystems (and all paths possibly creating
> dirty data) in this way.
>

>> Therefore, I think there is only actual method that we stop writeback thread
>> to resolve the mmap problem. Also, by this fix, the original problem
>> (ext4 delayed write vs unfreeze) can be solved.
>    Hmm, I had a look at the code again and think we could fix the issue
> cleanly (i.e. all possible users of s_umount) as follows: The lock
> ordering will be
>    s_umount ->  "fs frozen"
> and there will be a new mutex s_freeze_mutex protecting changes of
> s_frozen.
>
> freeze_bdev() already observes this lock ordering, it will only take
> s_freeze_mutex for the changes of s_frozen values. The only other code
> that is relevant for the lock ordering is thaw_super() (the freezing
> process is not expected to reenter kernel for the frozen filesystem).
> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> set s_frozen, release s_freeze_mutex and put superblock reference.
>

> So something like the patch below - it seems to work for me, can you test
> it please?
I think your patch looks good, so, the original problem seems to be solved.
OK, I will test your patch.
This weekend I cannot test it. So, I will reply next week.

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen March 31, 2011, 2:04 p.m. UTC | #4
On 3/31/11 3:37 AM, Yongqiang Yang wrote:

> in ext3, ext3_freeze() prevents journal from being updated by
> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> unlock_journal_updates().
> 
> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> an upper layer, so it is out control of ext4 and deadlock is easy to
> happen.
> 
> Could someone explain why ext4 does like above but not follow ext3?
> 
> Yongqiang.

That was me, I think ...

commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
Author: Eric Sandeen <sandeen@redhat.com>
Date:   Sun May 16 02:00:00 2010 -0400

    ext4: don't return to userspace after freezing the fs with a mutex held

    ext4_freeze() used jbd2_journal_lock_updates() which takes
    the j_barrier mutex, and then returns to userspace.  The
    kernel does not like this:

    ================================================
    [ BUG: lock held when returning to user space! ]
    ------------------------------------------------
    lvcreate/1075 is leaving the kernel with locks still held!
    1 lock held by lvcreate/1075:
     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
    jbd2_journal_lock_updates+0xe1/0xf0

    Use vfs_check_frozen() added to ext4_journal_start_sb() and
    ext4_force_commit() instead.

    Addresses-Red-Hat-Bugzilla: #568503


    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yongqiang Yang March 31, 2011, 2:36 p.m. UTC | #5
On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>
>> in ext3, ext3_freeze() prevents journal from being updated by
>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>> unlock_journal_updates().
>>
>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>> an upper layer, so it is out control of ext4 and deadlock is easy to
>> happen.
>>
>> Could someone explain why ext4 does like above but not follow ext3?
>>
>> Yongqiang.
>
> That was me, I think ...

Thank you, Eric.

I think ext4_journal_start() should check if current thread has an
active handle before  vfs_check_frozen(), if so, current handle will
be returned. Thus, we can avoid deadlocks.

Do you agree with me?  If I am right, I will send a patch.
>
> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
> Author: Eric Sandeen <sandeen@redhat.com>
> Date:   Sun May 16 02:00:00 2010 -0400
>
>    ext4: don't return to userspace after freezing the fs with a mutex held
>
>    ext4_freeze() used jbd2_journal_lock_updates() which takes
>    the j_barrier mutex, and then returns to userspace.  The
>    kernel does not like this:
>
>    ================================================
>    [ BUG: lock held when returning to user space! ]
>    ------------------------------------------------
>    lvcreate/1075 is leaving the kernel with locks still held!
>    1 lock held by lvcreate/1075:
>     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>    jbd2_journal_lock_updates+0xe1/0xf0
>
>    Use vfs_check_frozen() added to ext4_journal_start_sb() and
>    ext4_force_commit() instead.
>
>    Addresses-Red-Hat-Bugzilla: #568503
>
>
>    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>
Eric Sandeen March 31, 2011, 3:25 p.m. UTC | #6
On 3/31/11 9:36 AM, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
>> On 3/31/11 3:37 AM, Yongqiang Yang wrote:
>>
>>> in ext3, ext3_freeze() prevents journal from being updated by
>>> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
>>> unlock_journal_updates().
>>>
>>> in ext4, however, before ext4_freeze() returns, it unlock journal, and
>>> ext4 prevents journal from being updated by s_frozen. s_frozen is in
>>> an upper layer, so it is out control of ext4 and deadlock is easy to
>>> happen.
>>>
>>> Could someone explain why ext4 does like above but not follow ext3?
>>>
>>> Yongqiang.
>>
>> That was me, I think ...
> 
> Thank you, Eric.
> 
> I think ext4_journal_start() should check if current thread has an
> active handle before  vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
> 
> Do you agree with me?  If I am right, I will send a patch.

If you have a testcase to test it with, sure.  plus a patch would help me know for sure what you propose :)

Sorry for breaking it (if I did!)  But holding a mutex and returning to userspace was pretty bad, too :(

Thanks,
-Eric

>>
>> commit 6b0310fbf087ad6e9e3b8392adca97cd77184084
>> Author: Eric Sandeen <sandeen@redhat.com>
>> Date:   Sun May 16 02:00:00 2010 -0400
>>
>>    ext4: don't return to userspace after freezing the fs with a mutex held
>>
>>    ext4_freeze() used jbd2_journal_lock_updates() which takes
>>    the j_barrier mutex, and then returns to userspace.  The
>>    kernel does not like this:
>>
>>    ================================================
>>    [ BUG: lock held when returning to user space! ]
>>    ------------------------------------------------
>>    lvcreate/1075 is leaving the kernel with locks still held!
>>    1 lock held by lvcreate/1075:
>>     #0:  (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
>>    jbd2_journal_lock_updates+0xe1/0xf0
>>
>>    Use vfs_check_frozen() added to ext4_journal_start_sb() and
>>    ext4_force_commit() instead.
>>
>>    Addresses-Red-Hat-Bugzilla: #568503
>>
>>
>>    Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 31, 2011, 4:28 p.m. UTC | #7
On Thu 31-03-11 22:36:46, Yongqiang Yang wrote:
> On Thu, Mar 31, 2011 at 10:04 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> > On 3/31/11 3:37 AM, Yongqiang Yang wrote:
> >
> >> in ext3, ext3_freeze() prevents journal from being updated by
> >> lock_journal_updates(), ext3_unfreeze() allow journal to be updated by
> >> unlock_journal_updates().
> >>
> >> in ext4, however, before ext4_freeze() returns, it unlock journal, and
> >> ext4 prevents journal from being updated by s_frozen. s_frozen is in
> >> an upper layer, so it is out control of ext4 and deadlock is easy to
> >> happen.
> >>
> >> Could someone explain why ext4 does like above but not follow ext3?
> >>
> >> Yongqiang.
> >
> > That was me, I think ...
> 
> Thank you, Eric.
> 
> I think ext4_journal_start() should check if current thread has an
> active handle before  vfs_check_frozen(), if so, current handle will
> be returned. Thus, we can avoid deadlocks.
> 
> Do you agree with me?  If I am right, I will send a patch.
  Yes, definitely. This was exactly what I wanted to propose as well.
Thanks for looking into this.

								Honza
Toshiyuki Okajima April 5, 2011, 10:25 a.m. UTC | #8
Hi.

(2011/03/31 21:03), Toshiyuki Okajima wrote:
> Hi, thanks for your reviewing.
>
> (2011/03/30 23:12), Jan Kara wrote:
>> Hello,
>>
>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>> Jan Kara<jack@suse.cz> wrote:
>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>> Jan Kara<jack@suse.cz> wrote:
>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> <SNIP>
>>> I have deeply continued to examined the root cause of this problem, then
>>> I found it.
>>>
>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>> "writeback" the memory.
>>>
>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>> use the other filesystem, you can write something to the filesystem though
>>> you have freezed the filesystem.
>
>> Well, you can write something only in the caches, not to the on disk
>> image. So it's not a problem as such.
> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> we can write in not only the caches but also the loopback device. However,
> I don't still confirm that we can write to the real device(/dev/sdaX).
>
>>
>>> A sample problem is attached on this mail. Try to execute it then you can
>>> confirm that we can write some data to your filesystem while freezing the
>>> filesystem.
>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>
>>> I think the best approach to fix this problem is to let users not to write
>>> memory which is mapped to a certain file while the filesystem is freezing.
>>> However, it is very difficult to control users not to write memory which has
>>> been already mapped to the file.
>> It is actually possible. In case of ext4, you could add a check (+ wait)
>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>> of being frozen and if so, wait for it to get unfrozen. The only tough
>> problem here might be the locking as ext4_page_mkwrite() is called with
>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>> But you'd have to fix all filesystems (and all paths possibly creating
>> dirty data) in this way.
>>
>
>>> Therefore, I think there is only actual method that we stop writeback thread
>>> to resolve the mmap problem. Also, by this fix, the original problem
>>> (ext4 delayed write vs unfreeze) can be solved.
>> Hmm, I had a look at the code again and think we could fix the issue
>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>> ordering will be
>> s_umount -> "fs frozen"
>> and there will be a new mutex s_freeze_mutex protecting changes of
>> s_frozen.
>>
>> freeze_bdev() already observes this lock ordering, it will only take
>> s_freeze_mutex for the changes of s_frozen values. The only other code
>> that is relevant for the lock ordering is thaw_super() (the freezing
>> process is not expected to reenter kernel for the frozen filesystem).
>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>
>
>> So something like the patch below - it seems to work for me, can you test
>> it please?
> I think your patch looks good, so, the original problem seems to be solved.
> OK, I will test your patch.
> This weekend I cannot test it. So, I will reply next week.
I have tested whether Mizuma-san's reproducer can cause to deadlock with your
patch. And then any problems didn't hit while the reproducer was running.

I think your patch solves the original deadlock problem which is reported by
Mizuma-san.

> Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>  include/linux/fs.h |    1 +
>  2 files changed, 35 insertions(+), 6 deletions(-)

However, I think a write which causes the deadlock is from mmapped dirty
pages. So, I guess we also need to fix in the mmap path while fsfreezing.

> I floated
>
> [PATCH, RFC] check for frozen filesystems in the mmap path
>
> a long time ago, but it went nowhere; maybe time to revive that approach.

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 5, 2011, 10:54 p.m. UTC | #9
On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> (2011/03/31 21:03), Toshiyuki Okajima wrote:
> >Hi, thanks for your reviewing.
> >
> >(2011/03/30 23:12), Jan Kara wrote:
> >>Hello,
> >>
> >>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>Jan Kara<jack@suse.cz> wrote:
> >>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>Jan Kara<jack@suse.cz> wrote:
> >>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> ><SNIP>
> >>>I have deeply continued to examined the root cause of this problem, then
> >>>I found it.
> >>>
> >>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>"writeback" the memory.
> >>>
> >>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>use the other filesystem, you can write something to the filesystem though
> >>>you have freezed the filesystem.
> >
> >>Well, you can write something only in the caches, not to the on disk
> >>image. So it's not a problem as such.
> >My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >we can write in not only the caches but also the loopback device. However,
> >I don't still confirm that we can write to the real device(/dev/sdaX).
> >
> >>
> >>>A sample problem is attached on this mail. Try to execute it then you can
> >>>confirm that we can write some data to your filesystem while freezing the
> >>>filesystem.
> >>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>
> >>>I think the best approach to fix this problem is to let users not to write
> >>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>However, it is very difficult to control users not to write memory which has
> >>>been already mapped to the file.
> >>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>problem here might be the locking as ext4_page_mkwrite() is called with
> >>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>But you'd have to fix all filesystems (and all paths possibly creating
> >>dirty data) in this way.
> >>
> >
> >>>Therefore, I think there is only actual method that we stop writeback thread
> >>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>(ext4 delayed write vs unfreeze) can be solved.
> >>Hmm, I had a look at the code again and think we could fix the issue
> >>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>ordering will be
> >>s_umount -> "fs frozen"
> >>and there will be a new mutex s_freeze_mutex protecting changes of
> >>s_frozen.
> >>
> >>freeze_bdev() already observes this lock ordering, it will only take
> >>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>that is relevant for the lock ordering is thaw_super() (the freezing
> >>process is not expected to reenter kernel for the frozen filesystem).
> >>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>
> >
> >>So something like the patch below - it seems to work for me, can you test
> >>it please?
> >I think your patch looks good, so, the original problem seems to be solved.
> >OK, I will test your patch.
> >This weekend I cannot test it. So, I will reply next week.
> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> patch. And then any problems didn't hit while the reproducer was running.
>
> I think your patch solves the original deadlock problem which is reported by
> Mizuma-san.
  Good. Thanks.

> >Reported-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
> >Signed-off-by: Jan Kara <jack@suse.cz>
> >---
> > fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> > include/linux/fs.h |    1 +
> > 2 files changed, 35 insertions(+), 6 deletions(-)
> 
> However, I think a write which causes the deadlock is from mmapped dirty
> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
  Why? If you dirty a page, writeback thread can come and try to write it -
which blocks - but now that does not matter...

								Honza
Toshiyuki Okajima April 6, 2011, 5:09 a.m. UTC | #10
Hi.

(2011/04/06 7:54), Jan Kara wrote:
> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>> Hi, thanks for your reviewing.
>>>
>>> (2011/03/30 23:12), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>> Jan Kara<jack@suse.cz>  wrote:
>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>> <SNIP>
>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>> I found it.
>>>>>
>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>> "writeback" the memory.
>>>>>
>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>> use the other filesystem, you can write something to the filesystem though
>>>>> you have freezed the filesystem.
>>>
>>>> Well, you can write something only in the caches, not to the on disk
>>>> image. So it's not a problem as such.
>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>> we can write in not only the caches but also the loopback device. However,
>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>
>>>>
>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>> filesystem.
>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>
>>>>> I think the best approach to fix this problem is to let users not to write
>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>> However, it is very difficult to control users not to write memory which has
>>>>> been already mapped to the file.
>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>> dirty data) in this way.
>>>>
>>>
>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>> ordering will be
>>>> s_umount ->  "fs frozen"
>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>> s_frozen.
>>>>
>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>
>>>
>>>> So something like the patch below - it seems to work for me, can you test
>>>> it please?
>>> I think your patch looks good, so, the original problem seems to be solved.
>>> OK, I will test your patch.
>>> This weekend I cannot test it. So, I will reply next week.
>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>> patch. And then any problems didn't hit while the reproducer was running.
>>
>> I think your patch solves the original deadlock problem which is reported by
>> Mizuma-san.
>    Good. Thanks.
>
>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>> ---
>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>> include/linux/fs.h |    1 +
>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>

>> However, I think a write which causes the deadlock is from mmapped dirty
>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>    Why? If you dirty a page, writeback thread can come and try to write it -
> which blocks - but now that does not matter...
I have not understood the code around writeback thread very much...
Please explain me the concrete function name which blocks some writes?

Mizuma-san's reproducer also writes the data which maps to the file (mmap).
The original problem happens after the fsfreeze operation is done.
I understand the normal write operation (not mmap) can be blocked while
fsfreezing. So, I guess we don't always block all the write operation
while fsfreezing.

Thanks
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 6, 2011, 5:57 a.m. UTC | #11
On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> (2011/04/06 7:54), Jan Kara wrote:
> >On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>Hi, thanks for your reviewing.
> >>>
> >>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>Jan Kara<jack@suse.cz>  wrote:
> >>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>Jan Kara<jack@suse.cz>  wrote:
> >>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>><SNIP>
> >>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>I found it.
> >>>>>
> >>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>"writeback" the memory.
> >>>>>
> >>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>you have freezed the filesystem.
> >>>
> >>>>Well, you can write something only in the caches, not to the on disk
> >>>>image. So it's not a problem as such.
> >>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>we can write in not only the caches but also the loopback device. However,
> >>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>
> >>>>
> >>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>filesystem.
> >>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>
> >>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>However, it is very difficult to control users not to write memory which has
> >>>>>been already mapped to the file.
> >>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>dirty data) in this way.
> >>>>
> >>>
> >>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>ordering will be
> >>>>s_umount ->  "fs frozen"
> >>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>s_frozen.
> >>>>
> >>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>
> >>>
> >>>>So something like the patch below - it seems to work for me, can you test
> >>>>it please?
> >>>I think your patch looks good, so, the original problem seems to be solved.
> >>>OK, I will test your patch.
> >>>This weekend I cannot test it. So, I will reply next week.
> >>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>patch. And then any problems didn't hit while the reproducer was running.
> >>
> >>I think your patch solves the original deadlock problem which is reported by
> >>Mizuma-san.
> >   Good. Thanks.
> >
> >>>Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
> >>>Signed-off-by: Jan Kara<jack@suse.cz>
> >>>---
> >>>fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> >>>include/linux/fs.h |    1 +
> >>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>
> 
> >>However, I think a write which causes the deadlock is from mmapped dirty
> >>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> >   Why? If you dirty a page, writeback thread can come and try to write it -
> >which blocks - but now that does not matter...
> I have not understood the code around writeback thread very much...
> Please explain me the concrete function name which blocks some writes?
  It would block in ext4_da_writepages() function.

> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> The original problem happens after the fsfreeze operation is done.
> I understand the normal write operation (not mmap) can be blocked while
> fsfreezing. So, I guess we don't always block all the write operation
> while fsfreezing.
  Technically speaking, we block all the transaction starts which means we
end up blocking all the writes from going to disk. But that does not mean
we block all the writes from going to in-memory cache - as you properly
note the mmap case is one of such exceptions.

									Honza
Toshiyuki Okajima April 6, 2011, 7:40 a.m. UTC | #12
Hi.

(2011/04/06 14:57), Jan Kara wrote:
> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>> (2011/04/06 7:54), Jan Kara wrote:
>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>> Hi, thanks for your reviewing.
>>>>>
>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>> Jan Kara<jack@suse.cz>   wrote:
>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>> <SNIP>
>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>> I found it.
>>>>>>>
>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>> "writeback" the memory.
>>>>>>>
>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>> you have freezed the filesystem.
>>>>>
>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>> image. So it's not a problem as such.
>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>> we can write in not only the caches but also the loopback device. However,
>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>
>>>>>>
>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>> filesystem.
>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>
>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>> been already mapped to the file.
>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>> dirty data) in this way.
>>>>>>
>>>>>
>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>> ordering will be
>>>>>> s_umount ->   "fs frozen"
>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>> s_frozen.
>>>>>>
>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>
>>>>>
>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>> it please?
>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>> OK, I will test your patch.
>>>>> This weekend I cannot test it. So, I will reply next week.
>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>
>>>> I think your patch solves the original deadlock problem which is reported by
>>>> Mizuma-san.
>>>    Good. Thanks.
>>>
>>>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>>>> ---
>>>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>>>> include/linux/fs.h |    1 +
>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>
>>
>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>>    Why? If you dirty a page, writeback thread can come and try to write it -
>>> which blocks - but now that does not matter...

>> I have not understood the code around writeback thread very much...
>> Please explain me the concrete function name which blocks some writes?
>    It would block in ext4_da_writepages() function.
In ext4 with delayed allocation case, I understand it blocks.
(Original deadlock problem is just this case.)
But in ext4 without delayed allocation or other filesystems case, which function
can block writing?

>
>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>> The original problem happens after the fsfreeze operation is done.
>> I understand the normal write operation (not mmap) can be blocked while
>> fsfreezing. So, I guess we don't always block all the write operation
>> while fsfreezing.
>    Technically speaking, we block all the transaction starts which means we
> end up blocking all the writes from going to disk. But that does not mean
> we block all the writes from going to in-memory cache - as you properly
> note the mmap case is one of such exceptions.
Hm, I also think we can allow the writes to in-memory cache but we can't allow
the writes to disk while fsfreezing. I am considering that mmap path can
write to disk while fsfreezing because this deadlock problem happens after
fsfreeze operation is done...

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 6, 2011, 5:46 p.m. UTC | #13
Hello,

On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
> (2011/04/06 14:57), Jan Kara wrote:
> >On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
> >>(2011/04/06 7:54), Jan Kara wrote:
> >>>On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
> >>>>(2011/03/31 21:03), Toshiyuki Okajima wrote:
> >>>>>Hi, thanks for your reviewing.
> >>>>>
> >>>>>(2011/03/30 23:12), Jan Kara wrote:
> >>>>>>Hello,
> >>>>>>
> >>>>>>On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
> >>>>>>>On Thu, 17 Feb 2011 11:45:52 +0100
> >>>>>>>Jan Kara<jack@suse.cz>   wrote:
> >>>>>>>>On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
> >>>>>>>>>(2011/02/16 23:56), Jan Kara wrote:
> >>>>>>>>>>On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
> >>>>>>>>>>>On Tue, 15 Feb 2011 18:29:54 +0100
> >>>>>>>>>>>Jan Kara<jack@suse.cz>   wrote:
> >>>>>>>>>>>>On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
> >>>>>>>>>>>>>On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
> >>>>><SNIP>
> >>>>>>>I have deeply continued to examined the root cause of this problem, then
> >>>>>>>I found it.
> >>>>>>>
> >>>>>>>It is that we can write a memory which is mmaped to a file. Then the memory
> >>>>>>>becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
> >>>>>>>"writeback" the memory.
> >>>>>>>
> >>>>>>>Therefore, the root cause of this hangup is not only ext4 component (with
> >>>>>>>delayed allocation feature) but also writeback mechanism for mmap. If you
> >>>>>>>use the other filesystem, you can write something to the filesystem though
> >>>>>>>you have freezed the filesystem.
> >>>>>
> >>>>>>Well, you can write something only in the caches, not to the on disk
> >>>>>>image. So it's not a problem as such.
> >>>>>My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
> >>>>>we can write in not only the caches but also the loopback device. However,
> >>>>>I don't still confirm that we can write to the real device(/dev/sdaX).
> >>>>>
> >>>>>>
> >>>>>>>A sample problem is attached on this mail. Try to execute it then you can
> >>>>>>>confirm that we can write some data to your filesystem while freezing the
> >>>>>>>filesystem.
> >>>>>>>(If you change FS variable in go.sh from ext3 to ext4 and you execute
> >>>>>>>"fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
> >>>>>>>
> >>>>>>>I think the best approach to fix this problem is to let users not to write
> >>>>>>>memory which is mapped to a certain file while the filesystem is freezing.
> >>>>>>>However, it is very difficult to control users not to write memory which has
> >>>>>>>been already mapped to the file.
> >>>>>>It is actually possible. In case of ext4, you could add a check (+ wait)
> >>>>>>in ext4_page_mkwrite() whether the filesystem is frozen or in the process
> >>>>>>of being frozen and if so, wait for it to get unfrozen. The only tough
> >>>>>>problem here might be the locking as ext4_page_mkwrite() is called with
> >>>>>>mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
> >>>>>>But you'd have to fix all filesystems (and all paths possibly creating
> >>>>>>dirty data) in this way.
> >>>>>>
> >>>>>
> >>>>>>>Therefore, I think there is only actual method that we stop writeback thread
> >>>>>>>to resolve the mmap problem. Also, by this fix, the original problem
> >>>>>>>(ext4 delayed write vs unfreeze) can be solved.
> >>>>>>Hmm, I had a look at the code again and think we could fix the issue
> >>>>>>cleanly (i.e. all possible users of s_umount) as follows: The lock
> >>>>>>ordering will be
> >>>>>>s_umount ->   "fs frozen"
> >>>>>>and there will be a new mutex s_freeze_mutex protecting changes of
> >>>>>>s_frozen.
> >>>>>>
> >>>>>>freeze_bdev() already observes this lock ordering, it will only take
> >>>>>>s_freeze_mutex for the changes of s_frozen values. The only other code
> >>>>>>that is relevant for the lock ordering is thaw_super() (the freezing
> >>>>>>process is not expected to reenter kernel for the frozen filesystem).
> >>>>>>In thaw_super() we could take s_freeze_mutex, do all the thawing work,
> >>>>>>set s_frozen, release s_freeze_mutex and put superblock reference.
> >>>>>>
> >>>>>
> >>>>>>So something like the patch below - it seems to work for me, can you test
> >>>>>>it please?
> >>>>>I think your patch looks good, so, the original problem seems to be solved.
> >>>>>OK, I will test your patch.
> >>>>>This weekend I cannot test it. So, I will reply next week.
> >>>>I have tested whether Mizuma-san's reproducer can cause to deadlock with your
> >>>>patch. And then any problems didn't hit while the reproducer was running.
> >>>>
> >>>>I think your patch solves the original deadlock problem which is reported by
> >>>>Mizuma-san.
> >>>   Good. Thanks.
> >>>
> >>>>>Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
> >>>>>Signed-off-by: Jan Kara<jack@suse.cz>
> >>>>>---
> >>>>>fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
> >>>>>include/linux/fs.h |    1 +
> >>>>>2 files changed, 35 insertions(+), 6 deletions(-)
> >>>>
> >>
> >>>>However, I think a write which causes the deadlock is from mmapped dirty
> >>>>pages. So, I guess we also need to fix in the mmap path while fsfreezing.
> >>>   Why? If you dirty a page, writeback thread can come and try to write it -
> >>>which blocks - but now that does not matter...
> 
> >>I have not understood the code around writeback thread very much...
> >>Please explain me the concrete function name which blocks some writes?
> >   It would block in ext4_da_writepages() function.
> In ext4 with delayed allocation case, I understand it blocks.
> (Original deadlock problem is just this case.)
> But in ext4 without delayed allocation or other filesystems case, which function
> can block writing?
  For ext3 or ext4 without delayed allocation we block inside writepage()
function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
get modified to block while minor-faulting the page on frozen fs because
when blocks are already allocated we may skip starting a transaction and so
we could possibly modify the filesystem.

> >>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>The original problem happens after the fsfreeze operation is done.
> >>I understand the normal write operation (not mmap) can be blocked while
> >>fsfreezing. So, I guess we don't always block all the write operation
> >>while fsfreezing.
> >   Technically speaking, we block all the transaction starts which means we
> >end up blocking all the writes from going to disk. But that does not mean
> >we block all the writes from going to in-memory cache - as you properly
> >note the mmap case is one of such exceptions.
> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> the writes to disk while fsfreezing. I am considering that mmap path can
> write to disk while fsfreezing because this deadlock problem happens after
> fsfreeze operation is done...
  I'm sorry I don't understand now - are you speaking about the case above
when writepage() does not wait for filesystem being frozen or something
else?

									Honza
Toshiyuki Okajima April 15, 2011, 1:39 p.m. UTC | #14
Hi, sorry for my late response.

(2011/04/07 2:46), Jan Kara wrote:
>    Hello,
>
> On Wed 06-04-11 16:40:15, Toshiyuki Okajima wrote:
>> (2011/04/06 14:57), Jan Kara wrote:
>>> On Wed 06-04-11 14:09:14, Toshiyuki Okajima wrote:
>>>> (2011/04/06 7:54), Jan Kara wrote:
>>>>> On Tue 05-04-11 19:25:44, Toshiyuki Okajima wrote:
>>>>>> (2011/03/31 21:03), Toshiyuki Okajima wrote:
>>>>>>> Hi, thanks for your reviewing.
>>>>>>>
>>>>>>> (2011/03/30 23:12), Jan Kara wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Mon 28-03-11 17:06:28, Toshiyuki Okajima wrote:
>>>>>>>>> On Thu, 17 Feb 2011 11:45:52 +0100
>>>>>>>>> Jan Kara<jack@suse.cz>    wrote:
>>>>>>>>>> On Thu 17-02-11 12:50:51, Toshiyuki Okajima wrote:
>>>>>>>>>>> (2011/02/16 23:56), Jan Kara wrote:
>>>>>>>>>>>> On Wed 16-02-11 08:17:46, Toshiyuki Okajima wrote:
>>>>>>>>>>>>> On Tue, 15 Feb 2011 18:29:54 +0100
>>>>>>>>>>>>> Jan Kara<jack@suse.cz>    wrote:
>>>>>>>>>>>>>> On Tue 15-02-11 12:03:52, Ted Ts'o wrote:
>>>>>>>>>>>>>>> On Tue, Feb 15, 2011 at 05:06:30PM +0100, Jan Kara wrote:
>>>>>>> <SNIP>
>>>>>>>>> I have deeply continued to examined the root cause of this problem, then
>>>>>>>>> I found it.
>>>>>>>>>
>>>>>>>>> It is that we can write a memory which is mmaped to a file. Then the memory
>>>>>>>>> becomes "DIRTY" so then the flusher thread (ex. wb_do_writeback) tries to
>>>>>>>>> "writeback" the memory.
>>>>>>>>>
>>>>>>>>> Therefore, the root cause of this hangup is not only ext4 component (with
>>>>>>>>> delayed allocation feature) but also writeback mechanism for mmap. If you
>>>>>>>>> use the other filesystem, you can write something to the filesystem though
>>>>>>>>> you have freezed the filesystem.
>>>>>>>
>>>>>>>> Well, you can write something only in the caches, not to the on disk
>>>>>>>> image. So it's not a problem as such.
>>>>>>> My reproducer uses the loopback device(/dev/loopX). By using it, I have confirmed that
>>>>>>> we can write in not only the caches but also the loopback device. However,
>>>>>>> I don't still confirm that we can write to the real device(/dev/sdaX).
>>>>>>>
>>>>>>>>
>>>>>>>>> A sample problem is attached on this mail. Try to execute it then you can
>>>>>>>>> confirm that we can write some data to your filesystem while freezing the
>>>>>>>>> filesystem.
>>>>>>>>> (If you change FS variable in go.sh from ext3 to ext4 and you execute
>>>>>>>>> "fsfreeze -u mnt" manually on other prompt, you can also confirm this deadlock.)
>>>>>>>>>
>>>>>>>>> I think the best approach to fix this problem is to let users not to write
>>>>>>>>> memory which is mapped to a certain file while the filesystem is freezing.
>>>>>>>>> However, it is very difficult to control users not to write memory which has
>>>>>>>>> been already mapped to the file.
>>>>>>>> It is actually possible. In case of ext4, you could add a check (+ wait)
>>>>>>>> in ext4_page_mkwrite() whether the filesystem is frozen or in the process
>>>>>>>> of being frozen and if so, wait for it to get unfrozen. The only tough
>>>>>>>> problem here might be the locking as ext4_page_mkwrite() is called with
>>>>>>>> mmap_sem held and I'm not sure we can take s_umount with mmap_sem held.
>>>>>>>> But you'd have to fix all filesystems (and all paths possibly creating
>>>>>>>> dirty data) in this way.
>>>>>>>>
>>>>>>>
>>>>>>>>> Therefore, I think there is only actual method that we stop writeback thread
>>>>>>>>> to resolve the mmap problem. Also, by this fix, the original problem
>>>>>>>>> (ext4 delayed write vs unfreeze) can be solved.
>>>>>>>> Hmm, I had a look at the code again and think we could fix the issue
>>>>>>>> cleanly (i.e. all possible users of s_umount) as follows: The lock
>>>>>>>> ordering will be
>>>>>>>> s_umount ->    "fs frozen"
>>>>>>>> and there will be a new mutex s_freeze_mutex protecting changes of
>>>>>>>> s_frozen.
>>>>>>>>
>>>>>>>> freeze_bdev() already observes this lock ordering, it will only take
>>>>>>>> s_freeze_mutex for the changes of s_frozen values. The only other code
>>>>>>>> that is relevant for the lock ordering is thaw_super() (the freezing
>>>>>>>> process is not expected to reenter kernel for the frozen filesystem).
>>>>>>>> In thaw_super() we could take s_freeze_mutex, do all the thawing work,
>>>>>>>> set s_frozen, release s_freeze_mutex and put superblock reference.
>>>>>>>>
>>>>>>>
>>>>>>>> So something like the patch below - it seems to work for me, can you test
>>>>>>>> it please?
>>>>>>> I think your patch looks good, so, the original problem seems to be solved.
>>>>>>> OK, I will test your patch.
>>>>>>> This weekend I cannot test it. So, I will reply next week.
>>>>>> I have tested whether Mizuma-san's reproducer can cause to deadlock with your
>>>>>> patch. And then any problems didn't hit while the reproducer was running.
>>>>>>
>>>>>> I think your patch solves the original deadlock problem which is reported by
>>>>>> Mizuma-san.
>>>>>    Good. Thanks.
>>>>>
>>>>>>> Reported-by: Toshiyuki Okajima<toshi.okajima@jp.fujitsu.com>
>>>>>>> Signed-off-by: Jan Kara<jack@suse.cz>
>>>>>>> ---
>>>>>>> fs/super.c         |   40 ++++++++++++++++++++++++++++++++++------
>>>>>>> include/linux/fs.h |    1 +
>>>>>>> 2 files changed, 35 insertions(+), 6 deletions(-)
>>>>>>
>>>>
>>>>>> However, I think a write which causes the deadlock is from mmapped dirty
>>>>>> pages. So, I guess we also need to fix in the mmap path while fsfreezing.
>>>>>    Why? If you dirty a page, writeback thread can come and try to write it -
>>>>> which blocks - but now that does not matter...
>>
>>>> I have not understood the code around writeback thread very much...
>>>> Please explain me the concrete function name which blocks some writes?
>>>    It would block in ext4_da_writepages() function.
>> In ext4 with delayed allocation case, I understand it blocks.
>> (Original deadlock problem is just this case.)
>> But in ext4 without delayed allocation or other filesystems case, which function
>> can block writing?

>    For ext3 or ext4 without delayed allocation we block inside writepage()
> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> get modified to block while minor-faulting the page on frozen fs because
> when blocks are already allocated we may skip starting a transaction and so
> we could possibly modify the filesystem.
OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.

(minor-pagefault)
-> do_wp_page()
    -> page_mkwrite(= ext4_mkwrite())
       => BLOCK!

(major-pagefault)
-> do_liner_fault()
    -> page_mkwrite(= ext4_mkwrite())
       => BLOCK!

>
>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>> The original problem happens after the fsfreeze operation is done.
>>>> I understand the normal write operation (not mmap) can be blocked while
>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>> while fsfreezing.
>>>    Technically speaking, we block all the transaction starts which means we
>>> end up blocking all the writes from going to disk. But that does not mean
>>> we block all the writes from going to in-memory cache - as you properly
>>> note the mmap case is one of such exceptions.
>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>> the writes to disk while fsfreezing. I am considering that mmap path can
>> write to disk while fsfreezing because this deadlock problem happens after
>> fsfreeze operation is done...
>    I'm sorry I don't understand now - are you speaking about the case above
> when writepage() does not wait for filesystem being frozen or something
> else?
Sorry, I didn't understand around the page fault path.
So, I had read the kernel source code around it, then I maybe understand...

I worry whether we can update the file data in mmap case while fsfreezing.
Of course, I understand that we can write to in-memory cache, and it is not a
problem. However, if we can write to disk while fsfreezing, it is a problem.
So, I summarize the cases whether we can write to disk or not.

--------------------------------------------------------------------------
Cases (Whether we can write the data mmapped to the file on the disk
while fsfreezing)

[1] One of the page which has been mmapped is not bound. And
  the page is not allocated yet. (major fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) __do_falut is called.
    (4) ext4_page_mkwrite is called
    (5) ext4_write_begin is called
    (6) ext4_journal_start_sb       => We can STOP!

[2] One of the page which has been mmapped is not bound. But
  the page is already allocated, and the buffer_heads of the page
  are not mapped (BH_Mapped).  (minor fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) do_wp_page is called.
    (4) ext4_page_mkwrite is called
    (5) ext4_write_begin is called
    (6) ext4_journal_start_sb       => We can STOP!

[3] One of the page which has been mmapped is not bound. But
  the page is already allocated, and the buffer_heads of the page
  are mapped (BH_Mapped).  (minor fault?)

    (1) user dirtys a page
    (2) a page fault occurs (do_page_fault)
    (3) do_wp_page is called.
    (4) ext4_page_mkwrite is called
    * Cannot block the dirty page to be written because all bh is mapped.
    (5) user munmaps the page (munmap)
    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
    (7) writeback thread writes the page (struct page) to disk
                                            => We cannot STOP!

[4] One of the page which has been mmapped is bound. And
  the page is already allocated.

    (1) user dirtys a page
    ( ) no page fault occurs
    (2) user munmaps the page (munmap)
    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
    (4) writeback thread writes the page (struct page) to disk
                                            => We cannot STOP!
--------------------------------------------------------------------------

So, we can block the cases [1], [2].
But I think we cannot block the cases [3], [4] now.
If fixing the page_mkwrite, we can also block the case [3].
But the case [4] is not blocked because no page fault occurs
when we dirty the mmapped page.

Therefore, to repair this problem, we need to fix the cases [3], [4].
I think we must modify the writeback thread to fix the case [4].

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 15, 2011, 5:13 p.m. UTC | #15
Hello,

On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >   For ext3 or ext4 without delayed allocation we block inside writepage()
> >function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >get modified to block while minor-faulting the page on frozen fs because
> >when blocks are already allocated we may skip starting a transaction and so
> >we could possibly modify the filesystem.
> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> 
> (minor-pagefault)
> -> do_wp_page()
>    -> page_mkwrite(= ext4_mkwrite())
>       => BLOCK!
> 
> (major-pagefault)
> -> do_liner_fault()
>    -> page_mkwrite(= ext4_mkwrite())
>       => BLOCK!
> 
> >
> >>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>The original problem happens after the fsfreeze operation is done.
> >>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>while fsfreezing.
> >>>   Technically speaking, we block all the transaction starts which means we
> >>>end up blocking all the writes from going to disk. But that does not mean
> >>>we block all the writes from going to in-memory cache - as you properly
> >>>note the mmap case is one of such exceptions.
> >>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>the writes to disk while fsfreezing. I am considering that mmap path can
> >>write to disk while fsfreezing because this deadlock problem happens after
> >>fsfreeze operation is done...
> >   I'm sorry I don't understand now - are you speaking about the case above
> >when writepage() does not wait for filesystem being frozen or something
> >else?
> Sorry, I didn't understand around the page fault path.
> So, I had read the kernel source code around it, then I maybe understand...
> 
> I worry whether we can update the file data in mmap case while fsfreezing.
> Of course, I understand that we can write to in-memory cache, and it is not a
> problem. However, if we can write to disk while fsfreezing, it is a problem.
> So, I summarize the cases whether we can write to disk or not.
> 
> --------------------------------------------------------------------------
> Cases (Whether we can write the data mmapped to the file on the disk
> while fsfreezing)
> 
> [1] One of the page which has been mmapped is not bound. And
>  the page is not allocated yet. (major fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) __do_falut is called.
>    (4) ext4_page_mkwrite is called
>    (5) ext4_write_begin is called
>    (6) ext4_journal_start_sb       => We can STOP!
> 
> [2] One of the page which has been mmapped is not bound. But
>  the page is already allocated, and the buffer_heads of the page
>  are not mapped (BH_Mapped).  (minor fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) do_wp_page is called.
>    (4) ext4_page_mkwrite is called
>    (5) ext4_write_begin is called
>    (6) ext4_journal_start_sb       => We can STOP!
> 
> [3] One of the page which has been mmapped is not bound. But
>  the page is already allocated, and the buffer_heads of the page
>  are mapped (BH_Mapped).  (minor fault?)
> 
>    (1) user dirtys a page
>    (2) a page fault occurs (do_page_fault)
>    (3) do_wp_page is called.
>    (4) ext4_page_mkwrite is called
>    * Cannot block the dirty page to be written because all bh is mapped.
>    (5) user munmaps the page (munmap)
>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>    (7) writeback thread writes the page (struct page) to disk
>                                            => We cannot STOP!
> 
> [4] One of the page which has been mmapped is bound. And
>  the page is already allocated.
> 
>    (1) user dirtys a page
>    ( ) no page fault occurs
>    (2) user munmaps the page (munmap)
>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>    (4) writeback thread writes the page (struct page) to disk
>                                            => We cannot STOP!
> --------------------------------------------------------------------------
> 
> So, we can block the cases [1], [2].
> But I think we cannot block the cases [3], [4] now.
> If fixing the page_mkwrite, we can also block the case [3].
> But the case [4] is not blocked because no page fault occurs
> when we dirty the mmapped page.
> 
> Therefore, to repair this problem, we need to fix the cases [3], [4].
> I think we must modify the writeback thread to fix the case [4].
  The trick here is that when we write a page to disk, we write-protect
the page (you seem to call this that "the page is bound", I'm not sure why).
So we are guaranteed to receive a minor fault (case [3]) if user tries to
modify a page after we finish writeback while freezing the filesystem.
So principially all we need to do is just wait in ext4_page_mkwrite().

								Honza
Eric Sandeen April 15, 2011, 5:17 p.m. UTC | #16
On 4/15/11 12:13 PM, Jan Kara wrote:
>   Hello,
> 
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>   For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> -> do_wp_page()
>>    -> page_mkwrite(= ext4_mkwrite())
>>       => BLOCK!
>>
>> (major-pagefault)
>> -> do_liner_fault()
>>    -> page_mkwrite(= ext4_mkwrite())
>>       => BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>>   Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>>   I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>>  the page is not allocated yet. (major fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) __do_falut is called.
>>    (4) ext4_page_mkwrite is called
>>    (5) ext4_write_begin is called
>>    (6) ext4_journal_start_sb       => We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>>  the page is already allocated, and the buffer_heads of the page
>>  are not mapped (BH_Mapped).  (minor fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) do_wp_page is called.
>>    (4) ext4_page_mkwrite is called
>>    (5) ext4_write_begin is called
>>    (6) ext4_journal_start_sb       => We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>>  the page is already allocated, and the buffer_heads of the page
>>  are mapped (BH_Mapped).  (minor fault?)
>>
>>    (1) user dirtys a page
>>    (2) a page fault occurs (do_page_fault)
>>    (3) do_wp_page is called.
>>    (4) ext4_page_mkwrite is called
>>    * Cannot block the dirty page to be written because all bh is mapped.
>>    (5) user munmaps the page (munmap)
>>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>    (7) writeback thread writes the page (struct page) to disk
>>                                            => We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>>  the page is already allocated.
>>
>>    (1) user dirtys a page
>>    ( ) no page fault occurs
>>    (2) user munmaps the page (munmap)
>>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>    (4) writeback thread writes the page (struct page) to disk
>>                                            => We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
>   The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().

I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?

no frozen fs should be able to dirty & write pages via mmap, right?

-Eric
 
> 								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 15, 2011, 5:37 p.m. UTC | #17
On Fri 15-04-11 12:17:06, Eric Sandeen wrote:
> On 4/15/11 12:13 PM, Jan Kara wrote:
> >   Hello,
> > 
> > On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>   For ext3 or ext4 without delayed allocation we block inside writepage()
> >>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>> get modified to block while minor-faulting the page on frozen fs because
> >>> when blocks are already allocated we may skip starting a transaction and so
> >>> we could possibly modify the filesystem.
> >> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >> (minor-pagefault)
> >> -> do_wp_page()
> >>    -> page_mkwrite(= ext4_mkwrite())
> >>       => BLOCK!
> >>
> >> (major-pagefault)
> >> -> do_liner_fault()
> >>    -> page_mkwrite(= ext4_mkwrite())
> >>       => BLOCK!
> >>
> >>>
> >>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>> The original problem happens after the fsfreeze operation is done.
> >>>>>> I understand the normal write operation (not mmap) can be blocked while
> >>>>>> fsfreezing. So, I guess we don't always block all the write operation
> >>>>>> while fsfreezing.
> >>>>>   Technically speaking, we block all the transaction starts which means we
> >>>>> end up blocking all the writes from going to disk. But that does not mean
> >>>>> we block all the writes from going to in-memory cache - as you properly
> >>>>> note the mmap case is one of such exceptions.
> >>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>> the writes to disk while fsfreezing. I am considering that mmap path can
> >>>> write to disk while fsfreezing because this deadlock problem happens after
> >>>> fsfreeze operation is done...
> >>>   I'm sorry I don't understand now - are you speaking about the case above
> >>> when writepage() does not wait for filesystem being frozen or something
> >>> else?
> >> Sorry, I didn't understand around the page fault path.
> >> So, I had read the kernel source code around it, then I maybe understand...
> >>
> >> I worry whether we can update the file data in mmap case while fsfreezing.
> >> Of course, I understand that we can write to in-memory cache, and it is not a
> >> problem. However, if we can write to disk while fsfreezing, it is a problem.
> >> So, I summarize the cases whether we can write to disk or not.
> >>
> >> --------------------------------------------------------------------------
> >> Cases (Whether we can write the data mmapped to the file on the disk
> >> while fsfreezing)
> >>
> >> [1] One of the page which has been mmapped is not bound. And
> >>  the page is not allocated yet. (major fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) __do_falut is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       => We can STOP!
> >>
> >> [2] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are not mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       => We can STOP!
> >>
> >> [3] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    * Cannot block the dirty page to be written because all bh is mapped.
> >>    (5) user munmaps the page (munmap)
> >>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (7) writeback thread writes the page (struct page) to disk
> >>                                            => We cannot STOP!
> >>
> >> [4] One of the page which has been mmapped is bound. And
> >>  the page is already allocated.
> >>
> >>    (1) user dirtys a page
> >>    ( ) no page fault occurs
> >>    (2) user munmaps the page (munmap)
> >>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (4) writeback thread writes the page (struct page) to disk
> >>                                            => We cannot STOP!
> >> --------------------------------------------------------------------------
> >>
> >> So, we can block the cases [1], [2].
> >> But I think we cannot block the cases [3], [4] now.
> >> If fixing the page_mkwrite, we can also block the case [3].
> >> But the case [4] is not blocked because no page fault occurs
> >> when we dirty the mmapped page.
> >>
> >> Therefore, to repair this problem, we need to fix the cases [3], [4].
> >> I think we must modify the writeback thread to fix the case [4].
> >   The trick here is that when we write a page to disk, we write-protect
> > the page (you seem to call this that "the page is bound", I'm not sure why).
> > So we are guaranteed to receive a minor fault (case [3]) if user tries to
> > modify a page after we finish writeback while freezing the filesystem.
> > So principially all we need to do is just wait in ext4_page_mkwrite().
> 
> I've been kind of absent from this thread, sorry, but why would we wait in ext4_page_mkwrite(), rather than in mm/memory.c prior to any page_mkwrite call on any fs?
> 
> no frozen fs should be able to dirty & write pages via mmap, right?
  I have not put that much thought to it but locking might be kind of
tricky in the generic code. We have to be sure that freezing waits for
the page which is just being faulted. That means we have to take page lock
(now writepage() called during fs_freeze will wait for us), check whether
fs is frozen. If yes, unlock page, do vfs_check_frozen(), and retry. This
call sequence is much better suited for block_page_mkwrite() than for
code in memory.c I think.

								Honza
Toshiyuki Okajima April 18, 2011, 9:05 a.m. UTC | #18
Hi,

(2011/04/16 2:13), Jan Kara wrote:
>    Hello,
>
> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>> get modified to block while minor-faulting the page on frozen fs because
>>> when blocks are already allocated we may skip starting a transaction and so
>>> we could possibly modify the filesystem.
>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>
>> (minor-pagefault)
>> ->  do_wp_page()
>>     ->  page_mkwrite(= ext4_mkwrite())
>>        =>  BLOCK!
>>
>> (major-pagefault)
>> ->  do_liner_fault()
>>     ->  page_mkwrite(= ext4_mkwrite())
>>        =>  BLOCK!
>>
>>>
>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>> while fsfreezing.
>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>> note the mmap case is one of such exceptions.
>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>> fsfreeze operation is done...
>>>    I'm sorry I don't understand now - are you speaking about the case above
>>> when writepage() does not wait for filesystem being frozen or something
>>> else?
>> Sorry, I didn't understand around the page fault path.
>> So, I had read the kernel source code around it, then I maybe understand...
>>
>> I worry whether we can update the file data in mmap case while fsfreezing.
>> Of course, I understand that we can write to in-memory cache, and it is not a
>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>> So, I summarize the cases whether we can write to disk or not.
>>
>> --------------------------------------------------------------------------
>> Cases (Whether we can write the data mmapped to the file on the disk
>> while fsfreezing)
>>
>> [1] One of the page which has been mmapped is not bound. And
>>   the page is not allocated yet. (major fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) __do_falut is called.
>>     (4) ext4_page_mkwrite is called
>>     (5) ext4_write_begin is called
>>     (6) ext4_journal_start_sb       =>  We can STOP!
>>
>> [2] One of the page which has been mmapped is not bound. But
>>   the page is already allocated, and the buffer_heads of the page
>>   are not mapped (BH_Mapped).  (minor fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) do_wp_page is called.
>>     (4) ext4_page_mkwrite is called
>>     (5) ext4_write_begin is called
>>     (6) ext4_journal_start_sb       =>  We can STOP!
>>
>> [3] One of the page which has been mmapped is not bound. But
>>   the page is already allocated, and the buffer_heads of the page
>>   are mapped (BH_Mapped).  (minor fault?)
>>
>>     (1) user dirtys a page
>>     (2) a page fault occurs (do_page_fault)
>>     (3) do_wp_page is called.
>>     (4) ext4_page_mkwrite is called
>>     * Cannot block the dirty page to be written because all bh is mapped.
>>     (5) user munmaps the page (munmap)
>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>     (7) writeback thread writes the page (struct page) to disk
>>                                             =>  We cannot STOP!
>>
>> [4] One of the page which has been mmapped is bound. And
>>   the page is already allocated.
>>
>>     (1) user dirtys a page
>>     ( ) no page fault occurs
>>     (2) user munmaps the page (munmap)
>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>     (4) writeback thread writes the page (struct page) to disk
>>                                             =>  We cannot STOP!
>> --------------------------------------------------------------------------
>>
>> So, we can block the cases [1], [2].
>> But I think we cannot block the cases [3], [4] now.
>> If fixing the page_mkwrite, we can also block the case [3].
>> But the case [4] is not blocked because no page fault occurs
>> when we dirty the mmapped page.
>>
>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>> I think we must modify the writeback thread to fix the case [4].
>    The trick here is that when we write a page to disk, we write-protect
> the page (you seem to call this that "the page is bound", I'm not sure why).
Hm, I want to understand how to write-protect the page under fsfreezing.
But, anyway, I understand we don't need to consider the case [4].

> So we are guaranteed to receive a minor fault (case [3]) if user tries to
> modify a page after we finish writeback while freezing the filesystem.
> So principially all we need to do is just wait in ext4_page_mkwrite().
OK. I understand.
Are there any concrete ideas to fix this?
For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 18, 2011, 10:51 a.m. UTC | #19
On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
> >On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>   For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
> >>>get modified to block while minor-faulting the page on frozen fs because
> >>>when blocks are already allocated we may skip starting a transaction and so
> >>>we could possibly modify the filesystem.
> >>OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
> >>
> >>(minor-pagefault)
> >>->  do_wp_page()
> >>    ->  page_mkwrite(= ext4_mkwrite())
> >>       =>  BLOCK!
> >>
> >>(major-pagefault)
> >>->  do_liner_fault()
> >>    ->  page_mkwrite(= ext4_mkwrite())
> >>       =>  BLOCK!
> >>
> >>>
> >>>>>>Mizuma-san's reproducer also writes the data which maps to the file (mmap).
> >>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>I understand the normal write operation (not mmap) can be blocked while
> >>>>>>fsfreezing. So, I guess we don't always block all the write operation
> >>>>>>while fsfreezing.
> >>>>>   Technically speaking, we block all the transaction starts which means we
> >>>>>end up blocking all the writes from going to disk. But that does not mean
> >>>>>we block all the writes from going to in-memory cache - as you properly
> >>>>>note the mmap case is one of such exceptions.
> >>>>Hm, I also think we can allow the writes to in-memory cache but we can't allow
> >>>>the writes to disk while fsfreezing. I am considering that mmap path can
> >>>>write to disk while fsfreezing because this deadlock problem happens after
> >>>>fsfreeze operation is done...
> >>>   I'm sorry I don't understand now - are you speaking about the case above
> >>>when writepage() does not wait for filesystem being frozen or something
> >>>else?
> >>Sorry, I didn't understand around the page fault path.
> >>So, I had read the kernel source code around it, then I maybe understand...
> >>
> >>I worry whether we can update the file data in mmap case while fsfreezing.
> >>Of course, I understand that we can write to in-memory cache, and it is not a
> >>problem. However, if we can write to disk while fsfreezing, it is a problem.
> >>So, I summarize the cases whether we can write to disk or not.
> >>
> >>--------------------------------------------------------------------------
> >>Cases (Whether we can write the data mmapped to the file on the disk
> >>while fsfreezing)
> >>
> >>[1] One of the page which has been mmapped is not bound. And
> >>  the page is not allocated yet. (major fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) __do_falut is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       =>  We can STOP!
> >>
> >>[2] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are not mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    (5) ext4_write_begin is called
> >>    (6) ext4_journal_start_sb       =>  We can STOP!
> >>
> >>[3] One of the page which has been mmapped is not bound. But
> >>  the page is already allocated, and the buffer_heads of the page
> >>  are mapped (BH_Mapped).  (minor fault?)
> >>
> >>    (1) user dirtys a page
> >>    (2) a page fault occurs (do_page_fault)
> >>    (3) do_wp_page is called.
> >>    (4) ext4_page_mkwrite is called
> >>    * Cannot block the dirty page to be written because all bh is mapped.
> >>    (5) user munmaps the page (munmap)
> >>    (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (7) writeback thread writes the page (struct page) to disk
> >>                                            =>  We cannot STOP!
> >>
> >>[4] One of the page which has been mmapped is bound. And
> >>  the page is already allocated.
> >>
> >>    (1) user dirtys a page
> >>    ( ) no page fault occurs
> >>    (2) user munmaps the page (munmap)
> >>    (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
> >>    (4) writeback thread writes the page (struct page) to disk
> >>                                            =>  We cannot STOP!
> >>--------------------------------------------------------------------------
> >>
> >>So, we can block the cases [1], [2].
> >>But I think we cannot block the cases [3], [4] now.
> >>If fixing the page_mkwrite, we can also block the case [3].
> >>But the case [4] is not blocked because no page fault occurs
> >>when we dirty the mmapped page.
> >>
> >>Therefore, to repair this problem, we need to fix the cases [3], [4].
> >>I think we must modify the writeback thread to fix the case [4].
> >   The trick here is that when we write a page to disk, we write-protect
> >the page (you seem to call this that "the page is bound", I'm not sure why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
  Look at what page_mkclean() called from clear_page_dirty_for_io() does...

> But, anyway, I understand we don't need to consider the case [4].
  Yes.

> >So we are guaranteed to receive a minor fault (case [3]) if user tries to
> >modify a page after we finish writeback while freezing the filesystem.
> >So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
  Yes.

> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
  Sadly I don't see a simple way to fix this issue for all filesystems at
once. Implementing proper wait in block_page_mkwrite() should fix the issue
for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
have patches for this already for some time but I have to get to properly
testing them in more exotic conditions like 64k pages...

								Honza
Toshiyuki Okajima April 19, 2011, 9:43 a.m. UTC | #20
Hi,

(2011/04/18 19:51), Jan Kara wrote:
> On Mon 18-04-11 18:05:01, Toshiyuki Okajima wrote:
>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>    For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should probably
>>>>> get modified to block while minor-faulting the page on frozen fs because
>>>>> when blocks are already allocated we may skip starting a transaction and so
>>>>> we could possibly modify the filesystem.
>>>> OK. I think ->page_mkwrite() should also block writing the minor-faulting pages.
>>>>
>>>> (minor-pagefault)
>>>> ->   do_wp_page()
>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>        =>   BLOCK!
>>>>
>>>> (major-pagefault)
>>>> ->   do_liner_fault()
>>>>     ->   page_mkwrite(= ext4_mkwrite())
>>>>        =>   BLOCK!
>>>>
>>>>>
>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the file (mmap).
>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>> I understand the normal write operation (not mmap) can be blocked while
>>>>>>>> fsfreezing. So, I guess we don't always block all the write operation
>>>>>>>> while fsfreezing.
>>>>>>>    Technically speaking, we block all the transaction starts which means we
>>>>>>> end up blocking all the writes from going to disk. But that does not mean
>>>>>>> we block all the writes from going to in-memory cache - as you properly
>>>>>>> note the mmap case is one of such exceptions.
>>>>>> Hm, I also think we can allow the writes to in-memory cache but we can't allow
>>>>>> the writes to disk while fsfreezing. I am considering that mmap path can
>>>>>> write to disk while fsfreezing because this deadlock problem happens after
>>>>>> fsfreeze operation is done...
>>>>>    I'm sorry I don't understand now - are you speaking about the case above
>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>> else?
>>>> Sorry, I didn't understand around the page fault path.
>>>> So, I had read the kernel source code around it, then I maybe understand...
>>>>
>>>> I worry whether we can update the file data in mmap case while fsfreezing.
>>>> Of course, I understand that we can write to in-memory cache, and it is not a
>>>> problem. However, if we can write to disk while fsfreezing, it is a problem.
>>>> So, I summarize the cases whether we can write to disk or not.
>>>>
>>>> --------------------------------------------------------------------------
>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>> while fsfreezing)
>>>>
>>>> [1] One of the page which has been mmapped is not bound. And
>>>>   the page is not allocated yet. (major fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) __do_falut is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     (5) ext4_write_begin is called
>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>
>>>> [2] One of the page which has been mmapped is not bound. But
>>>>   the page is already allocated, and the buffer_heads of the page
>>>>   are not mapped (BH_Mapped).  (minor fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) do_wp_page is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     (5) ext4_write_begin is called
>>>>     (6) ext4_journal_start_sb       =>   We can STOP!
>>>>
>>>> [3] One of the page which has been mmapped is not bound. But
>>>>   the page is already allocated, and the buffer_heads of the page
>>>>   are mapped (BH_Mapped).  (minor fault?)
>>>>
>>>>     (1) user dirtys a page
>>>>     (2) a page fault occurs (do_page_fault)
>>>>     (3) do_wp_page is called.
>>>>     (4) ext4_page_mkwrite is called
>>>>     * Cannot block the dirty page to be written because all bh is mapped.
>>>>     (5) user munmaps the page (munmap)
>>>>     (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>     (7) writeback thread writes the page (struct page) to disk
>>>>                                             =>   We cannot STOP!
>>>>
>>>> [4] One of the page which has been mmapped is bound. And
>>>>   the page is already allocated.
>>>>
>>>>     (1) user dirtys a page
>>>>     ( ) no page fault occurs
>>>>     (2) user munmaps the page (munmap)
>>>>     (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>>>     (4) writeback thread writes the page (struct page) to disk
>>>>                                             =>   We cannot STOP!
>>>> --------------------------------------------------------------------------
>>>>
>>>> So, we can block the cases [1], [2].
>>>> But I think we cannot block the cases [3], [4] now.
>>>> If fixing the page_mkwrite, we can also block the case [3].
>>>> But the case [4] is not blocked because no page fault occurs
>>>> when we dirty the mmapped page.
>>>>
>>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>>> I think we must modify the writeback thread to fix the case [4].
>>>    The trick here is that when we write a page to disk, we write-protect
>>> the page (you seem to call this that "the page is bound", I'm not sure why).
>> Hm, I want to understand how to write-protect the page under fsfreezing.
>    Look at what page_mkclean() called from clear_page_dirty_for_io() does...
Thanks. I'll read that.

>
>> But, anyway, I understand we don't need to consider the case [4].
>    Yes.
>
>>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>>> modify a page after we finish writeback while freezing the filesystem.
>>> So principially all we need to do is just wait in ext4_page_mkwrite().
>> OK. I understand.
>> Are there any concrete ideas to fix this?
>> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
>    Yes.
>
>> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent it?
>    Sadly I don't see a simple way to fix this issue for all filesystems at
> once. Implementing proper wait in block_page_mkwrite() should fix the issue
> for xfs. Other filesystems like GFS2 or Btrfs will have to be fixed
> separately as ext4. For ext3, we'd have to add ->page_mkwrite() support. I
> have patches for this already for some time but I have to get to properly
> testing them in more exotic conditions like 64k pages...
OK. I understand the current status of your works to fix the problem which
can be written with some data at mmap path while fsfreezing.

Thanks,
Toshiyuki Okajima

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 3, 2011, 11:01 a.m. UTC | #21
On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> Hi,
>
> (2011/04/16 2:13), Jan Kara wrote:
>> Hello,
>>
>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>> probably
>>>> get modified to block while minor-faulting the page on frozen fs
>>>> because
>>>> when blocks are already allocated we may skip starting a transaction
>>>> and so
>>>> we could possibly modify the filesystem.
>>> OK. I think ->page_mkwrite() should also block writing the
>>> minor-faulting pages.
>>>
>>> (minor-pagefault)
>>> -> do_wp_page()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>> (major-pagefault)
>>> -> do_liner_fault()
>>> -> page_mkwrite(= ext4_mkwrite())
>>> => BLOCK!
>>>
>>>>
>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>> file (mmap).
>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>> while
>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>> operation
>>>>>>> while fsfreezing.
>>>>>> Technically speaking, we block all the transaction starts which
>>>>>> means we
>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>> not mean
>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>> properly
>>>>>> note the mmap case is one of such exceptions.
>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>> can't allow
>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>> path can
>>>>> write to disk while fsfreezing because this deadlock problem
>>>>> happens after
>>>>> fsfreeze operation is done...
>>>> I'm sorry I don't understand now - are you speaking about the case
>>>> above
>>>> when writepage() does not wait for filesystem being frozen or something
>>>> else?
>>> Sorry, I didn't understand around the page fault path.
>>> So, I had read the kernel source code around it, then I maybe
>>> understand...
>>>
>>> I worry whether we can update the file data in mmap case while
>>> fsfreezing.
>>> Of course, I understand that we can write to in-memory cache, and it
>>> is not a
>>> problem. However, if we can write to disk while fsfreezing, it is a
>>> problem.
>>> So, I summarize the cases whether we can write to disk or not.
>>>
>>> --------------------------------------------------------------------------
>>>
>>> Cases (Whether we can write the data mmapped to the file on the disk
>>> while fsfreezing)
>>>
>>> [1] One of the page which has been mmapped is not bound. And
>>> the page is not allocated yet. (major fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) __do_falut is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!
>>>
>>> [2] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are not mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> (5) ext4_write_begin is called
>>> (6) ext4_journal_start_sb => We can STOP!

What happens in the case as follows:

Task 1: Mmapped writes
t1)ext4_page_mkwrite()
   t2) ext4_write_begin() (FS is thawed so we proceed)
   t3) ext4_write_end() (journal is stopped now)
-----Pre-empted-----


Task 2: Freeze Task
t4) freezes the super block...
...(continues)....
tn) the page cache is clean and the F.S is frozen. Freeze has completed 
execution.

Task 1: Mmapped writes
tn+1) ext4_page_mkwrite() returns 0.
tn+2) __do_fault() gets control, code gets executed.
tn+3) _do_fault() marks the page dirty if the intent is to write to a 
file based page which faulted.

So you end up dirtying the page cache when the F.S is frozen? No?


Warm Regards,
Surbhi.







>>>
>>> [3] One of the page which has been mmapped is not bound. But
>>> the page is already allocated, and the buffer_heads of the page
>>> are mapped (BH_Mapped). (minor fault?)
>>>
>>> (1) user dirtys a page
>>> (2) a page fault occurs (do_page_fault)
>>> (3) do_wp_page is called.
>>> (4) ext4_page_mkwrite is called
>>> * Cannot block the dirty page to be written because all bh is mapped.
>>> (5) user munmaps the page (munmap)
>>> (6) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (7) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>>
>>> [4] One of the page which has been mmapped is bound. And
>>> the page is already allocated.
>>>
>>> (1) user dirtys a page
>>> ( ) no page fault occurs
>>> (2) user munmaps the page (munmap)
>>> (3) zap_pte_range dirtys the page (struct page) which is pte_dirtyed.
>>> (4) writeback thread writes the page (struct page) to disk
>>> => We cannot STOP!
>>> --------------------------------------------------------------------------
>>>
>>>
>>> So, we can block the cases [1], [2].
>>> But I think we cannot block the cases [3], [4] now.
>>> If fixing the page_mkwrite, we can also block the case [3].
>>> But the case [4] is not blocked because no page fault occurs
>>> when we dirty the mmapped page.
>>>
>>> Therefore, to repair this problem, we need to fix the cases [3], [4].
>>> I think we must modify the writeback thread to fix the case [4].
>> The trick here is that when we write a page to disk, we write-protect
>> the page (you seem to call this that "the page is bound", I'm not sure
>> why).
> Hm, I want to understand how to write-protect the page under fsfreezing.
> But, anyway, I understand we don't need to consider the case [4].
>
>> So we are guaranteed to receive a minor fault (case [3]) if user tries to
>> modify a page after we finish writeback while freezing the filesystem.
>> So principially all we need to do is just wait in ext4_page_mkwrite().
> OK. I understand.
> Are there any concrete ideas to fix this?
> For ext4, we can rescue from the case [3] by modifying ext4_page_mkwrite().
> But for ext3 or other FSs, we must implement ->page_mkwrite() to prevent
> it?
>
> Thanks,
> Toshiyuki Okajima
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 3, 2011, 1:08 p.m. UTC | #22
On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
Toshiyuki pointed out.

zap_pte_range()
  mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  

So, I think that it is here that we should do the checking for a ext4 F.S
frozen state and also prevent a parallel ext4 F.S freeze from happening.

Attaching a patch for initial review. Please do let me know your thoughts! 

Thanks a lot!

Warm Regards,
Surbhi.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 3, 2011, 1:46 p.m. UTC | #23
On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> Toshiyuki pointed out.
> 
> zap_pte_range()
>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)  
> 
> So, I think that it is here that we should do the checking for a ext4 F.S
> frozen state and also prevent a parallel ext4 F.S freeze from happening.
> 
> Attaching a patch for initial review. Please do let me know your thoughts! 
  This is definitely the wrong place. ->set_page_dirty() callbacks are
called with various locks held and the page need not be locked (thus
dereferencing page->mapping is oopsable). Moreover this particular callback
is called only in data=journal mode.

Believe me, the right place is page_mkwrite() - you have to catch the
read-only => read-write page transition. Once the page is mapped
read-write, you've already lost the race.

								Honza
Surbhi Palande May 3, 2011, 1:56 p.m. UTC | #24
On 05/03/2011 04:46 PM, Jan Kara wrote:
> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:

Sorry for missing the subject line :(
>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>> Toshiyuki pointed out.
>>
>> zap_pte_range()
>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>
>> So, I think that it is here that we should do the checking for a ext4 F.S
>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>
>> Attaching a patch for initial review. Please do let me know your thoughts!
>    This is definitely the wrong place. ->set_page_dirty() callbacks are
> called with various locks held and the page need not be locked (thus
> dereferencing page->mapping is oopsable). Moreover this particular callback
> is called only in data=journal mode.
Ok! Thanks for that!

>
> Believe me, the right place is page_mkwrite() - you have to catch the
> read-only =>  read-write page transition. Once the page is mapped
> read-write, you've already lost the race.

My only point is:
1) something should prevent the freeze from happening. We cant merely 
check the vfs_check_frozen()?

And this should be done where the page is marked dirty.Also, I thought 
that the page is marked read-write only in the page table in the 
__do_page_fault()? i.e the zap_pte_range() marks them dirty in the page 
cache? Is this understanding right?

IMHO, whatever code dirties the page in the page cache should call a F.S 
specific function and let it _prevent_ a fsfreeze while the page is 
getting dirtied, so that a freeze called after this point flushes this page!

Warm Regards,
Surbhi.










>
> 								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Surbhi Palande May 3, 2011, 3:26 p.m. UTC | #25
On 05/03/2011 04:56 PM, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>
> Sorry for missing the subject line :(
>>> On munmap() zap_pte_range() is called which dirties the PTE dirty
>>> pages as
>>> Toshiyuki pointed out.
>>>
>>> zap_pte_range()
>>> mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>
>>> So, I think that it is here that we should do the checking for a ext4
>>> F.S
>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>
>>> Attaching a patch for initial review. Please do let me know your
>>> thoughts!
>> This is definitely the wrong place. ->set_page_dirty() callbacks are
>> called with various locks held and the page need not be locked (thus
>> dereferencing page->mapping is oopsable). Moreover this particular
>> callback
>> is called only in data=journal mode.
> Ok! Thanks for that!
>
>>
>> Believe me, the right place is page_mkwrite() - you have to catch the
>> read-only => read-write page transition. Once the page is mapped
>> read-write, you've already lost the race.
Also, we then need to prevent a munmap()/zap_pte_range() call from 
dirtying a mmapped file page when the F.S is frozen?

Warm Regards,
Surbhi.

>
> My only point is:
> 1) something should prevent the freeze from happening. We cant merely
> check the vfs_check_frozen()?
>
> And this should be done where the page is marked dirty.Also, I thought
> that the page is marked read-write only in the page table in the
> __do_page_fault()? i.e the zap_pte_range() marks them dirty in the page
> cache? Is this understanding right?
>
> IMHO, whatever code dirties the page in the page cache should call a F.S
> specific function and let it _prevent_ a fsfreeze while the page is
> getting dirtied, so that a freeze called after this point flushes this
> page!
>
> Warm Regards,
> Surbhi.
>
>
>
>
>
>
>
>
>
>
>>
>> Honza
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 3, 2011, 3:36 p.m. UTC | #26
On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> On 05/03/2011 04:46 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> 
> Sorry for missing the subject line :(
> >>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>Toshiyuki pointed out.
> >>
> >>zap_pte_range()
> >>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>
> >>So, I think that it is here that we should do the checking for a ext4 F.S
> >>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>
> >>Attaching a patch for initial review. Please do let me know your thoughts!
> >   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >called with various locks held and the page need not be locked (thus
> >dereferencing page->mapping is oopsable). Moreover this particular callback
> >is called only in data=journal mode.
> Ok! Thanks for that!
> 
> >
> >Believe me, the right place is page_mkwrite() - you have to catch the
> >read-only =>  read-write page transition. Once the page is mapped
> >read-write, you've already lost the race.
> 
> My only point is:
> 1) something should prevent the freeze from happening. We cant
> merely check the vfs_check_frozen()?
  Yes, I agree - see my other email with patches.

> And this should be done where the page is marked dirty.Also, I
> thought that the page is marked read-write only in the page table in
> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> the page cache? Is this understanding right?
  The page can become dirty either because it was written via standard
write - write_begin is responsible for reliable check here - or it was
written via mmap - here we rely on page_mkwrite to do a reliable check -
it is analogous to write_begin callback. There should be no other way
to dirty a page.

With dirty bits it is a bit complicated. We have two of them in fact. One
in page table entry maintained by mmu and one in page structure maintained
by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
from page table into struct page. This is a lazy process so page can in
principle have new data without a dirty bit set in struct page because we
have not yet copied the dirty bit from page table. Only at moments where it
is important (like when we want to unmap the page, or throw away the page,
or so), we make sure struct page and page table bits are in sync.

Another subtle thing you need not be aware of it that when we clear page
dirty bit, we also writeprotect the page. So we are guaranteed to get a
page fault when the page is written to again.

> IMHO, whatever code dirties the page in the page cache should call a
> F.S specific function and let it _prevent_ a fsfreeze while the page
> is getting dirtied, so that a freeze called after this point flushes
> this page!
  Agreed, that's what code in write_begin() and page_mkwrite() should
achieve.
								Honza
Surbhi Palande May 3, 2011, 3:43 p.m. UTC | #27
On 05/03/2011 06:36 PM, Jan Kara wrote:
> On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
>> On 05/03/2011 04:46 PM, Jan Kara wrote:
>>> On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
>>
>> Sorry for missing the subject line :(
>>>> On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
>>>> Toshiyuki pointed out.
>>>>
>>>> zap_pte_range()
>>>>    mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
>>>>
>>>> So, I think that it is here that we should do the checking for a ext4 F.S
>>>> frozen state and also prevent a parallel ext4 F.S freeze from happening.
>>>>
>>>> Attaching a patch for initial review. Please do let me know your thoughts!
>>>    This is definitely the wrong place. ->set_page_dirty() callbacks are
>>> called with various locks held and the page need not be locked (thus
>>> dereferencing page->mapping is oopsable). Moreover this particular callback
>>> is called only in data=journal mode.
>> Ok! Thanks for that!
>>
>>>
>>> Believe me, the right place is page_mkwrite() - you have to catch the
>>> read-only =>   read-write page transition. Once the page is mapped
>>> read-write, you've already lost the race.
>>
>> My only point is:
>> 1) something should prevent the freeze from happening. We cant
>> merely check the vfs_check_frozen()?
>    Yes, I agree - see my other email with patches.
>
>> And this should be done where the page is marked dirty.Also, I
>> thought that the page is marked read-write only in the page table in
>> the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
>> the page cache? Is this understanding right?
>    The page can become dirty either because it was written via standard
> write - write_begin is responsible for reliable check here - or it was
> written via mmap - here we rely on page_mkwrite to do a reliable check -
> it is analogous to write_begin callback. There should be no other way
> to dirty a page.
>
> With dirty bits it is a bit complicated. We have two of them in fact. One
> in page table entry maintained by mmu and one in page structure maintained
> by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> from page table into struct page. This is a lazy process so page can in
> principle have new data without a dirty bit set in struct page because we
> have not yet copied the dirty bit from page table. Only at moments where it
> is important (like when we want to unmap the page, or throw away the page,
> or so), we make sure struct page and page table bits are in sync.
>
> Another subtle thing you need not be aware of it that when we clear page
> dirty bit, we also writeprotect the page. So we are guaranteed to get a
> page fault when the page is written to again.
>
>> IMHO, whatever code dirties the page in the page cache should call a
>> F.S specific function and let it _prevent_ a fsfreeze while the page
>> is getting dirtied, so that a freeze called after this point flushes
>> this page!
>    Agreed, that's what code in write_begin() and page_mkwrite() should
> achieve.
> 								Honza
Thanks a lot for the wonderful explanation :)

How about the revert : i.e calling  jbd2_journal_unlock_updates() from 
ext4_unfreeze() instead of the ext4_freeze()? Do you agree to that?


Warm Regards,
Surbhi.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 4, 2011, 7:24 p.m. UTC | #28
On Tue 03-05-11 18:43:48, Surbhi Palande wrote:
> On 05/03/2011 06:36 PM, Jan Kara wrote:
> >On Tue 03-05-11 16:56:57, Surbhi Palande wrote:
> >>On 05/03/2011 04:46 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 16:08:36, Surbhi Palande wrote:
> >>
> >>Sorry for missing the subject line :(
> >>>>On munmap() zap_pte_range() is called which dirties the PTE dirty pages as
> >>>>Toshiyuki pointed out.
> >>>>
> >>>>zap_pte_range()
> >>>>   mapping->a_ops->set_page_dirty (= ext4_journalled_set_page_dirty)
> >>>>
> >>>>So, I think that it is here that we should do the checking for a ext4 F.S
> >>>>frozen state and also prevent a parallel ext4 F.S freeze from happening.
> >>>>
> >>>>Attaching a patch for initial review. Please do let me know your thoughts!
> >>>   This is definitely the wrong place. ->set_page_dirty() callbacks are
> >>>called with various locks held and the page need not be locked (thus
> >>>dereferencing page->mapping is oopsable). Moreover this particular callback
> >>>is called only in data=journal mode.
> >>Ok! Thanks for that!
> >>
> >>>
> >>>Believe me, the right place is page_mkwrite() - you have to catch the
> >>>read-only =>   read-write page transition. Once the page is mapped
> >>>read-write, you've already lost the race.
> >>
> >>My only point is:
> >>1) something should prevent the freeze from happening. We cant
> >>merely check the vfs_check_frozen()?
> >   Yes, I agree - see my other email with patches.
> >
> >>And this should be done where the page is marked dirty.Also, I
> >>thought that the page is marked read-write only in the page table in
> >>the __do_page_fault()? i.e the zap_pte_range() marks them dirty in
> >>the page cache? Is this understanding right?
> >   The page can become dirty either because it was written via standard
> >write - write_begin is responsible for reliable check here - or it was
> >written via mmap - here we rely on page_mkwrite to do a reliable check -
> >it is analogous to write_begin callback. There should be no other way
> >to dirty a page.
> >
> >With dirty bits it is a bit complicated. We have two of them in fact. One
> >in page table entry maintained by mmu and one in page structure maintained
> >by kernel. Some functions (such as zap_pte_range()) copy the dirty bits
> >from page table into struct page. This is a lazy process so page can in
> >principle have new data without a dirty bit set in struct page because we
> >have not yet copied the dirty bit from page table. Only at moments where it
> >is important (like when we want to unmap the page, or throw away the page,
> >or so), we make sure struct page and page table bits are in sync.
> >
> >Another subtle thing you need not be aware of it that when we clear page
> >dirty bit, we also writeprotect the page. So we are guaranteed to get a
> >page fault when the page is written to again.
> >
> >>IMHO, whatever code dirties the page in the page cache should call a
> >>F.S specific function and let it _prevent_ a fsfreeze while the page
> >>is getting dirtied, so that a freeze called after this point flushes
> >>this page!
> >   Agreed, that's what code in write_begin() and page_mkwrite() should
> >achieve.
> >								Honza
> Thanks a lot for the wonderful explanation :)
> 
> How about the revert : i.e calling  jbd2_journal_unlock_updates()
> from ext4_unfreeze() instead of the ext4_freeze()? Do you agree to
> that?
  Sorry, I don't agree with revert. We could talk about changing
jbd2_journal_unlock_updates() to not return with mutex held (and handle
synchronization of locked journal operations differently) as an alternative
to doing "freeze" reference counting. But returning with mutex held to user
space is no-go. It will cause problems in lockdep, violates kernel locking
rules, and generally is a bad programming ;).

								Honza
Surbhi Palande May 6, 2011, 3:20 p.m. UTC | #29
This patch adds a flag which indicates that the journal is frozen or not and
introduces two new functions jbd2_journal_freeze and jbd2_journal_thaw which
should be called when the F.S freezes and thaws respectively. 
A new handle can only be started now when the barrier count is 0 and when the
journal is not in a frozen state. While the journal is in a frozen state,
trying to start a new handle would put the process on wait queue. Thawing the
journal would wake up all the processes waiting on this wait queue.

I have lightly tested this patch. Sending it here for initial review. Please
do let me know your inputs. Thanks a lot!

Warm Regards,
Surbhi.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/super.c b/fs/super.c
index e848649..4f74718 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,6 +77,7 @@  static struct super_block *alloc_super(struct file_system_type *type)
 		INIT_LIST_HEAD(&s->s_dentry_lru);
 		init_rwsem(&s->s_umount);
 		mutex_init(&s->s_lock);
+		mutex_init(&s->s_freeze_mutex);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
 		/*
 		 * The locking rules for s_lock are up to the
@@ -971,6 +972,24 @@  out:
  * Syncs the super to make sure the filesystem is consistent and calls the fs's
  * freeze_fs.  Subsequent calls to this without first thawing the fs will return
  * -EBUSY.
+ *
+ * Locking of freeze / thaw is tricky (if not messy). Freezing is protected by
+ * exclusively taking s_umount to avoid races with mount / remount / umount and
+ * also provide exclusion of concurrent freeze calls. Then we have
+ * s_freeze_mutex which protects changes to s_frozen and the call ->freeze_fs()
+ * against races with thawing code.
+ *
+ * Thawing code must not take s_umount before the filesystem is unfrozen
+ * because that would cause deadlocks (e.g. background flushing takes s_umount
+ * and then does writeback which blocks on a frozen filesystem). So we take
+ * only s_freeze_mutex, which provides us exclusion against concurrent
+ * freezing, and hold it until the thawing is finished. We are protected
+ * against superblock going away by holding an active sb reference and against
+ * remounting by the fact that the sb is frozen.
+ *
+ * Notes: s_freeze_mutex cannot be merged with bd_fsfreeze_mutex because we
+ * can freeze block devices without filesystems and also freeze filesystems
+ * not backed by block devices.
  */
 int freeze_super(struct super_block *sb)
 {
@@ -978,7 +997,9 @@  int freeze_super(struct super_block *sb)
 
 	atomic_inc(&sb->s_active);
 	down_write(&sb->s_umount);
-	if (sb->s_frozen) {
+	mutex_lock(&sb->s_freeze_mutex);
+	if (sb->s_frozen != SB_UNFROZEN) {
+		mutex_unlock(&sb->s_freeze_mutex);
 		deactivate_locked_super(sb);
 		return -EBUSY;
 	}
@@ -986,15 +1007,18 @@  int freeze_super(struct super_block *sb)
 	if (sb->s_flags & MS_RDONLY) {
 		sb->s_frozen = SB_FREEZE_TRANS;
 		smp_wmb();
+		mutex_unlock(&sb->s_freeze_mutex);
 		up_write(&sb->s_umount);
 		return 0;
 	}
 
 	sb->s_frozen = SB_FREEZE_WRITE;
+	mutex_unlock(&sb->s_freeze_mutex);
 	smp_wmb();
 
 	sync_filesystem(sb);
 
+	mutex_lock(&sb->s_freeze_mutex);
 	sb->s_frozen = SB_FREEZE_TRANS;
 	smp_wmb();
 
@@ -1005,10 +1029,12 @@  int freeze_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
 			sb->s_frozen = SB_UNFROZEN;
+			mutex_unlock(&sb->s_freeze_mutex);
 			deactivate_locked_super(sb);
 			return ret;
 		}
 	}
+	mutex_unlock(&sb->s_freeze_mutex);
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1019,14 +1045,15 @@  EXPORT_SYMBOL(freeze_super);
  * @sb: the super to thaw
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super().
+ * See freeze_super() for locking comments.
  */
 int thaw_super(struct super_block *sb)
 {
 	int error;
 
-	down_write(&sb->s_umount);
-	if (sb->s_frozen == SB_UNFROZEN) {
-		up_write(&sb->s_umount);
+	mutex_lock(&sb->s_freeze_mutex);
+	if (sb->s_frozen != SB_FREEZE_TRANS) {
+		mutex_unlock(&sb->s_freeze_mutex);
 		return -EINVAL;
 	}
 
@@ -1039,7 +1066,7 @@  int thaw_super(struct super_block *sb)
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
 			sb->s_frozen = SB_FREEZE_TRANS;
-			up_write(&sb->s_umount);
+			mutex_unlock(&sb->s_freeze_mutex);
 			return error;
 		}
 	}
@@ -1048,7 +1075,8 @@  out:
 	sb->s_frozen = SB_UNFROZEN;
 	smp_wmb();
 	wake_up(&sb->s_wait_unfrozen);
-	deactivate_locked_super(sb);
+	mutex_unlock(&sb->s_freeze_mutex);
+	deactivate_super(sb);
 
 	return 0;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7061a85..230892d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1382,6 +1382,7 @@  struct super_block {
 	struct dentry		*s_root;
 	struct rw_semaphore	s_umount;
 	struct mutex		s_lock;
+	struct mutex		s_freeze_mutex;
 	int			s_count;
 	atomic_t		s_active;
 #ifdef CONFIG_SECURITY