diff mbox

[RFC] Re: [BUG] ext4: cannot unfreeze a filesystem due to a deadlock

Message ID 20110503151948.GB6009@quack.suse.cz
State Superseded, archived
Headers show

Commit Message

Jan Kara May 3, 2011, 3:19 p.m. UTC
On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >(2011/04/16 2:13), Jan Kara wrote:
> >>Hello,
> >>
> >>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>probably
> >>>>get modified to block while minor-faulting the page on frozen fs
> >>>>because
> >>>>when blocks are already allocated we may skip starting a transaction
> >>>>and so
> >>>>we could possibly modify the filesystem.
> >>>OK. I think ->page_mkwrite() should also block writing the
> >>>minor-faulting pages.
> >>>
> >>>(minor-pagefault)
> >>>-> do_wp_page()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>(major-pagefault)
> >>>-> do_liner_fault()
> >>>-> page_mkwrite(= ext4_mkwrite())
> >>>=> BLOCK!
> >>>
> >>>>
> >>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>file (mmap).
> >>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>while
> >>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>operation
> >>>>>>>while fsfreezing.
> >>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>means we
> >>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>not mean
> >>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>properly
> >>>>>>note the mmap case is one of such exceptions.
> >>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>can't allow
> >>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>path can
> >>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>happens after
> >>>>>fsfreeze operation is done...
> >>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>above
> >>>>when writepage() does not wait for filesystem being frozen or something
> >>>>else?
> >>>Sorry, I didn't understand around the page fault path.
> >>>So, I had read the kernel source code around it, then I maybe
> >>>understand...
> >>>
> >>>I worry whether we can update the file data in mmap case while
> >>>fsfreezing.
> >>>Of course, I understand that we can write to in-memory cache, and it
> >>>is not a
> >>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>problem.
> >>>So, I summarize the cases whether we can write to disk or not.
> >>>
> >>>--------------------------------------------------------------------------
> >>>
> >>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>while fsfreezing)
> >>>
> >>>[1] One of the page which has been mmapped is not bound. And
> >>>the page is not allocated yet. (major fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) __do_falut is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
> >>>
> >>>[2] One of the page which has been mmapped is not bound. But
> >>>the page is already allocated, and the buffer_heads of the page
> >>>are not mapped (BH_Mapped). (minor fault?)
> >>>
> >>>(1) user dirtys a page
> >>>(2) a page fault occurs (do_page_fault)
> >>>(3) do_wp_page is called.
> >>>(4) ext4_page_mkwrite is called
> >>>(5) ext4_write_begin is called
> >>>(6) ext4_journal_start_sb => We can STOP!
> 
> What happens in the case as follows:
> 
> Task 1: Mmapped writes
> t1)ext4_page_mkwrite()
>   t2) ext4_write_begin() (FS is thawed so we proceed)
>   t3) ext4_write_end() (journal is stopped now)
> -----Pre-empted-----
> 
> 
> Task 2: Freeze Task
> t4) freezes the super block...
> ...(continues)....
> tn) the page cache is clean and the F.S is frozen. Freeze has
> completed execution.
> 
> Task 1: Mmapped writes
> tn+1) ext4_page_mkwrite() returns 0.
> tn+2) __do_fault() gets control, code gets executed.
> tn+3) _do_fault() marks the page dirty if the intent is to write to
> a file based page which faulted.
> 
> So you end up dirtying the page cache when the F.S is frozen? No?
  You are right ext4_page_mkrite() as currently implemented has problems.
You have to return the page locked (and check for frozen fs with page lock
held) to avoid races.

If you check for frozen fs with page lock held, you are guaranteed that
freezing code must wait for the page to get unlocked before proceeding. And
before the page is unlocked, it is marked dirty by the pagefault code which
makes freezing code write the page and writeprotect it again. So everything
will be safe.

Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
stable pages during writeback need that as well so it's a reasonable thing
to do). So something like attached patches should do what's needed - it's
lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.

								Honza

Comments

Surbhi Palande May 4, 2011, 12:09 p.m. UTC | #1
On 05/03/2011 06:19 PM, Jan Kara wrote:
> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>> (2011/04/16 2:13), Jan Kara wrote:
>>>> Hello,
>>>>
>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>> probably
>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>> because
>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>> and so
>>>>>> we could possibly modify the filesystem.
>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>> minor-faulting pages.
>>>>>
>>>>> (minor-pagefault)
>>>>> ->  do_wp_page()
>>>>> ->  page_mkwrite(= ext4_mkwrite())
>>>>> =>  BLOCK!
>>>>>
>>>>> (major-pagefault)
>>>>> ->  do_liner_fault()
>>>>> ->  page_mkwrite(= ext4_mkwrite())
>>>>> =>  BLOCK!
>>>>>
>>>>>>
>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>> file (mmap).
>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>> while
>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>> operation
>>>>>>>>> while fsfreezing.
>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>> means we
>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>> not mean
>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>> properly
>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>> can't allow
>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>> path can
>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>> happens after
>>>>>>> fsfreeze operation is done...
>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>> above
>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>> else?
>>>>> Sorry, I didn't understand around the page fault path.
>>>>> So, I had read the kernel source code around it, then I maybe
>>>>> understand...
>>>>>
>>>>> I worry whether we can update the file data in mmap case while
>>>>> fsfreezing.
>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>> is not a
>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>> problem.
>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>> while fsfreezing)
>>>>>
>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>> the page is not allocated yet. (major fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) __do_falut is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb =>  We can STOP!
>>>>>
>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>> the page is already allocated, and the buffer_heads of the page
>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>
>>>>> (1) user dirtys a page
>>>>> (2) a page fault occurs (do_page_fault)
>>>>> (3) do_wp_page is called.
>>>>> (4) ext4_page_mkwrite is called
>>>>> (5) ext4_write_begin is called
>>>>> (6) ext4_journal_start_sb =>  We can STOP!
>>
>> What happens in the case as follows:
>>
>> Task 1: Mmapped writes
>> t1)ext4_page_mkwrite()
>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>    t3) ext4_write_end() (journal is stopped now)
>> -----Pre-empted-----
>>
>>
>> Task 2: Freeze Task
>> t4) freezes the super block...
>> ...(continues)....
>> tn) the page cache is clean and the F.S is frozen. Freeze has
>> completed execution.
>>
>> Task 1: Mmapped writes
>> tn+1) ext4_page_mkwrite() returns 0.
>> tn+2) __do_fault() gets control, code gets executed.
>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>> a file based page which faulted.
>>
>> So you end up dirtying the page cache when the F.S is frozen? No?
>    You are right ext4_page_mkrite() as currently implemented has problems.
> You have to return the page locked (and check for frozen fs with page lock
> held) to avoid races.
>
> If you check for frozen fs with page lock held, you are guaranteed that
> freezing code must wait for the page to get unlocked before proceeding. And
> before the page is unlocked, it is marked dirty by the pagefault code which
> makes freezing code write the page and writeprotect it again. So everything
> will be safe.
For the locked page to be a part of the freeze initiated sync, should 
its owner inode not be dirtied? The page fault handler dirties the page, 
but who ensures that the inode is dirtied at this point?

Thanks!

Warm Regards,
Surbhi.



>
> Doing this cleanly requires some cleanups to ext4_page_mkwrite() (but
> stable pages during writeback need that as well so it's a reasonable thing
> to do). So something like attached patches should do what's needed - it's
> lightly tested with fsx in delalloc, nodelalloc, and data=journal configs.
>
> 								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 4, 2011, 7:19 p.m. UTC | #2
On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> On 05/03/2011 06:19 PM, Jan Kara wrote:
> >On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
> >>>(2011/04/16 2:13), Jan Kara wrote:
> >>>>Hello,
> >>>>
> >>>>On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
> >>>>>>For ext3 or ext4 without delayed allocation we block inside writepage()
> >>>>>>function. But as I wrote to Dave Chinner, ->page_mkwrite() should
> >>>>>>probably
> >>>>>>get modified to block while minor-faulting the page on frozen fs
> >>>>>>because
> >>>>>>when blocks are already allocated we may skip starting a transaction
> >>>>>>and so
> >>>>>>we could possibly modify the filesystem.
> >>>>>OK. I think ->page_mkwrite() should also block writing the
> >>>>>minor-faulting pages.
> >>>>>
> >>>>>(minor-pagefault)
> >>>>>->  do_wp_page()
> >>>>>->  page_mkwrite(= ext4_mkwrite())
> >>>>>=>  BLOCK!
> >>>>>
> >>>>>(major-pagefault)
> >>>>>->  do_liner_fault()
> >>>>>->  page_mkwrite(= ext4_mkwrite())
> >>>>>=>  BLOCK!
> >>>>>
> >>>>>>
> >>>>>>>>>Mizuma-san's reproducer also writes the data which maps to the
> >>>>>>>>>file (mmap).
> >>>>>>>>>The original problem happens after the fsfreeze operation is done.
> >>>>>>>>>I understand the normal write operation (not mmap) can be blocked
> >>>>>>>>>while
> >>>>>>>>>fsfreezing. So, I guess we don't always block all the write
> >>>>>>>>>operation
> >>>>>>>>>while fsfreezing.
> >>>>>>>>Technically speaking, we block all the transaction starts which
> >>>>>>>>means we
> >>>>>>>>end up blocking all the writes from going to disk. But that does
> >>>>>>>>not mean
> >>>>>>>>we block all the writes from going to in-memory cache - as you
> >>>>>>>>properly
> >>>>>>>>note the mmap case is one of such exceptions.
> >>>>>>>Hm, I also think we can allow the writes to in-memory cache but we
> >>>>>>>can't allow
> >>>>>>>the writes to disk while fsfreezing. I am considering that mmap
> >>>>>>>path can
> >>>>>>>write to disk while fsfreezing because this deadlock problem
> >>>>>>>happens after
> >>>>>>>fsfreeze operation is done...
> >>>>>>I'm sorry I don't understand now - are you speaking about the case
> >>>>>>above
> >>>>>>when writepage() does not wait for filesystem being frozen or something
> >>>>>>else?
> >>>>>Sorry, I didn't understand around the page fault path.
> >>>>>So, I had read the kernel source code around it, then I maybe
> >>>>>understand...
> >>>>>
> >>>>>I worry whether we can update the file data in mmap case while
> >>>>>fsfreezing.
> >>>>>Of course, I understand that we can write to in-memory cache, and it
> >>>>>is not a
> >>>>>problem. However, if we can write to disk while fsfreezing, it is a
> >>>>>problem.
> >>>>>So, I summarize the cases whether we can write to disk or not.
> >>>>>
> >>>>>--------------------------------------------------------------------------
> >>>>>
> >>>>>Cases (Whether we can write the data mmapped to the file on the disk
> >>>>>while fsfreezing)
> >>>>>
> >>>>>[1] One of the page which has been mmapped is not bound. And
> >>>>>the page is not allocated yet. (major fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) __do_falut is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb =>  We can STOP!
> >>>>>
> >>>>>[2] One of the page which has been mmapped is not bound. But
> >>>>>the page is already allocated, and the buffer_heads of the page
> >>>>>are not mapped (BH_Mapped). (minor fault?)
> >>>>>
> >>>>>(1) user dirtys a page
> >>>>>(2) a page fault occurs (do_page_fault)
> >>>>>(3) do_wp_page is called.
> >>>>>(4) ext4_page_mkwrite is called
> >>>>>(5) ext4_write_begin is called
> >>>>>(6) ext4_journal_start_sb =>  We can STOP!
> >>
> >>What happens in the case as follows:
> >>
> >>Task 1: Mmapped writes
> >>t1)ext4_page_mkwrite()
> >>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>   t3) ext4_write_end() (journal is stopped now)
> >>-----Pre-empted-----
> >>
> >>
> >>Task 2: Freeze Task
> >>t4) freezes the super block...
> >>...(continues)....
> >>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>completed execution.
> >>
> >>Task 1: Mmapped writes
> >>tn+1) ext4_page_mkwrite() returns 0.
> >>tn+2) __do_fault() gets control, code gets executed.
> >>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>a file based page which faulted.
> >>
> >>So you end up dirtying the page cache when the F.S is frozen? No?
> >   You are right ext4_page_mkrite() as currently implemented has problems.
> >You have to return the page locked (and check for frozen fs with page lock
> >held) to avoid races.
> >
> >If you check for frozen fs with page lock held, you are guaranteed that
> >freezing code must wait for the page to get unlocked before proceeding. And
> >before the page is unlocked, it is marked dirty by the pagefault code which
> >makes freezing code write the page and writeprotect it again. So everything
> >will be safe.
> For the locked page to be a part of the freeze initiated sync,
> should its owner inode not be dirtied? The page fault handler
> dirties the page, but who ensures that the inode is dirtied at this
> point?
  Follow the path from set_page_dirty() -> __set_page_dirty_buffers()
-> __set_page_dirty() -> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

  More code reading would save you (and me) some typing ;).

								Honza
Surbhi Palande May 4, 2011, 9:34 p.m. UTC | #3
On 05/04/2011 10:19 PM, Jan Kara wrote:
> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>> On 04/18/2011 12:05 PM, Toshiyuki Okajima wrote:
>>>>> (2011/04/16 2:13), Jan Kara wrote:
>>>>>> Hello,
>>>>>>
>>>>>> On Fri 15-04-11 22:39:07, Toshiyuki Okajima wrote:
>>>>>>>> For ext3 or ext4 without delayed allocation we block inside writepage()
>>>>>>>> function. But as I wrote to Dave Chinner, ->page_mkwrite() should
>>>>>>>> probably
>>>>>>>> get modified to block while minor-faulting the page on frozen fs
>>>>>>>> because
>>>>>>>> when blocks are already allocated we may skip starting a transaction
>>>>>>>> and so
>>>>>>>> we could possibly modify the filesystem.
>>>>>>> OK. I think ->page_mkwrite() should also block writing the
>>>>>>> minor-faulting pages.
>>>>>>>
>>>>>>> (minor-pagefault)
>>>>>>> ->   do_wp_page()
>>>>>>> ->   page_mkwrite(= ext4_mkwrite())
>>>>>>> =>   BLOCK!
>>>>>>>
>>>>>>> (major-pagefault)
>>>>>>> ->   do_liner_fault()
>>>>>>> ->   page_mkwrite(= ext4_mkwrite())
>>>>>>> =>   BLOCK!
>>>>>>>
>>>>>>>>
>>>>>>>>>>> Mizuma-san's reproducer also writes the data which maps to the
>>>>>>>>>>> file (mmap).
>>>>>>>>>>> The original problem happens after the fsfreeze operation is done.
>>>>>>>>>>> I understand the normal write operation (not mmap) can be blocked
>>>>>>>>>>> while
>>>>>>>>>>> fsfreezing. So, I guess we don't always block all the write
>>>>>>>>>>> operation
>>>>>>>>>>> while fsfreezing.
>>>>>>>>>> Technically speaking, we block all the transaction starts which
>>>>>>>>>> means we
>>>>>>>>>> end up blocking all the writes from going to disk. But that does
>>>>>>>>>> not mean
>>>>>>>>>> we block all the writes from going to in-memory cache - as you
>>>>>>>>>> properly
>>>>>>>>>> note the mmap case is one of such exceptions.
>>>>>>>>> Hm, I also think we can allow the writes to in-memory cache but we
>>>>>>>>> can't allow
>>>>>>>>> the writes to disk while fsfreezing. I am considering that mmap
>>>>>>>>> path can
>>>>>>>>> write to disk while fsfreezing because this deadlock problem
>>>>>>>>> happens after
>>>>>>>>> fsfreeze operation is done...
>>>>>>>> I'm sorry I don't understand now - are you speaking about the case
>>>>>>>> above
>>>>>>>> when writepage() does not wait for filesystem being frozen or something
>>>>>>>> else?
>>>>>>> Sorry, I didn't understand around the page fault path.
>>>>>>> So, I had read the kernel source code around it, then I maybe
>>>>>>> understand...
>>>>>>>
>>>>>>> I worry whether we can update the file data in mmap case while
>>>>>>> fsfreezing.
>>>>>>> Of course, I understand that we can write to in-memory cache, and it
>>>>>>> is not a
>>>>>>> problem. However, if we can write to disk while fsfreezing, it is a
>>>>>>> problem.
>>>>>>> So, I summarize the cases whether we can write to disk or not.
>>>>>>>
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> Cases (Whether we can write the data mmapped to the file on the disk
>>>>>>> while fsfreezing)
>>>>>>>
>>>>>>> [1] One of the page which has been mmapped is not bound. And
>>>>>>> the page is not allocated yet. (major fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) __do_falut is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb =>   We can STOP!
>>>>>>>
>>>>>>> [2] One of the page which has been mmapped is not bound. But
>>>>>>> the page is already allocated, and the buffer_heads of the page
>>>>>>> are not mapped (BH_Mapped). (minor fault?)
>>>>>>>
>>>>>>> (1) user dirtys a page
>>>>>>> (2) a page fault occurs (do_page_fault)
>>>>>>> (3) do_wp_page is called.
>>>>>>> (4) ext4_page_mkwrite is called
>>>>>>> (5) ext4_write_begin is called
>>>>>>> (6) ext4_journal_start_sb =>   We can STOP!
>>>>
>>>> What happens in the case as follows:
>>>>
>>>> Task 1: Mmapped writes
>>>> t1)ext4_page_mkwrite()
>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>    t3) ext4_write_end() (journal is stopped now)
>>>> -----Pre-empted-----
>>>>
>>>>
>>>> Task 2: Freeze Task
>>>> t4) freezes the super block...
>>>> ...(continues)....
>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>> completed execution.
>>>>
>>>> Task 1: Mmapped writes
>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>> tn+2) __do_fault() gets control, code gets executed.
>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>> a file based page which faulted.
>>>>
>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>> You have to return the page locked (and check for frozen fs with page lock
>>> held) to avoid races.
>>>
>>> If you check for frozen fs with page lock held, you are guaranteed that
>>> freezing code must wait for the page to get unlocked before proceeding. And
>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>> makes freezing code write the page and writeprotect it again. So everything
>>> will be safe.
>> For the locked page to be a part of the freeze initiated sync,
>> should its owner inode not be dirtied? The page fault handler
>> dirties the page, but who ensures that the inode is dirtied at this
>> point?
Well, I mean it as follows:

Doesn't the writeback code (invoked via sync_filesystem(sb)) write all 
the dirty pages of all the _dirty_ inodes of a superblock?

So in the window from the point where ext4_page_mkwrite returns to 
__do_fault() _till_ you mark the inode dirty (in __mark_inode_dirty()), 
you can have a race with freeze i.e if freeze happens meanwhile, then 
the sync initiated by freeze will not consider this locked page as the 
owner inode is _clean_ (or not dirtied yet) at that point?

Key: tx: time at unit x

P1: mmapped writes
t1) __do_page_fault()
    t2) ext4_page_mkwrite()
       // owner inode of the page is in _clean_ state - not yet dirtied
    --- pre-empted---

P2: Freeze_super
tn) freeze_super gets control
freezes the F.S, skips the owner inode as it is in the clean state. 
syncs all the other dirty inodes. page cache is now clean.


P1: mmapped writes (resume)
tn+x)__do_page_fault() gets control back:
    tn+x+1) set_page_dirty()
      tn+x+2) __set_page_dirty_buffers()
         tn+x+3) __set_page_dirty()
  	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)

So don't we end up dirtying the page cache when the F.S is frozen?

Again, apologies if I understood the writeback code or something else wrong!

Warm Regards,
Surbhi.

>    Follow the path from set_page_dirty() ->  __set_page_dirty_buffers()
> ->  __set_page_dirty() ->  __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);


>
>    More code reading would save you (and me) some typing ;).
P/S: Sorry about that!

>
> 								Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 4, 2011, 10:48 p.m. UTC | #4
On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> On 05/04/2011 10:19 PM, Jan Kara wrote:
> >On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>What happens in the case as follows:
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>t1)ext4_page_mkwrite()
> >>>>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>>   t3) ext4_write_end() (journal is stopped now)
> >>>>-----Pre-empted-----
> >>>>
> >>>>
> >>>>Task 2: Freeze Task
> >>>>t4) freezes the super block...
> >>>>...(continues)....
> >>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>completed execution.
> >>>>
> >>>>Task 1: Mmapped writes
> >>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>a file based page which faulted.
> >>>>
> >>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>>   You are right ext4_page_mkrite() as currently implemented has problems.
> >>>You have to return the page locked (and check for frozen fs with page lock
> >>>held) to avoid races.
> >>>
> >>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>makes freezing code write the page and writeprotect it again. So everything
> >>>will be safe.
> >>For the locked page to be a part of the freeze initiated sync,
> >>should its owner inode not be dirtied? The page fault handler
> >>dirties the page, but who ensures that the inode is dirtied at this
> >>point?
> Well, I mean it as follows:
> 
> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> all the dirty pages of all the _dirty_ inodes of a superblock?
> 
> So in the window from the point where ext4_page_mkwrite returns to
> __do_fault() _till_ you mark the inode dirty (in
> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
> happens meanwhile, then the sync initiated by freeze will not
> consider this locked page as the owner inode is _clean_ (or not
> dirtied yet) at that point?
  Ah, I see. That's actually a good point! Thanks for persistence. So we
should also dirty the page before checking for frozen fs.

> Key: tx: time at unit x
> 
> P1: mmapped writes
> t1) __do_page_fault()
>    t2) ext4_page_mkwrite()
>       // owner inode of the page is in _clean_ state - not yet dirtied
>    --- pre-empted---
> 
> P2: Freeze_super
> tn) freeze_super gets control
> freezes the F.S, skips the owner inode as it is in the clean state.
> syncs all the other dirty inodes. page cache is now clean.
> 
> 
> P1: mmapped writes (resume)
> tn+x)__do_page_fault() gets control back:
>    tn+x+1) set_page_dirty()
>      tn+x+2) __set_page_dirty_buffers()
>         tn+x+3) __set_page_dirty()
>  	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
> 
> So don't we end up dirtying the page cache when the F.S is frozen?
> 
> Again, apologies if I understood the writeback code or something else wrong!
  No, you understood it right. Just your previous email was too generic so
I have not thought about this particular race.

									Honza
Surbhi Palande May 5, 2011, 6:06 a.m. UTC | #5
On 05/05/2011 01:48 AM, Jan Kara wrote:
> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>> What happens in the case as follows:
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> t1)ext4_page_mkwrite()
>>>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>>    t3) ext4_write_end() (journal is stopped now)
>>>>>> -----Pre-empted-----
>>>>>>
>>>>>>
>>>>>> Task 2: Freeze Task
>>>>>> t4) freezes the super block...
>>>>>> ...(continues)....
>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>> completed execution.
>>>>>>
>>>>>> Task 1: Mmapped writes
>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>> a file based page which faulted.
>>>>>>
>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>> held) to avoid races.
>>>>>
>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>> will be safe.
>>>> For the locked page to be a part of the freeze initiated sync,
>>>> should its owner inode not be dirtied? The page fault handler
>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>> point?
>> Well, I mean it as follows:
>>
>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>
>> So in the window from the point where ext4_page_mkwrite returns to
>> __do_fault() _till_ you mark the inode dirty (in
>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>> happens meanwhile, then the sync initiated by freeze will not
>> consider this locked page as the owner inode is _clean_ (or not
>> dirtied yet) at that point?
>    Ah, I see. That's actually a good point! Thanks for persistence. So we
> should also dirty the page before checking for frozen fs.

Should we not also dirty the inode? IMHO, marking an inode will be racy 
as well!

Warm Regards,
Surbhi.

>
>> Key: tx: time at unit x
>>
>> P1: mmapped writes
>> t1) __do_page_fault()
>>     t2) ext4_page_mkwrite()
>>        // owner inode of the page is in _clean_ state - not yet dirtied
>>     --- pre-empted---
>>
>> P2: Freeze_super
>> tn) freeze_super gets control
>> freezes the F.S, skips the owner inode as it is in the clean state.
>> syncs all the other dirty inodes. page cache is now clean.
>>
>>
>> P1: mmapped writes (resume)
>> tn+x)__do_page_fault() gets control back:
>>     tn+x+1) set_page_dirty()
>>       tn+x+2) __set_page_dirty_buffers()
>>          tn+x+3) __set_page_dirty()
>>   	   tn+x+4) radix_tree_tag_set(page, PAGECACHE_TAG_DIRTY)
>>
>> So don't we end up dirtying the page cache when the F.S is frozen?
>>
>> Again, apologies if I understood the writeback code or something else wrong!
>    No, you understood it right. Just your previous email was too generic so
> I have not thought about this particular race.
>
> 									Honza

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 5, 2011, 11:18 a.m. UTC | #6
On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
> On 05/05/2011 01:48 AM, Jan Kara wrote:
> >On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
> >>On 05/04/2011 10:19 PM, Jan Kara wrote:
> >>>On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
> >>>>On 05/03/2011 06:19 PM, Jan Kara wrote:
> >>>>>On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
> >>>>>>What happens in the case as follows:
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>t1)ext4_page_mkwrite()
> >>>>>>   t2) ext4_write_begin() (FS is thawed so we proceed)
> >>>>>>   t3) ext4_write_end() (journal is stopped now)
> >>>>>>-----Pre-empted-----
> >>>>>>
> >>>>>>
> >>>>>>Task 2: Freeze Task
> >>>>>>t4) freezes the super block...
> >>>>>>...(continues)....
> >>>>>>tn) the page cache is clean and the F.S is frozen. Freeze has
> >>>>>>completed execution.
> >>>>>>
> >>>>>>Task 1: Mmapped writes
> >>>>>>tn+1) ext4_page_mkwrite() returns 0.
> >>>>>>tn+2) __do_fault() gets control, code gets executed.
> >>>>>>tn+3) _do_fault() marks the page dirty if the intent is to write to
> >>>>>>a file based page which faulted.
> >>>>>>
> >>>>>>So you end up dirtying the page cache when the F.S is frozen? No?
> >>>>>   You are right ext4_page_mkrite() as currently implemented has problems.
> >>>>>You have to return the page locked (and check for frozen fs with page lock
> >>>>>held) to avoid races.
> >>>>>
> >>>>>If you check for frozen fs with page lock held, you are guaranteed that
> >>>>>freezing code must wait for the page to get unlocked before proceeding. And
> >>>>>before the page is unlocked, it is marked dirty by the pagefault code which
> >>>>>makes freezing code write the page and writeprotect it again. So everything
> >>>>>will be safe.
> >>>>For the locked page to be a part of the freeze initiated sync,
> >>>>should its owner inode not be dirtied? The page fault handler
> >>>>dirties the page, but who ensures that the inode is dirtied at this
> >>>>point?
> >>Well, I mean it as follows:
> >>
> >>Doesn't the writeback code (invoked via sync_filesystem(sb)) write
> >>all the dirty pages of all the _dirty_ inodes of a superblock?
> >>
> >>So in the window from the point where ext4_page_mkwrite returns to
> >>__do_fault() _till_ you mark the inode dirty (in
> >>__mark_inode_dirty()), you can have a race with freeze i.e if freeze
> >>happens meanwhile, then the sync initiated by freeze will not
> >>consider this locked page as the owner inode is _clean_ (or not
> >>dirtied yet) at that point?
> >   Ah, I see. That's actually a good point! Thanks for persistence. So we
> >should also dirty the page before checking for frozen fs.
> 
> Should we not also dirty the inode? IMHO, marking an inode will be
> racy as well!
  Marking the page dirty marks the inode dirty as well as I've explained in my
previous emails. So I'm missing what you are concerned about...

								Honza
Surbhi Palande May 5, 2011, 2:01 p.m. UTC | #7
On 05/05/2011 02:18 PM, Jan Kara wrote:
> On Thu 05-05-11 09:06:29, Surbhi Palande wrote:
>> On 05/05/2011 01:48 AM, Jan Kara wrote:
>>> On Thu 05-05-11 00:34:51, Surbhi Palande wrote:
>>>> On 05/04/2011 10:19 PM, Jan Kara wrote:
>>>>> On Wed 04-05-11 15:09:37, Surbhi Palande wrote:
>>>>>> On 05/03/2011 06:19 PM, Jan Kara wrote:
>>>>>>> On Tue 03-05-11 14:01:50, Surbhi Palande wrote:
>>>>>>>> What happens in the case as follows:
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> t1)ext4_page_mkwrite()
>>>>>>>>    t2) ext4_write_begin() (FS is thawed so we proceed)
>>>>>>>>    t3) ext4_write_end() (journal is stopped now)
>>>>>>>> -----Pre-empted-----
>>>>>>>>
>>>>>>>>
>>>>>>>> Task 2: Freeze Task
>>>>>>>> t4) freezes the super block...
>>>>>>>> ...(continues)....
>>>>>>>> tn) the page cache is clean and the F.S is frozen. Freeze has
>>>>>>>> completed execution.
>>>>>>>>
>>>>>>>> Task 1: Mmapped writes
>>>>>>>> tn+1) ext4_page_mkwrite() returns 0.
>>>>>>>> tn+2) __do_fault() gets control, code gets executed.
>>>>>>>> tn+3) _do_fault() marks the page dirty if the intent is to write to
>>>>>>>> a file based page which faulted.
>>>>>>>>
>>>>>>>> So you end up dirtying the page cache when the F.S is frozen? No?
>>>>>>>    You are right ext4_page_mkrite() as currently implemented has problems.
>>>>>>> You have to return the page locked (and check for frozen fs with page lock
>>>>>>> held) to avoid races.
>>>>>>>
>>>>>>> If you check for frozen fs with page lock held, you are guaranteed that
>>>>>>> freezing code must wait for the page to get unlocked before proceeding. And
>>>>>>> before the page is unlocked, it is marked dirty by the pagefault code which
>>>>>>> makes freezing code write the page and writeprotect it again. So everything
>>>>>>> will be safe.
>>>>>> For the locked page to be a part of the freeze initiated sync,
>>>>>> should its owner inode not be dirtied? The page fault handler
>>>>>> dirties the page, but who ensures that the inode is dirtied at this
>>>>>> point?
>>>> Well, I mean it as follows:
>>>>
>>>> Doesn't the writeback code (invoked via sync_filesystem(sb)) write
>>>> all the dirty pages of all the _dirty_ inodes of a superblock?
>>>>
>>>> So in the window from the point where ext4_page_mkwrite returns to
>>>> __do_fault() _till_ you mark the inode dirty (in
>>>> __mark_inode_dirty()), you can have a race with freeze i.e if freeze
>>>> happens meanwhile, then the sync initiated by freeze will not
>>>> consider this locked page as the owner inode is _clean_ (or not
>>>> dirtied yet) at that point?
>>>    Ah, I see. That's actually a good point! Thanks for persistence. So we
>>> should also dirty the page before checking for frozen fs.
>>
>> Should we not also dirty the inode? IMHO, marking an inode will be
>> racy as well!
>    Marking the page dirty marks the inode dirty as well as I've explained in my
> previous emails. So I'm missing what you are concerned about...

Yes you are right! There is no other concern - setting the page dirty 
will be racy.

Warm Regards,
Surbhi.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From ee1f2f8cdea23cf19b34e51b4f78e040ce898976 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 3 May 2011 17:00:35 +0200
Subject: [PATCH 3/3] ext4: Block mmapped writes while the fs is frozen

We should not allow file modification via mmap while the filesystem is
frozen. So block in ext4_page_mkwrite() while the filesystem is frozen.

We have to check for frozen filesystem under page lock with which we then
return from ext4_page_mkwrite(). Only that way we cannot race with writeback
done by freezing code - either we lock the page after the writeback has
started, see freezing in progress and block, or writeback will wait for our
page lock which is released only when the fault is done and then writeback
will writeout and writeprotect the page again.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   41 ++++++++++++++++++++++++-----------------
 1 files changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 377fed0..6faadaf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5788,13 +5788,6 @@  static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
 	return !buffer_mapped(bh);
 }
 
-static int ext4_journalled_fault_fn(handle_t *handle, struct buffer_head *bh)
-{
-	if (!buffer_dirty(bh))
-		return 0;
-	return ext4_handle_dirty_metadata(handle, NULL, bh);
-}
-
 int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	struct page *page = vmf->page;
@@ -5804,10 +5797,16 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct file *file = vma->vm_file;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
-	handle_t handle;
-	get_block_t get_block;
+	handle_t *handle;
+	get_block_t *get_block;
 	int retries = 0;
 
+restart:
+	/*
+	 * This check is racy but catches the common case. The check at the
+	 * end of this function is reliable.
+	 */
+	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
@@ -5834,10 +5833,8 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	else
 		len = PAGE_CACHE_SIZE;
 	/*
-	 * Return if we have all the buffers mapped. This avoid
-	 * the need to call write_begin/write_end which does a
-	 * journal_start/journal_stop which can block and take
-	 * long time
+	 * Return if we have all the buffers mapped. This avoids the need to do
+	 * journal_start/journal_stop which can block and take a long time
 	 */
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
@@ -5852,7 +5849,7 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		get_block = ext4_get_block_write;
 	else
 		get_block = ext4_get_block;
-retry:
+retry_alloc:
 	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
 	if (IS_ERR(handle)) {
 		ret = VM_FAULT_SIGBUS;
@@ -5861,16 +5858,16 @@  retry:
 	ret = __block_page_mkwrite(vma, vmf, get_block);
 	if (ret == VM_FAULT_LOCKED && ext4_should_journal_data(inode)) {
 		if (walk_page_buffers(handle, page_buffers(page), 0,
-		 	  PAGE_CACHE_SIZE, NULL, ext4_journalled_fault_fn)) {
+		 	  PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) {
 			unlock_page(page);
 			ret = VM_FAULT_SIGBUS;
 			goto out;
 		}
 		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 	}
-	ext4_journal_end(handle);
+	ext4_journal_stop(handle);
 	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-		goto retry;
+		goto retry_alloc;
 out_ret:
 	if (ret < 0) {
 		if (ret == -ENOMEM)
@@ -5879,5 +5876,15 @@  out_ret:
 			ret = VM_FAULT_SIGBUS;
 	}
 out:
+	/*
+	 * Freezing in progress? We check with page lock held so if the test
+	 * here fails, we are sure freezing code will wait until the page
+	 * fault is done - at that point page will be dirty and unlocked so
+	 * freezing code will writeprotect it again.
+	 */
+	if (ret == VM_FAULT_LOCKED && inode->i_sb->s_frozen != SB_UNFROZEN) {
+		unlock_page(page);
+		goto restart;
+	}
 	return ret;
 }
-- 
1.7.1