Patchwork ext4: Rewrite ext4_page_mkwrite() to return locked page

login
register
mail settings
Submitter Jan Kara
Date May 30, 2011, 3:03 p.m.
Message ID <1306767829-27580-1-git-send-email-jack@suse.cz>
Download mbox | patch
Permalink /patch/97924/
State New
Headers show

Comments

Jan Kara - May 30, 2011, 3:03 p.m.
ext4_page_mkwrite() does not return page locked. This makes it hard
to avoid races with filesystem freezing code (so that we don't leave
writeable page on a frozen fs) or writeback code (so that we allow page
to be stable during writeback).

Also the current code uses i_alloc_sem to avoid races with truncate but that
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c.

Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |  106 ++++++++++++++++++++++++++++--------------------------
 1 files changed, 55 insertions(+), 51 deletions(-)

 Ted, necessary generic changes went upstream so I've rebased this patch
on top of 3.0-rc1 (which also resolves the conflict with Darrick's patches).
Theodore Ts'o - June 6, 2011, 2:17 a.m.
On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote:
> ext4_page_mkwrite() does not return page locked. This makes it hard
> to avoid races with filesystem freezing code (so that we don't leave
> writeable page on a frozen fs) or writeback code (so that we allow page
> to be stable during writeback).
> 
> Also the current code uses i_alloc_sem to avoid races with truncate but that
> seems to be the wrong locking order according to lock ordering documented in
> mm/rmap.c.
> 
> Also add a check for frozen filesystem so that we don't busyloop in page fault
> when the filesystem is frozen.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I'm not sure this commit description is accurate --- or I'm horribly
confused.

The old code was in fact returning the page locked --- I assume you're
referring to lock_page(page) (which is called right before the normal
path return).  So what am I missing?

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - June 8, 2011, 2:10 p.m.
Ping?

					- Ted

On Sun, Jun 05, 2011 at 10:17:54PM -0400, Ted Ts'o wrote:
> On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote:
> > ext4_page_mkwrite() does not return page locked. This makes it hard
> > to avoid races with filesystem freezing code (so that we don't leave
> > writeable page on a frozen fs) or writeback code (so that we allow page
> > to be stable during writeback).
> > 
> > Also the current code uses i_alloc_sem to avoid races with truncate but that
> > seems to be the wrong locking order according to lock ordering documented in
> > mm/rmap.c.
> > 
> > Also add a check for frozen filesystem so that we don't busyloop in page fault
> > when the filesystem is frozen.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I'm not sure this commit description is accurate --- or I'm horribly
> confused.
> 
> The old code was in fact returning the page locked --- I assume you're
> referring to lock_page(page) (which is called right before the normal
> path return).  So what am I missing?
> 
> 							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - June 8, 2011, 2:55 p.m.
On Sun 05-06-11 22:17:54, Ted Tso wrote:
> On Mon, May 30, 2011 at 05:03:49PM +0200, Jan Kara wrote:
> > ext4_page_mkwrite() does not return page locked. This makes it hard
> > to avoid races with filesystem freezing code (so that we don't leave
> > writeable page on a frozen fs) or writeback code (so that we allow page
> > to be stable during writeback).
> > 
> > Also the current code uses i_alloc_sem to avoid races with truncate but that
> > seems to be the wrong locking order according to lock ordering documented in
> > mm/rmap.c.
> > 
> > Also add a check for frozen filesystem so that we don't busyloop in page fault
> > when the filesystem is frozen.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> I'm not sure this commit description is accurate --- or I'm horribly
> confused.
> 
> The old code was in fact returning the page locked --- I assume you're
> referring to lock_page(page) (which is called right before the normal
> path return).  So what am I missing?
  OK, so I read the code again and you are right that end_page_writeback()
will be called before the work is queued and thus there are no problems
with i_mutex (both with multiblock-IO code and with the old code). Sorry for
confusion. But reading the code I have a few questions:
  In multiblock io submission code we call end_page_writeback() and drop
page references before we queue work converting extent from uninitialized
to initialized. But what prevents mm from reclaiming the page (and possibly
loading zeros from disk) before we finish the conversion? I suppose that
could be another cause of the problem described in
1449032be17abb69116dbc393f67ceb8bd034f92 but I don't see how that would be
addressed. Related question is: Commit
bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of
inode reference when creating/freeing end IO work. But now I don't see what
prevents mm from reaping the inode before the workqueue gets to processing
it.

Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the
old IO submission code but apparently it forgot to return the old handling
of uninitialized buffers so we unconditionnaly call block_write_full_page()
without specifying end_io function. So AFAICS we never convert unwritten
extents to written in some cases. For example when I mount the fs as:
mount -t ext4 -o nomblk_io_submit,dioread_nolock /dev/ubdb /mnt
and do
	int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
	char buf[1024];
	memset(buf, 'a', sizeof(buf));
	fallocate(fd, 0, 0, 16384);
	write(fd, buf, sizeof(buf));

I get a file full of zeros (after remounting the filesystem so that
pagecache is dropped) instead of seeing the first KB contain 'a's.

So to return to our original discussion regarding i_mutex - you are right
currently i_mutex does not cause a deadlock but to fix the first of the
above problems we need to somehow pin the page before all metadata is
properly updated, that metadata update requires i_mutex, and we will have
to be careful to pin the page in such a way so that memory reclaim cannot
get stuck reaping that page... If you can solve that, I'll be happy but
currently I'd find it more robust to just call end_page_writeback() after
we convert extents and avoid fs recursion from allocations.

								Honza
Theodore Ts'o - June 20, 2011, 2:53 a.m.
On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote:
>   OK, so I read the code again and you are right that end_page_writeback()
> will be called before the work is queued and thus there are no problems
> with i_mutex (both with multiblock-IO code and with the old code). Sorry for
> confusion. But reading the code I have a few questions:
>   In multiblock io submission code we call end_page_writeback() and drop
> page references before we queue work converting extent from uninitialized
> to initialized. But what prevents mm from reclaiming the page (and possibly
> loading zeros from disk) before we finish the conversion? 

We hold a reference on the page (i.e., via get_page()), which we don't
drop until we are doing doing the conversion.  The workqueue function
(ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(),
which drops the page reference.

> Related question is: Commit
> bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of
> inode reference when creating/freeing end IO work. But now I don't see what
> prevents mm from reaping the inode before the workqueue gets to processing
> it.

Actually it was commit f7ad6d2e9201a which removed the getting and
putting of the inode reference, and it describes the issues involved.
Specifically:

    The following BUG can occur when an inode which is getting freed when
    it still has dirty pages outstanding, and it gets deleted (in this
    because it was the target of a rename).  In ordered mode, we need to
    make sure the data pages are written just in case we crash before the
    rename (or unlink) is committed.  If the inode is being freed then
    when we try to igrab the inode, we end up tripping the BUG_ON at
    fs/ext4/page-io.c:146.
    
    To solve this problem, we need to keep track of the number of io
    callbacks which are pending, and avoid destroying the inode until they
    have all been completed.  That way we don't have to bump the inode
    count to keep the inode from being destroyed; an approach which
    doesn't work because the count could have already been dropped down to
    zero before the inode writeback has started (at which point we're not
    allowed to bump the count back up to 1, since it's already started
    getting freed).
    
    Thanks to Dave Chinner for suggesting this approach, which is also
    used by XFS.

> Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the
> old IO submission code but apparently it forgot to return the old handling
> of uninitialized buffers so we unconditionnaly call block_write_full_page()
> without specifying end_io function. 

Yeah.  That was a bug, no question.  No one apparently ran into it (or
at least no one bothered to report it to linux-ext4 or linux-kernel as
far as I know) while it was the default (i.e., during 2.6.37), and as
of 2.6.38 we are now using the new IO submission code by default
again.  Given that no one has reported any problems with the new IO
submission code, I was planning on completely removing the
nomblk_io_submit mount option for the 3.1 kernel, which would fix the
bug by virtue of nuking code in question.  :-)

> So to return to our original discussion regarding i_mutex - you are right
> currently i_mutex does not cause a deadlock but to fix the first of the
> above problems we need to somehow pin the page before all metadata is
> properly updated, that metadata update requires i_mutex, and we will have
> to be careful to pin the page in such a way so that memory reclaim cannot
> get stuck reaping that page... If you can solve that, I'll be happy but
> currently I'd find it more robust to just call end_page_writeback() after
> we convert extents and avoid fs recursion from allocations.

So I think the current fs/ext4/page_io.c code solves the problem
simply by grabbing a reference on the bug, which should prevent the mm
layer from reclaiming the page.  Does that satisfy you?  Do you see
anything that I may have missed?

(I very well could have missed something; after all, this code is
subtle, which is why I didn't feel comfortable responsible until I
could find the time to go and research this in depth; sorry for delay
in getting back to you.)

						- Ted

P.S.  Upon further reflection, perhaps we need to drop a few more code
comments in it so it's clearer to folks who try to go through this
code path in the future.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - June 20, 2011, 10:38 a.m.
On Sun 19-06-11 22:53:14, Ted Tso wrote:
> On Wed, Jun 08, 2011 at 04:55:49PM +0200, Jan Kara wrote:
> >   OK, so I read the code again and you are right that end_page_writeback()
> > will be called before the work is queued and thus there are no problems
> > with i_mutex (both with multiblock-IO code and with the old code). Sorry for
> > confusion. But reading the code I have a few questions:
> >   In multiblock io submission code we call end_page_writeback() and drop
> > page references before we queue work converting extent from uninitialized
> > to initialized. But what prevents mm from reclaiming the page (and possibly
> > loading zeros from disk) before we finish the conversion? 
> 
> We hold a reference on the page (i.e., via get_page()), which we don't
> drop until we are doing doing the conversion.  The workqueue function
> (ext4_end_io_work, in fs/ext4/page_io.c) calls ext4_free_io_end(),
> which drops the page reference.
  Hmm, let's go though the call chain and see where I miss something:
io_end is allocated and bio is initialized in io_submit_init() - we set
  bi_end_io to ext4_end_bio there
pages are added to io_end ext4_bio_write_page() - it creates io_page
  (p_count is set to 1, we grab page reference - get_page()), then
  io_submit_add_bh() adds io_page to io_end (p_count increased to 2), then
  we do put_io_page() in the end of ext4_bio_write_page() -> p_count == 1.
Eventually bio is submitted, io gets completed and ext4_end_bio() is
  called. It goes though all pages and does put_io_page(), thus p_count
  gets to 0, we clear PageWriteback and do put_page(). From this moment on,
  we hold no reference to the page AFAICS.

Note, there is a separate call chain from ext4_writepage(). There the
reference handling seems to be as you describe - i.e., we get reference in
ext4_set_bh_endio(), we call end_buffer_async_write() in
ext4_end_io_buffer_write() which clears PageWriteback and then we drop page
reference in ext4_free_io_end().

> > Related question is: Commit
> > bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc removed getting and putting of
> > inode reference when creating/freeing end IO work. But now I don't see what
> > prevents mm from reaping the inode before the workqueue gets to processing
> > it.
> 
> Actually it was commit f7ad6d2e9201a which removed the getting and
> putting of the inode reference, and it describes the issues involved.
  Ah, OK. Thanks for the pointer.

> > Finally, commit 1449032be17abb69116dbc393f67ceb8bd034f92 returned back the
> > old IO submission code but apparently it forgot to return the old handling
> > of uninitialized buffers so we unconditionnaly call block_write_full_page()
> > without specifying end_io function. 
> 
> Yeah.  That was a bug, no question.  No one apparently ran into it (or
> at least no one bothered to report it to linux-ext4 or linux-kernel as
> far as I know) while it was the default (i.e., during 2.6.37), and as
> of 2.6.38 we are now using the new IO submission code by default
> again.  Given that no one has reported any problems with the new IO
> submission code, I was planning on completely removing the
> nomblk_io_submit mount option for the 3.1 kernel, which would fix the
> bug by virtue of nuking code in question.  :-)
  Yep, that would solve it as well :).

> > So to return to our original discussion regarding i_mutex - you are right
> > currently i_mutex does not cause a deadlock but to fix the first of the
> > above problems we need to somehow pin the page before all metadata is
> > properly updated, that metadata update requires i_mutex, and we will have
> > to be careful to pin the page in such a way so that memory reclaim cannot
> > get stuck reaping that page... If you can solve that, I'll be happy but
> > currently I'd find it more robust to just call end_page_writeback() after
> > we convert extents and avoid fs recursion from allocations.
> 
> So I think the current fs/ext4/page_io.c code solves the problem
> simply by grabbing a reference on the bug, which should prevent the mm
> layer from reclaiming the page.  Does that satisfy you?  Do you see
> anything that I may have missed?
  OK, if we fix the reference counting I'm now convinced that the code
would be correct. Fixing that does not seem to be completely simple - we
need to use p_count to detect when we can do end_page_writeback() when
blocksize < pagesize and io submission races with io completion (current
code is correct in that respect). But we need another mechanism detect when
we can safely put page reference... Separate reference counter would
certainly do but it might be an overkill. Possibly we could get additional
page reference for each io_page reference?

> P.S.  Upon further reflection, perhaps we need to drop a few more code
> comments in it so it's clearer to folks who try to go through this
> code path in the future.
  Well, I think it would be worthwhile to have a highlevel description
somewhere (i.e. what protects inode from freeing, what protects page from
freeing, how we manage PageWriteback, page references and extent
conversion). The actual code is not that hard to follow but it's not easy
to see the design behind it...

								Honza

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a5763e3..d679dd8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5843,80 +5843,84 @@  int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	loff_t size;
 	unsigned long len;
-	int ret = -EINVAL;
-	void *fsdata;
+	int ret;
 	struct file *file = vma->vm_file;
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct address_space *mapping = inode->i_mapping;
+	handle_t *handle;
+	get_block_t *get_block;
+	int retries = 0;
 
 	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
+	 * This check is racy but catches the common case. We rely on
+	 * __block_page_mkwrite() to do a reliable check.
 	 */
-	down_read(&inode->i_alloc_sem);
-	size = i_size_read(inode);
-	if (page->mapping != mapping || size <= page_offset(page)
-	    || !PageUptodate(page)) {
-		/* page got truncated from under us? */
-		goto out_unlock;
+	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	/* Delalloc case is easy... */
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    !ext4_should_journal_data(inode) &&
+	    !ext4_nonda_switch(inode->i_sb)) {
+		do {
+			ret = __block_page_mkwrite(vma, vmf,
+						   ext4_da_get_block_prep);
+		} while (ret == -ENOSPC &&
+		       ext4_should_retry_alloc(inode->i_sb, &retries));
+		goto out_ret;
 	}
-	ret = 0;
 
 	lock_page(page);
-	wait_on_page_writeback(page);
-	if (PageMappedToDisk(page)) {
-		up_read(&inode->i_alloc_sem);
-		return VM_FAULT_LOCKED;
+	size = i_size_read(inode);
+	/* Page got truncated from under us? */
+	if (page->mapping != mapping || page_offset(page) > size) {
+		unlock_page(page);
+		ret = VM_FAULT_NOPAGE;
+		goto out;
 	}
 
 	if (page->index == size >> PAGE_CACHE_SHIFT)
 		len = size & ~PAGE_CACHE_MASK;
 	else
 		len = PAGE_CACHE_SIZE;
-
 	/*
-	 * return if we have all the buffers mapped. This avoid
-	 * the need to call write_begin/write_end which does a
-	 * journal_start/journal_stop which can block and take
-	 * long time
+	 * Return if we have all the buffers mapped. This avoids the need to do
+	 * journal_start/journal_stop which can block and take a long time
 	 */
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			up_read(&inode->i_alloc_sem);
-			return VM_FAULT_LOCKED;
+			/* Wait so that we don't change page under IO */
+			wait_on_page_writeback(page);
+			ret = VM_FAULT_LOCKED;
+			goto out;
 		}
 	}
 	unlock_page(page);
-	/*
-	 * OK, we need to fill the hole... Do write_begin write_end
-	 * to do block allocation/reservation.We are not holding
-	 * inode.i__mutex here. That allow * parallel write_begin,
-	 * write_end call. lock_page prevent this from happening
-	 * on the same page though
-	 */
-	ret = mapping->a_ops->write_begin(file, mapping, page_offset(page),
-			len, AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = mapping->a_ops->write_end(file, mapping, page_offset(page),
-			len, len, page, fsdata);
-	if (ret < 0)
-		goto out_unlock;
-	ret = 0;
-
-	/*
-	 * write_begin/end might have created a dirty page and someone
-	 * could wander in and start the IO.  Make sure that hasn't
-	 * happened.
-	 */
-	lock_page(page);
-	wait_on_page_writeback(page);
-	up_read(&inode->i_alloc_sem);
-	return VM_FAULT_LOCKED;
-out_unlock:
-	if (ret)
+	/* OK, we need to fill the hole... */
+	if (ext4_should_dioread_nolock(inode))
+		get_block = ext4_get_block_write;
+	else
+		get_block = ext4_get_block;
+retry_alloc:
+	handle = ext4_journal_start(inode, ext4_writepage_trans_blocks(inode));
+	if (IS_ERR(handle)) {
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+		goto out;
+	}
+	ret = __block_page_mkwrite(vma, vmf, get_block);
+	if (!ret && ext4_should_journal_data(inode)) {
+		if (walk_page_buffers(handle, page_buffers(page), 0,
+			  PAGE_CACHE_SIZE, NULL, do_journal_get_write_access)) {
+			unlock_page(page);
+			ret = VM_FAULT_SIGBUS;
+			goto out;
+		}
+		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+	}
+	ext4_journal_stop(handle);
+	if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+		goto retry_alloc;
+out_ret:
+	ret = block_page_mkwrite_return(ret);
+out:
 	return ret;
 }