diff mbox

ext4 out of order when use cfq scheduler

Message ID 20160107114736.GC8380@quack.suse.cz
State Superseded
Headers show

Commit Message

Jan Kara Jan. 7, 2016, 11:47 a.m. UTC
On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > -----Original Message-----
> > From: Jan Kara [mailto:jack@suse.cz]
> > Sent: Thursday, January 07, 2016 6:24 PM
> > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> > Subject: Re: ext4 out of order when use cfq scheduler
> > 
> > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > -----Original Message-----
> > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > >
> > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > looking into the sources you've sent me I've noticed that
> > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > would well explain why you see that not all file contents is
> > > > > > written. So did you just send me a different version of the
> > > > > > source or is your test program
> > > > really buggy?
> > > > > >
> > > > >
> > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > > of order acceptable ? or is it normal ?
> > > >
> > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > are written and others are not since nobody guarantees order of
> > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > in the inode (but so far it isn't clear to me whether you really see
> > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > > show that the problematic inode has extent of length 128 blocks
> > > > starting at block
> > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > > >
> > > > Some suggestions to try:
> > > > 1) Print also length of a write request in addition to the starting
> > > > block so that we can see how much actually got written
> > >
> > > Please see below failure analysis.
> > >
> > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > uninitialized blocks from zeroed-out blocks.
> > >
> > > Yes, i Initialize the device to 0xff this time.
> > >
> > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > which it is wrong.
> > > The wrong contents are old file contents which are created in previous
> > > test round.  It is caused by the "wrong" sequence inode data(in
> > > journal) and  the file contents. So the file contents are not updated.
> > 
> > So this confuses me somewhat. You previously said that you always remove files
> > after each test round and then new ones are created. Is it still the case? So the old
> > file contents you speak about above is just some random contents that happened
> > to be in disk blocks we freshly allocated to the file, am I right?
> 
> Yes. You are right.
>  The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
> 
> 
> > 
> > OK, so I was looking into the code and indeed, reality is correct and my mental
> > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > we need to wait for data IO completion during transaction commit during block
> > allocation. And I was wrong. It used to happen in
> > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > wrong because although we submit data writes before dropping handle for
> > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > delayed in the block layer until transaction commit.
> > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > persistence!
> > 
> 
> Thanks a lot for your feedback :-)
> Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
> Could you send the fix in this mail ?
> And whether the kernel 3.14 also have such issue, right ?

The problem is in all kernels starting with 3.8. Attached is a patch which
should fix the issue. Can you test whether it fixes the problem for you?

								Honza

Comments

Jan Kara Jan. 7, 2016, 12:19 p.m. UTC | #1
On Thu 07-01-16 12:47:36, Jan Kara wrote:
> On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote:
> > > -----Original Message-----
> > > From: Jan Kara [mailto:jack@suse.cz]
> > > Sent: Thursday, January 07, 2016 6:24 PM
> > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org
> > > Subject: Re: ext4 out of order when use cfq scheduler
> > > 
> > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > -----Original Message-----
> > > > > From: Jan Kara [mailto:jack@suse.cz]
> > > > > Sent: Wednesday, January 06, 2016 6:06 PM
> > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>
> > > > > Subject: Re: ext4 out of order when use cfq scheduler
> > > > >
> > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote:
> > > > > > > So you are running in 'ws' mode of your tool, am I right? Just
> > > > > > > looking into the sources you've sent me I've noticed that
> > > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do
> > > > > > > not use openflg at all. So file won't be synced at all. That
> > > > > > > would well explain why you see that not all file contents is
> > > > > > > written. So did you just send me a different version of the
> > > > > > > source or is your test program
> > > > > really buggy?
> > > > > > >
> > > > > >
> > > > > > Yes, it is a bug of the test code. So the test tool create files
> > > > > > without O_SYNC flag actually.  But , even in this case, is the out
> > > > > > of order acceptable ? or is it normal ?
> > > > >
> > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files
> > > > > are written and others are not since nobody guarantees order of
> > > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data
> > > > > in the inode (but so far it isn't clear to me whether you really see
> > > > > unitialized data or whether we really wrote zeros to those blocks -
> > > > > ext4 can sometimes decide to do so).  Your traces and disk contents
> > > > > show that the problematic inode has extent of length 128 blocks
> > > > > starting at block
> > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e.
> > > > > What is the block size of the filesystem?  Because inode size is only 0x40010.
> > > > >
> > > > > Some suggestions to try:
> > > > > 1) Print also length of a write request in addition to the starting
> > > > > block so that we can see how much actually got written
> > > >
> > > > Please see below failure analysis.
> > > >
> > > > > 2) Initialize the device to 0xff so that we can distinguish
> > > > > uninitialized blocks from zeroed-out blocks.
> > > >
> > > > Yes, i Initialize the device to 0xff this time.
> > > >
> > > > > 3) Report exactly for which 512-byte blocks checksum matches and for
> > > > > which it is wrong.
> > > > The wrong contents are old file contents which are created in previous
> > > > test round.  It is caused by the "wrong" sequence inode data(in
> > > > journal) and  the file contents. So the file contents are not updated.
> > > 
> > > So this confuses me somewhat. You previously said that you always remove files
> > > after each test round and then new ones are created. Is it still the case? So the old
> > > file contents you speak about above is just some random contents that happened
> > > to be in disk blocks we freshly allocated to the file, am I right?
> > 
> > Yes. You are right.
> >  The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round.
> > 
> > 
> > > 
> > > OK, so I was looking into the code and indeed, reality is correct and my mental
> > > model was wrong! ;) I thought that inode gets added to the list of inodes for which
> > > we need to wait for data IO completion during transaction commit during block
> > > allocation. And I was wrong. It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was
> > > wrong because although we submit data writes before dropping handle for
> > > allocating transaction and updating i_size, nobody guarantees that data IO is not
> > > delayed in the block layer until transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your report and
> > > persistence!
> > > 
> > 
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code.  I will try to understand your explanation which you describe above.  And have a look on related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Oh, I have realized the patch is on top of current ext4 development tree
and it won't compile for current vanilla kernel because of
EXT4_GET_BLOCKS_ZERO check. Just remove that line when  you get compilation
failure.

> +		if (map->m_flags & EXT4_MAP_NEW &&
> +		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
> +		    !(flags & EXT4_GET_BLOCKS_ZERO) &&

Just remove the above line and things should work for older kernels as
well.

> +		    ext4_should_order_data(inode)) {
> +			ret = ext4_jbd2_file_inode(handle, inode);
> +			if (ret)
> +				return ret;
> +		}
>  	}
>  	return retval;
>  }

								Honza
HUANG Weller (CM/ESW12-CN) Jan. 8, 2016, 12:46 a.m. UTC | #2
> >
> > >
> > > OK, so I was looking into the code and indeed, reality is correct
> > > and my mental model was wrong! ;) I thought that inode gets added to
> > > the list of inodes for which we need to wait for data IO completion
> > > during transaction commit during block allocation. And I was wrong.
> > > It used to happen in
> > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > calls to
> > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > removed. And that was wrong because although we submit data writes
> > > before dropping handle for allocating transaction and updating
> > > i_size, nobody guarantees that data IO is not delayed in the block layer until
> transaction commit.
> > > Which seems to happen in your case. I'll send a fix. Thanks for your
> > > report and persistence!
> > >
> >
> > Thanks a lot for your feedback :-)
> > Because I am not familiar with the detail of the ext4 internal code.  I will try to
> understand your explanation which you describe above.  And have a look on
> related funcations.
> > Could you send the fix in this mail ?
> > And whether the kernel 3.14 also have such issue, right ?
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which should fix
> the issue. Can you test whether it fixes the problem for you?
> 
> 								Honza
> --

Yes, of course I will redo the test with the patch. And also give you feedback.
Thanks.
Huang weller

> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
HUANG Weller (CM/ESW12-CN) Jan. 11, 2016, 9:05 a.m. UTC | #3
> -----Original Message-----
> From: HUANG Weller (CM/ESW12-CN)
> Sent: Friday, January 08, 2016 8:47 AM
> To: 'Jan Kara' <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: RE: ext4 out of order when use cfq scheduler
> 
> > >
> > > >
> > > > OK, so I was looking into the code and indeed, reality is correct
> > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > to the list of inodes for which we need to wait for data IO
> > > > completion during transaction commit during block allocation. And I was
> wrong.
> > > > It used to happen in
> > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > calls to
> > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > removed. And that was wrong because although we submit data writes
> > > > before dropping handle for allocating transaction and updating
> > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > layer until
> > transaction commit.
> > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > your report and persistence!
> > > >
> > >
> > > Thanks a lot for your feedback :-)
> > > Because I am not familiar with the detail of the ext4 internal code.
> > > I will try to
> > understand your explanation which you describe above.  And have a look
> > on related funcations.
> > > Could you send the fix in this mail ?
> > > And whether the kernel 3.14 also have such issue, right ?
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> >
> > 								Honza
> > --
> 
> Yes, of course I will redo the test with the patch. And also give you feedback.

Test result:
Test on 2 targets with the kernel applied your patch. Both of them are OK after 5000 power failure tests. Our target test cycle is 10,000.
By the way, since your original patch can't be applied on 3.x kernel, The attached one  is based on yours and can applied on old kernel(mine is 3.10) directly.

> 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
Jan Kara Jan. 11, 2016, 10:21 a.m. UTC | #4
On Mon 11-01-16 09:05:20, HUANG Weller (CM/ESW12-CN) wrote:
> 
> > -----Original Message-----
> > From: HUANG Weller (CM/ESW12-CN)
> > Sent: Friday, January 08, 2016 8:47 AM
> > To: 'Jan Kara' <jack@suse.cz>
> > Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> > Subject: RE: ext4 out of order when use cfq scheduler
> > 
> > > >
> > > > >
> > > > > OK, so I was looking into the code and indeed, reality is correct
> > > > > and my mental model was wrong! ;) I thought that inode gets added
> > > > > to the list of inodes for which we need to wait for data IO
> > > > > completion during transaction commit during block allocation. And I was
> > wrong.
> > > > > It used to happen in
> > > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove
> > > > > calls to
> > > > > ext4_jbd2_file_inode() from delalloc write path) where it got
> > > > > removed. And that was wrong because although we submit data writes
> > > > > before dropping handle for allocating transaction and updating
> > > > > i_size, nobody guarantees that data IO is not delayed in the block
> > > > > layer until
> > > transaction commit.
> > > > > Which seems to happen in your case. I'll send a fix. Thanks for
> > > > > your report and persistence!
> > > > >
> > > >
> > > > Thanks a lot for your feedback :-)
> > > > Because I am not familiar with the detail of the ext4 internal code.
> > > > I will try to
> > > understand your explanation which you describe above.  And have a look
> > > on related funcations.
> > > > Could you send the fix in this mail ?
> > > > And whether the kernel 3.14 also have such issue, right ?
> > >
> > > The problem is in all kernels starting with 3.8. Attached is a patch
> > > which should fix the issue. Can you test whether it fixes the problem for you?
> > >
> > > 								Honza
> > > --
> > 
> > Yes, of course I will redo the test with the patch. And also give you feedback.
> 
> Test result:
> Test on 2 targets with the kernel applied your patch. Both of them are OK
> after 5000 power failure tests. Our target test cycle is 10,000.  By the
> way, since your original patch can't be applied on 3.x kernel, The
> attached one  is based on yours and can applied on old kernel(mine is
> 3.10) directly.

Thanks for testing and the port! Once the patch is merged upstream, I'll
backport it and push it to all active stable kernels.

								Honza
Theodore Ts'o March 13, 2016, 4:27 a.m. UTC | #5
On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> 
> The problem is in all kernels starting with 3.8. Attached is a patch which
> should fix the issue. Can you test whether it fixes the problem for you?

Sorry, I missed this patch because it was attached to an discussion
thread.

> The problem is that although for delayed allocated blocks we write their
> contents immediately after allocating them, there is no guarantee that
> the IO scheduler or device doesn't reorder things

I don't think that's the problem.  In the commit thread when we call
blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
reorder writes after that point, which is before we write the commit
block.  Instead, I believe the problem is in ext4_writepages:

		ext4_journal_stop(handle);
		/* Submit prepared bio */
		ext4_io_submit(&mpd.io_submit);

Once we release the handle, the commit can start --- *before* we have
a chance to submit the I/O.   Oops.

I believe if we swap these two calls, it should fix the problem Huang
was seeing.

Jan, do you agree?

					- Ted
					
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
HUANG Weller (CM/ESW12-CN) March 14, 2016, 2:43 a.m. UTC | #6
> -----Original Message-----
> From: Theodore Ts'o [mailto:tytso@mit.edu]
> Sent: Sunday, March 13, 2016 12:27 PM
> To: Jan Kara <jack@suse.cz>
> Cc: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>; linux-
> ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com>
> Subject: Re: ext4 out of order when use cfq scheduler
> 
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> >
> > The problem is in all kernels starting with 3.8. Attached is a patch
> > which should fix the issue. Can you test whether it fixes the problem for you?
> 
> Sorry, I missed this patch because it was attached to an discussion thread.
> 
> > The problem is that although for delayed allocated blocks we write
> > their contents immediately after allocating them, there is no
> > guarantee that the IO scheduler or device doesn't reorder things
> 
> I don't think that's the problem.  In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't reorder writes
> after that point, which is before we write the commit block.  Instead, I believe the
> problem is in ext4_writepages:
> 
> 		ext4_journal_stop(handle);
> 		/* Submit prepared bio */
> 		ext4_io_submit(&mpd.io_submit);
> 
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O.   Oops.
> 
> I believe if we swap these two calls, it should fix the problem Huang was seeing.
> 
> Jan, do you agree?
> 
> 					- Ted

Hi Ted and Jan,

You can give me a patch and I can redo the verification on my kernel and HWs.

I also look into the code, since In my test case, I use data=ordered option and without sync.  So the write operation will goto ext4_da_writepages(), right ?
My kernel version is 3.10.63, as I see io_submit and journal_stop sequence  already in that order. 

while (!ret && wbc->nr_to_write > 0) {
ext4_journal_start
write_cache_pages_da
mpage_da_map_and_submit  ==>
ext4_journal_stop
}

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 14, 2016, 7:39 a.m. UTC | #7
On Sat 12-03-16 23:27:23, Ted Tso wrote:
> On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote:
> > The problem is in all kernels starting with 3.8. Attached is a patch which
> > should fix the issue. Can you test whether it fixes the problem for you?
> 
> Sorry, I missed this patch because it was attached to an discussion
> thread.

I have actually sent this patch in a standalone thread on January 11
(http://lists.openwall.net/linux-ext4/2016/01/11/3) together with one more
cleanup.

> > The problem is that although for delayed allocated blocks we write their
> > contents immediately after allocating them, there is no guarantee that
> > the IO scheduler or device doesn't reorder things
> 
> I don't think that's the problem.  In the commit thread when we call
> blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't
> reorder writes after that point, which is before we write the commit
> block.  Instead, I believe the problem is in ext4_writepages:
> 
> 		ext4_journal_stop(handle);
> 		/* Submit prepared bio */
> 		ext4_io_submit(&mpd.io_submit);
> 
> Once we release the handle, the commit can start --- *before* we have
> a chance to submit the I/O.   Oops.
> 
> I believe if we swap these two calls, it should fix the problem Huang
> was seeing.

No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
anything to IOs which have not reported completion before
blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
in its internal RB tree, following flush request completely bypasses this
tree and goes directly to the disk where it flushes caches. And only later
CFQ decides to schedule async writeback from the flusher thread which is
queued in the RB tree...

Note that the behavior has changed to be like this with the flushing
machinery rewrite. Before that, IO scheduler had to drain all the
outstanding IO requests (IO cache flush behaved like IO barrier). So your
patch would be enough with the old flushing machinery but is not enough
since 3.0 or so...

								Honza
Theodore Ts'o March 14, 2016, 2:36 p.m. UTC | #8
On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> anything to IOs which have not reported completion before
> blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> in its internal RB tree, following flush request completely bypasses this
> tree and goes directly to the disk where it flushes caches. And only later
> CFQ decides to schedule async writeback from the flusher thread which is
> queued in the RB tree...

Oh, right.  I am forgetting about the flushing mahchinery rewrite.
Thanks for pointing that out.

But what we *could* do is to swap those two calls and then in the case
where delalloc is enabled, could maintain a list of inodes where we
only need to call filemap_fdatawait(), and not initiate writeback for
any dirty pages which had been caused by non-allocating writes.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 15, 2016, 10:46 a.m. UTC | #9
On Mon 14-03-16 10:36:35, Ted Tso wrote:
> On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > anything to IOs which have not reported completion before
> > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > in its internal RB tree, following flush request completely bypasses this
> > tree and goes directly to the disk where it flushes caches. And only later
> > CFQ decides to schedule async writeback from the flusher thread which is
> > queued in the RB tree...
> 
> Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> Thanks for pointing that out.
> 
> But what we *could* do is to swap those two calls and then in the case
> where delalloc is enabled, could maintain a list of inodes where we
> only need to call filemap_fdatawait(), and not initiate writeback for
> any dirty pages which had been caused by non-allocating writes.

We actually don't need to swap those two calls - page is already marked as
under writeback in

  mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page

which gets called while we still hold the transaction handle. I agree
calling filemap_fdatawait() from JBD2 during commit should be enough to fix
issues with delalloc writeback. I'm just somewhat afraid that it will be
more fragile: If we add inode to transaction's list in ext4_map_blocks(),
we are pretty sure there's no way to allocate block to an inode without
introducing data exposure issues (which are then very hard to spot). If we
depend on callers of ext4_map_blocks() to properly add inode to appropriate
transaction list, we have much more places to check. I'll think whether we
could make this more robust.

								Honza
diff mbox

Patch

From 4dd4ac4bec65620a71df5e3f893e6728863f05c3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 7 Jan 2016 12:21:25 +0100
Subject: [PATCH] ext4: Fix data exposure after a crash

Huang has reported that in his powerfail testing he is seeing stale
block contents in some of recently allocated blocks although he mounts
ext4 in data=ordered mode. After some investigation I have found out
that indeed when delayed allocation is used, we don't add inode to
transaction's list of inodes needing flushing before commit. Originally
we were doing that but commit f3b59291a69d removed the logic with a
flawed argument that it is not needed.

The problem is that although for delayed allocated blocks we write their
contents immediately after allocating them, there is no guarantee that
the IO scheduler or device doesn't reorder things and thus transaction
allocating blocks and attaching them to inode can reach stable storage
before actual block contents. Actually whenever we attach freshly
allocated blocks to inode using a written extent, we should add inode to
transaction's ordered inode list to make sure we properly wait for block
contents to be written before committing the transaction. So that is
what we do in this patch. This also handles other cases where stale data
exposure was possible - like filling hole via mmap in
data=ordered,nodelalloc mode.

The only exception to the above rule are extending direct IO writes where
blkdev_direct_IO() waits for IO to complete before increasing i_size and
thus stale data exposure is not possible. For now we don't complicate
the code with optimizing this special case since the overhead is pretty
low. In case this is observed to be a performance problem we can always
handle it using a special flag to ext4_map_blocks().

CC: stable@vger.kernel.org
Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d
Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ff2f3cd38522..b216a3eb41a8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -682,6 +682,20 @@  out_sem:
 		ret = check_block_validity(inode, map);
 		if (ret != 0)
 			return ret;
+
+		/*
+		 * Inodes with freshly allocated blocks where contents will be
+		 * visible after transaction commit must be on transaction's
+		 * ordered data list.
+		 */
+		if (map->m_flags & EXT4_MAP_NEW &&
+		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
+		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
+		    ext4_should_order_data(inode)) {
+			ret = ext4_jbd2_file_inode(handle, inode);
+			if (ret)
+				return ret;
+		}
 	}
 	return retval;
 }
@@ -1135,15 +1149,6 @@  static int ext4_write_end(struct file *file,
 	int i_size_changed = 0;
 
 	trace_ext4_write_end(inode, pos, len, copied);
-	if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) {
-		ret = ext4_jbd2_file_inode(handle, inode);
-		if (ret) {
-			unlock_page(page);
-			page_cache_release(page);
-			goto errout;
-		}
-	}
-
 	if (ext4_has_inline_data(inode)) {
 		ret = ext4_write_inline_data_end(inode, pos, len,
 						 copied, page);
-- 
2.6.2