Message ID | 20160107114736.GC8380@quack.suse.cz |
---|---|
State | Superseded |
Headers | show |
On Thu 07-01-16 12:47:36, Jan Kara wrote: > On Thu 07-01-16 11:02:29, HUANG Weller (CM/ESW12-CN) wrote: > > > -----Original Message----- > > > From: Jan Kara [mailto:jack@suse.cz] > > > Sent: Thursday, January 07, 2016 6:24 PM > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com> > > > Cc: Jan Kara <jack@suse.cz>; linux-ext4@vger.kernel.org > > > Subject: Re: ext4 out of order when use cfq scheduler > > > > > > On Thu 07-01-16 06:43:00, HUANG Weller (CM/ESW12-CN) wrote: > > > > > -----Original Message----- > > > > > From: Jan Kara [mailto:jack@suse.cz] > > > > > Sent: Wednesday, January 06, 2016 6:06 PM > > > > > To: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com> > > > > > Subject: Re: ext4 out of order when use cfq scheduler > > > > > > > > > > On Wed 06-01-16 02:39:15, HUANG Weller (CM/ESW12-CN) wrote: > > > > > > > So you are running in 'ws' mode of your tool, am I right? Just > > > > > > > looking into the sources you've sent me I've noticed that > > > > > > > although you set O_SYNC in openflg when mode == MODE_WS, you do > > > > > > > not use openflg at all. So file won't be synced at all. That > > > > > > > would well explain why you see that not all file contents is > > > > > > > written. So did you just send me a different version of the > > > > > > > source or is your test program > > > > > really buggy? > > > > > > > > > > > > > > > > > > > Yes, it is a bug of the test code. So the test tool create files > > > > > > without O_SYNC flag actually. But , even in this case, is the out > > > > > > of order acceptable ? or is it normal ? > > > > > > > > > > Without fsync(2) or O_SYNC, it is perfectly possible that some files > > > > > are written and others are not since nobody guarantees order of > > > > > writeback of inodes. OTOH you shouldn't ever see uninitialized data > > > > > in the inode (but so far it isn't clear to me whether you really see > > > > > unitialized data or whether we really wrote zeros to those blocks - > > > > > ext4 can sometimes decide to do so). Your traces and disk contents > > > > > show that the problematic inode has extent of length 128 blocks > > > > > starting at block > > > > > 0x12c00 and then extent of lenght 1 block starting at block 0x1268e. > > > > > What is the block size of the filesystem? Because inode size is only 0x40010. > > > > > > > > > > Some suggestions to try: > > > > > 1) Print also length of a write request in addition to the starting > > > > > block so that we can see how much actually got written > > > > > > > > Please see below failure analysis. > > > > > > > > > 2) Initialize the device to 0xff so that we can distinguish > > > > > uninitialized blocks from zeroed-out blocks. > > > > > > > > Yes, i Initialize the device to 0xff this time. > > > > > > > > > 3) Report exactly for which 512-byte blocks checksum matches and for > > > > > which it is wrong. > > > > The wrong contents are old file contents which are created in previous > > > > test round. It is caused by the "wrong" sequence inode data(in > > > > journal) and the file contents. So the file contents are not updated. > > > > > > So this confuses me somewhat. You previously said that you always remove files > > > after each test round and then new ones are created. Is it still the case? So the old > > > file contents you speak about above is just some random contents that happened > > > to be in disk blocks we freshly allocated to the file, am I right? > > > > Yes. You are right. > > The "old file contents" means that the disk blocks which the contents is generated from last test round, and they are allocated to a new file in new test round. > > > > > > > > > > OK, so I was looking into the code and indeed, reality is correct and my mental > > > model was wrong! ;) I thought that inode gets added to the list of inodes for which > > > we need to wait for data IO completion during transaction commit during block > > > allocation. And I was wrong. It used to happen in > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove calls to > > > ext4_jbd2_file_inode() from delalloc write path) where it got removed. And that was > > > wrong because although we submit data writes before dropping handle for > > > allocating transaction and updating i_size, nobody guarantees that data IO is not > > > delayed in the block layer until transaction commit. > > > Which seems to happen in your case. I'll send a fix. Thanks for your report and > > > persistence! > > > > > > > Thanks a lot for your feedback :-) > > Because I am not familiar with the detail of the ext4 internal code. I will try to understand your explanation which you describe above. And have a look on related funcations. > > Could you send the fix in this mail ? > > And whether the kernel 3.14 also have such issue, right ? > > The problem is in all kernels starting with 3.8. Attached is a patch which > should fix the issue. Can you test whether it fixes the problem for you? Oh, I have realized the patch is on top of current ext4 development tree and it won't compile for current vanilla kernel because of EXT4_GET_BLOCKS_ZERO check. Just remove that line when you get compilation failure. > + if (map->m_flags & EXT4_MAP_NEW && > + !(map->m_flags & EXT4_MAP_UNWRITTEN) && > + !(flags & EXT4_GET_BLOCKS_ZERO) && Just remove the above line and things should work for older kernels as well. > + ext4_should_order_data(inode)) { > + ret = ext4_jbd2_file_inode(handle, inode); > + if (ret) > + return ret; > + } > } > return retval; > } Honza
> > > > > > > > OK, so I was looking into the code and indeed, reality is correct > > > and my mental model was wrong! ;) I thought that inode gets added to > > > the list of inodes for which we need to wait for data IO completion > > > during transaction commit during block allocation. And I was wrong. > > > It used to happen in > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove > > > calls to > > > ext4_jbd2_file_inode() from delalloc write path) where it got > > > removed. And that was wrong because although we submit data writes > > > before dropping handle for allocating transaction and updating > > > i_size, nobody guarantees that data IO is not delayed in the block layer until > transaction commit. > > > Which seems to happen in your case. I'll send a fix. Thanks for your > > > report and persistence! > > > > > > > Thanks a lot for your feedback :-) > > Because I am not familiar with the detail of the ext4 internal code. I will try to > understand your explanation which you describe above. And have a look on > related funcations. > > Could you send the fix in this mail ? > > And whether the kernel 3.14 also have such issue, right ? > > The problem is in all kernels starting with 3.8. Attached is a patch which should fix > the issue. Can you test whether it fixes the problem for you? > > Honza > -- Yes, of course I will redo the test with the patch. And also give you feedback. Thanks. Huang weller > Jan Kara <jack@suse.com> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: HUANG Weller (CM/ESW12-CN) > Sent: Friday, January 08, 2016 8:47 AM > To: 'Jan Kara' <jack@suse.cz> > Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com> > Subject: RE: ext4 out of order when use cfq scheduler > > > > > > > > > > > > OK, so I was looking into the code and indeed, reality is correct > > > > and my mental model was wrong! ;) I thought that inode gets added > > > > to the list of inodes for which we need to wait for data IO > > > > completion during transaction commit during block allocation. And I was > wrong. > > > > It used to happen in > > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove > > > > calls to > > > > ext4_jbd2_file_inode() from delalloc write path) where it got > > > > removed. And that was wrong because although we submit data writes > > > > before dropping handle for allocating transaction and updating > > > > i_size, nobody guarantees that data IO is not delayed in the block > > > > layer until > > transaction commit. > > > > Which seems to happen in your case. I'll send a fix. Thanks for > > > > your report and persistence! > > > > > > > > > > Thanks a lot for your feedback :-) > > > Because I am not familiar with the detail of the ext4 internal code. > > > I will try to > > understand your explanation which you describe above. And have a look > > on related funcations. > > > Could you send the fix in this mail ? > > > And whether the kernel 3.14 also have such issue, right ? > > > > The problem is in all kernels starting with 3.8. Attached is a patch > > which should fix the issue. Can you test whether it fixes the problem for you? > > > > Honza > > -- > > Yes, of course I will redo the test with the patch. And also give you feedback. Test result: Test on 2 targets with the kernel applied your patch. Both of them are OK after 5000 power failure tests. Our target test cycle is 10,000. By the way, since your original patch can't be applied on 3.x kernel, The attached one is based on yours and can applied on old kernel(mine is 3.10) directly. > > > Jan Kara <jack@suse.com> > > SUSE Labs, CR
On Mon 11-01-16 09:05:20, HUANG Weller (CM/ESW12-CN) wrote: > > > -----Original Message----- > > From: HUANG Weller (CM/ESW12-CN) > > Sent: Friday, January 08, 2016 8:47 AM > > To: 'Jan Kara' <jack@suse.cz> > > Cc: linux-ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com> > > Subject: RE: ext4 out of order when use cfq scheduler > > > > > > > > > > > > > > > > OK, so I was looking into the code and indeed, reality is correct > > > > > and my mental model was wrong! ;) I thought that inode gets added > > > > > to the list of inodes for which we need to wait for data IO > > > > > completion during transaction commit during block allocation. And I was > > wrong. > > > > > It used to happen in > > > > > mpage_da_map_and_submit() until commit f3b59291a69d (ext4: remove > > > > > calls to > > > > > ext4_jbd2_file_inode() from delalloc write path) where it got > > > > > removed. And that was wrong because although we submit data writes > > > > > before dropping handle for allocating transaction and updating > > > > > i_size, nobody guarantees that data IO is not delayed in the block > > > > > layer until > > > transaction commit. > > > > > Which seems to happen in your case. I'll send a fix. Thanks for > > > > > your report and persistence! > > > > > > > > > > > > > Thanks a lot for your feedback :-) > > > > Because I am not familiar with the detail of the ext4 internal code. > > > > I will try to > > > understand your explanation which you describe above. And have a look > > > on related funcations. > > > > Could you send the fix in this mail ? > > > > And whether the kernel 3.14 also have such issue, right ? > > > > > > The problem is in all kernels starting with 3.8. Attached is a patch > > > which should fix the issue. Can you test whether it fixes the problem for you? > > > > > > Honza > > > -- > > > > Yes, of course I will redo the test with the patch. And also give you feedback. > > Test result: > Test on 2 targets with the kernel applied your patch. Both of them are OK > after 5000 power failure tests. Our target test cycle is 10,000. By the > way, since your original patch can't be applied on 3.x kernel, The > attached one is based on yours and can applied on old kernel(mine is > 3.10) directly. Thanks for testing and the port! Once the patch is merged upstream, I'll backport it and push it to all active stable kernels. Honza
On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote: > > The problem is in all kernels starting with 3.8. Attached is a patch which > should fix the issue. Can you test whether it fixes the problem for you? Sorry, I missed this patch because it was attached to an discussion thread. > The problem is that although for delayed allocated blocks we write their > contents immediately after allocating them, there is no guarantee that > the IO scheduler or device doesn't reorder things I don't think that's the problem. In the commit thread when we call blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't reorder writes after that point, which is before we write the commit block. Instead, I believe the problem is in ext4_writepages: ext4_journal_stop(handle); /* Submit prepared bio */ ext4_io_submit(&mpd.io_submit); Once we release the handle, the commit can start --- *before* we have a chance to submit the I/O. Oops. I believe if we swap these two calls, it should fix the problem Huang was seeing. Jan, do you agree? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Theodore Ts'o [mailto:tytso@mit.edu] > Sent: Sunday, March 13, 2016 12:27 PM > To: Jan Kara <jack@suse.cz> > Cc: HUANG Weller (CM/ESW12-CN) <Weller.Huang@cn.bosch.com>; linux- > ext4@vger.kernel.org; Li, Michael <huayil@qti.qualcomm.com> > Subject: Re: ext4 out of order when use cfq scheduler > > On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote: > > > > The problem is in all kernels starting with 3.8. Attached is a patch > > which should fix the issue. Can you test whether it fixes the problem for you? > > Sorry, I missed this patch because it was attached to an discussion thread. > > > The problem is that although for delayed allocated blocks we write > > their contents immediately after allocating them, there is no > > guarantee that the IO scheduler or device doesn't reorder things > > I don't think that's the problem. In the commit thread when we call > blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't reorder writes > after that point, which is before we write the commit block. Instead, I believe the > problem is in ext4_writepages: > > ext4_journal_stop(handle); > /* Submit prepared bio */ > ext4_io_submit(&mpd.io_submit); > > Once we release the handle, the commit can start --- *before* we have > a chance to submit the I/O. Oops. > > I believe if we swap these two calls, it should fix the problem Huang was seeing. > > Jan, do you agree? > > - Ted Hi Ted and Jan, You can give me a patch and I can redo the verification on my kernel and HWs. I also look into the code, since In my test case, I use data=ordered option and without sync. So the write operation will goto ext4_da_writepages(), right ? My kernel version is 3.10.63, as I see io_submit and journal_stop sequence already in that order. while (!ret && wbc->nr_to_write > 0) { ext4_journal_start write_cache_pages_da mpage_da_map_and_submit ==> ext4_journal_stop } Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat 12-03-16 23:27:23, Ted Tso wrote: > On Thu, Jan 07, 2016 at 12:47:36PM +0100, Jan Kara wrote: > > The problem is in all kernels starting with 3.8. Attached is a patch which > > should fix the issue. Can you test whether it fixes the problem for you? > > Sorry, I missed this patch because it was attached to an discussion > thread. I have actually sent this patch in a standalone thread on January 11 (http://lists.openwall.net/linux-ext4/2016/01/11/3) together with one more cleanup. > > The problem is that although for delayed allocated blocks we write their > > contents immediately after allocating them, there is no guarantee that > > the IO scheduler or device doesn't reorder things > > I don't think that's the problem. In the commit thread when we call > blkdev_issue_flush() that acts as a barrier so the I/O scheduler won't > reorder writes after that point, which is before we write the commit > block. Instead, I believe the problem is in ext4_writepages: > > ext4_journal_stop(handle); > /* Submit prepared bio */ > ext4_io_submit(&mpd.io_submit); > > Once we release the handle, the commit can start --- *before* we have > a chance to submit the I/O. Oops. > > I believe if we swap these two calls, it should fix the problem Huang > was seeing. No, that won't be enough. blkdev_issue_flush() is not guaranteed to do anything to IOs which have not reported completion before blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio in its internal RB tree, following flush request completely bypasses this tree and goes directly to the disk where it flushes caches. And only later CFQ decides to schedule async writeback from the flusher thread which is queued in the RB tree... Note that the behavior has changed to be like this with the flushing machinery rewrite. Before that, IO scheduler had to drain all the outstanding IO requests (IO cache flush behaved like IO barrier). So your patch would be enough with the old flushing machinery but is not enough since 3.0 or so... Honza
On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote: > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do > anything to IOs which have not reported completion before > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio > in its internal RB tree, following flush request completely bypasses this > tree and goes directly to the disk where it flushes caches. And only later > CFQ decides to schedule async writeback from the flusher thread which is > queued in the RB tree... Oh, right. I am forgetting about the flushing mahchinery rewrite. Thanks for pointing that out. But what we *could* do is to swap those two calls and then in the case where delalloc is enabled, could maintain a list of inodes where we only need to call filemap_fdatawait(), and not initiate writeback for any dirty pages which had been caused by non-allocating writes. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon 14-03-16 10:36:35, Ted Tso wrote: > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote: > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do > > anything to IOs which have not reported completion before > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio > > in its internal RB tree, following flush request completely bypasses this > > tree and goes directly to the disk where it flushes caches. And only later > > CFQ decides to schedule async writeback from the flusher thread which is > > queued in the RB tree... > > Oh, right. I am forgetting about the flushing mahchinery rewrite. > Thanks for pointing that out. > > But what we *could* do is to swap those two calls and then in the case > where delalloc is enabled, could maintain a list of inodes where we > only need to call filemap_fdatawait(), and not initiate writeback for > any dirty pages which had been caused by non-allocating writes. We actually don't need to swap those two calls - page is already marked as under writeback in mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page which gets called while we still hold the transaction handle. I agree calling filemap_fdatawait() from JBD2 during commit should be enough to fix issues with delalloc writeback. I'm just somewhat afraid that it will be more fragile: If we add inode to transaction's list in ext4_map_blocks(), we are pretty sure there's no way to allocate block to an inode without introducing data exposure issues (which are then very hard to spot). If we depend on callers of ext4_map_blocks() to properly add inode to appropriate transaction list, we have much more places to check. I'll think whether we could make this more robust. Honza
From 4dd4ac4bec65620a71df5e3f893e6728863f05c3 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Thu, 7 Jan 2016 12:21:25 +0100 Subject: [PATCH] ext4: Fix data exposure after a crash Huang has reported that in his powerfail testing he is seeing stale block contents in some of recently allocated blocks although he mounts ext4 in data=ordered mode. After some investigation I have found out that indeed when delayed allocation is used, we don't add inode to transaction's list of inodes needing flushing before commit. Originally we were doing that but commit f3b59291a69d removed the logic with a flawed argument that it is not needed. The problem is that although for delayed allocated blocks we write their contents immediately after allocating them, there is no guarantee that the IO scheduler or device doesn't reorder things and thus transaction allocating blocks and attaching them to inode can reach stable storage before actual block contents. Actually whenever we attach freshly allocated blocks to inode using a written extent, we should add inode to transaction's ordered inode list to make sure we properly wait for block contents to be written before committing the transaction. So that is what we do in this patch. This also handles other cases where stale data exposure was possible - like filling hole via mmap in data=ordered,nodelalloc mode. The only exception to the above rule are extending direct IO writes where blkdev_direct_IO() waits for IO to complete before increasing i_size and thus stale data exposure is not possible. For now we don't complicate the code with optimizing this special case since the overhead is pretty low. In case this is observed to be a performance problem we can always handle it using a special flag to ext4_map_blocks(). CC: stable@vger.kernel.org Fixes: f3b59291a69d0b734be1fc8be489fef2dd846d3d Reported-by: "HUANG Weller (CM/ESW12-CN)" <Weller.Huang@cn.bosch.com> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/inode.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index ff2f3cd38522..b216a3eb41a8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -682,6 +682,20 @@ out_sem: ret = check_block_validity(inode, map); if (ret != 0) return ret; + + /* + * Inodes with freshly allocated blocks where contents will be + * visible after transaction commit must be on transaction's + * ordered data list. + */ + if (map->m_flags & EXT4_MAP_NEW && + !(map->m_flags & EXT4_MAP_UNWRITTEN) && + !(flags & EXT4_GET_BLOCKS_ZERO) && + ext4_should_order_data(inode)) { + ret = ext4_jbd2_file_inode(handle, inode); + if (ret) + return ret; + } } return retval; } @@ -1135,15 +1149,6 @@ static int ext4_write_end(struct file *file, int i_size_changed = 0; trace_ext4_write_end(inode, pos, len, copied); - if (ext4_test_inode_state(inode, EXT4_STATE_ORDERED_MODE)) { - ret = ext4_jbd2_file_inode(handle, inode); - if (ret) { - unlock_page(page); - page_cache_release(page); - goto errout; - } - } - if (ext4_has_inline_data(inode)) { ret = ext4_write_inline_data_end(inode, pos, len, copied, page); -- 2.6.2