diff mbox

ext4 out of order when use cfq scheduler

Message ID 20160315144633.GA12352@quack.suse.cz
State New, archived
Headers show

Commit Message

Jan Kara March 15, 2016, 2:46 p.m. UTC
On Tue 15-03-16 11:46:34, Jan Kara wrote:
> On Mon 14-03-16 10:36:35, Ted Tso wrote:
> > On Mon, Mar 14, 2016 at 08:39:28AM +0100, Jan Kara wrote:
> > > No, that won't be enough. blkdev_issue_flush() is not guaranteed to do
> > > anything to IOs which have not reported completion before
> > > blkdev_issue_flush() was called. Specifically, CFQ will queue submitted bio
> > > in its internal RB tree, following flush request completely bypasses this
> > > tree and goes directly to the disk where it flushes caches. And only later
> > > CFQ decides to schedule async writeback from the flusher thread which is
> > > queued in the RB tree...
> > 
> > Oh, right.  I am forgetting about the flushing mahchinery rewrite.
> > Thanks for pointing that out.
> > 
> > But what we *could* do is to swap those two calls and then in the case
> > where delalloc is enabled, could maintain a list of inodes where we
> > only need to call filemap_fdatawait(), and not initiate writeback for
> > any dirty pages which had been caused by non-allocating writes.
> 
> We actually don't need to swap those two calls - page is already marked as
> under writeback in
> 
>   mpage_map_and_submit_buffers() -> mpage_submit_page -> ext4_bio_write_page
> 
> which gets called while we still hold the transaction handle. I agree
> calling filemap_fdatawait() from JBD2 during commit should be enough to fix
> issues with delalloc writeback. I'm just somewhat afraid that it will be
> more fragile: If we add inode to transaction's list in ext4_map_blocks(),
> we are pretty sure there's no way to allocate block to an inode without
> introducing data exposure issues (which are then very hard to spot). If we
> depend on callers of ext4_map_blocks() to properly add inode to appropriate
> transaction list, we have much more places to check. I'll think whether we
> could make this more robust.

OK, I have something - Huang, can you check whether the attached patches
also fix your data exposure issues please? The first patch is the original
fix, patch two is a cleanup, patches 3 and 4 implement the speedup
suggested by Ted. Patches are only lightly tested so far.  I'll run more
comprehensive tests later and in particular I want to check whether the
additional complexity actually brings us some advantage at least for
workloads which redirty pages in addition to writing some new ones using
delayed allocation.

								Honza

Comments

HUANG Weller (CM/ESW12-CN) March 16, 2016, 12:41 a.m. UTC | #1
> 
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far.  I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
> 
> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


Sure, I will try to merge the patches and do the test on my kernel.

Huang weller
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
HUANG Weller (CM/ESW12-CN) March 24, 2016, 10:16 a.m. UTC | #2
> 
> OK, I have something - Huang, can you check whether the attached patches also
> fix your data exposure issues please? The first patch is the original fix, patch two
> is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> are only lightly tested so far.  I'll run more comprehensive tests later and in
> particular I want to check whether the additional complexity actually brings us
> some advantage at least for workloads which redirty pages in addition to writing
> some new ones using delayed allocation.
> 

Test done.
Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara March 24, 2016, 12:17 p.m. UTC | #3
On Thu 24-03-16 10:16:05, HUANG Weller (CM/ESW12-CN) wrote:
> > 
> > OK, I have something - Huang, can you check whether the attached patches also
> > fix your data exposure issues please? The first patch is the original fix, patch two
> > is a cleanup, patches 3 and 4 implement the speedup suggested by Ted. Patches
> > are only lightly tested so far.  I'll run more comprehensive tests later and in
> > particular I want to check whether the additional complexity actually brings us
> > some advantage at least for workloads which redirty pages in addition to writing
> > some new ones using delayed allocation.
> > 
> 
> Test done.
> Both targets(kernel 3.10.63) PASS the power loss test with 10,000 cycles. Test with io-scheduler CFQ.

Thanks for testing! I'll submit those patches once I verify there is some
performance gain in only waiting...

								Honza
diff mbox

Patch

From 7dd6dc4704480d9d07113c7218649cd056ea8508 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Tue, 15 Mar 2016 15:25:02 +0100
Subject: [PATCH 4/4] ext4: Do not ask jbd2 to write data for delalloc buffers

Currently we ask jbd2 to write all dirty allocated buffers before
committing a transaction when doing writeback of delay allocated blocks.
However this is unnecessary since we move all pages to writeback state
before dropping a transaction handle and then submit all the necessary
IO. We still need the transaction commit to wait for all the outstanding
writeback before flushing disk caches during transaction commit to avoid
data exposure issues though. Use the new jbd2 capability and ask it to
only wait for outstanding writeback during transaction commit when
writing back data in ext4_writepages().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h        |  3 +++
 fs/ext4/ext4_jbd2.h   | 12 +++++++++++-
 fs/ext4/inode.c       | 10 +++++++---
 fs/ext4/move_extent.c |  2 +-
 fs/jbd2/transaction.c |  4 ++--
 5 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc4910bce4ff..a095e833a6c0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,6 +561,9 @@  enum {
 #define EXT4_GET_BLOCKS_ZERO			0x0200
 #define EXT4_GET_BLOCKS_CREATE_ZERO		(EXT4_GET_BLOCKS_CREATE |\
 					EXT4_GET_BLOCKS_ZERO)
+	/* Caller will submit data before dropping transaction handle. This
+	 * allows jbd2 to avoid submitting data before commit. */
+#define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 
 /*
  * The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index f1c940b38b30..09c1ef38cbe6 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -359,7 +359,8 @@  static inline int ext4_journal_force_commit(journal_t *journal)
 	return 0;
 }
 
-static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
+static inline int ext4_jbd2_inode_add_write(handle_t *handle,
+					    struct inode *inode)
 {
 	if (ext4_handle_valid(handle))
 		return jbd2_journal_inode_add_write(handle,
@@ -367,6 +368,15 @@  static inline int ext4_jbd2_file_inode(handle_t *handle, struct inode *inode)
 	return 0;
 }
 
+static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
+					   struct inode *inode)
+{
+	if (ext4_handle_valid(handle))
+		return jbd2_journal_inode_add_wait(handle,
+						   EXT4_I(inode)->jinode);
+	return 0;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 173da9d467d1..01ecc67985bb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -692,7 +692,10 @@  out_sem:
 		    !(map->m_flags & EXT4_MAP_UNWRITTEN) &&
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    ext4_should_order_data(inode)) {
-			ret = ext4_jbd2_file_inode(handle, inode);
+			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
+				ret = ext4_jbd2_inode_add_wait(handle, inode);
+			else
+				ret = ext4_jbd2_inode_add_write(handle, inode);
 			if (ret)
 				return ret;
 		}
@@ -2164,7 +2167,8 @@  static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	 * the data was copied into the page cache.
 	 */
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
-			   EXT4_GET_BLOCKS_METADATA_NOFAIL;
+			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
+			   EXT4_GET_BLOCKS_IO_SUBMIT;
 	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
@@ -3538,7 +3542,7 @@  static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_file_inode(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index fb6f11709ae6..0ce9795c7404 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -390,7 +390,7 @@  data_copy:
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_file_inode(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
 
 unlock_pages:
 	unlock_page(pagep[0]);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 0fcbec79fc7e..903ad38443c8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2504,14 +2504,14 @@  static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 	 */
 	if ((jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction) &&
-	    jinode->i_flags & flags == flags)
+	    (jinode->i_flags & flags) == flags)
 		return 0;
 
 	spin_lock(&journal->j_list_lock);
 
 	if ((jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction) &&
-	    jinode->i_flags & flags == flags)
+	    (jinode->i_flags & flags) == flags)
 		goto done;
 
 	/*
-- 
2.6.2