diff mbox series

ext4: don't submit unwritten extent while holding active jbd2 handle

Message ID 20181225141625.18141-1-xiaoguang.wang@linux.alibaba.com
State New
Headers show
Series ext4: don't submit unwritten extent while holding active jbd2 handle | expand

Commit Message

Xiaoguang Wang Dec. 25, 2018, 2:16 p.m. UTC
In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
will try to find 2048 pages to map and normally one bio can contain 256
pages at most. If we really found 2048 pages to map, there will be 4 bios
and 4 ext4_io_submit() calls which are called both in ext4_writepages()
and mpage_map_and_submit_extent().

But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
will wait this handle to finish, so wait the unwritten extent is written to
disk, this will introduce unnecessary stall time, especially longer when the
writeback operation is io throttled, need to fix this issue.

Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and only
submit these bios after dropping the jbd2 handle.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/ext4/ext4.h    | 12 ++++++++++++
 fs/ext4/inode.c   |  5 ++++-
 fs/ext4/page-io.c | 45 ++++++++++++++++++++++++++++++++++-----------
 3 files changed, 50 insertions(+), 12 deletions(-)

Comments

Theodore Ts'o Jan. 3, 2019, 3:18 a.m. UTC | #1
On Tue, Dec 25, 2018 at 10:16:25PM +0800, Xiaoguang Wang wrote:
> In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
> will try to find 2048 pages to map and normally one bio can contain 256
> pages at most. If we really found 2048 pages to map, there will be 4 bios
> and 4 ext4_io_submit() calls which are called both in ext4_writepages()
> and mpage_map_and_submit_extent().
> 
> But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
> when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
> will wait this handle to finish, so wait the unwritten extent is written to
> disk, this will introduce unnecessary stall time, especially longer when the
> writeback operation is io throttled, need to fix this issue.
> 
> Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and only
> submit these bios after dropping the jbd2 handle.

I think it would be better if we simply don't even try to *start* the
jbd2 handle at all until we after ext4_end_io_rsv_work() is called.  I
suspect the best place to do this will be in ext4_end_io().  The
shorter time we hold the handle, the better the commit completion
latency will be.  (After all, while we are assembling the bios, memory
allocations could block if the system is under heavy memory pressure,
etc.)

This will also make it a bit easier to support dioread_nolock for
block sizes < page size, since we need to start a separate handle for
each extent that needs to be converted, and if the block size is 4k
and the page size is 64k (for example, on the PowerPC), there might be
as many as 16 extents that have to be modified for a heavily
fragmented file system.

Supporting dioread_nolock for 1k file systems on x86 and 4k file
systems on ppc64/ppc64le will allow us to make dioread_nolock the only
supported write path, dropping the current default.  This will give us
only one write path to optimize, debug, and maintain, which will be a
Good Thing.

Cheers,

						- Ted
Jan Kara Jan. 23, 2019, 12:35 p.m. UTC | #2
On Wed 02-01-19 22:18:40, Theodore Y. Ts'o wrote:
> On Tue, Dec 25, 2018 at 10:16:25PM +0800, Xiaoguang Wang wrote:
> > In ext4_writepages(), for every iteration, mpage_prepare_extent_to_map()
> > will try to find 2048 pages to map and normally one bio can contain 256
> > pages at most. If we really found 2048 pages to map, there will be 4 bios
> > and 4 ext4_io_submit() calls which are called both in ext4_writepages()
> > and mpage_map_and_submit_extent().
> > 
> > But note that in mpage_map_and_submit_extent(), we hold a valid jbd2 handle,
> > when dioread_nolock is enabled and extent is unwritten, jbd2 commit thread
> > will wait this handle to finish, so wait the unwritten extent is written to
> > disk, this will introduce unnecessary stall time, especially longer when the
> > writeback operation is io throttled, need to fix this issue.
> > 
> > Here for this scene, we accumulate bios in ext4_io_submit's io_bio, and only
> > submit these bios after dropping the jbd2 handle.
> 
> I think it would be better if we simply don't even try to *start* the
> jbd2 handle at all until we after ext4_end_io_rsv_work() is called.  I
> suspect the best place to do this will be in ext4_end_io().  The
> shorter time we hold the handle, the better the commit completion
> latency will be.  (After all, while we are assembling the bios, memory
> allocations could block if the system is under heavy memory pressure,
> etc.)

Currently this will create a deadlock possibility. Transaction commit can
wait for page writeback (data=ordered requirements) and thus
ext4_journal_start() can end up blocking waiting for transaction commit
which waits for page writeback => you cannot start new transaction while
you have a page locked or under writeback.

But there are in fact two handles ext4_writepages() creates. One is the
reserved one, which we use for extent conversion on io_end - that one
actually does not hold off transaction commit. The reserved space just gets
automatically "reallocated" in the newly started transaction. And then
there's the normal handle which is used for allocating blocks / splitting
extents during ext4_writepages(). That does hold-off transaction commit and
I think this was the handle Xiaoguang was complaining about. Now I also
don't like what Xiaoguang has implemented either because that's just going
to cause another set of problems - essentially it just duplicates current
bio plugging mechanism however without all the precautions that one takes
to avoid deadlocks.

> This will also make it a bit easier to support dioread_nolock for
> block sizes < page size, since we need to start a separate handle for
> each extent that needs to be converted, and if the block size is 4k
> and the page size is 64k (for example, on the PowerPC), there might be
> as many as 16 extents that have to be modified for a heavily
> fragmented file system.

Starting handle for each extent is currently problematic because of
ordering constrains I have described. You need to lock the page for
writeback but once you have the page locked, you cannot start new
transaction. So you have to have transaction ready for being able to write
the whole page before locking the page.

> Supporting dioread_nolock for 1k file systems on x86 and 4k file
> systems on ppc64/ppc64le will allow us to make dioread_nolock the only
> supported write path, dropping the current default.  This will give us
> only one write path to optimize, debug, and maintain, which will be a
> Good Thing.

Agreed on this but I think we need to approach this from different angle
:). Lot of these problems would go away if we got rid of data=ordered pages
as that completely removes the page lock / page writeback vs transaction
handle start lock dependency. So how about just create
ext4_writepages_unwritten() temporarily that will be used for
dioread_nolock mode as a copy-paste of ext4_writepages() (possibly into a
separate file since the writepages code is big anyway). Then we can
simplify the new version with the knowledge that we are using
dioread_nolock and thus don't have ordered pages and can start transactions
under page lock (which also deals with the problem that has started this
thread). Then the result will be hopefully simple enough that we can add
support for blocksize < pagesize to it. Also I'd be really happy if we
managed to implement writeout beyond EOF without creating unwritten extents
first as that will get appending to dioread_nolock file the same
performance as without it. And then we can just drop to old writeback code
and switch to dioread_nolock unconditionally.

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3f89d0ab08fc..cda191616fdb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -209,9 +209,21 @@  typedef struct ext4_io_end {
 
 struct ext4_io_submit {
 	struct writeback_control *io_wbc;
+	/*
+	 * When dioread_nolock is enabled, we don't submit bios for unwritten
+	 * extent while holding jbd2 handle, which can avoid jbd2 commit thread
+	 * wait io to complete, for this case, we will link bios cover the
+	 * extent to io_bio using bio->bi_private.
+	 */
 	struct bio		*io_bio;
 	ext4_io_end_t		*io_end;
 	sector_t		io_next_block;
+	/*
+	 * If not zero, we have an active bio and can submit this bio or add
+	 * new bh to this bio, if zero, we'll need to allocate a new bio.
+	 */
+	int have_active_bio;
+	int can_submit;
 };
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 22a9d8159720..09c8da5ef742 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2486,6 +2486,7 @@  static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
 			handle->h_rsv_handle = NULL;
 		}
+		mpd->io_submit.can_submit = 0;
 		ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
 	}
 
@@ -2908,7 +2909,9 @@  static int ext4_writepages(struct address_space *mapping,
 			handle = NULL;
 			mpd.do_map = 0;
 		}
-		/* Submit prepared bio */
+		/* Submit all prepared bios */
+		if (!mpd.io_submit.can_submit)
+			mpd.io_submit.can_submit = 1;
 		ext4_io_submit(&mpd.io_submit);
 		/* Unlock pages we didn't use */
 		mpage_release_unused_pages(&mpd, give_up_on_write);
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index db7590178dfc..0b3b6fbccf6b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -346,15 +346,29 @@  static void ext4_end_bio(struct bio *bio)
 
 void ext4_io_submit(struct ext4_io_submit *io)
 {
-	struct bio *bio = io->io_bio;
-
-	if (bio) {
-		int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
-				  REQ_SYNC : 0;
-		io->io_bio->bi_write_hint = io->io_end->inode->i_write_hint;
-		bio_set_op_attrs(io->io_bio, REQ_OP_WRITE, io_op_flags);
-		submit_bio(io->io_bio);
+	struct bio *bio = io->io_bio, *next = NULL;
+	int io_op_flags = io->io_wbc->sync_mode == WB_SYNC_ALL ?
+				REQ_SYNC : 0;
+
+	if (!io->can_submit) {
+		/*
+		 * Caller tries to submit this bio, but currently we can not
+		 * submit this bio, we set have_active_bio to zero, then caller
+		 * will allocate a new bio.
+		 */
+		io->have_active_bio = 0;
+		return;
 	}
+
+	for (bio = io->io_bio; bio; bio = next) {
+		next = bio->bi_private;
+		bio->bi_write_hint = io->io_end->inode->i_write_hint;
+		bio_set_op_attrs(bio, REQ_OP_WRITE, io_op_flags);
+		bio->bi_private = ext4_get_io_end(io->io_end);
+		submit_bio(bio);
+
+	}
+	io->have_active_bio = 0;
 	io->io_bio = NULL;
 }
 
@@ -364,6 +378,13 @@  void ext4_io_submit_init(struct ext4_io_submit *io,
 	io->io_wbc = wbc;
 	io->io_bio = NULL;
 	io->io_end = NULL;
+	/*
+	 * When dioread_nolock is enabled and submit unwritten extents,
+	 * set can_submit to zero, then we'll accumulate bios for this
+	 * extent and submit these bios after drop jdb2 handle.
+	 */
+	io->can_submit = 1;
+	io->have_active_bio = 0;
 }
 
 static int io_submit_init_bio(struct ext4_io_submit *io,
@@ -378,8 +399,10 @@  static int io_submit_init_bio(struct ext4_io_submit *io,
 	bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio_set_dev(bio, bh->b_bdev);
 	bio->bi_end_io = ext4_end_bio;
-	bio->bi_private = ext4_get_io_end(io->io_end);
+	/* Point to previous bio if there is */
+	bio->bi_private = io->io_bio;
 	io->io_bio = bio;
+	io->have_active_bio = 1;
 	io->io_next_block = bh->b_blocknr;
 	return 0;
 }
@@ -391,11 +414,11 @@  static int io_submit_add_bh(struct ext4_io_submit *io,
 {
 	int ret;
 
-	if (io->io_bio && bh->b_blocknr != io->io_next_block) {
+	if (io->have_active_bio && bh->b_blocknr != io->io_next_block) {
 submit_and_retry:
 		ext4_io_submit(io);
 	}
-	if (io->io_bio == NULL) {
+	if (!io->have_active_bio) {
 		ret = io_submit_init_bio(io, bh);
 		if (ret)
 			return ret;