Ext4: fix slow writeback under dioread_nolock and nodelalloc

Message ID 1542694270-47732-1-git-send-email-bo.liu@linux.alibaba.com
State New
Headers show
Series
  • Ext4: fix slow writeback under dioread_nolock and nodelalloc
Related show

Commit Message

Liu Bo Nov. 20, 2018, 6:11 a.m.
With "nodelalloc", blocks are allocated at the time of writing, and with
"dioread_nolock", these allocated blocks are marked as unwritten as well,
so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.

Everything looks normal except with "dioread_nolock", all allocated
extents are with EXT4_GET_BLOCKS_PRE_IO, which doesn't allow merging
adjacent extents.

And when it comes to writepages, given the fact that bh marked as
BH_Unwritten, it has to hold a journal handle to process these extents,
but when writepages() prepared a bunch of pages in a mpd, it could only
find one block to map to and submit one page at a time, and loop to the
next page over and over again.

ext4_writepages
  ...
  # starting from the 1st dirty page
  ext4_journal_start_with_reserve
  mpage_prepare_extent_to_map
    # batch up to 2048 dirty pages
  mpage_map_and_submit_extent
    mpage_map_one_extent
      ext4_map_blocks #with EXT4_GET_BLOCKS_IO_CREATE_EXT
        ext4_ext_map_blocks
	  ext4_find_extent
	    # find an extent with only one block at the offset
	  ext4_ext_handle_unwritten_extents
	    # try to split due to EXT4_GET_BLOCKS_PRE_IO,
	    # but no need to in this case as there is
	    # only one block in this extent
    mpage_map_and_submit_buffers
      #submit io for only 1st page
  #start from the 2nd dirty page
  ...

---

Given this is for buffered writes, the nice thing we want from
"dioread_nolock" is that extents are converted from unwritten at endio, so
thus we really don't have to take PRE_IO which is desigend for direct IO
path originally.

With this, we do extent merging in case of "nodelalloc" and writeback
doesn't need to do those extra batching and looping, the performance
number is shown as follows:

mount -o dioread_nolock,nodelalloc /dev/loop0 /mnt/
xfs_io -f -c "pwrite -W 0 1G" $M/foobar

- w/o:
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0:02:27.00 (6.951 MiB/sec and 1779.3791 ops/sec)

- w/
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0:00:06.00 (161.915 MiB/sec and 41450.3184 ops/sec)

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
---
 fs/ext4/extents.c | 8 +++++++-
 fs/ext4/inode.c   | 2 +-
 2 files changed, 8 insertions(+), 2 deletions(-)

Comments

Theodore Y. Ts'o Nov. 21, 2018, 12:07 a.m. | #1
On Tue, Nov 20, 2018 at 02:11:10PM +0800, Liu Bo wrote:
> With "nodelalloc", blocks are allocated at the time of writing, and with
> "dioread_nolock", these allocated blocks are marked as unwritten as well,
> so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.

Hi Liu,

Could I ask a larger question, which is why is it that you find it
desirable to use both nodelalloc and dioread_nolock at the same time?
What's the use case which is causing you to want to do this?

One of the things which I have been thinking about doing to simplify
ext4's test matrix and so we can improve performance in a more direct
way --- and also so it would make it much easier for us to switch over
to using more of iomap --- was to consider deprecating and removing
some of ext4's flexibility.  This includes:

*) Making dioread_nolock the default for 4k --- and fixing
   dioread_nolock so it works for cases where page_size != block_size.
   Once that is done, we would use dioread_nolock everywhere and drop
   the older code path.

*) Drop support for data=journal mode (it could be specified, but it
   would be ignored).  It's not clear anyone is using it, since the
   double-write overhead is huge.  Furthermore, requiring the use of
   nodelalloc means is a further performance hit on data=journal mode.

*) Drop support for nodelalloc, and make delayed allocation the
   default.  (This requires dropping data=journal mode, since
   data=journal mode does not co-exist with delalloc.)

The last was assuming that there really wasn't good use cases where
people would really want to use nodelalloc, aside from ext3
bug-for-bug compatibility, and hopefully since most people are using
ext4 these days, it becomes easier to make the case tha we don't need
to support ext3's unusual performance attributes.

The fact that you have done work trying to improve dioread_nolock &&
nodelalloc implies that perhaps I'm wrong about my assumption.  So I'd
like to understand more about what it is that you're trying to do, and
to figure out of perhaps there's a better way to do it.  After all,
delalloc makes it a lot easier for the block allocator to do a good
job, and it also decreases the CPU overhead for the block allocator.
And while forcing writes to the disk every five seconds might be
advantageous for some application, it also can add significant latency
variability for file system operations during a commit.

What am I missing?

Cheers,

						- Ted
Liu Bo Nov. 21, 2018, 1:30 a.m. | #2
Hi Ted,

On Tue, Nov 20, 2018 at 07:07:18PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Nov 20, 2018 at 02:11:10PM +0800, Liu Bo wrote:
> > With "nodelalloc", blocks are allocated at the time of writing, and with
> > "dioread_nolock", these allocated blocks are marked as unwritten as well,
> > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
> 
> Hi Liu,
> 
> Could I ask a larger question, which is why is it that you find it
> desirable to use both nodelalloc and dioread_nolock at the same time?
> What's the use case which is causing you to want to do this?
>

So the story started with the fact that we found with '-o delalloc',
applications were experiencing latency fluctuation because of
%i_data_sem's contention between sys_write(2) and writepages().

Things are not always bad with "delalloc", but if applications do
append only writing where new blocks are always needed and are latency
critical workloads, then it'll be noticable.

On a centOS7, a fio test doing a buffered write workload with bs=4k,
5M/s append writes and runtime=180s, (results may be different on
upstream ext4.)

------------------------
- delalloc: 
lat (usec): min=2 , max=193466 , avg= 5.86, stdev=227.91

-nodelalloc:
lat (usec): min=3 , max=16388 , avg= 7.00, stdev=28.92
------------------------

For dioread_nolock, this is suggested to fix the IO priority inversion
issue we've already discussed. (but there're other priority issues,
e.g. do_get_write_access()'s lock_buffer, buffer_shadow())

> One of the things which I have been thinking about doing to simplify
> ext4's test matrix and so we can improve performance in a more direct
> way --- and also so it would make it much easier for us to switch over
> to using more of iomap --- was to consider deprecating and removing
> some of ext4's flexibility.  This includes:
> 
> *) Making dioread_nolock the default for 4k --- and fixing
>    dioread_nolock so it works for cases where page_size != block_size.
>    Once that is done, we would use dioread_nolock everywhere and drop
>    the older code path.
> 
> *) Drop support for data=journal mode (it could be specified, but it
>    would be ignored).  It's not clear anyone is using it, since the
>    double-write overhead is huge.  Furthermore, requiring the use of
>    nodelalloc means is a further performance hit on data=journal mode.
> 
> *) Drop support for nodelalloc, and make delayed allocation the
>    default.  (This requires dropping data=journal mode, since
>    data=journal mode does not co-exist with delalloc.)
> 
> The last was assuming that there really wasn't good use cases where
> people would really want to use nodelalloc, aside from ext3
> bug-for-bug compatibility, and hopefully since most people are using
> ext4 these days, it becomes easier to make the case tha we don't need
> to support ext3's unusual performance attributes.
> 
> The fact that you have done work trying to improve dioread_nolock &&
> nodelalloc implies that perhaps I'm wrong about my assumption.  So I'd
> like to understand more about what it is that you're trying to do, and
> to figure out of perhaps there's a better way to do it.  After all,
> delalloc makes it a lot easier for the block allocator to do a good
> job, and it also decreases the CPU overhead for the block allocator.
> And while forcing writes to the disk every five seconds might be
> advantageous for some application, it also can add significant latency
> variability for file system operations during a commit.
> 
> What am I missing?
>

So looks like the only big complain from us is the contention on
i_data_sem.

If upstream ext4 has already addressed the problems, please let me
know, I'm happy to suggest people to switch to delalloc.

thanks,
-liubo
Artem Blagodarenko Nov. 21, 2018, 4:40 p.m. | #3
Hello Theodore,

> On 21 Nov 2018, at 03:07, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
> *) Drop support for nodelalloc, and make delayed allocation the
>   default.  (This requires dropping data=journal mode, since
>   data=journal mode does not co-exist with delalloc.)
> 
> The last was assuming that there really wasn't good use cases where
> people would really want to use nodelalloc, aside from ext3
> bug-for-bug compatibility, and hopefully since most people are using
> ext4 these days, it becomes easier to make the case tha we don't need
> to support ext3's unusual performance attributes.

Currently Lustre FS uses “nodelalloc” option at least because of this two reasons:

- OSD (layer that interact with EXT4 (LDISKFS backend) expect synchronous behaviour.

- delalloc optimise block allocation the way useless for Lustre FS, because data is striped across of OSTs. 
Lustre FS has its own optimisations. Its behaviour similar with data=journal journaling mode. “Journal” is stored on clients and OST recovery is like journal replay.

Best regards,
Artem Blagodarenko.

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 240b6dea5441..2a95f55563ba 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1718,6 +1718,7 @@  static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
 				struct ext4_extent *ex2)
 {
 	unsigned short ext1_ee_len, ext2_ee_len;
+	bool dioread_nolock = false;
 
 	if (ext4_ext_is_unwritten(ex1) != ext4_ext_is_unwritten(ex2))
 		return 0;
@@ -1741,10 +1742,15 @@  static int ext4_ext_correct_indexes(handle_t *handle, struct inode *inode,
 	 * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
 	 * dropping i_data_sem. But reserved blocks should save us in that
 	 * case.
+	 *
+	 * In case of dioread_nolock, we allow merging extent for buffered
+	 * writes as the split happens in ext4_writepages (where blocks have
+	 * been reserved for updating extent) instead of endio.
 	 */
+	dioread_nolock = ext4_should_dioread_nolock(inode);
 	if (ext4_ext_is_unwritten(ex1) &&
 	    (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
-	     atomic_read(&EXT4_I(inode)->i_unwritten) ||
+	     (!dioread_nolock && atomic_read(&EXT4_I(inode)->i_unwritten)) ||
 	     (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
 		return 0;
 #ifdef AGGRESSIVE_TEST
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6df9c6d60981..71fb3e1654f8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -812,7 +812,7 @@  int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 	ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
 	return _ext4_get_block(inode, iblock, bh_result,
-			       EXT4_GET_BLOCKS_IO_CREATE_EXT);
+			       EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
 }
 
 /* Maximum number of blocks we map for direct IO at once. */