[v2,2/2] ext4: fix slow writeback under dioread_nolock and nodelalloc

Message ID 20181215054840.5960-2-xiaoguang.wang@linux.alibaba.com
State New
Headers show
Series
  • [v2,1/2] ext4: try to merge unwritten extents who are also not under io
Related show

Commit Message

Xiaoguang Wang Dec. 15, 2018, 5:48 a.m.
With "nodelalloc", blocks are allocated at the time of writing, and with
"dioread_nolock", these allocated blocks are marked as unwritten as well,
so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.

Everything looks normal except with "dioread_nolock", all allocated
extents are with EXT4_GET_BLOCKS_PRE_IO, which doesn't allow merging
adjacent extents.

And when it comes to writepages, given the fact that bh marked as
BH_Unwritten, it has to hold a journal handle to process these extents,
but when writepages() prepared a bunch of pages in a mpd, it could only
find one block to map to and submit one page at a time, and loop to the
next page over and over again.

ext4_writepages
  ...
  # starting from the 1st dirty page
  ext4_journal_start_with_reserve
  mpage_prepare_extent_to_map
    # batch up to 2048 dirty pages
  mpage_map_and_submit_extent
    mpage_map_one_extent
      ext4_map_blocks #with EXT4_GET_BLOCKS_IO_CREATE_EXT
        ext4_ext_map_blocks
          ext4_find_extent
            # find an extent with only one block at the offset
          ext4_ext_handle_unwritten_extents
            # try to split due to EXT4_GET_BLOCKS_PRE_IO,
            # but no need to in this case as there is
            # only one block in this extent
    mpage_map_and_submit_buffers
      #submit io for only 1st page
  #start from the 2nd dirty page
  ...

---

Given this is for buffered writes, the nice thing we want from
"dioread_nolock" is that extents are converted from unwritten at endio, so
thus we really don't have to take PRE_IO which is desigend for direct IO
path originally.

With this, we do extent merging in case of "nodelalloc" and writeback
doesn't need to do those extra batching and looping, the performance
number is shown as follows:

mount -o dioread_nolock,nodelalloc /dev/loop0 /mnt/
xfs_io -f -c "pwrite -W 0 1G" $M/foobar

- w/o:
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0:02:27.00 (6.951 MiB/sec and 1779.3791 ops/sec)

- w/
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0:00:06.00 (161.915 MiB/sec and 41450.3184 ops/sec)

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 fs/ext4/extents.c | 6 +++++-
 fs/ext4/inode.c   | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

Comments

Theodore Y. Ts'o Jan. 23, 2019, 6:27 a.m. | #1
On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> With "nodelalloc", blocks are allocated at the time of writing, and with
> "dioread_nolock", these allocated blocks are marked as unwritten as well,
> so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.

I've been looking at your patches, and it seems that a simpler way,
perhaps more maintainable approach in the long term is to change how
we write to newly allocated blocks.  Today, we have two ways of doing
this:

1) In the dioread_nolock case, we allocate blocks, insert an entry in
the extent tree with the blocks marked uninitialized, write the
blocks, and then mark the blocks initialized.

2) In the !dioread_nolock case, we allocate blocks, insert an entry to
the extent tree, write the blocks --- and if we start a commit, we
write out all dirty pages associated with that inode (in the default
data=writeback case) to avoid stale writes.

So what if we change the dioread_nolock case to do write the blocks
first, and *then* insert the entry into the extent tree?  This avoids
stale data getting exposed, either by a direct I/O read, or after a
crash (which means we avoid needing to do the force write-out).

So what we would need to do is to pass a flag to ext4_map_blocks()
which causes it to *not* make any on-disk changes.  Instead, it would
track the fact that blocks have be reserved in the buddy bitmap (this
is how we prevent blocks from being preallocated after they are
deleted, but before the transaction has been committed), and the
location of the assigned blocks in the extent_status tree.  Since no
on-disk changes are being made, we wouldn't need to hold the
transaction open.

Then in the callback after the blocks are written, using the starting
logical block number stored in the io_end structure, we either convert
the unwritten extents or actually insert the newly allocated blocks in
the extent tree and update the on-disk bitmap allocation bitmaps.

Once we get this working, it should be easy to make dioread_nolock for
1k block sizes; it keeps the time that the handle open very short; and
it completely obviates the need for data=writeback.

What do folks think?

						 - Ted
Jan Kara Jan. 23, 2019, 12:48 p.m. | #2
On Wed 23-01-19 01:27:38, Theodore Y. Ts'o wrote:
> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> > With "nodelalloc", blocks are allocated at the time of writing, and with
> > "dioread_nolock", these allocated blocks are marked as unwritten as well,
> > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
> 
> I've been looking at your patches, and it seems that a simpler way,
> perhaps more maintainable approach in the long term is to change how
> we write to newly allocated blocks.  Today, we have two ways of doing
> this:
> 
> 1) In the dioread_nolock case, we allocate blocks, insert an entry in
> the extent tree with the blocks marked uninitialized, write the
> blocks, and then mark the blocks initialized.
> 
> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to
> the extent tree, write the blocks --- and if we start a commit, we
> write out all dirty pages associated with that inode (in the default
> data=writeback case) to avoid stale writes.
> 
> So what if we change the dioread_nolock case to do write the blocks
> first, and *then* insert the entry into the extent tree?  This avoids
> stale data getting exposed, either by a direct I/O read, or after a
> crash (which means we avoid needing to do the force write-out).
> 
> So what we would need to do is to pass a flag to ext4_map_blocks()
> which causes it to *not* make any on-disk changes.  Instead, it would
> track the fact that blocks have be reserved in the buddy bitmap (this
> is how we prevent blocks from being preallocated after they are
> deleted, but before the transaction has been committed), and the
> location of the assigned blocks in the extent_status tree.  Since no
> on-disk changes are being made, we wouldn't need to hold the
> transaction open.
> 
> Then in the callback after the blocks are written, using the starting
> logical block number stored in the io_end structure, we either convert
> the unwritten extents or actually insert the newly allocated blocks in
> the extent tree and update the on-disk bitmap allocation bitmaps.
> 
> Once we get this working, it should be easy to make dioread_nolock for
> 1k block sizes; it keeps the time that the handle open very short; and
> it completely obviates the need for data=writeback.
> 
> What do folks think?

Hum, so there is the problem that adding extent to the extent tree may need
some block allocations for metadata. So we'd have to carry over delalloc
reservations upto the io-end time. But that should be doable, just needs
some work. Also in the dioread_nolock case we don't have problems with
page lock & page writeback vs transaction start deadlocks as I've described
in my another email regarding ext4_writepages(). So I don't see any hole in
this and the performance should be good. I like this!

								Honza
Liu Bo Jan. 24, 2019, 1:34 a.m. | #3
On Tue, Jan 22, 2019 at 10:30 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
> > With "nodelalloc", blocks are allocated at the time of writing, and with
> > "dioread_nolock", these allocated blocks are marked as unwritten as well,
> > so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
>
> I've been looking at your patches, and it seems that a simpler way,
> perhaps more maintainable approach in the long term is to change how
> we write to newly allocated blocks.  Today, we have two ways of doing
> this:
>
> 1) In the dioread_nolock case, we allocate blocks, insert an entry in
> the extent tree with the blocks marked uninitialized, write the
> blocks, and then mark the blocks initialized.
>
> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to
> the extent tree, write the blocks --- and if we start a commit, we
> write out all dirty pages associated with that inode (in the default
> data=writeback case) to avoid stale writes.
>
> So what if we change the dioread_nolock case to do write the blocks
> first, and *then* insert the entry into the extent tree?  This avoids
> stale data getting exposed, either by a direct I/O read, or after a
> crash (which means we avoid needing to do the force write-out).
>
> So what we would need to do is to pass a flag to ext4_map_blocks()
> which causes it to *not* make any on-disk changes.  Instead, it would
> track the fact that blocks have be reserved in the buddy bitmap (this
> is how we prevent blocks from being preallocated after they are
> deleted, but before the transaction has been committed), and the
> location of the assigned blocks in the extent_status tree.  Since no
> on-disk changes are being made, we wouldn't need to hold the
> transaction open.
>
> Then in the callback after the blocks are written, using the starting
> logical block number stored in the io_end structure, we either convert
> the unwritten extents or actually insert the newly allocated blocks in
> the extent tree and update the on-disk bitmap allocation bitmaps.
>
> Once we get this working, it should be easy to make dioread_nolock for
> 1k block sizes; it keeps the time that the handle open very short; and
> it completely obviates the need for data=writeback.
>
> What do folks think?
>

So that (reserve, write, insert extent records) is basically what
btrfs is doing and I feel like it will work better than the current
way.

My only concern is performance since metadata reservation for delalloc
now becomes more and needs to be carried until endio, a perf. spike
would appear if the foreground writer needs to wait for flushing dirty
pages to reclaim metadata credits.

thanks,
liubo
Xiaoguang Wang Jan. 25, 2019, 2:02 a.m. | #4
hi,


> On Wed 23-01-19 01:27:38, Theodore Y. Ts'o wrote:
>> On Sat, Dec 15, 2018 at 01:48:40PM +0800, Xiaoguang Wang wrote:
>>> With "nodelalloc", blocks are allocated at the time of writing, and with
>>> "dioread_nolock", these allocated blocks are marked as unwritten as well,
>>> so bh(s) attached to the blocks have BH_Unwritten and BH_Mapped.
>>
>> I've been looking at your patches, and it seems that a simpler way,
>> perhaps more maintainable approach in the long term is to change how
>> we write to newly allocated blocks.  Today, we have two ways of doing
>> this:
>>
>> 1) In the dioread_nolock case, we allocate blocks, insert an entry in
>> the extent tree with the blocks marked uninitialized, write the
>> blocks, and then mark the blocks initialized.
>>
>> 2) In the !dioread_nolock case, we allocate blocks, insert an entry to
>> the extent tree, write the blocks --- and if we start a commit, we
>> write out all dirty pages associated with that inode (in the default
>> data=writeback case) to avoid stale writes.
>>
>> So what if we change the dioread_nolock case to do write the blocks
>> first, and *then* insert the entry into the extent tree?  This avoids
>> stale data getting exposed, either by a direct I/O read, or after a
>> crash (which means we avoid needing to do the force write-out).
>>
>> So what we would need to do is to pass a flag to ext4_map_blocks()
>> which causes it to *not* make any on-disk changes.  Instead, it would
>> track the fact that blocks have be reserved in the buddy bitmap (this
>> is how we prevent blocks from being preallocated after they are
>> deleted, but before the transaction has been committed), and the
>> location of the assigned blocks in the extent_status tree.  Since no
>> on-disk changes are being made, we wouldn't need to hold the
>> transaction open.
>>
>> Then in the callback after the blocks are written, using the starting
>> logical block number stored in the io_end structure, we either convert
>> the unwritten extents or actually insert the newly allocated blocks in
>> the extent tree and update the on-disk bitmap allocation bitmaps.
>>
>> Once we get this working, it should be easy to make dioread_nolock for
>> 1k block sizes; it keeps the time that the handle open very short; and
>> it completely obviates the need for data=writeback.
>>
>> What do folks think?
> 
> Hum, so there is the problem that adding extent to the extent tree may need
> some block allocations for metadata. So we'd have to carry over delalloc
> reservations upto the io-end time. But that should be doable, just needs
> some work. Also in the dioread_nolock case we don't have problems with
> page lock & page writeback vs transaction start deadlocks as I've described
> in my another email regarding ext4_writepages(). So I don't see any hole in
> this and the performance should be good. I like this!
First sorry for late response, I was busy with some internal work.
I really like your ideas, thanks both of you. Writing dirty pages while holding
open jbd2 handle will result in some issues, for example, system load will be high,
many tasks in the D state, some of them is stuck in wait_transaction_locked().

Currently I have tried some methods to fix this issue, for example, change
MAX_WRITEPAGES_EXTENT_LEN to 256, then we can make sure that one bio will
be generated at most, and this bio will be submitted after journal stop:
		if (!ext4_handle_valid(handle) || handle->h_sync == 0) {
			ext4_journal_stop(handle);
			handle = NULL;
			mpd.do_map = 0;
		}
		/* Submit prepared bio */
		ext4_io_submit(&mpd.io_submit);
After this change, the number of tasks in D state will decrease much,
Of course, my method is not good, I just say writing diry pages while holding
open jbd2 handle is not good :) and I believe that your ideas can fix this kind
of problems, thanks again.

Regards,
Xiaoguang Wang



Regards,
Xiaoguang Wang

> 
> 								Honza
>

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 444c739470a5..de73b0152892 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4038,9 +4038,13 @@  ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
 	/*
 	 * repeat fallocate creation request
 	 * we already have an unwritten extent
+	 *
+	 * With nodelalloc + dioread_nolock, write can also come here,
+	 * so make sure map is set with new to avoid exposing stale
+	 * data to reads.
 	 */
 	if (flags & EXT4_GET_BLOCKS_UNWRIT_EXT) {
-		map->m_flags |= EXT4_MAP_UNWRITTEN;
+		map->m_flags |= EXT4_MAP_UNWRITTEN | EXT4_MAP_NEW;
 		goto map_out;
 	}
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ba557a731081..4dbb43ab9d6e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -822,7 +822,7 @@  int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
 	ext4_debug("ext4_get_block_unwritten: inode %lu, create flag %d\n",
 		   inode->i_ino, create);
 	return _ext4_get_block(inode, iblock, bh_result,
-			       EXT4_GET_BLOCKS_IO_CREATE_EXT);
+			       EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
 }
 
 /* Maximum number of blocks we map for direct IO at once. */