diff mbox series

huge fsync latencies for a small file on ext4

Message ID b1ba05256affe171c2afca6c3aceeb7a@codeaurora.org
State Rejected
Headers show
Series huge fsync latencies for a small file on ext4 | expand

Commit Message

Sahitya Tummala Feb. 19, 2019, 10:20 a.m. UTC
Hi,

I am observing huge fsync latencies for a small file under the below 
test scenario -

process A -
Issue async write of 4GB using dd command (say large_file) on /data 
mounted
with ext4:
dd if=/dev/zero of=/data/testfile bs=1M count=4096

process B -
In parallel another process wrote a small 4KB data to another file
(say, small_file) and has issued fsync on this file.

Problem -
The fsync() on 4KB file, is taking upto ~30sec (worst case latency).
This is tested on an eMMC based device.

Observations -
This happens when the small_file and large_file both are part of the 
same
committing transaction or when the small_file is part of the running 
transaction
while large_file is part of the committing transaction.

During the commit of a transaction which includes large_file, the jbd2 
thread
does journal_finish_inode_data_buffers() by calling
filemap_fdatawait_keep_errors() on the file's inode address space. While 
this is
happening, if the writeback thread is running in parallel for the 
large_file, then
filemap_fdatawait_keep_errors() could potentially run in a loop of all 
the
pages (upto 4GB of data) and also wait for all the file's data to be 
written
to the disk in the current transaction context itself. At the time
of calling journal_finish_inode_data_buffers(), the file size is of only 
150MB.
and by the time filemap_fdatawait_keep_errors() returns, the file size 
is 4GB
and the page index also points to 4GB file offset in
__filemap_fdatawait_range(), indicating that is has scanned and waited 
for writeback
all the pages upto 4GB and not just 150MB.

Ideally, I think the jbd2 thread should have waited for only the amount 
of data
it has submitted as part of the current transaction and not to wait for 
the
on-going pages that are getting tagged for writeback in parallel in 
another context.
So along these lines, I have tried to use the inode's size at the time 
of calling
journal_finish_inode_data_buffers() as below -

                 if (!ret)
                         ret = err;
                 spin_lock(&journal->j_list_lock);

With this, the fsync latencies for small_file have reduced 
significantly.
It took upto max ~5sec (worst case latency).

Although this is seen in a test code, this could potentially impact the
phone's performance if any application or main UI thread in Android 
issues
fsync() in foreground while a large data transfer is going on in another
context.

Request you to share your thoughts and comments on this issue
and the fix suggested above.

Thanks,
Sahitya.

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project.

Comments

Jan Kara Feb. 19, 2019, 1:53 p.m. UTC | #1
Hi!

On Tue 19-02-19 15:50:23, stummala@codeaurora.org wrote:
> I am observing huge fsync latencies for a small file under the below test
> scenario -
> 
> process A -
> Issue async write of 4GB using dd command (say large_file) on /data mounted
> with ext4:
> dd if=/dev/zero of=/data/testfile bs=1M count=4096
> 
> process B -
> In parallel another process wrote a small 4KB data to another file
> (say, small_file) and has issued fsync on this file.
> 
> Problem -
> The fsync() on 4KB file, is taking upto ~30sec (worst case latency).
> This is tested on an eMMC based device.
> 
> Observations -
> This happens when the small_file and large_file both are part of the same
> committing transaction or when the small_file is part of the running
> transaction
> while large_file is part of the committing transaction.
> 
> During the commit of a transaction which includes large_file, the jbd2
> thread
> does journal_finish_inode_data_buffers() by calling
> filemap_fdatawait_keep_errors() on the file's inode address space. While
> this is
> happening, if the writeback thread is running in parallel for the
> large_file, then
> filemap_fdatawait_keep_errors() could potentially run in a loop of all the
> pages (upto 4GB of data) and also wait for all the file's data to be written
> to the disk in the current transaction context itself. At the time
> of calling journal_finish_inode_data_buffers(), the file size is of only
> 150MB.
> and by the time filemap_fdatawait_keep_errors() returns, the file size is
> 4GB
> and the page index also points to 4GB file offset in
> __filemap_fdatawait_range(), indicating that is has scanned and waited for
> writeback
> all the pages upto 4GB and not just 150MB.

Thanks for the detailed analysis! I'm somewhat surprised that the flusher
is able to submit new batch of pages for writeback faster than
__filemap_fdatawait_range() is scanning the radix tree but it is certainly
a possibility.

> Ideally, I think the jbd2 thread should have waited for only the amount of
> data
> it has submitted as part of the current transaction and not to wait for the
> on-going pages that are getting tagged for writeback in parallel in another
> context.
> So along these lines, I have tried to use the inode's size at the time of
> calling
> journal_finish_inode_data_buffers() as below -

One has to be really careful when using i_size like this. By the time the
transaction is committing, i_size could have been reduced from the value at
the time page writeback was issued. And that change will be journalled only
in the following transaction. So if the system crashes in the wrong moment,
user could see uninitialized blocks between new_size and old_size after
journal replay. So I don't think your patch is really correct.

Ted has outlined a plan how to get rid of data=ordered limitations [1] and
thus also this problem. It is quite some work but you're certainly welcome
to help out :)

								Honza

[1] https://www.spinics.net/lists/linux-ext4/msg64175.html

> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 2eb55c3..e86cf67 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -261,8 +261,8 @@ static int journal_finish_inode_data_buffers(journal_t
> *journal,
>                         continue;
>                 jinode->i_flags |= JI_COMMIT_RUNNING;
>                 spin_unlock(&journal->j_list_lock);
> -               err = filemap_fdatawait_keep_errors(
> -                               jinode->i_vfs_inode->i_mapping);
> +               err =
> filemap_fdatawait_range(jinode->i_vfs_inode->i_mapping,
> +                               0,
> i_size_read(jinode->i_vfs_inode->i_mapping->host));
>                 if (!ret)
>                         ret = err;
>                 spin_lock(&journal->j_list_lock);
> 
> With this, the fsync latencies for small_file have reduced significantly.
> It took upto max ~5sec (worst case latency).
> 
> Although this is seen in a test code, this could potentially impact the
> phone's performance if any application or main UI thread in Android issues
> fsync() in foreground while a large data transfer is going on in another
> context.
> 
> Request you to share your thoughts and comments on this issue
> and the fix suggested above.
> 
> Thanks,
> Sahitya.
> 
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
> Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
> Foundation Collaborative Project.
Sahitya Tummala Feb. 25, 2019, 5:10 a.m. UTC | #2
On Tue, Feb 19, 2019 at 02:53:02PM +0100, Jan Kara wrote:
> Hi!
> 
> On Tue 19-02-19 15:50:23, stummala@codeaurora.org wrote:
> > I am observing huge fsync latencies for a small file under the below test
> > scenario -
> > 
> > process A -
> > Issue async write of 4GB using dd command (say large_file) on /data mounted
> > with ext4:
> > dd if=/dev/zero of=/data/testfile bs=1M count=4096
> > 
> > process B -
> > In parallel another process wrote a small 4KB data to another file
> > (say, small_file) and has issued fsync on this file.
> > 
> > Problem -
> > The fsync() on 4KB file, is taking upto ~30sec (worst case latency).
> > This is tested on an eMMC based device.
> > 
> > Observations -
> > This happens when the small_file and large_file both are part of the same
> > committing transaction or when the small_file is part of the running
> > transaction
> > while large_file is part of the committing transaction.
> > 
> > During the commit of a transaction which includes large_file, the jbd2
> > thread
> > does journal_finish_inode_data_buffers() by calling
> > filemap_fdatawait_keep_errors() on the file's inode address space. While
> > this is
> > happening, if the writeback thread is running in parallel for the
> > large_file, then
> > filemap_fdatawait_keep_errors() could potentially run in a loop of all the
> > pages (upto 4GB of data) and also wait for all the file's data to be written
> > to the disk in the current transaction context itself. At the time
> > of calling journal_finish_inode_data_buffers(), the file size is of only
> > 150MB.
> > and by the time filemap_fdatawait_keep_errors() returns, the file size is
> > 4GB
> > and the page index also points to 4GB file offset in
> > __filemap_fdatawait_range(), indicating that is has scanned and waited for
> > writeback
> > all the pages upto 4GB and not just 150MB.
> 
> Thanks for the detailed analysis! I'm somewhat surprised that the flusher
> is able to submit new batch of pages for writeback faster than
> __filemap_fdatawait_range() is scanning the radix tree but it is certainly
> a possibility.
> 
> > Ideally, I think the jbd2 thread should have waited for only the amount of
> > data
> > it has submitted as part of the current transaction and not to wait for the
> > on-going pages that are getting tagged for writeback in parallel in another
> > context.
> > So along these lines, I have tried to use the inode's size at the time of
> > calling
> > journal_finish_inode_data_buffers() as below -
> 
> One has to be really careful when using i_size like this. By the time the
> transaction is committing, i_size could have been reduced from the value at
> the time page writeback was issued. And that change will be journalled only
> in the following transaction. So if the system crashes in the wrong moment,
> user could see uninitialized blocks between new_size and old_size after
> journal replay. So I don't think your patch is really correct.
> 

Thanks Jan for the clarification on the patch. I agree with your comments.

From that discussion, I think the problem that it is discussing is w.r.t
journal thread waiting for on-going active transaction updates to be done
and thus causing commit latencies. And I think the proposal is to do not
hold any handle while extents are being mapped in ext4_map_blocks() but
defer it till IO is completely done.

And with the new proposal since the inode will be added to
transaction->t_inode_list only after the IO is completed, there will be
no longer the need to do journal_finish_inode_data_buffers() in the journal
context and thus this problem also will not be observed? Is my understanding
correct, please clarify.

> Ted has outlined a plan how to get rid of data=ordered limitations [1] and
> thus also this problem. It is quite some work but you're certainly welcome
> to help out :)
> 
> 								Honza
> 
> [1] https://www.spinics.net/lists/linux-ext4/msg64175.html
> 
> > 
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 2eb55c3..e86cf67 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -261,8 +261,8 @@ static int journal_finish_inode_data_buffers(journal_t
> > *journal,
> >                         continue;
> >                 jinode->i_flags |= JI_COMMIT_RUNNING;
> >                 spin_unlock(&journal->j_list_lock);
> > -               err = filemap_fdatawait_keep_errors(
> > -                               jinode->i_vfs_inode->i_mapping);
> > +               err =
> > filemap_fdatawait_range(jinode->i_vfs_inode->i_mapping,
> > +                               0,
> > i_size_read(jinode->i_vfs_inode->i_mapping->host));
> >                 if (!ret)
> >                         ret = err;
> >                 spin_lock(&journal->j_list_lock);
> > 
> > With this, the fsync latencies for small_file have reduced significantly.
> > It took upto max ~5sec (worst case latency).
> > 
> > Although this is seen in a test code, this could potentially impact the
> > phone's performance if any application or main UI thread in Android issues
> > fsync() in foreground while a large data transfer is going on in another
> > context.
> > 
> > Request you to share your thoughts and comments on this issue
> > and the fix suggested above.
> > 
> > Thanks,
> > Sahitya.
> > 
> > --
> > Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
> > Inc.
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
> > Foundation Collaborative Project.
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara Feb. 26, 2019, 8:30 a.m. UTC | #3
On Mon 25-02-19 10:40:07, Sahitya Tummala wrote:
> On Tue, Feb 19, 2019 at 02:53:02PM +0100, Jan Kara wrote:
> > One has to be really careful when using i_size like this. By the time the
> > transaction is committing, i_size could have been reduced from the value at
> > the time page writeback was issued. And that change will be journalled only
> > in the following transaction. So if the system crashes in the wrong moment,
> > user could see uninitialized blocks between new_size and old_size after
> > journal replay. So I don't think your patch is really correct.
> > 
> 
> Thanks Jan for the clarification on the patch. I agree with your comments.
> 
> From that discussion, I think the problem that it is discussing is w.r.t
> journal thread waiting for on-going active transaction updates to be done
> and thus causing commit latencies.

Yes.

> And I think the proposal is to do not
> hold any handle while extents are being mapped in ext4_map_blocks() but
> defer it till IO is completely done.

Yes, real block allocation and insertion in extent tree will happen after
IO completion.

> And with the new proposal since the inode will be added to
> transaction->t_inode_list only after the IO is completed, there will be
> no longer the need to do journal_finish_inode_data_buffers() in the journal
> context and thus this problem also will not be observed? Is my understanding
> correct, please clarify.

Actually, with the new proposal, we can just completely stop adding inodes
to transaction->t_inode_list. But otherwise you're right.

									Honza

> 
> > Ted has outlined a plan how to get rid of data=ordered limitations [1] and
> > thus also this problem. It is quite some work but you're certainly welcome
> > to help out :)
> > 
> > [1] https://www.spinics.net/lists/linux-ext4/msg64175.html
diff mbox series

Patch

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 2eb55c3..e86cf67 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -261,8 +261,8 @@  static int 
journal_finish_inode_data_buffers(journal_t *journal,
                         continue;
                 jinode->i_flags |= JI_COMMIT_RUNNING;
                 spin_unlock(&journal->j_list_lock);
-               err = filemap_fdatawait_keep_errors(
-                               jinode->i_vfs_inode->i_mapping);
+               err = 
filemap_fdatawait_range(jinode->i_vfs_inode->i_mapping,
+                               0, 
i_size_read(jinode->i_vfs_inode->i_mapping->host));