Patchwork jbd/2[stable only]: Use WRITE_SYNC_PLUG in journal_commit_transaction.

login
register
mail settings
Submitter Tao Ma
Date July 12, 2011, 10:43 a.m.
Message ID <1310467431-23108-1-git-send-email-tm@tao.ma>
Download mbox | patch
Permalink /patch/104342/
State Not Applicable
Headers show

Comments

Tao Ma - July 12, 2011, 10:43 a.m.
From: Tao Ma <boyu.mt@taobao.com>

In commit 749ef9f8423, we use WRITE_SYNC instead of WRITE in
journal_commit_transaction. It causes a much heavy burden for
the disk as now the seqenctial write can't be merged(see the blktrace below).

Given the description of that commit 749ef9f8423, the reason why
we use WRITE_SYNC is that it wants to use REQ_NOIDLE and WRITE_SYNC_PLUG
also has that flag, so use WRITE_SYNC_PLUG instead. From blktrace,
we can get that:

without the patch:
  8,0    6       18     0.016058423  3342  D   W 461101317 + 4 [jbd2/sda11-8]
  8,0    6       19     0.016065473  3342  D   W 461101321 + 4 [jbd2/sda11-8]
  8,0    6       20     0.016070751  3342  D   W 461101325 + 4 [jbd2/sda11-8]
  8,0    6       21     0.016076180  3342  D   W 461101329 + 4 [jbd2/sda11-8]
  8,0    6       22     0.016081255  3342  D   W 461101333 + 4 [jbd2/sda11-8]
  8,0    6       23     0.016085963  3342  D   W 461101337 + 4 [jbd2/sda11-8]
  8,0    6       24     0.016182048     0  C   W 461101317 + 4 [0]
  8,0    6       25     0.016190820     0  C   W 461101325 + 4 [0]
  8,0    6       26     0.016193927     0  C   W 461101321 + 4 [0]
  8,0    6       27     0.016196532     0  C   W 461101333 + 4 [0]
  8,0    6       28     0.016199180     0  C   W 461101337 + 4 [0]
  8,0    6       29     0.016206180     0  C   W 461101329 + 4 [0]

with this patch:
  8,0    4       23     4.320315739  3129  D   W 461101317 + 24 [jbd2/sda11-8]
  8,0    4       24     4.320364518     0  C   W 461101317 + 24 [0]

This only affects .37 and .38 since Jens' new plug patches are included
in .39 and the bug is removed as a side effect. But I think it is needed
anyway for the stable. And RHEL6 needs this also I guess.

Cc: stable@kernel.org # 2.6.37 and 2.6.38
Cc: Jan Kara <jack@suse.cz>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
---
 fs/jbd/commit.c  |    9 +--------
 fs/jbd2/commit.c |    9 +--------
 2 files changed, 2 insertions(+), 16 deletions(-)
Vivek Goyal - July 12, 2011, 12:30 p.m.
On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
> From: Tao Ma <boyu.mt@taobao.com>
> 
> In commit 749ef9f8423, we use WRITE_SYNC instead of WRITE in
> journal_commit_transaction. It causes a much heavy burden for
> the disk as now the seqenctial write can't be merged(see the blktrace below).

Tao Ma,

Few queries.

- What's the workload you are using for this test.

- Do you see any performance improvement by switching to WRITE_SYNC_PLUG.

- Why writes are not being merged? Because request got dispatched
  immediately? Do you have logs for insertion of requests also.

- WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
  is doing unplug in this case?

- I am not sure in how many cases we are expecting to submit multiple
  sequential write here.

Thanks
Vivek

> 
> Given the description of that commit 749ef9f8423, the reason why
> we use WRITE_SYNC is that it wants to use REQ_NOIDLE and WRITE_SYNC_PLUG
> also has that flag, so use WRITE_SYNC_PLUG instead. From blktrace,
> we can get that:
> 
> without the patch:
>   8,0    6       18     0.016058423  3342  D   W 461101317 + 4 [jbd2/sda11-8]
>   8,0    6       19     0.016065473  3342  D   W 461101321 + 4 [jbd2/sda11-8]
>   8,0    6       20     0.016070751  3342  D   W 461101325 + 4 [jbd2/sda11-8]
>   8,0    6       21     0.016076180  3342  D   W 461101329 + 4 [jbd2/sda11-8]
>   8,0    6       22     0.016081255  3342  D   W 461101333 + 4 [jbd2/sda11-8]
>   8,0    6       23     0.016085963  3342  D   W 461101337 + 4 [jbd2/sda11-8]
>   8,0    6       24     0.016182048     0  C   W 461101317 + 4 [0]
>   8,0    6       25     0.016190820     0  C   W 461101325 + 4 [0]
>   8,0    6       26     0.016193927     0  C   W 461101321 + 4 [0]
>   8,0    6       27     0.016196532     0  C   W 461101333 + 4 [0]
>   8,0    6       28     0.016199180     0  C   W 461101337 + 4 [0]
>   8,0    6       29     0.016206180     0  C   W 461101329 + 4 [0]
> 
> with this patch:
>   8,0    4       23     4.320315739  3129  D   W 461101317 + 24 [jbd2/sda11-8]
>   8,0    4       24     4.320364518     0  C   W 461101317 + 24 [0]
> 
> This only affects .37 and .38 since Jens' new plug patches are included
> in .39 and the bug is removed as a side effect. But I think it is needed
> anyway for the stable. And RHEL6 needs this also I guess.
> 
> Cc: stable@kernel.org # 2.6.37 and 2.6.38
> Cc: Jan Kara <jack@suse.cz>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Corrado Zoccolo <czoccolo@gmail.com>
> Cc: Jens Axboe <jaxboe@fusionio.com>
> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
> ---
>  fs/jbd/commit.c  |    9 +--------
>  fs/jbd2/commit.c |    9 +--------
>  2 files changed, 2 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 34a4861..6d13df5 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
>  	int first_tag = 0;
>  	int tag_flag;
>  	int i;
> -	int write_op = WRITE_SYNC;
> +	int write_op = WRITE_SYNC_PLUG;
>  
>  	/*
>  	 * First job: lock down the current transaction and wait for
> @@ -327,13 +327,6 @@ void journal_commit_transaction(journal_t *journal)
>  	spin_lock(&journal->j_state_lock);
>  	commit_transaction->t_state = T_LOCKED;
>  
> -	/*
> -	 * Use plugged writes here, since we want to submit several before
> -	 * we unplug the device. We don't do explicit unplugging in here,
> -	 * instead we rely on sync_buffer() doing the unplug for us.
> -	 */
> -	if (commit_transaction->t_synchronous_commit)
> -		write_op = WRITE_SYNC_PLUG;
>  	spin_lock(&commit_transaction->t_handle_lock);
>  	while (commit_transaction->t_updates) {
>  		DEFINE_WAIT(wait);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index f3ad159..fc3840f 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	int tag_bytes = journal_tag_bytes(journal);
>  	struct buffer_head *cbh = NULL; /* For transactional checksums */
>  	__u32 crc32_sum = ~0;
> -	int write_op = WRITE_SYNC;
> +	int write_op = WRITE_SYNC_PLUG;
>  
>  	/*
>  	 * First job: lock down the current transaction and wait for
> @@ -363,13 +363,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	write_lock(&journal->j_state_lock);
>  	commit_transaction->t_state = T_LOCKED;
>  
> -	/*
> -	 * Use plugged writes here, since we want to submit several before
> -	 * we unplug the device. We don't do explicit unplugging in here,
> -	 * instead we rely on sync_buffer() doing the unplug for us.
> -	 */
> -	if (commit_transaction->t_synchronous_commit)
> -		write_op = WRITE_SYNC_PLUG;
>  	trace_jbd2_commit_locking(journal, commit_transaction);
>  	stats.run.rs_wait = commit_transaction->t_max_wait;
>  	stats.run.rs_locked = jiffies;
> -- 
> 1.7.4
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - July 12, 2011, 3:19 p.m.
On 07/12/2011 08:30 PM, Vivek Goyal wrote:
> On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
>> From: Tao Ma <boyu.mt@taobao.com>
>>
>> In commit 749ef9f8423, we use WRITE_SYNC instead of WRITE in
>> journal_commit_transaction. It causes a much heavy burden for
>> the disk as now the seqenctial write can't be merged(see the blktrace below).
> 
> Tao Ma,
> 
> Few queries.
> 
> - What's the workload you are using for this test.
A very simple one.
mkfs.ext4 -b 2048 /dev/sdx 10000000
sync
mount -t ext4 -o delalloc /dev/sdx /mnt/ext4
dd if=/dev/zero of=/mnt/ext4/a bs=1024K count=1
and run blktrace immediately after the 'dd'. When jbd2 begins to work,
you will get the blktrace output you want.
> 
> - Do you see any performance improvement by switching to WRITE_SYNC_PLUG.
I haven't done much tests yet. But I guess if there are many heavy sync
workload, we should suffer from some latency if we dispatch these
sequential write one by one. As I have said, Jens added plug/unplug in
39, and now these sequential write are dispatched in a one request. Run
the same test cases with 3.0-rcX, you will get the same result.
> 
> - Why writes are not being merged? Because request got dispatched
>   immediately? Do you have logs for insertion of requests also.
You can get it from the above test case.
> 
> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
>   is doing unplug in this case?
See the comments I removed, "we rely on sync_buffer() doing the unplug
for us". I removed them cause we all use pluged write now.
> 
> - I am not sure in how many cases we are expecting to submit multiple
>   sequential write here.
All the journal write will cause a sequential write to be split to many
requests here. So it would mean too much for metadata heavy test I think.

Thanks
Tao
> 
> Thanks
> Vivek
> 
>>
>> Given the description of that commit 749ef9f8423, the reason why
>> we use WRITE_SYNC is that it wants to use REQ_NOIDLE and WRITE_SYNC_PLUG
>> also has that flag, so use WRITE_SYNC_PLUG instead. From blktrace,
>> we can get that:
>>
>> without the patch:
>>   8,0    6       18     0.016058423  3342  D   W 461101317 + 4 [jbd2/sda11-8]
>>   8,0    6       19     0.016065473  3342  D   W 461101321 + 4 [jbd2/sda11-8]
>>   8,0    6       20     0.016070751  3342  D   W 461101325 + 4 [jbd2/sda11-8]
>>   8,0    6       21     0.016076180  3342  D   W 461101329 + 4 [jbd2/sda11-8]
>>   8,0    6       22     0.016081255  3342  D   W 461101333 + 4 [jbd2/sda11-8]
>>   8,0    6       23     0.016085963  3342  D   W 461101337 + 4 [jbd2/sda11-8]
>>   8,0    6       24     0.016182048     0  C   W 461101317 + 4 [0]
>>   8,0    6       25     0.016190820     0  C   W 461101325 + 4 [0]
>>   8,0    6       26     0.016193927     0  C   W 461101321 + 4 [0]
>>   8,0    6       27     0.016196532     0  C   W 461101333 + 4 [0]
>>   8,0    6       28     0.016199180     0  C   W 461101337 + 4 [0]
>>   8,0    6       29     0.016206180     0  C   W 461101329 + 4 [0]
>>
>> with this patch:
>>   8,0    4       23     4.320315739  3129  D   W 461101317 + 24 [jbd2/sda11-8]
>>   8,0    4       24     4.320364518     0  C   W 461101317 + 24 [0]
>>
>> This only affects .37 and .38 since Jens' new plug patches are included
>> in .39 and the bug is removed as a side effect. But I think it is needed
>> anyway for the stable. And RHEL6 needs this also I guess.
>>
>> Cc: stable@kernel.org # 2.6.37 and 2.6.38
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Vivek Goyal <vgoyal@redhat.com>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Corrado Zoccolo <czoccolo@gmail.com>
>> Cc: Jens Axboe <jaxboe@fusionio.com>
>> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
>> ---
>>  fs/jbd/commit.c  |    9 +--------
>>  fs/jbd2/commit.c |    9 +--------
>>  2 files changed, 2 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
>> index 34a4861..6d13df5 100644
>> --- a/fs/jbd/commit.c
>> +++ b/fs/jbd/commit.c
>> @@ -294,7 +294,7 @@ void journal_commit_transaction(journal_t *journal)
>>  	int first_tag = 0;
>>  	int tag_flag;
>>  	int i;
>> -	int write_op = WRITE_SYNC;
>> +	int write_op = WRITE_SYNC_PLUG;
>>  
>>  	/*
>>  	 * First job: lock down the current transaction and wait for
>> @@ -327,13 +327,6 @@ void journal_commit_transaction(journal_t *journal)
>>  	spin_lock(&journal->j_state_lock);
>>  	commit_transaction->t_state = T_LOCKED;
>>  
>> -	/*
>> -	 * Use plugged writes here, since we want to submit several before
>> -	 * we unplug the device. We don't do explicit unplugging in here,
>> -	 * instead we rely on sync_buffer() doing the unplug for us.
>> -	 */
>> -	if (commit_transaction->t_synchronous_commit)
>> -		write_op = WRITE_SYNC_PLUG;
>>  	spin_lock(&commit_transaction->t_handle_lock);
>>  	while (commit_transaction->t_updates) {
>>  		DEFINE_WAIT(wait);
>> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
>> index f3ad159..fc3840f 100644
>> --- a/fs/jbd2/commit.c
>> +++ b/fs/jbd2/commit.c
>> @@ -329,7 +329,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>>  	int tag_bytes = journal_tag_bytes(journal);
>>  	struct buffer_head *cbh = NULL; /* For transactional checksums */
>>  	__u32 crc32_sum = ~0;
>> -	int write_op = WRITE_SYNC;
>> +	int write_op = WRITE_SYNC_PLUG;
>>  
>>  	/*
>>  	 * First job: lock down the current transaction and wait for
>> @@ -363,13 +363,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>>  	write_lock(&journal->j_state_lock);
>>  	commit_transaction->t_state = T_LOCKED;
>>  
>> -	/*
>> -	 * Use plugged writes here, since we want to submit several before
>> -	 * we unplug the device. We don't do explicit unplugging in here,
>> -	 * instead we rely on sync_buffer() doing the unplug for us.
>> -	 */
>> -	if (commit_transaction->t_synchronous_commit)
>> -		write_op = WRITE_SYNC_PLUG;
>>  	trace_jbd2_commit_locking(journal, commit_transaction);
>>  	stats.run.rs_wait = commit_transaction->t_max_wait;
>>  	stats.run.rs_locked = jiffies;
>> -- 
>> 1.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH - July 12, 2011, 3:55 p.m.
On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
> This only affects .37 and .38 since Jens' new plug patches are included
> in .39 and the bug is removed as a side effect. But I think it is needed
> anyway for the stable. And RHEL6 needs this also I guess.

Both the .37 and .38-stable kernel trees are end-of-life and not having
any more releases, so there's no tree for these to be applied too,
sorry.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - July 13, 2011, 2:10 a.m.
On 07/12/2011 11:55 PM, Greg KH wrote:
> On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
>> This only affects .37 and .38 since Jens' new plug patches are included
>> in .39 and the bug is removed as a side effect. But I think it is needed
>> anyway for the stable. And RHEL6 needs this also I guess.
> 
> Both the .37 and .38-stable kernel trees are end-of-life and not having
> any more releases, so there's no tree for these to be applied too,
> sorry.
Fine. Since RHEL6 still has this problem(not sure SLES has this problem
or not), I may add a bug in redhat's bugzilla.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH - July 13, 2011, 2:17 a.m.
On Wed, Jul 13, 2011 at 10:10:03AM +0800, Tao Ma wrote:
> On 07/12/2011 11:55 PM, Greg KH wrote:
> > On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
> >> This only affects .37 and .38 since Jens' new plug patches are included
> >> in .39 and the bug is removed as a side effect. But I think it is needed
> >> anyway for the stable. And RHEL6 needs this also I guess.
> > 
> > Both the .37 and .38-stable kernel trees are end-of-life and not having
> > any more releases, so there's no tree for these to be applied too,
> > sorry.
> Fine. Since RHEL6 still has this problem(not sure SLES has this problem
> or not), I may add a bug in redhat's bugzilla.

Ok, but neither RHEL6 or SLES11 is based on .37 or .38, so it might not
be relevant for them either.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - July 13, 2011, 2:21 a.m.
On 07/13/2011 10:17 AM, Greg KH wrote:
> On Wed, Jul 13, 2011 at 10:10:03AM +0800, Tao Ma wrote:
>> On 07/12/2011 11:55 PM, Greg KH wrote:
>>> On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
>>>> This only affects .37 and .38 since Jens' new plug patches are included
>>>> in .39 and the bug is removed as a side effect. But I think it is needed
>>>> anyway for the stable. And RHEL6 needs this also I guess.
>>>
>>> Both the .37 and .38-stable kernel trees are end-of-life and not having
>>> any more releases, so there's no tree for these to be applied too,
>>> sorry.
>> Fine. Since RHEL6 still has this problem(not sure SLES has this problem
>> or not), I may add a bug in redhat's bugzilla.
> 
> Ok, but neither RHEL6 or SLES11 is based on .37 or .38, so it might not
> be relevant for them either.
Thanks for the reminder.
Actually I used RHEL6.1 and it seems that Corrado's patch 749ef9f8423 is
backported by redhat's guy already since it does have the problem I
mentioned.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer - July 14, 2011, 4:30 p.m.
Tao Ma <tm@tao.ma> writes:

> On 07/12/2011 08:30 PM, Vivek Goyal wrote:
>> On Tue, Jul 12, 2011 at 06:43:51PM +0800, Tao Ma wrote:
>>> From: Tao Ma <boyu.mt@taobao.com>
>>>
>>> In commit 749ef9f8423, we use WRITE_SYNC instead of WRITE in
>>> journal_commit_transaction. It causes a much heavy burden for
>>> the disk as now the seqenctial write can't be merged(see the blktrace below).
>> 
>> Tao Ma,
>> 
>> Few queries.
>> 
>> - What's the workload you are using for this test.
> A very simple one.
> mkfs.ext4 -b 2048 /dev/sdx 10000000
> sync
> mount -t ext4 -o delalloc /dev/sdx /mnt/ext4
> dd if=/dev/zero of=/mnt/ext4/a bs=1024K count=1
> and run blktrace immediately after the 'dd'. When jbd2 begins to work,
> you will get the blktrace output you want.
>> 
>> - Do you see any performance improvement by switching to WRITE_SYNC_PLUG.
> I haven't done much tests yet. But I guess if there are many heavy sync
> workload, we should suffer from some latency if we dispatch these
> sequential write one by one. As I have said, Jens added plug/unplug in
> 39, and now these sequential write are dispatched in a one request. Run
> the same test cases with 3.0-rcX, you will get the same result.
>> 
>> - Why writes are not being merged? Because request got dispatched
>>   immediately? Do you have logs for insertion of requests also.
> You can get it from the above test case.

Writes aren't being merged because BIO_RW_UNPLUG is set, so
__make_request unplugs the device immediately after submitting the
request.

>> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
>>   is doing unplug in this case?
> See the comments I removed, "we rely on sync_buffer() doing the unplug
> for us". I removed them cause we all use pluged write now.

Your logic is upside-down.  The code currently only uses the _PLUG
variant when t_synchronous_commit is set, meaning somebody *will* call
sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
layer is going to issue the unplug.  Of course, I'm not 100% sure of the
journaling process, so it may very well be that there always is an
unplug.  Can Jan or someone comment on that?  Anyway, you could test
this theory by seeing if your kernel generates any timer unplugs in the
blktrace output.

>> - I am not sure in how many cases we are expecting to submit multiple
>>   sequential write here.
> All the journal write will cause a sequential write to be split to many
> requests here. So it would mean too much for metadata heavy test I think.

If there is a lot of I/O going to the device, then I think we can expect
some amount of merging.  Your test case is a single process doing a
single buffered write which is written back in the background (so not
really high priority).

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - July 14, 2011, 7:46 p.m.
On Thu 14-07-11 12:30:32, Jeff Moyer wrote:
> Tao Ma <tm@tao.ma> writes:
> >> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
> >>   is doing unplug in this case?
> > See the comments I removed, "we rely on sync_buffer() doing the unplug
> > for us". I removed them cause we all use pluged write now.
> 
> Your logic is upside-down.  The code currently only uses the _PLUG
> variant when t_synchronous_commit is set, meaning somebody *will* call
> sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
> layer is going to issue the unplug.  Of course, I'm not 100% sure of the
> journaling process, so it may very well be that there always is an
> unplug.  Can Jan or someone comment on that?  Anyway, you could test
> this theory by seeing if your kernel generates any timer unplugs in the
> blktrace output.
  So I'm not expert in plugging code but from what I understand when we do
wait_on_buffer() (which calls io_schedule()) which will do
blk_flush_plug()), the queue will get unplugged and IO starts. And we wait
for all buffers we submit so we are guaranteed wait_on_buffer() will be
called...

								Honza
Vivek Goyal - July 14, 2011, 8:01 p.m.
On Thu, Jul 14, 2011 at 09:46:57PM +0200, Jan Kara wrote:
> On Thu 14-07-11 12:30:32, Jeff Moyer wrote:
> > Tao Ma <tm@tao.ma> writes:
> > >> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
> > >>   is doing unplug in this case?
> > > See the comments I removed, "we rely on sync_buffer() doing the unplug
> > > for us". I removed them cause we all use pluged write now.
> > 
> > Your logic is upside-down.  The code currently only uses the _PLUG
> > variant when t_synchronous_commit is set, meaning somebody *will* call
> > sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
> > layer is going to issue the unplug.  Of course, I'm not 100% sure of the
> > journaling process, so it may very well be that there always is an
> > unplug.  Can Jan or someone comment on that?  Anyway, you could test
> > this theory by seeing if your kernel generates any timer unplugs in the
> > blktrace output.
>   So I'm not expert in plugging code but from what I understand when we do
> wait_on_buffer() (which calls io_schedule()) which will do
> blk_flush_plug()), the queue will get unplugged and IO starts. And we wait
> for all buffers we submit so we are guaranteed wait_on_buffer() will be
> called...

But blk_flush_plug() is called only in recent kernels where problem is
not present anyway.

Tao is reporting problem in 2.6.38 and 2.6.39. My concern is that if we
send all the IO as WRITE_SYNC_UNPLUG and not really unplug the queue
explicitly then we might lose more time in waiting for timer unplugs
and not benefit that much from merging.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Moyer - July 14, 2011, 8:08 p.m.
Jan Kara <jack@suse.cz> writes:

> On Thu 14-07-11 12:30:32, Jeff Moyer wrote:
>> Tao Ma <tm@tao.ma> writes:
>> >> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
>> >>   is doing unplug in this case?
>> > See the comments I removed, "we rely on sync_buffer() doing the unplug
>> > for us". I removed them cause we all use pluged write now.
>> 
>> Your logic is upside-down.  The code currently only uses the _PLUG
>> variant when t_synchronous_commit is set, meaning somebody *will* call
>> sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
>> layer is going to issue the unplug.  Of course, I'm not 100% sure of the
>> journaling process, so it may very well be that there always is an
>> unplug.  Can Jan or someone comment on that?  Anyway, you could test
>> this theory by seeing if your kernel generates any timer unplugs in the
>> blktrace output.
>   So I'm not expert in plugging code but from what I understand when we do
> wait_on_buffer() (which calls io_schedule()) which will do
> blk_flush_plug()), the queue will get unplugged and IO starts. And we wait
> for all buffers we submit so we are guaranteed wait_on_buffer() will be
> called...

Sorry, I should have been more specific.  As Vivek mentioned, we're
talking about older kernels (pre the blk plugging series).  So, the
question is, if journal_commit_transaction is called with
t_synchronous_commit not set, will the underlying device ever be
unplugged by the journal code?  My guess is there's no explicit unplug,
so it's not correct to replace a WRITE_SYNC with a WRITE_SYNC_PLUG.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - July 14, 2011, 9:38 p.m.
On Thu 14-07-11 16:08:24, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > On Thu 14-07-11 12:30:32, Jeff Moyer wrote:
> >> Tao Ma <tm@tao.ma> writes:
> >> >> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
> >> >>   is doing unplug in this case?
> >> > See the comments I removed, "we rely on sync_buffer() doing the unplug
> >> > for us". I removed them cause we all use pluged write now.
> >> 
> >> Your logic is upside-down.  The code currently only uses the _PLUG
> >> variant when t_synchronous_commit is set, meaning somebody *will* call
> >> sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
> >> layer is going to issue the unplug.  Of course, I'm not 100% sure of the
> >> journaling process, so it may very well be that there always is an
> >> unplug.  Can Jan or someone comment on that?  Anyway, you could test
> >> this theory by seeing if your kernel generates any timer unplugs in the
> >> blktrace output.
> >   So I'm not expert in plugging code but from what I understand when we do
> > wait_on_buffer() (which calls io_schedule()) which will do
> > blk_flush_plug()), the queue will get unplugged and IO starts. And we wait
> > for all buffers we submit so we are guaranteed wait_on_buffer() will be
> > called...
> 
> Sorry, I should have been more specific.  As Vivek mentioned, we're
> talking about older kernels (pre the blk plugging series).  So, the
> question is, if journal_commit_transaction is called with
> t_synchronous_commit not set, will the underlying device ever be
> unplugged by the journal code?  My guess is there's no explicit unplug,
> so it's not correct to replace a WRITE_SYNC with a WRITE_SYNC_PLUG.
  There are no explicit unplugs in journalling code. But checking the code
in 2.6.37, I still see wait_on_buffer() calls sync_bh() which calls
blk_run_address_space() which ends up calling bdi->unplug_io_fn() so I
would say unplug is called anyway.

								Honza
Tao Ma - July 15, 2011, 2:43 a.m.
Hi Vivek and Jeff,
On 07/15/2011 05:38 AM, Jan Kara wrote:
> On Thu 14-07-11 16:08:24, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>>> On Thu 14-07-11 12:30:32, Jeff Moyer wrote:
>>>> Tao Ma <tm@tao.ma> writes:
>>>>>> - WRITE_SYNC_PLUG will plug the queue and expects explicity unplug. Who
>>>>>>   is doing unplug in this case?
>>>>> See the comments I removed, "we rely on sync_buffer() doing the unplug
>>>>> for us". I removed them cause we all use pluged write now.
>>>>
>>>> Your logic is upside-down.  The code currently only uses the _PLUG
>>>> variant when t_synchronous_commit is set, meaning somebody *will* call
>>>> sync_buffer.  Simply setting WRITE_SYNC_PLUG doens't mean the upper
>>>> layer is going to issue the unplug.  Of course, I'm not 100% sure of the
>>>> journaling process, so it may very well be that there always is an
>>>> unplug.  Can Jan or someone comment on that?  Anyway, you could test
>>>> this theory by seeing if your kernel generates any timer unplugs in the
>>>> blktrace output.
>>>   So I'm not expert in plugging code but from what I understand when we do
>>> wait_on_buffer() (which calls io_schedule()) which will do
>>> blk_flush_plug()), the queue will get unplugged and IO starts. And we wait
>>> for all buffers we submit so we are guaranteed wait_on_buffer() will be
>>> called...
>>
>> Sorry, I should have been more specific.  As Vivek mentioned, we're
>> talking about older kernels (pre the blk plugging series).  So, the
>> question is, if journal_commit_transaction is called with
>> t_synchronous_commit not set, will the underlying device ever be
>> unplugged by the journal code?  My guess is there's no explicit unplug,
>> so it's not correct to replace a WRITE_SYNC with a WRITE_SYNC_PLUG.
>   There are no explicit unplugs in journalling code. But checking the code
> in 2.6.37, I still see wait_on_buffer() calls sync_bh() which calls
> blk_run_address_space() which ends up calling bdi->unplug_io_fn() so I
> would say unplug is called anyway.
yeah, jbd2 works like what Jan described.
And what's more, if you looked at the commit I mentioned(749ef9f8423),
this commit just changed WRITE to WRITE_SYNC, so in the old times(before
that commit is merged), WRITE has been used in jbd/2 for many years. It
also doesn't do a explicit unplug, let the scheduler do the merge and
reply on sync_bh to unplug the device. So change WRITE to
WRITE_SYNC_PLUG is welcomed, but change WRITE to WRITE_SYNC is broken
since it splits the sequential write to several i/o requests.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 34a4861..6d13df5 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -294,7 +294,7 @@  void journal_commit_transaction(journal_t *journal)
 	int first_tag = 0;
 	int tag_flag;
 	int i;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE_SYNC_PLUG;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -327,13 +327,6 @@  void journal_commit_transaction(journal_t *journal)
 	spin_lock(&journal->j_state_lock);
 	commit_transaction->t_state = T_LOCKED;
 
-	/*
-	 * Use plugged writes here, since we want to submit several before
-	 * we unplug the device. We don't do explicit unplugging in here,
-	 * instead we rely on sync_buffer() doing the unplug for us.
-	 */
-	if (commit_transaction->t_synchronous_commit)
-		write_op = WRITE_SYNC_PLUG;
 	spin_lock(&commit_transaction->t_handle_lock);
 	while (commit_transaction->t_updates) {
 		DEFINE_WAIT(wait);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f3ad159..fc3840f 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -329,7 +329,7 @@  void jbd2_journal_commit_transaction(journal_t *journal)
 	int tag_bytes = journal_tag_bytes(journal);
 	struct buffer_head *cbh = NULL; /* For transactional checksums */
 	__u32 crc32_sum = ~0;
-	int write_op = WRITE_SYNC;
+	int write_op = WRITE_SYNC_PLUG;
 
 	/*
 	 * First job: lock down the current transaction and wait for
@@ -363,13 +363,6 @@  void jbd2_journal_commit_transaction(journal_t *journal)
 	write_lock(&journal->j_state_lock);
 	commit_transaction->t_state = T_LOCKED;
 
-	/*
-	 * Use plugged writes here, since we want to submit several before
-	 * we unplug the device. We don't do explicit unplugging in here,
-	 * instead we rely on sync_buffer() doing the unplug for us.
-	 */
-	if (commit_transaction->t_synchronous_commit)
-		write_op = WRITE_SYNC_PLUG;
 	trace_jbd2_commit_locking(journal, commit_transaction);
 	stats.run.rs_wait = commit_transaction->t_max_wait;
 	stats.run.rs_locked = jiffies;