jbd2 bh_shadow issue

Message ID 871f4cbf-f7bc-35d8-bd42-32721b44c458@linux.alibaba.com
State New
Headers show
Series
  • jbd2 bh_shadow issue
Related show

Commit Message

Xiaoguang Wang Nov. 27, 2018, 11:48 a.m.
hi,

I also meet this issue in our server, sometimes we got big latency for
buffered writes. Using fio, I have done some tests to reproduce this issue,
Please see test results in this mail:
https://www.spinics.net/lists/linux-ext4/msg62885.html

I agree that doing buffer copy-out for every journaled buffer wastes cpu
time, so here I wonder whether we can only do buffer copy-out when buffer
is BH_SHADOW state. The core idea is simple:
     In do_get_write_access(), if buffer has already been in BH_SHADOW state,
     we allocate a new page, make buffer's b_page point to this new page, and
     following journal dirty work can be done in this patch. When transaction
     commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
     to primary page, make buffer point to primary page and free the temporary
     page.

I have written a test patch, the big long latency will disapper. Do you think
the method is useful? If you think so, I can submit a formal patch and do fstests.
Anyway I think this issue is not good, maybe we should fix it:)

Here I attach my test patch here:
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
  fs/jbd2/journal.c            | 26 ++++++++++++++++++++++++--
  fs/jbd2/transaction.c        | 31 +++++++++++++++++++++++++++----
  include/linux/buffer_head.h  |  1 +
  include/linux/jbd2.h         |  7 +++++++
  include/linux/journal-head.h |  4 ++++
  5 files changed, 63 insertions(+), 6 deletions(-)

Comments

Jan Kara Nov. 29, 2018, 6:04 p.m. | #1
Hi,

On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote:
> I also meet this issue in our server, sometimes we got big latency for
> buffered writes. Using fio, I have done some tests to reproduce this issue,
> Please see test results in this mail:
> https://www.spinics.net/lists/linux-ext4/msg62885.html
> 
> I agree that doing buffer copy-out for every journaled buffer wastes cpu
> time, so here I wonder whether we can only do buffer copy-out when buffer
> is BH_SHADOW state. The core idea is simple:
>     In do_get_write_access(), if buffer has already been in BH_SHADOW state,
>     we allocate a new page, make buffer's b_page point to this new page, and
>     following journal dirty work can be done in this patch. When transaction
>     commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
>     to primary page, make buffer point to primary page and free the temporary
>     page.

This isn't going to work. That are places in the kernel that assume that
the page bh points to is the page stored in the page cache. Also you need
to make sure that functions like sb_getblk() are going to return your new
bh etc. So it is not really simple. On the other hand if we speak only
about metadata buffers, the potential for breakage is certainly much
smaller (but you'd have to then make sure data=journal case does not use
your optimization) and there aren't that many ways how bh can be accessed
so maybe you could make something like this working. But it certainly isn't
going to work in the form you've presented here...

								Honza


> 
> I have written a test patch, the big long latency will disapper. Do you think
> the method is useful? If you think so, I can submit a formal patch and do fstests.
> Anyway I think this issue is not good, maybe we should fix it:)
> 
> Here I attach my test patch here:
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

I think that replacing the buffer_head is harder than you think. 

> ---
>  fs/jbd2/journal.c            | 26 ++++++++++++++++++++++++--
>  fs/jbd2/transaction.c        | 31 +++++++++++++++++++++++++++----
>  include/linux/buffer_head.h  |  1 +
>  include/linux/jbd2.h         |  7 +++++++
>  include/linux/journal-head.h |  4 ++++
>  5 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 8ef6b6daaa7a..bb0612cf5447 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -393,8 +393,30 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  		new_page = virt_to_page(jh_in->b_frozen_data);
>  		new_offset = offset_in_page(jh_in->b_frozen_data);
>  	} else {
> -		new_page = jh2bh(jh_in)->b_page;
> -		new_offset = offset_in_page(jh2bh(jh_in)->b_data);
> +		if (buffer_lege(bh_in)) {
> +			char *primary, *temp_primary = jh_in->b_temp_data;
> +			struct page *temp_page;
> +			unsigned temp_offset;
> +
> +			new_page = jh_in->page;
> +			new_offset = jh_in->page_offset;
> +
> +			temp_page = virt_to_page(temp_primary);
> +			temp_offset = offset_in_page(temp_primary);
> +			primary = kmap_atomic(new_page);
> +			mapped_data = kmap_atomic(temp_page);
> +
> +			memcpy(primary + new_offset, mapped_data + temp_offset,
> +			       bh_in->b_size);
> +			set_bh_page(bh_in, new_page, new_offset);
> +			kunmap_atomic(primary);
> +			kunmap_atomic(temp_primary);
> +			jbd2_free(temp_primary, bh_in->b_size);
> +			clear_buffer_primarycopied(bh_in);
> +		} else {
> +			new_page = jh2bh(jh_in)->b_page;
> +			new_offset = offset_in_page(jh2bh(jh_in)->b_data);
> +		}
>  	}
> 
>  	mapped_data = kmap_atomic(new_page);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index c0b66a7a795b..6ad8ec94c7b8 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -827,6 +827,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  	journal_t *journal;
>  	int error;
>  	char *frozen_buffer = NULL;
> +	char *tmp_primary = NULL;
>  	unsigned long start_lock, time_lock;
> 
>  	if (is_handle_aborted(handle))
> @@ -956,10 +957,29 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  	 * here.
>  	 */
>  	if (buffer_shadow(bh)) {
> -		JBUFFER_TRACE(jh, "on shadow: sleep");
> -		jbd_unlock_bh_state(bh);
> -		wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
> -		goto repeat;
> +		char *mapped_data;
> +		struct page *new_page;
> +		unsigned int primary_offset, new_offset;
> +
> +		if (!tmp_primary) {
> +			jbd_unlock_bh_state(bh);
> +			tmp_primary = jbd2_alloc(bh->b_size, GFP_NOFS |
> +						 __GFP_NOFAIL);
> +			goto repeat;
> +		}
> +
> +		primary_offset = offset_in_page(bh->b_data);
> +		mapped_data = kmap_atomic(bh->b_page);
> +		memcpy(tmp_primary, mapped_data + primary_offset, bh->b_size);
> +		kunmap_atomic(mapped_data);
> +
> +		new_page = virt_to_page(tmp_primary);
> +		new_offset = offset_in_page(tmp_primary);
> +		jh->b_temp_data = tmp_primary;
> +		jh->page = bh->b_page;
> +		jh->page_offset = primary_offset;
> +		set_bh_page(bh, new_page, new_offset);
> +		set_buffer_primarycopied(bh);
>  	}
> 
>  	/*
> @@ -1009,6 +1029,9 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
>  		jbd2_free(frozen_buffer, bh->b_size);
> 
> +	if (tmp_primary)
> +		jbd2_free(frozen_buffer, bh->b_size);
> +
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
>  }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 96225a77c112..88fbe345d291 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -66,6 +66,7 @@ struct buffer_head {
>  	struct page *b_page;		/* the page this bh is mapped to */
> 
>  	sector_t b_blocknr;		/* start block number */
> +	sector_t orig_blocknr;		/* start block number */
>  	size_t b_size;			/* size of mapping */
>  	char *b_data;			/* pointer to data within the page */
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index b708e5169d1d..0928f9bea8f4 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -321,6 +321,12 @@ enum jbd_state_bits {
>  	BH_Shadow,		/* IO on shadow buffer is running */
>  	BH_Verified,		/* Metadata block has been verified ok */
>  	BH_JBDPrivateStart,	/* First bit available for private use by FS */
> +	/*
> +	 * We can not dirty a bh while it's going to disk with BH_Shadow flag.
> +	 * In this case, we can copy this bh and let later fs dirty operations
> +	 * go to this new bh.
> +	 */
> +	BH_PrimaryCopied,
>  };
> 
>  BUFFER_FNS(JBD, jbd)
> @@ -334,6 +340,7 @@ TAS_BUFFER_FNS(RevokeValid, revokevalid)
>  BUFFER_FNS(Freed, freed)
>  BUFFER_FNS(Shadow, shadow)
>  BUFFER_FNS(Verified, verified)
> +BUFFER_FNS(PrimaryCopied, primarycopied)
> 
>  static inline struct buffer_head *jh2bh(struct journal_head *jh)
>  {
> diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
> index 9fb870524314..0fa317124d9a 100644
> --- a/include/linux/journal-head.h
> +++ b/include/linux/journal-head.h
> @@ -51,6 +51,10 @@ struct journal_head {
>  	 */
>  	char *b_frozen_data;
> 
> +	struct page *page;
> +	unsigned int page_offset;
> +	char *b_temp_data;
> +
>  	/*
>  	 * Pointer to a saved copy of the buffer containing no uncommitted
>  	 * deallocation references, so that allocations can avoid overwriting
> -- 
> 2.14.4.44.g2045bb6
> 
> 
> 
> 
> 
> Regards,
> Xiaoguang Wang
Xiaoguang Wang Dec. 4, 2018, 10:49 a.m. | #2
hi,
> Hi,
> 
> On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote:
>> I also meet this issue in our server, sometimes we got big latency for
>> buffered writes. Using fio, I have done some tests to reproduce this issue,
>> Please see test results in this mail:
>> https://www.spinics.net/lists/linux-ext4/msg62885.html
>>
>> I agree that doing buffer copy-out for every journaled buffer wastes cpu
>> time, so here I wonder whether we can only do buffer copy-out when buffer
>> is BH_SHADOW state. The core idea is simple:
>>      In do_get_write_access(), if buffer has already been in BH_SHADOW state,
>>      we allocate a new page, make buffer's b_page point to this new page, and
>>      following journal dirty work can be done in this patch. When transaction
>>      commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
>>      to primary page, make buffer point to primary page and free the temporary
>>      page.
> 
> This isn't going to work. That are places in the kernel that assume that
> the page bh points to is the page stored in the page cache. Also you need
> to make sure that functions like sb_getblk() are going to return your new
> bh etc. So it is not really simple. On the other hand if we speak only
> about metadata buffers, the potential for breakage is certainly much
> smaller (but you'd have to then make sure data=journal case does not use
> your optimization) and there aren't that many ways how bh can be accessed
> so maybe you could make something like this working. But it certainly isn't
> going to work in the form you've presented here...
I had thought some of your mentioned cases and only just considered meta
buffer before, thanks you for pointing more cases. Do you have any suggestions
for this bh_shadow issue, sometimes it really caused app's long tail latency.

Regards,
Xiaoguang Wang

> 
> 								Honza
> 
> 
>>
>> I have written a test patch, the big long latency will disapper. Do you think
>> the method is useful? If you think so, I can submit a formal patch and do fstests.
>> Anyway I think this issue is not good, maybe we should fix it:)
>>
>> Here I attach my test patch here:
>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> 
> I think that replacing the buffer_head is harder than you think.
> 
>> ---
>>   fs/jbd2/journal.c            | 26 ++++++++++++++++++++++++--
>>   fs/jbd2/transaction.c        | 31 +++++++++++++++++++++++++++----
>>   include/linux/buffer_head.h  |  1 +
>>   include/linux/jbd2.h         |  7 +++++++
>>   include/linux/journal-head.h |  4 ++++
>>   5 files changed, 63 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
>> index 8ef6b6daaa7a..bb0612cf5447 100644
>> --- a/fs/jbd2/journal.c
>> +++ b/fs/jbd2/journal.c
>> @@ -393,8 +393,30 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>>   		new_page = virt_to_page(jh_in->b_frozen_data);
>>   		new_offset = offset_in_page(jh_in->b_frozen_data);
>>   	} else {
>> -		new_page = jh2bh(jh_in)->b_page;
>> -		new_offset = offset_in_page(jh2bh(jh_in)->b_data);
>> +		if (buffer_lege(bh_in)) {
>> +			char *primary, *temp_primary = jh_in->b_temp_data;
>> +			struct page *temp_page;
>> +			unsigned temp_offset;
>> +
>> +			new_page = jh_in->page;
>> +			new_offset = jh_in->page_offset;
>> +
>> +			temp_page = virt_to_page(temp_primary);
>> +			temp_offset = offset_in_page(temp_primary);
>> +			primary = kmap_atomic(new_page);
>> +			mapped_data = kmap_atomic(temp_page);
>> +
>> +			memcpy(primary + new_offset, mapped_data + temp_offset,
>> +			       bh_in->b_size);
>> +			set_bh_page(bh_in, new_page, new_offset);
>> +			kunmap_atomic(primary);
>> +			kunmap_atomic(temp_primary);
>> +			jbd2_free(temp_primary, bh_in->b_size);
>> +			clear_buffer_primarycopied(bh_in);
>> +		} else {
>> +			new_page = jh2bh(jh_in)->b_page;
>> +			new_offset = offset_in_page(jh2bh(jh_in)->b_data);
>> +		}
>>   	}
>>
>>   	mapped_data = kmap_atomic(new_page);
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index c0b66a7a795b..6ad8ec94c7b8 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -827,6 +827,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>>   	journal_t *journal;
>>   	int error;
>>   	char *frozen_buffer = NULL;
>> +	char *tmp_primary = NULL;
>>   	unsigned long start_lock, time_lock;
>>
>>   	if (is_handle_aborted(handle))
>> @@ -956,10 +957,29 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>>   	 * here.
>>   	 */
>>   	if (buffer_shadow(bh)) {
>> -		JBUFFER_TRACE(jh, "on shadow: sleep");
>> -		jbd_unlock_bh_state(bh);
>> -		wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
>> -		goto repeat;
>> +		char *mapped_data;
>> +		struct page *new_page;
>> +		unsigned int primary_offset, new_offset;
>> +
>> +		if (!tmp_primary) {
>> +			jbd_unlock_bh_state(bh);
>> +			tmp_primary = jbd2_alloc(bh->b_size, GFP_NOFS |
>> +						 __GFP_NOFAIL);
>> +			goto repeat;
>> +		}
>> +
>> +		primary_offset = offset_in_page(bh->b_data);
>> +		mapped_data = kmap_atomic(bh->b_page);
>> +		memcpy(tmp_primary, mapped_data + primary_offset, bh->b_size);
>> +		kunmap_atomic(mapped_data);
>> +
>> +		new_page = virt_to_page(tmp_primary);
>> +		new_offset = offset_in_page(tmp_primary);
>> +		jh->b_temp_data = tmp_primary;
>> +		jh->page = bh->b_page;
>> +		jh->page_offset = primary_offset;
>> +		set_bh_page(bh, new_page, new_offset);
>> +		set_buffer_primarycopied(bh);
>>   	}
>>
>>   	/*
>> @@ -1009,6 +1029,9 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>>   	if (unlikely(frozen_buffer))	/* It's usually NULL */
>>   		jbd2_free(frozen_buffer, bh->b_size);
>>
>> +	if (tmp_primary)
>> +		jbd2_free(frozen_buffer, bh->b_size);
>> +
>>   	JBUFFER_TRACE(jh, "exit");
>>   	return error;
>>   }
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 96225a77c112..88fbe345d291 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -66,6 +66,7 @@ struct buffer_head {
>>   	struct page *b_page;		/* the page this bh is mapped to */
>>
>>   	sector_t b_blocknr;		/* start block number */
>> +	sector_t orig_blocknr;		/* start block number */
>>   	size_t b_size;			/* size of mapping */
>>   	char *b_data;			/* pointer to data within the page */
>>
>> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
>> index b708e5169d1d..0928f9bea8f4 100644
>> --- a/include/linux/jbd2.h
>> +++ b/include/linux/jbd2.h
>> @@ -321,6 +321,12 @@ enum jbd_state_bits {
>>   	BH_Shadow,		/* IO on shadow buffer is running */
>>   	BH_Verified,		/* Metadata block has been verified ok */
>>   	BH_JBDPrivateStart,	/* First bit available for private use by FS */
>> +	/*
>> +	 * We can not dirty a bh while it's going to disk with BH_Shadow flag.
>> +	 * In this case, we can copy this bh and let later fs dirty operations
>> +	 * go to this new bh.
>> +	 */
>> +	BH_PrimaryCopied,
>>   };
>>
>>   BUFFER_FNS(JBD, jbd)
>> @@ -334,6 +340,7 @@ TAS_BUFFER_FNS(RevokeValid, revokevalid)
>>   BUFFER_FNS(Freed, freed)
>>   BUFFER_FNS(Shadow, shadow)
>>   BUFFER_FNS(Verified, verified)
>> +BUFFER_FNS(PrimaryCopied, primarycopied)
>>
>>   static inline struct buffer_head *jh2bh(struct journal_head *jh)
>>   {
>> diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
>> index 9fb870524314..0fa317124d9a 100644
>> --- a/include/linux/journal-head.h
>> +++ b/include/linux/journal-head.h
>> @@ -51,6 +51,10 @@ struct journal_head {
>>   	 */
>>   	char *b_frozen_data;
>>
>> +	struct page *page;
>> +	unsigned int page_offset;
>> +	char *b_temp_data;
>> +
>>   	/*
>>   	 * Pointer to a saved copy of the buffer containing no uncommitted
>>   	 * deallocation references, so that allocations can avoid overwriting
>> -- 
>> 2.14.4.44.g2045bb6
>>
>>
>>
>>
>>
>> Regards,
>> Xiaoguang Wang
Jan Kara Dec. 11, 2018, 10:19 a.m. | #3
Hi,

On Tue 04-12-18 18:49:49, Xiaoguang Wang wrote:
> > On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote:
> > > I also meet this issue in our server, sometimes we got big latency for
> > > buffered writes. Using fio, I have done some tests to reproduce this issue,
> > > Please see test results in this mail:
> > > https://www.spinics.net/lists/linux-ext4/msg62885.html
> > > 
> > > I agree that doing buffer copy-out for every journaled buffer wastes cpu
> > > time, so here I wonder whether we can only do buffer copy-out when buffer
> > > is BH_SHADOW state. The core idea is simple:
> > >      In do_get_write_access(), if buffer has already been in BH_SHADOW state,
> > >      we allocate a new page, make buffer's b_page point to this new page, and
> > >      following journal dirty work can be done in this patch. When transaction
> > >      commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
> > >      to primary page, make buffer point to primary page and free the temporary
> > >      page.
> > 
> > This isn't going to work. That are places in the kernel that assume that
> > the page bh points to is the page stored in the page cache. Also you need
> > to make sure that functions like sb_getblk() are going to return your new
> > bh etc. So it is not really simple. On the other hand if we speak only
> > about metadata buffers, the potential for breakage is certainly much
> > smaller (but you'd have to then make sure data=journal case does not use
> > your optimization) and there aren't that many ways how bh can be accessed
> > so maybe you could make something like this working. But it certainly isn't
> > going to work in the form you've presented here...
> I had thought some of your mentioned cases and only just considered meta
> buffer before, thanks you for pointing more cases. Do you have any suggestions
> for this bh_shadow issue, sometimes it really caused app's long tail latency.

I was thinking for some time about this but I don't have a good solution to
this problem. Additionally I have realized that you will face the same
latency hit not only when the block is written out to the journal but also
when it is being checkpointed. So the solution could not be just some
special journalling hack but would need to provide a general way how to
modify metadata buffer heads that are being written out.

One solution that looks at least remotely sane to me is that you could do
an equivalent of page migration (it is somewhat similar to what you've
suggested) - allocate new page, copy there data from the old page, replace
original page in the address space with this page, move buffer heads over
from the old page to the new page. You can check what buffer_migrate_page()
is doing. And then you can let processes modify this new page and when the
IO completes the old page would get released.  But you'd have to be extra
careful with the locking and also make sure buffer heads you migrate don't
have extra references so that you do not move buffer heads under someone's
hands (as that could cause data corruption)...

								Honza
Xiaoguang Wang Dec. 12, 2018, 2:24 p.m. | #4
hi,

> Hi,
> 
> On Tue 04-12-18 18:49:49, Xiaoguang Wang wrote:
>>> On Tue 27-11-18 19:48:28, Xiaoguang Wang wrote:
>>>> I also meet this issue in our server, sometimes we got big latency for
>>>> buffered writes. Using fio, I have done some tests to reproduce this issue,
>>>> Please see test results in this mail:
>>>> https://www.spinics.net/lists/linux-ext4/msg62885.html
>>>>
>>>> I agree that doing buffer copy-out for every journaled buffer wastes cpu
>>>> time, so here I wonder whether we can only do buffer copy-out when buffer
>>>> is BH_SHADOW state. The core idea is simple:
>>>>       In do_get_write_access(), if buffer has already been in BH_SHADOW state,
>>>>       we allocate a new page, make buffer's b_page point to this new page, and
>>>>       following journal dirty work can be done in this patch. When transaction
>>>>       commits, in jbd2_journal_write_metadata_buffer(), we copy new page's conten
>>>>       to primary page, make buffer point to primary page and free the temporary
>>>>       page.
>>>
>>> This isn't going to work. That are places in the kernel that assume that
>>> the page bh points to is the page stored in the page cache. Also you need
>>> to make sure that functions like sb_getblk() are going to return your new
>>> bh etc. So it is not really simple. On the other hand if we speak only
>>> about metadata buffers, the potential for breakage is certainly much
>>> smaller (but you'd have to then make sure data=journal case does not use
>>> your optimization) and there aren't that many ways how bh can be accessed
>>> so maybe you could make something like this working. But it certainly isn't
>>> going to work in the form you've presented here...
>> I had thought some of your mentioned cases and only just considered meta
>> buffer before, thanks you for pointing more cases. Do you have any suggestions
>> for this bh_shadow issue, sometimes it really caused app's long tail latency.
> 
> I was thinking for some time about this but I don't have a good solution to
> this problem. Additionally I have realized that you will face the same
> latency hit not only when the block is written out to the journal but also
> when it is being checkpointed. 
To be honest, we often see long tail latency caused by bh_shadow issue, but
don't see any long tail latency caused by checkpointed buffer...

> So the solution could not be just some
> special journalling hack but would need to provide a general way how to
> modify metadata buffer heads that are being written out.
Agree.

> 
> One solution that looks at least remotely sane to me is that you could do
> an equivalent of page migration (it is somewhat similar to what you've
> suggested) - allocate new page, copy there data from the old page, replace
> original page in the address space with this page, move buffer heads over
> from the old page to the new page. You can check what buffer_migrate_page()
> is doing. And then you can let processes modify this new page and when the
> IO completes the old page would get released.  But you'd have to be extra
> careful with the locking and also make sure buffer heads you migrate don't
> have extra references so that you do not move buffer heads under someone's
> hands (as that could cause data corruption)...
Thanks Jan, I'll have a try according to your suggestions.

> 
> 								Honza
>

Patch

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ef6b6daaa7a..bb0612cf5447 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -393,8 +393,30 @@  int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
  		new_page = virt_to_page(jh_in->b_frozen_data);
  		new_offset = offset_in_page(jh_in->b_frozen_data);
  	} else {
-		new_page = jh2bh(jh_in)->b_page;
-		new_offset = offset_in_page(jh2bh(jh_in)->b_data);
+		if (buffer_lege(bh_in)) {
+			char *primary, *temp_primary = jh_in->b_temp_data;
+			struct page *temp_page;
+			unsigned temp_offset;
+
+			new_page = jh_in->page;
+			new_offset = jh_in->page_offset;
+
+			temp_page = virt_to_page(temp_primary);
+			temp_offset = offset_in_page(temp_primary);
+			primary = kmap_atomic(new_page);
+			mapped_data = kmap_atomic(temp_page);
+
+			memcpy(primary + new_offset, mapped_data + temp_offset,
+			       bh_in->b_size);
+			set_bh_page(bh_in, new_page, new_offset);
+			kunmap_atomic(primary);
+			kunmap_atomic(temp_primary);
+			jbd2_free(temp_primary, bh_in->b_size);
+			clear_buffer_primarycopied(bh_in);
+		} else {
+			new_page = jh2bh(jh_in)->b_page;
+			new_offset = offset_in_page(jh2bh(jh_in)->b_data);
+		}
  	}

  	mapped_data = kmap_atomic(new_page);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c0b66a7a795b..6ad8ec94c7b8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -827,6 +827,7 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
  	journal_t *journal;
  	int error;
  	char *frozen_buffer = NULL;
+	char *tmp_primary = NULL;
  	unsigned long start_lock, time_lock;

  	if (is_handle_aborted(handle))
@@ -956,10 +957,29 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
  	 * here.
  	 */
  	if (buffer_shadow(bh)) {
-		JBUFFER_TRACE(jh, "on shadow: sleep");
-		jbd_unlock_bh_state(bh);
-		wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
-		goto repeat;
+		char *mapped_data;
+		struct page *new_page;
+		unsigned int primary_offset, new_offset;
+
+		if (!tmp_primary) {
+			jbd_unlock_bh_state(bh);
+			tmp_primary = jbd2_alloc(bh->b_size, GFP_NOFS |
+						 __GFP_NOFAIL);
+			goto repeat;
+		}
+
+		primary_offset = offset_in_page(bh->b_data);
+		mapped_data = kmap_atomic(bh->b_page);
+		memcpy(tmp_primary, mapped_data + primary_offset, bh->b_size);
+		kunmap_atomic(mapped_data);
+
+		new_page = virt_to_page(tmp_primary);
+		new_offset = offset_in_page(tmp_primary);
+		jh->b_temp_data = tmp_primary;
+		jh->page = bh->b_page;
+		jh->page_offset = primary_offset;
+		set_bh_page(bh, new_page, new_offset);
+		set_buffer_primarycopied(bh);
  	}

  	/*
@@ -1009,6 +1029,9 @@  do_get_write_access(handle_t *handle, struct journal_head *jh,
  	if (unlikely(frozen_buffer))	/* It's usually NULL */
  		jbd2_free(frozen_buffer, bh->b_size);

+	if (tmp_primary)
+		jbd2_free(frozen_buffer, bh->b_size);
+
  	JBUFFER_TRACE(jh, "exit");
  	return error;
  }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 96225a77c112..88fbe345d291 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -66,6 +66,7 @@  struct buffer_head {
  	struct page *b_page;		/* the page this bh is mapped to */

  	sector_t b_blocknr;		/* start block number */
+	sector_t orig_blocknr;		/* start block number */
  	size_t b_size;			/* size of mapping */
  	char *b_data;			/* pointer to data within the page */

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index b708e5169d1d..0928f9bea8f4 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -321,6 +321,12 @@  enum jbd_state_bits {
  	BH_Shadow,		/* IO on shadow buffer is running */
  	BH_Verified,		/* Metadata block has been verified ok */
  	BH_JBDPrivateStart,	/* First bit available for private use by FS */
+	/*
+	 * We can not dirty a bh while it's going to disk with BH_Shadow flag.
+	 * In this case, we can copy this bh and let later fs dirty operations
+	 * go to this new bh.
+	 */
+	BH_PrimaryCopied,
  };

  BUFFER_FNS(JBD, jbd)
@@ -334,6 +340,7 @@  TAS_BUFFER_FNS(RevokeValid, revokevalid)
  BUFFER_FNS(Freed, freed)
  BUFFER_FNS(Shadow, shadow)
  BUFFER_FNS(Verified, verified)
+BUFFER_FNS(PrimaryCopied, primarycopied)

  static inline struct buffer_head *jh2bh(struct journal_head *jh)
  {
diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
index 9fb870524314..0fa317124d9a 100644
--- a/include/linux/journal-head.h
+++ b/include/linux/journal-head.h
@@ -51,6 +51,10 @@  struct journal_head {
  	 */
  	char *b_frozen_data;

+	struct page *page;
+	unsigned int page_offset;
+	char *b_temp_data;
+
  	/*
  	 * Pointer to a saved copy of the buffer containing no uncommitted
  	 * deallocation references, so that allocations can avoid overwriting