[1/4] block: Fix dirty bitmap in bdrv_co_discard

Message ID	1430830009-29839-2-git-send-email-famz@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Fam Zheng <famz@redhat.com> To: qemu-devel@nongnu.org Date: Tue, 5 May 2015 20:46:46 +0800 Message-Id: <1430830009-29839-2-git-send-email-famz@redhat.com> In-Reply-To: <1430830009-29839-1-git-send-email-famz@redhat.com> References: <1430830009-29839-1-git-send-email-famz@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, pbonzini@redhat.com, jsnow@redhat.com, wangxiaolong@ucloud.cn Subject: [Qemu-devel] [PATCH 1/4] block: Fix dirty bitmap in bdrv_co_discard Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Fam Zheng May 5, 2015, 12:46 p.m. UTC

Unsetting dirty globally with discard is not very correct. The discard may zero
out sectors (depending on can_write_zeroes_with_unmap), we should replicate
this change to destinition side to make sure that the guest sees the same data.

Calling bdrv_reset_dirty also troubles mirror job because the hbitmap iterator
doesn't expect unsetting of bits after current position.

So let's do it the opposite way which fixes both problems: set the dirty bits
if we are to discard it.

Reported-by: wangxiaolong@ucloud.cn
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/io.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Paolo Bonzini May 5, 2015, 1:06 p.m. UTC | #1

On 05/05/2015 14:46, Fam Zheng wrote:
> Unsetting dirty globally with discard is not very correct. The discard may zero
> out sectors (depending on can_write_zeroes_with_unmap), we should replicate
> this change to destinition side to make sure that the guest sees the same data.
> 
> Calling bdrv_reset_dirty also troubles mirror job because the hbitmap iterator
> doesn't expect unsetting of bits after current position.
> 
> So let's do it the opposite way which fixes both problems: set the dirty bits
> if we are to discard it.

This is not enough, you also have to do the discard in block/mirror.c,
otherwise the destination image could even become fully provisioned!

Paolo

> Reported-by: wangxiaolong@ucloud.cn
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/io.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 1ce62c4..809688b 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2343,8 +2343,6 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>          return -EROFS;
>      }
>  
> -    bdrv_reset_dirty(bs, sector_num, nb_sectors);
> -
>      /* Do nothing if disabled.  */
>      if (!(bs->open_flags & BDRV_O_UNMAP)) {
>          return 0;
> @@ -2354,6 +2352,8 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
>          return 0;
>      }
>  
> +    bdrv_set_dirty(bs, sector_num, nb_sectors);
> +
>      max_discard = MIN_NON_ZERO(bs->bl.max_discard, BDRV_REQUEST_MAX_SECTORS);
>      while (nb_sectors > 0) {
>          int ret;
>

Fam Zheng May 6, 2015, 1:45 a.m. UTC | #2

On Tue, 05/05 15:06, Paolo Bonzini wrote:
> 
> 
> On 05/05/2015 14:46, Fam Zheng wrote:
> > Unsetting dirty globally with discard is not very correct. The discard may zero
> > out sectors (depending on can_write_zeroes_with_unmap), we should replicate
> > this change to destinition side to make sure that the guest sees the same data.
> > 
> > Calling bdrv_reset_dirty also troubles mirror job because the hbitmap iterator
> > doesn't expect unsetting of bits after current position.
> > 
> > So let's do it the opposite way which fixes both problems: set the dirty bits
> > if we are to discard it.
> 
> This is not enough, you also have to do the discard in block/mirror.c,
> otherwise the destination image could even become fully provisioned!

I wasn't sure what if src and dest have different can_write_zeroes_with_unmap
value, but your argument is stronger.

I will add discard in mirror.  Besides that, do we need to compare the
can_write_zeroes_with_unmap?

Fam

> 
> Paolo
> 
> > Reported-by: wangxiaolong@ucloud.cn
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  block/io.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index 1ce62c4..809688b 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -2343,8 +2343,6 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
> >          return -EROFS;
> >      }
> >  
> > -    bdrv_reset_dirty(bs, sector_num, nb_sectors);
> > -
> >      /* Do nothing if disabled.  */
> >      if (!(bs->open_flags & BDRV_O_UNMAP)) {
> >          return 0;
> > @@ -2354,6 +2352,8 @@ int coroutine_fn bdrv_co_discard(BlockDriverState *bs, int64_t sector_num,
> >          return 0;
> >      }
> >  
> > +    bdrv_set_dirty(bs, sector_num, nb_sectors);
> > +
> >      max_discard = MIN_NON_ZERO(bs->bl.max_discard, BDRV_REQUEST_MAX_SECTORS);
> >      while (nb_sectors > 0) {
> >          int ret;
> >

Paolo Bonzini May 6, 2015, 8:59 a.m. UTC | #3

On 06/05/2015 03:45, Fam Zheng wrote:
>> > This is not enough, you also have to do the discard in block/mirror.c,
>> > otherwise the destination image could even become fully provisioned!
> I wasn't sure what if src and dest have different can_write_zeroes_with_unmap
> value, but your argument is stronger.
> 
> I will add discard in mirror.  Besides that, do we need to compare the
> can_write_zeroes_with_unmap?

Hmm, if can_write_zeroes_with_unmap is set, it's probably better to
write zeroes instead of discarding, in case the guest is relying on it.

Paolo

Fam Zheng May 6, 2015, 9:50 a.m. UTC | #4

On Wed, 05/06 10:59, Paolo Bonzini wrote:
> 
> 
> On 06/05/2015 03:45, Fam Zheng wrote:
> >> > This is not enough, you also have to do the discard in block/mirror.c,
> >> > otherwise the destination image could even become fully provisioned!
> > I wasn't sure what if src and dest have different can_write_zeroes_with_unmap
> > value, but your argument is stronger.
> > 
> > I will add discard in mirror.  Besides that, do we need to compare the
> > can_write_zeroes_with_unmap?
> 
> Hmm, if can_write_zeroes_with_unmap is set, it's probably better to
> write zeroes instead of discarding, in case the guest is relying on it.
> 

I think there are four cases:

 #   src can_write_zeroes_with_unmap       target can_write_zeroes_with_unmap
--------------------------------------------------------------------------------
 1              true                                   true
 2              true                                   false
 3              false                                  true
 4              false                                  false

I think replicating discard only works for 1, because both side would then read
0.  For case 2 & 3 it's probably better to mirror the actual reading of source.

I'm not sure about 4.

What do you think?

Fam

Paolo Bonzini May 6, 2015, 10:21 a.m. UTC | #5

On 06/05/2015 11:50, Fam Zheng wrote:
>  #   src can_write_zeroes_with_unmap       target can_write_zeroes_with_unmap
> --------------------------------------------------------------------------------
>  1              true                                   true
>  2              true                                   false
>  3              false                                  true
>  4              false                                  false

1 should replicate WRITE SAME, in case the unmap granularity of the
target is different from that of the source.  In that case, a discard on
the target might leave some sectors untouched.  Writing zeroes would
ensure matching data between the source and the target.

2 should _not_ discard: it should write zeroes even at the cost of
making the target fully provisioned.  Perhaps you can optimize it by
looking at bdrv_get_block_status for the target, and checking the answer
for BDRV_ZERO.

3 and 4 can use discard on the target.

So it looks like only the source setting matters.

We need to check the cost of bdrv_co_get_block_status for "raw", too.
If it's too expensive, that can be a problem.

Paolo

> For case 2 & 3 it's probably better to mirror the actual reading of source.
> 
> I'm not sure about 4.

Even in case 1, discard could be "UNMAP" and write zeroes could be
"WRITE SAME".  If the unmap granularity of the target is For unaligned
sectors, UNMAP might leave some sectors aside while WRITE SAME will
write with zeroes.

Fam Zheng May 11, 2015, 8:02 a.m. UTC | #6

On Wed, 05/06 12:21, Paolo Bonzini wrote:
> 
> 
> On 06/05/2015 11:50, Fam Zheng wrote:
> >  #   src can_write_zeroes_with_unmap       target can_write_zeroes_with_unmap
> > --------------------------------------------------------------------------------
> >  1              true                                   true
> >  2              true                                   false
> >  3              false                                  true
> >  4              false                                  false
> 
> 1 should replicate WRITE SAME, in case the unmap granularity of the
> target is different from that of the source.  In that case, a discard on
> the target might leave some sectors untouched.  Writing zeroes would
> ensure matching data between the source and the target.
> 
> 2 should _not_ discard: it should write zeroes even at the cost of
> making the target fully provisioned.  Perhaps you can optimize it by
> looking at bdrv_get_block_status for the target, and checking the answer
> for BDRV_ZERO.
> 
> 3 and 4 can use discard on the target.
> 
> So it looks like only the source setting matters.
> 
> We need to check the cost of bdrv_co_get_block_status for "raw", too.
> If it's too expensive, that can be a problem.

In my test, bdrv_is_allocated_above() takes 10~20us on ext4 raw image on my
laptop SATA.  And also note that:

	/*
	 * ...
	 *
	 * 'pnum' is set to the number of sectors (including and immediately following
	 * the specified sector) that are known to be in the same
	 * allocated/unallocated state.
	 *
	 * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
	 * beyond the end of the disk image it will be clamped.
	 */
	static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,

So we currently don't coalesce the sequential dirty sectors.

Was that intentional?

Fam

> 
> Paolo
> 
> > For case 2 & 3 it's probably better to mirror the actual reading of source.
> > 
> > I'm not sure about 4.
> 
> Even in case 1, discard could be "UNMAP" and write zeroes could be
> "WRITE SAME".  If the unmap granularity of the target is For unaligned
> sectors, UNMAP might leave some sectors aside while WRITE SAME will
> write with zeroes.

Paolo Bonzini May 11, 2015, 8:33 a.m. UTC | #7

On 11/05/2015 10:02, Fam Zheng wrote:
> 
> 	/*
> 	 * ...
> 	 *
> 	 * 'pnum' is set to the number of sectors (including and immediately following
> 	 * the specified sector) that are known to be in the same
> 	 * allocated/unallocated state.
> 	 *
> 	 * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
> 	 * beyond the end of the disk image it will be clamped.
> 	 */
> 	static int64_t coroutine_fn bdrv_co_get_block_status(BlockDriverState *bs,
> 
> So we currently don't coalesce the sequential dirty sectors.
> 
> Was that intentional?

The idea, I think, is that for some drivers bdrv_co_get_block_status is
O(nb_sectors).

It's okay to let a driver return *pnum > nb_sectors.  You do need to
audit the callers, though.  It would be nice also to add TODOs whenever
the code is fine but it could explot *pnum > nb_sectors to get better
performance.

Paolo

[1/4] block: Fix dirty bitmap in bdrv_co_discard

Commit Message

Comments

Patch