Patchwork [4/4] qemu-img: conditionally discard target on convert

login
register
mail settings
Submitter Peter Lieven
Date July 15, 2013, 10:49 a.m.
Message ID <1373885375-13601-5-git-send-email-pl@kamp.de>
Download mbox | patch
Permalink /patch/258994/
State New
Headers show

Comments

Peter Lieven - July 15, 2013, 10:49 a.m.
if a destination has has_zero_init = 0, but it supports
discard zeroes use discard to convert the target
into an all zero device.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 qemu-img.c |   56 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)
Kevin Wolf - July 17, 2013, 8:46 a.m.
Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
> if a destination has has_zero_init = 0, but it supports
> discard zeroes use discard to convert the target
> into an all zero device.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>

Wouldn't it be better to use bdrv_write_zeroes() and extend the
implementation of that to use discard internally in those block drivers
where it makes sense?

Because here you're not really discarding (i.e. don't care about the
sectors any more), but you want them to be zeroed.

Kevin
Paolo Bonzini - July 17, 2013, 9:58 a.m.
Il 17/07/2013 10:46, Kevin Wolf ha scritto:
> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>> if a destination has has_zero_init = 0, but it supports
>> discard zeroes use discard to convert the target
>> into an all zero device.
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> 
> Wouldn't it be better to use bdrv_write_zeroes() and extend the
> implementation of that to use discard internally in those block drivers
> where it makes sense?
> 
> Because here you're not really discarding (i.e. don't care about the
> sectors any more), but you want them to be zeroed.

I thought the same yesterday when reviewing the series, but I'm not
convinced.

Discarding is not always the right way to write zeroes, because it can
disrupt performance.  It may be fine when you are already going to write
a sparse image (as is the case for qemu-img convert), but not in
general.  So if you just used write_zeroes, it would have to fall under
yet another -drive option (or an extension to "-drive discard").  I
think what Peter did is a good compromise in the end.

BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
zeroes blocks, but is that true for unaligned operations?

Paolo
Peter Lieven - July 17, 2013, 10:21 a.m.
Am 17.07.2013 um 11:58 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
>> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>>> if a destination has has_zero_init = 0, but it supports
>>> discard zeroes use discard to convert the target
>>> into an all zero device.
>>> 
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> 
>> Wouldn't it be better to use bdrv_write_zeroes() and extend the
>> implementation of that to use discard internally in those block drivers
>> where it makes sense?
>> 
>> Because here you're not really discarding (i.e. don't care about the
>> sectors any more), but you want them to be zeroed.
> 
> I thought the same yesterday when reviewing the series, but I'm not
> convinced.
> 
> Discarding is not always the right way to write zeroes, because it can
> disrupt performance.  It may be fine when you are already going to write
> a sparse image (as is the case for qemu-img convert), but not in
> general.  So if you just used write_zeroes, it would have to fall under
> yet another -drive option (or an extension to "-drive discard").  I
> think what Peter did is a good compromise in the end.
> 
> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
> zeroes blocks, but is that true for unaligned operations?

Good question, I will pass it to ronnie. My guess is that the command will fail with
a check condition if it failed to unmap the data. From what Ronnie sent earlier
it should be guaranteed that the blocks are at least zero after the unmap command.

As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
which should be a multiple of the alignment. It also checks if sectors are deallocated
after the unmap afterwards. If the unmap fails it falls back to has_zero_init =1.

But for the write_zeroes patch of my iscsi series it might be better to check after the discard
if the sectors are unallocated and otherways fail with -ENOTSUP. Just to be absolutely safe.

Peter
Peter Lieven - July 17, 2013, 10:22 a.m.
Am 17.07.2013 um 10:46 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>> if a destination has has_zero_init = 0, but it supports
>> discard zeroes use discard to convert the target
>> into an all zero device.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> 
> Wouldn't it be better to use bdrv_write_zeroes() and extend the
> implementation of that to use discard internally in those block drivers
> where it makes sense?
> 
> Because here you're not really discarding (i.e. don't care about the
> sectors any more), but you want them to be zeroed.

It is just a fall back in case we can't decide if has_zero_init is 1 easily.
This general approach has the benefit that it is also good for any host device
that has discard zeroes.

Peter
Kevin Wolf - July 17, 2013, 10:25 a.m.
Am 17.07.2013 um 11:58 hat Paolo Bonzini geschrieben:
> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
> > Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
> >> if a destination has has_zero_init = 0, but it supports
> >> discard zeroes use discard to convert the target
> >> into an all zero device.
> >>
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> > 
> > Wouldn't it be better to use bdrv_write_zeroes() and extend the
> > implementation of that to use discard internally in those block drivers
> > where it makes sense?
> > 
> > Because here you're not really discarding (i.e. don't care about the
> > sectors any more), but you want them to be zeroed.
> 
> I thought the same yesterday when reviewing the series, but I'm not
> convinced.
> 
> Discarding is not always the right way to write zeroes, because it can
> disrupt performance.  It may be fine when you are already going to write
> a sparse image (as is the case for qemu-img convert), but not in
> general.  So if you just used write_zeroes, it would have to fall under
> yet another -drive option (or an extension to "-drive discard").

Maybe we need a flag to bdrv_write_zeroes() that specifies whether to
use discard or not, kind of like the unmap bit in WRITE SAME.

On the other hand, I think you're right that this is really policy,
and as such shouldn't be hardcoded based on our guesses, but be
configurable.

> I think what Peter did is a good compromise in the end.

At the very least it must become a separate function. img_convert() is
already too big and too deeply nested.

> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
> zeroes blocks, but is that true for unaligned operations?

SBC-3 Rev. 35e, 5.16 READ CAPACITY (16) command:

    "A logical block provisioning read zeros (LBPRZ) bit set to one
    indicates that, for read commands specifying an unmapped LBA (see
    4.7.4.5), the device server returns user data set to zero [...]"

So it depends on the block provisioning state of the LBA, not on the
operations that were performed on it.

5.28 UNMAP command:

    If the ANCHOR bit in the CDB is set to zero, and the logical unit is
    thin provisioned (see 4.7.3.3), then the logical block provisioning
    state for each specified LBA:

    a) should become deallocated;
    b) may become anchored; or
    c) may remain unchanged.

So with UNMAP, I think you don't have any guarantees that the LBA
becomes unmapped and therefore zeroed. It could just keep its current
data. No matter whether your request was aligned or not.

Kevin
Kevin Wolf - July 17, 2013, 10:27 a.m.
Am 17.07.2013 um 12:21 hat Peter Lieven geschrieben:
> 
> Am 17.07.2013 um 11:58 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
> > Il 17/07/2013 10:46, Kevin Wolf ha scritto:
> >> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
> >>> if a destination has has_zero_init = 0, but it supports
> >>> discard zeroes use discard to convert the target
> >>> into an all zero device.
> >>> 
> >>> Signed-off-by: Peter Lieven <pl@kamp.de>
> >> 
> >> Wouldn't it be better to use bdrv_write_zeroes() and extend the
> >> implementation of that to use discard internally in those block drivers
> >> where it makes sense?
> >> 
> >> Because here you're not really discarding (i.e. don't care about the
> >> sectors any more), but you want them to be zeroed.
> > 
> > I thought the same yesterday when reviewing the series, but I'm not
> > convinced.
> > 
> > Discarding is not always the right way to write zeroes, because it can
> > disrupt performance.  It may be fine when you are already going to write
> > a sparse image (as is the case for qemu-img convert), but not in
> > general.  So if you just used write_zeroes, it would have to fall under
> > yet another -drive option (or an extension to "-drive discard").  I
> > think what Peter did is a good compromise in the end.
> > 
> > BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
> > zeroes blocks, but is that true for unaligned operations?
> 
> Good question, I will pass it to ronnie. My guess is that the command will fail with
> a check condition if it failed to unmap the data. From what Ronnie sent earlier
> it should be guaranteed that the blocks are at least zero after the unmap command.
> 
> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
> which should be a multiple of the alignment. It also checks if sectors are deallocated
> after the unmap afterwards. If the unmap fails it falls back to has_zero_init =1.

Well, you use bdrv_discard(), and ignoring discards is valid. Just
another reason to use bdrv_write_zeroes() instead.

Kevin
Peter Lieven - July 17, 2013, 10:31 a.m.
Am 17.07.2013 um 12:27 schrieb Kevin Wolf <kwolf@redhat.com>:

> Am 17.07.2013 um 12:21 hat Peter Lieven geschrieben:
>> 
>> Am 17.07.2013 um 11:58 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>> 
>>> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
>>>> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>>>>> if a destination has has_zero_init = 0, but it supports
>>>>> discard zeroes use discard to convert the target
>>>>> into an all zero device.
>>>>> 
>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> 
>>>> Wouldn't it be better to use bdrv_write_zeroes() and extend the
>>>> implementation of that to use discard internally in those block drivers
>>>> where it makes sense?
>>>> 
>>>> Because here you're not really discarding (i.e. don't care about the
>>>> sectors any more), but you want them to be zeroed.
>>> 
>>> I thought the same yesterday when reviewing the series, but I'm not
>>> convinced.
>>> 
>>> Discarding is not always the right way to write zeroes, because it can
>>> disrupt performance.  It may be fine when you are already going to write
>>> a sparse image (as is the case for qemu-img convert), but not in
>>> general.  So if you just used write_zeroes, it would have to fall under
>>> yet another -drive option (or an extension to "-drive discard").  I
>>> think what Peter did is a good compromise in the end.
>>> 
>>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>>> zeroes blocks, but is that true for unaligned operations?
>> 
>> Good question, I will pass it to ronnie. My guess is that the command will fail with
>> a check condition if it failed to unmap the data. From what Ronnie sent earlier
>> it should be guaranteed that the blocks are at least zero after the unmap command.
>> 
>> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
>> which should be a multiple of the alignment. It also checks if sectors are deallocated
>> after the unmap afterwards. If the unmap fails it falls back to has_zero_init =1.
> 
> Well, you use bdrv_discard(), and ignoring discards is valid. Just
> another reason to use bdrv_write_zeroes() instead.

I use discard and afterwards check the provisioning state again. Just in case it silently failed for
whatever reason.

I am ok to have a general algorithm in bdrv_write_zeroes that is used as soon as the
block driver return discard_zeroes.

Would you be happy with that? I would tweak it that way that it uses discard to write zeroes
and double checks the provisioning status after the discard. If it hasn't turned to unmapped
real zeroes are written. I would further add code to honor max_unmap and discard_alignment
information.

Peter
Paolo Bonzini - July 17, 2013, 10:45 a.m.
Il 17/07/2013 12:21, Peter Lieven ha scritto:
>> > BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>> > zeroes blocks, but is that true for unaligned operations?
> Good question, I will pass it to ronnie. My guess is that the command will fail with
> a check condition if it failed to unmap the data. From what Ronnie sent earlier
> it should be guaranteed that the blocks are at least zero after the unmap command.

I'm not so sure about that, SBC says explicitly that a misaligned unmap
request "may result in unmap operations on fewer LBAs than requested".

Perhaps it's better to do this:

- if LBPWS=1, use WRITE SAME(16).  For WRITE SAME it is explicit that
"if the device server does not deallocate or anchor the LBA, then the
device server shall perform the specified write operation".  Expose the
value of LBPRZ in the "discard zeroes" field of BlockDriverInfo.

- if LBPWS=0, you can still support discard iff LBPU=1 and use UNMAP.
But if LBPRZ=1 and LBPWS=0, you should not expose the "discard zeroes"
field of BlockDriverInfo.

> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
> which should be a multiple of the alignment.

Also, the remainder itself (which SBC calls "UNMAP GRANULARITY
ALIGNMENT") may not be 0.  I think the above trick (only exposing LBPRZ
if you'll use WRITE SAME) will sidestep the problem.  It may also remove
the need to fall back to has_zero_init=1.

Paolo
Paolo Bonzini - July 17, 2013, 10:47 a.m.
Il 17/07/2013 12:27, Kevin Wolf ha scritto:
> Am 17.07.2013 um 12:21 hat Peter Lieven geschrieben:
>>
>> Am 17.07.2013 um 11:58 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
>>>> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>>>>> if a destination has has_zero_init = 0, but it supports
>>>>> discard zeroes use discard to convert the target
>>>>> into an all zero device.
>>>>>
>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>
>>>> Wouldn't it be better to use bdrv_write_zeroes() and extend the
>>>> implementation of that to use discard internally in those block drivers
>>>> where it makes sense?
>>>>
>>>> Because here you're not really discarding (i.e. don't care about the
>>>> sectors any more), but you want them to be zeroed.
>>>
>>> I thought the same yesterday when reviewing the series, but I'm not
>>> convinced.
>>>
>>> Discarding is not always the right way to write zeroes, because it can
>>> disrupt performance.  It may be fine when you are already going to write
>>> a sparse image (as is the case for qemu-img convert), but not in
>>> general.  So if you just used write_zeroes, it would have to fall under
>>> yet another -drive option (or an extension to "-drive discard").  I
>>> think what Peter did is a good compromise in the end.
>>>
>>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>>> zeroes blocks, but is that true for unaligned operations?
>>
>> Good question, I will pass it to ronnie. My guess is that the command will fail with
>> a check condition if it failed to unmap the data. From what Ronnie sent earlier
>> it should be guaranteed that the blocks are at least zero after the unmap command.
>>
>> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
>> which should be a multiple of the alignment. It also checks if sectors are deallocated
>> after the unmap afterwards. If the unmap fails it falls back to has_zero_init =1.
> 
> Well, you use bdrv_discard(), and ignoring discards is valid. Just
> another reason to use bdrv_write_zeroes() instead.

He's only using it if discard_zeroes is true in the new BlockDriverInfo.
 We can define the semantics of that bit, and I think defining it as
"ignored discards will still write zeroes" is a good thing (same as what
SCSI targets do if you use WRITE SAME to do the discard operation).

Paolo
Peter Lieven - July 17, 2013, 2:19 p.m.
Am 17.07.2013 um 12:47 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 17/07/2013 12:27, Kevin Wolf ha scritto:
>> Am 17.07.2013 um 12:21 hat Peter Lieven geschrieben:
>>> 
>>> Am 17.07.2013 um 11:58 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>> 
>>>> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
>>>>> Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>>>>>> if a destination has has_zero_init = 0, but it supports
>>>>>> discard zeroes use discard to convert the target
>>>>>> into an all zero device.
>>>>>> 
>>>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>>> 
>>>>> Wouldn't it be better to use bdrv_write_zeroes() and extend the
>>>>> implementation of that to use discard internally in those block drivers
>>>>> where it makes sense?
>>>>> 
>>>>> Because here you're not really discarding (i.e. don't care about the
>>>>> sectors any more), but you want them to be zeroed.
>>>> 
>>>> I thought the same yesterday when reviewing the series, but I'm not
>>>> convinced.
>>>> 
>>>> Discarding is not always the right way to write zeroes, because it can
>>>> disrupt performance.  It may be fine when you are already going to write
>>>> a sparse image (as is the case for qemu-img convert), but not in
>>>> general.  So if you just used write_zeroes, it would have to fall under
>>>> yet another -drive option (or an extension to "-drive discard").  I
>>>> think what Peter did is a good compromise in the end.
>>>> 
>>>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>>>> zeroes blocks, but is that true for unaligned operations?
>>> 
>>> Good question, I will pass it to ronnie. My guess is that the command will fail with
>>> a check condition if it failed to unmap the data. From what Ronnie sent earlier
>>> it should be guaranteed that the blocks are at least zero after the unmap command.
>>> 
>>> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
>>> which should be a multiple of the alignment. It also checks if sectors are deallocated
>>> after the unmap afterwards. If the unmap fails it falls back to has_zero_init =1.
>> 
>> Well, you use bdrv_discard(), and ignoring discards is valid. Just
>> another reason to use bdrv_write_zeroes() instead.
> 
> He's only using it if discard_zeroes is true in the new BlockDriverInfo.
> We can define the semantics of that bit, and I think defining it as
> "ignored discards will still write zeroes" is a good thing (same as what
> SCSI targets do if you use WRITE SAME to do the discard operation).

Good idea


> 
> Paolo
Peter Lieven - July 17, 2013, 2:21 p.m.
Am 17.07.2013 um 12:45 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 17/07/2013 12:21, Peter Lieven ha scritto:
>>>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>>>> zeroes blocks, but is that true for unaligned operations?
>> Good question, I will pass it to ronnie. My guess is that the command will fail with
>> a check condition if it failed to unmap the data. From what Ronnie sent earlier
>> it should be guaranteed that the blocks are at least zero after the unmap command.
> 
> I'm not so sure about that, SBC says explicitly that a misaligned unmap
> request "may result in unmap operations on fewer LBAs than requested".
> 
> Perhaps it's better to do this:
> 
> - if LBPWS=1, use WRITE SAME(16).  For WRITE SAME it is explicit that
> "if the device server does not deallocate or anchor the LBA, then the
> device server shall perform the specified write operation".  Expose the
> value of LBPRZ in the "discard zeroes" field of BlockDriverInfo.
> 
> - if LBPWS=0, you can still support discard iff LBPU=1 and use UNMAP.
> But if LBPRZ=1 and LBPWS=0, you should not expose the "discard zeroes"
> field of BlockDriverInfo.
> 
>> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
>> which should be a multiple of the alignment.
> 
> Also, the remainder itself (which SBC calls "UNMAP GRANULARITY
> ALIGNMENT") may not be 0.  I think the above trick (only exposing LBPRZ
> if you'll use WRITE SAME) will sidestep the problem.  It may also remove
> the need to fall back to has_zero_init=1.

You mean has_zero_init = 0?

What is if the max write same size and max unmap differ?

Peter

> 
> Paolo
Paolo Bonzini - July 17, 2013, 2:26 p.m.
Il 17/07/2013 16:21, Peter Lieven ha scritto:
> Am 17.07.2013 um 12:45 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>> Il 17/07/2013 12:21, Peter Lieven ha scritto:
>>>>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>>>>> zeroes blocks, but is that true for unaligned operations?
>>> Good question, I will pass it to ronnie. My guess is that the command will fail with
>>> a check condition if it failed to unmap the data. From what Ronnie sent earlier
>>> it should be guaranteed that the blocks are at least zero after the unmap command.
>>
>> I'm not so sure about that, SBC says explicitly that a misaligned unmap
>> request "may result in unmap operations on fewer LBAs than requested".
>>
>> Perhaps it's better to do this:
>>
>> - if LBPWS=1, use WRITE SAME(16).  For WRITE SAME it is explicit that
>> "if the device server does not deallocate or anchor the LBA, then the
>> device server shall perform the specified write operation".  Expose the
>> value of LBPRZ in the "discard zeroes" field of BlockDriverInfo.
>>
>> - if LBPWS=0, you can still support discard iff LBPU=1 and use UNMAP.
>> But if LBPRZ=1 and LBPWS=0, you should not expose the "discard zeroes"
>> field of BlockDriverInfo.
>>
>>> As for the qemu-img patch this shouldn't matter. It uses always blocks of bdi->max_unmap
>>> which should be a multiple of the alignment.
>>
>> Also, the remainder itself (which SBC calls "UNMAP GRANULARITY
>> ALIGNMENT") may not be 0.  I think the above trick (only exposing LBPRZ
>> if you'll use WRITE SAME) will sidestep the problem.  It may also remove
>> the need to fall back to has_zero_init=1.
> 
> You mean has_zero_init = 0?

Yeah (copied the typo from you ;)).

> What is if the max write same size and max unmap differ?

You only need to consult one of the two, depending on whether you plan
to use WRITE SAME or UNMAP.  In other words, iscsi_open takes all the
decisions and stores the values that will be returned by BlockDriverInfo.

By the way, the semantics we gave to discard_zeroes, mean that
discard_zeroes must be 0 if the flags do not include BDRV_O_UNMAP.
Please ensure this is the case, and add an assertion about this in
bdrv_get_info.  Document everything that we decide about the semantics!

Paolo
ronniesahlberg@gmail.com - July 17, 2013, 3:54 p.m.
On Wed, Jul 17, 2013 at 3:25 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 17.07.2013 um 11:58 hat Paolo Bonzini geschrieben:
>> Il 17/07/2013 10:46, Kevin Wolf ha scritto:
>> > Am 15.07.2013 um 12:49 hat Peter Lieven geschrieben:
>> >> if a destination has has_zero_init = 0, but it supports
>> >> discard zeroes use discard to convert the target
>> >> into an all zero device.
>> >>
>> >> Signed-off-by: Peter Lieven <pl@kamp.de>
>> >
>> > Wouldn't it be better to use bdrv_write_zeroes() and extend the
>> > implementation of that to use discard internally in those block drivers
>> > where it makes sense?
>> >
>> > Because here you're not really discarding (i.e. don't care about the
>> > sectors any more), but you want them to be zeroed.
>>
>> I thought the same yesterday when reviewing the series, but I'm not
>> convinced.
>>
>> Discarding is not always the right way to write zeroes, because it can
>> disrupt performance.  It may be fine when you are already going to write
>> a sparse image (as is the case for qemu-img convert), but not in
>> general.  So if you just used write_zeroes, it would have to fall under
>> yet another -drive option (or an extension to "-drive discard").
>
> Maybe we need a flag to bdrv_write_zeroes() that specifies whether to
> use discard or not, kind of like the unmap bit in WRITE SAME.
>
> On the other hand, I think you're right that this is really policy,
> and as such shouldn't be hardcoded based on our guesses, but be
> configurable.
>
>> I think what Peter did is a good compromise in the end.
>
> At the very least it must become a separate function. img_convert() is
> already too big and too deeply nested.
>
>> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
>> zeroes blocks, but is that true for unaligned operations?
>
> SBC-3 Rev. 35e, 5.16 READ CAPACITY (16) command:
>
>     "A logical block provisioning read zeros (LBPRZ) bit set to one
>     indicates that, for read commands specifying an unmapped LBA (see
>     4.7.4.5), the device server returns user data set to zero [...]"
>
> So it depends on the block provisioning state of the LBA, not on the
> operations that were performed on it.
>
> 5.28 UNMAP command:
>
>     If the ANCHOR bit in the CDB is set to zero, and the logical unit is
>     thin provisioned (see 4.7.3.3), then the logical block provisioning
>     state for each specified LBA:
>
>     a) should become deallocated;
>     b) may become anchored; or
>     c) may remain unchanged.
>
> So with UNMAP, I think you don't have any guarantees that the LBA
> becomes unmapped and therefore zeroed. It could just keep its current
> data. No matter whether your request was aligned or not.
>

If you check the LogicalBlockLimits VPD page it has :

---
The UNMAP GRANULARITY ALIGNMENT field indicates the LBA of the first
logical block to which the OPTIMAL
field applies. The unmap granularity alignment is used to calculate an
optimal unmap
request starting LBA as follows:
UNMAP GRANULARITY
optimal unmap request starting LBA = (n × optimal unmap granularity) +
unmap granularity alignment
where:
n is zero or any positive integer value;
optimal unmap granularity is the value in the OPTIMAL UNMAP
GRANULARITY field; and
unmap granularity alignment is the value in the UNMAP GRANULARITY
ALIGNMENT field.
An unmap request with a starting LBA that is not optimal may result in
unmap operations on fewer LBAs than
requested.
---

So if Peter checks OPTIMAL_UNMAP_GRANULARITY and
UNMAP_GRANULARITY_ALIGNMENT and make sure that
the starting LBA is   (n * OUG) + UGA   and that the number of blocks
is a multiple of OUG then you should be fine.
Those unmaps should not silently fail.


I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
"optimal unmap request" then the blocks will become unmapped and they
will
read back as 0.


(
Using this check should also make it safe with UNMAP for 4k block
devices that emulate 512b blocks.
Those devices should report OUG==8 accordingly so you dont need to
check what LogicalBlocksPerPhysicalBlockExponent is.
)

regards
ronnie sahlberg
Paolo Bonzini - July 17, 2013, 4:27 p.m.
Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
> "optimal unmap request" then the blocks will become unmapped and they
> will read back as 0.

Yes, but it is not reasonable to assume that bdrv_discard will only
receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
capacity), is the safer thing to do.

Paolo
ronniesahlberg@gmail.com - July 17, 2013, 4:31 p.m.
On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
>> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
>> "optimal unmap request" then the blocks will become unmapped and they
>> will read back as 0.
>
> Yes, but it is not reasonable to assume that bdrv_discard will only
> receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
> exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
> capacity), is the safer thing to do.
>
> Paolo

ACK.

WRITESAME10/16 with UNMAP flag set is probably best.
Peter Lieven - July 17, 2013, 5:02 p.m.
For Disks we always use read/write16 so i think we Should also use writesame16. Or not?

Von meinem iPhone gesendet

Am 17.07.2013 um 18:31 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
>>> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
>>> "optimal unmap request" then the blocks will become unmapped and they
>>> will read back as 0.
>> 
>> Yes, but it is not reasonable to assume that bdrv_discard will only
>> receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
>> exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
>> capacity), is the safer thing to do.
>> 
>> Paolo
> 
> ACK.
> 
> WRITESAME10/16 with UNMAP flag set is probably best.
Paolo Bonzini - July 17, 2013, 5:04 p.m.
Il 17/07/2013 19:02, Peter Lieven ha scritto:
> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?

Yes.

Remember you can still use UNMAP if LBPRZ=0.

Paolo

> Von meinem iPhone gesendet
> 
> Am 17.07.2013 um 18:31 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
> 
>> On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
>>>> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
>>>> "optimal unmap request" then the blocks will become unmapped and they
>>>> will read back as 0.
>>>
>>> Yes, but it is not reasonable to assume that bdrv_discard will only
>>> receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
>>> exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
>>> capacity), is the safer thing to do.
>>>
>>> Paolo
>>
>> ACK.
>>
>> WRITESAME10/16 with UNMAP flag set is probably best.
ronniesahlberg@gmail.com - July 17, 2013, 5:09 p.m.
On Wed, Jul 17, 2013 at 10:02 AM, Peter Lieven <pl@kamp.de> wrote:
> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?

Sounds good.


>
> Von meinem iPhone gesendet
>
> Am 17.07.2013 um 18:31 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>
>> On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
>>>> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
>>>> "optimal unmap request" then the blocks will become unmapped and they
>>>> will read back as 0.
>>>
>>> Yes, but it is not reasonable to assume that bdrv_discard will only
>>> receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
>>> exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
>>> capacity), is the safer thing to do.
>>>
>>> Paolo
>>
>> ACK.
>>
>> WRITESAME10/16 with UNMAP flag set is probably best.
Peter Lieven - July 17, 2013, 5:48 p.m.
Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
> 
> Yes.
> 
> Remember you can still use UNMAP if LBPRZ=0.

I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.

Maybe we should call it discard_writes_zeroes or similar.

Discard_zeroes is sth that should only indicate if lbprz == 1. At least if we refer to the Linux ioctl. We could include both in BDI.

Peter

> 
> Paolo
> 
>> Von meinem iPhone gesendet
>> 
>> Am 17.07.2013 um 18:31 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:
>> 
>>> On Wed, Jul 17, 2013 at 9:27 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>> Il 17/07/2013 17:54, ronnie sahlberg ha scritto:
>>>>> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
>>>>> "optimal unmap request" then the blocks will become unmapped and they
>>>>> will read back as 0.
>>>> 
>>>> Yes, but it is not reasonable to assume that bdrv_discard will only
>>>> receive "optimal" requests.  Thus, using WRITE SAME for LBPRZ=1, and not
>>>> exposing LBPRZ=1 if LBPWS=0 (may also use LBPWS10 depending on the
>>>> capacity), is the safer thing to do.
>>>> 
>>>> Paolo
>>> 
>>> ACK.
>>> 
>>> WRITESAME10/16 with UNMAP flag set is probably best.
>
Paolo Bonzini - July 17, 2013, 8 p.m.
Il 17/07/2013 19:48, Peter Lieven ha scritto:
> 
> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>>> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
>>
>> Yes.
>>
>> Remember you can still use UNMAP if LBPRZ=0.
> 
> I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.
> 
> Maybe we should call it discard_writes_zeroes or similar.
> 
> Discard_zeroes is sth that should only indicate if lbprz == 1. At
> least if we refer to the Linux ioctl. We could include both in BDI.

I don't know... I don't see the usefulness of
discard_zeroes_part_of_what_it_was_told_to... :)

Paolo
Kevin Wolf - July 18, 2013, 9:21 a.m.
Am 17.07.2013 um 17:54 hat ronnie sahlberg geschrieben:
> On Wed, Jul 17, 2013 at 3:25 AM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 17.07.2013 um 11:58 hat Paolo Bonzini geschrieben:
> >> BTW, Peter and Ronnie: we were assuming that UNMAP with LBPRZ=1 always
> >> zeroes blocks, but is that true for unaligned operations?
> >
> > SBC-3 Rev. 35e, 5.16 READ CAPACITY (16) command:
> >
> >     "A logical block provisioning read zeros (LBPRZ) bit set to one
> >     indicates that, for read commands specifying an unmapped LBA (see
> >     4.7.4.5), the device server returns user data set to zero [...]"
> >
> > So it depends on the block provisioning state of the LBA, not on the
> > operations that were performed on it.
> >
> > 5.28 UNMAP command:
> >
> >     If the ANCHOR bit in the CDB is set to zero, and the logical unit is
> >     thin provisioned (see 4.7.3.3), then the logical block provisioning
> >     state for each specified LBA:
> >
> >     a) should become deallocated;
> >     b) may become anchored; or
> >     c) may remain unchanged.
> >
> > So with UNMAP, I think you don't have any guarantees that the LBA
> > becomes unmapped and therefore zeroed. It could just keep its current
> > data. No matter whether your request was aligned or not.
> >
> 
> If you check the LogicalBlockLimits VPD page it has :
> 
> ---
> [...]
> An unmap request with a starting LBA that is not optimal may result in
> unmap operations on fewer LBAs than
> requested.
> ---
> 
> So if Peter checks OPTIMAL_UNMAP_GRANULARITY and
> UNMAP_GRANULARITY_ALIGNMENT and make sure that
> the starting LBA is   (n * OUG) + UGA   and that the number of blocks
> is a multiple of OUG then you should be fine.

Okay, playing language^Whardware spec lawyer...

You've got the logic backwards. The spec says "if misaligned then unmap
fewer LBAs". This is an implication, not an equivalence: It doesn't say
"if unmap fewer LBAs then misaligned". The sentence from the VPD page
just reinforces for a special case what I have quoted above (as take it
as a hint like "this would be the typical result").

The point is that the spec just says it "should" become deallocated, not
"must" or "shall", and there is no condition connected with it. "should"
is defined in the usual way (3.5.15):

    3.5.13 should
    a keyword indicating flexibility of choice with a strongly preferred
    alternative; “should” is equivalent to the phrase “it is strongly
    recommended”

So, it would be nice if UNMAP unmapped, but it doesn't have to.

> I think it is reasonable to assume that IF LBPRZ==1 and IF it is an
> "optimal unmap request" then the blocks will become unmapped and they
> will
> read back as 0.

It might be reasonable to expect in practice, but as I see it, it's not
demanded by the spec.

WRITE SAME is different, there you actually get the guarantee that the
desired data is read back. I strongly recommend using it over UNMAP
therefore if you don't want to discard, but write zeroes.

Kevin
Kevin Wolf - July 18, 2013, 9:23 a.m.
Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
> 
> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
> > Il 17/07/2013 19:02, Peter Lieven ha scritto:
> >> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
> > 
> > Yes.
> > 
> > Remember you can still use UNMAP if LBPRZ=0.
> 
> I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.
> 
> Maybe we should call it discard_writes_zeroes or similar.
> 
> Discard_zeroes is sth that should only indicate if lbprz == 1. At least if we refer to the Linux ioctl. We could include both in BDI.

Maybe what we really should do is to define different operations (with
an exact behaviour) instead of having one bdrv_discard() and then adding
flags everywhere to tell what the operation is doing exactly.

Kevin
Paolo Bonzini - July 18, 2013, 10:24 a.m.
Il 18/07/2013 11:23, Kevin Wolf ha scritto:
> Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
>>
>> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>>>> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
>>>
>>> Yes.
>>>
>>> Remember you can still use UNMAP if LBPRZ=0.
>>
>> I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.
>>
>> Maybe we should call it discard_writes_zeroes or similar.
>>
>> Discard_zeroes is sth that should only indicate if lbprz == 1. At least if we refer to the Linux ioctl. We could include both in BDI.
> 
> Maybe what we really should do is to define different operations (with
> an exact behaviour) instead of having one bdrv_discard() and then adding
> flags everywhere to tell what the operation is doing exactly.

A BDRV_MAY_UNMAP flag for bdrv_write_zeroes?

Paolo
Kevin Wolf - July 18, 2013, 10:42 a.m.
Am 18.07.2013 um 12:24 hat Paolo Bonzini geschrieben:
> Il 18/07/2013 11:23, Kevin Wolf ha scritto:
> > Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
> >>
> >> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> >>
> >>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
> >>>> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
> >>>
> >>> Yes.
> >>>
> >>> Remember you can still use UNMAP if LBPRZ=0.
> >>
> >> I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.
> >>
> >> Maybe we should call it discard_writes_zeroes or similar.
> >>
> >> Discard_zeroes is sth that should only indicate if lbprz == 1. At least if we refer to the Linux ioctl. We could include both in BDI.
> > 
> > Maybe what we really should do is to define different operations (with
> > an exact behaviour) instead of having one bdrv_discard() and then adding
> > flags everywhere to tell what the operation is doing exactly.
> 
> A BDRV_MAY_UNMAP flag for bdrv_write_zeroes?

Something like that, yes.

Kevin
Peter Lieven - July 18, 2013, 10:44 a.m.
On 18.07.2013 12:24, Paolo Bonzini wrote:
> Il 18/07/2013 11:23, Kevin Wolf ha scritto:
>> Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
>>> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>>>>> For Disks we always use read/write16 so i think we Should also use writesame16. Or not?
>>>> Yes.
>>>>
>>>> Remember you can still use UNMAP if LBPRZ=0.
>>> I can always use it if writesame is not available, but in this case bdi->discard_zeroes must be 0.
>>>
>>> Maybe we should call it discard_writes_zeroes or similar.
>>>
>>> Discard_zeroes is sth that should only indicate if lbprz == 1. At least if we refer to the Linux ioctl. We could include both in BDI.
>> Maybe what we really should do is to define different operations (with
>> an exact behaviour) instead of having one bdrv_discard() and then adding
>> flags everywhere to tell what the operation is doing exactly.
> A BDRV_MAY_UNMAP flag for bdrv_write_zeroes?

I thought that we wanted to add a paramter to the BDI (call it write_zeroes_w_discard).
If this is set the bdrv MUST accept a flag to bdrv_discard() lets call it BDRV_DISCARD_WRITE_ZEROES
and he has to ensure that all sectors specified in bdrv_discard() read as zero after the operation.

If this flag is not set (e.g. when the OS issues a normal discard) the operation may still silently fail with
undefined provisioning state and content of the specified sectors.

With the above flag a general optimization of bdrv_write_zeroes would be possible, that
calls bdrv_discard with BDRV_DISCARD_WRITE_ZEROES if bdi->write_zeroes_w_discard == 1.

For iscsi this would mean it sets bdi->write_zeroes_w_discard = 1 iff BDRV_O_UNMAP == 1 and lbpme == 1 and lbpws == 1 and lbprz == 1.

The second flag that could be added is bdi->discard_zeroes. But this only says unmapped blocks
read back as zero.

Peter
Paolo Bonzini - July 18, 2013, 10:56 a.m.
Il 18/07/2013 12:44, Peter Lieven ha scritto:
> On 18.07.2013 12:24, Paolo Bonzini wrote:
>> Il 18/07/2013 11:23, Kevin Wolf ha scritto:
>>> Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
>>>> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>>
>>>>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>>>>>> For Disks we always use read/write16 so i think we Should also use
>>>>>> writesame16. Or not?
>>>>> Yes.
>>>>>
>>>>> Remember you can still use UNMAP if LBPRZ=0.
>>>> I can always use it if writesame is not available, but in this case
>>>> bdi->discard_zeroes must be 0.
>>>>
>>>> Maybe we should call it discard_writes_zeroes or similar.
>>>>
>>>> Discard_zeroes is sth that should only indicate if lbprz == 1. At
>>>> least if we refer to the Linux ioctl. We could include both in BDI.
>>> Maybe what we really should do is to define different operations (with
>>> an exact behaviour) instead of having one bdrv_discard() and then adding
>>> flags everywhere to tell what the operation is doing exactly.
>> A BDRV_MAY_UNMAP flag for bdrv_write_zeroes?
> 
> I thought that we wanted to add a paramter to the BDI (call it
> write_zeroes_w_discard).

I also thought so, but I like Kevin's idea of not shoehorning it in
bdrv_discard().  Extending bdrv_write_zeroes is a better API.

So far we've avoided "discard zeroes" semantics in QEMU (device models
are also careful not to expose that).  Since the only sane way to
implement what you want is to use the SCSI WRITE SAME command, adding
flags to bdrv_write_zeroes will even be easier because the mapping with
SCSI is natural: discard = UNMAP, write_zeroes = WRITE SAME (either
without or with the UNMAP bit).

> If this is set the bdrv MUST accept a flag to bdrv_discard() lets call
> it BDRV_DISCARD_WRITE_ZEROES
> and he has to ensure that all sectors specified in bdrv_discard() read
> as zero after the operation.
> 
> If this flag is not set (e.g. when the OS issues a normal discard) the
> operation may still silently fail with
> undefined provisioning state and content of the specified sectors.

But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
begin with?  That's why extending bdrv_write_zeroes is preferable.

> With the above flag a general optimization of bdrv_write_zeroes would be
> possible, that calls bdrv_discard with BDRV_DISCARD_WRITE_ZEROES if
> bdi->write_zeroes_w_discard == 1.

No, you cannot do that unconditionally.  Some operations can use it,
some cannot.  Think of SCSI disk emulation: it must not discard is WRITE
SAME is sent without the UNMAP bit!

> The second flag that could be added is bdi->discard_zeroes. But this
> only says unmapped blocks read back as zero.

I think a flag is not needed here.  You can add BDRV_BLOCK_ZERO directly
in iscsi_co_get_block_status.

Paolo
Peter Lieven - July 18, 2013, 11:04 a.m.
On 18.07.2013 12:56, Paolo Bonzini wrote:
> Il 18/07/2013 12:44, Peter Lieven ha scritto:
>> On 18.07.2013 12:24, Paolo Bonzini wrote:
>>> Il 18/07/2013 11:23, Kevin Wolf ha scritto:
>>>> Am 17.07.2013 um 19:48 hat Peter Lieven geschrieben:
>>>>> Am 17.07.2013 um 19:04 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>>>
>>>>>> Il 17/07/2013 19:02, Peter Lieven ha scritto:
>>>>>>> For Disks we always use read/write16 so i think we Should also use
>>>>>>> writesame16. Or not?
>>>>>> Yes.
>>>>>>
>>>>>> Remember you can still use UNMAP if LBPRZ=0.
>>>>> I can always use it if writesame is not available, but in this case
>>>>> bdi->discard_zeroes must be 0.
>>>>>
>>>>> Maybe we should call it discard_writes_zeroes or similar.
>>>>>
>>>>> Discard_zeroes is sth that should only indicate if lbprz == 1. At
>>>>> least if we refer to the Linux ioctl. We could include both in BDI.
>>>> Maybe what we really should do is to define different operations (with
>>>> an exact behaviour) instead of having one bdrv_discard() and then adding
>>>> flags everywhere to tell what the operation is doing exactly.
>>> A BDRV_MAY_UNMAP flag for bdrv_write_zeroes?
>> I thought that we wanted to add a paramter to the BDI (call it
>> write_zeroes_w_discard).
> I also thought so, but I like Kevin's idea of not shoehorning it in
> bdrv_discard().  Extending bdrv_write_zeroes is a better API.
>
> So far we've avoided "discard zeroes" semantics in QEMU (device models
> are also careful not to expose that).  Since the only sane way to
> implement what you want is to use the SCSI WRITE SAME command, adding
> flags to bdrv_write_zeroes will even be easier because the mapping with
> SCSI is natural: discard = UNMAP, write_zeroes = WRITE SAME (either
> without or with the UNMAP bit).
>
>> If this is set the bdrv MUST accept a flag to bdrv_discard() lets call
>> it BDRV_DISCARD_WRITE_ZEROES
>> and he has to ensure that all sectors specified in bdrv_discard() read
>> as zero after the operation.
>>
>> If this flag is not set (e.g. when the OS issues a normal discard) the
>> operation may still silently fail with
>> undefined provisioning state and content of the specified sectors.
> But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
> fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
> begin with?  That's why extending bdrv_write_zeroes is preferable.
In this case wo do not need a flag to the function at all. If the
driver sets bdi->write_zeroes_w_discard = 1 then bdrv_write_zeroes
can use bdrv_discard to write zeroes and the driver has to
ensure that all is zero afterwards.

If the driver would have a better method of writing zeroes than
discard it simply should not set bdi->write_zeroes_w_discard = 1.

Peter
Paolo Bonzini - July 18, 2013, 12:31 p.m.
Il 18/07/2013 13:04, Peter Lieven ha scritto:
>> But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
>> fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
>> begin with?  That's why extending bdrv_write_zeroes is preferable.
> In this case wo do not need a flag to the function at all. If the
> driver sets bdi->write_zeroes_w_discard = 1 then bdrv_write_zeroes
> can use bdrv_discard to write zeroes and the driver has to
> ensure that all is zero afterwards.

Peter, you removed exactly the part of the email where I explained the
wrong part of your reasoning:

   you cannot do that [discard in bdrv_write_zeroes] unconditionally.
   Some operations can use it, some cannot.  Think of SCSI disk
   emulation: it must not discard is WRITE SAME is sent without the
   UNMAP bit!

> If the driver would have a better method of writing zeroes than
> discard it simply should not set bdi->write_zeroes_w_discard = 1.

If the driver had a better method of writing zeroes than discard, it
simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
its bdrv_write_zeros implementation.

With Kevin's proposal, there is no reason to add a flag to bdi.  (Sorry
it took a few iterations to get it right).

Paolo
Peter Lieven - July 18, 2013, 1:29 p.m.
On 18.07.2013 14:31, Paolo Bonzini wrote:
> Il 18/07/2013 13:04, Peter Lieven ha scritto:
>>> But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
>>> fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
>>> begin with?  That's why extending bdrv_write_zeroes is preferable.
>> In this case wo do not need a flag to the function at all. If the
>> driver sets bdi->write_zeroes_w_discard = 1 then bdrv_write_zeroes
>> can use bdrv_discard to write zeroes and the driver has to
>> ensure that all is zero afterwards.
> Peter, you removed exactly the part of the email where I explained the
> wrong part of your reasoning:
>
>     you cannot do that [discard in bdrv_write_zeroes] unconditionally.
>     Some operations can use it, some cannot.  Think of SCSI disk
>     emulation: it must not discard is WRITE SAME is sent without the
>     UNMAP bit!
>
>> If the driver would have a better method of writing zeroes than
>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
> If the driver had a better method of writing zeroes than discard, it
> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
> its bdrv_write_zeros implementation.
ok, but this would require an individual patch in every driver, wouldn't it.
i am ok with that.

the BDRV_MAY_DISCARD flag is at the end a hint if the driver can optimize
writing zeroes by a discard or if real zeroes should be written e.g. to
sanitize the device?

talking for iscsi:
bdrv->discard can remain to use UNMAP and silently fail if lbpu == 0.
bdrv->write_zeroes will use writesame16 and set the unmap flag only if BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
in case lbpws == 0 it will return -ENOSUP.

Correct?

Peter
Paolo Bonzini - July 18, 2013, 1:52 p.m.
Il 18/07/2013 15:29, Peter Lieven ha scritto:
>>> If the driver would have a better method of writing zeroes than
>>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
>> If the driver had a better method of writing zeroes than discard, it
>> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
>> its bdrv_write_zeros implementation.
>
> ok, but this would require an individual patch in every driver, wouldn't
> it.  i am ok with that.

Yes (making the drivers return the flag in the BDI would also require
per-driver patches).

BDRV_MAY_UNMAP is an advisory flag, that the driver is free to ignore
(just like bdrv_discard can be ignored altogether).

block.c probably should avoid passing down BDRV_MAY_UNMAP altogether if
the BDRV_O_UNMAP flag is not set.

> the BDRV_MAY_DISCARD flag is at the end a hint if the driver can optimize
> writing zeroes by a discard or if real zeroes should be written e.g. to
> sanitize the device?
> 
> talking for iscsi:
> bdrv->discard can remain to use UNMAP and silently fail if lbpu == 0.
> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
> in case lbpws == 0 it will return -ENOSUP.

Almost: for LBPWS == 0 it can use WRITE SAME(16) without the unmap flag.
 If that doesn't work either it can return -ENOTSUP.

Paolo

> 
> Correct?
> 
> Peter
>
ronniesahlberg@gmail.com - July 18, 2013, 1:55 p.m.
On Thu, Jul 18, 2013 at 6:29 AM, Peter Lieven <pl@kamp.de> wrote:
> On 18.07.2013 14:31, Paolo Bonzini wrote:
>>
>> Il 18/07/2013 13:04, Peter Lieven ha scritto:
>>>>
>>>> But if you set BDRV_DISCARD_WRITE_ZEROES, then you always need a
>>>> fallback to bdrv_write_zeroes.  Why not just call bdrv_write_zeroes to
>>>> begin with?  That's why extending bdrv_write_zeroes is preferable.
>>>
>>> In this case wo do not need a flag to the function at all. If the
>>> driver sets bdi->write_zeroes_w_discard = 1 then bdrv_write_zeroes
>>> can use bdrv_discard to write zeroes and the driver has to
>>> ensure that all is zero afterwards.
>>
>> Peter, you removed exactly the part of the email where I explained the
>> wrong part of your reasoning:
>>
>>     you cannot do that [discard in bdrv_write_zeroes] unconditionally.
>>     Some operations can use it, some cannot.  Think of SCSI disk
>>     emulation: it must not discard is WRITE SAME is sent without the
>>     UNMAP bit!
>>
>>> If the driver would have a better method of writing zeroes than
>>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
>>
>> If the driver had a better method of writing zeroes than discard, it
>> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
>> its bdrv_write_zeros implementation.
>
> ok, but this would require an individual patch in every driver, wouldn't it.
> i am ok with that.
>
> the BDRV_MAY_DISCARD flag is at the end a hint if the driver can optimize
> writing zeroes by a discard or if real zeroes should be written e.g. to
> sanitize the device?
>
> talking for iscsi:
> bdrv->discard can remain to use UNMAP and silently fail if lbpu == 0.
> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.

When you use WRITESAME16 you can ignore the lbprz flag.
Just send a WRITESAME16 command with one block of data that is all set to zero.
If the unmap flag is set and if unmapped blocks read back the same as
the block you provided (all zero)
then it will unmap those blocks, where possible.
All other blocks that can not be unmapped, or where unmapped blocks
will not read back the same, will instead be overwritten by the
provided all-zero block.


> in case lbpws == 0 it will return -ENOSUP.
>
> Correct?
>
> Peter
>
Peter Lieven - July 18, 2013, 2:09 p.m.
On 18.07.2013 15:52, Paolo Bonzini wrote:
> Il 18/07/2013 15:29, Peter Lieven ha scritto:
>>>> If the driver would have a better method of writing zeroes than
>>>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
>>> If the driver had a better method of writing zeroes than discard, it
>>> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
>>> its bdrv_write_zeros implementation.
>> ok, but this would require an individual patch in every driver, wouldn't
>> it.  i am ok with that.
> Yes (making the drivers return the flag in the BDI would also require
> per-driver patches).
we still might need a hint for qemu-img convert that the driver does support
writing zeroes by unmap because using write_zeroes in the main loop
might result in unaligned requests that the target is not able to unmap.
and to avoid writing several blocks twice by first writing all zeroes
to the target and then writing all data blocks again I would need to keep the loop
at the beginning of qemu-img convert to write zeroes with correct
alignment and granularity if the driver supports write_zeroes_w_discard.

>
> BDRV_MAY_UNMAP is an advisory flag, that the driver is free to ignore
> (just like bdrv_discard can be ignored altogether).
>
> block.c probably should avoid passing down BDRV_MAY_UNMAP altogether if
> the BDRV_O_UNMAP flag is not set.
>
>> the BDRV_MAY_DISCARD flag is at the end a hint if the driver can optimize
>> writing zeroes by a discard or if real zeroes should be written e.g. to
>> sanitize the device?
>>
>> talking for iscsi:
>> bdrv->discard can remain to use UNMAP and silently fail if lbpu == 0.
>> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
>> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
>> in case lbpws == 0 it will return -ENOSUP.
> Almost: for LBPWS == 0 it can use WRITE SAME(16) without the unmap flag.
>   If that doesn't work either it can return -ENOTSUP.
Ok, i was not aware that WRITE SAME(16) is always available.

Peter
Paolo Bonzini - July 18, 2013, 2:14 p.m.
Il 18/07/2013 15:55, ronnie sahlberg ha scritto:
>> > bdrv->write_zeroes will use writesame16 and set the unmap flag only if
>> > BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
> When you use WRITESAME16 you can ignore the lbprz flag.
> Just send a WRITESAME16 command with one block of data that is all set to zero.
> If the unmap flag is set and if unmapped blocks read back the same as
> the block you provided (all zero)
> then it will unmap those blocks, where possible.

True, so the unmap flag can be set iff BDRV_MAY_DISCARD == 1.  block.c
can take care of checking BDRV_O_UNMAP.

Paolo

> All other blocks that can not be unmapped, or where unmapped blocks
> will not read back the same, will instead be overwritten by the
> provided all-zero block.
> 
>
Paolo Bonzini - July 18, 2013, 2:20 p.m.
Il 18/07/2013 16:09, Peter Lieven ha scritto:
> On 18.07.2013 15:52, Paolo Bonzini wrote:
>> Il 18/07/2013 15:29, Peter Lieven ha scritto:
>>>>> If the driver would have a better method of writing zeroes than
>>>>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
>>>> If the driver had a better method of writing zeroes than discard, it
>>>> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
>>>> its bdrv_write_zeros implementation.
>>> ok, but this would require an individual patch in every driver, wouldn't
>>> it.  i am ok with that.
>> Yes (making the drivers return the flag in the BDI would also require
>> per-driver patches).
> we still might need a hint for qemu-img convert that the driver does
> support
> writing zeroes by unmap because using write_zeroes in the main loop
> might result in unaligned requests that the target is not able to unmap.
> and to avoid writing several blocks twice by first writing all zeroes
> to the target and then writing all data blocks again I would need to
> keep the loop
> at the beginning of qemu-img convert to write zeroes with correct
> alignment and granularity if the driver supports write_zeroes_w_discard.

(Mis)alignment and granularity can be handled later.  We can ignore them
for now.  Later, if we decide the best way to support them is a flag,
we'll add it.  Let's not put the cart before the horse.

BTW, I expect alignment!=0 to be really, really rare.

Paolo
Peter Lieven - July 18, 2013, 2:32 p.m.
On 18.07.2013 16:20, Paolo Bonzini wrote:
> Il 18/07/2013 16:09, Peter Lieven ha scritto:
>> On 18.07.2013 15:52, Paolo Bonzini wrote:
>>> Il 18/07/2013 15:29, Peter Lieven ha scritto:
>>>>>> If the driver would have a better method of writing zeroes than
>>>>>> discard it simply should not set bdi->write_zeroes_w_discard = 1.
>>>>> If the driver had a better method of writing zeroes than discard, it
>>>>> simply should ignore the BDRV_MAY_UNMAP (or BDRV_MAY_DISCARD) flag in
>>>>> its bdrv_write_zeros implementation.
>>>> ok, but this would require an individual patch in every driver, wouldn't
>>>> it.  i am ok with that.
>>> Yes (making the drivers return the flag in the BDI would also require
>>> per-driver patches).
>> we still might need a hint for qemu-img convert that the driver does
>> support
>> writing zeroes by unmap because using write_zeroes in the main loop
>> might result in unaligned requests that the target is not able to unmap.
>> and to avoid writing several blocks twice by first writing all zeroes
>> to the target and then writing all data blocks again I would need to
>> keep the loop
>> at the beginning of qemu-img convert to write zeroes with correct
>> alignment and granularity if the driver supports write_zeroes_w_discard.
> (Mis)alignment and granularity can be handled later.  We can ignore them
> for now.  Later, if we decide the best way to support them is a flag,
> we'll add it.  Let's not put the cart before the horse.
>
> BTW, I expect alignment!=0 to be really, really rare.
To explain my concerns:

I know that my target has internal page size of 15MB. I will check what happens
if I deallocate this 15MB in chunks of lets say 1MB. If the page gets unprovisioned
after the last chunk is unmapped it would be fine :-)

Peter
Paolo Bonzini - July 18, 2013, 2:35 p.m.
Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>
>> (Mis)alignment and granularity can be handled later.  We can ignore them
>> for now.  Later, if we decide the best way to support them is a flag,
>> we'll add it.  Let's not put the cart before the horse.
>>
>> BTW, I expect alignment!=0 to be really, really rare.
> To explain my concerns:
> 
> I know that my target has internal page size of 15MB. I will check what
> happens
> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
> unprovisioned
> after the last chunk is unmapped it would be fine :-)

You're talking of granularity here, not (mis)alignment.

Paolo
Peter Lieven - July 18, 2013, 6:43 p.m.
Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>> 
>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>> for now.  Later, if we decide the best way to support them is a flag,
>>> we'll add it.  Let's not put the cart before the horse.
>>> 
>>> BTW, I expect alignment!=0 to be really, really rare.
>> To explain my concerns:
>> 
>> I know that my target has internal page size of 15MB. I will check what
>> happens
>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>> unprovisioned
>> after the last chunk is unmapped it would be fine :-)
> 
> You're talking of granularity here, not (mis)alignment.

you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-) otherwise i will have to add support for honoring this values in qemu-img convert
as a follow up.

Peter
ronniesahlberg@gmail.com - July 18, 2013, 6:54 p.m.
BlockLimitsVPD  OptimalUnmapGranularity also applies to unmapping with
writesame16 :

An OPTIMAL UNMAP GRANULARITY field set to a non-zero value indicates
the optimal granularity in logical blocks
for unmap requests (e.g., an UNMAP command or a WRITE SAME (16)
command with the UNMAP bit set to
one). An unmap request with a number of logical blocks that is not a
multiple of this value may result in unmap
operations on fewer LBAs than requested. An OPTIMAL UNMAP GRANULARITY
field set to 0000_0000h indicates
that the device server does not report an optimal unmap granularity.

So if you send a writesame16+unmap-bit  that honours the "optimal
unmap request" you have a pretty good chance
that the blocks will be unmapped. If they are not they will at least
be overwritten as zero.


On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>
>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>
>>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>>> for now.  Later, if we decide the best way to support them is a flag,
>>>> we'll add it.  Let's not put the cart before the horse.
>>>>
>>>> BTW, I expect alignment!=0 to be really, really rare.
>>> To explain my concerns:
>>>
>>> I know that my target has internal page size of 15MB. I will check what
>>> happens
>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>> unprovisioned
>>> after the last chunk is unmapped it would be fine :-)
>>
>> You're talking of granularity here, not (mis)alignment.
>
> you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
> i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-) otherwise i will have to add support for honoring this values in qemu-img convert
> as a follow up.
>
> Peter
>
Peter Lieven - July 18, 2013, 7:28 p.m.
Am 18.07.2013 um 20:54 schrieb ronnie sahlberg <ronniesahlberg@gmail.com>:

> BlockLimitsVPD  OptimalUnmapGranularity also applies to unmapping with
> writesame16 :
> 
> An OPTIMAL UNMAP GRANULARITY field set to a non-zero value indicates
> the optimal granularity in logical blocks
> for unmap requests (e.g., an UNMAP command or a WRITE SAME (16)
> command with the UNMAP bit set to
> one). An unmap request with a number of logical blocks that is not a
> multiple of this value may result in unmap
> operations on fewer LBAs than requested. An OPTIMAL UNMAP GRANULARITY
> field set to 0000_0000h indicates
> that the device server does not report an optimal unmap granularity.
> 
> So if you send a writesame16+unmap-bit  that honours the "optimal
> unmap request" you have a pretty good chance
> that the blocks will be unmapped. If they are not they will at least
> be overwritten as zero.

thanks for the details. I think to have optimal performance and best
change for unmapping in qemu-img convert
it might be best to export the OPTIMAL UNMAP GRANULARITY as well
as the write_zeroes_w_discard capability via the BDI and than zero
out the whole device with bdrv_write_zeroes and the BDRV_MAY_DISCARD
flag.

the logic for this is already prepared in patch4 (topic of this email). except that
i have to exchange bdrv_discard with bdrv_write_zeroes w/ BDRV_MAY_DISCARD.

what do you think?



> 
> 
> On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>> 
>>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>> 
>>>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>>>> for now.  Later, if we decide the best way to support them is a flag,
>>>>> we'll add it.  Let's not put the cart before the horse.
>>>>> 
>>>>> BTW, I expect alignment!=0 to be really, really rare.
>>>> To explain my concerns:
>>>> 
>>>> I know that my target has internal page size of 15MB. I will check what
>>>> happens
>>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>>> unprovisioned
>>>> after the last chunk is unmapped it would be fine :-)
>>> 
>>> You're talking of granularity here, not (mis)alignment.
>> 
>> you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
>> i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-) otherwise i will have to add support for honoring this values in qemu-img convert
>> as a follow up.
>> 
>> Peter
>>
Paolo Bonzini - July 19, 2013, 5:58 a.m.
Il 18/07/2013 21:28, Peter Lieven ha scritto:
> thanks for the details. I think to have optimal performance and best
> change for unmapping in qemu-img convert
> it might be best to export the OPTIMAL UNMAP GRANULARITY

Agreed about this.

> as well as the write_zeroes_w_discard capability via the BDI

But why this?!?  It is _not_ needed.  All you need is to change the
default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
nonzero.

Paolo

> and than zero
> out the whole device with bdrv_write_zeroes and the BDRV_MAY_DISCARD
> flag.
> 
> the logic for this is already prepared in patch4 (topic of this email). except that
> i have to exchange bdrv_discard with bdrv_write_zeroes w/ BDRV_MAY_DISCARD.
> 
> what do you think?
> 
> 
> 
>>
>>
>> On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>>>
>>>>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>>>>> for now.  Later, if we decide the best way to support them is a flag,
>>>>>> we'll add it.  Let's not put the cart before the horse.
>>>>>>
>>>>>> BTW, I expect alignment!=0 to be really, really rare.
>>>>> To explain my concerns:
>>>>>
>>>>> I know that my target has internal page size of 15MB. I will check what
>>>>> happens
>>>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>>>> unprovisioned
>>>>> after the last chunk is unmapped it would be fine :-)
>>>>
>>>> You're talking of granularity here, not (mis)alignment.
>>>
>>> you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
>>> i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-) otherwise i will have to add support for honoring this values in qemu-img convert
>>> as a follow up.
>>>
>>> Peter
>>>
> 
> 
>
Peter Lieven - July 19, 2013, 6:08 a.m.
On 19.07.2013 07:58, Paolo Bonzini wrote:
> Il 18/07/2013 21:28, Peter Lieven ha scritto:
>> thanks for the details. I think to have optimal performance and best
>> change for unmapping in qemu-img convert
>> it might be best to export the OPTIMAL UNMAP GRANULARITY
> Agreed about this.
>
>> as well as the write_zeroes_w_discard capability via the BDI
> But why this?!?  It is _not_ needed.  All you need is to change the
> default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
> nonzero.
2 reasons:
a) Does this guarantee that the requests are aligned to multiple of the -S value?

b) If I this flag exists qemu-img convert can do the "zeroing" a priori. This
has the benefit that combined with bdrv_get_block_status requests before it might
not need to touch large areas of the volume. Speaking for iSCSI its likely that
the user sets a fresh volume as the destination, but its not guaranteed.
With the Patch 4 there are only done a few get_block_status requests to verify
this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
thousands of writesame requests for possibly already unmapped data.

To give an example. If I take my storage with an 1TB volume. It takes about 10-12
get_block_status requests to verify that it is completely unmapped. After this
I am safe to set has_zero_init = 1 in qemu-img convert.

If I would convert a 1TB image to this target where lets say 50% are at leat 15MB
zero blocks (15MB is the OUG of my storage) it would take ~35000 write same
requests to achieve the same.

Peter
>
> Paolo
>
>> and than zero
>> out the whole device with bdrv_write_zeroes and the BDRV_MAY_DISCARD
>> flag.
>>
>> the logic for this is already prepared in patch4 (topic of this email). except that
>> i have to exchange bdrv_discard with bdrv_write_zeroes w/ BDRV_MAY_DISCARD.
>>
>> what do you think?
>>
>>
>>
>>>
>>> On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>>>> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>>
>>>>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>>>>>> for now.  Later, if we decide the best way to support them is a flag,
>>>>>>> we'll add it.  Let's not put the cart before the horse.
>>>>>>>
>>>>>>> BTW, I expect alignment!=0 to be really, really rare.
>>>>>> To explain my concerns:
>>>>>>
>>>>>> I know that my target has internal page size of 15MB. I will check what
>>>>>> happens
>>>>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>>>>> unprovisioned
>>>>>> after the last chunk is unmapped it would be fine :-)
>>>>> You're talking of granularity here, not (mis)alignment.
>>>> you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
>>>> i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-) otherwise i will have to add support for honoring this values in qemu-img convert
>>>> as a follow up.
>>>>
>>>> Peter
>>>>
>>
>>
ronniesahlberg@gmail.com - July 19, 2013, 1:25 p.m.
On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven <pl@kamp.de> wrote:
> On 19.07.2013 07:58, Paolo Bonzini wrote:
>>
>> Il 18/07/2013 21:28, Peter Lieven ha scritto:
>>>
>>> thanks for the details. I think to have optimal performance and best
>>> change for unmapping in qemu-img convert
>>> it might be best to export the OPTIMAL UNMAP GRANULARITY
>>
>> Agreed about this.
>>
>>> as well as the write_zeroes_w_discard capability via the BDI
>>
>> But why this?!?  It is _not_ needed.  All you need is to change the
>> default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
>> nonzero.
>
> 2 reasons:
> a) Does this guarantee that the requests are aligned to multiple of the -S
> value?
>
> b) If I this flag exists qemu-img convert can do the "zeroing" a priori.
> This
> has the benefit that combined with bdrv_get_block_status requests before it
> might
> not need to touch large areas of the volume. Speaking for iSCSI its likely
> that
> the user sets a fresh volume as the destination, but its not guaranteed.
> With the Patch 4 there are only done a few get_block_status requests to
> verify
> this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
> thousands of writesame requests for possibly already unmapped data.
>
> To give an example. If I take my storage with an 1TB volume. It takes about
> 10-12
> get_block_status requests to verify that it is completely unmapped. After
> this
> I am safe to set has_zero_init = 1 in qemu-img convert.
>
> If I would convert a 1TB image to this target where lets say 50% are at leat
> 15MB
> zero blocks (15MB is the OUG of my storage) it would take ~35000 write same
> requests to achieve the same.

I am not sure I am reading this right, but you dont have to writesame
exactly 1xOUG to get it to unmap.
nxOUG will work too,
So instead of sending one writesame for each OUG range, you can send
one writesame for every ~10G or so.
Say 10G is ~667 OUGs for your case, so you can send
writesame for ~667xOUG in each command and then it would "only" take
~100 writesames instead of ~35000.

So as long as you are sending in multiples of OUG you should be fine.


>
> Peter
>
>>
>> Paolo
>>
>>> and than zero
>>> out the whole device with bdrv_write_zeroes and the BDRV_MAY_DISCARD
>>> flag.
>>>
>>> the logic for this is already prepared in patch4 (topic of this email).
>>> except that
>>> i have to exchange bdrv_discard with bdrv_write_zeroes w/
>>> BDRV_MAY_DISCARD.
>>>
>>> what do you think?
>>>
>>>
>>>
>>>>
>>>> On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>>>>>
>>>>> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>>>
>>>>>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>>>>
>>>>>>>> (Mis)alignment and granularity can be handled later.  We can ignore
>>>>>>>> them
>>>>>>>> for now.  Later, if we decide the best way to support them is a
>>>>>>>> flag,
>>>>>>>> we'll add it.  Let's not put the cart before the horse.
>>>>>>>>
>>>>>>>> BTW, I expect alignment!=0 to be really, really rare.
>>>>>>>
>>>>>>> To explain my concerns:
>>>>>>>
>>>>>>> I know that my target has internal page size of 15MB. I will check
>>>>>>> what
>>>>>>> happens
>>>>>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>>>>>> unprovisioned
>>>>>>> after the last chunk is unmapped it would be fine :-)
>>>>>>
>>>>>> You're talking of granularity here, not (mis)alignment.
>>>>>
>>>>> you are right. for the target i am talking about this is 30720 512-byte
>>>>> blocks for the granularity (pagesize) and 0 for the alignment.
>>>>> i will see what happens if I write same w/unmap the whole 30720 blocks
>>>>> in smaller blocks ;-) otherwise i will have to add support for honoring this
>>>>> values in qemu-img convert
>>>>> as a follow up.
>>>>>
>>>>> Peter
>>>>>
>>>
>>>
>
>
> --
>
> Mit freundlichen Grüßen
>
> Peter Lieven
>
> ...........................................................
>
>   KAMP Netzwerkdienste GmbH
>   Vestische Str. 89-91 | 46117 Oberhausen
>   Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
>   pl@kamp.de | http://www.kamp.de
>
>   Geschäftsführer: Heiner Lante | Michael Lante
>   Amtsgericht Duisburg | HRB Nr. 12154
>   USt-Id-Nr.: DE 120607556
>
> ...........................................................
>
Peter Lieven - July 19, 2013, 1:49 p.m.
On 19.07.2013 15:25, ronnie sahlberg wrote:
> On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven <pl@kamp.de> wrote:
>> On 19.07.2013 07:58, Paolo Bonzini wrote:
>>> Il 18/07/2013 21:28, Peter Lieven ha scritto:
>>>> thanks for the details. I think to have optimal performance and best
>>>> change for unmapping in qemu-img convert
>>>> it might be best to export the OPTIMAL UNMAP GRANULARITY
>>> Agreed about this.
>>>
>>>> as well as the write_zeroes_w_discard capability via the BDI
>>> But why this?!?  It is _not_ needed.  All you need is to change the
>>> default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
>>> nonzero.
>> 2 reasons:
>> a) Does this guarantee that the requests are aligned to multiple of the -S
>> value?
>>
>> b) If I this flag exists qemu-img convert can do the "zeroing" a priori.
>> This
>> has the benefit that combined with bdrv_get_block_status requests before it
>> might
>> not need to touch large areas of the volume. Speaking for iSCSI its likely
>> that
>> the user sets a fresh volume as the destination, but its not guaranteed.
>> With the Patch 4 there are only done a few get_block_status requests to
>> verify
>> this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
>> thousands of writesame requests for possibly already unmapped data.
>>
>> To give an example. If I take my storage with an 1TB volume. It takes about
>> 10-12
>> get_block_status requests to verify that it is completely unmapped. After
>> this
>> I am safe to set has_zero_init = 1 in qemu-img convert.
>>
>> If I would convert a 1TB image to this target where lets say 50% are at leat
>> 15MB
>> zero blocks (15MB is the OUG of my storage) it would take ~35000 write same
>> requests to achieve the same.
> I am not sure I am reading this right, but you dont have to writesame
> exactly 1xOUG to get it to unmap.
> nxOUG will work too,
> So instead of sending one writesame for each OUG range, you can send
> one writesame for every ~10G or so.
> Say 10G is ~667 OUGs for your case, so you can send
> writesame for ~667xOUG in each command and then it would "only" take
> ~100 writesames instead of ~35000.
>
> So as long as you are sending in multiples of OUG you should be fine.
do I not have to take care of max_ws_size?

Peter
ronniesahlberg@gmail.com - July 19, 2013, 1:58 p.m.
On Thu, Jul 18, 2013 at 11:43 AM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 18.07.2013 um 16:35 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>
>> Il 18/07/2013 16:32, Peter Lieven ha scritto:
>>>>>
>>>> (Mis)alignment and granularity can be handled later.  We can ignore them
>>>> for now.  Later, if we decide the best way to support them is a flag,
>>>> we'll add it.  Let's not put the cart before the horse.
>>>>
>>>> BTW, I expect alignment!=0 to be really, really rare.
>>> To explain my concerns:
>>>
>>> I know that my target has internal page size of 15MB. I will check what
>>> happens
>>> if I deallocate this 15MB in chunks of lets say 1MB. If the page gets
>>> unprovisioned
>>> after the last chunk is unmapped it would be fine :-)
>>
>> You're talking of granularity here, not (mis)alignment.
>
> you are right. for the target i am talking about this is 30720 512-byte blocks for the granularity (pagesize) and 0 for the alignment.
> i will see what happens if I write same w/unmap the whole 30720 blocks in smaller blocks ;-)

If you write in smaller than OUG chunks, whether or not the phusical
block becomes unmapped is I think implementation defined. A target is
allowed to make this unmapped but not required to.

For example if you do two writesame16 for your 30720 chunk :
1, writesame16 lba:0 tl:30000
2, writesame16 lba:30000 tl:720

then

SBC 4.7.3.4
===
A WRITE SAME command shall not cause an LBA to become unmapped if
unmapping that LBA creates a
case in which a subsequent read of that unmapped LBA mayis able to
return user data or protection
informationlogical block data that differs from the Data-Out Buffer
for that WRITE SAME command. The
protection information returned by a read of an unmapped LBA is set to
FFFF_FFFF_FFFF_FFFFh
(see 4.7.4.5).
===

during processing of the second writesame of the physical block, since
the physical block is now all zero once this writesame16 completes,  a
target may trigger the whole physical block to become unmapped,
eventhough this specific writesame16 was only for a fraction of the
block.
It is allowed to unmap the block, but not required to.


or it could do it at a later time. For example, a target is allowed to
have say a background scan-and-unmap process that goes through the
disk and unmaps all all-zero physical blocks :
I dont know if your target would do this. Would be neat to add this to STGT.

SBC 4.7.3.5 :
===
If the LBPRZ bit in the READ CAPACITY (16) parameter data (see 5.16.2)
is set to one, and a mapped LBA
references a logical block that contains:
a) user data with all bits set to zero; and
b) protection information, if any, set to FFFF_FFFF_FFFF_FFFFh,
then the device server may transition that mapped LBA to anchored or
deallocated at any time.
===

I.e. a target is allowed at any time to automatically unmap physical
blocks as long as the block is all zero and lbprz is set.


> otherwise i will have to add support for honoring this values in qemu-img convert
> as a follow up.
>
> Peter
>
ronniesahlberg@gmail.com - July 19, 2013, 2 p.m.
On Fri, Jul 19, 2013 at 6:49 AM, Peter Lieven <pl@kamp.de> wrote:
> On 19.07.2013 15:25, ronnie sahlberg wrote:
>>
>> On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven <pl@kamp.de> wrote:
>>>
>>> On 19.07.2013 07:58, Paolo Bonzini wrote:
>>>>
>>>> Il 18/07/2013 21:28, Peter Lieven ha scritto:
>>>>>
>>>>> thanks for the details. I think to have optimal performance and best
>>>>> change for unmapping in qemu-img convert
>>>>> it might be best to export the OPTIMAL UNMAP GRANULARITY
>>>>
>>>> Agreed about this.
>>>>
>>>>> as well as the write_zeroes_w_discard capability via the BDI
>>>>
>>>> But why this?!?  It is _not_ needed.  All you need is to change the
>>>> default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
>>>> nonzero.
>>>
>>> 2 reasons:
>>> a) Does this guarantee that the requests are aligned to multiple of the
>>> -S
>>> value?
>>>
>>> b) If I this flag exists qemu-img convert can do the "zeroing" a priori.
>>> This
>>> has the benefit that combined with bdrv_get_block_status requests before
>>> it
>>> might
>>> not need to touch large areas of the volume. Speaking for iSCSI its
>>> likely
>>> that
>>> the user sets a fresh volume as the destination, but its not guaranteed.
>>> With the Patch 4 there are only done a few get_block_status requests to
>>> verify
>>> this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
>>> thousands of writesame requests for possibly already unmapped data.
>>>
>>> To give an example. If I take my storage with an 1TB volume. It takes
>>> about
>>> 10-12
>>> get_block_status requests to verify that it is completely unmapped. After
>>> this
>>> I am safe to set has_zero_init = 1 in qemu-img convert.
>>>
>>> If I would convert a 1TB image to this target where lets say 50% are at
>>> leat
>>> 15MB
>>> zero blocks (15MB is the OUG of my storage) it would take ~35000 write
>>> same
>>> requests to achieve the same.
>>
>> I am not sure I am reading this right, but you dont have to writesame
>> exactly 1xOUG to get it to unmap.
>> nxOUG will work too,
>> So instead of sending one writesame for each OUG range, you can send
>> one writesame for every ~10G or so.
>> Say 10G is ~667 OUGs for your case, so you can send
>> writesame for ~667xOUG in each command and then it would "only" take
>> ~100 writesames instead of ~35000.
>>
>> So as long as you are sending in multiples of OUG you should be fine.
>
> do I not have to take care of max_ws_size?

Yes you need to handle mas_ws_size   but I would imagine that on most
targets that max_ws_size >> OUG
I would be surprised if a target would set max_ws_size to just a single OUG.


>
> Peter
Peter Lieven - July 19, 2013, 2:02 p.m.
On 19.07.2013 16:00, ronnie sahlberg wrote:
> On Fri, Jul 19, 2013 at 6:49 AM, Peter Lieven <pl@kamp.de> wrote:
>> On 19.07.2013 15:25, ronnie sahlberg wrote:
>>> On Thu, Jul 18, 2013 at 11:08 PM, Peter Lieven <pl@kamp.de> wrote:
>>>> On 19.07.2013 07:58, Paolo Bonzini wrote:
>>>>> Il 18/07/2013 21:28, Peter Lieven ha scritto:
>>>>>> thanks for the details. I think to have optimal performance and best
>>>>>> change for unmapping in qemu-img convert
>>>>>> it might be best to export the OPTIMAL UNMAP GRANULARITY
>>>>> Agreed about this.
>>>>>
>>>>>> as well as the write_zeroes_w_discard capability via the BDI
>>>>> But why this?!?  It is _not_ needed.  All you need is to change the
>>>>> default of the "-S" option to be the OPTIMAL UNMAP GRANULARITY if it is
>>>>> nonzero.
>>>> 2 reasons:
>>>> a) Does this guarantee that the requests are aligned to multiple of the
>>>> -S
>>>> value?
>>>>
>>>> b) If I this flag exists qemu-img convert can do the "zeroing" a priori.
>>>> This
>>>> has the benefit that combined with bdrv_get_block_status requests before
>>>> it
>>>> might
>>>> not need to touch large areas of the volume. Speaking for iSCSI its
>>>> likely
>>>> that
>>>> the user sets a fresh volume as the destination, but its not guaranteed.
>>>> With the Patch 4 there are only done a few get_block_status requests to
>>>> verify
>>>> this. If we just write zeroes with BDRV_MAY_UNMAP, we send hundreds or
>>>> thousands of writesame requests for possibly already unmapped data.
>>>>
>>>> To give an example. If I take my storage with an 1TB volume. It takes
>>>> about
>>>> 10-12
>>>> get_block_status requests to verify that it is completely unmapped. After
>>>> this
>>>> I am safe to set has_zero_init = 1 in qemu-img convert.
>>>>
>>>> If I would convert a 1TB image to this target where lets say 50% are at
>>>> leat
>>>> 15MB
>>>> zero blocks (15MB is the OUG of my storage) it would take ~35000 write
>>>> same
>>>> requests to achieve the same.
>>> I am not sure I am reading this right, but you dont have to writesame
>>> exactly 1xOUG to get it to unmap.
>>> nxOUG will work too,
>>> So instead of sending one writesame for each OUG range, you can send
>>> one writesame for every ~10G or so.
>>> Say 10G is ~667 OUGs for your case, so you can send
>>> writesame for ~667xOUG in each command and then it would "only" take
>>> ~100 writesames instead of ~35000.
>>>
>>> So as long as you are sending in multiples of OUG you should be fine.
>> do I not have to take care of max_ws_size?
> Yes you need to handle mas_ws_size   but I would imagine that on most
> targets that max_ws_size >> OUG
> I would be surprised if a target would set max_ws_size to just a single OUG.
for may target it is ;-(

maybe I would use max_ws_size for the write_zeroes_with unmap
thing. I would think it is a multiple of oug if it is specified.

Peter
Peter Lieven - Sept. 12, 2013, 8:52 a.m.
On 18.07.2013 16:14, Paolo Bonzini wrote:
> Il 18/07/2013 15:55, ronnie sahlberg ha scritto:
>>>> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
>>>> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
>> When you use WRITESAME16 you can ignore the lbprz flag.
>> Just send a WRITESAME16 command with one block of data that is all set to zero.
>> If the unmap flag is set and if unmapped blocks read back the same as
>> the block you provided (all zero)
>> then it will unmap those blocks, where possible.
> True, so the unmap flag can be set iff BDRV_MAY_DISCARD == 1.  block.c
> can take care of checking BDRV_O_UNMAP.
Would you add BDRV_MAY_DISCARD to BdrvRequestFlags?

This would require making BdrvRequestFlags public.

Before I start implementation I would still like to verify if we need this
flag. E.g. in which case would we do not want to optimize writing zeroes
via discard if the user has set BDRV_O_UNMAP?

One case I would think of is sanitizing data. But in this case shouldn't this
be a different function? In case of a container format its also not guaranteed that the
contents are erased from the underlying media. Looking at the commit message
for f08f2ddae078e8a7683f8b16da8e0cc3029c7b89 it sounds that bdrv_write_zeroes
was intended to provide an efficient way to zero contents, but only in the sense
that it is guaranteed that they read back as zero (compared to bdrv_discard).

Peter
Paolo Bonzini - Sept. 12, 2013, 8:55 a.m.
Il 12/09/2013 10:52, Peter Lieven ha scritto:
> On 18.07.2013 16:14, Paolo Bonzini wrote:
>> Il 18/07/2013 15:55, ronnie sahlberg ha scritto:
>>>>> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
>>>>> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
>>> When you use WRITESAME16 you can ignore the lbprz flag.
>>> Just send a WRITESAME16 command with one block of data that is all
>>> set to zero.
>>> If the unmap flag is set and if unmapped blocks read back the same as
>>> the block you provided (all zero)
>>> then it will unmap those blocks, where possible.
>> True, so the unmap flag can be set iff BDRV_MAY_DISCARD == 1.  block.c
>> can take care of checking BDRV_O_UNMAP.
> Would you add BDRV_MAY_DISCARD to BdrvRequestFlags?
> 
> This would require making BdrvRequestFlags public.
> 
> Before I start implementation I would still like to verify if we need this
> flag. E.g. in which case would we do not want to optimize writing zeroes
> via discard if the user has set BDRV_O_UNMAP?

For example if the bdrv_write_zeroes is coming from a WRITE SAME command
(sent by the guest to an emulated SCSI disk).  In this case, we should
not discard if the UNMAP bit is zero in the request.

The WRITE SAME implementation in scsi-disk.c is old and broken, but the
block layer should provide enough API to make it right.

Paolo

> One case I would think of is sanitizing data. But in this case shouldn't
> this
> be a different function? In case of a container format its also not
> guaranteed that the
> contents are erased from the underlying media. Looking at the commit
> message
> for f08f2ddae078e8a7683f8b16da8e0cc3029c7b89 it sounds that
> bdrv_write_zeroes
> was intended to provide an efficient way to zero contents, but only in
> the sense
> that it is guaranteed that they read back as zero (compared to
> bdrv_discard).
> 
> Peter
>
Peter Lieven - Sept. 12, 2013, 9:03 a.m.
On 12.09.2013 10:55, Paolo Bonzini wrote:
> Il 12/09/2013 10:52, Peter Lieven ha scritto:
>> On 18.07.2013 16:14, Paolo Bonzini wrote:
>>> Il 18/07/2013 15:55, ronnie sahlberg ha scritto:
>>>>>> bdrv->write_zeroes will use writesame16 and set the unmap flag only if
>>>>>> BDRV_MAY_DISCARD == 1 and BDRV_O_UNMAP == 1 and lbprz == 1.
>>>> When you use WRITESAME16 you can ignore the lbprz flag.
>>>> Just send a WRITESAME16 command with one block of data that is all
>>>> set to zero.
>>>> If the unmap flag is set and if unmapped blocks read back the same as
>>>> the block you provided (all zero)
>>>> then it will unmap those blocks, where possible.
>>> True, so the unmap flag can be set iff BDRV_MAY_DISCARD == 1.  block.c
>>> can take care of checking BDRV_O_UNMAP.
>> Would you add BDRV_MAY_DISCARD to BdrvRequestFlags?
>>
>> This would require making BdrvRequestFlags public.
Is the BdrvRequestFlags the right place to put BRDV_REQ_MAY_UNMAP?
>>
>> Before I start implementation I would still like to verify if we need this
>> flag. E.g. in which case would we do not want to optimize writing zeroes
>> via discard if the user has set BDRV_O_UNMAP?
> For example if the bdrv_write_zeroes is coming from a WRITE SAME command
> (sent by the guest to an emulated SCSI disk).  In this case, we should
> not discard if the UNMAP bit is zero in the request.
>
> The WRITE SAME implementation in scsi-disk.c is old and broken, but the
> block layer should provide enough API to make it right.
Its actually calling discard regardless if the the buffer contents are zero.
So the outcome of this call is undefined.

     case WRITE_SAME_16:
         nb_sectors = scsi_data_cdb_length(r->req.cmd.buf);
         if (bdrv_is_read_only(s->qdev.conf.bs)) {
             scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
             return 0;
         }
         if (!check_lba_range(s, r->req.cmd.lba, nb_sectors)) {
             goto illegal_lba;
         }

         /*
          * We only support WRITE SAME with the unmap bit set for now.
          */
         if (!(req->cmd.buf[1] & 0x8)) {
             goto illegal_request;
         }

         /* The request is used as the AIO opaque value, so add a ref.  */
         scsi_req_ref(&r->req);
         r->req.aiocb = bdrv_aio_discard(s->qdev.conf.bs,
                                         r->req.cmd.lba * (s->qdev.blocksize / 512),
                                         nb_sectors * (s->qdev.blocksize / 512),
                                         scsi_aio_complete, r);
         return 0;


Peter

Patch

diff --git a/qemu-img.c b/qemu-img.c
index f8c97d3..cbde02f 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1342,7 +1342,8 @@  static int img_convert(int argc, char **argv)
         goto out;
     }
 
-    flags = BDRV_O_RDWR;
+    flags = BDRV_O_RDWR | BDRV_O_UNMAP;
+
     ret = bdrv_parse_cache_flags(cache, &flags);
     if (ret < 0) {
         error_report("Invalid cache option: %s", cache);
@@ -1442,8 +1443,61 @@  static int img_convert(int argc, char **argv)
         /* signal EOF to align */
         bdrv_write_compressed(out_bs, 0, NULL, 0);
     } else {
+        BlockDriverInfo bdi;
+
         int has_zero_init = bdrv_has_zero_init(out_bs);
 
+        /* if the target has no zero init by default check if we
+         * can discard blocks to zeroize the device */
+        if (!has_zero_init && !out_baseimg &&
+            bdrv_get_info(out_bs, &bdi) == 0 &&
+            bdi.discard_zeroes && bdi.max_unmap > 0) {
+            int64_t target_size = bdrv_getlength(out_bs) / BDRV_SECTOR_SIZE;
+            int64_t sector_num2 = -1;
+            int n;
+            sector_num = 0;
+            for (;;) {
+                nb_sectors = target_size - sector_num;
+                if (nb_sectors <= 0) {
+                    has_zero_init = 1;
+                    break;
+                }
+                if (nb_sectors > INT_MAX) {
+                    nb_sectors = INT_MAX;
+                }
+                if (!bdrv_is_allocated(out_bs, sector_num, nb_sectors, &n)) {
+                    if (!n) {
+                        /* an error occured, continue with has_zero_init = 0 */
+                        break;
+                    }
+                    sector_num += n;
+                    continue;
+                }
+                if (sector_num == sector_num2) {
+                    /* some drivers allow a discard to silently fail.
+                     * double-check that we do not try to discard the
+                     * same sectors twice. if thats the case
+                     * continue with has_zero_init = 0 */
+                    break;
+                }
+                sector_num2 = sector_num;
+                while (n > 0) {
+                    nb_sectors = n;
+                    if (nb_sectors > bdi.max_unmap) {
+                        nb_sectors = bdi.max_unmap;
+                    }
+                    ret = bdrv_discard(out_bs, sector_num2, nb_sectors);
+                    if (ret < 0) {
+                        error_report("error while discarding at sector %" PRId64 ": %s",
+                                     sector_num, strerror(-ret));
+                        goto out;
+                    }
+                    sector_num2 += nb_sectors;
+                    n -= nb_sectors;;
+                }
+            }
+        }
+
         sector_num = 0; // total number of sectors converted so far
         nb_sectors = total_sectors - sector_num;
         if (nb_sectors != 0) {