diff mbox

[v2,5/9] block: Pass unaligned discard requests to drivers

Message ID 1479413642-22463-6-git-send-email-eblake@redhat.com
State New
Headers show

Commit Message

Eric Blake Nov. 17, 2016, 8:13 p.m. UTC
Discard is advisory, so rounding the requests to alignment
boundaries is never semantically wrong from the data that
the guest sees.  But at least the Dell Equallogic iSCSI SANs
has an interesting property that its advertised discard
alignment is 15M, yet documents that discarding a sequence
of 1M slices will eventually result in the 15M page being
marked as discarded, and it is possible to observe which
pages have been discarded.

Between commits 9f1963b and b8d0a980, we converted the block
layer to a byte-based interface that ultimately ignores any
unaligned head or tail based on the driver's advertised
discard granularity, which means that qemu 2.7 refuses to
pass any discard request smaller than 15M down to the Dell
Equallogic hardware.  This is a slight regression in behavior
compared to earlier qemu, where a guest executing discards
in power-of-2 chunks used to be able to get every page
discarded, but is now left with various pages still allocated
because the guest requests did not align with the hardware's
15M pages.

Since the SCSI specification says nothing about a minimum
discard granularity, and only documents the preferred
alignment, it is best if the block layer gives the driver
every bit of information about discard requests, rather than
rounding it to alignment boundaries early.

Rework the block layer discard algorithm to mirror the write
zero algorithm: always peel off any unaligned head or tail
and manage that in isolation, then do the bulk of the request
on an aligned boundary.  The fallback when the driver returns
-ENOTSUP for an unaligned request is to silently ignore that
portion of the discard request; but for devices that can pass
the partial request all the way down to hardware, this can
result in the hardware coalescing requests and discarding
aligned pages after all.

Reported by: Peter Lieven <pl@kamp.de>
CC: qemu-stable@nongnu.org
Signed-off-by: Eric Blake <eblake@redhat.com>

---
v2: Split at more boundaries.
---
 block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

Comments

Max Reitz Nov. 17, 2016, 10:26 p.m. UTC | #1
On 17.11.2016 21:13, Eric Blake wrote:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <pl@kamp.de>
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment

Returning -ENOTSUP is not quite "silently" in my book.

> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }

Could be written as

num = num % bs->bl.request_alignment ?: num;

But that's up to you.

More importantly, is it possible that request_alignment >
pdiscard_alignment? In that case, align would be request_alignment, head
would be less than request_alignment but could be more than
pdiscard_alignment.

Thus, if that is possible, this might create a request longer than
pdiscard_alignment.

> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;

Same here.

Max

> +            }
> +        }
> +        /* limit request size */
> +        if (num > max_pdiscard) {
> +            num = max_pdiscard;
> +        }
> 
>          if (bs->drv->bdrv_co_pdiscard) {
>              ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
>
Eric Blake Nov. 17, 2016, 11:01 p.m. UTC | #2
On 11/17/2016 04:26 PM, Max Reitz wrote:
> On 17.11.2016 21:13, Eric Blake wrote:
>> Discard is advisory, so rounding the requests to alignment
>> boundaries is never semantically wrong from the data that
>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>> has an interesting property that its advertised discard
>> alignment is 15M, yet documents that discarding a sequence
>> of 1M slices will eventually result in the 15M page being
>> marked as discarded, and it is possible to observe which
>> pages have been discarded.
>>

>>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>>                                     align);
>> -    assert(max_pdiscard);
>> +    assert(max_pdiscard >= bs->bl.request_alignment);
>>
>>      while (count > 0) {
>>          int ret;
>> -        int num = MIN(count, max_pdiscard);
>> +        int num = count;
>> +
>> +        if (head) {
>> +            /* Make small requests to get to alignment boundaries. */
>> +            num = MIN(count, align - head);
>> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
>> +                num %= bs->bl.request_alignment;
>> +            }
> 
> Could be written as
> 
> num = num % bs->bl.request_alignment ?: num;
> 
> But that's up to you.
> 
> More importantly, is it possible that request_alignment >
> pdiscard_alignment? In that case, align would be request_alignment, head
> would be less than request_alignment but could be more than
> pdiscard_alignment.

pdiscard_alignment can be 0 (no inherent limit); but it if it is
nonzero, it must be at least request_alignment.  The block layer (should
be, if it is not already) enforcing that as part of the
.bdrv_refresh_limits() callback contract; at any rate, it is documented
that way in block_int.h
Max Reitz Nov. 17, 2016, 11:03 p.m. UTC | #3
On 18.11.2016 00:01, Eric Blake wrote:
> On 11/17/2016 04:26 PM, Max Reitz wrote:
>> On 17.11.2016 21:13, Eric Blake wrote:
>>> Discard is advisory, so rounding the requests to alignment
>>> boundaries is never semantically wrong from the data that
>>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>>> has an interesting property that its advertised discard
>>> alignment is 15M, yet documents that discarding a sequence
>>> of 1M slices will eventually result in the 15M page being
>>> marked as discarded, and it is possible to observe which
>>> pages have been discarded.
>>>
> 
>>>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>>>                                     align);
>>> -    assert(max_pdiscard);
>>> +    assert(max_pdiscard >= bs->bl.request_alignment);
>>>
>>>      while (count > 0) {
>>>          int ret;
>>> -        int num = MIN(count, max_pdiscard);
>>> +        int num = count;
>>> +
>>> +        if (head) {
>>> +            /* Make small requests to get to alignment boundaries. */
>>> +            num = MIN(count, align - head);
>>> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
>>> +                num %= bs->bl.request_alignment;
>>> +            }
>>
>> Could be written as
>>
>> num = num % bs->bl.request_alignment ?: num;
>>
>> But that's up to you.
>>
>> More importantly, is it possible that request_alignment >
>> pdiscard_alignment? In that case, align would be request_alignment, head
>> would be less than request_alignment but could be more than
>> pdiscard_alignment.
> 
> pdiscard_alignment can be 0 (no inherent limit); but it if it is
> nonzero, it must be at least request_alignment.  The block layer (should
> be, if it is not already) enforcing that as part of the
> .bdrv_refresh_limits() callback contract; at any rate, it is documented
> that way in block_int.h

Yes, you're right. Thus:

Reviewed-by: Max Reitz <mreitz@redhat.com>
Max Reitz Nov. 17, 2016, 11:44 p.m. UTC | #4
On 17.11.2016 21:13, Eric Blake wrote:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.

Is this series supposed to address this issue? Because if so, I fail to
see where it does. If the device advertises 15 MB as the discard
granularity, then the iscsi driver will still drop all discard requests
that are not aligned to 15 MB boundaries, no?

The only difference is that it's now the iscsi driver that drops the
request instead of the generic block layer.

Max
Eric Blake Nov. 18, 2016, 1:13 a.m. UTC | #5
On 11/17/2016 05:44 PM, Max Reitz wrote:
>>
>> Since the SCSI specification says nothing about a minimum
>> discard granularity, and only documents the preferred
>> alignment, it is best if the block layer gives the driver
>> every bit of information about discard requests, rather than
>> rounding it to alignment boundaries early.
> 
> Is this series supposed to address this issue? Because if so, I fail to
> see where it does. If the device advertises 15 MB as the discard
> granularity, then the iscsi driver will still drop all discard requests
> that are not aligned to 15 MB boundaries, no?
> 

I don't have access to the device in question, so I'm hoping Peter
chimes in (oops, how'd I miss him on original CC?).  Here's all the more
he said on v1:
https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html

My gut feel, however, is that the iscsi code is NOT rounding (the qcow2
code rounded, but that's different); the regression happened in 2.7
because the block layer also started rounding, and this patch gets the
block layer rounding out of the way. If nothing changed in the iscsi
code in the meantime, then the iscsi code should now (once again) be
discarding all sizes, regardless of the 15M advertisement.

Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
on a slightly modified NBD client that forced 15M alignments, and for
those cases, it definitely made the difference on whether all bytes were
passed (spread across several calls), vs. just the aligned bytes in the
middle of a request larger than 15M.

> The only difference is that it's now the iscsi driver that drops the
> request instead of the generic block layer.

If the iscsi driver was ever dropping it in the first place.
Max Reitz Nov. 19, 2016, 10:05 p.m. UTC | #6
On 18.11.2016 02:13, Eric Blake wrote:
> On 11/17/2016 05:44 PM, Max Reitz wrote:
>>>
>>> Since the SCSI specification says nothing about a minimum
>>> discard granularity, and only documents the preferred
>>> alignment, it is best if the block layer gives the driver
>>> every bit of information about discard requests, rather than
>>> rounding it to alignment boundaries early.
>>
>> Is this series supposed to address this issue? Because if so, I fail to
>> see where it does. If the device advertises 15 MB as the discard
>> granularity, then the iscsi driver will still drop all discard requests
>> that are not aligned to 15 MB boundaries, no?
>>
> 
> I don't have access to the device in question, so I'm hoping Peter
> chimes in (oops, how'd I miss him on original CC?).  Here's all the more
> he said on v1:
> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html
> 
> My gut feel, however, is that the iscsi code is NOT rounding

Why are you relying on your gut feel when you can simply look into the
code in question? As a matter of fact, I know that you have looked at
the very piece of code in question because you touch it in patch 4.

Now that I looked at it again, though, I see my mistake, though: I
assumed that is_byte_request_lun_aligned() would check whether the
request is aligned to the advertised discard granularity. Of course, it
does not, though, it only checks whether it's aligned to the device's
block_size.

So with the device in question, block_size is something like, say, 1 MB
and the pdiscard_alignment is 15 MB. With your series, the block layer
will first split off the head of an unaligned discard request (with the
rest being aligned to the pdiscard_alignment) and then again the head
off that head (with regards to the request_alignment).

The iscsi driver will discard the head of the head (which isn't aligned
to the request_alignment), but pass the part of the head that is aligned
to request_alignment and of course pass everything that's aligned to
pdiscard_alignment, too.

(And the same then happens for the tail.)

OK, now I see.

>                                                              (the qcow2
> code rounded, but that's different); the regression happened in 2.7
> because the block layer also started rounding, and this patch gets the
> block layer rounding out of the way. If nothing changed in the iscsi
> code in the meantime, then the iscsi code should now (once again) be
> discarding all sizes, regardless of the 15M advertisement.

Well, all sizes that are aligned to the request_alignment.

> Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
> on a slightly modified NBD client that forced 15M alignments, and for
> those cases, it definitely made the difference on whether all bytes were
> passed (spread across several calls), vs. just the aligned bytes in the
> middle of a request larger than 15M.
> 
>> The only difference is that it's now the iscsi driver that drops the
>> request instead of the generic block layer.
> 
> If the iscsi driver was ever dropping it in the first place.

It wasn't dropping them, it was asserting that it was dropped; yes, my
mistake for thinking that is_byte_request_lun_aligned() would check
whether the request is aligned to pdiscard_alignment.

Max
Peter Lieven Nov. 21, 2016, 1:39 p.m. UTC | #7
Am 19.11.2016 um 23:05 schrieb Max Reitz:
> On 18.11.2016 02:13, Eric Blake wrote:
>> On 11/17/2016 05:44 PM, Max Reitz wrote:
>>>> Since the SCSI specification says nothing about a minimum
>>>> discard granularity, and only documents the preferred
>>>> alignment, it is best if the block layer gives the driver
>>>> every bit of information about discard requests, rather than
>>>> rounding it to alignment boundaries early.
>>> Is this series supposed to address this issue? Because if so, I fail to
>>> see where it does. If the device advertises 15 MB as the discard
>>> granularity, then the iscsi driver will still drop all discard requests
>>> that are not aligned to 15 MB boundaries, no?
>>>
>> I don't have access to the device in question, so I'm hoping Peter
>> chimes in (oops, how'd I miss him on original CC?).  Here's all the more
>> he said on v1:
>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html
>>
>> My gut feel, however, is that the iscsi code is NOT rounding
> Why are you relying on your gut feel when you can simply look into the
> code in question? As a matter of fact, I know that you have looked at
> the very piece of code in question because you touch it in patch 4.
>
> Now that I looked at it again, though, I see my mistake, though: I
> assumed that is_byte_request_lun_aligned() would check whether the
> request is aligned to the advertised discard granularity. Of course, it
> does not, though, it only checks whether it's aligned to the device's
> block_size.
>
> So with the device in question, block_size is something like, say, 1 MB
> and the pdiscard_alignment is 15 MB. With your series, the block layer
> will first split off the head of an unaligned discard request (with the
> rest being aligned to the pdiscard_alignment) and then again the head
> off that head (with regards to the request_alignment).
>
> The iscsi driver will discard the head of the head (which isn't aligned
> to the request_alignment), but pass the part of the head that is aligned
> to request_alignment and of course pass everything that's aligned to
> pdiscard_alignment, too.
>
> (And the same then happens for the tail.)
>
> OK, now I see.
>
>>                                                               (the qcow2
>> code rounded, but that's different); the regression happened in 2.7
>> because the block layer also started rounding, and this patch gets the
>> block layer rounding out of the way. If nothing changed in the iscsi
>> code in the meantime, then the iscsi code should now (once again) be
>> discarding all sizes, regardless of the 15M advertisement.
> Well, all sizes that are aligned to the request_alignment.
>
>> Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as
>> on a slightly modified NBD client that forced 15M alignments, and for
>> those cases, it definitely made the difference on whether all bytes were
>> passed (spread across several calls), vs. just the aligned bytes in the
>> middle of a request larger than 15M.
>>
>>> The only difference is that it's now the iscsi driver that drops the
>>> request instead of the generic block layer.
>> If the iscsi driver was ever dropping it in the first place.
> It wasn't dropping them, it was asserting that it was dropped; yes, my
> mistake for thinking that is_byte_request_lun_aligned() would check
> whether the request is aligned to pdiscard_alignment.

Sorry for not responding earlier. I was ooo some days.

The iSCSI driver indeed checked only for alignment to block_size and
was passing all other discard requests which the device was handling
or just dropping.

The version now seems correct.

Thanks,
Peter
Kevin Wolf Nov. 22, 2016, 2:03 p.m. UTC | #8
Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
> Discard is advisory, so rounding the requests to alignment
> boundaries is never semantically wrong from the data that
> the guest sees.  But at least the Dell Equallogic iSCSI SANs
> has an interesting property that its advertised discard
> alignment is 15M, yet documents that discarding a sequence
> of 1M slices will eventually result in the 15M page being
> marked as discarded, and it is possible to observe which
> pages have been discarded.
> 
> Between commits 9f1963b and b8d0a980, we converted the block
> layer to a byte-based interface that ultimately ignores any
> unaligned head or tail based on the driver's advertised
> discard granularity, which means that qemu 2.7 refuses to
> pass any discard request smaller than 15M down to the Dell
> Equallogic hardware.  This is a slight regression in behavior
> compared to earlier qemu, where a guest executing discards
> in power-of-2 chunks used to be able to get every page
> discarded, but is now left with various pages still allocated
> because the guest requests did not align with the hardware's
> 15M pages.
> 
> Since the SCSI specification says nothing about a minimum
> discard granularity, and only documents the preferred
> alignment, it is best if the block layer gives the driver
> every bit of information about discard requests, rather than
> rounding it to alignment boundaries early.
> 
> Rework the block layer discard algorithm to mirror the write
> zero algorithm: always peel off any unaligned head or tail
> and manage that in isolation, then do the bulk of the request
> on an aligned boundary.  The fallback when the driver returns
> -ENOTSUP for an unaligned request is to silently ignore that
> portion of the discard request; but for devices that can pass
> the partial request all the way down to hardware, this can
> result in the hardware coalescing requests and discarding
> aligned pages after all.
> 
> Reported by: Peter Lieven <pl@kamp.de>
> CC: qemu-stable@nongnu.org
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> v2: Split at more boundaries.
> ---
>  block/io.c | 45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 085ac34..4f00562 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>  {
>      BdrvTrackedRequest req;
>      int max_pdiscard, ret;
> -    int head, align;
> +    int head, tail, align;
> 
>      if (!bs->drv) {
>          return -ENOMEDIUM;
> @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
>          return 0;
>      }
> 
> -    /* Discard is advisory, so ignore any unaligned head or tail */
> +    /* Discard is advisory, but some devices track and coalesce
> +     * unaligned requests, so we must pass everything down rather than
> +     * round here.  Still, most devices will just silently ignore
> +     * unaligned requests (by returning -ENOTSUP), so we must fragment
> +     * the request accordingly.  */
>      align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
>      assert(align % bs->bl.request_alignment == 0);
>      head = offset % align;
> -    if (head) {
> -        head = MIN(count, align - head);
> -        count -= head;
> -        offset += head;
> -    }
> -    count = QEMU_ALIGN_DOWN(count, align);
> -    if (!count) {
> -        return 0;
> -    }
> +    tail = (offset + count) % align;
> 
>      bdrv_inc_in_flight(bs);
>      tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
> @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
> 
>      max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
>                                     align);
> -    assert(max_pdiscard);
> +    assert(max_pdiscard >= bs->bl.request_alignment);
> 
>      while (count > 0) {
>          int ret;
> -        int num = MIN(count, max_pdiscard);
> +        int num = count;
> +
> +        if (head) {
> +            /* Make small requests to get to alignment boundaries. */
> +            num = MIN(count, align - head);
> +            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
> +                num %= bs->bl.request_alignment;
> +            }
> +            head = (head + num) % align;
> +            assert(num < max_pdiscard);
> +        } else if (tail) {
> +            if (num > align) {
> +                /* Shorten the request to the last aligned cluster.  */
> +                num -= tail;
> +            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
> +                       tail > bs->bl.request_alignment) {
> +                tail %= bs->bl.request_alignment;
> +                num -= tail;
> +            }
> +        }

Hm, from the commit message I expected splitting requests in three
(head, bulk, tail), but actually we can end up splitting it in five?

Just to check whether I got this right, let me try an example: Let's
assume request alignment 512, pdiscard alignment 64k, and we get a
discard request with offset 510, length 130k. This algorithm makes the
following calls to the driver:

1. pdiscard offset=510, len=2               | new count = 130k - 2
2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
3. pdiscard offset=64k, len=64k             | new count = 2558
4. pdiscard offset=128k, len=2048           | new count = 510
5. pdiscard offset=130k, len=510            | new count = 0

Correct?

If so, is this really necessary or even helpful? I see that the iscsi
driver throws away requests 1 and 5 and needs the split because
otherwise it would disregard the areas covered by 2 and 4, too. But why
can't or shouldn't the iscsi driver do this rounding itself when it gets
combined 1+2 and 4+5 requests?

Kevin
Eric Blake Nov. 22, 2016, 2:13 p.m. UTC | #9
On 11/22/2016 08:03 AM, Kevin Wolf wrote:
> Am 17.11.2016 um 21:13 hat Eric Blake geschrieben:
>> Discard is advisory, so rounding the requests to alignment
>> boundaries is never semantically wrong from the data that
>> the guest sees.  But at least the Dell Equallogic iSCSI SANs
>> has an interesting property that its advertised discard
>> alignment is 15M, yet documents that discarding a sequence
>> of 1M slices will eventually result in the 15M page being
>> marked as discarded, and it is possible to observe which
>> pages have been discarded.
>>

>>
>> Rework the block layer discard algorithm to mirror the write
>> zero algorithm: always peel off any unaligned head or tail
>> and manage that in isolation, then do the bulk of the request
>> on an aligned boundary.  The fallback when the driver returns
>> -ENOTSUP for an unaligned request is to silently ignore that
>> portion of the discard request; but for devices that can pass
>> the partial request all the way down to hardware, this can
>> result in the hardware coalescing requests and discarding
>> aligned pages after all.
>>

> 
> Hm, from the commit message I expected splitting requests in three
> (head, bulk, tail), but actually we can end up splitting it in five?

Correct; so maybe I need to improve the commit message.  The goal in
multiple splits was to make it easier for drivers to not have to worry
about re-aligning things (a request is either sub-sector, sub-page but
sector-aligned, or page-aligned).

> 
> Just to check whether I got this right, let me try an example: Let's
> assume request alignment 512, pdiscard alignment 64k, and we get a
> discard request with offset 510, length 130k. This algorithm makes the
> following calls to the driver:
> 
> 1. pdiscard offset=510, len=2               | new count = 130k - 2
> 2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
> 3. pdiscard offset=64k, len=64k             | new count = 2558
> 4. pdiscard offset=128k, len=2048           | new count = 510
> 5. pdiscard offset=130k, len=510            | new count = 0
> 
> Correct?
> 

Yes.

> If so, is this really necessary or even helpful? I see that the iscsi
> driver throws away requests 1 and 5 and needs the split because
> otherwise it would disregard the areas covered by 2 and 4, too. But why
> can't or shouldn't the iscsi driver do this rounding itself when it gets
> combined 1+2 and 4+5 requests?

Because then every driver has to implement rounding; I thought that
having centralized rounding code was easier to maintain overall than
having every driver reimplement it.
Eric Blake Nov. 22, 2016, 2:56 p.m. UTC | #10
On 11/22/2016 08:13 AM, Eric Blake wrote:

>> Hm, from the commit message I expected splitting requests in three
>> (head, bulk, tail), but actually we can end up splitting it in five?
> 
> Correct; so maybe I need to improve the commit message.  The goal in
> multiple splits was to make it easier for drivers to not have to worry
> about re-aligning things (a request is either sub-sector, sub-page but
> sector-aligned, or page-aligned).
> 
>>
>> Just to check whether I got this right, let me try an example: Let's
>> assume request alignment 512, pdiscard alignment 64k, and we get a
>> discard request with offset 510, length 130k. This algorithm makes the
>> following calls to the driver:
>>
>> 1. pdiscard offset=510, len=2               | new count = 130k - 2
>> 2. pdiscard offset=512, len=(64k - 512)     | new count = 66k + 510
>> 3. pdiscard offset=64k, len=64k             | new count = 2558
>> 4. pdiscard offset=128k, len=2048           | new count = 510
>> 5. pdiscard offset=130k, len=510            | new count = 0
>>
>> Correct?
>>
> 
> Yes.

To further clarify: for write zeroes, we have been doing 1+2 and 4+5
together, because if the driver returns -ENOTSUP, we then do a normal
write, and normal writes already take care of fragmenting things back
into alignment boundaries (1 and 5 are handled by read-modify-write, 2
and 4 are aligned to write request boundaries), so the writes happen no
matter what even if the write-zero at request boundaries might have been
more efficient.  With discard, there is no fallback to any other
operation, so we HAVE to present as many fragments as possible to the
driver, while still making it easy for the driver to distinguish between
aligned and unaligned requests and not having to do extra rounding.

But you made me realize that if I make write_zeroes do the same 5-piece
fragmentation, then it might be easier to use bdrv_aligned_pwritev
instead of manually reimplementing max_transfer logic in the write fallback.
diff mbox

Patch

diff --git a/block/io.c b/block/io.c
index 085ac34..4f00562 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2424,7 +2424,7 @@  int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
 {
     BdrvTrackedRequest req;
     int max_pdiscard, ret;
-    int head, align;
+    int head, tail, align;

     if (!bs->drv) {
         return -ENOMEDIUM;
@@ -2447,19 +2447,15 @@  int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,
         return 0;
     }

-    /* Discard is advisory, so ignore any unaligned head or tail */
+    /* Discard is advisory, but some devices track and coalesce
+     * unaligned requests, so we must pass everything down rather than
+     * round here.  Still, most devices will just silently ignore
+     * unaligned requests (by returning -ENOTSUP), so we must fragment
+     * the request accordingly.  */
     align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment);
     assert(align % bs->bl.request_alignment == 0);
     head = offset % align;
-    if (head) {
-        head = MIN(count, align - head);
-        count -= head;
-        offset += head;
-    }
-    count = QEMU_ALIGN_DOWN(count, align);
-    if (!count) {
-        return 0;
-    }
+    tail = (offset + count) % align;

     bdrv_inc_in_flight(bs);
     tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD);
@@ -2471,11 +2467,34 @@  int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset,

     max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX),
                                    align);
-    assert(max_pdiscard);
+    assert(max_pdiscard >= bs->bl.request_alignment);

     while (count > 0) {
         int ret;
-        int num = MIN(count, max_pdiscard);
+        int num = count;
+
+        if (head) {
+            /* Make small requests to get to alignment boundaries. */
+            num = MIN(count, align - head);
+            if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) {
+                num %= bs->bl.request_alignment;
+            }
+            head = (head + num) % align;
+            assert(num < max_pdiscard);
+        } else if (tail) {
+            if (num > align) {
+                /* Shorten the request to the last aligned cluster.  */
+                num -= tail;
+            } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) &&
+                       tail > bs->bl.request_alignment) {
+                tail %= bs->bl.request_alignment;
+                num -= tail;
+            }
+        }
+        /* limit request size */
+        if (num > max_pdiscard) {
+            num = max_pdiscard;
+        }

         if (bs->drv->bdrv_co_pdiscard) {
             ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);