Message ID | 1479413642-22463-6-git-send-email-eblake@redhat.com |
---|---|
State | New |
Headers | show |
On 17.11.2016 21:13, Eric Blake wrote: > Discard is advisory, so rounding the requests to alignment > boundaries is never semantically wrong from the data that > the guest sees. But at least the Dell Equallogic iSCSI SANs > has an interesting property that its advertised discard > alignment is 15M, yet documents that discarding a sequence > of 1M slices will eventually result in the 15M page being > marked as discarded, and it is possible to observe which > pages have been discarded. > > Between commits 9f1963b and b8d0a980, we converted the block > layer to a byte-based interface that ultimately ignores any > unaligned head or tail based on the driver's advertised > discard granularity, which means that qemu 2.7 refuses to > pass any discard request smaller than 15M down to the Dell > Equallogic hardware. This is a slight regression in behavior > compared to earlier qemu, where a guest executing discards > in power-of-2 chunks used to be able to get every page > discarded, but is now left with various pages still allocated > because the guest requests did not align with the hardware's > 15M pages. > > Since the SCSI specification says nothing about a minimum > discard granularity, and only documents the preferred > alignment, it is best if the block layer gives the driver > every bit of information about discard requests, rather than > rounding it to alignment boundaries early. > > Rework the block layer discard algorithm to mirror the write > zero algorithm: always peel off any unaligned head or tail > and manage that in isolation, then do the bulk of the request > on an aligned boundary. The fallback when the driver returns > -ENOTSUP for an unaligned request is to silently ignore that > portion of the discard request; but for devices that can pass > the partial request all the way down to hardware, this can > result in the hardware coalescing requests and discarding > aligned pages after all. > > Reported by: Peter Lieven <pl@kamp.de> > CC: qemu-stable@nongnu.org > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > v2: Split at more boundaries. > --- > block/io.c | 45 ++++++++++++++++++++++++++++++++------------- > 1 file changed, 32 insertions(+), 13 deletions(-) > > diff --git a/block/io.c b/block/io.c > index 085ac34..4f00562 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > { > BdrvTrackedRequest req; > int max_pdiscard, ret; > - int head, align; > + int head, tail, align; > > if (!bs->drv) { > return -ENOMEDIUM; > @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > return 0; > } > > - /* Discard is advisory, so ignore any unaligned head or tail */ > + /* Discard is advisory, but some devices track and coalesce > + * unaligned requests, so we must pass everything down rather than > + * round here. Still, most devices will just silently ignore > + * unaligned requests (by returning -ENOTSUP), so we must fragment Returning -ENOTSUP is not quite "silently" in my book. > + * the request accordingly. */ > align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment); > assert(align % bs->bl.request_alignment == 0); > head = offset % align; > - if (head) { > - head = MIN(count, align - head); > - count -= head; > - offset += head; > - } > - count = QEMU_ALIGN_DOWN(count, align); > - if (!count) { > - return 0; > - } > + tail = (offset + count) % align; > > bdrv_inc_in_flight(bs); > tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD); > @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > > max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), > align); > - assert(max_pdiscard); > + assert(max_pdiscard >= bs->bl.request_alignment); > > while (count > 0) { > int ret; > - int num = MIN(count, max_pdiscard); > + int num = count; > + > + if (head) { > + /* Make small requests to get to alignment boundaries. */ > + num = MIN(count, align - head); > + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) { > + num %= bs->bl.request_alignment; > + } Could be written as num = num % bs->bl.request_alignment ?: num; But that's up to you. More importantly, is it possible that request_alignment > pdiscard_alignment? In that case, align would be request_alignment, head would be less than request_alignment but could be more than pdiscard_alignment. Thus, if that is possible, this might create a request longer than pdiscard_alignment. > + head = (head + num) % align; > + assert(num < max_pdiscard); > + } else if (tail) { > + if (num > align) { > + /* Shorten the request to the last aligned cluster. */ > + num -= tail; > + } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) && > + tail > bs->bl.request_alignment) { > + tail %= bs->bl.request_alignment; > + num -= tail; Same here. Max > + } > + } > + /* limit request size */ > + if (num > max_pdiscard) { > + num = max_pdiscard; > + } > > if (bs->drv->bdrv_co_pdiscard) { > ret = bs->drv->bdrv_co_pdiscard(bs, offset, num); >
On 11/17/2016 04:26 PM, Max Reitz wrote: > On 17.11.2016 21:13, Eric Blake wrote: >> Discard is advisory, so rounding the requests to alignment >> boundaries is never semantically wrong from the data that >> the guest sees. But at least the Dell Equallogic iSCSI SANs >> has an interesting property that its advertised discard >> alignment is 15M, yet documents that discarding a sequence >> of 1M slices will eventually result in the 15M page being >> marked as discarded, and it is possible to observe which >> pages have been discarded. >> >> max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), >> align); >> - assert(max_pdiscard); >> + assert(max_pdiscard >= bs->bl.request_alignment); >> >> while (count > 0) { >> int ret; >> - int num = MIN(count, max_pdiscard); >> + int num = count; >> + >> + if (head) { >> + /* Make small requests to get to alignment boundaries. */ >> + num = MIN(count, align - head); >> + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) { >> + num %= bs->bl.request_alignment; >> + } > > Could be written as > > num = num % bs->bl.request_alignment ?: num; > > But that's up to you. > > More importantly, is it possible that request_alignment > > pdiscard_alignment? In that case, align would be request_alignment, head > would be less than request_alignment but could be more than > pdiscard_alignment. pdiscard_alignment can be 0 (no inherent limit); but it if it is nonzero, it must be at least request_alignment. The block layer (should be, if it is not already) enforcing that as part of the .bdrv_refresh_limits() callback contract; at any rate, it is documented that way in block_int.h
On 18.11.2016 00:01, Eric Blake wrote: > On 11/17/2016 04:26 PM, Max Reitz wrote: >> On 17.11.2016 21:13, Eric Blake wrote: >>> Discard is advisory, so rounding the requests to alignment >>> boundaries is never semantically wrong from the data that >>> the guest sees. But at least the Dell Equallogic iSCSI SANs >>> has an interesting property that its advertised discard >>> alignment is 15M, yet documents that discarding a sequence >>> of 1M slices will eventually result in the 15M page being >>> marked as discarded, and it is possible to observe which >>> pages have been discarded. >>> > >>> max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), >>> align); >>> - assert(max_pdiscard); >>> + assert(max_pdiscard >= bs->bl.request_alignment); >>> >>> while (count > 0) { >>> int ret; >>> - int num = MIN(count, max_pdiscard); >>> + int num = count; >>> + >>> + if (head) { >>> + /* Make small requests to get to alignment boundaries. */ >>> + num = MIN(count, align - head); >>> + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) { >>> + num %= bs->bl.request_alignment; >>> + } >> >> Could be written as >> >> num = num % bs->bl.request_alignment ?: num; >> >> But that's up to you. >> >> More importantly, is it possible that request_alignment > >> pdiscard_alignment? In that case, align would be request_alignment, head >> would be less than request_alignment but could be more than >> pdiscard_alignment. > > pdiscard_alignment can be 0 (no inherent limit); but it if it is > nonzero, it must be at least request_alignment. The block layer (should > be, if it is not already) enforcing that as part of the > .bdrv_refresh_limits() callback contract; at any rate, it is documented > that way in block_int.h Yes, you're right. Thus: Reviewed-by: Max Reitz <mreitz@redhat.com>
On 17.11.2016 21:13, Eric Blake wrote: > Discard is advisory, so rounding the requests to alignment > boundaries is never semantically wrong from the data that > the guest sees. But at least the Dell Equallogic iSCSI SANs > has an interesting property that its advertised discard > alignment is 15M, yet documents that discarding a sequence > of 1M slices will eventually result in the 15M page being > marked as discarded, and it is possible to observe which > pages have been discarded. > > Between commits 9f1963b and b8d0a980, we converted the block > layer to a byte-based interface that ultimately ignores any > unaligned head or tail based on the driver's advertised > discard granularity, which means that qemu 2.7 refuses to > pass any discard request smaller than 15M down to the Dell > Equallogic hardware. This is a slight regression in behavior > compared to earlier qemu, where a guest executing discards > in power-of-2 chunks used to be able to get every page > discarded, but is now left with various pages still allocated > because the guest requests did not align with the hardware's > 15M pages. > > Since the SCSI specification says nothing about a minimum > discard granularity, and only documents the preferred > alignment, it is best if the block layer gives the driver > every bit of information about discard requests, rather than > rounding it to alignment boundaries early. Is this series supposed to address this issue? Because if so, I fail to see where it does. If the device advertises 15 MB as the discard granularity, then the iscsi driver will still drop all discard requests that are not aligned to 15 MB boundaries, no? The only difference is that it's now the iscsi driver that drops the request instead of the generic block layer. Max
On 11/17/2016 05:44 PM, Max Reitz wrote: >> >> Since the SCSI specification says nothing about a minimum >> discard granularity, and only documents the preferred >> alignment, it is best if the block layer gives the driver >> every bit of information about discard requests, rather than >> rounding it to alignment boundaries early. > > Is this series supposed to address this issue? Because if so, I fail to > see where it does. If the device advertises 15 MB as the discard > granularity, then the iscsi driver will still drop all discard requests > that are not aligned to 15 MB boundaries, no? > I don't have access to the device in question, so I'm hoping Peter chimes in (oops, how'd I miss him on original CC?). Here's all the more he said on v1: https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html My gut feel, however, is that the iscsi code is NOT rounding (the qcow2 code rounded, but that's different); the regression happened in 2.7 because the block layer also started rounding, and this patch gets the block layer rounding out of the way. If nothing changed in the iscsi code in the meantime, then the iscsi code should now (once again) be discarding all sizes, regardless of the 15M advertisement. Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as on a slightly modified NBD client that forced 15M alignments, and for those cases, it definitely made the difference on whether all bytes were passed (spread across several calls), vs. just the aligned bytes in the middle of a request larger than 15M. > The only difference is that it's now the iscsi driver that drops the > request instead of the generic block layer. If the iscsi driver was ever dropping it in the first place.
On 18.11.2016 02:13, Eric Blake wrote: > On 11/17/2016 05:44 PM, Max Reitz wrote: >>> >>> Since the SCSI specification says nothing about a minimum >>> discard granularity, and only documents the preferred >>> alignment, it is best if the block layer gives the driver >>> every bit of information about discard requests, rather than >>> rounding it to alignment boundaries early. >> >> Is this series supposed to address this issue? Because if so, I fail to >> see where it does. If the device advertises 15 MB as the discard >> granularity, then the iscsi driver will still drop all discard requests >> that are not aligned to 15 MB boundaries, no? >> > > I don't have access to the device in question, so I'm hoping Peter > chimes in (oops, how'd I miss him on original CC?). Here's all the more > he said on v1: > https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html > > My gut feel, however, is that the iscsi code is NOT rounding Why are you relying on your gut feel when you can simply look into the code in question? As a matter of fact, I know that you have looked at the very piece of code in question because you touch it in patch 4. Now that I looked at it again, though, I see my mistake, though: I assumed that is_byte_request_lun_aligned() would check whether the request is aligned to the advertised discard granularity. Of course, it does not, though, it only checks whether it's aligned to the device's block_size. So with the device in question, block_size is something like, say, 1 MB and the pdiscard_alignment is 15 MB. With your series, the block layer will first split off the head of an unaligned discard request (with the rest being aligned to the pdiscard_alignment) and then again the head off that head (with regards to the request_alignment). The iscsi driver will discard the head of the head (which isn't aligned to the request_alignment), but pass the part of the head that is aligned to request_alignment and of course pass everything that's aligned to pdiscard_alignment, too. (And the same then happens for the tail.) OK, now I see. > (the qcow2 > code rounded, but that's different); the regression happened in 2.7 > because the block layer also started rounding, and this patch gets the > block layer rounding out of the way. If nothing changed in the iscsi > code in the meantime, then the iscsi code should now (once again) be > discarding all sizes, regardless of the 15M advertisement. Well, all sizes that are aligned to the request_alignment. > Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as > on a slightly modified NBD client that forced 15M alignments, and for > those cases, it definitely made the difference on whether all bytes were > passed (spread across several calls), vs. just the aligned bytes in the > middle of a request larger than 15M. > >> The only difference is that it's now the iscsi driver that drops the >> request instead of the generic block layer. > > If the iscsi driver was ever dropping it in the first place. It wasn't dropping them, it was asserting that it was dropped; yes, my mistake for thinking that is_byte_request_lun_aligned() would check whether the request is aligned to pdiscard_alignment. Max
Am 19.11.2016 um 23:05 schrieb Max Reitz: > On 18.11.2016 02:13, Eric Blake wrote: >> On 11/17/2016 05:44 PM, Max Reitz wrote: >>>> Since the SCSI specification says nothing about a minimum >>>> discard granularity, and only documents the preferred >>>> alignment, it is best if the block layer gives the driver >>>> every bit of information about discard requests, rather than >>>> rounding it to alignment boundaries early. >>> Is this series supposed to address this issue? Because if so, I fail to >>> see where it does. If the device advertises 15 MB as the discard >>> granularity, then the iscsi driver will still drop all discard requests >>> that are not aligned to 15 MB boundaries, no? >>> >> I don't have access to the device in question, so I'm hoping Peter >> chimes in (oops, how'd I miss him on original CC?). Here's all the more >> he said on v1: >> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg02223.html >> >> My gut feel, however, is that the iscsi code is NOT rounding > Why are you relying on your gut feel when you can simply look into the > code in question? As a matter of fact, I know that you have looked at > the very piece of code in question because you touch it in patch 4. > > Now that I looked at it again, though, I see my mistake, though: I > assumed that is_byte_request_lun_aligned() would check whether the > request is aligned to the advertised discard granularity. Of course, it > does not, though, it only checks whether it's aligned to the device's > block_size. > > So with the device in question, block_size is something like, say, 1 MB > and the pdiscard_alignment is 15 MB. With your series, the block layer > will first split off the head of an unaligned discard request (with the > rest being aligned to the pdiscard_alignment) and then again the head > off that head (with regards to the request_alignment). > > The iscsi driver will discard the head of the head (which isn't aligned > to the request_alignment), but pass the part of the head that is aligned > to request_alignment and of course pass everything that's aligned to > pdiscard_alignment, too. > > (And the same then happens for the tail.) > > OK, now I see. > >> (the qcow2 >> code rounded, but that's different); the regression happened in 2.7 >> because the block layer also started rounding, and this patch gets the >> block layer rounding out of the way. If nothing changed in the iscsi >> code in the meantime, then the iscsi code should now (once again) be >> discarding all sizes, regardless of the 15M advertisement. > Well, all sizes that are aligned to the request_alignment. > >> Meanwhile, I _did_ test this patch with blkdebug (see 9/9), as well as >> on a slightly modified NBD client that forced 15M alignments, and for >> those cases, it definitely made the difference on whether all bytes were >> passed (spread across several calls), vs. just the aligned bytes in the >> middle of a request larger than 15M. >> >>> The only difference is that it's now the iscsi driver that drops the >>> request instead of the generic block layer. >> If the iscsi driver was ever dropping it in the first place. > It wasn't dropping them, it was asserting that it was dropped; yes, my > mistake for thinking that is_byte_request_lun_aligned() would check > whether the request is aligned to pdiscard_alignment. Sorry for not responding earlier. I was ooo some days. The iSCSI driver indeed checked only for alignment to block_size and was passing all other discard requests which the device was handling or just dropping. The version now seems correct. Thanks, Peter
Am 17.11.2016 um 21:13 hat Eric Blake geschrieben: > Discard is advisory, so rounding the requests to alignment > boundaries is never semantically wrong from the data that > the guest sees. But at least the Dell Equallogic iSCSI SANs > has an interesting property that its advertised discard > alignment is 15M, yet documents that discarding a sequence > of 1M slices will eventually result in the 15M page being > marked as discarded, and it is possible to observe which > pages have been discarded. > > Between commits 9f1963b and b8d0a980, we converted the block > layer to a byte-based interface that ultimately ignores any > unaligned head or tail based on the driver's advertised > discard granularity, which means that qemu 2.7 refuses to > pass any discard request smaller than 15M down to the Dell > Equallogic hardware. This is a slight regression in behavior > compared to earlier qemu, where a guest executing discards > in power-of-2 chunks used to be able to get every page > discarded, but is now left with various pages still allocated > because the guest requests did not align with the hardware's > 15M pages. > > Since the SCSI specification says nothing about a minimum > discard granularity, and only documents the preferred > alignment, it is best if the block layer gives the driver > every bit of information about discard requests, rather than > rounding it to alignment boundaries early. > > Rework the block layer discard algorithm to mirror the write > zero algorithm: always peel off any unaligned head or tail > and manage that in isolation, then do the bulk of the request > on an aligned boundary. The fallback when the driver returns > -ENOTSUP for an unaligned request is to silently ignore that > portion of the discard request; but for devices that can pass > the partial request all the way down to hardware, this can > result in the hardware coalescing requests and discarding > aligned pages after all. > > Reported by: Peter Lieven <pl@kamp.de> > CC: qemu-stable@nongnu.org > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > v2: Split at more boundaries. > --- > block/io.c | 45 ++++++++++++++++++++++++++++++++------------- > 1 file changed, 32 insertions(+), 13 deletions(-) > > diff --git a/block/io.c b/block/io.c > index 085ac34..4f00562 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > { > BdrvTrackedRequest req; > int max_pdiscard, ret; > - int head, align; > + int head, tail, align; > > if (!bs->drv) { > return -ENOMEDIUM; > @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > return 0; > } > > - /* Discard is advisory, so ignore any unaligned head or tail */ > + /* Discard is advisory, but some devices track and coalesce > + * unaligned requests, so we must pass everything down rather than > + * round here. Still, most devices will just silently ignore > + * unaligned requests (by returning -ENOTSUP), so we must fragment > + * the request accordingly. */ > align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment); > assert(align % bs->bl.request_alignment == 0); > head = offset % align; > - if (head) { > - head = MIN(count, align - head); > - count -= head; > - offset += head; > - } > - count = QEMU_ALIGN_DOWN(count, align); > - if (!count) { > - return 0; > - } > + tail = (offset + count) % align; > > bdrv_inc_in_flight(bs); > tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD); > @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, > > max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), > align); > - assert(max_pdiscard); > + assert(max_pdiscard >= bs->bl.request_alignment); > > while (count > 0) { > int ret; > - int num = MIN(count, max_pdiscard); > + int num = count; > + > + if (head) { > + /* Make small requests to get to alignment boundaries. */ > + num = MIN(count, align - head); > + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) { > + num %= bs->bl.request_alignment; > + } > + head = (head + num) % align; > + assert(num < max_pdiscard); > + } else if (tail) { > + if (num > align) { > + /* Shorten the request to the last aligned cluster. */ > + num -= tail; > + } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) && > + tail > bs->bl.request_alignment) { > + tail %= bs->bl.request_alignment; > + num -= tail; > + } > + } Hm, from the commit message I expected splitting requests in three (head, bulk, tail), but actually we can end up splitting it in five? Just to check whether I got this right, let me try an example: Let's assume request alignment 512, pdiscard alignment 64k, and we get a discard request with offset 510, length 130k. This algorithm makes the following calls to the driver: 1. pdiscard offset=510, len=2 | new count = 130k - 2 2. pdiscard offset=512, len=(64k - 512) | new count = 66k + 510 3. pdiscard offset=64k, len=64k | new count = 2558 4. pdiscard offset=128k, len=2048 | new count = 510 5. pdiscard offset=130k, len=510 | new count = 0 Correct? If so, is this really necessary or even helpful? I see that the iscsi driver throws away requests 1 and 5 and needs the split because otherwise it would disregard the areas covered by 2 and 4, too. But why can't or shouldn't the iscsi driver do this rounding itself when it gets combined 1+2 and 4+5 requests? Kevin
On 11/22/2016 08:03 AM, Kevin Wolf wrote: > Am 17.11.2016 um 21:13 hat Eric Blake geschrieben: >> Discard is advisory, so rounding the requests to alignment >> boundaries is never semantically wrong from the data that >> the guest sees. But at least the Dell Equallogic iSCSI SANs >> has an interesting property that its advertised discard >> alignment is 15M, yet documents that discarding a sequence >> of 1M slices will eventually result in the 15M page being >> marked as discarded, and it is possible to observe which >> pages have been discarded. >> >> >> Rework the block layer discard algorithm to mirror the write >> zero algorithm: always peel off any unaligned head or tail >> and manage that in isolation, then do the bulk of the request >> on an aligned boundary. The fallback when the driver returns >> -ENOTSUP for an unaligned request is to silently ignore that >> portion of the discard request; but for devices that can pass >> the partial request all the way down to hardware, this can >> result in the hardware coalescing requests and discarding >> aligned pages after all. >> > > Hm, from the commit message I expected splitting requests in three > (head, bulk, tail), but actually we can end up splitting it in five? Correct; so maybe I need to improve the commit message. The goal in multiple splits was to make it easier for drivers to not have to worry about re-aligning things (a request is either sub-sector, sub-page but sector-aligned, or page-aligned). > > Just to check whether I got this right, let me try an example: Let's > assume request alignment 512, pdiscard alignment 64k, and we get a > discard request with offset 510, length 130k. This algorithm makes the > following calls to the driver: > > 1. pdiscard offset=510, len=2 | new count = 130k - 2 > 2. pdiscard offset=512, len=(64k - 512) | new count = 66k + 510 > 3. pdiscard offset=64k, len=64k | new count = 2558 > 4. pdiscard offset=128k, len=2048 | new count = 510 > 5. pdiscard offset=130k, len=510 | new count = 0 > > Correct? > Yes. > If so, is this really necessary or even helpful? I see that the iscsi > driver throws away requests 1 and 5 and needs the split because > otherwise it would disregard the areas covered by 2 and 4, too. But why > can't or shouldn't the iscsi driver do this rounding itself when it gets > combined 1+2 and 4+5 requests? Because then every driver has to implement rounding; I thought that having centralized rounding code was easier to maintain overall than having every driver reimplement it.
On 11/22/2016 08:13 AM, Eric Blake wrote: >> Hm, from the commit message I expected splitting requests in three >> (head, bulk, tail), but actually we can end up splitting it in five? > > Correct; so maybe I need to improve the commit message. The goal in > multiple splits was to make it easier for drivers to not have to worry > about re-aligning things (a request is either sub-sector, sub-page but > sector-aligned, or page-aligned). > >> >> Just to check whether I got this right, let me try an example: Let's >> assume request alignment 512, pdiscard alignment 64k, and we get a >> discard request with offset 510, length 130k. This algorithm makes the >> following calls to the driver: >> >> 1. pdiscard offset=510, len=2 | new count = 130k - 2 >> 2. pdiscard offset=512, len=(64k - 512) | new count = 66k + 510 >> 3. pdiscard offset=64k, len=64k | new count = 2558 >> 4. pdiscard offset=128k, len=2048 | new count = 510 >> 5. pdiscard offset=130k, len=510 | new count = 0 >> >> Correct? >> > > Yes. To further clarify: for write zeroes, we have been doing 1+2 and 4+5 together, because if the driver returns -ENOTSUP, we then do a normal write, and normal writes already take care of fragmenting things back into alignment boundaries (1 and 5 are handled by read-modify-write, 2 and 4 are aligned to write request boundaries), so the writes happen no matter what even if the write-zero at request boundaries might have been more efficient. With discard, there is no fallback to any other operation, so we HAVE to present as many fragments as possible to the driver, while still making it easy for the driver to distinguish between aligned and unaligned requests and not having to do extra rounding. But you made me realize that if I make write_zeroes do the same 5-piece fragmentation, then it might be easier to use bdrv_aligned_pwritev instead of manually reimplementing max_transfer logic in the write fallback.
diff --git a/block/io.c b/block/io.c index 085ac34..4f00562 100644 --- a/block/io.c +++ b/block/io.c @@ -2424,7 +2424,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, { BdrvTrackedRequest req; int max_pdiscard, ret; - int head, align; + int head, tail, align; if (!bs->drv) { return -ENOMEDIUM; @@ -2447,19 +2447,15 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, return 0; } - /* Discard is advisory, so ignore any unaligned head or tail */ + /* Discard is advisory, but some devices track and coalesce + * unaligned requests, so we must pass everything down rather than + * round here. Still, most devices will just silently ignore + * unaligned requests (by returning -ENOTSUP), so we must fragment + * the request accordingly. */ align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment); assert(align % bs->bl.request_alignment == 0); head = offset % align; - if (head) { - head = MIN(count, align - head); - count -= head; - offset += head; - } - count = QEMU_ALIGN_DOWN(count, align); - if (!count) { - return 0; - } + tail = (offset + count) % align; bdrv_inc_in_flight(bs); tracked_request_begin(&req, bs, offset, count, BDRV_TRACKED_DISCARD); @@ -2471,11 +2467,34 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), align); - assert(max_pdiscard); + assert(max_pdiscard >= bs->bl.request_alignment); while (count > 0) { int ret; - int num = MIN(count, max_pdiscard); + int num = count; + + if (head) { + /* Make small requests to get to alignment boundaries. */ + num = MIN(count, align - head); + if (!QEMU_IS_ALIGNED(num, bs->bl.request_alignment)) { + num %= bs->bl.request_alignment; + } + head = (head + num) % align; + assert(num < max_pdiscard); + } else if (tail) { + if (num > align) { + /* Shorten the request to the last aligned cluster. */ + num -= tail; + } else if (!QEMU_IS_ALIGNED(tail, bs->bl.request_alignment) && + tail > bs->bl.request_alignment) { + tail %= bs->bl.request_alignment; + num -= tail; + } + } + /* limit request size */ + if (num > max_pdiscard) { + num = max_pdiscard; + } if (bs->drv->bdrv_co_pdiscard) { ret = bs->drv->bdrv_co_pdiscard(bs, offset, num);
Discard is advisory, so rounding the requests to alignment boundaries is never semantically wrong from the data that the guest sees. But at least the Dell Equallogic iSCSI SANs has an interesting property that its advertised discard alignment is 15M, yet documents that discarding a sequence of 1M slices will eventually result in the 15M page being marked as discarded, and it is possible to observe which pages have been discarded. Between commits 9f1963b and b8d0a980, we converted the block layer to a byte-based interface that ultimately ignores any unaligned head or tail based on the driver's advertised discard granularity, which means that qemu 2.7 refuses to pass any discard request smaller than 15M down to the Dell Equallogic hardware. This is a slight regression in behavior compared to earlier qemu, where a guest executing discards in power-of-2 chunks used to be able to get every page discarded, but is now left with various pages still allocated because the guest requests did not align with the hardware's 15M pages. Since the SCSI specification says nothing about a minimum discard granularity, and only documents the preferred alignment, it is best if the block layer gives the driver every bit of information about discard requests, rather than rounding it to alignment boundaries early. Rework the block layer discard algorithm to mirror the write zero algorithm: always peel off any unaligned head or tail and manage that in isolation, then do the bulk of the request on an aligned boundary. The fallback when the driver returns -ENOTSUP for an unaligned request is to silently ignore that portion of the discard request; but for devices that can pass the partial request all the way down to hardware, this can result in the hardware coalescing requests and discarding aligned pages after all. Reported by: Peter Lieven <pl@kamp.de> CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> --- v2: Split at more boundaries. --- block/io.c | 45 ++++++++++++++++++++++++++++++++------------- 1 file changed, 32 insertions(+), 13 deletions(-)