Message ID | 1469129688-22848-5-git-send-email-eblake@redhat.com |
---|---|
State | New |
Headers | show |
On Thu, Jul 21, 2016 at 01:34:48PM -0600, Eric Blake wrote: > Dell Equallogic iSCSI SANs have a very unusual advertised geometry: > > $ iscsi-inq -e 1 -c $((0xb0)) iscsi://XXX/0 > wsnz:0 > maximum compare and write length:1 > optimal transfer length granularity:0 > maximum transfer length:0 > optimal transfer length:0 > maximum prefetch xdread xdwrite transfer length:0 > maximum unmap lba count:30720 > maximum unmap block descriptor count:2 > optimal unmap granularity:30720 > ugavalid:1 > unmap granularity alignment:0 > maximum write same length:30720 > > which says that both the maximum and the optimal discard size > is 15M. It is not immediately apparent if the device allows > discard requests not aligned to the optimal size, nor if it > allows discards at a finer granularity than the optimal size. > > I tried to find details in the SCSI Commands Reference Manual > Rev. A on what valid values of maximum and optimal sizes are > permitted, but while that document mentions a "Block Limits > VPD Page", I couldn't actually find documentation of that page > or what values it would have, or if a SCSI device has an > advertisement of its minimal unmap granularity. So it is not > obvious to me whether the Dell Equallogic device is compliance > with the SCSI specification. > > Fortunately, it is easy enough to support non-power-of-2 sizing, > even if it means we are less efficient than truly possible when > targetting that device (for example, it means that we refuse to > unmap anything that is not a multiple of 15M and aligned to a > 15M boundary, even if the device truly does support a smaller > granularity where unmapping actually works). > > Reported-by: Peter Lieven <pl@kamp.de> > Signed-off-by: Eric Blake <eblake@redhat.com> > > --- > Help in locating the actual specs on what SCSI requires for > page 0xb0 would be nice. But this should at least avoid the > assertion failures that Peter is hitting. I was able to > test this patch using NBD on a hacked up qemu where I made > block/nbd.c report the same block limits, and could confirm > the assert under qemu-io 'w -z 0 40m' and 'discard 0 40m' > pre-patch, as well as the post-patch behavior of splitting > things to 15M alignment ('discard 1M 15M' becomes a no-op > because it is not aligned). But obviously it needs to be > tested on the actual iscsi SAN that triggered the original > report. > --- > include/block/block_int.h | 37 ++++++++++++++++++++----------------- > block/io.c | 15 +++++++++------ > 2 files changed, 29 insertions(+), 23 deletions(-) Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
On Thu, 07/21 13:34, Eric Blake wrote:
> + max_write_zeroes = max_write_zeroes / alignment * alignment;
Not using QEMU_ALIGN_DOWN despite patch 3?
On 07/27/2016 01:25 AM, Fam Zheng wrote: > On Thu, 07/21 13:34, Eric Blake wrote: >> + max_write_zeroes = max_write_zeroes / alignment * alignment; > > Not using QEMU_ALIGN_DOWN despite patch 3? Looks like I missed that on the rebase. Can fix if there is a reason for a respin.
Am 01.08.2016 um 11:22 schrieb Paolo Bonzini: > > On 28/07/2016 04:39, Eric Blake wrote: >> On 07/27/2016 01:25 AM, Fam Zheng wrote: >>> On Thu, 07/21 13:34, Eric Blake wrote: >>>> + max_write_zeroes = max_write_zeroes / alignment * alignment; >>> Not using QEMU_ALIGN_DOWN despite patch 3? >> Looks like I missed that on the rebase. Can fix if there is a reason for >> a respin. > Since Stefan acked this, I'm applying the patch and fixing it to use > QEMU_ALIGN_DOWN. > > Paolo Hi, I came across a sort of regression we introduced with the dropping of head and tail of an unaligned discard. The discard alignment that we use to trim the discard request is just a hint. I learned on the equallogics that a page (which is this unusal 15MB large) is unallocated even if the discard happens in pieces. E.g. in slices of 1MB requests. From my point of view I would like to restore the old behaviour. What do you think? Thanks, Peter
On 25/10/2016 14:03, Peter Lieven wrote: > Am 01.08.2016 um 11:22 schrieb Paolo Bonzini: >> >> On 28/07/2016 04:39, Eric Blake wrote: >>> On 07/27/2016 01:25 AM, Fam Zheng wrote: >>>> On Thu, 07/21 13:34, Eric Blake wrote: >>>>> + max_write_zeroes = max_write_zeroes / alignment * alignment; >>>> Not using QEMU_ALIGN_DOWN despite patch 3? >>> Looks like I missed that on the rebase. Can fix if there is a reason for >>> a respin. >> Since Stefan acked this, I'm applying the patch and fixing it to use >> QEMU_ALIGN_DOWN. >> >> Paolo > > Hi, > > I came across a sort of regression we introduced with the dropping of > head and tail > of an unaligned discard. > > The discard alignment that we use to trim the discard request is just a > hint. > > I learned on the equallogics that a page (which is this unusal 15MB > large) is > unallocated even if the discard happens in pieces. E.g. in slices of 1MB > requests. > > From my point of view I would like to restore the old behaviour. What do > you think? The right logic should be the one in Linux: if splitting a request, and the next starting sector would be misaligned, stop the discard at the previous aligned sector. Otherwise leave everything alone. Paolo
Am 25.10.2016 um 14:09 schrieb Paolo Bonzini: > > On 25/10/2016 14:03, Peter Lieven wrote: >> Am 01.08.2016 um 11:22 schrieb Paolo Bonzini: >>> On 28/07/2016 04:39, Eric Blake wrote: >>>> On 07/27/2016 01:25 AM, Fam Zheng wrote: >>>>> On Thu, 07/21 13:34, Eric Blake wrote: >>>>>> + max_write_zeroes = max_write_zeroes / alignment * alignment; >>>>> Not using QEMU_ALIGN_DOWN despite patch 3? >>>> Looks like I missed that on the rebase. Can fix if there is a reason for >>>> a respin. >>> Since Stefan acked this, I'm applying the patch and fixing it to use >>> QEMU_ALIGN_DOWN. >>> >>> Paolo >> Hi, >> >> I came across a sort of regression we introduced with the dropping of >> head and tail >> of an unaligned discard. >> >> The discard alignment that we use to trim the discard request is just a >> hint. >> >> I learned on the equallogics that a page (which is this unusal 15MB >> large) is >> unallocated even if the discard happens in pieces. E.g. in slices of 1MB >> requests. >> >> From my point of view I would like to restore the old behaviour. What do >> you think? > The right logic should be the one in Linux: if splitting a request, and > the next starting sector would be misaligned, stop the discard at the > previous aligned sector. Otherwise leave everything alone. Just to clarify. I mean the guest would send incremental 1MB discards we would now drop all of them if the alignment is 15MB. Previously, we have sent all of the 1MB requests. Peter
On 25/10/2016 14:12, Peter Lieven wrote: > Am 25.10.2016 um 14:09 schrieb Paolo Bonzini: >> >> On 25/10/2016 14:03, Peter Lieven wrote: >>> Am 01.08.2016 um 11:22 schrieb Paolo Bonzini: >>>> On 28/07/2016 04:39, Eric Blake wrote: >>>>> On 07/27/2016 01:25 AM, Fam Zheng wrote: >>>>>> On Thu, 07/21 13:34, Eric Blake wrote: >>>>>>> + max_write_zeroes = max_write_zeroes / alignment * alignment; >>>>>> Not using QEMU_ALIGN_DOWN despite patch 3? >>>>> Looks like I missed that on the rebase. Can fix if there is a >>>>> reason for >>>>> a respin. >>>> Since Stefan acked this, I'm applying the patch and fixing it to use >>>> QEMU_ALIGN_DOWN. >>>> >>>> Paolo >>> Hi, >>> >>> I came across a sort of regression we introduced with the dropping of >>> head and tail >>> of an unaligned discard. >>> >>> The discard alignment that we use to trim the discard request is just a >>> hint. >>> >>> I learned on the equallogics that a page (which is this unusal 15MB >>> large) is >>> unallocated even if the discard happens in pieces. E.g. in slices of 1MB >>> requests. >>> >>> From my point of view I would like to restore the old behaviour. >>> What do >>> you think? >> The right logic should be the one in Linux: if splitting a request, and >> the next starting sector would be misaligned, stop the discard at the >> previous aligned sector. Otherwise leave everything alone. > > Just to clarify. I mean the guest would send incremental 1MB discards > we would now drop all of them if the alignment is 15MB. Previously, > we have sent all of the 1MB requests. Yes. In this case there would be no need at all to split the request, so each request should be passed through. But hey, that firmware is seriously weird. :) Paolo
Am 25.10.2016 um 14:19 schrieb Paolo Bonzini: > > On 25/10/2016 14:12, Peter Lieven wrote: >> Am 25.10.2016 um 14:09 schrieb Paolo Bonzini: >>> On 25/10/2016 14:03, Peter Lieven wrote: >>>> Am 01.08.2016 um 11:22 schrieb Paolo Bonzini: >>>>> On 28/07/2016 04:39, Eric Blake wrote: >>>>>> On 07/27/2016 01:25 AM, Fam Zheng wrote: >>>>>>> On Thu, 07/21 13:34, Eric Blake wrote: >>>>>>>> + max_write_zeroes = max_write_zeroes / alignment * alignment; >>>>>>> Not using QEMU_ALIGN_DOWN despite patch 3? >>>>>> Looks like I missed that on the rebase. Can fix if there is a >>>>>> reason for >>>>>> a respin. >>>>> Since Stefan acked this, I'm applying the patch and fixing it to use >>>>> QEMU_ALIGN_DOWN. >>>>> >>>>> Paolo >>>> Hi, >>>> >>>> I came across a sort of regression we introduced with the dropping of >>>> head and tail >>>> of an unaligned discard. >>>> >>>> The discard alignment that we use to trim the discard request is just a >>>> hint. >>>> >>>> I learned on the equallogics that a page (which is this unusal 15MB >>>> large) is >>>> unallocated even if the discard happens in pieces. E.g. in slices of 1MB >>>> requests. >>>> >>>> From my point of view I would like to restore the old behaviour. >>>> What do >>>> you think? >>> The right logic should be the one in Linux: if splitting a request, and >>> the next starting sector would be misaligned, stop the discard at the >>> previous aligned sector. Otherwise leave everything alone. >> Just to clarify. I mean the guest would send incremental 1MB discards >> we would now drop all of them if the alignment is 15MB. Previously, >> we have sent all of the 1MB requests. > Yes. In this case there would be no need at all to split the request, > so each request should be passed through. > > But hey, that firmware is seriously weird. :) Yes, so you would not change the new implementation? Even if the discard is e.g. 1MB it could theretically be that internally the device has a finer granularity. Its an optimal discard alignment not the minimum required discard size. I think thats a difference. It does not say I can't handle smaller discards. Peter
On 10/25/2016 07:42 AM, Peter Lieven wrote: >> >> But hey, that firmware is seriously weird. :) > > Yes, so you would not change the new implementation? > > Even if the discard is e.g. 1MB it could theretically be that internally > the device has a finer granularity. Its an optimal discard alignment > not the minimum required discard size. I think thats a difference. > It does not say I can't handle smaller discards. The firmware is probably technically buggy for advertising too large of a minimum granularity, if it can piece together smaller requests into a larger discard. If discards need to happen at a smaller granularity, the firmware (or kernel quirk system) should fix the advertisement to the actual granularity that it will honor. I don't see a reason to change qemu's current behavior.
Am 25.10.2016 um 15:59 schrieb Eric Blake: > On 10/25/2016 07:42 AM, Peter Lieven wrote: >>> But hey, that firmware is seriously weird. :) >> Yes, so you would not change the new implementation? >> >> Even if the discard is e.g. 1MB it could theretically be that internally >> the device has a finer granularity. Its an optimal discard alignment >> not the minimum required discard size. I think thats a difference. >> It does not say I can't handle smaller discards. > The firmware is probably technically buggy for advertising too large of > a minimum granularity, if it can piece together smaller requests into a > larger discard. If discards need to happen at a smaller granularity, > the firmware (or kernel quirk system) should fix the advertisement to > the actual granularity that it will honor. I don't see a reason to > change qemu's current behavior. > The issue is that the optimal unmap granularity is only a hint. There is no evidence what happens with unaligned requests or requests that are smaller. They could still lead to a discard operation in the storage device. It just says if you can send me discards of that size thats optimal for me. Thats not said that smaller or unaligned requests have no effect. Peter
On 10/25/2016 09:20 AM, Peter Lieven wrote: >> The firmware is probably technically buggy for advertising too large of >> a minimum granularity, if it can piece together smaller requests into a >> larger discard. If discards need to happen at a smaller granularity, >> the firmware (or kernel quirk system) should fix the advertisement to >> the actual granularity that it will honor. I don't see a reason to >> change qemu's current behavior. >> > > The issue is that the optimal unmap granularity is only a hint. > There is no evidence what happens with unaligned requests or requests > that are smaller. They could still lead to a discard operation in the > storage device. It just says if you can send me discards of that size > thats optimal for me. Thats not said that smaller or unaligned requests > have no effect. So your argument is that we should always pass down every unaligned less-than-optimum discard request all the way to the hardware, rather than dropping it higher in the stack, even though discard requests are already advisory, in order to leave the hardware as the ultimate decision on whether to ignore the unaligned request?
On 25/10/2016 16:35, Eric Blake wrote: > So your argument is that we should always pass down every unaligned > less-than-optimum discard request all the way to the hardware, rather > than dropping it higher in the stack, even though discard requests are > already advisory, in order to leave the hardware as the ultimate > decision on whether to ignore the unaligned request? Yes, I agree with Peter as to this. Paolo
On 10/25/2016 09:36 AM, Paolo Bonzini wrote: > > > On 25/10/2016 16:35, Eric Blake wrote: >> So your argument is that we should always pass down every unaligned >> less-than-optimum discard request all the way to the hardware, rather >> than dropping it higher in the stack, even though discard requests are >> already advisory, in order to leave the hardware as the ultimate >> decision on whether to ignore the unaligned request? > > Yes, I agree with Peter as to this. Okay, I'll work on patches. I think it counts as bug fix, so appropriate even if I miss soft freeze (I'd still like to get NBD write zero support into 2.8, since it already missed 2.7, but that one is still awaiting review with not much time left).
Am 25.10.2016 um 18:12 schrieb Eric Blake: > On 10/25/2016 09:36 AM, Paolo Bonzini wrote: >> >> On 25/10/2016 16:35, Eric Blake wrote: >>> So your argument is that we should always pass down every unaligned >>> less-than-optimum discard request all the way to the hardware, rather >>> than dropping it higher in the stack, even though discard requests are >>> already advisory, in order to leave the hardware as the ultimate >>> decision on whether to ignore the unaligned request? >> Yes, I agree with Peter as to this. > Okay, I'll work on patches. I think it counts as bug fix, so appropriate > even if I miss soft freeze (I'd still like to get NBD write zero support > into 2.8, since it already missed 2.7, but that one is still awaiting > review with not much time left). > Hi Eric, have you had time to look at this? If you need help, let me know. Peter
On 11/08/2016 05:03 AM, Peter Lieven wrote: > Am 25.10.2016 um 18:12 schrieb Eric Blake: >> On 10/25/2016 09:36 AM, Paolo Bonzini wrote: >>> >>> On 25/10/2016 16:35, Eric Blake wrote: >>>> So your argument is that we should always pass down every unaligned >>>> less-than-optimum discard request all the way to the hardware, rather >>>> than dropping it higher in the stack, even though discard requests are >>>> already advisory, in order to leave the hardware as the ultimate >>>> decision on whether to ignore the unaligned request? >>> Yes, I agree with Peter as to this. >> Okay, I'll work on patches. I think it counts as bug fix, so appropriate >> even if I miss soft freeze (I'd still like to get NBD write zero support >> into 2.8, since it already missed 2.7, but that one is still awaiting >> review with not much time left). >> > > Hi Eric, > > have you had time to look at this? > If you need help, let me know. Still on my list. I'm not forgetting it, and it does count as a bug fix so it is safe for inclusion, although I'm trying to get it in before this week is out.
On 11/08/2016 05:03 AM, Peter Lieven wrote: > Am 25.10.2016 um 18:12 schrieb Eric Blake: >> On 10/25/2016 09:36 AM, Paolo Bonzini wrote: >>> >>> On 25/10/2016 16:35, Eric Blake wrote: >>>> So your argument is that we should always pass down every unaligned >>>> less-than-optimum discard request all the way to the hardware, rather >>>> than dropping it higher in the stack, even though discard requests are >>>> already advisory, in order to leave the hardware as the ultimate >>>> decision on whether to ignore the unaligned request? >>> Yes, I agree with Peter as to this. >> Okay, I'll work on patches. I think it counts as bug fix, so appropriate >> even if I miss soft freeze (I'd still like to get NBD write zero support >> into 2.8, since it already missed 2.7, but that one is still awaiting >> review with not much time left). >> > > Hi Eric, > > have you had time to look at this? > If you need help, let me know. Patches posted, but testing help would be appreciated since you have the actual hardware that exhibits the issue. I'm also trying to write a patch to extend the blkdebug driver to share this "feature" of a 15M page, and write a qemu-iotest to make it harder to regress in the future.
diff --git a/include/block/block_int.h b/include/block/block_int.h index 1fe0fd9..47665be 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -330,36 +330,39 @@ typedef struct BlockLimits { * otherwise. */ uint32_t request_alignment; - /* maximum number of bytes that can be discarded at once (since it - * is signed, it must be < 2G, if set), should be multiple of + /* Maximum number of bytes that can be discarded at once (since it + * is signed, it must be < 2G, if set). Must be multiple of * pdiscard_alignment, but need not be power of 2. May be 0 if no * inherent 32-bit limit */ int32_t max_pdiscard; - /* optimal alignment for discard requests in bytes, must be power - * of 2, less than max_pdiscard if that is set, and multiple of - * bl.request_alignment. May be 0 if bl.request_alignment is good - * enough */ + /* Optimal alignment for discard requests in bytes. A power of 2 + * is best but not mandatory. Must be a multiple of + * bl.request_alignment, and must be less than max_pdiscard if + * that is set. May be 0 if bl.request_alignment is good enough */ uint32_t pdiscard_alignment; - /* maximum number of bytes that can zeroized at once (since it is - * signed, it must be < 2G, if set), should be multiple of + /* Maximum number of bytes that can zeroized at once (since it is + * signed, it must be < 2G, if set). Must be multiple of * pwrite_zeroes_alignment. May be 0 if no inherent 32-bit limit */ int32_t max_pwrite_zeroes; - /* optimal alignment for write zeroes requests in bytes, must be - * power of 2, less than max_pwrite_zeroes if that is set, and - * multiple of bl.request_alignment. May be 0 if - * bl.request_alignment is good enough */ + /* Optimal alignment for write zeroes requests in bytes. A power + * of 2 is best but not mandatory. Must be a multiple of + * bl.request_alignment, and must be less than max_pwrite_zeroes + * if that is set. May be 0 if bl.request_alignment is good + * enough */ uint32_t pwrite_zeroes_alignment; - /* optimal transfer length in bytes (must be power of 2, and - * multiple of bl.request_alignment), or 0 if no preferred size */ + /* Optimal transfer length in bytes. A power of 2 is best but not + * mandatory. Must be a multiple of bl.request_alignment, or 0 if + * no preferred size */ uint32_t opt_transfer; - /* maximal transfer length in bytes (need not be power of 2, but - * should be multiple of opt_transfer), or 0 for no 32-bit limit. - * For now, anything larger than INT_MAX is clamped down. */ + /* Maximal transfer length in bytes. Need not be power of 2, but + * must be multiple of opt_transfer and bl.request_alignment, or 0 + * for no 32-bit limit. For now, anything larger than INT_MAX is + * clamped down. */ uint32_t max_transfer; /* memory alignment, in bytes so that no bounce buffer is needed */ diff --git a/block/io.c b/block/io.c index 7323f0f..0644a5d 100644 --- a/block/io.c +++ b/block/io.c @@ -1180,10 +1180,11 @@ static int coroutine_fn bdrv_co_do_pwrite_zeroes(BlockDriverState *bs, int alignment = MAX(bs->bl.pwrite_zeroes_alignment, bs->bl.request_alignment); - assert(is_power_of_2(alignment)); - head = offset & (alignment - 1); - tail = (offset + count) & (alignment - 1); - max_write_zeroes &= ~(alignment - 1); + assert(alignment % bs->bl.request_alignment == 0); + head = offset % alignment; + tail = (offset + count) % alignment; + max_write_zeroes = max_write_zeroes / alignment * alignment; + assert(max_write_zeroes >= bs->bl.request_alignment); while (count > 0 && !ret) { int num = count; @@ -2429,9 +2430,10 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, /* Discard is advisory, so ignore any unaligned head or tail */ align = MAX(bs->bl.pdiscard_alignment, bs->bl.request_alignment); - assert(is_power_of_2(align)); - head = MIN(count, -offset & (align - 1)); + assert(align % bs->bl.request_alignment == 0); + head = offset % align; if (head) { + head = MIN(count, align - head); count -= head; offset += head; } @@ -2449,6 +2451,7 @@ int coroutine_fn bdrv_co_pdiscard(BlockDriverState *bs, int64_t offset, max_pdiscard = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_pdiscard, INT_MAX), align); + assert(max_pdiscard); while (count > 0) { int ret;
Dell Equallogic iSCSI SANs have a very unusual advertised geometry: $ iscsi-inq -e 1 -c $((0xb0)) iscsi://XXX/0 wsnz:0 maximum compare and write length:1 optimal transfer length granularity:0 maximum transfer length:0 optimal transfer length:0 maximum prefetch xdread xdwrite transfer length:0 maximum unmap lba count:30720 maximum unmap block descriptor count:2 optimal unmap granularity:30720 ugavalid:1 unmap granularity alignment:0 maximum write same length:30720 which says that both the maximum and the optimal discard size is 15M. It is not immediately apparent if the device allows discard requests not aligned to the optimal size, nor if it allows discards at a finer granularity than the optimal size. I tried to find details in the SCSI Commands Reference Manual Rev. A on what valid values of maximum and optimal sizes are permitted, but while that document mentions a "Block Limits VPD Page", I couldn't actually find documentation of that page or what values it would have, or if a SCSI device has an advertisement of its minimal unmap granularity. So it is not obvious to me whether the Dell Equallogic device is compliance with the SCSI specification. Fortunately, it is easy enough to support non-power-of-2 sizing, even if it means we are less efficient than truly possible when targetting that device (for example, it means that we refuse to unmap anything that is not a multiple of 15M and aligned to a 15M boundary, even if the device truly does support a smaller granularity where unmapping actually works). Reported-by: Peter Lieven <pl@kamp.de> Signed-off-by: Eric Blake <eblake@redhat.com> --- Help in locating the actual specs on what SCSI requires for page 0xb0 would be nice. But this should at least avoid the assertion failures that Peter is hitting. I was able to test this patch using NBD on a hacked up qemu where I made block/nbd.c report the same block limits, and could confirm the assert under qemu-io 'w -z 0 40m' and 'discard 0 40m' pre-patch, as well as the post-patch behavior of splitting things to 15M alignment ('discard 1M 15M' becomes a no-op because it is not aligned). But obviously it needs to be tested on the actual iscsi SAN that triggered the original report. --- include/block/block_int.h | 37 ++++++++++++++++++++----------------- block/io.c | 15 +++++++++------ 2 files changed, 29 insertions(+), 23 deletions(-)