diff mbox series

[2/4] block: Split padded I/O vectors exceeding IOV_MAX

Message ID 20230317175019.10857-3-hreitz@redhat.com
State New
Headers show
Series block: Split padded I/O vectors exceeding IOV_MAX | expand

Commit Message

Hanna Czenczek March 17, 2023, 5:50 p.m. UTC
When processing vectored guest requests that are not aligned to the
storage request alignment, we pad them by adding head and/or tail
buffers for a read-modify-write cycle.

The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
with this padding, the vector can exceed that limit.  As of
4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
limit, instead returning an error to the guest.

To the guest, this appears as a random I/O error.  We should not return
an I/O error to the guest when it issued a perfectly valid request.

Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
longer than IOV_MAX, which generally seems to work (because the guest
assumes a smaller alignment than we really have, file-posix's
raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
so emulate the request, so that the IOV_MAX does not matter).  However,
that does not seem exactly great.

I see two ways to fix this problem:
1. We split such long requests into two requests.
2. We join some elements of the vector into new buffers to make it
   shorter.

I am wary of (1), because it seems like it may have unintended side
effects.

(2) on the other hand seems relatively simple to implement, with
hopefully few side effects, so this patch does that.

To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
is effectively replaced by the new function bdrv_create_padded_qiov(),
which not only wraps the request IOV with padding head/tail, but also
ensures that the resulting vector will not have more than IOV_MAX
elements.  Putting that functionality into qemu_iovec_init_extended() is
infeasible because it requires allocating a bounce buffer; doing so
would require many more parameters (buffer alignment, how to initialize
the buffer, and out parameters like the buffer, its length, and the
original elements), which is not reasonable.

Conversely, it is not difficult to move qemu_iovec_init_extended()'s
functionality into bdrv_create_padded_qiov() by using public
qemu_iovec_* functions, so that is what this patch does.

Because bdrv_pad_request() was the only "serious" user of
qemu_iovec_init_extended(), the next patch will remove the latter
function, so the functionality is not implemented twice.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 143 insertions(+), 10 deletions(-)

Comments

Eric Blake March 18, 2023, 12:19 p.m. UTC | #1
On Fri, Mar 17, 2023 at 06:50:17PM +0100, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 

> 
> To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
> is effectively replaced by the new function bdrv_create_padded_qiov(),
> which not only wraps the request IOV with padding head/tail, but also
> ensures that the resulting vector will not have more than IOV_MAX
> elements.  Putting that functionality into qemu_iovec_init_extended() is
> infeasible because it requires allocating a bounce buffer; doing so
> would require many more parameters (buffer alignment, how to initialize
> the buffer, and out parameters like the buffer, its length, and the
> original elements), which is not reasonable.
> 
> Conversely, it is not difficult to move qemu_iovec_init_extended()'s
> functionality into bdrv_create_padded_qiov() by using public
> qemu_iovec_* functions, so that is what this patch does.
> 
> Because bdrv_pad_request() was the only "serious" user of
> qemu_iovec_init_extended(), the next patch will remove the latter
> function, so the functionality is not implemented twice.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 143 insertions(+), 10 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>
Vladimir Sementsov-Ogievskiy March 20, 2023, 10:31 a.m. UTC | #2
On 17.03.23 20:50, Hanna Czenczek wrote:
> When processing vectored guest requests that are not aligned to the
> storage request alignment, we pad them by adding head and/or tail
> buffers for a read-modify-write cycle.
> 
> The guest can submit I/O vectors up to IOV_MAX (1024) in length, but
> with this padding, the vector can exceed that limit.  As of
> 4c002cef0e9abe7135d7916c51abce47f7fc1ee2 ("util/iov: make
> qemu_iovec_init_extended() honest"), we refuse to pad vectors beyond the
> limit, instead returning an error to the guest.
> 
> To the guest, this appears as a random I/O error.  We should not return
> an I/O error to the guest when it issued a perfectly valid request.
> 
> Before 4c002cef0e9abe7135d7916c51abce47f7fc1ee2, we just made the vector
> longer than IOV_MAX, which generally seems to work (because the guest
> assumes a smaller alignment than we really have, file-posix's
> raw_co_prw() will generally see bdrv_qiov_is_aligned() return false, and
> so emulate the request, so that the IOV_MAX does not matter).  However,
> that does not seem exactly great.
> 
> I see two ways to fix this problem:
> 1. We split such long requests into two requests.
> 2. We join some elements of the vector into new buffers to make it
>     shorter.
> 
> I am wary of (1), because it seems like it may have unintended side
> effects.
> 
> (2) on the other hand seems relatively simple to implement, with
> hopefully few side effects, so this patch does that.
> 
> To do this, the use of qemu_iovec_init_extended() in bdrv_pad_request()
> is effectively replaced by the new function bdrv_create_padded_qiov(),
> which not only wraps the request IOV with padding head/tail, but also
> ensures that the resulting vector will not have more than IOV_MAX
> elements.  Putting that functionality into qemu_iovec_init_extended() is
> infeasible because it requires allocating a bounce buffer; doing so
> would require many more parameters (buffer alignment, how to initialize
> the buffer, and out parameters like the buffer, its length, and the
> original elements), which is not reasonable.
> 
> Conversely, it is not difficult to move qemu_iovec_init_extended()'s
> functionality into bdrv_create_padded_qiov() by using public
> qemu_iovec_* functions, so that is what this patch does.
> 
> Because bdrv_pad_request() was the only "serious" user of
> qemu_iovec_init_extended(), the next patch will remove the latter
> function, so the functionality is not implemented twice.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2141964
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   block/io.c | 153 +++++++++++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 143 insertions(+), 10 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8974d46941..1e9cdba17a 100644
> --- a/block/io.c
> +++ b/block/io.c

[..]

> +    pad->write = write;
> +
>       return true;
>   }
>   
> @@ -1545,6 +1561,18 @@ zero_mem:
>   
>   static void bdrv_padding_destroy(BdrvRequestPadding *pad)

Maybe, rename to _finalize, to stress that it's not only freeing memory.

>   {
> +    if (pad->collapse_bounce_buf) {
> +        if (!pad->write) {
> +            /*
> +             * If padding required elements in the vector to be collapsed into a
> +             * bounce buffer, copy the bounce buffer content back
> +             */
> +            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
> +                                pad->collapse_bounce_buf, pad->collapse_len);
> +        }
> +        qemu_vfree(pad->collapse_bounce_buf);
> +        qemu_iovec_destroy(&pad->pre_collapse_qiov);
> +    }
>       if (pad->buf) {
>           qemu_vfree(pad->buf);
>           qemu_iovec_destroy(&pad->local_qiov);
> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>       memset(pad, 0, sizeof(*pad));
>   }
>   
> +/*
> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
> + *
> + * To ensure this, when necessary, the first couple of elements (up to three)

maybe, "first two-three elements"

> + * of @iov are merged into pad->collapse_bounce_buf and replaced by a reference
> + * to that bounce buffer in pad->local_qiov.
> + *
> + * After performing a read request, the data from the bounce buffer must be
> + * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_destroy()).
> + */
> +static int bdrv_create_padded_qiov(BlockDriverState *bs,
> +                                   BdrvRequestPadding *pad,
> +                                   struct iovec *iov, int niov,
> +                                   size_t iov_offset, size_t bytes)
> +{
> +    int padded_niov, surplus_count, collapse_count;
> +
> +    /* Assert this invariant */
> +    assert(niov <= IOV_MAX);
> +
> +    /*
> +     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
> +     * to the guest is not ideal, but there is little else we can do.  At least
> +     * this will practically never happen on 64-bit systems.
> +     */
> +    if (SIZE_MAX - pad->head < bytes ||
> +        SIZE_MAX - pad->head - bytes < pad->tail)
> +    {
> +        return -EINVAL;
> +    }
> +
> +    /* Length of the resulting IOV if we just concatenated everything */
> +    padded_niov = !!pad->head + niov + !!pad->tail;
> +
> +    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
> +
> +    if (pad->head) {
> +        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
> +    }
> +
> +    /*
> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
> +     * Instead, merge the first couple of elements of @iov to reduce the number

maybe, "first two-three elements"

> +     * of vector elements as necessary.
> +     */
> +    if (padded_niov > IOV_MAX) {
>

[..]

> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>           flags |= BDRV_REQ_COPY_ON_READ;
>       }
>   
> -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
> -                           NULL, &flags);
> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
> +                           &pad, NULL, &flags);
>       if (ret < 0) {
>           goto fail;
>       }

a bit later:

tracked_request_end(&req);
bdrv_padding_destroy(&pad);


Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. With that:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>



PS, I feel here still exists small space for optimization:

move the logic to bdrv_init_padding(), and

1. allocate only one buffer
2. make the new collpase are to be attached to head or tail padding
3. avoid creating extra iov-slice, maybe with help of some new qemu_iovec_* API that can control number of copied/to-be-copied iovs and/or calculation number of iovs in qiov/qiov_offset/bytes slice
Hanna Czenczek April 3, 2023, 1:33 p.m. UTC | #3
(Sorry for the rather late reply... Thanks for the review!)

On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
> On 17.03.23 20:50, Hanna Czenczek wrote:

[...]

>> diff --git a/block/io.c b/block/io.c
>> index 8974d46941..1e9cdba17a 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>
> [..]
>
>> +    pad->write = write;
>> +
>>       return true;
>>   }
>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>
> Maybe, rename to _finalize, to stress that it's not only freeing memory.

Sounds good!

[...]

>> @@ -1552,6 +1580,101 @@ static void 
>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>       memset(pad, 0, sizeof(*pad));
>>   }
>>   +/*
>> + * Create pad->local_qiov by wrapping @iov in the padding head and 
>> tail, while
>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>> + *
>> + * To ensure this, when necessary, the first couple of elements (up 
>> to three)
>
> maybe, "first two-three elements"

Sure (here and...

[...]

>> +    /*
>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>> +     * Instead, merge the first couple of elements of @iov to reduce 
>> the number
>
> maybe, "first two-three elements"

...here).

>
>> +     * of vector elements as necessary.
>> +     */
>> +    if (padded_niov > IOV_MAX) {
>>
>
> [..]
>
>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild 
>> *child,
>>           flags |= BDRV_REQ_COPY_ON_READ;
>>       }
>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>> &bytes, &pad,
>> -                           NULL, &flags);
>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, 
>> false,
>> +                           &pad, NULL, &flags);
>>       if (ret < 0) {
>>           goto fail;
>>       }
>
> a bit later:
>
> tracked_request_end(&req);
> bdrv_padding_destroy(&pad);
>
>
> Now, the request is formally finished inside bdrv_padding_destroy().. 
> Not sure, does it really violate something, but seems safer to swap 
> these two calls. 

I’d rather not, for two reasons: First, tracked requests are (as far as 
I understand) only there to implement request serialization, and so only 
care about metadata (offset, length, and type), which is not changed by 
changes to the I/O vector.

Second, even if the state of the I/O vector were relevant to tracked 
requests, I think it would actually be the other way around, i.e. the 
tracked request must be ended before the padding is 
finalized/destroyed.  The tracked request is about the actual request we 
submit to `child` (which is why tracked_request_begin() is called after 
bdrv_pad_request()), and that request is done using the modified I/O 
vector.  So if the tracked request had any connection to the request’s 
I/O vector (which it doesn’t), it would be to this modified one, so we 
mustn’t invalidate it via bdrv_padding_finalize() while the tracked 
request lives.

Or, said differently: I generally try to clean up things in the inverse 
way they were set up, and because bdrv_pad_requests() comes before 
tracked_request_begin(), I think tracked_request_end() should come 
before bdrv_padding_finalize().

> With that:
>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>
>
>
> PS, I feel here still exists small space for optimization:

The question is whether any optimization is really worth it, and I’m not 
sure it is.  The bug has been in qemu for over two years, and because 
the only report I’ve seen about it came from our QE department, it seems 
like a very rare case, so I find it more important for the code to be as 
simple as possible than to optimize.

> move the logic to bdrv_init_padding(), and
>
> 1. allocate only one buffer
> 2. make the new collpase are to be attached to head or tail padding
> 3. avoid creating extra iov-slice, maybe with help of some new 
> qemu_iovec_* API that can control number of copied/to-be-copied iovs 
> and/or calculation number of iovs in qiov/qiov_offset/bytes slice

I’ve actually begun by trying to reuse the padding buffer, and to 
collapse head/tail into it, but found it to be rather complicated. See 
also my reply to Stefan here: 
https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04774.html

Hanna
Vladimir Sementsov-Ogievskiy April 4, 2023, 8:10 a.m. UTC | #4
On 03.04.23 16:33, Hanna Czenczek wrote:
> (Sorry for the rather late reply... Thanks for the review!)
> 
> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>> On 17.03.23 20:50, Hanna Czenczek wrote:
> 
> [...]
> 
>>> diff --git a/block/io.c b/block/io.c
>>> index 8974d46941..1e9cdba17a 100644
>>> --- a/block/io.c
>>> +++ b/block/io.c
>>
>> [..]
>>
>>> +    pad->write = write;
>>> +
>>>       return true;
>>>   }
>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>
>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
> 
> Sounds good!
> 
> [...]
> 
>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>       memset(pad, 0, sizeof(*pad));
>>>   }
>>>   +/*
>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>> + *
>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>
>> maybe, "first two-three elements"
> 
> Sure (here and...
> 
> [...]
> 
>>> +    /*
>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>
>> maybe, "first two-three elements"
> 
> ...here).
> 
>>
>>> +     * of vector elements as necessary.
>>> +     */
>>> +    if (padded_niov > IOV_MAX) {
>>>
>>
>> [..]
>>
>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>       }
>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>> -                           NULL, &flags);
>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>> +                           &pad, NULL, &flags);
>>>       if (ret < 0) {
>>>           goto fail;
>>>       }
>>
>> a bit later:
>>
>> tracked_request_end(&req);
>> bdrv_padding_destroy(&pad);
>>
>>
>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
> 
> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
> 
> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
> 
> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().

Note, that it's wise-versa in bdrv_co_pwritev_part().

For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end(). And moving the last manipulation with qiov out of it breaks this simple thought.
Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.

I agree with your point about sequence of objects finalization, but maybe, that just shows that copying data back to qiov should not be a part of bdrv_padding_finalize(), but instead be a separate function, called before tracked_request_end().

> 
>> With that:
>>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
>>
>>
>>
>> PS, I feel here still exists small space for optimization:
> 
> The question is whether any optimization is really worth it, and I’m not sure it is.  The bug has been in qemu for over two years, and because the only report I’ve seen about it came from our QE department, it seems like a very rare case, so I find it more important for the code to be as simple as possible than to optimize.
> 
>> move the logic to bdrv_init_padding(), and
>>
>> 1. allocate only one buffer
>> 2. make the new collpase are to be attached to head or tail padding
>> 3. avoid creating extra iov-slice, maybe with help of some new qemu_iovec_* API that can control number of copied/to-be-copied iovs and/or calculation number of iovs in qiov/qiov_offset/bytes slice
> 
> I’ve actually begun by trying to reuse the padding buffer, and to collapse head/tail into it, but found it to be rather complicated. See also my reply to Stefan here: https://lists.nongnu.org/archive/html/qemu-devel/2023-03/msg04774.html
> 
> Hanna
>
Hanna Czenczek April 4, 2023, 5:32 p.m. UTC | #5
On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
> On 03.04.23 16:33, Hanna Czenczek wrote:
>> (Sorry for the rather late reply... Thanks for the review!)
>>
>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>
>> [...]
>>
>>>> diff --git a/block/io.c b/block/io.c
>>>> index 8974d46941..1e9cdba17a 100644
>>>> --- a/block/io.c
>>>> +++ b/block/io.c
>>>
>>> [..]
>>>
>>>> +    pad->write = write;
>>>> +
>>>>       return true;
>>>>   }
>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>
>>> Maybe, rename to _finalize, to stress that it's not only freeing 
>>> memory.
>>
>> Sounds good!
>>
>> [...]
>>
>>>> @@ -1552,6 +1580,101 @@ static void 
>>>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>       memset(pad, 0, sizeof(*pad));
>>>>   }
>>>>   +/*
>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and 
>>>> tail, while
>>>> + * ensuring that the resulting vector will not exceed IOV_MAX 
>>>> elements.
>>>> + *
>>>> + * To ensure this, when necessary, the first couple of elements 
>>>> (up to three)
>>>
>>> maybe, "first two-three elements"
>>
>> Sure (here and...
>>
>> [...]
>>
>>>> +    /*
>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate 
>>>> everything.
>>>> +     * Instead, merge the first couple of elements of @iov to 
>>>> reduce the number
>>>
>>> maybe, "first two-three elements"
>>
>> ...here).
>>
>>>
>>>> +     * of vector elements as necessary.
>>>> +     */
>>>> +    if (padded_niov > IOV_MAX) {
>>>>
>>>
>>> [..]
>>>
>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn 
>>>> bdrv_co_preadv_part(BdrvChild *child,
>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>       }
>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>> &bytes, &pad,
>>>> -                           NULL, &flags);
>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>> &bytes, false,
>>>> +                           &pad, NULL, &flags);
>>>>       if (ret < 0) {
>>>>           goto fail;
>>>>       }
>>>
>>> a bit later:
>>>
>>> tracked_request_end(&req);
>>> bdrv_padding_destroy(&pad);
>>>
>>>
>>> Now, the request is formally finished inside 
>>> bdrv_padding_destroy().. Not sure, does it really violate something, 
>>> but seems safer to swap these two calls. 
>>
>> I’d rather not, for two reasons: First, tracked requests are (as far 
>> as I understand) only there to implement request serialization, and 
>> so only care about metadata (offset, length, and type), which is not 
>> changed by changes to the I/O vector.
>>
>> Second, even if the state of the I/O vector were relevant to tracked 
>> requests, I think it would actually be the other way around, i.e. the 
>> tracked request must be ended before the padding is 
>> finalized/destroyed.  The tracked request is about the actual request 
>> we submit to `child` (which is why tracked_request_begin() is called 
>> after bdrv_pad_request()), and that request is done using the 
>> modified I/O vector.  So if the tracked request had any connection to 
>> the request’s I/O vector (which it doesn’t), it would be to this 
>> modified one, so we mustn’t invalidate it via bdrv_padding_finalize() 
>> while the tracked request lives.
>>
>> Or, said differently: I generally try to clean up things in the 
>> inverse way they were set up, and because bdrv_pad_requests() comes 
>> before tracked_request_begin(), I think tracked_request_end() should 
>> come before bdrv_padding_finalize().
>
> Note, that it's wise-versa in bdrv_co_pwritev_part().

Well, and it’s this way here.  We agree that for clean-up, the order 
doesn’t functionally matter, so either way is actually fine.

> For me it's just simpler to think that the whole request, including 
> filling user-given qiov with data on read part is inside 
> tracked_request_begin() / tracked_request_end().

It isn’t, though, because padding must be done before the tracked 
request is created.  The tracked request uses the request’s actual 
offset and length, after padding, so bdrv_pad_request() must always be 
done before (i.e., outside) tracked_request_begin().

> And moving the last manipulation with qiov out of it breaks this 
> simple thought.
> Guest should not care of it, as it doesn't know about request 
> tracking.. But what about internal code? Some code may depend on some 
> requests be finished after bdrv_drained_begin() call, but now they may 
> be not fully finished, and some data may be not copied back to 
> original qiov.
>
> I agree with your point about sequence of objects finalization, but 
> maybe, that just shows that copying data back to qiov should not be a 
> part of bdrv_padding_finalize(), but instead be a separate function, 
> called before tracked_request_end().

But my thought is that copying back shouldn’t be done before 
tracked_request_end(), because copying back is not part of the tracked 
request.  What we track is the padded request, which uses a modified I/O 
vector, so undoing that modification shouldn’t be done while the tracked 
request lives.

I know I’m inconsistent with regards to bdrv_co_pwritev_part(), which is 
because it doesn’t matter.  My actual position is that tracked requests 
are about metadata, so undoing/finalizing the padding (including 
potentially copying data back) has nothing to do with a tracked request, 
so the order cannot of finalizing both cannot matter.

But you’re arguing for consistency, and my position on that is, if we 
want consistency, I’d finalize the tracked request first, and the 
padding second.  This is also because tracking is done for 
serialization, so we should end it as soon as possible, so that 
concurrent requests are resumed quickly.  (Though I’m not sure if 
delaying it by a memcpy() matters for an essentially single-threaded 
block layer at this time.)

Hanna
Vladimir Sementsov-Ogievskiy April 5, 2023, 9:59 a.m. UTC | #6
On 04.04.23 20:32, Hanna Czenczek wrote:
> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>> (Sorry for the rather late reply... Thanks for the review!)
>>>
>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>
>>> [...]
>>>
>>>>> diff --git a/block/io.c b/block/io.c
>>>>> index 8974d46941..1e9cdba17a 100644
>>>>> --- a/block/io.c
>>>>> +++ b/block/io.c
>>>>
>>>> [..]
>>>>
>>>>> +    pad->write = write;
>>>>> +
>>>>>       return true;
>>>>>   }
>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>
>>>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
>>>
>>> Sounds good!
>>>
>>> [...]
>>>
>>>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>   }
>>>>>   +/*
>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>>>> + *
>>>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>>>
>>>> maybe, "first two-three elements"
>>>
>>> Sure (here and...
>>>
>>> [...]
>>>
>>>>> +    /*
>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>>>
>>>> maybe, "first two-three elements"
>>>
>>> ...here).
>>>
>>>>
>>>>> +     * of vector elements as necessary.
>>>>> +     */
>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>
>>>>
>>>> [..]
>>>>
>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>       }
>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>>>> -                           NULL, &flags);
>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>>>> +                           &pad, NULL, &flags);
>>>>>       if (ret < 0) {
>>>>>           goto fail;
>>>>>       }
>>>>
>>>> a bit later:
>>>>
>>>> tracked_request_end(&req);
>>>> bdrv_padding_destroy(&pad);
>>>>
>>>>
>>>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
>>>
>>> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
>>>
>>> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
>>>
>>> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().
>>
>> Note, that it's wise-versa in bdrv_co_pwritev_part().
> 
> Well, and it’s this way here.  We agree that for clean-up, the order doesn’t functionally matter, so either way is actually fine.
> 
>> For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end().
> 
> It isn’t, though, because padding must be done before the tracked request is created.  The tracked request uses the request’s actual offset and length, after padding, so bdrv_pad_request() must always be done before (i.e., outside) tracked_request_begin().
> 
>> And moving the last manipulation with qiov out of it breaks this simple thought.
>> Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.

You didn't answered here. Do you think that's wrong assumption for the user of drained sections?

>>
>> I agree with your point about sequence of objects finalization, but maybe, that just shows that copying data back to qiov should not be a part of bdrv_padding_finalize(), but instead be a separate function, called before tracked_request_end().
> 
> But my thought is that copying back shouldn’t be done before tracked_request_end(), because copying back is not part of the tracked request.  What we track is the padded request, which uses a modified I/O vector, so undoing that modification shouldn’t be done while the tracked request lives.
> 
> I know I’m inconsistent with regards to bdrv_co_pwritev_part(), which is because it doesn’t matter.  My actual position is that tracked requests are about metadata, so undoing/finalizing the padding (including potentially copying data back) has nothing to do with a tracked request, so the order cannot of finalizing both cannot matter.
> 
> But you’re arguing for consistency, and my position on that is, if we want consistency, I’d finalize the tracked request first, and the padding second.  This is also because tracking is done for serialization, so we should end it as soon as possible, so that concurrent requests are resumed quickly.  (Though I’m not sure if delaying it by a memcpy() matters for an essentially single-threaded block layer at this time.)
> 
> Hanna
>
Hanna Czenczek April 6, 2023, 4:51 p.m. UTC | #7
On 05.04.23 11:59, Vladimir Sementsov-Ogievskiy wrote:
> On 04.04.23 20:32, Hanna Czenczek wrote:
>> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>>> (Sorry for the rather late reply... Thanks for the review!)
>>>>
>>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>>
>>>> [...]
>>>>
>>>>>> diff --git a/block/io.c b/block/io.c
>>>>>> index 8974d46941..1e9cdba17a 100644
>>>>>> --- a/block/io.c
>>>>>> +++ b/block/io.c
>>>>>
>>>>> [..]
>>>>>
>>>>>> +    pad->write = write;
>>>>>> +
>>>>>>       return true;
>>>>>>   }
>>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>
>>>>> Maybe, rename to _finalize, to stress that it's not only freeing 
>>>>> memory.
>>>>
>>>> Sounds good!
>>>>
>>>> [...]
>>>>
>>>>>> @@ -1552,6 +1580,101 @@ static void 
>>>>>> bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>>   }
>>>>>>   +/*
>>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head 
>>>>>> and tail, while
>>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX 
>>>>>> elements.
>>>>>> + *
>>>>>> + * To ensure this, when necessary, the first couple of elements 
>>>>>> (up to three)
>>>>>
>>>>> maybe, "first two-three elements"
>>>>
>>>> Sure (here and...
>>>>
>>>> [...]
>>>>
>>>>>> +    /*
>>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate 
>>>>>> everything.
>>>>>> +     * Instead, merge the first couple of elements of @iov to 
>>>>>> reduce the number
>>>>>
>>>>> maybe, "first two-three elements"
>>>>
>>>> ...here).
>>>>
>>>>>
>>>>>> +     * of vector elements as necessary.
>>>>>> +     */
>>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>>
>>>>>
>>>>> [..]
>>>>>
>>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn 
>>>>>> bdrv_co_preadv_part(BdrvChild *child,
>>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>>       }
>>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>>>> &bytes, &pad,
>>>>>> -                           NULL, &flags);
>>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, 
>>>>>> &bytes, false,
>>>>>> +                           &pad, NULL, &flags);
>>>>>>       if (ret < 0) {
>>>>>>           goto fail;
>>>>>>       }
>>>>>
>>>>> a bit later:
>>>>>
>>>>> tracked_request_end(&req);
>>>>> bdrv_padding_destroy(&pad);
>>>>>
>>>>>
>>>>> Now, the request is formally finished inside 
>>>>> bdrv_padding_destroy().. Not sure, does it really violate 
>>>>> something, but seems safer to swap these two calls. 
>>>>
>>>> I’d rather not, for two reasons: First, tracked requests are (as 
>>>> far as I understand) only there to implement request serialization, 
>>>> and so only care about metadata (offset, length, and type), which 
>>>> is not changed by changes to the I/O vector.
>>>>
>>>> Second, even if the state of the I/O vector were relevant to 
>>>> tracked requests, I think it would actually be the other way 
>>>> around, i.e. the tracked request must be ended before the padding 
>>>> is finalized/destroyed.  The tracked request is about the actual 
>>>> request we submit to `child` (which is why tracked_request_begin() 
>>>> is called after bdrv_pad_request()), and that request is done using 
>>>> the modified I/O vector.  So if the tracked request had any 
>>>> connection to the request’s I/O vector (which it doesn’t), it would 
>>>> be to this modified one, so we mustn’t invalidate it via 
>>>> bdrv_padding_finalize() while the tracked request lives.
>>>>
>>>> Or, said differently: I generally try to clean up things in the 
>>>> inverse way they were set up, and because bdrv_pad_requests() comes 
>>>> before tracked_request_begin(), I think tracked_request_end() 
>>>> should come before bdrv_padding_finalize().
>>>
>>> Note, that it's wise-versa in bdrv_co_pwritev_part().
>>
>> Well, and it’s this way here.  We agree that for clean-up, the order 
>> doesn’t functionally matter, so either way is actually fine.
>>
>>> For me it's just simpler to think that the whole request, including 
>>> filling user-given qiov with data on read part is inside 
>>> tracked_request_begin() / tracked_request_end().
>>
>> It isn’t, though, because padding must be done before the tracked 
>> request is created.  The tracked request uses the request’s actual 
>> offset and length, after padding, so bdrv_pad_request() must always 
>> be done before (i.e., outside) tracked_request_begin().
>>
>>> And moving the last manipulation with qiov out of it breaks this 
>>> simple thought.
>>> Guest should not care of it, as it doesn't know about request 
>>> tracking.. But what about internal code? Some code may depend on 
>>> some requests be finished after bdrv_drained_begin() call, but now 
>>> they may be not fully finished, and some data may be not copied back 
>>> to original qiov.
>
> You didn't answered here. Do you think that's wrong assumption for the 
> user of drained sections?

Tracked requests are about request (write) serialization, they have 
nothing to do with draining.  Draining is about waiting until the 
in_flight counter is 0, i.e. waiting for bdrv_dec_in_flight(), which is 
separate from tracked_request_end() and always comes after 
bdrv_padding_finalize().

Hanna
Vladimir Sementsov-Ogievskiy April 6, 2023, 9:35 p.m. UTC | #8
On 06.04.23 19:51, Hanna Czenczek wrote:
> On 05.04.23 11:59, Vladimir Sementsov-Ogievskiy wrote:
>> On 04.04.23 20:32, Hanna Czenczek wrote:
>>> On 04.04.23 10:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 03.04.23 16:33, Hanna Czenczek wrote:
>>>>> (Sorry for the rather late reply... Thanks for the review!)
>>>>>
>>>>> On 20.03.23 11:31, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> On 17.03.23 20:50, Hanna Czenczek wrote:
>>>>>
>>>>> [...]
>>>>>
>>>>>>> diff --git a/block/io.c b/block/io.c
>>>>>>> index 8974d46941..1e9cdba17a 100644
>>>>>>> --- a/block/io.c
>>>>>>> +++ b/block/io.c
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> +    pad->write = write;
>>>>>>> +
>>>>>>>       return true;
>>>>>>>   }
>>>>>>>   @@ -1545,6 +1561,18 @@ zero_mem:
>>>>>>>     static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>
>>>>>> Maybe, rename to _finalize, to stress that it's not only freeing memory.
>>>>>
>>>>> Sounds good!
>>>>>
>>>>> [...]
>>>>>
>>>>>>> @@ -1552,6 +1580,101 @@ static void bdrv_padding_destroy(BdrvRequestPadding *pad)
>>>>>>>       memset(pad, 0, sizeof(*pad));
>>>>>>>   }
>>>>>>>   +/*
>>>>>>> + * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
>>>>>>> + * ensuring that the resulting vector will not exceed IOV_MAX elements.
>>>>>>> + *
>>>>>>> + * To ensure this, when necessary, the first couple of elements (up to three)
>>>>>>
>>>>>> maybe, "first two-three elements"
>>>>>
>>>>> Sure (here and...
>>>>>
>>>>> [...]
>>>>>
>>>>>>> +    /*
>>>>>>> +     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
>>>>>>> +     * Instead, merge the first couple of elements of @iov to reduce the number
>>>>>>
>>>>>> maybe, "first two-three elements"
>>>>>
>>>>> ...here).
>>>>>
>>>>>>
>>>>>>> +     * of vector elements as necessary.
>>>>>>> +     */
>>>>>>> +    if (padded_niov > IOV_MAX) {
>>>>>>>
>>>>>>
>>>>>> [..]
>>>>>>
>>>>>>> @@ -1653,8 +1786,8 @@ int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
>>>>>>>           flags |= BDRV_REQ_COPY_ON_READ;
>>>>>>>       }
>>>>>>>   -    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
>>>>>>> -                           NULL, &flags);
>>>>>>> +    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
>>>>>>> +                           &pad, NULL, &flags);
>>>>>>>       if (ret < 0) {
>>>>>>>           goto fail;
>>>>>>>       }
>>>>>>
>>>>>> a bit later:
>>>>>>
>>>>>> tracked_request_end(&req);
>>>>>> bdrv_padding_destroy(&pad);
>>>>>>
>>>>>>
>>>>>> Now, the request is formally finished inside bdrv_padding_destroy().. Not sure, does it really violate something, but seems safer to swap these two calls. 
>>>>>
>>>>> I’d rather not, for two reasons: First, tracked requests are (as far as I understand) only there to implement request serialization, and so only care about metadata (offset, length, and type), which is not changed by changes to the I/O vector.
>>>>>
>>>>> Second, even if the state of the I/O vector were relevant to tracked requests, I think it would actually be the other way around, i.e. the tracked request must be ended before the padding is finalized/destroyed.  The tracked request is about the actual request we submit to `child` (which is why tracked_request_begin() is called after bdrv_pad_request()), and that request is done using the modified I/O vector.  So if the tracked request had any connection to the request’s I/O vector (which it doesn’t), it would be to this modified one, so we mustn’t invalidate it via bdrv_padding_finalize() while the tracked request lives.
>>>>>
>>>>> Or, said differently: I generally try to clean up things in the inverse way they were set up, and because bdrv_pad_requests() comes before tracked_request_begin(), I think tracked_request_end() should come before bdrv_padding_finalize().
>>>>
>>>> Note, that it's wise-versa in bdrv_co_pwritev_part().
>>>
>>> Well, and it’s this way here.  We agree that for clean-up, the order doesn’t functionally matter, so either way is actually fine.
>>>
>>>> For me it's just simpler to think that the whole request, including filling user-given qiov with data on read part is inside tracked_request_begin() / tracked_request_end().
>>>
>>> It isn’t, though, because padding must be done before the tracked request is created.  The tracked request uses the request’s actual offset and length, after padding, so bdrv_pad_request() must always be done before (i.e., outside) tracked_request_begin().
>>>
>>>> And moving the last manipulation with qiov out of it breaks this simple thought.
>>>> Guest should not care of it, as it doesn't know about request tracking.. But what about internal code? Some code may depend on some requests be finished after bdrv_drained_begin() call, but now they may be not fully finished, and some data may be not copied back to original qiov.
>>
>> You didn't answered here. Do you think that's wrong assumption for the user of drained sections?
> 
> Tracked requests are about request (write) serialization, they have nothing to do with draining.  Draining is about waiting until the in_flight counter is 0, i.e. waiting for bdrv_dec_in_flight(), which is separate from tracked_request_end() and always comes after bdrv_padding_finalize().
> 

Oh, right, I was wrong, sorry for long arguing.

No more objections:

Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
diff mbox series

Patch

diff --git a/block/io.c b/block/io.c
index 8974d46941..1e9cdba17a 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1435,6 +1435,14 @@  out:
  * @merge_reads is true for small requests,
  * if @buf_len == @head + bytes + @tail. In this case it is possible that both
  * head and tail exist but @buf_len == align and @tail_buf == @buf.
+ *
+ * @write is true for write requests, false for read requests.
+ *
+ * If padding makes the vector too long (exceeding IOV_MAX), then we need to
+ * merge existing vector elements into a single one.  @collapse_bounce_buf acts
+ * as the bounce buffer in such cases.  @pre_collapse_qiov has the pre-collapse
+ * I/O vector elements so for read requests, the data can be copied back after
+ * the read is done.
  */
 typedef struct BdrvRequestPadding {
     uint8_t *buf;
@@ -1443,11 +1451,17 @@  typedef struct BdrvRequestPadding {
     size_t head;
     size_t tail;
     bool merge_reads;
+    bool write;
     QEMUIOVector local_qiov;
+
+    uint8_t *collapse_bounce_buf;
+    size_t collapse_len;
+    QEMUIOVector pre_collapse_qiov;
 } BdrvRequestPadding;
 
 static bool bdrv_init_padding(BlockDriverState *bs,
                               int64_t offset, int64_t bytes,
+                              bool write,
                               BdrvRequestPadding *pad)
 {
     int64_t align = bs->bl.request_alignment;
@@ -1479,6 +1493,8 @@  static bool bdrv_init_padding(BlockDriverState *bs,
         pad->tail_buf = pad->buf + pad->buf_len - align;
     }
 
+    pad->write = write;
+
     return true;
 }
 
@@ -1545,6 +1561,18 @@  zero_mem:
 
 static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 {
+    if (pad->collapse_bounce_buf) {
+        if (!pad->write) {
+            /*
+             * If padding required elements in the vector to be collapsed into a
+             * bounce buffer, copy the bounce buffer content back
+             */
+            qemu_iovec_from_buf(&pad->pre_collapse_qiov, 0,
+                                pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_vfree(pad->collapse_bounce_buf);
+        qemu_iovec_destroy(&pad->pre_collapse_qiov);
+    }
     if (pad->buf) {
         qemu_vfree(pad->buf);
         qemu_iovec_destroy(&pad->local_qiov);
@@ -1552,6 +1580,101 @@  static void bdrv_padding_destroy(BdrvRequestPadding *pad)
     memset(pad, 0, sizeof(*pad));
 }
 
+/*
+ * Create pad->local_qiov by wrapping @iov in the padding head and tail, while
+ * ensuring that the resulting vector will not exceed IOV_MAX elements.
+ *
+ * To ensure this, when necessary, the first couple of elements (up to three)
+ * of @iov are merged into pad->collapse_bounce_buf and replaced by a reference
+ * to that bounce buffer in pad->local_qiov.
+ *
+ * After performing a read request, the data from the bounce buffer must be
+ * copied back into pad->pre_collapse_qiov (e.g. by bdrv_padding_destroy()).
+ */
+static int bdrv_create_padded_qiov(BlockDriverState *bs,
+                                   BdrvRequestPadding *pad,
+                                   struct iovec *iov, int niov,
+                                   size_t iov_offset, size_t bytes)
+{
+    int padded_niov, surplus_count, collapse_count;
+
+    /* Assert this invariant */
+    assert(niov <= IOV_MAX);
+
+    /*
+     * Cannot pad if resulting length would exceed SIZE_MAX.  Returning an error
+     * to the guest is not ideal, but there is little else we can do.  At least
+     * this will practically never happen on 64-bit systems.
+     */
+    if (SIZE_MAX - pad->head < bytes ||
+        SIZE_MAX - pad->head - bytes < pad->tail)
+    {
+        return -EINVAL;
+    }
+
+    /* Length of the resulting IOV if we just concatenated everything */
+    padded_niov = !!pad->head + niov + !!pad->tail;
+
+    qemu_iovec_init(&pad->local_qiov, MIN(padded_niov, IOV_MAX));
+
+    if (pad->head) {
+        qemu_iovec_add(&pad->local_qiov, pad->buf, pad->head);
+    }
+
+    /*
+     * If padded_niov > IOV_MAX, we cannot just concatenate everything.
+     * Instead, merge the first couple of elements of @iov to reduce the number
+     * of vector elements as necessary.
+     */
+    if (padded_niov > IOV_MAX) {
+        /*
+         * Only head and tail can have lead to the number of entries exceeding
+         * IOV_MAX, so we can exceed it by the head and tail at most.  We need
+         * to reduce the number of elements by `surplus_count`, so we merge that
+         * many elements plus one into one element.
+         */
+        surplus_count = padded_niov - IOV_MAX;
+        assert(surplus_count <= !!pad->head + !!pad->tail);
+        collapse_count = surplus_count + 1;
+
+        /*
+         * Move the elements to collapse into `pad->pre_collapse_qiov`, then
+         * advance `iov` (and associated variables) by those elements.
+         */
+        qemu_iovec_init(&pad->pre_collapse_qiov, collapse_count);
+        qemu_iovec_concat_iov(&pad->pre_collapse_qiov, iov,
+                              collapse_count, iov_offset, SIZE_MAX);
+        iov += collapse_count;
+        iov_offset = 0;
+        niov -= collapse_count;
+        bytes -= pad->pre_collapse_qiov.size;
+
+        /*
+         * Construct the bounce buffer to match the length of the to-collapse
+         * vector elements, and for write requests, initialize it with the data
+         * from those elements.  Then add it to `pad->local_qiov`.
+         */
+        pad->collapse_len = pad->pre_collapse_qiov.size;
+        pad->collapse_bounce_buf = qemu_blockalign(bs, pad->collapse_len);
+        if (pad->write) {
+            qemu_iovec_to_buf(&pad->pre_collapse_qiov, 0,
+                              pad->collapse_bounce_buf, pad->collapse_len);
+        }
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->collapse_bounce_buf, pad->collapse_len);
+    }
+
+    qemu_iovec_concat_iov(&pad->local_qiov, iov, niov, iov_offset, bytes);
+
+    if (pad->tail) {
+        qemu_iovec_add(&pad->local_qiov,
+                       pad->buf + pad->buf_len - pad->tail, pad->tail);
+    }
+
+    assert(pad->local_qiov.niov == MIN(padded_niov, IOV_MAX));
+    return 0;
+}
+
 /*
  * bdrv_pad_request
  *
@@ -1559,6 +1682,8 @@  static void bdrv_padding_destroy(BdrvRequestPadding *pad)
  * read of padding, bdrv_padding_rmw_read() should be called separately if
  * needed.
  *
+ * @write is true for write requests, false for read requests.
+ *
  * Request parameters (@qiov, &qiov_offset, &offset, &bytes) are in-out:
  *  - on function start they represent original request
  *  - on failure or when padding is not needed they are unchanged
@@ -1567,24 +1692,32 @@  static void bdrv_padding_destroy(BdrvRequestPadding *pad)
 static int bdrv_pad_request(BlockDriverState *bs,
                             QEMUIOVector **qiov, size_t *qiov_offset,
                             int64_t *offset, int64_t *bytes,
+                            bool write,
                             BdrvRequestPadding *pad, bool *padded,
                             BdrvRequestFlags *flags)
 {
     int ret;
+    struct iovec *sliced_iov;
+    int sliced_niov;
+    size_t sliced_head, sliced_tail;
 
     bdrv_check_qiov_request(*offset, *bytes, *qiov, *qiov_offset, &error_abort);
 
-    if (!bdrv_init_padding(bs, *offset, *bytes, pad)) {
+    if (!bdrv_init_padding(bs, *offset, *bytes, write, pad)) {
         if (padded) {
             *padded = false;
         }
         return 0;
     }
 
-    ret = qemu_iovec_init_extended(&pad->local_qiov, pad->buf, pad->head,
-                                   *qiov, *qiov_offset, *bytes,
-                                   pad->buf + pad->buf_len - pad->tail,
-                                   pad->tail);
+    sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+                                  &sliced_head, &sliced_tail,
+                                  &sliced_niov);
+
+    /* Guaranteed by bdrv_check_qiov_request() */
+    assert(*bytes <= SIZE_MAX);
+    ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+                                  sliced_head, *bytes);
     if (ret < 0) {
         bdrv_padding_destroy(pad);
         return ret;
@@ -1653,8 +1786,8 @@  int coroutine_fn bdrv_co_preadv_part(BdrvChild *child,
         flags |= BDRV_REQ_COPY_ON_READ;
     }
 
-    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                           NULL, &flags);
+    ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, false,
+                           &pad, NULL, &flags);
     if (ret < 0) {
         goto fail;
     }
@@ -1996,7 +2129,7 @@  bdrv_co_do_zero_pwritev(BdrvChild *child, int64_t offset, int64_t bytes,
     /* This flag doesn't make sense for padding or zero writes */
     flags &= ~BDRV_REQ_REGISTERED_BUF;
 
-    padding = bdrv_init_padding(bs, offset, bytes, &pad);
+    padding = bdrv_init_padding(bs, offset, bytes, true, &pad);
     if (padding) {
         assert(!(flags & BDRV_REQ_NO_WAIT));
         bdrv_make_request_serialising(req, align);
@@ -2112,8 +2245,8 @@  int coroutine_fn bdrv_co_pwritev_part(BdrvChild *child,
          * bdrv_co_do_zero_pwritev() does aligning by itself, so, we do
          * alignment only if there is no ZERO flag.
          */
-        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, &pad,
-                               &padded, &flags);
+        ret = bdrv_pad_request(bs, &qiov, &qiov_offset, &offset, &bytes, true,
+                               &pad, &padded, &flags);
         if (ret < 0) {
             return ret;
         }