diff mbox

[4/4] block: avoid creating oversized writes in multiwrite_merge

Message ID 1409935888-18552-5-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Sept. 5, 2014, 4:51 p.m. UTC
Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block.c |    5 +++++
 1 file changed, 5 insertions(+)

Comments

Paolo Bonzini Sept. 18, 2014, 2:13 p.m. UTC | #1
Il 05/09/2014 18:51, Peter Lieven ha scritto:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  block.c |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block.c b/block.c
> index fa4c34b..db3f842 100644
> --- a/block.c
> +++ b/block.c
> @@ -4554,6 +4554,11 @@ static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
>              merge = 0;
>          }
>  
> +        if (bs->bl.max_transfer_length && reqs[outidx].nb_sectors +
> +            reqs[i].nb_sectors > bs->bl.max_transfer_length) {
> +            merge = 0;
> +        }
> +
>          if (merge) {
>              size_t size;
>              QEMUIOVector *qiov = g_malloc0(sizeof(*qiov));
> 

So I think if we treat it just as a hint for multiwrite, we can avoid
writing code to split oversized requests.  They always worked so far, we
can certainly wait until we have a real bugfix.

Can you drop patch 2 and resend the rest?

Thanks,

Paolo
Peter Lieven Sept. 18, 2014, 10:56 p.m. UTC | #2
Am 18.09.2014 um 16:13 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 05/09/2014 18:51, Peter Lieven ha scritto:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> block.c |    5 +++++
>> 1 file changed, 5 insertions(+)
>> 
>> diff --git a/block.c b/block.c
>> index fa4c34b..db3f842 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -4554,6 +4554,11 @@ static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
>>             merge = 0;
>>         }
>> 
>> +        if (bs->bl.max_transfer_length && reqs[outidx].nb_sectors +
>> +            reqs[i].nb_sectors > bs->bl.max_transfer_length) {
>> +            merge = 0;
>> +        }
>> +
>>         if (merge) {
>>             size_t size;
>>             QEMUIOVector *qiov = g_malloc0(sizeof(*qiov));
>> 
> 
> So I think if we treat it just as a hint for multiwrite, we can avoid
> writing code to split oversized requests.  They always worked so far, we
> can certainly wait until we have a real bug fix.

I would not treat this as a hint. I would use it in cases where we definitely
know an absolute hard limit for I/O request size. Otherwise the value for
bs->bl.max_transfer_length should be 0.

If there comes in an oversized request we fail it as early as possible and regarding
the multi write code we avoid that it accidentally generates an oversized request.

Peter
Paolo Bonzini Sept. 19, 2014, 1:33 p.m. UTC | #3
Il 19/09/2014 00:56, Peter Lieven ha scritto:
>> > So I think if we treat it just as a hint for multiwrite, we can avoid
>> > writing code to split oversized requests.  They always worked so far, we
>> > can certainly wait until we have a real bug fix.
> I would not treat this as a hint. I would use it in cases where we definitely
> know an absolute hard limit for I/O request size. Otherwise the value for
> bs->bl.max_transfer_length should be 0.
> 
> If there comes in an oversized request we fail it as early as possible

That's the part that I'd rather not touch, at least not without doing
request splitting.

Paolo

 and regarding
> the multi write code we avoid that it accidentally generates an oversized request.
Peter Lieven Sept. 22, 2014, 9:43 a.m. UTC | #4
On 19.09.2014 15:33, Paolo Bonzini wrote:
> Il 19/09/2014 00:56, Peter Lieven ha scritto:
>>>> So I think if we treat it just as a hint for multiwrite, we can avoid
>>>> writing code to split oversized requests.  They always worked so far, we
>>>> can certainly wait until we have a real bug fix.
>> I would not treat this as a hint. I would use it in cases where we definitely
>> know an absolute hard limit for I/O request size. Otherwise the value for
>> bs->bl.max_transfer_length should be 0.
>>
>> If there comes in an oversized request we fail it as early as possible
> That's the part that I'd rather not touch, at least not without doing
> request splitting.

This series aims not at touching default behaviour. The default for max_transfer_length
is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
will fail. And Patch 2 aims at catching this fail earlier in the stack. Currently, we only
have a limit for iSCSI. Without Patch 2 it would fail after we have send the command to
the target. And without Patch 4 it may happen that multiwrite_merge traps the into the limit.

Maybe I should adjust the description of max_transfer_length?

Peter
Paolo Bonzini Sept. 22, 2014, 7:06 p.m. UTC | #5
Il 22/09/2014 11:43, Peter Lieven ha scritto:
> This series aims not at touching default behaviour. The default for max_transfer_length
> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
> will fail. And Patch 2 aims at catching this fail earlier in the stack.

Understood.  But the right fix is to avoid that backend limits transpire
into guest ABI, not to catch the limits earlier.  So the right fix would
be to implement request splitting.

Since we never had a bug report about this, I'm not pushing to implement
splitting.  However, patch 2 is still wrong.

Patch 4 instead is fixing a real bug, so it's very much welcome.

Paolo

> Currently, we only
> have a limit for iSCSI. Without Patch 2 it would fail after we have send
> the command to
> the target. And without Patch 4 it may happen that multiwrite_merge
> traps the into the limit.
> 
> Maybe I should adjust the description of max_transfer_length?
Peter Lieven Sept. 23, 2014, 6:15 a.m. UTC | #6
On 22.09.2014 21:06, Paolo Bonzini wrote:
> Il 22/09/2014 11:43, Peter Lieven ha scritto:
>> This series aims not at touching default behaviour. The default for max_transfer_length
>> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
>> will fail. And Patch 2 aims at catching this fail earlier in the stack.
> Understood.  But the right fix is to avoid that backend limits transpire
> into guest ABI, not to catch the limits earlier.  So the right fix would
> be to implement request splitting.

Since you proposed to add traces for this would you leave those in?
And since iSCSI is the only user of this at the moment would you
go for implementing this check in the iSCSI block layer?

As for the split logic would you think it is enough to build new qiov's
out of the too big iov without copying the contents? This would work
as long as a single iov inside the qiov is not bigger the max_transfer_length.
Otherwise I would need to allocate temporary buffers and copy around.

Peter

>
> Since we never had a bug report about this, I'm not pushing to implement
> splitting.  However, patch 2 is still wrong.
>
> Patch 4 instead is fixing a real bug, so it's very much welcome.
>
> Paolo
>
>> Currently, we only
>> have a limit for iSCSI. Without Patch 2 it would fail after we have send
>> the command to
>> the target. And without Patch 4 it may happen that multiwrite_merge
>> traps the into the limit.
>>
>> Maybe I should adjust the description of max_transfer_length?
Kevin Wolf Sept. 23, 2014, 8:59 a.m. UTC | #7
Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
> On 22.09.2014 21:06, Paolo Bonzini wrote:
> >Il 22/09/2014 11:43, Peter Lieven ha scritto:
> >>This series aims not at touching default behaviour. The default for max_transfer_length
> >>is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
> >>will fail. And Patch 2 aims at catching this fail earlier in the stack.
> >Understood.  But the right fix is to avoid that backend limits transpire
> >into guest ABI, not to catch the limits earlier.  So the right fix would
> >be to implement request splitting.
> 
> Since you proposed to add traces for this would you leave those in?
> And since iSCSI is the only user of this at the moment would you
> go for implementing this check in the iSCSI block layer?
> 
> As for the split logic would you think it is enough to build new qiov's
> out of the too big iov without copying the contents? This would work
> as long as a single iov inside the qiov is not bigger the max_transfer_length.
> Otherwise I would need to allocate temporary buffers and copy around.

You can split single iovs, too. There are functions that make this very
easy, they copy a sub-qiov with a byte granularity offset and length
(qemu_iovec_concat and friends). qcow2 uses the same to split requests
at (fragmented) cluster boundaries.

Kevin
Peter Lieven Sept. 23, 2014, 9:04 a.m. UTC | #8
On 23.09.2014 10:59, Kevin Wolf wrote:
> Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
>> On 22.09.2014 21:06, Paolo Bonzini wrote:
>>> Il 22/09/2014 11:43, Peter Lieven ha scritto:
>>>> This series aims not at touching default behaviour. The default for max_transfer_length
>>>> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
>>>> will fail. And Patch 2 aims at catching this fail earlier in the stack.
>>> Understood.  But the right fix is to avoid that backend limits transpire
>>> into guest ABI, not to catch the limits earlier.  So the right fix would
>>> be to implement request splitting.
>> Since you proposed to add traces for this would you leave those in?
>> And since iSCSI is the only user of this at the moment would you
>> go for implementing this check in the iSCSI block layer?
>>
>> As for the split logic would you think it is enough to build new qiov's
>> out of the too big iov without copying the contents? This would work
>> as long as a single iov inside the qiov is not bigger the max_transfer_length.
>> Otherwise I would need to allocate temporary buffers and copy around.
> You can split single iovs, too. There are functions that make this very
> easy, they copy a sub-qiov with a byte granularity offset and length
> (qemu_iovec_concat and friends). qcow2 uses the same to split requests
> at (fragmented) cluster boundaries.

Thanks for that pointer. Looking at qemu_iovec_concat this seems to be a
nobrainer.

I will first send an updated series dropping Patch 2 and will then send a follow-up
that splits requests.

Peter
Peter Lieven Sept. 23, 2014, 9:32 a.m. UTC | #9
On 23.09.2014 10:59, Kevin Wolf wrote:
> Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
>> On 22.09.2014 21:06, Paolo Bonzini wrote:
>>> Il 22/09/2014 11:43, Peter Lieven ha scritto:
>>>> This series aims not at touching default behaviour. The default for max_transfer_length
>>>> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
>>>> will fail. And Patch 2 aims at catching this fail earlier in the stack.
>>> Understood.  But the right fix is to avoid that backend limits transpire
>>> into guest ABI, not to catch the limits earlier.  So the right fix would
>>> be to implement request splitting.
>> Since you proposed to add traces for this would you leave those in?
>> And since iSCSI is the only user of this at the moment would you
>> go for implementing this check in the iSCSI block layer?
>>
>> As for the split logic would you think it is enough to build new qiov's
>> out of the too big iov without copying the contents? This would work
>> as long as a single iov inside the qiov is not bigger the max_transfer_length.
>> Otherwise I would need to allocate temporary buffers and copy around.
> You can split single iovs, too. There are functions that make this very
> easy, they copy a sub-qiov with a byte granularity offset and length
> (qemu_iovec_concat and friends). qcow2 uses the same to split requests
> at (fragmented) cluster boundaries.

Might it be as easy as this?

static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
     BdrvRequestFlags flags)
{
     if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
         return -EINVAL;
     }

     if (bs->bl.max_transfer_length &&
         nb_sectors > bs->bl.max_transfer_length) {
         int ret = 0;
         QEMUIOVector *qiov2 = NULL;
         size_t soffset = 0;

         trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
bs->bl.max_transfer_length);

         qemu_iovec_init(qiov2, qiov->niov);
         while (nb_sectors > bs->bl.max_transfer_length && !ret) {
             qemu_iovec_reset(qiov2);
             qemu_iovec_concat(qiov2, qiov, soffset,
                               bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
             ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
                                     bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
                                     qiov2, flags);
             soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
             sector_num += bs->bl.max_transfer_length;
             nb_sectors -= bs->bl.max_transfer_length;
         }
         qemu_iovec_destroy(qiov2);
         if (ret) {
             return ret;
         }
     }

     return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
}

>
> Kevin
Kevin Wolf Sept. 23, 2014, 9:47 a.m. UTC | #10
Am 23.09.2014 um 11:32 hat Peter Lieven geschrieben:
> On 23.09.2014 10:59, Kevin Wolf wrote:
> >Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
> >>On 22.09.2014 21:06, Paolo Bonzini wrote:
> >>>Il 22/09/2014 11:43, Peter Lieven ha scritto:
> >>>>This series aims not at touching default behaviour. The default for max_transfer_length
> >>>>is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
> >>>>will fail. And Patch 2 aims at catching this fail earlier in the stack.
> >>>Understood.  But the right fix is to avoid that backend limits transpire
> >>>into guest ABI, not to catch the limits earlier.  So the right fix would
> >>>be to implement request splitting.
> >>Since you proposed to add traces for this would you leave those in?
> >>And since iSCSI is the only user of this at the moment would you
> >>go for implementing this check in the iSCSI block layer?
> >>
> >>As for the split logic would you think it is enough to build new qiov's
> >>out of the too big iov without copying the contents? This would work
> >>as long as a single iov inside the qiov is not bigger the max_transfer_length.
> >>Otherwise I would need to allocate temporary buffers and copy around.
> >You can split single iovs, too. There are functions that make this very
> >easy, they copy a sub-qiov with a byte granularity offset and length
> >(qemu_iovec_concat and friends). qcow2 uses the same to split requests
> >at (fragmented) cluster boundaries.
> 
> Might it be as easy as this?

This is completely untested, right? :-)

But ignoring bugs, the principle looks right.

> static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>     BdrvRequestFlags flags)
> {
>     if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
>         return -EINVAL;
>     }
> 
>     if (bs->bl.max_transfer_length &&
>         nb_sectors > bs->bl.max_transfer_length) {
>         int ret = 0;
>         QEMUIOVector *qiov2 = NULL;

Make it "QEMUIOVector qiov2;" on the stack.

>         size_t soffset = 0;
> 
>         trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
> bs->bl.max_transfer_length);
> 
>         qemu_iovec_init(qiov2, qiov->niov);

And &qiov2 here, then this doesn't crash with a NULL dereference.

>         while (nb_sectors > bs->bl.max_transfer_length && !ret) {
>             qemu_iovec_reset(qiov2);
>             qemu_iovec_concat(qiov2, qiov, soffset,
>                               bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
>             ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>                                     bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
>                                     qiov2, flags);
>             soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
>             sector_num += bs->bl.max_transfer_length;
>             nb_sectors -= bs->bl.max_transfer_length;
>         }
>         qemu_iovec_destroy(qiov2);
>         if (ret) {
>             return ret;
>         }

The error check needs to be immediately after the assignment of ret,
otherwise the next loop iteration can overwrite an error with a success
(and if it didn't, it would still do useless I/O because the request as
a whole would fail anyway).

>     }
> 
>     return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>                              nb_sectors << BDRV_SECTOR_BITS, qiov, flags);

qiov doesn't work here for the splitting case. You need the remaining
part, not the whole original qiov.

Kevin
Peter Lieven Sept. 23, 2014, 9:52 a.m. UTC | #11
On 23.09.2014 11:47, Kevin Wolf wrote:
> Am 23.09.2014 um 11:32 hat Peter Lieven geschrieben:
>> On 23.09.2014 10:59, Kevin Wolf wrote:
>>> Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
>>>> On 22.09.2014 21:06, Paolo Bonzini wrote:
>>>>> Il 22/09/2014 11:43, Peter Lieven ha scritto:
>>>>>> This series aims not at touching default behaviour. The default for max_transfer_length
>>>>>> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
>>>>>> will fail. And Patch 2 aims at catching this fail earlier in the stack.
>>>>> Understood.  But the right fix is to avoid that backend limits transpire
>>>>> into guest ABI, not to catch the limits earlier.  So the right fix would
>>>>> be to implement request splitting.
>>>> Since you proposed to add traces for this would you leave those in?
>>>> And since iSCSI is the only user of this at the moment would you
>>>> go for implementing this check in the iSCSI block layer?
>>>>
>>>> As for the split logic would you think it is enough to build new qiov's
>>>> out of the too big iov without copying the contents? This would work
>>>> as long as a single iov inside the qiov is not bigger the max_transfer_length.
>>>> Otherwise I would need to allocate temporary buffers and copy around.
>>> You can split single iovs, too. There are functions that make this very
>>> easy, they copy a sub-qiov with a byte granularity offset and length
>>> (qemu_iovec_concat and friends). qcow2 uses the same to split requests
>>> at (fragmented) cluster boundaries.
>> Might it be as easy as this?
> This is completely untested, right? :-)

Yes :-)
I was just unsure if the general approach is right.

>
> But ignoring bugs, the principle looks right.
>
>> static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>      BdrvRequestFlags flags)
>> {
>>      if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
>>          return -EINVAL;
>>      }
>>
>>      if (bs->bl.max_transfer_length &&
>>          nb_sectors > bs->bl.max_transfer_length) {
>>          int ret = 0;
>>          QEMUIOVector *qiov2 = NULL;
> Make it "QEMUIOVector qiov2;" on the stack.
>
>>          size_t soffset = 0;
>>
>>          trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
>> bs->bl.max_transfer_length);
>>
>>          qemu_iovec_init(qiov2, qiov->niov);
> And &qiov2 here, then this doesn't crash with a NULL dereference.
>
>>          while (nb_sectors > bs->bl.max_transfer_length && !ret) {
>>              qemu_iovec_reset(qiov2);
>>              qemu_iovec_concat(qiov2, qiov, soffset,
>>                                bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
>>              ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>>                                      bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
>>                                      qiov2, flags);
>>              soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
>>              sector_num += bs->bl.max_transfer_length;
>>              nb_sectors -= bs->bl.max_transfer_length;
>>          }
>>          qemu_iovec_destroy(qiov2);
>>          if (ret) {
>>              return ret;
>>          }
> The error check needs to be immediately after the assignment of ret,
> otherwise the next loop iteration can overwrite an error with a success
> (and if it didn't, it would still do useless I/O because the request as
> a whole would fail anyway).

There is a && !ret in the loop condition. I wanted to avoid copying the destroy part.

BTW, is it !ret or ret < 0 ?

>
>>      }
>>
>>      return bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>>                               nb_sectors << BDRV_SECTOR_BITS, qiov, flags);
> qiov doesn't work here for the splitting case. You need the remaining
> part, not the whole original qiov.

Right ;-)

Peter
Kevin Wolf Sept. 23, 2014, 10:05 a.m. UTC | #12
Am 23.09.2014 um 11:52 hat Peter Lieven geschrieben:
> On 23.09.2014 11:47, Kevin Wolf wrote:
> >Am 23.09.2014 um 11:32 hat Peter Lieven geschrieben:
> >>On 23.09.2014 10:59, Kevin Wolf wrote:
> >>>Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
> >>>>On 22.09.2014 21:06, Paolo Bonzini wrote:
> >>>>>Il 22/09/2014 11:43, Peter Lieven ha scritto:
> >>>>>>This series aims not at touching default behaviour. The default for max_transfer_length
> >>>>>>is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
> >>>>>>will fail. And Patch 2 aims at catching this fail earlier in the stack.
> >>>>>Understood.  But the right fix is to avoid that backend limits transpire
> >>>>>into guest ABI, not to catch the limits earlier.  So the right fix would
> >>>>>be to implement request splitting.
> >>>>Since you proposed to add traces for this would you leave those in?
> >>>>And since iSCSI is the only user of this at the moment would you
> >>>>go for implementing this check in the iSCSI block layer?
> >>>>
> >>>>As for the split logic would you think it is enough to build new qiov's
> >>>>out of the too big iov without copying the contents? This would work
> >>>>as long as a single iov inside the qiov is not bigger the max_transfer_length.
> >>>>Otherwise I would need to allocate temporary buffers and copy around.
> >>>You can split single iovs, too. There are functions that make this very
> >>>easy, they copy a sub-qiov with a byte granularity offset and length
> >>>(qemu_iovec_concat and friends). qcow2 uses the same to split requests
> >>>at (fragmented) cluster boundaries.
> >>Might it be as easy as this?
> >This is completely untested, right? :-)
> 
> Yes :-)
> I was just unsure if the general approach is right.

Looks alright to me.

> >But ignoring bugs, the principle looks right.
> >
> >>static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> >>     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >>     BdrvRequestFlags flags)
> >>{
> >>     if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
> >>         return -EINVAL;
> >>     }
> >>
> >>     if (bs->bl.max_transfer_length &&
> >>         nb_sectors > bs->bl.max_transfer_length) {
> >>         int ret = 0;
> >>         QEMUIOVector *qiov2 = NULL;
> >Make it "QEMUIOVector qiov2;" on the stack.
> >
> >>         size_t soffset = 0;
> >>
> >>         trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
> >>bs->bl.max_transfer_length);
> >>
> >>         qemu_iovec_init(qiov2, qiov->niov);
> >And &qiov2 here, then this doesn't crash with a NULL dereference.
> >
> >>         while (nb_sectors > bs->bl.max_transfer_length && !ret) {
> >>             qemu_iovec_reset(qiov2);
> >>             qemu_iovec_concat(qiov2, qiov, soffset,
> >>                               bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
> >>             ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
> >>                                     bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
> >>                                     qiov2, flags);
> >>             soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
> >>             sector_num += bs->bl.max_transfer_length;
> >>             nb_sectors -= bs->bl.max_transfer_length;
> >>         }
> >>         qemu_iovec_destroy(qiov2);
> >>         if (ret) {
> >>             return ret;
> >>         }
> >The error check needs to be immediately after the assignment of ret,
> >otherwise the next loop iteration can overwrite an error with a success
> >(and if it didn't, it would still do useless I/O because the request as
> >a whole would fail anyway).
> 
> There is a && !ret in the loop condition. I wanted to avoid copying the destroy part.

Ah, yes, clever. I missed that. Maybe too clever then. ;-)

> BTW, is it !ret or ret < 0 ?

It only ever returns 0 or negative, so both are equivalent. I
prefer checks for ret < 0, but that's a matter of style rather than
correctness.

Another problem I just noticed is that this is not the only caller of
bdrv_co_do_preadv(), so you're not splitting all requests. The
synchronous bdrv_read/write/pread/pwrite/pwritev functions all don't get
the functionality this way.

Perhaps you should be doing it inside bdrv_co_do_preadv(), before the
call to bdrv_aligned_preadv(). Might even be more correct if it can
happen that the alignment adjustment increases a request too much to fit
in bl.max_transfer_length.

Kevin
Peter Lieven Sept. 30, 2014, 7:26 a.m. UTC | #13
On 23.09.2014 12:05, Kevin Wolf wrote:
> Am 23.09.2014 um 11:52 hat Peter Lieven geschrieben:
>> On 23.09.2014 11:47, Kevin Wolf wrote:
>>> Am 23.09.2014 um 11:32 hat Peter Lieven geschrieben:
>>>> On 23.09.2014 10:59, Kevin Wolf wrote:
>>>>> Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
>>>>>> On 22.09.2014 21:06, Paolo Bonzini wrote:
>>>>>>> Il 22/09/2014 11:43, Peter Lieven ha scritto:
>>>>>>>> This series aims not at touching default behaviour. The default for max_transfer_length
>>>>>>>> is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
>>>>>>>> will fail. And Patch 2 aims at catching this fail earlier in the stack.
>>>>>>> Understood.  But the right fix is to avoid that backend limits transpire
>>>>>>> into guest ABI, not to catch the limits earlier.  So the right fix would
>>>>>>> be to implement request splitting.
>>>>>> Since you proposed to add traces for this would you leave those in?
>>>>>> And since iSCSI is the only user of this at the moment would you
>>>>>> go for implementing this check in the iSCSI block layer?
>>>>>>
>>>>>> As for the split logic would you think it is enough to build new qiov's
>>>>>> out of the too big iov without copying the contents? This would work
>>>>>> as long as a single iov inside the qiov is not bigger the max_transfer_length.
>>>>>> Otherwise I would need to allocate temporary buffers and copy around.
>>>>> You can split single iovs, too. There are functions that make this very
>>>>> easy, they copy a sub-qiov with a byte granularity offset and length
>>>>> (qemu_iovec_concat and friends). qcow2 uses the same to split requests
>>>>> at (fragmented) cluster boundaries.
>>>> Might it be as easy as this?
>>> This is completely untested, right? :-)
>> Yes :-)
>> I was just unsure if the general approach is right.
> Looks alright to me.
>
>>> But ignoring bugs, the principle looks right.
>>>
>>>> static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
>>>>      int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>      BdrvRequestFlags flags)
>>>> {
>>>>      if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
>>>>          return -EINVAL;
>>>>      }
>>>>
>>>>      if (bs->bl.max_transfer_length &&
>>>>          nb_sectors > bs->bl.max_transfer_length) {
>>>>          int ret = 0;
>>>>          QEMUIOVector *qiov2 = NULL;
>>> Make it "QEMUIOVector qiov2;" on the stack.
>>>
>>>>          size_t soffset = 0;
>>>>
>>>>          trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
>>>> bs->bl.max_transfer_length);
>>>>
>>>>          qemu_iovec_init(qiov2, qiov->niov);
>>> And &qiov2 here, then this doesn't crash with a NULL dereference.
>>>
>>>>          while (nb_sectors > bs->bl.max_transfer_length && !ret) {
>>>>              qemu_iovec_reset(qiov2);
>>>>              qemu_iovec_concat(qiov2, qiov, soffset,
>>>>                                bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
>>>>              ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
>>>>                                      bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
>>>>                                      qiov2, flags);
>>>>              soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
>>>>              sector_num += bs->bl.max_transfer_length;
>>>>              nb_sectors -= bs->bl.max_transfer_length;
>>>>          }
>>>>          qemu_iovec_destroy(qiov2);
>>>>          if (ret) {
>>>>              return ret;
>>>>          }
>>> The error check needs to be immediately after the assignment of ret,
>>> otherwise the next loop iteration can overwrite an error with a success
>>> (and if it didn't, it would still do useless I/O because the request as
>>> a whole would fail anyway).
>> There is a && !ret in the loop condition. I wanted to avoid copying the destroy part.
> Ah, yes, clever. I missed that. Maybe too clever then. ;-)
>
>> BTW, is it !ret or ret < 0 ?
> It only ever returns 0 or negative, so both are equivalent. I
> prefer checks for ret < 0, but that's a matter of style rather than
> correctness.
>
> Another problem I just noticed is that this is not the only caller of
> bdrv_co_do_preadv(), so you're not splitting all requests. The
> synchronous bdrv_read/write/pread/pwrite/pwritev functions all don't get
> the functionality this way.
>
> Perhaps you should be doing it inside bdrv_co_do_preadv(), before the
> call to bdrv_aligned_preadv(). Might even be more correct if it can
> happen that the alignment adjustment increases a request too much to fit
> in bl.max_transfer_length.

If I do it this way can I use the same req Object for all splitted
requests?

Peter
Kevin Wolf Sept. 30, 2014, 8:03 a.m. UTC | #14
Am 30.09.2014 um 09:26 hat Peter Lieven geschrieben:
> On 23.09.2014 12:05, Kevin Wolf wrote:
> >Am 23.09.2014 um 11:52 hat Peter Lieven geschrieben:
> >>On 23.09.2014 11:47, Kevin Wolf wrote:
> >>>Am 23.09.2014 um 11:32 hat Peter Lieven geschrieben:
> >>>>On 23.09.2014 10:59, Kevin Wolf wrote:
> >>>>>Am 23.09.2014 um 08:15 hat Peter Lieven geschrieben:
> >>>>>>On 22.09.2014 21:06, Paolo Bonzini wrote:
> >>>>>>>Il 22/09/2014 11:43, Peter Lieven ha scritto:
> >>>>>>>>This series aims not at touching default behaviour. The default for max_transfer_length
> >>>>>>>>is 0 (no limit). max_transfer_length is a limit that MUST be satisfied otherwise the request
> >>>>>>>>will fail. And Patch 2 aims at catching this fail earlier in the stack.
> >>>>>>>Understood.  But the right fix is to avoid that backend limits transpire
> >>>>>>>into guest ABI, not to catch the limits earlier.  So the right fix would
> >>>>>>>be to implement request splitting.
> >>>>>>Since you proposed to add traces for this would you leave those in?
> >>>>>>And since iSCSI is the only user of this at the moment would you
> >>>>>>go for implementing this check in the iSCSI block layer?
> >>>>>>
> >>>>>>As for the split logic would you think it is enough to build new qiov's
> >>>>>>out of the too big iov without copying the contents? This would work
> >>>>>>as long as a single iov inside the qiov is not bigger the max_transfer_length.
> >>>>>>Otherwise I would need to allocate temporary buffers and copy around.
> >>>>>You can split single iovs, too. There are functions that make this very
> >>>>>easy, they copy a sub-qiov with a byte granularity offset and length
> >>>>>(qemu_iovec_concat and friends). qcow2 uses the same to split requests
> >>>>>at (fragmented) cluster boundaries.
> >>>>Might it be as easy as this?
> >>>This is completely untested, right? :-)
> >>Yes :-)
> >>I was just unsure if the general approach is right.
> >Looks alright to me.
> >
> >>>But ignoring bugs, the principle looks right.
> >>>
> >>>>static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
> >>>>     int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >>>>     BdrvRequestFlags flags)
> >>>>{
> >>>>     if (nb_sectors < 0 || nb_sectors > (UINT_MAX >> BDRV_SECTOR_BITS)) {
> >>>>         return -EINVAL;
> >>>>     }
> >>>>
> >>>>     if (bs->bl.max_transfer_length &&
> >>>>         nb_sectors > bs->bl.max_transfer_length) {
> >>>>         int ret = 0;
> >>>>         QEMUIOVector *qiov2 = NULL;
> >>>Make it "QEMUIOVector qiov2;" on the stack.
> >>>
> >>>>         size_t soffset = 0;
> >>>>
> >>>>         trace_bdrv_co_do_readv_toobig(bs, sector_num, nb_sectors,
> >>>>bs->bl.max_transfer_length);
> >>>>
> >>>>         qemu_iovec_init(qiov2, qiov->niov);
> >>>And &qiov2 here, then this doesn't crash with a NULL dereference.
> >>>
> >>>>         while (nb_sectors > bs->bl.max_transfer_length && !ret) {
> >>>>             qemu_iovec_reset(qiov2);
> >>>>             qemu_iovec_concat(qiov2, qiov, soffset,
> >>>>                               bs->bl.max_transfer_length << BDRV_SECTOR_BITS);
> >>>>             ret = bdrv_co_do_preadv(bs, sector_num << BDRV_SECTOR_BITS,
> >>>>                                     bs->bl.max_transfer_length << BDRV_SECTOR_BITS,
> >>>>                                     qiov2, flags);
> >>>>             soffset += bs->bl.max_transfer_length << BDRV_SECTOR_BITS;
> >>>>             sector_num += bs->bl.max_transfer_length;
> >>>>             nb_sectors -= bs->bl.max_transfer_length;
> >>>>         }
> >>>>         qemu_iovec_destroy(qiov2);
> >>>>         if (ret) {
> >>>>             return ret;
> >>>>         }
> >>>The error check needs to be immediately after the assignment of ret,
> >>>otherwise the next loop iteration can overwrite an error with a success
> >>>(and if it didn't, it would still do useless I/O because the request as
> >>>a whole would fail anyway).
> >>There is a && !ret in the loop condition. I wanted to avoid copying the destroy part.
> >Ah, yes, clever. I missed that. Maybe too clever then. ;-)
> >
> >>BTW, is it !ret or ret < 0 ?
> >It only ever returns 0 or negative, so both are equivalent. I
> >prefer checks for ret < 0, but that's a matter of style rather than
> >correctness.
> >
> >Another problem I just noticed is that this is not the only caller of
> >bdrv_co_do_preadv(), so you're not splitting all requests. The
> >synchronous bdrv_read/write/pread/pwrite/pwritev functions all don't get
> >the functionality this way.
> >
> >Perhaps you should be doing it inside bdrv_co_do_preadv(), before the
> >call to bdrv_aligned_preadv(). Might even be more correct if it can
> >happen that the alignment adjustment increases a request too much to fit
> >in bl.max_transfer_length.
> 
> If I do it this way can I use the same req Object for all splitted
> requests?

That's a good question. I think as long as you process the parts of the
split request one after another, reusing the same req object should be
safe. If you were to process them in parallel, though, I wouldn't be as
sure about it (which you probably don't want because it complicates
things :-)).

The probably most obviously correct way to handle things would be to
have one tracked_request_begin/end() for the whole request and then call
bdrv_aligned_preadv() multiple times in between. Otherwise you'd have to
serialise each part individually etc.

Kevin
diff mbox

Patch

diff --git a/block.c b/block.c
index fa4c34b..db3f842 100644
--- a/block.c
+++ b/block.c
@@ -4554,6 +4554,11 @@  static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
             merge = 0;
         }
 
+        if (bs->bl.max_transfer_length && reqs[outidx].nb_sectors +
+            reqs[i].nb_sectors > bs->bl.max_transfer_length) {
+            merge = 0;
+        }
+
         if (merge) {
             size_t size;
             QEMUIOVector *qiov = g_malloc0(sizeof(*qiov));