diff mbox

increase BlockConf.min_io_size type from uint16_t to uint32_t

Message ID 1335801160-321-1-git-send-email-mjt@msgid.tls.msk.ru
State New
Headers show

Commit Message

Michael Tokarev April 30, 2012, 3:52 p.m. UTC
This value is used currently for virtio-blk only.  It was defined
as uint16_t before, which is the same as in kernel<=>user interface
(in virtio_blk.h, struct virtio_blk_config).  But the problem is
that in kernel<=>user interface the units are sectors (which is
usually 512 bytes or more), while in qemu it is in bytes.  However,
for, say, md raid5 arrays, it is typical to have actual min_io_size
of a host device to be large.  For example, for raid5 device of
3 drives with 64Kb chunk size, that value will be 128Kb, which does
not fit in a uint16_t anymore.

Increase the value size from 16bits to 32bits for now.

But apparently, the kernel<=>user interface needs to be fixed too
(while it is much more difficult due to compatibility issues),
because even with 512byte units, the 16bit value there will overflow
too with quite normal MD RAID configuration.

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
---
 block.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Stefan Hajnoczi May 1, 2012, 8:27 a.m. UTC | #1
On Mon, Apr 30, 2012 at 4:52 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
> This value is used currently for virtio-blk only.  It was defined
> as uint16_t before, which is the same as in kernel<=>user interface
> (in virtio_blk.h, struct virtio_blk_config).  But the problem is
> that in kernel<=>user interface the units are sectors (which is
> usually 512 bytes or more), while in qemu it is in bytes.  However,
> for, say, md raid5 arrays, it is typical to have actual min_io_size
> of a host device to be large.  For example, for raid5 device of
> 3 drives with 64Kb chunk size, that value will be 128Kb, which does
> not fit in a uint16_t anymore.
>
> Increase the value size from 16bits to 32bits for now.
>
> But apparently, the kernel<=>user interface needs to be fixed too
> (while it is much more difficult due to compatibility issues),
> because even with 512byte units, the 16bit value there will overflow
> too with quite normal MD RAID configuration.
>
> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
> ---
>  block.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Which kernel<=>user interface?

Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Michael Tokarev May 1, 2012, 8:43 a.m. UTC | #2
On 01.05.2012 12:27, Stefan Hajnoczi wrote:
> On Mon, Apr 30, 2012 at 4:52 PM, Michael Tokarev <mjt@tls.msk.ru> wrote:
>> This value is used currently for virtio-blk only.  It was defined
>> as uint16_t before, which is the same as in kernel<=>user interface
>> (in virtio_blk.h, struct virtio_blk_config).  But the problem is
>> that in kernel<=>user interface the units are sectors (which is
>> usually 512 bytes or more), while in qemu it is in bytes.  However,
>> for, say, md raid5 arrays, it is typical to have actual min_io_size
>> of a host device to be large.  For example, for raid5 device of
>> 3 drives with 64Kb chunk size, that value will be 128Kb, which does
>> not fit in a uint16_t anymore.
>>
>> Increase the value size from 16bits to 32bits for now.
>>
>> But apparently, the kernel<=>user interface needs to be fixed too
>> (while it is much more difficult due to compatibility issues),
>> because even with 512byte units, the 16bit value there will overflow
>> too with quite normal MD RAID configuration.
>>
>> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
>> ---
>>  block.h |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> Which kernel<=>user interface?

struct virtio_blk_config in virtio_blk.h, which is used to
pass information about block device from (qemu) userspace
to guest kernel.

Besides, it appears that at least minimum_io_size is not used
anywhere in the kernel, -- so, for example, filesystems does
not directly benefit from seeing this information.  But mkfs.xfs
do use it.

> Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Thanks,

/mjt
Kevin Wolf May 2, 2012, 9:57 a.m. UTC | #3
Am 30.04.2012 17:52, schrieb Michael Tokarev:
> This value is used currently for virtio-blk only.  It was defined
> as uint16_t before, which is the same as in kernel<=>user interface
> (in virtio_blk.h, struct virtio_blk_config).  But the problem is
> that in kernel<=>user interface the units are sectors (which is
> usually 512 bytes or more), while in qemu it is in bytes.  However,
> for, say, md raid5 arrays, it is typical to have actual min_io_size
> of a host device to be large.  For example, for raid5 device of
> 3 drives with 64Kb chunk size, that value will be 128Kb, which does
> not fit in a uint16_t anymore.
> 
> Increase the value size from 16bits to 32bits for now.
> 
> But apparently, the kernel<=>user interface needs to be fixed too
> (while it is much more difficult due to compatibility issues),
> because even with 512byte units, the 16bit value there will overflow
> too with quite normal MD RAID configuration.
> 
> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
> ---
>  block.h |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block.h b/block.h
> index f163e54..cd5ae79 100644
> --- a/block.h
> +++ b/block.h
> @@ -425,7 +425,7 @@ typedef struct BlockConf {
>      BlockDriverState *bs;
>      uint16_t physical_block_size;
>      uint16_t logical_block_size;
> -    uint16_t min_io_size;
> +    uint32_t min_io_size;
>      uint32_t opt_io_size;
>      int32_t bootindex;
>      uint32_t discard_granularity;
> @@ -450,7 +450,7 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
>                            _conf.logical_block_size, 512),               \
>      DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,                \
>                            _conf.physical_block_size, 512),              \
> -    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
> +    DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),  \
>      DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
>      DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
>      DEFINE_PROP_UINT32("discard_granularity", _state, \

Don't you need an additional check in virtio-blk now so that you don't
overflow the 16 bit field in the virtio protocol? And I guess the same
for SCSI, where INQUIRY reports only a 16 bits sector count as well.

Kevin
Michael Tokarev May 2, 2012, 10:08 a.m. UTC | #4
02.05.2012 13:57, Kevin Wolf пишет:
> Am 30.04.2012 17:52, schrieb Michael Tokarev:
>> This value is used currently for virtio-blk only.  It was defined
>> as uint16_t before, which is the same as in kernel<=>user interface
>> (in virtio_blk.h, struct virtio_blk_config).  But the problem is
>> that in kernel<=>user interface the units are sectors (which is
>> usually 512 bytes or more), while in qemu it is in bytes.  However,
>> for, say, md raid5 arrays, it is typical to have actual min_io_size
>> of a host device to be large.  For example, for raid5 device of
>> 3 drives with 64Kb chunk size, that value will be 128Kb, which does
>> not fit in a uint16_t anymore.
>>
>> Increase the value size from 16bits to 32bits for now.
>>
>> But apparently, the kernel<=>user interface needs to be fixed too
>> (while it is much more difficult due to compatibility issues),
>> because even with 512byte units, the 16bit value there will overflow
>> too with quite normal MD RAID configuration.
>>
>> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
>> ---
>>  block.h |    4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/block.h b/block.h
>> index f163e54..cd5ae79 100644
>> --- a/block.h
>> +++ b/block.h
>> @@ -425,7 +425,7 @@ typedef struct BlockConf {
>>      BlockDriverState *bs;
>>      uint16_t physical_block_size;
>>      uint16_t logical_block_size;
>> -    uint16_t min_io_size;
>> +    uint32_t min_io_size;
>>      uint32_t opt_io_size;
>>      int32_t bootindex;
>>      uint32_t discard_granularity;
>> @@ -450,7 +450,7 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
>>                            _conf.logical_block_size, 512),               \
>>      DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,                \
>>                            _conf.physical_block_size, 512),              \
>> -    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
>> +    DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),  \
>>      DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
>>      DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
>>      DEFINE_PROP_UINT32("discard_granularity", _state, \
> 
> Don't you need an additional check in virtio-blk now so that you don't
> overflow the 16 bit field in the virtio protocol? And I guess the same
> for SCSI, where INQUIRY reports only a 16 bits sector count as well.

That probably should be added too, but I'm not sure I agree these should be
added into virtio-blk.

As I already mentioned, the virtio protocol has the same defect (but there
it is less serious due to usage of larger units).  And that's where the
additional overflow needs to be ELIMINATED, not just checked.  Ie, the
protocol should be changed somehow - the only issue is that I don't know
how to do that now, once it has been in use for quite some time.

Note that now, we don't have ANY checks of this theme whatsoever, at all:
I tried using 128K as min_io_size, and the guest sees it as 0 currently, --
the limits (at least the fact that the value fits in the defined number
of bits) should be checked in the common function which implements these
DEFINE_PROP_UINT* defines.

I'm not sure I can fix all of this in one go, so I went on and fixed the
most specific bug first, without any additional bad side effects.

Besides, as I also already mentioned, this whole structure is used in
virtio-blk *only*, so only virtio driver passes this information into
guest, scsi does not use it.  And again, in case of scsi, the units are
*sectors*, not bytes.

So maybe the solution is to keep this as 16bits but switch to sectors.
But for this, DEFINE_PROP_UINT can't be used anymore, unless we agree
to change user interface TOO and require the user to specify these
values in sectors to start with.

And yes, if SCSI uses 16bit value here, it will have the same issue
as virtio currently have -- pretty normal MD RAID array can't be
expressed using 16bit number of sectors already...  Oh well, but we
at least have a (small) chance to fix virtio.

Thanks,

/mjt
Kevin Wolf May 2, 2012, 2:35 p.m. UTC | #5
Am 02.05.2012 12:08, schrieb Michael Tokarev:
> 02.05.2012 13:57, Kevin Wolf пишет:
>> Am 30.04.2012 17:52, schrieb Michael Tokarev:
>>> This value is used currently for virtio-blk only.  It was defined
>>> as uint16_t before, which is the same as in kernel<=>user interface
>>> (in virtio_blk.h, struct virtio_blk_config).  But the problem is
>>> that in kernel<=>user interface the units are sectors (which is
>>> usually 512 bytes or more), while in qemu it is in bytes.  However,
>>> for, say, md raid5 arrays, it is typical to have actual min_io_size
>>> of a host device to be large.  For example, for raid5 device of
>>> 3 drives with 64Kb chunk size, that value will be 128Kb, which does
>>> not fit in a uint16_t anymore.
>>>
>>> Increase the value size from 16bits to 32bits for now.
>>>
>>> But apparently, the kernel<=>user interface needs to be fixed too
>>> (while it is much more difficult due to compatibility issues),
>>> because even with 512byte units, the 16bit value there will overflow
>>> too with quite normal MD RAID configuration.
>>>
>>> Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
>>> ---
>>>  block.h |    4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block.h b/block.h
>>> index f163e54..cd5ae79 100644
>>> --- a/block.h
>>> +++ b/block.h
>>> @@ -425,7 +425,7 @@ typedef struct BlockConf {
>>>      BlockDriverState *bs;
>>>      uint16_t physical_block_size;
>>>      uint16_t logical_block_size;
>>> -    uint16_t min_io_size;
>>> +    uint32_t min_io_size;
>>>      uint32_t opt_io_size;
>>>      int32_t bootindex;
>>>      uint32_t discard_granularity;
>>> @@ -450,7 +450,7 @@ static inline unsigned int get_physical_block_exp(BlockConf *conf)
>>>                            _conf.logical_block_size, 512),               \
>>>      DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,                \
>>>                            _conf.physical_block_size, 512),              \
>>> -    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
>>> +    DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),  \
>>>      DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
>>>      DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
>>>      DEFINE_PROP_UINT32("discard_granularity", _state, \
>>
>> Don't you need an additional check in virtio-blk now so that you don't
>> overflow the 16 bit field in the virtio protocol? And I guess the same
>> for SCSI, where INQUIRY reports only a 16 bits sector count as well.
> 
> That probably should be added too, but I'm not sure I agree these should be
> added into virtio-blk.
> 
> As I already mentioned, the virtio protocol has the same defect (but there
> it is less serious due to usage of larger units).  And that's where the
> additional overflow needs to be ELIMINATED, not just checked.  Ie, the
> protocol should be changed somehow - the only issue is that I don't know
> how to do that now, once it has been in use for quite some time.

Even if you create a new version of the protocol (introduce a new
feature flag or whatever), newer qemus will still have to deal with
older guests and vice versa.

> Note that now, we don't have ANY checks of this theme whatsoever, at all:
> I tried using 128K as min_io_size, and the guest sees it as 0 currently, --
> the limits (at least the fact that the value fits in the defined number
> of bits) should be checked in the common function which implements these
> DEFINE_PROP_UINT* defines.

Looks like a bug in the qdev property parser to me. 128k is obviously an
invalid value for a 16 bit property.

But now that you're extending the property value to 32 bits, but only 25
bits can be really used, the property type doesn't even theoretically
suffice as a check.

> I'm not sure I can fix all of this in one go, so I went on and fixed the
> most specific bug first, without any additional bad side effects.
> 
> Besides, as I also already mentioned, this whole structure is used in
> virtio-blk *only*, so only virtio driver passes this information into
> guest, scsi does not use it.  And again, in case of scsi, the units are
> *sectors*, not bytes.

Yes, same as for virtio-blk.

But I can't see which structure is only used by virtio-blk, though. The
min_io_size property is contained in DEFINE_BLOCK_PROPERTIES(), which is
used by every qdevified block device. Most of them ignore min_io_size,
but virtio-blk and scsi-disk don't.

> So maybe the solution is to keep this as 16bits but switch to sectors.
> But for this, DEFINE_PROP_UINT can't be used anymore, unless we agree
> to change user interface TOO and require the user to specify these
> values in sectors to start with.

That wouldn't be a good interface. Let's just take a 32 bit number and
add the checks in the devices.

> And yes, if SCSI uses 16bit value here, it will have the same issue
> as virtio currently have -- pretty normal MD RAID array can't be
> expressed using 16bit number of sectors already...  Oh well, but we
> at least have a (small) chance to fix virtio.

Just curious... What values do you want to use? The 32 MB minimum I/O
size that are possible with 16 bits and 512 byte sectors already sounds
insanely large.

Kevin
Michael Tokarev May 2, 2012, 3:37 p.m. UTC | #6
On 02.05.2012 18:35, Kevin Wolf wrote:
[]
>> As I already mentioned, the virtio protocol has the same defect (but there
>> it is less serious due to usage of larger units).  And that's where the
>> additional overflow needs to be ELIMINATED, not just checked.  Ie, the
>> protocol should be changed somehow - the only issue is that I don't know
>> how to do that now, once it has been in use for quite some time.
> 
> Even if you create a new version of the protocol (introduce a new
> feature flag or whatever), newer qemus will still have to deal with
> older guests and vice versa.

Sure.  And for these, the checks indeed should be done in lower layers.

[]
> But now that you're extending the property value to 32 bits, but only 25
> bits can be really used, the property type doesn't even theoretically
> suffice as a check.

So, what do you propose?  To add a check into virtio-blk.c (and into
whatever else place is using this variable) too?  If yes, and indeed
it appears to be the right thing to do, care to show me the right
place please, I'm not very familiar with that code...

[]
> But I can't see which structure is only used by virtio-blk, though. The
> min_io_size property is contained in DEFINE_BLOCK_PROPERTIES(), which is
> used by every qdevified block device. Most of them ignore min_io_size,
> but virtio-blk and scsi-disk don't.

I mean the BlockConf struct.  It isn't used anywhere but in virtio-blk.c.

But again, since I'm not familiar with the code, I might be wrong.

[]

> That wouldn't be a good interface. Let's just take a 32 bit number and
> add the checks in the devices.

That's fine.  The only thing left to do is to find the proper places for
the checks.  Help?

[]
> Just curious... What values do you want to use? The 32 MB minimum I/O
> size that are possible with 16 bits and 512 byte sectors already sounds
> insanely large.

I don't "use" any values.  I merely pass whatever is defined on my systems
down to the guest.

md layer in kernel - raid4, raid5 and raid6 implementation - sets min_io_size
to the chunk size, and opt_io_size to "stripe size".  It is not uncommon at
all to have chunk size = 1Mb or more.   I've no idea how useful this information
is, but at least with it present (my small raid5 array has 256Kb chunk size),
xfs created in the guest performs much faster than without this information
(which means usual 512 there).

This is how I discovered the issue - I wondered why xfs created in the guest
is so much slower than the same xfs but created on host.  I/O sizes immediately
come to min, so I added these, but it was still slow.  So I noticed min_io_size
isn't passed correctly, increased the size of this type, and voila, xfs
created in guest now behaves as fast as created on host.  Something like that
anyway.

There's an obvious bug in there, but it is not obvious for me where/how it should
be fixed.  Maybe the sizes used by md raid5 are insane, that's arguable ofcourse,
but this is what is in use now (and since the day the i/o sizes were added to
md to start with), and this is what makes things fly.

Thanks,

/mjt
Kevin Wolf May 2, 2012, 3:53 p.m. UTC | #7
Am 02.05.2012 17:37, schrieb Michael Tokarev:
> On 02.05.2012 18:35, Kevin Wolf wrote:
> []
>>> As I already mentioned, the virtio protocol has the same defect (but there
>>> it is less serious due to usage of larger units).  And that's where the
>>> additional overflow needs to be ELIMINATED, not just checked.  Ie, the
>>> protocol should be changed somehow - the only issue is that I don't know
>>> how to do that now, once it has been in use for quite some time.
>>
>> Even if you create a new version of the protocol (introduce a new
>> feature flag or whatever), newer qemus will still have to deal with
>> older guests and vice versa.
> 
> Sure.  And for these, the checks indeed should be done in lower layers.

What lower layers do you mean?

>> But now that you're extending the property value to 32 bits, but only 25
>> bits can be really used, the property type doesn't even theoretically
>> suffice as a check.
> 
> So, what do you propose?  To add a check into virtio-blk.c (and into
> whatever else place is using this variable) too?  If yes, and indeed
> it appears to be the right thing to do, care to show me the right
> place please, I'm not very familiar with that code...

I think the qdev init functions, or whatever they have become with QOM,
are the right place. Looks like they are virtio_blk_init() and
scsi_initfn(). I believe it's possible to fail them.

>> But I can't see which structure is only used by virtio-blk, though. The
>> min_io_size property is contained in DEFINE_BLOCK_PROPERTIES(), which is
>> used by every qdevified block device. Most of them ignore min_io_size,
>> but virtio-blk and scsi-disk don't.
> 
> I mean the BlockConf struct.  It isn't used anywhere but in virtio-blk.c.
> 
> But again, since I'm not familiar with the code, I might be wrong.

$ git grep BlockConf
block.h:typedef struct BlockConf {
block.h:} BlockConf;
block.h:static inline unsigned int get_physical_block_exp(BlockConf *conf)
hw/ide/internal.h:    BlockConf conf;
hw/s390-virtio-bus.h:    BlockConf block;
hw/scsi.h:    BlockConf conf;
hw/usb/dev-storage.c:    BlockConf conf;
hw/virtio-blk.c:    BlockConf *conf;
hw/virtio-blk.c:VirtIODevice *virtio_blk_init(DeviceState *dev,
BlockConf *conf,
hw/virtio-pci.h:    BlockConf block;
hw/virtio.h:VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf,

So there are more users than just virtio-blk.

>> That wouldn't be a good interface. Let's just take a 32 bit number and
>> add the checks in the devices.
> 
> That's fine.  The only thing left to do is to find the proper places for
> the checks.  Help?

See above.

>> Just curious... What values do you want to use? The 32 MB minimum I/O
>> size that are possible with 16 bits and 512 byte sectors already sounds
>> insanely large.
> 
> I don't "use" any values.  I merely pass whatever is defined on my systems
> down to the guest.
> 
> md layer in kernel - raid4, raid5 and raid6 implementation - sets min_io_size
> to the chunk size, and opt_io_size to "stripe size".  It is not uncommon at
> all to have chunk size = 1Mb or more.   I've no idea how useful this information
> is, but at least with it present (my small raid5 array has 256Kb chunk size),
> xfs created in the guest performs much faster than without this information
> (which means usual 512 there).
> 
> This is how I discovered the issue - I wondered why xfs created in the guest
> is so much slower than the same xfs but created on host.  I/O sizes immediately
> come to min, so I added these, but it was still slow.  So I noticed min_io_size
> isn't passed correctly, increased the size of this type, and voila, xfs
> created in guest now behaves as fast as created on host.  Something like that
> anyway.
> 
> There's an obvious bug in there, but it is not obvious for me where/how it should
> be fixed.  Maybe the sizes used by md raid5 are insane, that's arguable ofcourse,
> but this is what is in use now (and since the day the i/o sizes were added to
> md to start with), and this is what makes things fly.

Thanks for the explanation. I wasn't trying to say that your setup is
wrong, I just wasn't aware that such sizes are in use.

Kevin
diff mbox

Patch

diff --git a/block.h b/block.h
index f163e54..cd5ae79 100644
--- a/block.h
+++ b/block.h
@@ -425,7 +425,7 @@  typedef struct BlockConf {
     BlockDriverState *bs;
     uint16_t physical_block_size;
     uint16_t logical_block_size;
-    uint16_t min_io_size;
+    uint32_t min_io_size;
     uint32_t opt_io_size;
     int32_t bootindex;
     uint32_t discard_granularity;
@@ -450,7 +450,7 @@  static inline unsigned int get_physical_block_exp(BlockConf *conf)
                           _conf.logical_block_size, 512),               \
     DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,                \
                           _conf.physical_block_size, 512),              \
-    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
+    DEFINE_PROP_UINT32("min_io_size", _state, _conf.min_io_size, 0),  \
     DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
     DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
     DEFINE_PROP_UINT32("discard_granularity", _state, \