diff mbox

[2/4] block: add block topology options

Message ID 20100129190440.GA25287@lst.de
State New
Headers show

Commit Message

Christoph Hellwig Jan. 29, 2010, 7:04 p.m. UTC
Add three new suboptions for the drive option to export block topology
information to the guest.  This is needed to get optimal I/O alignment
for RAID arrays or SSDs.

The options are:

 - physical_block_size to specify the physical block size of the device,
   this is going to increase from 512 bytes to 4096 kilobytes for many
   modern storage devices
 - min_io_size to specify the minimal I/O size without performance impact,
   this is typically set to the RAID chunk size for arrays.
 - opt_io_size to specify the optimal sustained I/O size, this is
   typically the RAID stripe width for arrays.

I decided to not auto-probe these values from blkid which might easily
be possible as I don't know how to deal with these issues on migration.

Note that we specificly only set the physical_block_size, and not the
logial one which is the unit all I/O is described in.  The reason for
that is that IDE does not support increasing the logical block size and
at last for now I want to stick to one meachnisms in queue and allow
for easy switching of transports for a given backing image which would
not be possible if scsi and virtio use real 4k sectors, while ide only
uses the physical block exponent.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Comments

Anthony Liguori Feb. 3, 2010, 7 p.m. UTC | #1
On 01/29/2010 01:04 PM, Christoph Hellwig wrote:
> Add three new suboptions for the drive option to export block topology
> information to the guest.  This is needed to get optimal I/O alignment
> for RAID arrays or SSDs.
>
> The options are:
>
>   - physical_block_size to specify the physical block size of the device,
>     this is going to increase from 512 bytes to 4096 kilobytes for many
>     modern storage devices
>   - min_io_size to specify the minimal I/O size without performance impact,
>     this is typically set to the RAID chunk size for arrays.
>   - opt_io_size to specify the optimal sustained I/O size, this is
>     typically the RAID stripe width for arrays.
>    

This makes sense.

> I decided to not auto-probe these values from blkid which might easily
> be possible as I don't know how to deal with these issues on migration.
>
> Note that we specificly only set the physical_block_size, and not the
> logial one which is the unit all I/O is described in.  The reason for
> that is that IDE does not support increasing the logical block size and
> at last for now I want to stick to one meachnisms in queue and allow
> for easy switching of transports for a given backing image which would
> not be possible if scsi and virtio use real 4k sectors, while ide only
> uses the physical block exponent.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>
>
> Index: qemu/block.c
> ===================================================================
> --- qemu.orig/block.c	2010-01-29 11:07:50.083004364 +0100
> +++ qemu/block.c	2010-01-29 11:08:32.940004255 +0100
> @@ -1028,6 +1028,51 @@ int bdrv_enable_write_cache(BlockDriverS
>       return bs->enable_write_cache;
>   }
>
> +unsigned int bdrv_get_physical_block_size(BlockDriverState *bs)
> +{
> +    return bs->physical_block_size;
> +}
> +
> +unsigned int bdrv_get_physical_block_exp(BlockDriverState *bs)
> +{
> +    unsigned int exp = 0, size;
> +
> +    for (size = bs->physical_block_size; size>  512; size>>= 1) {
> +        exp++;
> +    }
> +
> +    return exp;
> +}
> +
> +void bdrv_set_physical_block_size(BlockDriverState *bs,
> +        unsigned int physical_block_size)
> +{
> +    bs->physical_block_size = physical_block_size;
> +}
> +
>    

But I don't think this is the wrong place to do it.  The 
BlockDriverState reflects that backing device, not the emulated device 
itself.  In this case, you're trying to set a property of the emulated 
device.

I think these need to be qdev properties of the respective devices.  
 From a UI perspective, you can still expose -drive options for the end 
user to consume, but this data should be associated with the devices 
themselves.

Regards,

Anthony Liguori

> +unsigned int bdrv_get_min_io_size(BlockDriverState *bs)
> +{
> +    return bs->min_io_size;
> +}
> +
> +void bdrv_set_min_io_size(BlockDriverState *bs,
> +        unsigned int min_io_size)
> +{
> +    bs->min_io_size = min_io_size;
> +}
> +
> +unsigned int bdrv_get_opt_io_size(BlockDriverState *bs)
> +{
> +    return bs->opt_io_size;
> +}
> +
> +void bdrv_set_opt_io_size(BlockDriverState *bs,
> +        unsigned int opt_io_size)
> +{
> +    bs->opt_io_size = opt_io_size;
> +}
> +
>   /* XXX: no longer used */
>   void bdrv_set_change_cb(BlockDriverState *bs,
>                           void (*change_cb)(void *opaque), void *opaque)
> Index: qemu/block.h
> ===================================================================
> --- qemu.orig/block.h	2010-01-29 11:07:50.089004011 +0100
> +++ qemu/block.h	2010-01-29 11:08:32.940004255 +0100
> @@ -152,6 +152,16 @@ int bdrv_is_inserted(BlockDriverState *b
>   int bdrv_media_changed(BlockDriverState *bs);
>   int bdrv_is_locked(BlockDriverState *bs);
>   void bdrv_set_locked(BlockDriverState *bs, int locked);
> +unsigned int bdrv_get_physical_block_size(BlockDriverState *bs);
> +unsigned int bdrv_get_physical_block_exp(BlockDriverState *bs);
> +void bdrv_set_physical_block_size(BlockDriverState *bs,
> +        unsigned int physical_block_size);
> +unsigned int bdrv_get_min_io_size(BlockDriverState *bs);
> +void bdrv_set_min_io_size(BlockDriverState *bs,
> +        unsigned int min_io_size);
> +unsigned int bdrv_get_opt_io_size(BlockDriverState *bs);
> +void bdrv_set_opt_io_size(BlockDriverState *bs,
> +        unsigned int opt_io_size);
>   int bdrv_eject(BlockDriverState *bs, int eject_flag);
>   void bdrv_set_change_cb(BlockDriverState *bs,
>                           void (*change_cb)(void *opaque), void *opaque);
> Index: qemu/block_int.h
> ===================================================================
> --- qemu.orig/block_int.h	2010-01-29 11:07:50.096004065 +0100
> +++ qemu/block_int.h	2010-01-29 11:08:32.941003474 +0100
> @@ -173,6 +173,14 @@ struct BlockDriverState {
>          drivers. They are not used by the block driver */
>       int cyls, heads, secs, translation;
>       int type;
> +
> +    /*
> +     * Topology information, all optional.
> +     */
> +    unsigned int physical_block_size;
> +    unsigned int min_io_size;
> +    unsigned int opt_io_size;
> +
>       char device_name[32];
>       unsigned long *dirty_bitmap;
>       BlockDriverState *next;
> Index: qemu/qemu-config.c
> ===================================================================
> --- qemu.orig/qemu-config.c	2010-01-29 11:07:50.133004032 +0100
> +++ qemu/qemu-config.c	2010-01-29 11:08:32.944025367 +0100
> @@ -78,6 +78,15 @@ QemuOptsList qemu_drive_opts = {
>           },{
>               .name = "readonly",
>               .type = QEMU_OPT_BOOL,
> +        },{
> +            .name = "physical_block_size",
> +            .type = QEMU_OPT_NUMBER,
> +        },{
> +            .name = "min_io_size",
> +            .type = QEMU_OPT_NUMBER,
> +        },{
> +            .name = "opt_io_size",
> +            .type = QEMU_OPT_NUMBER,
>           },
>           { /* end if list */ }
>       },
> Index: qemu/vl.c
> ===================================================================
> --- qemu.orig/vl.c	2010-01-29 11:07:50.141004284 +0100
> +++ qemu/vl.c	2010-01-29 11:08:32.947003820 +0100
> @@ -1904,6 +1904,9 @@ DriveInfo *drive_init(QemuOpts *opts, vo
>       int index;
>       int cache;
>       int aio = 0;
> +    unsigned long physical_block_size = 512;
> +    unsigned long min_io_size = 0;
> +    unsigned long opt_io_size = 0;
>       int ro = 0;
>       int bdrv_flags;
>       int on_read_error, on_write_error;
> @@ -2053,6 +2056,32 @@ DriveInfo *drive_init(QemuOpts *opts, vo
>       }
>   #endif
>
> +    if ((buf = qemu_opt_get(opts, "physical_block_size")) != NULL) {
> +        physical_block_size = strtoul(buf, NULL, 10);
> +        if (physical_block_size<  512) {
> +            fprintf(stderr, "sector size must be larger than 512 bytes\n");
> +            return NULL;
> +        }
> +    }
> +
> +    if ((buf = qemu_opt_get(opts, "min_io_size")) != NULL) {
> +        min_io_size = strtoul(buf, NULL, 10);
> +        if (!min_io_size || (min_io_size % physical_block_size)) {
> +            fprintf(stderr,
> +                    "min_io_size must be a multiple of the sector size\n");
> +            return NULL;
> +        }
> +    }
> +
> +    if ((buf = qemu_opt_get(opts, "opt_io_size")) != NULL) {
> +        opt_io_size = strtoul(buf, NULL, 10);
> +        if (!opt_io_size || (opt_io_size % min_io_size)) {
> +            fprintf(stderr,
> +                    "opt_io_size must be a multiple of min_io_size\n");
> +            return NULL;
> +        }
> +    }
> +
>       if ((buf = qemu_opt_get(opts, "format")) != NULL) {
>          if (strcmp(buf, "?") == 0) {
>               fprintf(stderr, "qemu: Supported formats:");
> @@ -2257,6 +2286,12 @@ DriveInfo *drive_init(QemuOpts *opts, vo
>           return NULL;
>       }
>
> +    bdrv_set_physical_block_size(dinfo->bdrv, physical_block_size);
> +    if (min_io_size)
> +        bdrv_set_min_io_size(dinfo->bdrv, min_io_size);
> +    if (opt_io_size)
> +        bdrv_set_opt_io_size(dinfo->bdrv, opt_io_size);
> +
>       if (bdrv_key_required(dinfo->bdrv))
>           autostart = 0;
>       *fatal_error = 0;
> Index: qemu/qemu-options.hx
> ===================================================================
> --- qemu.orig/qemu-options.hx	2010-01-29 11:07:50.149004395 +0100
> +++ qemu/qemu-options.hx	2010-01-29 19:36:06.118256061 +0100
> @@ -104,6 +104,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
>       "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
>       "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
>       "       [,addr=A][,id=name][,aio=threads|native][,readonly=on|off]\n"
> +    "       [,physical_block=size=size][,min_io_size=size][,opt_io_size=size]\n"
>       "                use 'file' as a drive image\n")
>   DEF("set", HAS_ARG, QEMU_OPTION_set,
>       "-set group.id.arg=value\n"
> @@ -149,6 +150,15 @@ an untrusted format header.
>   This option specifies the serial number to assign to the device.
>   @item addr=@var{addr}
>   Specify the controller's PCI address (if=virtio only).
> +@item physical_sector_size=@var{size}
> +Report a physical block size larger than the logical block size of 512 bytes.
> +@item min_io_size=@var{size}
> +Reported a minimum I/O size or optimum I/O granularity.  This is the smallest
> +I/O size the device can perform without a performance penalty.  For RAID
> +devices this should be set to the stripe chunk size.
> +@item opt_io_size=@var{size}
> +Report an optimal I/O size, which is the device's preferred unit for
> +sustained I/O.  This should be set to the stripe width for RAID devices.
>   @end table
>
>   By default, writethrough caching is used for all block device.  This means that
>
>
>
>
Christoph Hellwig Feb. 5, 2010, 1:09 p.m. UTC | #2
On Wed, Feb 03, 2010 at 01:00:47PM -0600, Anthony Liguori wrote:
> But I don't think this is the wrong place to do it.  The 
> BlockDriverState reflects that backing device, not the emulated device 
> itself.  In this case, you're trying to set a property of the emulated 
> device.

I think that's very borderline.  While the emulated device exposes these
properties, they are in fact a property of the backing storage, the
right sector and min/max I/O sizes are determined by the backing storage
device.

> I think these need to be qdev properties of the respective devices.  
> From a UI perspective, you can still expose -drive options for the end 
> user to consume, but this data should be associated with the devices 
> themselves.

In addition to not really beeing more logical this would be a lot more
effort.  We'd need to add properties to all the device, which means
including dealing with the n+1 ide variants, the virtio-pci proxy, etc.

If you believe it really needs to be in the qdev properties I'll
implement it, but I suspect the current version is a better idea.
Jamie Lokier Feb. 5, 2010, 4:16 p.m. UTC | #3
Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 01:00:47PM -0600, Anthony Liguori wrote:
> > But I don't think this is the wrong place to do it.  The 
> > BlockDriverState reflects that backing device, not the emulated device 
> > itself.  In this case, you're trying to set a property of the emulated 
> > device.
> 
> I think that's very borderline.  While the emulated device exposes these
> properties, they are in fact a property of the backing storage, the
> right sector and min/max I/O sizes are determined by the backing storage
> device.
> 
> > I think these need to be qdev properties of the respective devices.  
> > From a UI perspective, you can still expose -drive options for the end 
> > user to consume, but this data should be associated with the devices 
> > themselves.
> 
> In addition to not really beeing more logical this would be a lot more
> effort.  We'd need to add properties to all the device, which means
> including dealing with the n+1 ide variants, the virtio-pci proxy, etc.
> 
> If you believe it really needs to be in the qdev properties I'll
> implement it, but I suspect the current version is a better idea.

If you move your VM to a new system with different backing devices,
sometimes you want to be sure there is no guest-visible change.  Or
even if you just replace a drive - you might prefer confidence that
the guest sees no change.

Even if you just convert between qcow2 and a raw block device, or the
other way, you'll sometimes want to be sure it's not guest-visible.

-- Jamie
Christoph Hellwig Feb. 5, 2010, 4:22 p.m. UTC | #4
On Fri, Feb 05, 2010 at 04:16:20PM +0000, Jamie Lokier wrote:
> If you move your VM to a new system with different backing devices,
> sometimes you want to be sure there is no guest-visible change.  Or
> even if you just replace a drive - you might prefer confidence that
> the guest sees no change.

Yes, that's why we do not auto probe it but require it to be set
manually.  Note that not the physical block size attribute can
we a data integrity issue, though.  A storage device guarnatees
that it can write a sector atomically, so moving from a 4k to a 512 byte
physcical sector device could lead to not beeing able to atomically
write a 4k piece of data that the guest expects to write atomcially.

I'm not sure how failure safe the read-modify-write algorithms on
4k sector disks with logical 512 bye blocks are, but I'd expect issues
there, too.

> Even if you just convert between qcow2 and a raw block device, or the
> other way, you'll sometimes want to be sure it's not guest-visible.

The image format has no hooks into these options currently.
Jamie Lokier Feb. 5, 2010, 5:16 p.m. UTC | #5
Christoph Hellwig wrote:
> Note that not the physical block size attribute can
> we a data integrity issue, though.

I agree.

> A storage device guarnatees that it can write a sector atomically,

I've looked everywhere for confirmation of *atomicity*, and so far all
I've seen are rumours.  Some people believe sector writes are atomic
during power failure, some people believe they are not.  Those who
believe it is, don't have reliable references, and I haven't seen it
in any standard.

SQLite has a flag which can be set for a backing store to say block
writes are atomic, to enable it to optimise some things; the flag is
not set when writing to a disk block device, because they didn't find
confirmation of it.

> so moving from a 4k to a 512 byte physcical sector device could lead
> to not beeing able to atomically write a 4k piece of data that the
> guest expects to write atomcially.

If there is no confirmation that sector writes are atomic, then no
database or filesystem should be relying on that property anyway.

> I'm not sure how failure safe the read-modify-write algorithms on
> 4k sector disks with logical 512 bye blocks are, but I'd expect issues
> there, too.

I think you might be referring to what I'm calling "radius of
destruction", because I don't know if there's a well known term for it.

By that I mean if you write 512 bytes, and it's implemented as RMW to
a 4k sector, then on power failure any part of the 4k sector could be
corrupted.

On some RAIDs the size is much larger; also on many flash devices.
(RAIDs make it clearer that the alignment is relevant too.)

Note that if 4k sector writes were _atomic_, then read-modify-write of
512 bytes would be completely reliable.

> > Even if you just convert between qcow2 and a raw block device, or the
> > other way, you'll sometimes want to be sure it's not guest-visible.
> 
> The image format has no hooks into these options currently.

No, but whatever is reported to the guest, you may device you want it
to continue being reported to the guest after doing the convert
operation.  Even if it's a data integrity concern.  In fact
*precisely* when the guest has algorithms which write differently
depending on the sector size, for integrity, that means changing the
guest-visible sector size may trigger bugs and other changes that you
sometimes don't want.

I agree that it should report the size by default, though, because the
integrity concern.

-- Jamie
Anthony Liguori Feb. 5, 2010, 5:33 p.m. UTC | #6
On 02/05/2010 07:09 AM, Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 01:00:47PM -0600, Anthony Liguori wrote:
>    
>> But I don't think this is the wrong place to do it.  The
>> BlockDriverState reflects that backing device, not the emulated device
>> itself.  In this case, you're trying to set a property of the emulated
>> device.
>>      
> I think that's very borderline.  While the emulated device exposes these
> properties, they are in fact a property of the backing storage, the
> right sector and min/max I/O sizes are determined by the backing storage
> device.
>    

It's guest visible state that should be configured based on the 
properties of the backing store.

However, as you said, live migration complicates things.

drive is not an optimal interface because it mixes host and guest 
state.  But the subset of -drive that you normally use should be 
independent of the guest.  For instance, with:

-drive file=/home/anthony/foo.img,cache=off,aio=native,id=foo -device 
virtio-blk-pci,drive=foo

file, cache, and aio are all host properties.  They have no visible 
impact to the guest and if I change those properties during live 
migration, they take affect without the guest noticing

These topology options still can be specified via -drive, but the proper 
way of splitting things would have them as part of the -device option.  
During live migration, things on the -device side inherited as part of 
the live migration process.

>> I think these need to be qdev properties of the respective devices.
>>  From a UI perspective, you can still expose -drive options for the end
>> user to consume, but this data should be associated with the devices
>> themselves.
>>      
> In addition to not really beeing more logical this would be a lot more
> effort.  We'd need to add properties to all the device, which means
> including dealing with the n+1 ide variants, the virtio-pci proxy, etc.
>    

The way we deal with this with network devices is we have a common 
DEFINE_NIC_PROPERTIES that lets us consistently add properties for all 
network devices.  We need exactly this sort of thing for disk drives for 
the reason you describe.

> If you believe it really needs to be in the qdev properties I'll
> implement it, but I suspect the current version is a better idea.
>    

It definitely a correctness issue.  Once we can describe the full PC via 
qdev, we want to be able to migrate that description as part of the 
migration traffic.  For that to work, we need to have all of the device 
characteristics expressed as qdev properties.

Regards,

Anthony Liguori
Markus Armbruster Feb. 9, 2010, 3:26 p.m. UTC | #7
Anthony Liguori <anthony@codemonkey.ws> writes:

> On 02/05/2010 07:09 AM, Christoph Hellwig wrote:
>> On Wed, Feb 03, 2010 at 01:00:47PM -0600, Anthony Liguori wrote:
>>    
>>> But I don't think this is the wrong place to do it.  The
>>> BlockDriverState reflects that backing device, not the emulated device
>>> itself.  In this case, you're trying to set a property of the emulated
>>> device.
>>>      
>> I think that's very borderline.  While the emulated device exposes these
>> properties, they are in fact a property of the backing storage, the
>> right sector and min/max I/O sizes are determined by the backing storage
>> device.
>>    
>
> It's guest visible state that should be configured based on the
> properties of the backing store.
>
> However, as you said, live migration complicates things.
>
> drive is not an optimal interface because it mixes host and guest
> state.

Qdev separates guest and host stuff.  -device does guest, -netdev,
-chardev, -drive if=none do host.  Unfortunately, -drive can also do
host, with other values for parameter if.  A clean, host-only -blockdev
for use with -device would be nice to have.

>         But the subset of -drive that you normally use should be
> independent of the guest.

To be precise: those properties that make sense with if=none.

>                            For instance, with:
>
> -drive file=/home/anthony/foo.img,cache=off,aio=native,id=foo -device
> virtio-blk-pci,drive=foo

There's an implied if=ide, which means it's mixed host and guest stuff.

> file, cache, and aio are all host properties.  They have no visible
> impact to the guest and if I change those properties during live
> migration, they take affect without the guest noticing
>
> These topology options still can be specified via -drive, but the
> proper way of splitting things would have them as part of the -device
> option.  During live migration, things on the -device side inherited
> as part of the live migration process.

If it's guest-visible, it should live in guest state (qdev device
state), and be configured with -device.  Else, it should live in host
state (block driver state), and be configured via -drive if=none (or
-blockdev once we have it).

[...]
diff mbox

Patch

Index: qemu/block.c
===================================================================
--- qemu.orig/block.c	2010-01-29 11:07:50.083004364 +0100
+++ qemu/block.c	2010-01-29 11:08:32.940004255 +0100
@@ -1028,6 +1028,51 @@  int bdrv_enable_write_cache(BlockDriverS
     return bs->enable_write_cache;
 }
 
+unsigned int bdrv_get_physical_block_size(BlockDriverState *bs)
+{
+    return bs->physical_block_size;
+}
+
+unsigned int bdrv_get_physical_block_exp(BlockDriverState *bs)
+{
+    unsigned int exp = 0, size;
+
+    for (size = bs->physical_block_size; size > 512; size >>= 1) {
+        exp++;
+    }
+
+    return exp;
+}
+
+void bdrv_set_physical_block_size(BlockDriverState *bs,
+        unsigned int physical_block_size)
+{
+    bs->physical_block_size = physical_block_size;
+}
+
+
+unsigned int bdrv_get_min_io_size(BlockDriverState *bs)
+{
+    return bs->min_io_size;
+}
+
+void bdrv_set_min_io_size(BlockDriverState *bs,
+        unsigned int min_io_size)
+{
+    bs->min_io_size = min_io_size;
+}
+
+unsigned int bdrv_get_opt_io_size(BlockDriverState *bs)
+{
+    return bs->opt_io_size;
+}
+
+void bdrv_set_opt_io_size(BlockDriverState *bs,
+        unsigned int opt_io_size)
+{
+    bs->opt_io_size = opt_io_size;
+}
+
 /* XXX: no longer used */
 void bdrv_set_change_cb(BlockDriverState *bs,
                         void (*change_cb)(void *opaque), void *opaque)
Index: qemu/block.h
===================================================================
--- qemu.orig/block.h	2010-01-29 11:07:50.089004011 +0100
+++ qemu/block.h	2010-01-29 11:08:32.940004255 +0100
@@ -152,6 +152,16 @@  int bdrv_is_inserted(BlockDriverState *b
 int bdrv_media_changed(BlockDriverState *bs);
 int bdrv_is_locked(BlockDriverState *bs);
 void bdrv_set_locked(BlockDriverState *bs, int locked);
+unsigned int bdrv_get_physical_block_size(BlockDriverState *bs);
+unsigned int bdrv_get_physical_block_exp(BlockDriverState *bs);
+void bdrv_set_physical_block_size(BlockDriverState *bs,
+        unsigned int physical_block_size);
+unsigned int bdrv_get_min_io_size(BlockDriverState *bs);
+void bdrv_set_min_io_size(BlockDriverState *bs,
+        unsigned int min_io_size);
+unsigned int bdrv_get_opt_io_size(BlockDriverState *bs);
+void bdrv_set_opt_io_size(BlockDriverState *bs,
+        unsigned int opt_io_size);
 int bdrv_eject(BlockDriverState *bs, int eject_flag);
 void bdrv_set_change_cb(BlockDriverState *bs,
                         void (*change_cb)(void *opaque), void *opaque);
Index: qemu/block_int.h
===================================================================
--- qemu.orig/block_int.h	2010-01-29 11:07:50.096004065 +0100
+++ qemu/block_int.h	2010-01-29 11:08:32.941003474 +0100
@@ -173,6 +173,14 @@  struct BlockDriverState {
        drivers. They are not used by the block driver */
     int cyls, heads, secs, translation;
     int type;
+
+    /*
+     * Topology information, all optional.
+     */
+    unsigned int physical_block_size;
+    unsigned int min_io_size;
+    unsigned int opt_io_size;
+
     char device_name[32];
     unsigned long *dirty_bitmap;
     BlockDriverState *next;
Index: qemu/qemu-config.c
===================================================================
--- qemu.orig/qemu-config.c	2010-01-29 11:07:50.133004032 +0100
+++ qemu/qemu-config.c	2010-01-29 11:08:32.944025367 +0100
@@ -78,6 +78,15 @@  QemuOptsList qemu_drive_opts = {
         },{
             .name = "readonly",
             .type = QEMU_OPT_BOOL,
+        },{
+            .name = "physical_block_size",
+            .type = QEMU_OPT_NUMBER,
+        },{
+            .name = "min_io_size",
+            .type = QEMU_OPT_NUMBER,
+        },{
+            .name = "opt_io_size",
+            .type = QEMU_OPT_NUMBER,
         },
         { /* end if list */ }
     },
Index: qemu/vl.c
===================================================================
--- qemu.orig/vl.c	2010-01-29 11:07:50.141004284 +0100
+++ qemu/vl.c	2010-01-29 11:08:32.947003820 +0100
@@ -1904,6 +1904,9 @@  DriveInfo *drive_init(QemuOpts *opts, vo
     int index;
     int cache;
     int aio = 0;
+    unsigned long physical_block_size = 512;
+    unsigned long min_io_size = 0;
+    unsigned long opt_io_size = 0;
     int ro = 0;
     int bdrv_flags;
     int on_read_error, on_write_error;
@@ -2053,6 +2056,32 @@  DriveInfo *drive_init(QemuOpts *opts, vo
     }
 #endif
 
+    if ((buf = qemu_opt_get(opts, "physical_block_size")) != NULL) {
+        physical_block_size = strtoul(buf, NULL, 10);
+        if (physical_block_size < 512) {
+            fprintf(stderr, "sector size must be larger than 512 bytes\n");
+            return NULL;
+        }
+    }
+
+    if ((buf = qemu_opt_get(opts, "min_io_size")) != NULL) {
+        min_io_size = strtoul(buf, NULL, 10);
+        if (!min_io_size || (min_io_size % physical_block_size)) {
+            fprintf(stderr,
+                    "min_io_size must be a multiple of the sector size\n");
+            return NULL;
+        }
+    }
+
+    if ((buf = qemu_opt_get(opts, "opt_io_size")) != NULL) {
+        opt_io_size = strtoul(buf, NULL, 10);
+        if (!opt_io_size || (opt_io_size % min_io_size)) {
+            fprintf(stderr,
+                    "opt_io_size must be a multiple of min_io_size\n");
+            return NULL;
+        }
+    }
+
     if ((buf = qemu_opt_get(opts, "format")) != NULL) {
        if (strcmp(buf, "?") == 0) {
             fprintf(stderr, "qemu: Supported formats:");
@@ -2257,6 +2286,12 @@  DriveInfo *drive_init(QemuOpts *opts, vo
         return NULL;
     }
 
+    bdrv_set_physical_block_size(dinfo->bdrv, physical_block_size);
+    if (min_io_size)
+        bdrv_set_min_io_size(dinfo->bdrv, min_io_size);
+    if (opt_io_size)
+        bdrv_set_opt_io_size(dinfo->bdrv, opt_io_size);
+
     if (bdrv_key_required(dinfo->bdrv))
         autostart = 0;
     *fatal_error = 0;
Index: qemu/qemu-options.hx
===================================================================
--- qemu.orig/qemu-options.hx	2010-01-29 11:07:50.149004395 +0100
+++ qemu/qemu-options.hx	2010-01-29 19:36:06.118256061 +0100
@@ -104,6 +104,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cyls=c,heads=h,secs=s[,trans=t]][,snapshot=on|off]\n"
     "       [,cache=writethrough|writeback|none][,format=f][,serial=s]\n"
     "       [,addr=A][,id=name][,aio=threads|native][,readonly=on|off]\n"
+    "       [,physical_block=size=size][,min_io_size=size][,opt_io_size=size]\n"
     "                use 'file' as a drive image\n")
 DEF("set", HAS_ARG, QEMU_OPTION_set,
     "-set group.id.arg=value\n"
@@ -149,6 +150,15 @@  an untrusted format header.
 This option specifies the serial number to assign to the device.
 @item addr=@var{addr}
 Specify the controller's PCI address (if=virtio only).
+@item physical_sector_size=@var{size}
+Report a physical block size larger than the logical block size of 512 bytes.
+@item min_io_size=@var{size}
+Reported a minimum I/O size or optimum I/O granularity.  This is the smallest
+I/O size the device can perform without a performance penalty.  For RAID
+devices this should be set to the stripe chunk size.
+@item opt_io_size=@var{size}
+Report an optimal I/O size, which is the device's preferred unit for
+sustained I/O.  This should be set to the stripe width for RAID devices.
 @end table
 
 By default, writethrough caching is used for all block device.  This means that