diff mbox

qemu: make virtio-blk PCI compliant by default

Message ID 20090907181436.GA8538@redhat.com
State Superseded
Headers show

Commit Message

Michael S. Tsirkin Sept. 7, 2009, 6:14 p.m. UTC
commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
PCI-compliant, since it makes region 0 (which is an i/o region)
size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.

When the ATA serial number feature is off, which is the default,
make the device spec compliant again, by making region 0 smaller.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Tested-by: Vadim Rozenfeld <vrozenfe@redhat.com>

---

Note: the companion feature in guest kernel was added by
1d589bb16b825b3a7b4edd34d997f1f1f953033d in linux 2.6.31-rc1.

Comments

john cooper Sept. 8, 2009, 7:40 a.m. UTC | #1
Michael S. Tsirkin wrote:
> commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
> PCI-compliant, since it makes region 0 (which is an i/o region)
> size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.
> 
> When the ATA serial number feature is off, which is the default,
> make the device spec compliant again, by making region 0 smaller.

I'd hazard this is the cause of the breakage others
encountered -- even when the driver was initialized
but unused.  For some odd reason I hadn't seen nor
been able to reproduce the failure.

The mock-up of an entire ATA IDENTIFY page is really
overkill for what we're trying to accomplish here,
namely passing a 20 byte S/N from qemu to the guest.
However emulating and passing an IDENTIFY page allows
guest apps to interpret the information via an
existing interface, with the guest driver doing nothing
more than transferring the data as opaque.  During
review, other defined fields of the IDENTIFY page were
speculated to be potentially useful thus the entire
512 byte page was passed wholesale.  But it is clearly
more trouble than benefit at this point.  I'll rework
the patch or use an alternate mechanism.

-john
Michael S. Tsirkin Sept. 8, 2009, 7:58 a.m. UTC | #2
On Tue, Sep 08, 2009 at 03:40:08AM -0400, john cooper wrote:
> Michael S. Tsirkin wrote:
> > commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
> > PCI-compliant, since it makes region 0 (which is an i/o region)
> > size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.
> > 
> > When the ATA serial number feature is off, which is the default,
> > make the device spec compliant again, by making region 0 smaller.
> 
> I'd hazard this is the cause of the breakage others
> encountered -- even when the driver was initialized
> but unused.  For some odd reason I hadn't seen nor
> been able to reproduce the failure.
> 
> The mock-up of an entire ATA IDENTIFY page is really
> overkill for what we're trying to accomplish here,
> namely passing a 20 byte S/N from qemu to the guest.
> However emulating and passing an IDENTIFY page allows
> guest apps to interpret the information via an
> existing interface, with the guest driver doing nothing
> more than transferring the data as opaque.  During
> review, other defined fields of the IDENTIFY page were
> speculated to be potentially useful thus the entire
> 512 byte page was passed wholesale.  But it is clearly
> more trouble than benefit at this point.  I'll rework
> the patch or use an alternate mechanism.

BTW, will the change you have in mind affect the guest driver as well?
If yes, maybe revert 1d589bb16b825b3a7b4edd34d997f1f1f953033d in linux
before 2.6.31 is out? Otherwise we'll be stuck supporting the bad
interface from upstream 2.6.31 in upstream qemu ...

> -john
> 
> -- 
> john.cooper@redhat.com
Michael S. Tsirkin Sept. 14, 2009, 11:39 a.m. UTC | #3
On Tue, Sep 08, 2009 at 03:40:08AM -0400, john cooper wrote:
> Michael S. Tsirkin wrote:
> > commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
> > PCI-compliant, since it makes region 0 (which is an i/o region)
> > size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.
> > 
> > When the ATA serial number feature is off, which is the default,
> > make the device spec compliant again, by making region 0 smaller.
> 
> I'd hazard this is the cause of the breakage others
> encountered -- even when the driver was initialized
> but unused.  For some odd reason I hadn't seen nor
> been able to reproduce the failure.
> 
> The mock-up of an entire ATA IDENTIFY page is really
> overkill for what we're trying to accomplish here,
> namely passing a 20 byte S/N from qemu to the guest.
> However emulating and passing an IDENTIFY page allows
> guest apps to interpret the information via an
> existing interface, with the guest driver doing nothing
> more than transferring the data as opaque.  During
> review, other defined fields of the IDENTIFY page were
> speculated to be potentially useful thus the entire
> 512 byte page was passed wholesale.  But it is clearly
> more trouble than benefit at this point.  I'll rework
> the patch or use an alternate mechanism.
> 
> -john

For now, what my patch does is fix the PCI compliance issue for everyone
who uses the default setup (without the serial #). With serial disabled,
we should not reserve any space in region 0 (not even 20 bytes).

So I propose we apply this patch for now, and then your patch only has
to handle the case of serial number enabled. Makes sense?  If yes pls
ack.


> -- 
> john.cooper@redhat.com
john cooper Sept. 15, 2009, 7:29 a.m. UTC | #4
Michael S. Tsirkin wrote:

> For now, what my patch does is fix the PCI compliance issue for everyone
> who uses the default setup (without the serial #). With serial disabled,
> we should not reserve any space in region 0 (not even 20 bytes).
> 
> So I propose we apply this patch for now, and then your patch only has
> to handle the case of serial number enabled. Makes sense?  If yes pls
> ack.

Sorry, I intended to do so in my previous mail.
Looks fine, ACK.

-john
Rusty Russell Sept. 21, 2009, 11:09 a.m. UTC | #5
On Tue, 8 Sep 2009 05:28:31 pm Michael S. Tsirkin wrote:
> On Tue, Sep 08, 2009 at 03:40:08AM -0400, john cooper wrote:
> > Michael S. Tsirkin wrote:
> > > commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
> > > PCI-compliant, since it makes region 0 (which is an i/o region)
> > > size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.
> > > 
> > > When the ATA serial number feature is off, which is the default,
> > > make the device spec compliant again, by making region 0 smaller.
> > 
> > I'd hazard this is the cause of the breakage others
> > encountered -- even when the driver was initialized
> > but unused.  For some odd reason I hadn't seen nor
> > been able to reproduce the failure.
> > 
> > The mock-up of an entire ATA IDENTIFY page is really
> > overkill for what we're trying to accomplish here,
> > namely passing a 20 byte S/N from qemu to the guest.
> > However emulating and passing an IDENTIFY page allows
> > guest apps to interpret the information via an
> > existing interface, with the guest driver doing nothing
> > more than transferring the data as opaque.  During
> > review, other defined fields of the IDENTIFY page were
> > speculated to be potentially useful thus the entire
> > 512 byte page was passed wholesale.  But it is clearly
> > more trouble than benefit at this point.  I'll rework
> > the patch or use an alternate mechanism.
> 
> BTW, will the change you have in mind affect the guest driver as well?
> If yes, maybe revert 1d589bb16b825b3a7b4edd34d997f1f1f953033d in linux
> before 2.6.31 is out? Otherwise we'll be stuck supporting the bad
> interface from upstream 2.6.31 in upstream qemu ...

Too late, but no big deal.  We just add another feature bit for the
new method.  We can have a command virtqueue for queries such as this.

(Interestingly, this same issue broke lguest, with the 8-bit limit on
configuration space sizes).

Thanks,
Rusty.
john cooper Sept. 21, 2009, 3:47 p.m. UTC | #6
Rusty Russell wrote:

> Too late, but no big deal.  We just add another feature bit for the
> new method.  We can have a command virtqueue for queries such as this.
> 
> (Interestingly, this same issue broke lguest, with the 8-bit limit on
> configuration space sizes).

I believe the existing guest visible interface
is preservable as-is without introducing any
ATA specific knowledge into the driver.  A
trivial alternate R/O mapping for the page
into the guest is all that is required.

I ran out of time last week to tie this off
but hopefully will do so shortly.

Thanks,

-john
Rusty Russell Sept. 22, 2009, 5:06 a.m. UTC | #7
On Mon, 14 Sep 2009 09:09:59 pm Michael S. Tsirkin wrote:
> On Tue, Sep 08, 2009 at 03:40:08AM -0400, john cooper wrote:
> > Michael S. Tsirkin wrote:
> > > commit bf011293faaa7f87e4de83185931e7411b794128 made virtio-blk-pci not
> > > PCI-compliant, since it makes region 0 (which is an i/o region)
> > > size > 256, and, since PCI 2.1, i/o regions are limited to 256 bytes size.
> > > 
> > > When the ATA serial number feature is off, which is the default,
> > > make the device spec compliant again, by making region 0 smaller.
> > 
> > I'd hazard this is the cause of the breakage others
> > encountered -- even when the driver was initialized
> > but unused.  For some odd reason I hadn't seen nor
> > been able to reproduce the failure.
> > 
> > The mock-up of an entire ATA IDENTIFY page is really
> > overkill for what we're trying to accomplish here,
> > namely passing a 20 byte S/N from qemu to the guest.
> > However emulating and passing an IDENTIFY page allows
> > guest apps to interpret the information via an
> > existing interface, with the guest driver doing nothing
> > more than transferring the data as opaque.  During
> > review, other defined fields of the IDENTIFY page were
> > speculated to be potentially useful thus the entire
> > 512 byte page was passed wholesale.  But it is clearly
> > more trouble than benefit at this point.  I'll rework
> > the patch or use an alternate mechanism.
> > 
> > -john
> 
> For now, what my patch does is fix the PCI compliance issue for everyone
> who uses the default setup (without the serial #). With serial disabled,
> we should not reserve any space in region 0 (not even 20 bytes).
> 
> So I propose we apply this patch for now, and then your patch only has
> to handle the case of serial number enabled. Makes sense?  If yes pls
> ack.

lguest uses the same workaround (since we don't support that feature anyway
at the moment).

Thanks!
Rusty.
Avi Kivity Sept. 22, 2009, 9:30 a.m. UTC | #8
On 09/21/2009 06:47 PM, john cooper wrote:
> Rusty Russell wrote:
>
>    
>> Too late, but no big deal.  We just add another feature bit for the
>> new method.  We can have a command virtqueue for queries such as this.
>>
>> (Interestingly, this same issue broke lguest, with the 8-bit limit on
>> configuration space sizes).
>>      
> I believe the existing guest visible interface
> is preservable as-is without introducing any
> ATA specific knowledge into the driver.  A
> trivial alternate R/O mapping for the page
> into the guest is all that is required.
>
> I ran out of time last week to tie this off
> but hopefully will do so shortly.
>
>    

Can we just read this page as a virtqueue command instead of having it 
mapped permanently?
john cooper Sept. 22, 2009, 2:21 p.m. UTC | #9
Avi Kivity wrote:
> On 09/21/2009 06:47 PM, john cooper wrote:
>> Rusty Russell wrote:
>>
>>   
>>> Too late, but no big deal.  We just add another feature bit for the
>>> new method.  We can have a command virtqueue for queries such as this.
>>>
>>> (Interestingly, this same issue broke lguest, with the 8-bit limit on
>>> configuration space sizes).
>>>      
>> I believe the existing guest visible interface
>> is preservable as-is without introducing any
>> ATA specific knowledge into the driver.  A
>> trivial alternate R/O mapping for the page
>> into the guest is all that is required.
>>
>> I ran out of time last week to tie this off
>> but hopefully will do so shortly.
>>
>>    
> 
> Can we just read this page as a virtqueue command instead of having it
> mapped permanently?

Probably although I hadn't looked specifically
at doing so.   Mapping the data via an unused
pci bar is pretty trivial and seemed minimally
intrusive to the existing driver.

-john
Avi Kivity Sept. 22, 2009, 2:27 p.m. UTC | #10
On 09/22/2009 05:21 PM, john cooper wrote:
>> Can we just read this page as a virtqueue command instead of having it
>> mapped permanently?
>>      
> Probably although I hadn't looked specifically
> at doing so.   Mapping the data via an unused
> pci bar is pretty trivial and seemed minimally
> intrusive to the existing driver.
>    

We'll run out of bars if we expend them like that.
Michael S. Tsirkin Sept. 22, 2009, 2:41 p.m. UTC | #11
On Tue, Sep 22, 2009 at 05:27:22PM +0300, Avi Kivity wrote:
> On 09/22/2009 05:21 PM, john cooper wrote:
>>> Can we just read this page as a virtqueue command instead of having it
>>> mapped permanently?
>>>      
>> Probably although I hadn't looked specifically
>> at doing so.   Mapping the data via an unused
>> pci bar is pretty trivial and seemed minimally
>> intrusive to the existing driver.
>>    
>
> We'll run out of bars if we expend them like that.

BAR1 is used for MSI-X already (if MSI-X is enabled), you can just add
your data there. I would report the offset within the BAR somewhere
in io space or in configuration space, so that we can relocate tables
later if we like without changing guests.

> -- 
> error compiling committee.c: too many arguments to function
Avi Kivity Sept. 22, 2009, 2:45 p.m. UTC | #12
On 09/22/2009 05:41 PM, Michael S. Tsirkin wrote:
>>>
>>> Probably although I hadn't looked specifically
>>> at doing so.   Mapping the data via an unused
>>> pci bar is pretty trivial and seemed minimally
>>> intrusive to the existing driver.
>>>
>>>        
>> We'll run out of bars if we expend them like that.
>>      
> BAR1 is used for MSI-X already (if MSI-X is enabled), you can just add
> your data there. I would report the offset within the BAR somewhere
> in io space or in configuration space, so that we can relocate tables
> later if we like without changing guests.
>    

I'd really like to avoid such entanglements.
john cooper Sept. 22, 2009, 3:09 p.m. UTC | #13
Avi Kivity wrote:
> On 09/22/2009 05:21 PM, john cooper wrote:
>>> Can we just read this page as a virtqueue command instead of having it
>>> mapped permanently?
>>>      
>> Probably although I hadn't looked specifically
>> at doing so.   Mapping the data via an unused
>> pci bar is pretty trivial and seemed minimally
>> intrusive to the existing driver.
>>    
> 
> We'll run out of bars if we expend them like that.

Agreed.  However my motivation here was to use a
single additional bar specifically to compensate
for the PCI spec imposed 256 byte size limitation
of the config space mapping.  As we're defining the
content/size of this area, future use to accommodate
additional data should be unrestricted.

If using an additional bar is still a concern I
can certainly have a look at accomplishing the
same via virtqueue scaffolding.

-john
Anthony Liguori Sept. 23, 2009, 1:59 a.m. UTC | #14
john cooper wrote:
> Avi Kivity wrote:
>   
>> On 09/22/2009 05:21 PM, john cooper wrote:
>>     
>>>> Can we just read this page as a virtqueue command instead of having it
>>>> mapped permanently?
>>>>      
>>>>         
>>> Probably although I hadn't looked specifically
>>> at doing so.   Mapping the data via an unused
>>> pci bar is pretty trivial and seemed minimally
>>> intrusive to the existing driver.
>>>    
>>>       
>> We'll run out of bars if we expend them like that.
>>     
>
> Agreed.  However my motivation here was to use a
> single additional bar specifically to compensate
> for the PCI spec imposed 256 byte size limitation
> of the config space mapping.  As we're defining the
> content/size of this area, future use to accommodate
> additional data should be unrestricted.
>   

Why expose the whole ATAPI page instead of just the serial number?

I think the proper solution is to move the config to a separate bar 
that's MMIO instead of PIO.  config access is never performance 
sensitive and an MMIO bar has less restrictions on size.

Regards,

Anthony Liguori
john cooper Sept. 23, 2009, 4:56 a.m. UTC | #15
Anthony Liguori wrote:
> john cooper wrote:
>> Avi Kivity wrote:
>>  
>>> On 09/22/2009 05:21 PM, john cooper wrote:
>>>    
>>>>> Can we just read this page as a virtqueue command instead of having it
>>>>> mapped permanently?
>>>>>              
>>>> Probably although I hadn't looked specifically
>>>> at doing so.   Mapping the data via an unused
>>>> pci bar is pretty trivial and seemed minimally
>>>> intrusive to the existing driver.
>>>>          
>>> We'll run out of bars if we expend them like that.
>>>     
>>
>> Agreed.  However my motivation here was to use a
>> single additional bar specifically to compensate
>> for the PCI spec imposed 256 byte size limitation
>> of the config space mapping.  As we're defining the
>> content/size of this area, future use to accommodate
>> additional data should be unrestricted.
>>   
> 
> Why expose the whole ATAPI page instead of just the serial number?

Exposing the s/n alone was what I'd done originally
which in retrospect was well contained within the
pci config area -- even if squandering the somewhat
limited space in that area.

However exposing it as an identify page allows use
of the existing HDIO_GET_IDENTITY ioctl (and existing
use of that interface, eg: "hdparm -i") as a means to
extract the s/n along with potentially useful related
data in the page.

Creation of the identify structure is within qemu
as that is where the information exists.  And this
allows the virtio_blk driver to treat the data as
opaque, simply exporting it uninterpreted via a copyout.
So the only real issue remaining is how to best saw
a hole between qemu and the guest virtio_blk driver
to transfer the identity data.

> I think the proper solution is to move the config to a separate bar 
> that's MMIO instead of PIO.  config access is never performance 
> sensitive and an MMIO bar has less restrictions on size.

That's exactly what I've done.  I'll clean up the
patch and post it here so we can have a more concrete
discussion.

-john
john cooper Sept. 29, 2009, 6:09 a.m. UTC | #16
This is a correction to the previous version
which violated the PCI config space 256 byte
limitation.

The crux of the feature is simply passing a qemu
received virtio serial number to/through the guest
driver and making it available to guest userspace
via a preexisting interface.

To accomplish this, ATA IDENTIFY data as implemented
by the HDIO_GET_IDENTITY ioctl is returned to the
caller for the virtio_blk drive of interest.  Content
of this identify structure is created by qemu and
passed wholesale to guest userspace without
interpretation by the guest driver.

The change this patch implements is passing of the
identify data through a mapping established by
PCI BAR #5 rather than the PCI config area, the
latter of which resulted in the above breakage.

virtio_blk with this patch now pulls the information
from the mapped bar area and includes minimal
scaffolding to help map bars external to virtio_pci.

-john
Michael S. Tsirkin Sept. 29, 2009, 6:58 a.m. UTC | #17
On Tue, Sep 29, 2009 at 02:09:30AM -0400, john cooper wrote:
> This is a correction to the previous version
> which violated the PCI config space 256 byte
> limitation.
> 
> The crux of the feature is simply passing a qemu
> received virtio serial number to/through the guest
> driver and making it available to guest userspace
> via a preexisting interface.
> 
> To accomplish this, ATA IDENTIFY data as implemented
> by the HDIO_GET_IDENTITY ioctl is returned to the
> caller for the virtio_blk drive of interest.  Content
> of this identify structure is created by qemu and
> passed wholesale to guest userspace without
> interpretation by the guest driver.
> 
> The change this patch implements is passing of the
> identify data through a mapping established by
> PCI BAR #5 rather than the PCI config area, the
> latter of which resulted in the above breakage.

Using bar per feature we'll quickly run out of BARs.
We already use a BAR for MSI-X - let's add
ATA identity there?

> virtio_blk with this patch now pulls the information
> from the mapped bar area and includes minimal
> scaffolding to help map bars external to virtio_pci.
> 
> -john
> 
> 
> -- 
> john.cooper@redhat.com
Avi Kivity Sept. 29, 2009, 7:22 a.m. UTC | #18
On 09/29/2009 08:58 AM, Michael S. Tsirkin wrote:
> Using bar per feature we'll quickly run out of BARs.
> We already use a BAR for MSI-X - let's add
> ATA identity there?
>    

Mixing unrelated features will quickly cause confusion.
Michael S. Tsirkin Sept. 29, 2009, 8:54 a.m. UTC | #19
On Tue, Sep 29, 2009 at 09:22:53AM +0200, Avi Kivity wrote:
> On 09/29/2009 08:58 AM, Michael S. Tsirkin wrote:
>> Using bar per feature we'll quickly run out of BARs.
>> We already use a BAR for MSI-X - let's add
>> ATA identity there?
>>    
> Mixing unrelated features will quickly cause confusion.

Note if we ever switch to 64 bit BARs, we'll only have 3 of these
available, so the ID would be using the last free one.
We can just as well plan ahead for when we'll add another feature?

I agree fixed offsets are messy though.  I previously proposed storing
the offset to the identity data in an i/o register. This will let each
implementation lay out this data in an optimal manner, without
confusion.

> -- 
> Do not meddle in the internals of kernels, for they are subtle and quick to panic.
Avi Kivity Sept. 29, 2009, 9:16 a.m. UTC | #20
On 09/29/2009 10:54 AM, Michael S. Tsirkin wrote:
> On Tue, Sep 29, 2009 at 09:22:53AM +0200, Avi Kivity wrote:
>    
>> On 09/29/2009 08:58 AM, Michael S. Tsirkin wrote:
>>      
>>> Using bar per feature we'll quickly run out of BARs.
>>> We already use a BAR for MSI-X - let's add
>>> ATA identity there?
>>>
>>>        
>> Mixing unrelated features will quickly cause confusion.
>>      
> Note if we ever switch to 64 bit BARs, we'll only have 3 of these
> available, so the ID would be using the last free one.
> We can just as well plan ahead for when we'll add another feature?
>
> I agree fixed offsets are messy though.  I previously proposed storing
> the offset to the identity data in an i/o register. This will let each
> implementation lay out this data in an optimal manner, without
> confusion.
>    

Yes, that works much better.
Anthony Liguori Sept. 29, 2009, 1:51 p.m. UTC | #21
john cooper wrote:
> This is a correction to the previous version
> which violated the PCI config space 256 byte
> limitation.
>
> The crux of the feature is simply passing a qemu
> received virtio serial number to/through the guest
> driver and making it available to guest userspace
> via a preexisting interface.
>
> To accomplish this, ATA IDENTIFY data as implemented
> by the HDIO_GET_IDENTITY ioctl is returned to the
> caller for the virtio_blk drive of interest.  Content
> of this identify structure is created by qemu and
> passed wholesale to guest userspace without
> interpretation by the guest driver.
>
> The change this patch implements is passing of the
> identify data through a mapping established by
> PCI BAR #5 rather than the PCI config area, the
> latter of which resulted in the above breakage.
>   

This is a massive layering violation.  The virtio-blk ABI cannot make 
demands of the transport.

The better solution would be to move the entire virtio-pci config space 
to a separate BAR that's an MMIO region.  Then there is no practical 
limit on the size of the config area.

Regards,

Anthony Liguori
Anthony Liguori Sept. 29, 2009, 1:55 p.m. UTC | #22
Michael S. Tsirkin wrote:
> On Tue, Sep 29, 2009 at 09:22:53AM +0200, Avi Kivity wrote:
>   
>> On 09/29/2009 08:58 AM, Michael S. Tsirkin wrote:
>>     
>>> Using bar per feature we'll quickly run out of BARs.
>>> We already use a BAR for MSI-X - let's add
>>> ATA identity there?
>>>    
>>>       
>> Mixing unrelated features will quickly cause confusion.
>>     
>
> Note if we ever switch to 64 bit BARs, we'll only have 3 of these
> available, so the ID would be using the last free one.
> We can just as well plan ahead for when we'll add another feature?
>
> I agree fixed offsets are messy though.  I previously proposed storing
> the offset to the identity data in an i/o register. This will let each
> implementation lay out this data in an optimal manner, without
> confusion.
>   

1) There's no need to make the identity page separate from the config 
space.  The real problem is the config space is too small.  We can fix 
that without making any changes to virtio-blk by putting the config 
space in a separate BAR.

2) Passing an ATA identity page is goofy.  We should just pass the 
serial number and let Linux generate the identity page.  Just because 
Linux requires this as it's user space interface, that doesn't mean that 
other guests will (like Windows).  Instead of exposing an opaque blob, 
we should expose the information we need in a structured way.

Regards,

Anthony Liguori
Michael S. Tsirkin Sept. 29, 2009, 2:06 p.m. UTC | #23
On Tue, Sep 29, 2009 at 08:55:20AM -0500, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> On Tue, Sep 29, 2009 at 09:22:53AM +0200, Avi Kivity wrote:
>>   
>>> On 09/29/2009 08:58 AM, Michael S. Tsirkin wrote:
>>>     
>>>> Using bar per feature we'll quickly run out of BARs.
>>>> We already use a BAR for MSI-X - let's add
>>>> ATA identity there?
>>>>          
>>> Mixing unrelated features will quickly cause confusion.
>>>     
>>
>> Note if we ever switch to 64 bit BARs, we'll only have 3 of these
>> available, so the ID would be using the last free one.
>> We can just as well plan ahead for when we'll add another feature?
>>
>> I agree fixed offsets are messy though.  I previously proposed storing
>> the offset to the identity data in an i/o register. This will let each
>> implementation lay out this data in an optimal manner, without
>> confusion.
>>   
>
> 1) There's no need to make the identity page separate from the config  
> space.  The real problem is the config space is too small.  We can fix  
> that without making any changes to virtio-blk by putting the config  
> space in a separate BAR.

I don't think it's such a good idea.  We want to keep exposing most of
config space in i/o bar because of backwards compatibility.  And we want
to keep datapath operations such as vq kicks, in i/o space.


> 2) Passing an ATA identity page is goofy.  We should just pass the  
> serial number and let Linux generate the identity page.  Just because  
> Linux requires this as it's user space interface, that doesn't mean that  
> other guests will (like Windows).  Instead of exposing an opaque blob,  
> we should expose the information we need in a structured way.
>
> Regards,
>
> Anthony Liguori
Anthony Liguori Sept. 29, 2009, 2:14 p.m. UTC | #24
Michael S. Tsirkin wrote:
> I don't think it's such a good idea.  We want to keep exposing most of
> config space in i/o bar because of backwards compatibility.  And we want
> to keep datapath operations such as vq kicks, in i/o space.
>   

We can do feature negotiation in virtio-pci.  That lets us continue 
exposing the first 246 bytes of the config space in PIO and guests can 
negotiate a second bar for access to larger config spaces.

Regards,

Anthony Liguori
Avi Kivity Sept. 29, 2009, 4:22 p.m. UTC | #25
On 09/29/2009 03:51 PM, Anthony Liguori wrote:
>> The change this patch implements is passing of the
>> identify data through a mapping established by
>> PCI BAR #5 rather than the PCI config area, the
>> latter of which resulted in the above breakage.
>
>
> This is a massive layering violation.  The virtio-blk ABI cannot make 
> demands of the transport.

True.

>
> The better solution would be to move the entire virtio-pci config 
> space to a separate BAR that's an MMIO region.  Then there is no 
> practical limit on the size of the config area.

Don't some fast-paths accesses go through the config space?  Using mmio 
will slow them down.
Avi Kivity Sept. 29, 2009, 4:24 p.m. UTC | #26
On 09/29/2009 04:14 PM, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> I don't think it's such a good idea.  We want to keep exposing most of
>> config space in i/o bar because of backwards compatibility.  And we want
>> to keep datapath operations such as vq kicks, in i/o space.
>
> We can do feature negotiation in virtio-pci.  That lets us continue 
> exposing the first 246 bytes of the config space in PIO and guests can 
> negotiate a second bar for access to larger config spaces.

Shouldn't the guest's pci layer automatically adapt to what the card 
throws at it?

Of course we need to keep migration compatibility so we need to keep pio.
Michael S. Tsirkin Sept. 29, 2009, 4:30 p.m. UTC | #27
On Tue, Sep 29, 2009 at 09:14:56AM -0500, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> I don't think it's such a good idea.  We want to keep exposing most of
>> config space in i/o bar because of backwards compatibility.  And we want
>> to keep datapath operations such as vq kicks, in i/o space.
>>   
>
> We can do feature negotiation in virtio-pci.  That lets us continue  
> exposing the first 246 bytes of the config space in PIO and guests can  
> negotiate a second bar for access to larger config spaces.
>
> Regards,
>
> Anthony Liguori

guests can't negotiate a bar.

What I think we want to do
- Use i/o for small fields (2-4 bytes) as we did previously
  we won't run out anytime soon
- For large fields, put them in memory BAR1 and report the
  offset in i/o space
- For huge fields, provide a mailbox in i/o space where guest
  writes an offset and reads out a value
Anthony Liguori Sept. 29, 2009, 5:24 p.m. UTC | #28
Avi Kivity wrote:
> On 09/29/2009 03:51 PM, Anthony Liguori wrote:
>>> The change this patch implements is passing of the
>>> identify data through a mapping established by
>>> PCI BAR #5 rather than the PCI config area, the
>>> latter of which resulted in the above breakage.
>>
>>
>> This is a massive layering violation.  The virtio-blk ABI cannot make 
>> demands of the transport.
>
> True.
>
>>
>> The better solution would be to move the entire virtio-pci config 
>> space to a separate BAR that's an MMIO region.  Then there is no 
>> practical limit on the size of the config area.
>
> Don't some fast-paths accesses go through the config space?  Using 
> mmio will slow them down.

Not that I know of.

Regards,

Anthony Liguori
Anthony Liguori Sept. 29, 2009, 5:26 p.m. UTC | #29
Michael S. Tsirkin wrote:
> On Tue, Sep 29, 2009 at 09:14:56AM -0500, Anthony Liguori wrote:
>   
>> Michael S. Tsirkin wrote:
>>     
>>> I don't think it's such a good idea.  We want to keep exposing most of
>>> config space in i/o bar because of backwards compatibility.  And we want
>>> to keep datapath operations such as vq kicks, in i/o space.
>>>   
>>>       
>> We can do feature negotiation in virtio-pci.  That lets us continue  
>> exposing the first 246 bytes of the config space in PIO and guests can  
>> negotiate a second bar for access to larger config spaces.
>>
>> Regards,
>>
>> Anthony Liguori
>>     
>
> guests can't negotiate a bar.
>   

The host always exposes the config in two places.  config[0..235] in 
bytes 20-255 in BAR0 and the full config space in BAR2 (or BAR1 + offset 
if desired).  Old drivers will just ignore the new config location.

> What I think we want to do
> - Use i/o for small fields (2-4 bytes) as we did previously
>   we won't run out anytime soon
> - For large fields, put them in memory BAR1 and report the
>   offset in i/o space
>   

This is at best a band aid as a large number of small fields can result 
in the same problem.

> - For huge fields, provide a mailbox in i/o space where guest
>   writes an offset and reads out a value
>   

Regards,

Anthony Liguori
Rusty Russell Sept. 29, 2009, 5:28 p.m. UTC | #30
On Tue, 29 Sep 2009 11:25:20 pm Anthony Liguori wrote:
> 2) Passing an ATA identity page is goofy.  We should just pass the 
> serial number and let Linux generate the identity page.  Just because 
> Linux requires this as it's user space interface, that doesn't mean that 
> other guests will (like Windows).  Instead of exposing an opaque blob, 
> we should expose the information we need in a structured way.

I think John did this on my prompting, because it already existed as a
defined ABI.  If someone is going to make stuff up, isn't it better that
kvm does it (some of those fields might have useful values?) than linux?

(I don't care: my patches stick with the current scheme, but changing is
fairly easy, but needs to be decided before commit).

Thanks,
Rusty,
Anthony Liguori Sept. 29, 2009, 5:31 p.m. UTC | #31
Rusty Russell wrote:
> On Tue, 29 Sep 2009 11:25:20 pm Anthony Liguori wrote:
>   
>> 2) Passing an ATA identity page is goofy.  We should just pass the 
>> serial number and let Linux generate the identity page.  Just because 
>> Linux requires this as it's user space interface, that doesn't mean that 
>> other guests will (like Windows).  Instead of exposing an opaque blob, 
>> we should expose the information we need in a structured way.
>>     
>
> I think John did this on my prompting, because it already existed as a
> defined ABI.  If someone is going to make stuff up, isn't it better that
> kvm does it (some of those fields might have useful values?) than linux?
>   

We should expose Linux ABI quirks in the virtio ABI.  Down the road, 
someone may decide that there's a better way to expose things like S/N 
down to userspace and then in our drivers we would have to parse the ATA 
identity page in order to figure out the S/N to expose that properly 
through the new interface.  Moreover, there's probably an OS out there 
that we already have to do that for.

That's a pretty ugly thing to have to do.

> (I don't care: my patches stick with the current scheme, but changing is
> fairly easy, but needs to be decided before commit).
>   

Everything is tremendously simplified if we just expose the S/N in the 
ABI since we can avoid solving the config size limitation.  I'm a big 
fan of deferring difficult problems when there's an immediate easy 
solution :-)

> Thanks,
> Rusty,
>
Michael S. Tsirkin Sept. 29, 2009, 5:31 p.m. UTC | #32
On Tue, Sep 29, 2009 at 12:26:13PM -0500, Anthony Liguori wrote:
> Michael S. Tsirkin wrote:
>> On Tue, Sep 29, 2009 at 09:14:56AM -0500, Anthony Liguori wrote:
>>   
>>> Michael S. Tsirkin wrote:
>>>     
>>>> I don't think it's such a good idea.  We want to keep exposing most of
>>>> config space in i/o bar because of backwards compatibility.  And we want
>>>> to keep datapath operations such as vq kicks, in i/o space.
>>>>         
>>> We can do feature negotiation in virtio-pci.  That lets us continue   
>>> exposing the first 246 bytes of the config space in PIO and guests 
>>> can  negotiate a second bar for access to larger config spaces.
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>     
>>
>> guests can't negotiate a bar.
>>   
>
> The host always exposes the config in two places.  config[0..235] in  
> bytes 20-255 in BAR0 and the full config space in BAR2 (or BAR1 + offset  
> if desired).  Old drivers will just ignore the new config location.
>
>> What I think we want to do
>> - Use i/o for small fields (2-4 bytes) as we did previously
>>   we won't run out anytime soon
>> - For large fields, put them in memory BAR1 and report the
>>   offset in i/o space
>>   
>
> This is at best a band aid as a large number of small fields can result  
> in the same problem.

It would have to be very large :)

>> - For huge fields, provide a mailbox in i/o space where guest
>>   writes an offset and reads out a value
>>   
>
> Regards,
>
> Anthony Liguori
john cooper Sept. 29, 2009, 6:44 p.m. UTC | #33
Anthony Liguori wrote:

> 2) Passing an ATA identity page is goofy.  We should just pass the 
> serial number and let Linux generate the identity page.

I have a hard time disagreeing with that comment.

But the rationale that got us here was from the
user's perspective this information could be
trivially exposed via a preexisting interface vs.
creating a new interface as I'd done initially.
It was also pointed out other defined fields in
the identify page exist which were potentially
useful.  So pressing a HDIO_GET_IDENTITY ioctl
into service here didn't seem unreasonable.

In terms of getting that bag of bits from qemu to
the guest userspace, having qemu cobble together
the page is arguably appropriate as that it where
the source of the information resides.  The
alternative of pushing knowledge of this structure
into the guest driver seems unnatural.  Moreover
we would still somehow need a means to get the
data from qemu to the driver even if the driver
was only packaging it up for export as an identify
page.

> Just because 
> Linux requires this as it's user space interface, that doesn't mean that 
> other guests will (like Windows).  Instead of exposing an opaque blob, 
> we should expose the information we need in a structured way.

The identify page is defined as the return of an
"identify drive" command from an ATA drive/controller.
I'd hazard it is a familiar structure in a windows
driver/userland context as well.

-john
Anthony Liguori Sept. 29, 2009, 8:55 p.m. UTC | #34
john cooper wrote:
> Anthony Liguori wrote:
>
>> 2) Passing an ATA identity page is goofy.  We should just pass the 
>> serial number and let Linux generate the identity page.
>
> I have a hard time disagreeing with that comment.
>
> But the rationale that got us here was from the
> user's perspective this information could be
> trivially exposed via a preexisting interface vs.
> creating a new interface as I'd done initially.
> It was also pointed out other defined fields in
> the identify page exist which were potentially
> useful.  So pressing a HDIO_GET_IDENTITY ioctl
> into service here didn't seem unreasonable.

virtio-blk isn't an ATA device so why pretend like it is?

We can present the information we think is useful to the guest in a 
concise format and then if the guest wants to pretend it's an ATA 
device, it can create the ATA identify page on the fly.

>> Just because Linux requires this as it's user space interface, that 
>> doesn't mean that other guests will (like Windows).  Instead of 
>> exposing an opaque blob, we should expose the information we need in 
>> a structured way.
>
> The identify page is defined as the return of an
> "identify drive" command from an ATA drive/controller.
> I'd hazard it is a familiar structure in a windows
> driver/userland context as well.

Regards,

Anthony Liguori
Rusty Russell Sept. 30, 2009, 1:12 a.m. UTC | #35
On Wed, 30 Sep 2009 03:01:12 am Anthony Liguori wrote:
> Rusty Russell wrote:
> > On Tue, 29 Sep 2009 11:25:20 pm Anthony Liguori wrote:
> >   
> >> 2) Passing an ATA identity page is goofy.  We should just pass the 
> >> serial number and let Linux generate the identity page.  Just because 
> >> Linux requires this as it's user space interface, that doesn't mean that 
> >> other guests will (like Windows).  Instead of exposing an opaque blob, 
> >> we should expose the information we need in a structured way.
> >>     
> >
> > I think John did this on my prompting, because it already existed as a
> > defined ABI.  If someone is going to make stuff up, isn't it better that
> > kvm does it (some of those fields might have useful values?) than linux?
> 
> We should expose Linux ABI quirks in the virtio ABI.  Down the road, 
> someone may decide that there's a better way to expose things like S/N 
> down to userspace and then in our drivers we would have to parse the ATA 
> identity page in order to figure out the S/N to expose that properly 
> through the new interface.  Moreover, there's probably an OS out there 
> that we already have to do that for.
> 
> That's a pretty ugly thing to have to do.

Calling the ATA IDENTIFY command a Linux ABI quirk is disingenuous.

Not sure what the SCSI equiv is to get a serial number (VPD?) But if everyone
is happy with 20 bytes, and John wants to do SN that way, I'll take the patch.

Still, I got to learn fun stuff about the block layer!
Rusty.
Rusty Russell Sept. 30, 2009, 1:19 a.m. UTC | #36
On Wed, 30 Sep 2009 06:25:39 am Anthony Liguori wrote:
> virtio-blk isn't an ATA device so why pretend like it is?

Do you want to maintain a "vdparm" to match hdparm and sdparm?

Confused,
Rusty.
Jamie Lokier Sept. 30, 2009, 1:22 a.m. UTC | #37
Rusty Russell wrote:
> Calling the ATA IDENTIFY command a Linux ABI quirk is disingenuous.
> 
> Not sure what the SCSI equiv is to get a serial number (VPD?) But if
> everyone is happy with 20 bytes, and John wants to do SN that way,
> I'll take the patch.

One thing I'd hope to see at some point is a "read-only" flag, set for
read-only images.  I don't know if ATA IDENTIFY has that yet, but I'm
sure the SCSI equivalent does.

-- Jamie
Anthony Liguori Sept. 30, 2009, 2:17 a.m. UTC | #38
Rusty Russell wrote:
> On Wed, 30 Sep 2009 06:25:39 am Anthony Liguori wrote:
>   
>> virtio-blk isn't an ATA device so why pretend like it is?
>>     
>
> Do you want to maintain a "vdparm" to match hdparm and sdparm?
>   

That's an argument to implement the HDIO_GET_IDENTITY ioctl in the Linux 
virtio-blk driver.  That's a totally reasonable thing to do.

However, baking something into the virtio-blk ABI because Linux requires 
an hdparm, sdparm, and potentially a vdparm doesn't seem like a good 
idea to me.  We should expose the information that makes sense for a 
virtio block device and if a guest wants to create an ATA identify 
response out of that data, more power to it.

Regards,

Anthony Liguori

> Confused,
> Rusty.
>
Paul Brook Sept. 30, 2009, 11:47 a.m. UTC | #39
> > But the rationale that got us here was from the
> > user's perspective this information could be
> > trivially exposed via a preexisting interface vs.
> > creating a new interface as I'd done initially.
> > It was also pointed out other defined fields in
> > the identify page exist which were potentially
> > useful.  So pressing a HDIO_GET_IDENTITY ioctl
> > into service here didn't seem unreasonable.
> 
> virtio-blk isn't an ATA device so why pretend like it is?

Agreed. If we're going to copy anything than my vote would be for SCSI. We 
already have partial support for feeding SCSI commands to a virtio device.
TBH I wouldn't have bothered with virtio-blk at all, and gone straight for 
virtio-scsi. I guess it's a bit late for that though.

Paul
Rusty Russell Sept. 30, 2009, noon UTC | #40
On Wed, 30 Sep 2009 11:47:00 am Anthony Liguori wrote:
> Rusty Russell wrote:
> > On Wed, 30 Sep 2009 06:25:39 am Anthony Liguori wrote:
> >   
> >> virtio-blk isn't an ATA device so why pretend like it is?
... 
> That's an argument to implement the HDIO_GET_IDENTITY ioctl in the Linux 
> virtio-blk driver.  That's a totally reasonable thing to do.

Your second sentence contradicts your first here.  If it's reasonable to
imitate ATA in Linux, it's reasonable to imitate ATA in virtio.

Using an existing standard should always be the default; one should need a
compelling reason *not* to do so.  In practice, (1) that code already exists
in kvm, and (2) it makes the Linux implementation trivial.

So, what are the real arguments?
(1) Some fields are redundant (eg. geometry) with existing fields already.
(2) Future fields will have to decide whether to use ATA IDENTIFY or normal
    features,
(3) We don't know enough about ATA IDENTIFY to do a decent job of filling
    it in.

I can buy (1), but I still have to deal with (2) and (3) if they're inside
the virtio_blk driver anyway.

Anything else?
Rusty.
Jamie Lokier Sept. 30, 2009, 6:04 p.m. UTC | #41
Rusty Russell wrote:
> On Wed, 30 Sep 2009 11:47:00 am Anthony Liguori wrote:
> > Rusty Russell wrote:
> > > On Wed, 30 Sep 2009 06:25:39 am Anthony Liguori wrote:
> > >   
> > >> virtio-blk isn't an ATA device so why pretend like it is?
> ... 
> > That's an argument to implement the HDIO_GET_IDENTITY ioctl in the Linux 
> > virtio-blk driver.  That's a totally reasonable thing to do.
> 
> Your second sentence contradicts your first here.  If it's reasonable to
> imitate ATA in Linux, it's reasonable to imitate ATA in virtio.
> 
> Using an existing standard should always be the default; one should need a
> compelling reason *not* to do so.  In practice, (1) that code already exists
> in kvm, and (2) it makes the Linux implementation trivial.
> 
> So, what are the real arguments?
> (1) Some fields are redundant (eg. geometry) with existing fields already.
> (2) Future fields will have to decide whether to use ATA IDENTIFY or normal
>     features,
> (3) We don't know enough about ATA IDENTIFY to do a decent job of filling
>     it in.
> 
> I can buy (1), but I still have to deal with (2) and (3) if they're inside
> the virtio_blk driver anyway.
> 
> Anything else?

(4) Putting it into the virtio_blk driver means another duplicate
implementation, and it might have to be duplicated another time in the
Windows virtio_blk driver too, if there's a Windows generic equivalent
to HDIO_GET_IDENTITY (I don't know if there is though).

(5) When the host's backing driver is itself a disk implementing the
ATA command, would you ever want the guest to see all details of the
identity page as close as possible to the host device sometimes?
I.e. is virtio_blk a way to accelerate block I/O to a host device, or
a way to abstract it to the bare minimum?

(6) Why don't we use the SCSI identity page instead?  Presumably
that's already in qemu/kvm too, for SCSI and USB disks :-)

-- Jamie
john cooper Oct. 5, 2009, 3:40 p.m. UTC | #42
Paul Brook wrote:

> Agreed. If we're going to copy anything than my vote would be for SCSI. We 
> already have partial support for feeding SCSI commands to a virtio device.
> TBH I wouldn't have bothered with virtio-blk at all, and gone straight for 
> virtio-scsi. I guess it's a bit late for that though.

Initially the patch introduced a dedicated ioctl
in the guest driver to retrieve this data.  Review
comments from that version suggested using an
existing interface which to me seemed reasonable.

I had looked first at implementing the above via
a scsi INQUIRY cmd.   But at least at that time
this option would have been more invasive to the
existing code vs. just doing the equivalent via
an ATA HDIO_GET_IDENTITY.

-john
john cooper Oct. 5, 2009, 3:41 p.m. UTC | #43
Jamie Lokier wrote:
> (5) When the host's backing driver is itself a disk implementing the
> ATA command, would you ever want the guest to see all details of the
> identity page as close as possible to the host device sometimes?
> I.e. is virtio_blk a way to accelerate block I/O to a host device, or
> a way to abstract it to the bare minimum?

Admittedly most of the data in the identify page
as seen by the host is fodder for the sake of
existing userspace utilities.  So I don't really
see any case where pass-through of host->guest
data is meaningful or useful.  The patch just
follows suit here with a similar mockup of this
data for the guest userland.

-john
john cooper Oct. 5, 2009, 3:44 p.m. UTC | #44
Rusty Russell wrote:
> Calling the ATA IDENTIFY command a Linux ABI quirk is disingenuous.

I think ATA has earned that moniker all by itself.

> Not sure what the SCSI equiv is to get a serial number (VPD?) But if everyone
> is happy with 20 bytes, and John wants to do SN that way, I'll take the patch.

I've been swayed from one side to the other of
this argument.  Still I'd begrudgingly give my
vote to the ATA option only because it reuses an
existing wart vs. introducing a new one.

Updated qemu patch follows leveraging your virtqueue
REQ_TYPE_SPECIAL scaffolding.  At this point only
the issue of this guest visible interface should
remain.

-john
diff mbox

Patch

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index a33eafb..3652c0a 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -27,6 +27,7 @@  typedef struct VirtIOBlock
     void *rq;
     char serial_str[BLOCK_SERIAL_STRLEN + 1];
     QEMUBH *bh;
+    size_t config_size;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -362,7 +363,7 @@  static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
     virtio_identify_template(&blkcfg);
     memcpy(&blkcfg.identify[VIRTIO_BLK_ID_SN], s->serial_str,
         VIRTIO_BLK_ID_SN_BYTES);
-    memcpy(config, &blkcfg, sizeof(blkcfg));
+    memcpy(config, &blkcfg, s->config_size);
 }
 
 static uint32_t virtio_blk_get_features(VirtIODevice *vdev)
@@ -419,18 +420,21 @@  VirtIODevice *virtio_blk_init(DeviceState *dev, DriveInfo *dinfo)
     VirtIOBlock *s;
     int cylinders, heads, secs;
     static int virtio_blk_id;
-    char *ps;
+    char *ps = (char *)drive_get_serial(dinfo->bdrv);
+    size_t size = strlen(ps) ? sizeof(struct virtio_blk_config) :
+	    offsetof(struct virtio_blk_config, _blk_size);
 
     s = (VirtIOBlock *)virtio_common_init("virtio-blk", VIRTIO_ID_BLOCK,
-                                          sizeof(struct virtio_blk_config),
+                                          size,
                                           sizeof(VirtIOBlock));
 
+    s->config_size = size;
     s->vdev.get_config = virtio_blk_update_config;
     s->vdev.get_features = virtio_blk_get_features;
     s->vdev.reset = virtio_blk_reset;
     s->bs = dinfo->bdrv;
     s->rq = NULL;
-    if (strlen(ps = (char *)drive_get_serial(s->bs)))
+    if (strlen(ps))
         strncpy(s->serial_str, ps, sizeof(s->serial_str));
     else
         snprintf(s->serial_str, sizeof(s->serial_str), "0");