Message ID | 1448987005-28335-1-git-send-email-shmulik.ladkani@ravellosystems.com |
---|---|
State | New |
Headers | show |
On 12/01/2015 06:23 PM, Shmulik Ladkani wrote: > In 1811e64 'hw/virtio: Add PCIe capability to virtio devices', the > QEMU_PCI_CAP_EXPRESS capability was added to virtio's pci_dev, within > 'virtio_pci_realize' - the pci device object realization method. > > This occurs to late, as 'pci_qdev_realize' (DeviceClass.realize of > TYPE_PCI_DEVICE) has already been called, without knowing that the > device instance is indeed an "express" instance, thus allocating > insufficient pci config space. > > As a result, device may crash upon attempt to write to the PCIE config > space. > > Fix, by arming the QEMU_PCI_CAP_EXPRESS capability early in virtio-pci's > own DeviceClass realize method. > > This also makes code cleaner, as 'virtio_pci_realize' may now access the > 'pci_is_express' predicate when needed. > > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> > --- > hw/virtio/virtio-pci.c | 24 +++++++++++++++++++----- > hw/virtio/virtio-pci.h | 1 + > 2 files changed, 20 insertions(+), 5 deletions(-) > > diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c > index dd48562..cee1b7b 100644 > --- a/hw/virtio/virtio-pci.c > +++ b/hw/virtio/virtio-pci.c > @@ -1814,13 +1814,10 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp) > > address_space_init(&proxy->modern_as, &proxy->modern_cfg, "virtio-pci-cfg-as"); > > - if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) > - && !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN) > - && pci_bus_is_express(pci_dev->bus) > - && !pci_bus_is_root(pci_dev->bus)) { > + if (pci_is_express(pci_dev) && pci_bus_is_express(pci_dev->bus) && > + !pci_bus_is_root(pci_dev->bus)) { > int pos; Hi, Here you should check only for 'pci_is_express(pci_dev)' . > > - pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; > pos = pcie_endpoint_cap_init(pci_dev, 0); > assert(pos > 0); > > @@ -1879,10 +1876,25 @@ static Property virtio_pci_properties[] = { > DEFINE_PROP_END_OF_LIST(), > }; > > +static void virtio_pci_dc_realize(DeviceState *qdev, Error **errp) > +{ > + VirtioPCIClass *vpciklass = VIRTIO_PCI_GET_CLASS(qdev); > + VirtIOPCIProxy *proxy = VIRTIO_PCI(qdev); > + PCIDevice *pci_dev = &proxy->pci_dev; > + > + if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) && > + !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN)) { > + pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; And here you should also check: pci_bus_is_express(pci_dev->bus) && !pci_bus_is_root(pci_dev->bus)) The reason is the device becomes express only if *all* the conditions are met. > + } > + > + vpciklass->saved_dc_realize(qdev, errp); > +} > + > static void virtio_pci_class_init(ObjectClass *klass, void *data) > { > DeviceClass *dc = DEVICE_CLASS(klass); > PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); > + VirtioPCIClass *vpciklass = VIRTIO_PCI_CLASS(klass); > > dc->props = virtio_pci_properties; > k->realize = virtio_pci_realize; > @@ -1890,6 +1902,8 @@ static void virtio_pci_class_init(ObjectClass *klass, void *data) > k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; > k->revision = VIRTIO_PCI_ABI_VERSION; > k->class_id = PCI_CLASS_OTHERS; > + vpciklass->saved_dc_realize = dc->realize; > + dc->realize = virtio_pci_dc_realize; > dc->reset = virtio_pci_reset; > } > > diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h > index ffb74bb..f18e948 100644 > --- a/hw/virtio/virtio-pci.h > +++ b/hw/virtio/virtio-pci.h > @@ -105,6 +105,7 @@ typedef struct { > > typedef struct VirtioPCIClass { > PCIDeviceClass parent_class; > + DeviceRealize saved_dc_realize; I would change the name to parent_realize :) Also please add "for-2.5" to prefix. Thanks for posting it! Marcel > void (*realize)(VirtIOPCIProxy *vpci_dev, Error **errp); > } VirtioPCIClass; > >
Hi, On Tue, 1 Dec 2015 18:36:39 +0200 Marcel Apfelbaum <marcel@redhat.com> wrote: > > + if (pci_is_express(pci_dev) && pci_bus_is_express(pci_dev->bus) && > > + !pci_bus_is_root(pci_dev->bus)) { > > int pos; > > Here you should check only for 'pci_is_express(pci_dev)' . [snip] > > +static void virtio_pci_dc_realize(DeviceState *qdev, Error **errp) > > +{ > > + VirtioPCIClass *vpciklass = VIRTIO_PCI_GET_CLASS(qdev); > > + VirtIOPCIProxy *proxy = VIRTIO_PCI(qdev); > > + PCIDevice *pci_dev = &proxy->pci_dev; > > + > > + if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) && > > + !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN)) { > > + pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; > > And here you should also check: > pci_bus_is_express(pci_dev->bus) && !pci_bus_is_root(pci_dev->bus)) > > The reason is the device becomes express only if *all* the conditions > are met. I'm ok with either approaches. However it seems common practice to set QEMU_PCI_CAP_EXPRESS unconditionally for PCIE devices. The few existing PCIE devices do so by assigning their PCIDeviceClass.is_express to 1 within their 'class_init', regardless the properties of the bus their on. (e.g. xhci_class_init, megasas_class_init, vfio_pci_dev_class_init, nvme_class_init, and more) Some devices later call pcie_endpoint_cap_init conditionally. (e.g. usb_xhci_realize). Can you please examine this and let me know the preferred approach? > > + DeviceRealize saved_dc_realize; > > I would change the name to parent_realize :) Sure.
On 12/01/2015 09:30 PM, Shmulik Ladkani wrote: > Hi, > > On Tue, 1 Dec 2015 18:36:39 +0200 Marcel Apfelbaum <marcel@redhat.com> wrote: >>> + if (pci_is_express(pci_dev) && pci_bus_is_express(pci_dev->bus) && >>> + !pci_bus_is_root(pci_dev->bus)) { >>> int pos; >> >> Here you should check only for 'pci_is_express(pci_dev)' . > > [snip] > >>> +static void virtio_pci_dc_realize(DeviceState *qdev, Error **errp) >>> +{ >>> + VirtioPCIClass *vpciklass = VIRTIO_PCI_GET_CLASS(qdev); >>> + VirtIOPCIProxy *proxy = VIRTIO_PCI(qdev); >>> + PCIDevice *pci_dev = &proxy->pci_dev; >>> + >>> + if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) && >>> + !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN)) { >>> + pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; >> >> And here you should also check: >> pci_bus_is_express(pci_dev->bus) && !pci_bus_is_root(pci_dev->bus)) >> >> The reason is the device becomes express only if *all* the conditions >> are met. > > I'm ok with either approaches. > > However it seems common practice to set QEMU_PCI_CAP_EXPRESS > unconditionally for PCIE devices. > > The few existing PCIE devices do so by assigning their > PCIDeviceClass.is_express to 1 within their 'class_init', regardless the > properties of the bus their on. > (e.g. xhci_class_init, megasas_class_init, vfio_pci_dev_class_init, > nvme_class_init, and more) > > Some devices later call pcie_endpoint_cap_init conditionally. > (e.g. usb_xhci_realize). > > Can you please examine this and let me know the preferred approach? Yes, I saw that..., as always not a walk in the park. - So we have "is_express = true" <=> QEMU_PCI_CAP_EXPRESS on <=> "config size = PCIe" - Not related to the above (!!), if (some condition) => add PCIe express capability (megasas is the exception) Let's take "usb_xhci": - If we put it under a PCI bus it will not be an express device, but it will have a "big" config space. Also pci_is_express(dev) will still return true! - This is probably a bug. (or I am missing something) NVME: - simple, always PCIe Now let's see vfio-pci: - is_express = true (with the comment: we might be) => PCIe config - vfio_populate_device => checks actual register (I think), if not PCIe, rewinds it : vdev->config_size = reg_info.size; if (vdev->config_size == PCI_CONFIG_SPACE_SIZE) { vdev->pdev.cap_present &= ~QEMU_PCI_CAP_EXPRESS; } - better (we still "loose" the space, but at least pci_is_express will return false) Now virtio case: - If we split the conditions into 2 parts we would have usb_xhci issues: - PCIe config space for a PCI device if *some* conditions are not met. - pci_is_express will return true when we don't want that. If you see a reason to split, please do, I only see problems :) Our solution to make it "clean" is to not mark the class as "is_express", but hijack realize method and add our "conditions" before calling it. A more elegant solution would be to make is_express a method and let the subclasses implement it: - vfio will look for the actual device config space - NVME will return true - usb_xhci will condition this on the bus type - virtio will have its own conditions. But this is not 2.5 material. I hope I helped, Thanks for getting involved. Marcel > >>> + DeviceRealize saved_dc_realize; >> >> I would change the name to parent_realize :) > > Sure. >
Thanks Marcel, On Tue, 1 Dec 2015 22:46:33 +0200, marcel@redhat.com wrote: > >> The reason is the device becomes express only if *all* the conditions > >> are met. > > > > I'm ok with either approaches. > > > > However it seems common practice to set QEMU_PCI_CAP_EXPRESS > > unconditionally for PCIE devices. > > > > The few existing PCIE devices do so by assigning their > > PCIDeviceClass.is_express to 1 within their 'class_init', regardless the > > properties of the bus their on. > > (e.g. xhci_class_init, megasas_class_init, vfio_pci_dev_class_init, > > nvme_class_init, and more) > > > > Some devices later call pcie_endpoint_cap_init conditionally. > > (e.g. usb_xhci_realize). > > > > Can you please examine this and let me know the preferred approach? > > Yes, I saw that..., as always not a walk in the park. > > - So we have "is_express = true" <=> QEMU_PCI_CAP_EXPRESS on <=> "config size = PCIe" > - Not related to the above (!!), if (some condition) => add PCIe express capability > (megasas is the exception) > > Let's take "usb_xhci": > - If we put it under a PCI bus it will not be an express device, but > it will have a "big" config space. Also pci_is_express(dev) will still return true! > - This is probably a bug. (or I am missing something) I actually assumed this is the right behavior. A device class reports whether its instances *could* be pcie by arming its PCIDeviceClass.is_express. As such, the "big" config space is allocated for the instance. This is harmless. Such a device may (or may not) be connected to a pcie bus, and only if so, we report it is a pcie endpoint. Also, pcie_add_capability is allowed on that device, in order to setup whatever capabilities on its pcie config space (even if finally not on a pcie bus). Moreover, VMSTATE_PCIE_DEVICE (which uses vmstate_pcie_device rather than vmstate_pci_device) can be used for that device's VMStateDescription fields *without* worrying whether the actual config space is "big" or "small". Otherwise one should examine whether vmstate_pcie_device or vmstate_pci_device need to be used. Seems tedious. This is the reasoning I can think of, why assigning QEMU_PCI_CAP_EXPRESS and reporting pcie_endpoint_cap_init are not tightly coupled. Indeed, no strict solution here, both approaches seem reasoanble (and both are used!). WDYT? Is my above interpretation makes sense? Regards, Shmulik
On 12/02/2015 10:01 AM, Shmulik Ladkani wrote: > Thanks Marcel, > > On Tue, 1 Dec 2015 22:46:33 +0200, marcel@redhat.com wrote: >>>> The reason is the device becomes express only if *all* the conditions >>>> are met. >>> >>> I'm ok with either approaches. >>> >>> However it seems common practice to set QEMU_PCI_CAP_EXPRESS >>> unconditionally for PCIE devices. >>> >>> The few existing PCIE devices do so by assigning their >>> PCIDeviceClass.is_express to 1 within their 'class_init', regardless the >>> properties of the bus their on. >>> (e.g. xhci_class_init, megasas_class_init, vfio_pci_dev_class_init, >>> nvme_class_init, and more) >>> >>> Some devices later call pcie_endpoint_cap_init conditionally. >>> (e.g. usb_xhci_realize). >>> >>> Can you please examine this and let me know the preferred approach? >> >> Yes, I saw that..., as always not a walk in the park. >> >> - So we have "is_express = true" <=> QEMU_PCI_CAP_EXPRESS on <=> "config size = PCIe" >> - Not related to the above (!!), if (some condition) => add PCIe express capability >> (megasas is the exception) >> >> Let's take "usb_xhci": >> - If we put it under a PCI bus it will not be an express device, but >> it will have a "big" config space. Also pci_is_express(dev) will still return true! >> - This is probably a bug. (or I am missing something) > > I actually assumed this is the right behavior. > > A device class reports whether its instances *could* be pcie by arming > its PCIDeviceClass.is_express. > As such, the "big" config space is allocated for the instance. This is > harmless. > > Such a device may (or may not) be connected to a pcie bus, and only if > so, we report it is a pcie endpoint. > > Also, pcie_add_capability is allowed on that device, in order to setup > whatever capabilities on its pcie config space (even if finally not on a > pcie bus). > > Moreover, VMSTATE_PCIE_DEVICE (which uses vmstate_pcie_device rather > than vmstate_pci_device) can be used for that device's > VMStateDescription fields *without* worrying whether the actual config > space is "big" or "small". > Otherwise one should examine whether vmstate_pcie_device or > vmstate_pci_device need to be used. Seems tedious. > > This is the reasoning I can think of, why assigning QEMU_PCI_CAP_EXPRESS > and reporting pcie_endpoint_cap_init are not tightly coupled. I agree it may be the reason, but that does not make it right. I still see two *possible* problems: 1. Pci config space is guest visible. The guest can read/write to a place it shouldn't. I don't know if is a *real* issue, but it needs checking. 2. We still have pci_is_express returning true, this is error prone because one can use this function assuming the device is express. Maybe we should call it "can_be_express" ? If the migration construct (VMSTATE) is *the only* reason for doing this, maybe is not a good enough reason (I am not the one to decide :) ). Is still it seems a little off to me. If you think this is good enough, you can simply do the same: - Instead of replacing the realize method, just advertise it with "is_express" (meaning it can be express) - Leave all the conditions as they were in prev patch. As a result, the pci config space will have the right length. The consequences are obvious now, if virtio/pci maintainers are OK with that, so am I. Thanks, Marcel > > Indeed, no strict solution here, both approaches seem reasoanble (and > both are used!). > > WDYT? Is my above interpretation makes sense? > > Regards, > Shmulik >
Hi, On Wed, 2 Dec 2015 11:51:46 +0200, marcel@redhat.com wrote: > 2. We still have pci_is_express returning true, this is error prone because > one can use this function assuming the device is express. Maybe we should > call it "can_be_express" ? > > If you think this is good enough, you can simply do the same: > - Instead of replacing the realize method, just advertise it with > "is_express" (meaning it can be express) > - Leave all the conditions as they were in prev patch. > As a result, the pci config space will have the right length. Oh but we can't do so, as the change of config space size is guest visible and breaks migration; it must depend on your x-pcie-disable flag :) As I can't decide what's better, I'm following your initial suggestion and submit for maintainers to review. However, do note that there are few more evidence that 'pci_is_express' is true while not necessarily placed on a pcie bus: - pcie_endpoint_cap_init: it tests for 'pci_bus_is_express' although 'dev' is guaranteed to be 'pci_is_express' (assertion in pcie_cap_init) - 058fdcf 'xhci: add endpoint cap on express bus only' Thanks, Shmulik
On 12/02/2015 03:30 PM, Shmulik Ladkani wrote: > Hi, > > On Wed, 2 Dec 2015 11:51:46 +0200, marcel@redhat.com wrote: >> 2. We still have pci_is_express returning true, this is error prone because >> one can use this function assuming the device is express. Maybe we should >> call it "can_be_express" ? >> >> If you think this is good enough, you can simply do the same: >> - Instead of replacing the realize method, just advertise it with >> "is_express" (meaning it can be express) >> - Leave all the conditions as they were in prev patch. >> As a result, the pci config space will have the right length. > > Oh but we can't do so, as the change of config space size is guest > visible and breaks migration; it must depend on your x-pcie-disable > flag :) Indeed, we need at least to condition it on x-pcie-disable. > > As I can't decide what's better, I'm following your initial suggestion > and submit for maintainers to review. Sure, and thanks for the patience to get to the bottom of it. > > However, do note that there are few more evidence that 'pci_is_express' > is true while not necessarily placed on a pcie bus: and this is scary ... we really should call it "pci_can_be_express" > - pcie_endpoint_cap_init: > it tests for 'pci_bus_is_express' although 'dev' is guaranteed to be > 'pci_is_express' (assertion in pcie_cap_init) > - 058fdcf 'xhci: add endpoint cap on express bus only' Thanks, Marcel > > Thanks, > Shmulik >
Hi, On Wed, 2 Dec 2015 16:00:41 +0200, marcel@redhat.com wrote: > > As I can't decide what's better, I'm following your initial suggestion > > and submit for maintainers to review. > > Sure, and thanks for the patience to get to the bottom of it. Sorry, your initial suggestion testing 'pci_bus_is_express() && !pci_bus_is_root()' within 'virtio_pci_dc_realize' will not do the job as in that place pci_dev->bus is still NULL... It's only assigned at pci_qdev_realize. I think this is the main reason for separation. The classes themselves describe the type of the device, which is either pci or pcie, using k->is_express. The class code can't tell whether the actual device instance gets attached to a pci or pcie bus. So yes, 'pci_is_express' really states whether device instance *can* be an pcie device. Thus, I'm going with my initial suggestion, plus minor naming changes. Many thanks!
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index dd48562..cee1b7b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -1814,13 +1814,10 @@ static void virtio_pci_realize(PCIDevice *pci_dev, Error **errp) address_space_init(&proxy->modern_as, &proxy->modern_cfg, "virtio-pci-cfg-as"); - if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) - && !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN) - && pci_bus_is_express(pci_dev->bus) - && !pci_bus_is_root(pci_dev->bus)) { + if (pci_is_express(pci_dev) && pci_bus_is_express(pci_dev->bus) && + !pci_bus_is_root(pci_dev->bus)) { int pos; - pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; pos = pcie_endpoint_cap_init(pci_dev, 0); assert(pos > 0); @@ -1879,10 +1876,25 @@ static Property virtio_pci_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void virtio_pci_dc_realize(DeviceState *qdev, Error **errp) +{ + VirtioPCIClass *vpciklass = VIRTIO_PCI_GET_CLASS(qdev); + VirtIOPCIProxy *proxy = VIRTIO_PCI(qdev); + PCIDevice *pci_dev = &proxy->pci_dev; + + if (!(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_PCIE) && + !(proxy->flags & VIRTIO_PCI_FLAG_DISABLE_MODERN)) { + pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS; + } + + vpciklass->saved_dc_realize(qdev, errp); +} + static void virtio_pci_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); PCIDeviceClass *k = PCI_DEVICE_CLASS(klass); + VirtioPCIClass *vpciklass = VIRTIO_PCI_CLASS(klass); dc->props = virtio_pci_properties; k->realize = virtio_pci_realize; @@ -1890,6 +1902,8 @@ static void virtio_pci_class_init(ObjectClass *klass, void *data) k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; k->revision = VIRTIO_PCI_ABI_VERSION; k->class_id = PCI_CLASS_OTHERS; + vpciklass->saved_dc_realize = dc->realize; + dc->realize = virtio_pci_dc_realize; dc->reset = virtio_pci_reset; } diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h index ffb74bb..f18e948 100644 --- a/hw/virtio/virtio-pci.h +++ b/hw/virtio/virtio-pci.h @@ -105,6 +105,7 @@ typedef struct { typedef struct VirtioPCIClass { PCIDeviceClass parent_class; + DeviceRealize saved_dc_realize; void (*realize)(VirtIOPCIProxy *vpci_dev, Error **errp); } VirtioPCIClass;
In 1811e64 'hw/virtio: Add PCIe capability to virtio devices', the QEMU_PCI_CAP_EXPRESS capability was added to virtio's pci_dev, within 'virtio_pci_realize' - the pci device object realization method. This occurs to late, as 'pci_qdev_realize' (DeviceClass.realize of TYPE_PCI_DEVICE) has already been called, without knowing that the device instance is indeed an "express" instance, thus allocating insufficient pci config space. As a result, device may crash upon attempt to write to the PCIE config space. Fix, by arming the QEMU_PCI_CAP_EXPRESS capability early in virtio-pci's own DeviceClass realize method. This also makes code cleaner, as 'virtio_pci_realize' may now access the 'pci_is_express' predicate when needed. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> --- hw/virtio/virtio-pci.c | 24 +++++++++++++++++++----- hw/virtio/virtio-pci.h | 1 + 2 files changed, 20 insertions(+), 5 deletions(-)