diff mbox

[v16,00/14] vfio-pci: pass the aer error to guest

Message ID 56B1C029.3080901@cn.fujitsu.com
State New
Headers show

Commit Message

chenfan Feb. 3, 2016, 8:54 a.m. UTC
On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>
>> For now, for vfio pci passthough devices when qemu receives
>> an error from host aer report, currentlly just terminate the guest,
>> but usually user want to know what error occurred but stopping the
>> guest, so this patches add aer capability support for vfio device,
>> and pass the error to guest, and have guest driver to recover
>> from the error.
> I would like to see a version of this patchset that doesn't
> depend on pci core changes.
> I think that if you make this simplifying assumption:
>
> - all devices on same bus in guest are on same bus in host
>
> then you can handle both reset and hotplug simply in function 0
> since it will belong to vfio.
>
> So we can have a version without pci core changes that simply assumes
> this, and things will just work.
>
>
> Now, if we wanted to enforce this limitation, I think the
> cleanest way would be to add a callback in struct PCIDevice:
>
> 	bool is_valid_function(PCIDevice *newfunction)
>
> and call it as each function is added.
> This way aer function can validate that each function
> added shares the same bus.
> And this way issues will be detected directly and not when
> function 0 is added.
>
> I would prefer this validation code to be a patch on top so we can merge
> the functionality directly and avoid blocking it while we figure out the
> best api to validate things.
>
> I don't see why making guest topology match host would
> ever be a problem, but if it's required to support
> configurations where these differ, I'd like to see
> an attempt to address that be split out, after aer
> is supported.
Hi Michael,

Just think about this more,  I think we also should check the vfio
devices whether on the same bus at the time of function 0 is added.
because we don't know the affected devices by a bus reset have
already all been assigned to VM. for example, the multi-function's hotplug.
devices on same bus in host are added to VM one by one. when we
test one device, we haven't yet added the other devices. so I think
the patch should like below. then we could add a vfio_is_valid_function 
in vfio
to test each device whether the affected devices on the same bus.

Thanks,
Chen




>
>
>
>> v15-v16:
>>     10/14, 11/14 are new to introduce a reset sequence id to specify the
>>     vfio devices has been reset for that reset. other patches aren't modified.
>>
>> v14-v15:
>>     1. add device hot reset callback
>>     2. add bus_in_reset for vfio device to avoid multi do host bus reset
>>
>> v13-v14:
>>     1. for multifunction device, requiring all functions enable AER.(9/13)
>>     2. due to all affected functions receive error signal, ignore no
>>        error occurred function. (12/13)
>>
>> v12-v13:
>>     1. since support multifuncion hotplug, here add callback to enable aer.
>>     2. add pci device pre+post reset for aer host reset.
>>
>> Chen Fan (14):
>>    vfio: extract vfio_get_hot_reset_info as a single function
>>    vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
>>    pcie: modify the capability size assert
>>    vfio: make the 4 bytes aligned for capability size
>>    vfio: add pcie extanded capability support
>>    aer: impove pcie_aer_init to support vfio device
>>    vfio: add aer support for vfio device
>>    vfio: add check host bus reset is support or not
>>    add check reset mechanism when hotplug vfio device
>>    pci: introduce pci bus pre reset
>>    vfio: introduce last reset sequence id
>>    pcie_aer: expose pcie_aer_msg() interface
>>    vfio-pci: pass the aer error to guest
>>    vfio: add 'aer' property to expose aercap
>>
>>   hw/pci-bridge/ioh3420.c            |   2 +-
>>   hw/pci-bridge/xio3130_downstream.c |   2 +-
>>   hw/pci-bridge/xio3130_upstream.c   |   2 +-
>>   hw/pci/pci.c                       |  42 +++
>>   hw/pci/pci_bridge.c                |   3 +
>>   hw/pci/pcie.c                      |   2 +-
>>   hw/pci/pcie_aer.c                  |   6 +-
>>   hw/vfio/pci.c                      | 616 +++++++++++++++++++++++++++++++++----
>>   hw/vfio/pci.h                      |   9 +
>>   include/hw/pci/pci.h               |   1 +
>>   include/hw/pci/pci_bus.h           |   8 +
>>   include/hw/pci/pcie_aer.h          |   3 +-
>>   12 files changed, 624 insertions(+), 72 deletions(-)
>>
>> -- 
>> 1.9.3
>>
>>
>
> .
>

Comments

Michael S. Tsirkin Feb. 3, 2016, 1:57 p.m. UTC | #1
On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:
> 
> On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
> >On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
> >>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>
> >>For now, for vfio pci passthough devices when qemu receives
> >>an error from host aer report, currentlly just terminate the guest,
> >>but usually user want to know what error occurred but stopping the
> >>guest, so this patches add aer capability support for vfio device,
> >>and pass the error to guest, and have guest driver to recover
> >>from the error.
> >I would like to see a version of this patchset that doesn't
> >depend on pci core changes.
> >I think that if you make this simplifying assumption:
> >
> >- all devices on same bus in guest are on same bus in host
> >
> >then you can handle both reset and hotplug simply in function 0
> >since it will belong to vfio.
> >
> >So we can have a version without pci core changes that simply assumes
> >this, and things will just work.
> >
> >
> >Now, if we wanted to enforce this limitation, I think the
> >cleanest way would be to add a callback in struct PCIDevice:
> >
> >	bool is_valid_function(PCIDevice *newfunction)
> >
> >and call it as each function is added.
> >This way aer function can validate that each function
> >added shares the same bus.
> >And this way issues will be detected directly and not when
> >function 0 is added.
> >
> >I would prefer this validation code to be a patch on top so we can merge
> >the functionality directly and avoid blocking it while we figure out the
> >best api to validate things.
> >
> >I don't see why making guest topology match host would
> >ever be a problem, but if it's required to support
> >configurations where these differ, I'd like to see
> >an attempt to address that be split out, after aer
> >is supported.
> Hi Michael,
> 
> Just think about this more,  I think we also should check the vfio
> devices whether on the same bus at the time of function 0 is added.
> because we don't know the affected devices by a bus reset have
> already all been assigned to VM.

This is something vfio in kernel should check.
You can't rely on qemu being well behaved, so don't
even try to catch cases which would break host in userspace.

qemu should only worry about not breaking guest.


> for example, the multi-function's hotplug.
> devices on same bus in host are added to VM one by one. when we
> test one device, we haven't yet added the other devices.
> so I think
> the patch should like below. then we could add a vfio_is_valid_function in
> vfio
> to test each device whether the affected devices on the same bus.
> 
> Thanks,
> Chen
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index d940f79..7163b56 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus, int bus_num,
> uint8_t devfn)
>      return bus->devices[devfn];
>  }
> 
> +static int pci_bus_check_devices(PCIBus *bus)
> +{
> +    PCIDeviceClass *pc;
> +    int i, ret = 0;
> +
> +    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> +        if (!bus->devices[i]) {
> +            continue;
> +        }
> +
> +        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> +        if (!pc->is_valid_func) {
> +            continue;
> +        }
> +
> +        ret = pc->is_valid_func(bus->devices[i], bus);
> +        if (!ret) {
> +            return -1;
> +        }
> +    }
> +    return 0;
> +}
> +
> +static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> +{
> +    if (pdev->bus == bus) {
> +        return true;
> +    }
> +
> +    return false;
> +}
> +

I don't really understand what is this one doing.
Why do we need a default function?

>  static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>  {
>      PCIDevice *pci_dev = (PCIDevice *)qdev;
> @@ -1878,6 +1910,14 @@ static void pci_qdev_realize(DeviceState *qdev, Error
> **errp)
>          pci_qdev_unrealize(DEVICE(pci_dev), NULL);
>          return;
>      }
> +
> +    if (DEVICE(pci_dev)->hotplugged &&
> +        pci_get_function_0(pci_dev) == pci_dev &&
> +        pci_bus_check_devices(bus)) {
> +        error_setg(errp, "failed to hotplug function 0");
> +        pci_qdev_unrealize(DEVICE(pci_dev), NULL);
> +        return;
> +    }
>  }
> 
>  static void pci_default_realize(PCIDevice *dev, Error **errp)
> @@ -2390,6 +2430,7 @@ static void pci_device_class_init(ObjectClass *klass,
> void *data)
>      k->bus_type = TYPE_PCI_BUS;
>      k->props = pci_props;
>      pc->realize = pci_default_realize;
> +    pc->is_valid_func = pci_is_valid_function;
>  }
> 
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index dedf277..a89580f 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -191,6 +191,7 @@ typedef struct PCIDeviceClass {
> 
>      void (*realize)(PCIDevice *dev, Error **errp);
>      int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
> +    bool (*is_valid_func)(PCIDevice *dev, PCIBus *bus);
>      PCIUnregisterFunc *exit;
>      PCIConfigReadFunc *config_read;
>      PCIConfigWriteFunc *config_write;
> 
> 
> 
> >
> >
> >
> >>v15-v16:
> >>    10/14, 11/14 are new to introduce a reset sequence id to specify the
> >>    vfio devices has been reset for that reset. other patches aren't modified.
> >>
> >>v14-v15:
> >>    1. add device hot reset callback
> >>    2. add bus_in_reset for vfio device to avoid multi do host bus reset
> >>
> >>v13-v14:
> >>    1. for multifunction device, requiring all functions enable AER.(9/13)
> >>    2. due to all affected functions receive error signal, ignore no
> >>       error occurred function. (12/13)
> >>
> >>v12-v13:
> >>    1. since support multifuncion hotplug, here add callback to enable aer.
> >>    2. add pci device pre+post reset for aer host reset.
> >>
> >>Chen Fan (14):
> >>   vfio: extract vfio_get_hot_reset_info as a single function
> >>   vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
> >>   pcie: modify the capability size assert
> >>   vfio: make the 4 bytes aligned for capability size
> >>   vfio: add pcie extanded capability support
> >>   aer: impove pcie_aer_init to support vfio device
> >>   vfio: add aer support for vfio device
> >>   vfio: add check host bus reset is support or not
> >>   add check reset mechanism when hotplug vfio device
> >>   pci: introduce pci bus pre reset
> >>   vfio: introduce last reset sequence id
> >>   pcie_aer: expose pcie_aer_msg() interface
> >>   vfio-pci: pass the aer error to guest
> >>   vfio: add 'aer' property to expose aercap
> >>
> >>  hw/pci-bridge/ioh3420.c            |   2 +-
> >>  hw/pci-bridge/xio3130_downstream.c |   2 +-
> >>  hw/pci-bridge/xio3130_upstream.c   |   2 +-
> >>  hw/pci/pci.c                       |  42 +++
> >>  hw/pci/pci_bridge.c                |   3 +
> >>  hw/pci/pcie.c                      |   2 +-
> >>  hw/pci/pcie_aer.c                  |   6 +-
> >>  hw/vfio/pci.c                      | 616 +++++++++++++++++++++++++++++++++----
> >>  hw/vfio/pci.h                      |   9 +
> >>  include/hw/pci/pci.h               |   1 +
> >>  include/hw/pci/pci_bus.h           |   8 +
> >>  include/hw/pci/pcie_aer.h          |   3 +-
> >>  12 files changed, 624 insertions(+), 72 deletions(-)
> >>
> >>-- 
> >>1.9.3
> >>
> >>
> >
> >.
> >
> 
>
chenfan Feb. 4, 2016, 2:04 a.m. UTC | #2
On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:
>> On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
>>> On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
>>>> From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
>>>>
>>>> For now, for vfio pci passthough devices when qemu receives
>>>> an error from host aer report, currentlly just terminate the guest,
>>>> but usually user want to know what error occurred but stopping the
>>>> guest, so this patches add aer capability support for vfio device,
>>>> and pass the error to guest, and have guest driver to recover
>>> >from the error.
>>> I would like to see a version of this patchset that doesn't
>>> depend on pci core changes.
>>> I think that if you make this simplifying assumption:
>>>
>>> - all devices on same bus in guest are on same bus in host
>>>
>>> then you can handle both reset and hotplug simply in function 0
>>> since it will belong to vfio.
>>>
>>> So we can have a version without pci core changes that simply assumes
>>> this, and things will just work.
>>>
>>>
>>> Now, if we wanted to enforce this limitation, I think the
>>> cleanest way would be to add a callback in struct PCIDevice:
>>>
>>> 	bool is_valid_function(PCIDevice *newfunction)
>>>
>>> and call it as each function is added.
>>> This way aer function can validate that each function
>>> added shares the same bus.
>>> And this way issues will be detected directly and not when
>>> function 0 is added.
>>>
>>> I would prefer this validation code to be a patch on top so we can merge
>>> the functionality directly and avoid blocking it while we figure out the
>>> best api to validate things.
>>>
>>> I don't see why making guest topology match host would
>>> ever be a problem, but if it's required to support
>>> configurations where these differ, I'd like to see
>>> an attempt to address that be split out, after aer
>>> is supported.
>> Hi Michael,
>>
>> Just think about this more,  I think we also should check the vfio
>> devices whether on the same bus at the time of function 0 is added.
>> because we don't know the affected devices by a bus reset have
>> already all been assigned to VM.
> This is something vfio in kernel should check.
> You can't rely on qemu being well behaved, so don't
> even try to catch cases which would break host in userspace.
>
> qemu should only worry about not breaking guest.
>
>
>> for example, the multi-function's hotplug.
>> devices on same bus in host are added to VM one by one. when we
>> test one device, we haven't yet added the other devices.
>> so I think
>> the patch should like below. then we could add a vfio_is_valid_function in
>> vfio
>> to test each device whether the affected devices on the same bus.
>>
>> Thanks,
>> Chen
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index d940f79..7163b56 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus, int bus_num,
>> uint8_t devfn)
>>       return bus->devices[devfn];
>>   }
>>
>> +static int pci_bus_check_devices(PCIBus *bus)
>> +{
>> +    PCIDeviceClass *pc;
>> +    int i, ret = 0;
>> +
>> +    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
>> +        if (!bus->devices[i]) {
>> +            continue;
>> +        }
>> +
>> +        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
>> +        if (!pc->is_valid_func) {
>> +            continue;
>> +        }
>> +
>> +        ret = pc->is_valid_func(bus->devices[i], bus);
>> +        if (!ret) {
>> +            return -1;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
>> +{
>> +    if (pdev->bus == bus) {
>> +        return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
> I don't really understand what is this one doing.
> Why do we need a default function?
if the vfio driver in kernel can handle the bus reset for any one
device in qemu without the affected devices assigned. I think
we don't need this default one.
BTW, IIRC at present the devices on the same bus in host can
be assigned to different VM, so if we want to support this kind of
bus reset for an independent device when enable aer, aren't we
limiting the case that others devices on the same bus must be
assigned to current VM?

Thanks,
Chen
>>   static void pci_qdev_realize(DeviceState *qdev, Error **errp)
>>   {
>>       PCIDevice *pci_dev = (PCIDevice *)qdev;
>> @@ -1878,6 +1910,14 @@ static void pci_qdev_realize(DeviceState *qdev, Error
>> **errp)
>>           pci_qdev_unrealize(DEVICE(pci_dev), NULL);
>>           return;
>>       }
>> +
>> +    if (DEVICE(pci_dev)->hotplugged &&
>> +        pci_get_function_0(pci_dev) == pci_dev &&
>> +        pci_bus_check_devices(bus)) {
>> +        error_setg(errp, "failed to hotplug function 0");
>> +        pci_qdev_unrealize(DEVICE(pci_dev), NULL);
>> +        return;
>> +    }
>>   }
>>
>>   static void pci_default_realize(PCIDevice *dev, Error **errp)
>> @@ -2390,6 +2430,7 @@ static void pci_device_class_init(ObjectClass *klass,
>> void *data)
>>       k->bus_type = TYPE_PCI_BUS;
>>       k->props = pci_props;
>>       pc->realize = pci_default_realize;
>> +    pc->is_valid_func = pci_is_valid_function;
>>   }
>>
>>   AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index dedf277..a89580f 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -191,6 +191,7 @@ typedef struct PCIDeviceClass {
>>
>>       void (*realize)(PCIDevice *dev, Error **errp);
>>       int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
>> +    bool (*is_valid_func)(PCIDevice *dev, PCIBus *bus);
>>       PCIUnregisterFunc *exit;
>>       PCIConfigReadFunc *config_read;
>>       PCIConfigWriteFunc *config_write;
>>
>>
>>
>>>
>>>
>>>> v15-v16:
>>>>     10/14, 11/14 are new to introduce a reset sequence id to specify the
>>>>     vfio devices has been reset for that reset. other patches aren't modified.
>>>>
>>>> v14-v15:
>>>>     1. add device hot reset callback
>>>>     2. add bus_in_reset for vfio device to avoid multi do host bus reset
>>>>
>>>> v13-v14:
>>>>     1. for multifunction device, requiring all functions enable AER.(9/13)
>>>>     2. due to all affected functions receive error signal, ignore no
>>>>        error occurred function. (12/13)
>>>>
>>>> v12-v13:
>>>>     1. since support multifuncion hotplug, here add callback to enable aer.
>>>>     2. add pci device pre+post reset for aer host reset.
>>>>
>>>> Chen Fan (14):
>>>>    vfio: extract vfio_get_hot_reset_info as a single function
>>>>    vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
>>>>    pcie: modify the capability size assert
>>>>    vfio: make the 4 bytes aligned for capability size
>>>>    vfio: add pcie extanded capability support
>>>>    aer: impove pcie_aer_init to support vfio device
>>>>    vfio: add aer support for vfio device
>>>>    vfio: add check host bus reset is support or not
>>>>    add check reset mechanism when hotplug vfio device
>>>>    pci: introduce pci bus pre reset
>>>>    vfio: introduce last reset sequence id
>>>>    pcie_aer: expose pcie_aer_msg() interface
>>>>    vfio-pci: pass the aer error to guest
>>>>    vfio: add 'aer' property to expose aercap
>>>>
>>>>   hw/pci-bridge/ioh3420.c            |   2 +-
>>>>   hw/pci-bridge/xio3130_downstream.c |   2 +-
>>>>   hw/pci-bridge/xio3130_upstream.c   |   2 +-
>>>>   hw/pci/pci.c                       |  42 +++
>>>>   hw/pci/pci_bridge.c                |   3 +
>>>>   hw/pci/pcie.c                      |   2 +-
>>>>   hw/pci/pcie_aer.c                  |   6 +-
>>>>   hw/vfio/pci.c                      | 616 +++++++++++++++++++++++++++++++++----
>>>>   hw/vfio/pci.h                      |   9 +
>>>>   include/hw/pci/pci.h               |   1 +
>>>>   include/hw/pci/pci_bus.h           |   8 +
>>>>   include/hw/pci/pcie_aer.h          |   3 +-
>>>>   12 files changed, 624 insertions(+), 72 deletions(-)
>>>>
>>>> -- 
>>>> 1.9.3
>>>>
>>>>
>>> .
>>>
>>
>
> .
>
Michael S. Tsirkin Feb. 4, 2016, 11:21 a.m. UTC | #3
On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote:
> 
> On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:
> >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:
> >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
> >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
> >>>>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> >>>>
> >>>>For now, for vfio pci passthough devices when qemu receives
> >>>>an error from host aer report, currentlly just terminate the guest,
> >>>>but usually user want to know what error occurred but stopping the
> >>>>guest, so this patches add aer capability support for vfio device,
> >>>>and pass the error to guest, and have guest driver to recover
> >>>>from the error.
> >>>I would like to see a version of this patchset that doesn't
> >>>depend on pci core changes.
> >>>I think that if you make this simplifying assumption:
> >>>
> >>>- all devices on same bus in guest are on same bus in host
> >>>
> >>>then you can handle both reset and hotplug simply in function 0
> >>>since it will belong to vfio.
> >>>
> >>>So we can have a version without pci core changes that simply assumes
> >>>this, and things will just work.
> >>>
> >>>
> >>>Now, if we wanted to enforce this limitation, I think the
> >>>cleanest way would be to add a callback in struct PCIDevice:
> >>>
> >>>	bool is_valid_function(PCIDevice *newfunction)
> >>>
> >>>and call it as each function is added.
> >>>This way aer function can validate that each function
> >>>added shares the same bus.
> >>>And this way issues will be detected directly and not when
> >>>function 0 is added.
> >>>
> >>>I would prefer this validation code to be a patch on top so we can merge
> >>>the functionality directly and avoid blocking it while we figure out the
> >>>best api to validate things.
> >>>
> >>>I don't see why making guest topology match host would
> >>>ever be a problem, but if it's required to support
> >>>configurations where these differ, I'd like to see
> >>>an attempt to address that be split out, after aer
> >>>is supported.
> >>Hi Michael,
> >>
> >>Just think about this more,  I think we also should check the vfio
> >>devices whether on the same bus at the time of function 0 is added.
> >>because we don't know the affected devices by a bus reset have
> >>already all been assigned to VM.
> >This is something vfio in kernel should check.
> >You can't rely on qemu being well behaved, so don't
> >even try to catch cases which would break host in userspace.
> >
> >qemu should only worry about not breaking guest.
> >
> >
> >>for example, the multi-function's hotplug.
> >>devices on same bus in host are added to VM one by one. when we
> >>test one device, we haven't yet added the other devices.
> >>so I think
> >>the patch should like below. then we could add a vfio_is_valid_function in
> >>vfio
> >>to test each device whether the affected devices on the same bus.
> >>
> >>Thanks,
> >>Chen
> >>
> >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >>index d940f79..7163b56 100644
> >>--- a/hw/pci/pci.c
> >>+++ b/hw/pci/pci.c
> >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus, int bus_num,
> >>uint8_t devfn)
> >>      return bus->devices[devfn];
> >>  }
> >>
> >>+static int pci_bus_check_devices(PCIBus *bus)
> >>+{
> >>+    PCIDeviceClass *pc;
> >>+    int i, ret = 0;
> >>+
> >>+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> >>+        if (!bus->devices[i]) {
> >>+            continue;
> >>+        }
> >>+
> >>+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> >>+        if (!pc->is_valid_func) {
> >>+            continue;
> >>+        }
> >>+
> >>+        ret = pc->is_valid_func(bus->devices[i], bus);
> >>+        if (!ret) {
> >>+            return -1;
> >>+        }
> >>+    }
> >>+    return 0;
> >>+}
> >>+
> >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> >>+{
> >>+    if (pdev->bus == bus) {
> >>+        return true;
> >>+    }
> >>+
> >>+    return false;
> >>+}
> >>+
> >I don't really understand what is this one doing.
> >Why do we need a default function?
> if the vfio driver in kernel can handle the bus reset for any one
> device in qemu without the affected devices assigned. I think
> we don't need this default one.
> BTW, IIRC at present the devices on the same bus in host can
> be assigned to different VM, so if we want to support this kind of
> bus reset for an independent device when enable aer, aren't we
> limiting the case that others devices on the same bus must be
> assigned to current VM?
> 
> Thanks,
> Chen

I don't believe this works at the moment, and
I'd expect kernel to prevent this,
so we should not rely on userspace code for this.
Alex, could you comment please?


> >>  static void pci_qdev_realize(DeviceState *qdev, Error **errp)
> >>  {
> >>      PCIDevice *pci_dev = (PCIDevice *)qdev;
> >>@@ -1878,6 +1910,14 @@ static void pci_qdev_realize(DeviceState *qdev, Error
> >>**errp)
> >>          pci_qdev_unrealize(DEVICE(pci_dev), NULL);
> >>          return;
> >>      }
> >>+
> >>+    if (DEVICE(pci_dev)->hotplugged &&
> >>+        pci_get_function_0(pci_dev) == pci_dev &&
> >>+        pci_bus_check_devices(bus)) {
> >>+        error_setg(errp, "failed to hotplug function 0");
> >>+        pci_qdev_unrealize(DEVICE(pci_dev), NULL);
> >>+        return;
> >>+    }
> >>  }
> >>
> >>  static void pci_default_realize(PCIDevice *dev, Error **errp)
> >>@@ -2390,6 +2430,7 @@ static void pci_device_class_init(ObjectClass *klass,
> >>void *data)
> >>      k->bus_type = TYPE_PCI_BUS;
> >>      k->props = pci_props;
> >>      pc->realize = pci_default_realize;
> >>+    pc->is_valid_func = pci_is_valid_function;
> >>  }
> >>
> >>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> >>index dedf277..a89580f 100644
> >>--- a/include/hw/pci/pci.h
> >>+++ b/include/hw/pci/pci.h
> >>@@ -191,6 +191,7 @@ typedef struct PCIDeviceClass {
> >>
> >>      void (*realize)(PCIDevice *dev, Error **errp);
> >>      int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
> >>+    bool (*is_valid_func)(PCIDevice *dev, PCIBus *bus);
> >>      PCIUnregisterFunc *exit;
> >>      PCIConfigReadFunc *config_read;
> >>      PCIConfigWriteFunc *config_write;
> >>
> >>
> >>
> >>>
> >>>
> >>>>v15-v16:
> >>>>    10/14, 11/14 are new to introduce a reset sequence id to specify the
> >>>>    vfio devices has been reset for that reset. other patches aren't modified.
> >>>>
> >>>>v14-v15:
> >>>>    1. add device hot reset callback
> >>>>    2. add bus_in_reset for vfio device to avoid multi do host bus reset
> >>>>
> >>>>v13-v14:
> >>>>    1. for multifunction device, requiring all functions enable AER.(9/13)
> >>>>    2. due to all affected functions receive error signal, ignore no
> >>>>       error occurred function. (12/13)
> >>>>
> >>>>v12-v13:
> >>>>    1. since support multifuncion hotplug, here add callback to enable aer.
> >>>>    2. add pci device pre+post reset for aer host reset.
> >>>>
> >>>>Chen Fan (14):
> >>>>   vfio: extract vfio_get_hot_reset_info as a single function
> >>>>   vfio: squeeze out vfio_pci_do_hot_reset for support bus reset
> >>>>   pcie: modify the capability size assert
> >>>>   vfio: make the 4 bytes aligned for capability size
> >>>>   vfio: add pcie extanded capability support
> >>>>   aer: impove pcie_aer_init to support vfio device
> >>>>   vfio: add aer support for vfio device
> >>>>   vfio: add check host bus reset is support or not
> >>>>   add check reset mechanism when hotplug vfio device
> >>>>   pci: introduce pci bus pre reset
> >>>>   vfio: introduce last reset sequence id
> >>>>   pcie_aer: expose pcie_aer_msg() interface
> >>>>   vfio-pci: pass the aer error to guest
> >>>>   vfio: add 'aer' property to expose aercap
> >>>>
> >>>>  hw/pci-bridge/ioh3420.c            |   2 +-
> >>>>  hw/pci-bridge/xio3130_downstream.c |   2 +-
> >>>>  hw/pci-bridge/xio3130_upstream.c   |   2 +-
> >>>>  hw/pci/pci.c                       |  42 +++
> >>>>  hw/pci/pci_bridge.c                |   3 +
> >>>>  hw/pci/pcie.c                      |   2 +-
> >>>>  hw/pci/pcie_aer.c                  |   6 +-
> >>>>  hw/vfio/pci.c                      | 616 +++++++++++++++++++++++++++++++++----
> >>>>  hw/vfio/pci.h                      |   9 +
> >>>>  include/hw/pci/pci.h               |   1 +
> >>>>  include/hw/pci/pci_bus.h           |   8 +
> >>>>  include/hw/pci/pcie_aer.h          |   3 +-
> >>>>  12 files changed, 624 insertions(+), 72 deletions(-)
> >>>>
> >>>>-- 
> >>>>1.9.3
> >>>>
> >>>>
> >>>.
> >>>
> >>
> >
> >.
> >
> 
>
Alex Williamson Feb. 4, 2016, 5:46 p.m. UTC | #4
On Thu, 4 Feb 2016 13:21:57 +0200
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote:
> > 
> > On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:  
> > >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:  
> > >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:  
> > >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:  
> > >>>>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > >>>>
> > >>>>For now, for vfio pci passthough devices when qemu receives
> > >>>>an error from host aer report, currentlly just terminate the
> > >>>>guest, but usually user want to know what error occurred but
> > >>>>stopping the guest, so this patches add aer capability support
> > >>>>for vfio device, and pass the error to guest, and have guest
> > >>>>driver to recover from the error.  
> > >>>I would like to see a version of this patchset that doesn't
> > >>>depend on pci core changes.
> > >>>I think that if you make this simplifying assumption:
> > >>>
> > >>>- all devices on same bus in guest are on same bus in host
> > >>>
> > >>>then you can handle both reset and hotplug simply in function 0
> > >>>since it will belong to vfio.
> > >>>
> > >>>So we can have a version without pci core changes that simply
> > >>>assumes this, and things will just work.
> > >>>
> > >>>
> > >>>Now, if we wanted to enforce this limitation, I think the
> > >>>cleanest way would be to add a callback in struct PCIDevice:
> > >>>
> > >>>	bool is_valid_function(PCIDevice *newfunction)
> > >>>
> > >>>and call it as each function is added.
> > >>>This way aer function can validate that each function
> > >>>added shares the same bus.
> > >>>And this way issues will be detected directly and not when
> > >>>function 0 is added.
> > >>>
> > >>>I would prefer this validation code to be a patch on top so we
> > >>>can merge the functionality directly and avoid blocking it while
> > >>>we figure out the best api to validate things.
> > >>>
> > >>>I don't see why making guest topology match host would
> > >>>ever be a problem, but if it's required to support
> > >>>configurations where these differ, I'd like to see
> > >>>an attempt to address that be split out, after aer
> > >>>is supported.  
> > >>Hi Michael,
> > >>
> > >>Just think about this more,  I think we also should check the vfio
> > >>devices whether on the same bus at the time of function 0 is
> > >>added. because we don't know the affected devices by a bus reset
> > >>have already all been assigned to VM.  
> > >This is something vfio in kernel should check.
> > >You can't rely on qemu being well behaved, so don't
> > >even try to catch cases which would break host in userspace.
> > >
> > >qemu should only worry about not breaking guest.
> > >
> > >  
> > >>for example, the multi-function's hotplug.
> > >>devices on same bus in host are added to VM one by one. when we
> > >>test one device, we haven't yet added the other devices.
> > >>so I think
> > >>the patch should like below. then we could add a
> > >>vfio_is_valid_function in vfio
> > >>to test each device whether the affected devices on the same bus.
> > >>
> > >>Thanks,
> > >>Chen
> > >>
> > >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > >>index d940f79..7163b56 100644
> > >>--- a/hw/pci/pci.c
> > >>+++ b/hw/pci/pci.c
> > >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus,
> > >>int bus_num, uint8_t devfn)
> > >>      return bus->devices[devfn];
> > >>  }
> > >>
> > >>+static int pci_bus_check_devices(PCIBus *bus)
> > >>+{
> > >>+    PCIDeviceClass *pc;
> > >>+    int i, ret = 0;
> > >>+
> > >>+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> > >>+        if (!bus->devices[i]) {
> > >>+            continue;
> > >>+        }
> > >>+
> > >>+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> > >>+        if (!pc->is_valid_func) {
> > >>+            continue;
> > >>+        }
> > >>+
> > >>+        ret = pc->is_valid_func(bus->devices[i], bus);
> > >>+        if (!ret) {
> > >>+            return -1;
> > >>+        }
> > >>+    }
> > >>+    return 0;
> > >>+}
> > >>+
> > >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> > >>+{
> > >>+    if (pdev->bus == bus) {
> > >>+        return true;
> > >>+    }
> > >>+
> > >>+    return false;
> > >>+}
> > >>+  
> > >I don't really understand what is this one doing.
> > >Why do we need a default function?  
> > if the vfio driver in kernel can handle the bus reset for any one
> > device in qemu without the affected devices assigned. I think
> > we don't need this default one.
> > BTW, IIRC at present the devices on the same bus in host can
> > be assigned to different VM, so if we want to support this kind of
> > bus reset for an independent device when enable aer, aren't we
> > limiting the case that others devices on the same bus must be
> > assigned to current VM?
> > 
> > Thanks,
> > Chen  
> 
> I don't believe this works at the moment, and
> I'd expect kernel to prevent this,
> so we should not rely on userspace code for this.
> Alex, could you comment please?

DMA isolation and bus isolation are separate things.  So long as
devices on the same bus are DMA isolated then they can be assigned to
separate VMs or split between host and VM.  However, certain features
like bus reset are not available to the user unless they "own" all of
the DMA isolated sets affected by a bus reset.  The kernel doesn't care
how the user is using them, but they must prove they own them through
vfio group file descriptors.

I thought in previous discussions we decided that unused devices made
the problem set more complicated for userspace so we simplified by
requiring them to be assigned.  For instance imagine a two function
device, with DMA isolation between functions, where we only want one
function assigned to the VM. QEMU would need to learn to take ownership
of the other function without exposing it to the VM simply for the
purpose of being able to perform a bus reset.  Another simplification
was to expose them in the same slot, so we don't need to worry about VM
configurations where one device could be hot-unplugged and the
ownership released, breaking QEMU's ability to do a bus reset on the
remaining device.  There are a lot of configuration restrictions
imposed by adding the requirement that QEMU needs to be able to perform
a host bus reset on a device in order to support this feature.  Thanks,

Alex
Michael S. Tsirkin Feb. 4, 2016, 6:09 p.m. UTC | #5
On Thu, Feb 04, 2016 at 10:46:52AM -0700, Alex Williamson wrote:
> On Thu, 4 Feb 2016 13:21:57 +0200
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote:
> > > 
> > > On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:  
> > > >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:  
> > > >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:  
> > > >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:  
> > > >>>>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > > >>>>
> > > >>>>For now, for vfio pci passthough devices when qemu receives
> > > >>>>an error from host aer report, currentlly just terminate the
> > > >>>>guest, but usually user want to know what error occurred but
> > > >>>>stopping the guest, so this patches add aer capability support
> > > >>>>for vfio device, and pass the error to guest, and have guest
> > > >>>>driver to recover from the error.  
> > > >>>I would like to see a version of this patchset that doesn't
> > > >>>depend on pci core changes.
> > > >>>I think that if you make this simplifying assumption:
> > > >>>
> > > >>>- all devices on same bus in guest are on same bus in host
> > > >>>
> > > >>>then you can handle both reset and hotplug simply in function 0
> > > >>>since it will belong to vfio.
> > > >>>
> > > >>>So we can have a version without pci core changes that simply
> > > >>>assumes this, and things will just work.
> > > >>>
> > > >>>
> > > >>>Now, if we wanted to enforce this limitation, I think the
> > > >>>cleanest way would be to add a callback in struct PCIDevice:
> > > >>>
> > > >>>	bool is_valid_function(PCIDevice *newfunction)
> > > >>>
> > > >>>and call it as each function is added.
> > > >>>This way aer function can validate that each function
> > > >>>added shares the same bus.
> > > >>>And this way issues will be detected directly and not when
> > > >>>function 0 is added.
> > > >>>
> > > >>>I would prefer this validation code to be a patch on top so we
> > > >>>can merge the functionality directly and avoid blocking it while
> > > >>>we figure out the best api to validate things.
> > > >>>
> > > >>>I don't see why making guest topology match host would
> > > >>>ever be a problem, but if it's required to support
> > > >>>configurations where these differ, I'd like to see
> > > >>>an attempt to address that be split out, after aer
> > > >>>is supported.  
> > > >>Hi Michael,
> > > >>
> > > >>Just think about this more,  I think we also should check the vfio
> > > >>devices whether on the same bus at the time of function 0 is
> > > >>added. because we don't know the affected devices by a bus reset
> > > >>have already all been assigned to VM.  
> > > >This is something vfio in kernel should check.
> > > >You can't rely on qemu being well behaved, so don't
> > > >even try to catch cases which would break host in userspace.
> > > >
> > > >qemu should only worry about not breaking guest.
> > > >
> > > >  
> > > >>for example, the multi-function's hotplug.
> > > >>devices on same bus in host are added to VM one by one. when we
> > > >>test one device, we haven't yet added the other devices.
> > > >>so I think
> > > >>the patch should like below. then we could add a
> > > >>vfio_is_valid_function in vfio
> > > >>to test each device whether the affected devices on the same bus.
> > > >>
> > > >>Thanks,
> > > >>Chen
> > > >>
> > > >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > >>index d940f79..7163b56 100644
> > > >>--- a/hw/pci/pci.c
> > > >>+++ b/hw/pci/pci.c
> > > >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus,
> > > >>int bus_num, uint8_t devfn)
> > > >>      return bus->devices[devfn];
> > > >>  }
> > > >>
> > > >>+static int pci_bus_check_devices(PCIBus *bus)
> > > >>+{
> > > >>+    PCIDeviceClass *pc;
> > > >>+    int i, ret = 0;
> > > >>+
> > > >>+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> > > >>+        if (!bus->devices[i]) {
> > > >>+            continue;
> > > >>+        }
> > > >>+
> > > >>+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> > > >>+        if (!pc->is_valid_func) {
> > > >>+            continue;
> > > >>+        }
> > > >>+
> > > >>+        ret = pc->is_valid_func(bus->devices[i], bus);
> > > >>+        if (!ret) {
> > > >>+            return -1;
> > > >>+        }
> > > >>+    }
> > > >>+    return 0;
> > > >>+}
> > > >>+
> > > >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> > > >>+{
> > > >>+    if (pdev->bus == bus) {
> > > >>+        return true;
> > > >>+    }
> > > >>+
> > > >>+    return false;
> > > >>+}
> > > >>+  
> > > >I don't really understand what is this one doing.
> > > >Why do we need a default function?  
> > > if the vfio driver in kernel can handle the bus reset for any one
> > > device in qemu without the affected devices assigned. I think
> > > we don't need this default one.
> > > BTW, IIRC at present the devices on the same bus in host can
> > > be assigned to different VM, so if we want to support this kind of
> > > bus reset for an independent device when enable aer, aren't we
> > > limiting the case that others devices on the same bus must be
> > > assigned to current VM?
> > > 
> > > Thanks,
> > > Chen  
> > 
> > I don't believe this works at the moment, and
> > I'd expect kernel to prevent this,
> > so we should not rely on userspace code for this.
> > Alex, could you comment please?
> 
> DMA isolation and bus isolation are separate things.  So long as
> devices on the same bus are DMA isolated then they can be assigned to
> separate VMs or split between host and VM.  However, certain features
> like bus reset are not available to the user unless they "own" all of
> the DMA isolated sets affected by a bus reset.  The kernel doesn't care
> how the user is using them, but they must prove they own them through
> vfio group file descriptors.

OK this makes sense.
So I think the solution is for userspace to make sure bus reset
is available before exposing aer to guest.
For example, attempt a bus reset.


> I thought in previous discussions we decided that unused devices made
> the problem set more complicated for userspace so we simplified by
> requiring them to be assigned.  For instance imagine a two function
> device, with DMA isolation between functions, where we only want one
> function assigned to the VM. QEMU would need to learn to take ownership
> of the other function without exposing it to the VM simply for the
> purpose of being able to perform a bus reset.

Hmm this might be a security problem.
Ideally we should not touch devices we don't *need* to touch.
But given kernel requires this at the moment (IIUC)
QEMU could open the device
but not expose it to guest until actually asked.

>  Another simplification
> was to expose them in the same slot, so we don't need to worry about VM
> configurations where one device could be hot-unplugged and the
> ownership released, breaking QEMU's ability to do a bus reset on the
> remaining device.

Same would apply here.

>  There are a lot of configuration restrictions
> imposed by adding the requirement that QEMU needs to be able to perform
> a host bus reset on a device in order to support this feature.  Thanks,
> 
> Alex
Alex Williamson Feb. 4, 2016, 8:15 p.m. UTC | #6
----- Original Message -----
> On Thu, Feb 04, 2016 at 10:46:52AM -0700, Alex Williamson wrote:
> > On Thu, 4 Feb 2016 13:21:57 +0200
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote:
> > > > 
> > > > On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:
> > > > >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:
> > > > >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
> > > > >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
> > > > >>>>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > > > >>>>
> > > > >>>>For now, for vfio pci passthough devices when qemu receives
> > > > >>>>an error from host aer report, currentlly just terminate the
> > > > >>>>guest, but usually user want to know what error occurred but
> > > > >>>>stopping the guest, so this patches add aer capability support
> > > > >>>>for vfio device, and pass the error to guest, and have guest
> > > > >>>>driver to recover from the error.
> > > > >>>I would like to see a version of this patchset that doesn't
> > > > >>>depend on pci core changes.
> > > > >>>I think that if you make this simplifying assumption:
> > > > >>>
> > > > >>>- all devices on same bus in guest are on same bus in host
> > > > >>>
> > > > >>>then you can handle both reset and hotplug simply in function 0
> > > > >>>since it will belong to vfio.
> > > > >>>
> > > > >>>So we can have a version without pci core changes that simply
> > > > >>>assumes this, and things will just work.
> > > > >>>
> > > > >>>
> > > > >>>Now, if we wanted to enforce this limitation, I think the
> > > > >>>cleanest way would be to add a callback in struct PCIDevice:
> > > > >>>
> > > > >>>	bool is_valid_function(PCIDevice *newfunction)
> > > > >>>
> > > > >>>and call it as each function is added.
> > > > >>>This way aer function can validate that each function
> > > > >>>added shares the same bus.
> > > > >>>And this way issues will be detected directly and not when
> > > > >>>function 0 is added.
> > > > >>>
> > > > >>>I would prefer this validation code to be a patch on top so we
> > > > >>>can merge the functionality directly and avoid blocking it while
> > > > >>>we figure out the best api to validate things.
> > > > >>>
> > > > >>>I don't see why making guest topology match host would
> > > > >>>ever be a problem, but if it's required to support
> > > > >>>configurations where these differ, I'd like to see
> > > > >>>an attempt to address that be split out, after aer
> > > > >>>is supported.
> > > > >>Hi Michael,
> > > > >>
> > > > >>Just think about this more,  I think we also should check the vfio
> > > > >>devices whether on the same bus at the time of function 0 is
> > > > >>added. because we don't know the affected devices by a bus reset
> > > > >>have already all been assigned to VM.
> > > > >This is something vfio in kernel should check.
> > > > >You can't rely on qemu being well behaved, so don't
> > > > >even try to catch cases which would break host in userspace.
> > > > >
> > > > >qemu should only worry about not breaking guest.
> > > > >
> > > > >  
> > > > >>for example, the multi-function's hotplug.
> > > > >>devices on same bus in host are added to VM one by one. when we
> > > > >>test one device, we haven't yet added the other devices.
> > > > >>so I think
> > > > >>the patch should like below. then we could add a
> > > > >>vfio_is_valid_function in vfio
> > > > >>to test each device whether the affected devices on the same bus.
> > > > >>
> > > > >>Thanks,
> > > > >>Chen
> > > > >>
> > > > >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > >>index d940f79..7163b56 100644
> > > > >>--- a/hw/pci/pci.c
> > > > >>+++ b/hw/pci/pci.c
> > > > >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus,
> > > > >>int bus_num, uint8_t devfn)
> > > > >>      return bus->devices[devfn];
> > > > >>  }
> > > > >>
> > > > >>+static int pci_bus_check_devices(PCIBus *bus)
> > > > >>+{
> > > > >>+    PCIDeviceClass *pc;
> > > > >>+    int i, ret = 0;
> > > > >>+
> > > > >>+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> > > > >>+        if (!bus->devices[i]) {
> > > > >>+            continue;
> > > > >>+        }
> > > > >>+
> > > > >>+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> > > > >>+        if (!pc->is_valid_func) {
> > > > >>+            continue;
> > > > >>+        }
> > > > >>+
> > > > >>+        ret = pc->is_valid_func(bus->devices[i], bus);
> > > > >>+        if (!ret) {
> > > > >>+            return -1;
> > > > >>+        }
> > > > >>+    }
> > > > >>+    return 0;
> > > > >>+}
> > > > >>+
> > > > >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> > > > >>+{
> > > > >>+    if (pdev->bus == bus) {
> > > > >>+        return true;
> > > > >>+    }
> > > > >>+
> > > > >>+    return false;
> > > > >>+}
> > > > >>+
> > > > >I don't really understand what is this one doing.
> > > > >Why do we need a default function?
> > > > if the vfio driver in kernel can handle the bus reset for any one
> > > > device in qemu without the affected devices assigned. I think
> > > > we don't need this default one.
> > > > BTW, IIRC at present the devices on the same bus in host can
> > > > be assigned to different VM, so if we want to support this kind of
> > > > bus reset for an independent device when enable aer, aren't we
> > > > limiting the case that others devices on the same bus must be
> > > > assigned to current VM?
> > > > 
> > > > Thanks,
> > > > Chen
> > > 
> > > I don't believe this works at the moment, and
> > > I'd expect kernel to prevent this,
> > > so we should not rely on userspace code for this.
> > > Alex, could you comment please?
> > 
> > DMA isolation and bus isolation are separate things.  So long as
> > devices on the same bus are DMA isolated then they can be assigned to
> > separate VMs or split between host and VM.  However, certain features
> > like bus reset are not available to the user unless they "own" all of
> > the DMA isolated sets affected by a bus reset.  The kernel doesn't care
> > how the user is using them, but they must prove they own them through
> > vfio group file descriptors.
> 
> OK this makes sense.
> So I think the solution is for userspace to make sure bus reset
> is available before exposing aer to guest.
> For example, attempt a bus reset.

The API includes a mechanism for retrieving the affected devices without
making it necessary to actually perform a bus reset.

> > I thought in previous discussions we decided that unused devices made
> > the problem set more complicated for userspace so we simplified by
> > requiring them to be assigned.  For instance imagine a two function
> > device, with DMA isolation between functions, where we only want one
> > function assigned to the VM. QEMU would need to learn to take ownership
> > of the other function without exposing it to the VM simply for the
> > purpose of being able to perform a bus reset.
> 
> Hmm this might be a security problem.

How so?  We don't simply simply trust that the user previously validated
the ability to do a bus reset and allow it any time they ask, each reset
needs to pass the file descriptors related to the affected devices.

> Ideally we should not touch devices we don't *need* to touch.
> But given kernel requires this at the moment (IIUC)
> QEMU could open the device
> but not expose it to guest until actually asked.

Well, a bus reset counts as needing to touch the device.  I don't see
how it could not.  Requiring QEMU to add a new mode of holding a device
only for ownership purposes sounds like an expansion of the scope of
these changes, which is exactly what we don't need at this point.

> >  Another simplification
> > was to expose them in the same slot, so we don't need to worry about VM
> > configurations where one device could be hot-unplugged and the
> > ownership released, breaking QEMU's ability to do a bus reset on the
> > remaining device.
> 
> Same would apply here.

Again, expanding scope versus configuration requirement, I take the
latter.  Thanks,

Alex
Michael S. Tsirkin Feb. 4, 2016, 9:58 p.m. UTC | #7
On Thu, Feb 04, 2016 at 03:15:39PM -0500, Alex Williamson wrote:
> 
> 
> ----- Original Message -----
> > On Thu, Feb 04, 2016 at 10:46:52AM -0700, Alex Williamson wrote:
> > > On Thu, 4 Feb 2016 13:21:57 +0200
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > On Thu, Feb 04, 2016 at 10:04:01AM +0800, Chen Fan wrote:
> > > > > 
> > > > > On 02/03/2016 09:57 PM, Michael S. Tsirkin wrote:
> > > > > >On Wed, Feb 03, 2016 at 04:54:01PM +0800, Chen Fan wrote:
> > > > > >>On 01/17/2016 02:34 AM, Michael S. Tsirkin wrote:
> > > > > >>>On Tue, Jan 12, 2016 at 10:43:01AM +0800, Cao jin wrote:
> > > > > >>>>From: Chen Fan <chen.fan.fnst@cn.fujitsu.com>
> > > > > >>>>
> > > > > >>>>For now, for vfio pci passthough devices when qemu receives
> > > > > >>>>an error from host aer report, currentlly just terminate the
> > > > > >>>>guest, but usually user want to know what error occurred but
> > > > > >>>>stopping the guest, so this patches add aer capability support
> > > > > >>>>for vfio device, and pass the error to guest, and have guest
> > > > > >>>>driver to recover from the error.
> > > > > >>>I would like to see a version of this patchset that doesn't
> > > > > >>>depend on pci core changes.
> > > > > >>>I think that if you make this simplifying assumption:
> > > > > >>>
> > > > > >>>- all devices on same bus in guest are on same bus in host
> > > > > >>>
> > > > > >>>then you can handle both reset and hotplug simply in function 0
> > > > > >>>since it will belong to vfio.
> > > > > >>>
> > > > > >>>So we can have a version without pci core changes that simply
> > > > > >>>assumes this, and things will just work.
> > > > > >>>
> > > > > >>>
> > > > > >>>Now, if we wanted to enforce this limitation, I think the
> > > > > >>>cleanest way would be to add a callback in struct PCIDevice:
> > > > > >>>
> > > > > >>>	bool is_valid_function(PCIDevice *newfunction)
> > > > > >>>
> > > > > >>>and call it as each function is added.
> > > > > >>>This way aer function can validate that each function
> > > > > >>>added shares the same bus.
> > > > > >>>And this way issues will be detected directly and not when
> > > > > >>>function 0 is added.
> > > > > >>>
> > > > > >>>I would prefer this validation code to be a patch on top so we
> > > > > >>>can merge the functionality directly and avoid blocking it while
> > > > > >>>we figure out the best api to validate things.
> > > > > >>>
> > > > > >>>I don't see why making guest topology match host would
> > > > > >>>ever be a problem, but if it's required to support
> > > > > >>>configurations where these differ, I'd like to see
> > > > > >>>an attempt to address that be split out, after aer
> > > > > >>>is supported.
> > > > > >>Hi Michael,
> > > > > >>
> > > > > >>Just think about this more,  I think we also should check the vfio
> > > > > >>devices whether on the same bus at the time of function 0 is
> > > > > >>added. because we don't know the affected devices by a bus reset
> > > > > >>have already all been assigned to VM.
> > > > > >This is something vfio in kernel should check.
> > > > > >You can't rely on qemu being well behaved, so don't
> > > > > >even try to catch cases which would break host in userspace.
> > > > > >
> > > > > >qemu should only worry about not breaking guest.
> > > > > >
> > > > > >  
> > > > > >>for example, the multi-function's hotplug.
> > > > > >>devices on same bus in host are added to VM one by one. when we
> > > > > >>test one device, we haven't yet added the other devices.
> > > > > >>so I think
> > > > > >>the patch should like below. then we could add a
> > > > > >>vfio_is_valid_function in vfio
> > > > > >>to test each device whether the affected devices on the same bus.
> > > > > >>
> > > > > >>Thanks,
> > > > > >>Chen
> > > > > >>
> > > > > >>diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > > > > >>index d940f79..7163b56 100644
> > > > > >>--- a/hw/pci/pci.c
> > > > > >>+++ b/hw/pci/pci.c
> > > > > >>@@ -1836,6 +1836,38 @@ PCIDevice *pci_find_device(PCIBus *bus,
> > > > > >>int bus_num, uint8_t devfn)
> > > > > >>      return bus->devices[devfn];
> > > > > >>  }
> > > > > >>
> > > > > >>+static int pci_bus_check_devices(PCIBus *bus)
> > > > > >>+{
> > > > > >>+    PCIDeviceClass *pc;
> > > > > >>+    int i, ret = 0;
> > > > > >>+
> > > > > >>+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
> > > > > >>+        if (!bus->devices[i]) {
> > > > > >>+            continue;
> > > > > >>+        }
> > > > > >>+
> > > > > >>+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
> > > > > >>+        if (!pc->is_valid_func) {
> > > > > >>+            continue;
> > > > > >>+        }
> > > > > >>+
> > > > > >>+        ret = pc->is_valid_func(bus->devices[i], bus);
> > > > > >>+        if (!ret) {
> > > > > >>+            return -1;
> > > > > >>+        }
> > > > > >>+    }
> > > > > >>+    return 0;
> > > > > >>+}
> > > > > >>+
> > > > > >>+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
> > > > > >>+{
> > > > > >>+    if (pdev->bus == bus) {
> > > > > >>+        return true;
> > > > > >>+    }
> > > > > >>+
> > > > > >>+    return false;
> > > > > >>+}
> > > > > >>+
> > > > > >I don't really understand what is this one doing.
> > > > > >Why do we need a default function?
> > > > > if the vfio driver in kernel can handle the bus reset for any one
> > > > > device in qemu without the affected devices assigned. I think
> > > > > we don't need this default one.
> > > > > BTW, IIRC at present the devices on the same bus in host can
> > > > > be assigned to different VM, so if we want to support this kind of
> > > > > bus reset for an independent device when enable aer, aren't we
> > > > > limiting the case that others devices on the same bus must be
> > > > > assigned to current VM?
> > > > > 
> > > > > Thanks,
> > > > > Chen
> > > > 
> > > > I don't believe this works at the moment, and
> > > > I'd expect kernel to prevent this,
> > > > so we should not rely on userspace code for this.
> > > > Alex, could you comment please?
> > > 
> > > DMA isolation and bus isolation are separate things.  So long as
> > > devices on the same bus are DMA isolated then they can be assigned to
> > > separate VMs or split between host and VM.  However, certain features
> > > like bus reset are not available to the user unless they "own" all of
> > > the DMA isolated sets affected by a bus reset.  The kernel doesn't care
> > > how the user is using them, but they must prove they own them through
> > > vfio group file descriptors.
> > 
> > OK this makes sense.
> > So I think the solution is for userspace to make sure bus reset
> > is available before exposing aer to guest.
> > For example, attempt a bus reset.
> 
> The API includes a mechanism for retrieving the affected devices without
> making it necessary to actually perform a bus reset.
> 
> > > I thought in previous discussions we decided that unused devices made
> > > the problem set more complicated for userspace so we simplified by
> > > requiring them to be assigned.  For instance imagine a two function
> > > device, with DMA isolation between functions, where we only want one
> > > function assigned to the VM. QEMU would need to learn to take ownership
> > > of the other function without exposing it to the VM simply for the
> > > purpose of being able to perform a bus reset.
> > 
> > Hmm this might be a security problem.
> 
> How so?  We don't simply simply trust that the user previously validated
> the ability to do a bus reset and allow it any time they ask, each reset
> needs to pass the file descriptors related to the affected devices.

Not a problem. A risk.  Simply passing in
each extra function to guest increases an attack surface.  We should not
expose devices to guest if it's not strictly required.

> > Ideally we should not touch devices we don't *need* to touch.
> > But given kernel requires this at the moment (IIUC)
> > QEMU could open the device
> > but not expose it to guest until actually asked.
> 
> Well, a bus reset counts as needing to touch the device.  I don't see
> how it could not.  Requiring QEMU to add a new mode of holding a device
> only for ownership purposes sounds like an expansion of the scope of
> these changes, which is exactly what we don't need at this point.

All this can be done gradually, I have no problem with that.

> > >  Another simplification
> > > was to expose them in the same slot, so we don't need to worry about VM
> > > configurations where one device could be hot-unplugged and the
> > > ownership released, breaking QEMU's ability to do a bus reset on the
> > > remaining device.
> > 
> > Same would apply here.
> 
> Again, expanding scope versus configuration requirement, I take the
> latter.  Thanks,
> 
> Alex
diff mbox

Patch

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index d940f79..7163b56 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1836,6 +1836,38 @@  PCIDevice *pci_find_device(PCIBus *bus, int 
bus_num, uint8_t devfn)
      return bus->devices[devfn];
  }

+static int pci_bus_check_devices(PCIBus *bus)
+{
+    PCIDeviceClass *pc;
+    int i, ret = 0;
+
+    for (i = 0; i < ARRAY_SIZE(bus->devices); ++i) {
+        if (!bus->devices[i]) {
+            continue;
+        }
+
+        pc = PCI_DEVICE_GET_CLASS(bus->devices[i]);
+        if (!pc->is_valid_func) {
+            continue;
+        }
+
+        ret = pc->is_valid_func(bus->devices[i], bus);
+        if (!ret) {
+            return -1;
+        }
+    }
+    return 0;
+}
+
+static bool pci_is_valid_function(PCIDevice *pdev, PCIBus *bus)
+{
+    if (pdev->bus == bus) {
+        return true;
+    }
+
+    return false;
+}
+
  static void pci_qdev_realize(DeviceState *qdev, Error **errp)
  {
      PCIDevice *pci_dev = (PCIDevice *)qdev;
@@ -1878,6 +1910,14 @@  static void pci_qdev_realize(DeviceState *qdev, 
Error **errp)
          pci_qdev_unrealize(DEVICE(pci_dev), NULL);
          return;
      }
+
+    if (DEVICE(pci_dev)->hotplugged &&
+        pci_get_function_0(pci_dev) == pci_dev &&
+        pci_bus_check_devices(bus)) {
+        error_setg(errp, "failed to hotplug function 0");
+        pci_qdev_unrealize(DEVICE(pci_dev), NULL);
+        return;
+    }
  }

  static void pci_default_realize(PCIDevice *dev, Error **errp)
@@ -2390,6 +2430,7 @@  static void pci_device_class_init(ObjectClass 
*klass, void *data)
      k->bus_type = TYPE_PCI_BUS;
      k->props = pci_props;
      pc->realize = pci_default_realize;
+    pc->is_valid_func = pci_is_valid_function;
  }

  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dedf277..a89580f 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -191,6 +191,7 @@  typedef struct PCIDeviceClass {

      void (*realize)(PCIDevice *dev, Error **errp);
      int (*init)(PCIDevice *dev);/* TODO convert to realize() and remove */
+    bool (*is_valid_func)(PCIDevice *dev, PCIBus *bus);
      PCIUnregisterFunc *exit;
      PCIConfigReadFunc *config_read;
      PCIConfigWriteFunc *config_write;