Patchwork [2/3] VFIO: VFIO_DEVICE_SET_ADDR_MAPPING command

login
register
mail settings
Submitter Gavin Shan
Date March 15, 2013, 7:26 a.m.
Message ID <1363332390-12754-3-git-send-email-shangw@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/227861/
State Changes Requested
Headers show

Comments

Gavin Shan - March 15, 2013, 7:26 a.m.
The address (domain/bus/slot/function) of the passed PCI device
looks quite different from perspective of host and guest. Some
architectures like PPC need to setup the mapping in host. The patch
introduces additional VFIO device IOCTL command to address that.

Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 include/uapi/linux/vfio.h |   16 ++++++++++++++++
 1 files changed, 16 insertions(+), 0 deletions(-)
Alex Williamson - March 15, 2013, 7:29 p.m.
On Fri, 2013-03-15 at 15:26 +0800, Gavin Shan wrote:
> The address (domain/bus/slot/function) of the passed PCI device
> looks quite different from perspective of host and guest. Some
> architectures like PPC need to setup the mapping in host. The patch
> introduces additional VFIO device IOCTL command to address that.

Could you explain further how this will be used?  How the device is
exposed to a guest is entirely a userspace construct, so why does vfio
need to know or care about this?  I had assumed for AER that QEMU would
do the translation from host to guest address space.

> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
> ---
>  include/uapi/linux/vfio.h |   16 ++++++++++++++++
>  1 files changed, 16 insertions(+), 0 deletions(-)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 6e58d9b..ecc4f38 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -289,6 +289,22 @@ struct vfio_irq_set {
>   */
>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
>  
> +/**
> + * VFIO_DEVICE_SET_ADDR_MAPPING - _IO(VFIO_TYPE, VFIO_BASE + 12)
> + *
> + * The address, which comprised of domain/bus/slot/function looks
> + * different between host and guest. We need to setup the mapping
> + * in host for some architectures like PPC so that the passed PCI
> + * devices could support RTAS smoothly.
> + */
> +struct vfio_addr_mapping {
> +	__u64 buid;

What's a buid?  Thanks,

Alex

> +	__u8  bus;
> +	__u8  slot;
> +	__u8  func;
> +};
> +#define VFIO_DEVICE_SET_ADDR_MAPPING	_IO(VFIO_TYPE, VFIO_BASE + 12)
> +
>  /*
>   * The VFIO-PCI bus driver makes use of the following fixed region and
>   * IRQ index mapping.  Unimplemented regions return a size of zero.
Gavin Shan - March 16, 2013, 1:34 a.m.
On Fri, Mar 15, 2013 at 01:29:00PM -0600, Alex Williamson wrote:
>On Fri, 2013-03-15 at 15:26 +0800, Gavin Shan wrote:
>> The address (domain/bus/slot/function) of the passed PCI device
>> looks quite different from perspective of host and guest. Some
>> architectures like PPC need to setup the mapping in host. The patch
>> introduces additional VFIO device IOCTL command to address that.
>
>Could you explain further how this will be used?  How the device is
>exposed to a guest is entirely a userspace construct, so why does vfio
>need to know or care about this?  I had assumed for AER that QEMU would
>do the translation from host to guest address space.
>

The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous
patch. The PowerNV platform is going to override it to figure out the
information for EEH core to use. On the other hand, QEMU will runs into
the IOCTL command while opening (creating) one VFIO device.

Though I'm not familiar with AER very much. AER is quite different from
EEH. The EEH functionality implemented in PHB instead of in PCI device
core. So we don't care AER stuff in EEH directly :-)

>> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
>> ---
>>  include/uapi/linux/vfio.h |   16 ++++++++++++++++
>>  1 files changed, 16 insertions(+), 0 deletions(-)
>> 
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 6e58d9b..ecc4f38 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -289,6 +289,22 @@ struct vfio_irq_set {
>>   */
>>  #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
>>  
>> +/**
>> + * VFIO_DEVICE_SET_ADDR_MAPPING - _IO(VFIO_TYPE, VFIO_BASE + 12)
>> + *
>> + * The address, which comprised of domain/bus/slot/function looks
>> + * different between host and guest. We need to setup the mapping
>> + * in host for some architectures like PPC so that the passed PCI
>> + * devices could support RTAS smoothly.
>> + */
>> +struct vfio_addr_mapping {
>> +	__u64 buid;
>
>What's a buid?  Thanks,
>

BUID means "Bus Unit Identifier". BUID is the identifier for specific PHB.
Firmware figures it out and expose to OS through device-tree. For VFIO case,
the QEMU figures it out and expose to guest eventually. It's notable that
the PHB (including buid) figured out by QEMU is virtual and something like
container :-)

>> +	__u8  bus;
>> +	__u8  slot;
>> +	__u8  func;
>> +};
>> +#define VFIO_DEVICE_SET_ADDR_MAPPING	_IO(VFIO_TYPE, VFIO_BASE + 12)
>> +
>>  /*
>>   * The VFIO-PCI bus driver makes use of the following fixed region and
>>   * IRQ index mapping.  Unimplemented regions return a size of zero.
>

Thanks,
Gavin
Benjamin Herrenschmidt - March 16, 2013, 5:37 a.m.
On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote:
> >Could you explain further how this will be used?  How the device is
> >exposed to a guest is entirely a userspace construct, so why does vfio
> >need to know or care about this?  I had assumed for AER that QEMU would
> >do the translation from host to guest address space.
> >
> 
> The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous
> patch. The PowerNV platform is going to override it to figure out the
> information for EEH core to use. On the other hand, QEMU will runs into
> the IOCTL command while opening (creating) one VFIO device.
> 
> Though I'm not familiar with AER very much. AER is quite different from
> EEH. The EEH functionality implemented in PHB instead of in PCI device
> core. So we don't care AER stuff in EEH directly :-)

To give Alex a bit more background...

EEH is our IBM specific error handling facility which is a superset of AER.

IE. In addition to AER's error detection and logging, it adds a layer of
error detection at the host bridge level (such as iommu violations etc...)
and a mechanism for handling and recovering from errors. This is tied to
our iommu domain stuff (our PE's) and our device "freezing" capability
among others.

With VFIO + KVM, we want to implement most of the EEH support for guests in
the host kernel. The reason is multipart and we can discuss this separately
as some of it might well be debatable (mostly it's more convenient that way
because we hook into the underlying HW/FW EEH which isn't directly userspace
accessible so we don't have to add a new layer of kernel -> user API in
addition to the VFIO stuff), but there's at least one aspect of it that drives
this requirement more strongly which is performance:

When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do
a firmware call to query the EEH state of the device and check whether it
has been frozen. On some devices, that can be a performance issue, and
going all the way to qemu for that would be horribly expensive.

So we want at least a way to handle that call in the kernel and for that we
need at least some way of mapping things there.

Cheers,
Ben.
Alex Williamson - March 18, 2013, 9:01 p.m.
On Sat, 2013-03-16 at 06:37 +0100, Benjamin Herrenschmidt wrote:
> On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote:
> > >Could you explain further how this will be used?  How the device is
> > >exposed to a guest is entirely a userspace construct, so why does vfio
> > >need to know or care about this?  I had assumed for AER that QEMU would
> > >do the translation from host to guest address space.
> > >
> > 
> > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous
> > patch. The PowerNV platform is going to override it to figure out the
> > information for EEH core to use. On the other hand, QEMU will runs into
> > the IOCTL command while opening (creating) one VFIO device.
> > 
> > Though I'm not familiar with AER very much. AER is quite different from
> > EEH. The EEH functionality implemented in PHB instead of in PCI device
> > core. So we don't care AER stuff in EEH directly :-)
> 
> To give Alex a bit more background...
> 
> EEH is our IBM specific error handling facility which is a superset of AER.
> 
> IE. In addition to AER's error detection and logging, it adds a layer of
> error detection at the host bridge level (such as iommu violations etc...)
> and a mechanism for handling and recovering from errors. This is tied to
> our iommu domain stuff (our PE's) and our device "freezing" capability
> among others.
> 
> With VFIO + KVM, we want to implement most of the EEH support for guests in
> the host kernel. The reason is multipart and we can discuss this separately
> as some of it might well be debatable (mostly it's more convenient that way
> because we hook into the underlying HW/FW EEH which isn't directly userspace
> accessible so we don't have to add a new layer of kernel -> user API in
> addition to the VFIO stuff), but there's at least one aspect of it that drives
> this requirement more strongly which is performance:
> 
> When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do
> a firmware call to query the EEH state of the device and check whether it
> has been frozen. On some devices, that can be a performance issue, and
> going all the way to qemu for that would be horribly expensive.
> 
> So we want at least a way to handle that call in the kernel and for that we
> need at least some way of mapping things there.

There's no notification mechanism when a PHB is frozen?  I suppose
notification would be asynchronous so you risk data for every read that
happens in the interim.  So the choices are a) tell the host kernel the
mapping, b) tell the guest kernel the mapping, c) identity mapping, or
d) qemu intercept?

Presumably your firmware call to query the EEH is not going through
VFIO, so is VFIO the appropriate place to setup this mapping?  As you
say, this seems like just a convenient place to put it even though it
really has nothing to do with the VFIO kernel component.  QEMU has this
information and could register it with the host kernel through other
means if available.  Maybe the mapping should be registered with KVM if
that's how the EEH data is accessed.  I'm not yet sold on why this
mapping is registered here.  Thanks,

Alex
Gavin Shan - March 19, 2013, 3:24 a.m.
On Mon, Mar 18, 2013 at 03:01:14PM -0600, Alex Williamson wrote:
>On Sat, 2013-03-16 at 06:37 +0100, Benjamin Herrenschmidt wrote:
>> On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote:
>> > >Could you explain further how this will be used?  How the device is
>> > >exposed to a guest is entirely a userspace construct, so why does vfio
>> > >need to know or care about this?  I had assumed for AER that QEMU would
>> > >do the translation from host to guest address space.
>> > >
>> > 
>> > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous
>> > patch. The PowerNV platform is going to override it to figure out the
>> > information for EEH core to use. On the other hand, QEMU will runs into
>> > the IOCTL command while opening (creating) one VFIO device.
>> > 
>> > Though I'm not familiar with AER very much. AER is quite different from
>> > EEH. The EEH functionality implemented in PHB instead of in PCI device
>> > core. So we don't care AER stuff in EEH directly :-)
>> 
>> To give Alex a bit more background...
>> 
>> EEH is our IBM specific error handling facility which is a superset of AER.
>> 
>> IE. In addition to AER's error detection and logging, it adds a layer of
>> error detection at the host bridge level (such as iommu violations etc...)
>> and a mechanism for handling and recovering from errors. This is tied to
>> our iommu domain stuff (our PE's) and our device "freezing" capability
>> among others.
>> 
>> With VFIO + KVM, we want to implement most of the EEH support for guests in
>> the host kernel. The reason is multipart and we can discuss this separately
>> as some of it might well be debatable (mostly it's more convenient that way
>> because we hook into the underlying HW/FW EEH which isn't directly userspace
>> accessible so we don't have to add a new layer of kernel -> user API in
>> addition to the VFIO stuff), but there's at least one aspect of it that drives
>> this requirement more strongly which is performance:
>> 
>> When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do
>> a firmware call to query the EEH state of the device and check whether it
>> has been frozen. On some devices, that can be a performance issue, and
>> going all the way to qemu for that would be horribly expensive.
>> 
>> So we want at least a way to handle that call in the kernel and for that we
>> need at least some way of mapping things there.
>
>There's no notification mechanism when a PHB is frozen?  I suppose
>notification would be asynchronous so you risk data for every read that
>happens in the interim.  So the choices are a) tell the host kernel the
>mapping, b) tell the guest kernel the mapping, c) identity mapping, or
>d) qemu intercept?
>

We do have dedicated interrupts on detecting frozen PHB on host side.
However, the guest has to poll/check the frozen state (frozen PE) during
access to config or MMIO space. For the recommended methods, (a) is what
we want to do with the patchset. (b) seems infeasible since the guest
shouldn't be aware of hypervisor (e.g. KVM or PowerVM) it's running on
top of, it's hard to polish the guest to do it. (d) sounds applicable
since the QEMU should know the address (BDF) of host and guest devices.
However, we still need let the host EEH core know that which PCI device
has been passed to guest and the best place to do that would be when opening
the corresponding VFIO PCI device. In turn, it will still need weak function
for ppc platform to override it. Why we not directly take (a) to finish
everything in one VFIO IOCTL command?

Sorry, Alex. I didn't understand (c) well :-)

>Presumably your firmware call to query the EEH is not going through
>VFIO, so is VFIO the appropriate place to setup this mapping?  As you
>say, this seems like just a convenient place to put it even though it
>really has nothing to do with the VFIO kernel component.  QEMU has this
>information and could register it with the host kernel through other
>means if available.  Maybe the mapping should be registered with KVM if
>that's how the EEH data is accessed.  I'm not yet sold on why this
>mapping is registered here.  Thanks,
>

Yes, EEH firmware call needn't going through VFIO. However, EEH has
very close relationship with PCI and so VFIO-PCI does. Eventually, EEH
has close relationship with VFIO-PCI :-)

Thanks,
Gavin
Alex Williamson - March 19, 2013, 4:18 a.m.
On Tue, 2013-03-19 at 11:24 +0800, Gavin Shan wrote:
> On Mon, Mar 18, 2013 at 03:01:14PM -0600, Alex Williamson wrote:
> >On Sat, 2013-03-16 at 06:37 +0100, Benjamin Herrenschmidt wrote:
> >> On Sat, 2013-03-16 at 09:34 +0800, Gavin Shan wrote:
> >> > >Could you explain further how this will be used?  How the device is
> >> > >exposed to a guest is entirely a userspace construct, so why does vfio
> >> > >need to know or care about this?  I had assumed for AER that QEMU would
> >> > >do the translation from host to guest address space.
> >> > >
> >> > 
> >> > The weak IOCTL function (vfio_pci_arch_ioctl) was introduced by previous
> >> > patch. The PowerNV platform is going to override it to figure out the
> >> > information for EEH core to use. On the other hand, QEMU will runs into
> >> > the IOCTL command while opening (creating) one VFIO device.
> >> > 
> >> > Though I'm not familiar with AER very much. AER is quite different from
> >> > EEH. The EEH functionality implemented in PHB instead of in PCI device
> >> > core. So we don't care AER stuff in EEH directly :-)
> >> 
> >> To give Alex a bit more background...
> >> 
> >> EEH is our IBM specific error handling facility which is a superset of AER.
> >> 
> >> IE. In addition to AER's error detection and logging, it adds a layer of
> >> error detection at the host bridge level (such as iommu violations etc...)
> >> and a mechanism for handling and recovering from errors. This is tied to
> >> our iommu domain stuff (our PE's) and our device "freezing" capability
> >> among others.
> >> 
> >> With VFIO + KVM, we want to implement most of the EEH support for guests in
> >> the host kernel. The reason is multipart and we can discuss this separately
> >> as some of it might well be debatable (mostly it's more convenient that way
> >> because we hook into the underlying HW/FW EEH which isn't directly userspace
> >> accessible so we don't have to add a new layer of kernel -> user API in
> >> addition to the VFIO stuff), but there's at least one aspect of it that drives
> >> this requirement more strongly which is performance:
> >> 
> >> When EEH is enabled, whenever any MMIO returns all 1's, the kernel will do
> >> a firmware call to query the EEH state of the device and check whether it
> >> has been frozen. On some devices, that can be a performance issue, and
> >> going all the way to qemu for that would be horribly expensive.
> >> 
> >> So we want at least a way to handle that call in the kernel and for that we
> >> need at least some way of mapping things there.
> >
> >There's no notification mechanism when a PHB is frozen?  I suppose
> >notification would be asynchronous so you risk data for every read that
> >happens in the interim.  So the choices are a) tell the host kernel the
> >mapping, b) tell the guest kernel the mapping, c) identity mapping, or
> >d) qemu intercept?
> >
> 
> We do have dedicated interrupts on detecting frozen PHB on host side.
> However, the guest has to poll/check the frozen state (frozen PE) during
> access to config or MMIO space.

Can you make use of something like this to notify the guest:

https://github.com/awilliam/linux-vfio/commit/dad9f8972e04cd081a028d8fb1249d746d97fc03

As a first step this only notifies QEMU, but the plan is to forward that
on to the guest.  If we can leverage similar interfaces between AER and
EEH, I'd obviously like to do that.

> For the recommended methods, (a) is what
> we want to do with the patchset. (b) seems infeasible since the guest
> shouldn't be aware of hypervisor (e.g. KVM or PowerVM) it's running on
> top of, it's hard to polish the guest to do it. (d) sounds applicable
> since the QEMU should know the address (BDF) of host and guest devices.
> However, we still need let the host EEH core know that which PCI device
> has been passed to guest and the best place to do that would be when opening
> the corresponding VFIO PCI device. In turn, it will still need weak function
> for ppc platform to override it. Why we not directly take (a) to finish
> everything in one VFIO IOCTL command?

Because it seems like VFIO is just being used as a relay and has no
purpose knowing this information on it's own.  It's just a convenient
place to host the ioctl, but that alone is not a good enough reason to
put it there.

> Sorry, Alex. I didn't understand (c) well :-)

(c) is if the buid/bus/slot/func exposed to the guest matches the same
for the host, then there's no need for mapping translation.

> >Presumably your firmware call to query the EEH is not going through
> >VFIO, so is VFIO the appropriate place to setup this mapping?  As you
> >say, this seems like just a convenient place to put it even though it
> >really has nothing to do with the VFIO kernel component.  QEMU has this
> >information and could register it with the host kernel through other
> >means if available.  Maybe the mapping should be registered with KVM if
> >that's how the EEH data is accessed.  I'm not yet sold on why this
> >mapping is registered here.  Thanks,
> >
> 
> Yes, EEH firmware call needn't going through VFIO. However, EEH has
> very close relationship with PCI and so VFIO-PCI does. Eventually, EEH
> has close relationship with VFIO-PCI :-)

Is there some plan to do more with EEH through VFIO in the future or are
we just talking about some kind of weird associative property to sell
this ioctl?  Thanks,

Alex
Benjamin Herrenschmidt - March 19, 2013, 4:45 a.m.
On Mon, 2013-03-18 at 22:18 -0600, Alex Williamson wrote:
> > Yes, EEH firmware call needn't going through VFIO. However, EEH has
> > very close relationship with PCI and so VFIO-PCI does. Eventually, EEH
> > has close relationship with VFIO-PCI :-)
> 
> Is there some plan to do more with EEH through VFIO in the future or are
> we just talking about some kind of weird associative property to sell
> this ioctl?  Thanks,

Well, I'm not sure how 'weird' that is but it makes sense to me... VFIO
is the mechanism that virtualizes access to a PCI device and provides
interfaces to qemu & kvm to access it &| map it.

Or rather VFIO-PCI is.

At a basic level it provides ... the basic PCI interfaces, ie, config
space access (with or without filtering), interrupts, etc...

In our environment, EEH is just another functionality of PCI really.
The firmware calls used by the guest to do that fall into more or less
the same category as the ones used for PCI config space accesses,
manipulation of DMA windows, etc... Similar to host (though guest
and host use a different FW interface for various reasons).

So it's very natural to "transport" these via VFIO-PCI like everything
else, I don't see a more natural place to put the ioctl's we need for
qemu to be able to access the EEH state, trigger EEH (un)isolation,
resets, etc...

Fundamentally, the design should be for VFIO-PCI to provide some specific
ioctls for EEH that userspace programs such as qemu can use, and then
re-expose those APIs to the guest.

In addition, to do some of it in the kernel for performance reason, we
want to establish that mapping, but I see that as a VFIO "accelerator".

IE. Whatever is going to respond to the EEH calls from the guest in-kernel
will have to share state with the rest of the EEH stuff provided to qemu
by vfio-pci.

Ben.
Alex Williamson - March 20, 2013, 6:48 p.m.
On Tue, 2013-03-19 at 05:45 +0100, Benjamin Herrenschmidt wrote:
> On Mon, 2013-03-18 at 22:18 -0600, Alex Williamson wrote:
> > > Yes, EEH firmware call needn't going through VFIO. However, EEH has
> > > very close relationship with PCI and so VFIO-PCI does. Eventually, EEH
> > > has close relationship with VFIO-PCI :-)
> > 
> > Is there some plan to do more with EEH through VFIO in the future or are
> > we just talking about some kind of weird associative property to sell
> > this ioctl?  Thanks,
> 
> Well, I'm not sure how 'weird' that is but it makes sense to me... VFIO
> is the mechanism that virtualizes access to a PCI device and provides
> interfaces to qemu & kvm to access it &| map it.
> 
> Or rather VFIO-PCI is.
> 
> At a basic level it provides ... the basic PCI interfaces, ie, config
> space access (with or without filtering), interrupts, etc...
> 
> In our environment, EEH is just another functionality of PCI really.
> The firmware calls used by the guest to do that fall into more or less
> the same category as the ones used for PCI config space accesses,
> manipulation of DMA windows, etc... Similar to host (though guest
> and host use a different FW interface for various reasons).
> 
> So it's very natural to "transport" these via VFIO-PCI like everything
> else, I don't see a more natural place to put the ioctl's we need for
> qemu to be able to access the EEH state, trigger EEH (un)isolation,
> resets, etc...
> 
> Fundamentally, the design should be for VFIO-PCI to provide some specific
> ioctls for EEH that userspace programs such as qemu can use, and then
> re-expose those APIs to the guest.
> 
> In addition, to do some of it in the kernel for performance reason, we
> want to establish that mapping, but I see that as a VFIO "accelerator".
> 
> IE. Whatever is going to respond to the EEH calls from the guest in-kernel
> will have to share state with the rest of the EEH stuff provided to qemu
> by vfio-pci.

Perhaps my problem is that I don't have a clear picture of where you're
going with this like I do for AER.  For AER we're starting with
notification of an error, from that we build into how to retrieve the
error information, and finally how to perform corrective action.  Each
of these will be done through vifo-pci.

Here we're starting by registering a mapping that's really only useful
to the vfio "accelerator" path, but we don't even have a hint of what
the non-accelerated path is and how vfio is involved with it.  Thanks,

Alex
Benjamin Herrenschmidt - March 20, 2013, 7:31 p.m.
On Wed, 2013-03-20 at 12:48 -0600, Alex Williamson wrote:
> Perhaps my problem is that I don't have a clear picture of where
> you're
> going with this like I do for AER.  For AER we're starting with
> notification of an error, from that we build into how to retrieve the
> error information, and finally how to perform corrective action.  Each
> of these will be done through vifo-pci.
> 
> Here we're starting by registering a mapping that's really only useful
> to the vfio "accelerator" path, but we don't even have a hint of what
> the non-accelerated path is and how vfio is involved with it.  Thanks,

I'm surprised that you are building so much policy around AER ... can't
you just pass the raw stuff down to the guest and let the guest do it's
own corrective actions ?

As for EEH, I will let Gavin describe in more details what he is doing,
though I wouldn't be surprised if so far he doesn't have a
non-accelerated path :-) Which indeed makes things oddball, granted ...
at least for now. I *think* what Gavin's doing right now is a
pass-through to the host EEH directly in the kernel, so without a slow
path...

Gavin, it really boils down to that. In-kernel EEH for guests is a
KVMism that ends up not involving VFIO in any other way than
establishing the mapping, then arguably it could be done via a VM ioctl.

If there's more going through VFIO and shared state, then it should
probably go through VFIO-PCI.

Note (FYI) that EEH somewhat encompass AER... the EEH logic triggers on
AER errors as well and the error reports generated by the firmware
contain the AER register dump in addition to the bridge internal stuff.
IE. EEH encompass pretty much all sort of errors, correctable or not,
that happens on PCI. It adds a mechanism of "isolation" of domains on
first error (involving blocking MMIOs, DMAs and MSIs) which helps with
preventing propagation of bad data, and various recovery schemes.

Cheers,
Ben.
Alex Williamson - March 20, 2013, 7:46 p.m.
On Wed, 2013-03-20 at 20:31 +0100, Benjamin Herrenschmidt wrote:
> On Wed, 2013-03-20 at 12:48 -0600, Alex Williamson wrote:
> > Perhaps my problem is that I don't have a clear picture of where
> > you're
> > going with this like I do for AER.  For AER we're starting with
> > notification of an error, from that we build into how to retrieve the
> > error information, and finally how to perform corrective action.  Each
> > of these will be done through vifo-pci.
> > 
> > Here we're starting by registering a mapping that's really only useful
> > to the vfio "accelerator" path, but we don't even have a hint of what
> > the non-accelerated path is and how vfio is involved with it.  Thanks,
> 
> I'm surprised that you are building so much policy around AER ... can't
> you just pass the raw stuff down to the guest and let the guest do it's
> own corrective actions ?

How does the guest get the raw stuff?  We need to get the AER interrupt
out to the guest so it can be injected into the virtual PCIe port, then
we need to be able to retrieve the physical device log and pass it to
the qemu to mangle to match the guest topology.  We don't have existing
firmware interfaces for the guest to do that, so it's all being routed
through vfio-pci.

> As for EEH, I will let Gavin describe in more details what he is doing,
> though I wouldn't be surprised if so far he doesn't have a
> non-accelerated path :-) Which indeed makes things oddball, granted ...
> at least for now. I *think* what Gavin's doing right now is a
> pass-through to the host EEH directly in the kernel, so without a slow
> path...
> 
> Gavin, it really boils down to that. In-kernel EEH for guests is a
> KVMism that ends up not involving VFIO in any other way than
> establishing the mapping, then arguably it could be done via a VM ioctl.
> 
> If there's more going through VFIO and shared state, then it should
> probably go through VFIO-PCI.

Exactly my thinking.  Thanks,

Alex
Gavin Shan - March 21, 2013, 2:09 a.m.
On Wed, Mar 20, 2013 at 01:46:22PM -0600, Alex Williamson wrote:
>On Wed, 2013-03-20 at 20:31 +0100, Benjamin Herrenschmidt wrote:
>> On Wed, 2013-03-20 at 12:48 -0600, Alex Williamson wrote:

.../...

>> As for EEH, I will let Gavin describe in more details what he is doing,
>> though I wouldn't be surprised if so far he doesn't have a
>> non-accelerated path :-) Which indeed makes things oddball, granted ...
>> at least for now. I *think* what Gavin's doing right now is a
>> pass-through to the host EEH directly in the kernel, so without a slow
>> path...
>> 

Yes, I don't have non-accelerated path. I'm trying to describe what I'm
doing: On the host side, the interrupt will be triggered while detecting
frozen PE, which has been passed to guest. We won't send EEH event to
EEH core on host side and we're waiting for guest to be involved to handle
the EEH error. In guest, any access to config or MMIO of the frozen PE will
trigger EEH event and in turn, the guest utilizes existing (exactly same
to pSeries on phyp case) RTAS calls to recover the error. The RTAS calls is
being emulated in host kernel.

Part of the RTAS call arguments is PCI domain/bus/slot/function viewed from
guest perspective. That's different from that for same physical PCI device
in host side. So I used VFIO-PCI to do the mapping and maintain the information
in host kernel.

>> Gavin, it really boils down to that. In-kernel EEH for guests is a
>> KVMism that ends up not involving VFIO in any other way than
>> establishing the mapping, then arguably it could be done via a VM ioctl.
>> 
>> If there's more going through VFIO and shared state, then it should
>> probably go through VFIO-PCI.
>

Ben, you're right. I use VFIO for nothing other than doing address mapping.
So I will do the address mapping in VM IOCTL instead of in VFIO-PCI.

Thanks,
Gavin

Patch

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 6e58d9b..ecc4f38 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -289,6 +289,22 @@  struct vfio_irq_set {
  */
 #define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
 
+/**
+ * VFIO_DEVICE_SET_ADDR_MAPPING - _IO(VFIO_TYPE, VFIO_BASE + 12)
+ *
+ * The address, which comprised of domain/bus/slot/function looks
+ * different between host and guest. We need to setup the mapping
+ * in host for some architectures like PPC so that the passed PCI
+ * devices could support RTAS smoothly.
+ */
+struct vfio_addr_mapping {
+	__u64 buid;
+	__u8  bus;
+	__u8  slot;
+	__u8  func;
+};
+#define VFIO_DEVICE_SET_ADDR_MAPPING	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
 /*
  * The VFIO-PCI bus driver makes use of the following fixed region and
  * IRQ index mapping.  Unimplemented regions return a size of zero.