diff mbox

[RFC] vfio: VFIO PCI driver for Qemu

Message ID 20120725165948.17260.82862.stgit@bling.home
State New
Headers show

Commit Message

Alex Williamson July 25, 2012, 5:03 p.m. UTC
This adds PCI based device assignment to Qemu using the Linux VFIO
userspace driver interface.  After setting up VFIO device access,
devices can be added to Qemu guests using the vfio-pci device
option:

 -device vfio-pci,host=1:10.1,id=net0

or for hotplug:

(qemu) device_add vfio-pci,host=1:10.1,id=net0
(qemu) device_del net0

This patch adds support for assigning host physical PCI devices,
with or without KVM[1] for x86 hosts and guests.  Support for
POWER hosts and guests is working and expected to follow shortly.
Other platforms wishing to make use of this need to do the following:
 - Add a VFIO IOMMU interface to the host kernel driver or make us of
   an existing one if possible (pre-req: linux host IOMMU support)
 - Add corresponding mapping calls for your IOMMU in qemu, see
   x86 and POWER for examples.

And if you care about PCI legacy interrupts:
 - Add support for EOI notification (TBD for everyone)

While not requiring KVM support, VFIO based device assignment still
supports acceleration through KVM.  MMIO regions with sufficient
alignment are mapped directly into the guest addres space and
platforms supporting direct interrupt injection through eventfds can
bypass Qemu userspace.  This support is included and automatically
enabled when KVM and KVM irqchip is enabled.  These allow VFIO
based assignment to meet the same performance levels as KVM based
assignment in the qemu-kvm tree.

Sending this as an RFC for review as we're waiting on VFIO to be
accepted into the Linux kernel.  I'm hoping it will be accepted
for Linux v3.6.  Pending Linux VFIO acceptance, I'd like to get
this support in for 1.2 and work on generic Qemu EOI infrastructure
in-tree.  This patch is based on current qemu.git merged with MST's
latest pull request.  Thanks,

Alex

[1] The proposed level IRQFD/EOIFD KVM interface is currently
required to support legacy PCI INTx interrupts.  Qemu support for
this is included here.  Qemu infrastructure for EOI notification
is not yet in place to do this without KVM.  Device which rely only
on MSI/MSIX work in unaccelerated Qemu.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 MAINTAINERS                |    5 
 configure                  |   12 
 hw/i386/Makefile.objs      |    1 
 hw/vfio_pci.c              | 2030 ++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio_pci.h              |  100 ++
 linux-headers/linux/kvm.h  |   21 
 linux-headers/linux/vfio.h |  445 ++++++++++
 7 files changed, 2614 insertions(+)
 create mode 100644 hw/vfio_pci.c
 create mode 100644 hw/vfio_pci.h
 create mode 100644 linux-headers/linux/vfio.h

Comments

Avi Kivity July 25, 2012, 7:30 p.m. UTC | #1
On 07/25/2012 08:03 PM, Alex Williamson wrote:
> This adds PCI based device assignment to Qemu using the Linux VFIO
> userspace driver interface.  After setting up VFIO device access,
> devices can be added to Qemu guests using the vfio-pci device
> option:
>
>  -device vfio-pci,host=1:10.1,id=net0
>
>

Let's use the same syntax as for kvm device assignment.  Then we can
fall back on kvm when vfio is not available.  We can also have an
optional parameter kernel-driver to explicitly select vfio or kvm.
Alex Williamson July 25, 2012, 7:53 p.m. UTC | #2
On Wed, 2012-07-25 at 22:30 +0300, Avi Kivity wrote:
> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> > This adds PCI based device assignment to Qemu using the Linux VFIO
> > userspace driver interface.  After setting up VFIO device access,
> > devices can be added to Qemu guests using the vfio-pci device
> > option:
> >
> >  -device vfio-pci,host=1:10.1,id=net0
> >
> >
> 
> Let's use the same syntax as for kvm device assignment.  Then we can
> fall back on kvm when vfio is not available.  We can also have an
> optional parameter kernel-driver to explicitly select vfio or kvm.

This seems confusing to me, pci-assign already has options like
prefer_msi, share_intx, and configfd that vfio doesn't.  I'm sure vfio
will eventually get options that pci-assign won't have.  How is a user
supposed to figure out what options are actually available from -device
pci-assign,?  Isn't this the same as asking to drop all model specific
devices and just use -device net,model=e1000... hey, we've been there
before ;)  Thanks,

Alex
Avi Kivity July 26, 2012, 8:35 a.m. UTC | #3
On 07/25/2012 10:53 PM, Alex Williamson wrote:
> On Wed, 2012-07-25 at 22:30 +0300, Avi Kivity wrote:
>> On 07/25/2012 08:03 PM, Alex Williamson wrote:
>> > This adds PCI based device assignment to Qemu using the Linux VFIO
>> > userspace driver interface.  After setting up VFIO device access,
>> > devices can be added to Qemu guests using the vfio-pci device
>> > option:
>> >
>> >  -device vfio-pci,host=1:10.1,id=net0
>> >
>> >
>> 
>> Let's use the same syntax as for kvm device assignment.  Then we can
>> fall back on kvm when vfio is not available.  We can also have an
>> optional parameter kernel-driver to explicitly select vfio or kvm.
> 
> This seems confusing to me, pci-assign already has options like
> prefer_msi, share_intx, and configfd that vfio doesn't.  I'm sure vfio
> will eventually get options that pci-assign won't have.  How is a user
> supposed to figure out what options are actually available from -device
> pci-assign,? 

Read the documentation.

> Isn't this the same as asking to drop all model specific
> devices and just use -device net,model=e1000... hey, we've been there
> before ;)  Thanks,

It's not.  e1000 is a guest visible feature. vfio and kvm assignment do
exactly the same thing, as far as the guest is concerned, just using a
different driver.  This is more akin to -device virtio-net,vhost=on|off
(where we also have a default and a fallback, which wouldn't make sense
for model=e1000).
Alex Williamson July 26, 2012, 2:56 p.m. UTC | #4
On Thu, 2012-07-26 at 11:35 +0300, Avi Kivity wrote:
> On 07/25/2012 10:53 PM, Alex Williamson wrote:
> > On Wed, 2012-07-25 at 22:30 +0300, Avi Kivity wrote:
> >> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> >> > This adds PCI based device assignment to Qemu using the Linux VFIO
> >> > userspace driver interface.  After setting up VFIO device access,
> >> > devices can be added to Qemu guests using the vfio-pci device
> >> > option:
> >> >
> >> >  -device vfio-pci,host=1:10.1,id=net0
> >> >
> >> >
> >> 
> >> Let's use the same syntax as for kvm device assignment.  Then we can
> >> fall back on kvm when vfio is not available.  We can also have an
> >> optional parameter kernel-driver to explicitly select vfio or kvm.
> > 
> > This seems confusing to me, pci-assign already has options like
> > prefer_msi, share_intx, and configfd that vfio doesn't.  I'm sure vfio
> > will eventually get options that pci-assign won't have.  How is a user
> > supposed to figure out what options are actually available from -device
> > pci-assign,? 
> 
> Read the documentation.

And libvirt is supposed to parse the qemu-docs package matching the
installed qemu binary package to figure out what's supported?

> > Isn't this the same as asking to drop all model specific
> > devices and just use -device net,model=e1000... hey, we've been there
> > before ;)  Thanks,
> 
> It's not.  e1000 is a guest visible feature. vfio and kvm assignment do
> exactly the same thing, as far as the guest is concerned, just using a
> different driver.  This is more akin to -device virtio-net,vhost=on|off
> (where we also have a default and a fallback, which wouldn't make sense
> for model=e1000).

I understand an agree with your desire to make this transparent from the
user perspective, but I think the place to do that abstraction is
libvirt.  The qemu command line is just the final step in a process that
already needs to be aware of which backend will be used.  This is not
simply a small tweak to the qemu options and now I'm using vfio.  It
goes something like this:

   KVM                                     VFIO
1. Identify the assigned device         1. Identify the assigned device
2. Unbind from host driver              2. Identify the iommu group for the device
3. Bind to pci-stub                     3. Evaluate all the devices for the group
4. Launch qemu                          4. Unbind all devices in the group from host drivers
                                        5. Bind all devices in the group to vfio-pci
                                        6. Launch qemu

I've actually already had a report from an early adopter that did
everything under the VFIO list on the right, but but happened to be
using qemu-kvm and the -device pci-assign option and couldn't figure out
what was going on.  Due to KVM's poor device ownership model, it was
more than happy to bind to a device owned by vfio-pci.  Imagine the
support questions we have to ask if we support both via pci-assign;
well, what version of qemu are you using and does that default to vfio
or kvm assignment or has the distro modified it to switch the default...
VFIO offers certain advantages, for instance correctly managing the
IOMMU domain on systems like Andreas' where KVM can't manage the domain
of the bridge because it doesn't understand grouping.  There are also
obvious advantages in the device ownership model.  Users want to be sure
they're using these things.

Both KVM and VFIO do strive to make the device in the guest look as much
like it does on bare metal as possible, but we don't guarantee they're
identical and we don't guarantee to match each other.  So in fact, we
can expect subtle difference in how the guest sees it.  Things like the
capabilities exposed, the emulation/virtualization of some of those
capabilities, eventually things like express config space support and
AER error propagation.  These are all a bit more than "add vhost=on to
your virtio-net-pci options and magically your networking is faster".
Thanks,

Alex
Alex Williamson July 26, 2012, 3:09 p.m. UTC | #5
On Wed, 2012-07-25 at 11:03 -0600, Alex Williamson wrote:
> This adds PCI based device assignment to Qemu using the Linux VFIO
> userspace driver interface.  After setting up VFIO device access,
> devices can be added to Qemu guests using the vfio-pci device
> option:
> 
>  -device vfio-pci,host=1:10.1,id=net0
> 
> or for hotplug:
> 
> (qemu) device_add vfio-pci,host=1:10.1,id=net0
> (qemu) device_del net0
> 
> This patch adds support for assigning host physical PCI devices,
> with or without KVM[1] for x86 hosts and guests.  Support for
> POWER hosts and guests is working and expected to follow shortly.
> Other platforms wishing to make use of this need to do the following:
>  - Add a VFIO IOMMU interface to the host kernel driver or make us of
>    an existing one if possible (pre-req: linux host IOMMU support)
>  - Add corresponding mapping calls for your IOMMU in qemu, see
>    x86 and POWER for examples.
> 
> And if you care about PCI legacy interrupts:
>  - Add support for EOI notification (TBD for everyone)
> 
> While not requiring KVM support, VFIO based device assignment still
> supports acceleration through KVM.  MMIO regions with sufficient
> alignment are mapped directly into the guest addres space and
> platforms supporting direct interrupt injection through eventfds can
> bypass Qemu userspace.  This support is included and automatically
> enabled when KVM and KVM irqchip is enabled.  These allow VFIO
> based assignment to meet the same performance levels as KVM based
> assignment in the qemu-kvm tree.
> 
> Sending this as an RFC for review as we're waiting on VFIO to be
> accepted into the Linux kernel.  I'm hoping it will be accepted
> for Linux v3.6.  Pending Linux VFIO acceptance, I'd like to get
> this support in for 1.2 and work on generic Qemu EOI infrastructure
> in-tree.  This patch is based on current qemu.git merged with MST's
> latest pull request.  Thanks,
> 
> Alex
> 
> [1] The proposed level IRQFD/EOIFD KVM interface is currently
> required to support legacy PCI INTx interrupts.  Qemu support for
> this is included here.  Qemu infrastructure for EOI notification
> is not yet in place to do this without KVM.  Device which rely only
> on MSI/MSIX work in unaccelerated Qemu.

I forgot to mention that anyone wanting to test this out can use the
vfio-for-qemu branch of my git tree here:
git://github.com/awilliam/qemu-vfio.git

Thanks,
Alex
Avi Kivity July 26, 2012, 3:59 p.m. UTC | #6
On 07/26/2012 05:56 PM, Alex Williamson wrote:
>> >> Let's use the same syntax as for kvm device assignment.  Then we can
>> >> fall back on kvm when vfio is not available.  We can also have an
>> >> optional parameter kernel-driver to explicitly select vfio or kvm.
>> > 
>> > This seems confusing to me, pci-assign already has options like
>> > prefer_msi, share_intx, and configfd that vfio doesn't.  I'm sure vfio
>> > will eventually get options that pci-assign won't have.  How is a user
>> > supposed to figure out what options are actually available from -device
>> > pci-assign,? 
>> 
>> Read the documentation.
> 
> And libvirt is supposed to parse the qemu-docs package matching the
> installed qemu binary package to figure out what's supported?

I was hoping that we could avoid any change in libvirt.

> 
>> > Isn't this the same as asking to drop all model specific
>> > devices and just use -device net,model=e1000... hey, we've been there
>> > before ;)  Thanks,
>> 
>> It's not.  e1000 is a guest visible feature. vfio and kvm assignment do
>> exactly the same thing, as far as the guest is concerned, just using a
>> different driver.  This is more akin to -device virtio-net,vhost=on|off
>> (where we also have a default and a fallback, which wouldn't make sense
>> for model=e1000).
> 
> I understand an agree with your desire to make this transparent from the
> user perspective, but I think the place to do that abstraction is
> libvirt.  The qemu command line is just the final step in a process that
> already needs to be aware of which backend will be used.  This is not
> simply a small tweak to the qemu options and now I'm using vfio.  It
> goes something like this:
> 
>    KVM                                     VFIO
> 1. Identify the assigned device         1. Identify the assigned device
> 2. Unbind from host driver              2. Identify the iommu group for the device
> 3. Bind to pci-stub                     3. Evaluate all the devices for the group
> 4. Launch qemu                          4. Unbind all devices in the group from host drivers
>                                         5. Bind all devices in the group to vfio-pci
>                                         6. Launch qemu

In the common case, on x86 (but I'm repeating myself), the iommu group
includes just one device, yes?  Could we make pci-stub an alias for the
corresponding vfio steps?

Though I generally dislike doing magic behind the user's back.  qemu and
even more the kernel are low level interfaces and should behave as
regularly as possible.

> 
> I've actually already had a report from an early adopter that did
> everything under the VFIO list on the right, but but happened to be
> using qemu-kvm and the -device pci-assign option and couldn't figure out
> what was going on.  Due to KVM's poor device ownership model, it was
> more than happy to bind to a device owned by vfio-pci.  Imagine the
> support questions we have to ask if we support both via pci-assign;

In fact we had the same experience with kvm being enabled or not.  We
have 'info kvm' for that.

> well, what version of qemu are you using and does that default to vfio
> or kvm assignment or has the distro modified it to switch the default...
> VFIO offers certain advantages, for instance correctly managing the
> IOMMU domain on systems like Andreas' where KVM can't manage the domain
> of the bridge because it doesn't understand grouping.  There are also
> obvious advantages in the device ownership model.  Users want to be sure
> they're using these things.
> 
> Both KVM and VFIO do strive to make the device in the guest look as much
> like it does on bare metal as possible, but we don't guarantee they're
> identical and we don't guarantee to match each other.  So in fact, we
> can expect subtle difference in how the guest sees it.  Things like the
> capabilities exposed, the emulation/virtualization of some of those
> capabilities, eventually things like express config space support and
> AER error propagation.  These are all a bit more than "add vhost=on to
> your virtio-net-pci options and magically your networking is faster".

I see.  Thanks for the explanation.
Avi Kivity July 26, 2012, 4:06 p.m. UTC | #7
On 07/26/2012 05:56 PM, Alex Williamson wrote:
> 
> Both KVM and VFIO do strive to make the device in the guest look as much
> like it does on bare metal as possible, but we don't guarantee they're
> identical and we don't guarantee to match each other.

btw, this is somewhat problematic, conceivably this could break a guest
(due to a guest bug).  But with device assignment the compatibility
requirements can be relaxed a bit since there is no live migration.
Alex Williamson July 26, 2012, 4:33 p.m. UTC | #8
On Thu, 2012-07-26 at 18:59 +0300, Avi Kivity wrote:
> On 07/26/2012 05:56 PM, Alex Williamson wrote:
> >> >> Let's use the same syntax as for kvm device assignment.  Then we can
> >> >> fall back on kvm when vfio is not available.  We can also have an
> >> >> optional parameter kernel-driver to explicitly select vfio or kvm.
> >> > 
> >> > This seems confusing to me, pci-assign already has options like
> >> > prefer_msi, share_intx, and configfd that vfio doesn't.  I'm sure vfio
> >> > will eventually get options that pci-assign won't have.  How is a user
> >> > supposed to figure out what options are actually available from -device
> >> > pci-assign,? 
> >> 
> >> Read the documentation.
> > 
> > And libvirt is supposed to parse the qemu-docs package matching the
> > installed qemu binary package to figure out what's supported?
> 
> I was hoping that we could avoid any change in libvirt.

I don't think that's possible...

> >> > Isn't this the same as asking to drop all model specific
> >> > devices and just use -device net,model=e1000... hey, we've been there
> >> > before ;)  Thanks,
> >> 
> >> It's not.  e1000 is a guest visible feature. vfio and kvm assignment do
> >> exactly the same thing, as far as the guest is concerned, just using a
> >> different driver.  This is more akin to -device virtio-net,vhost=on|off
> >> (where we also have a default and a fallback, which wouldn't make sense
> >> for model=e1000).
> > 
> > I understand an agree with your desire to make this transparent from the
> > user perspective, but I think the place to do that abstraction is
> > libvirt.  The qemu command line is just the final step in a process that
> > already needs to be aware of which backend will be used.  This is not
> > simply a small tweak to the qemu options and now I'm using vfio.  It
> > goes something like this:
> > 
> >    KVM                                     VFIO
> > 1. Identify the assigned device         1. Identify the assigned device
> > 2. Unbind from host driver              2. Identify the iommu group for the device
> > 3. Bind to pci-stub                     3. Evaluate all the devices for the group
> > 4. Launch qemu                          4. Unbind all devices in the group from host drivers
> >                                         5. Bind all devices in the group to vfio-pci
> >                                         6. Launch qemu
> 
> In the common case, on x86 (but I'm repeating myself), the iommu group
> includes just one device, yes?  Could we make pci-stub an alias for the
> corresponding vfio steps?

PCI bridges masking devices is not as uncommon as you'd like, that's
exactly why Andreas is using VFIO instead of KVM assignment.  Not to
mention that VFIO takes a much more strict stance on multifunction ACS
requirements, typically resulting in all function of a multifunction
device being inseparable.  So no, I don't think multi-device groups will
be unusual at all, even on x86.  Playing games with pci-stub sounds like
a nightmare.  Personally I think we have the opportunity to make libvirt
and tools like virt-manager a lot better with VFIO.  They no longer need
to do PCI bridge testing or ACS checking for VFIO and they can better
inform the user about what devices need to be removed from the host to
provide safe assignment.

> Though I generally dislike doing magic behind the user's back.  qemu and
> even more the kernel are low level interfaces and should behave as
> regularly as possible.
> 
> > 
> > I've actually already had a report from an early adopter that did
> > everything under the VFIO list on the right, but but happened to be
> > using qemu-kvm and the -device pci-assign option and couldn't figure out
> > what was going on.  Due to KVM's poor device ownership model, it was
> > more than happy to bind to a device owned by vfio-pci.  Imagine the
> > support questions we have to ask if we support both via pci-assign;
> 
> In fact we had the same experience with kvm being enabled or not.  We
> have 'info kvm' for that.

If we have both vfio and kvm assignment in the same tree there's no
reason we couldn't intermix them within a VM.  Unfortunately we have to
beware of KVM assignment's poor ownership model, but that's true whether
the device is attached to vfio-pci or some other driver.  Maybe we
should prevent that, but I see that happening by deprecating KVM
assignment and eventually disabling and removing it.  Thanks,

Alex
Avi Kivity July 26, 2012, 4:34 p.m. UTC | #9
On 07/25/2012 08:03 PM, Alex Williamson wrote:

> +/*
> + * Resource setup
> + */
> +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> +{
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t size;
> +
> +    if (!memory_region_size(&bar->mem)) {
> +        return;
> +    }
> +
> +    size = memory_region_size(&bar->mmap_mem);
> +    if (size) {
> +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> +         munmap(bar->mmap, size);
> +    }
> +
> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        size = memory_region_size(&vdev->msix->mmap_mem);
> +        if (size) {
> +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> +            munmap(vdev->msix->mmap, size);
> +        }
> +    }

Are the three size checks needed? Everything should work without them
from the memory core point of view.

> +
> +    memory_region_destroy(&bar->mem);
> +}
> +
> +static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
> +                         void **map, size_t size, off_t offset,
> +                         const char *name)
> +{
> +    *map = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +                MAP_SHARED, bar->fd, bar->fd_offset + offset);
> +    if (*map == MAP_FAILED) {
> +        *map = NULL;
> +        return -1;
> +    }
> +
> +    memory_region_init_ram_ptr(submem, name, size, *map);
> +    memory_region_add_subregion(mem, offset, submem);



> +
> +    return 0;
> +}
> +
> +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> +{
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    unsigned size = bar->size;
> +    char name[64];
> +
> +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> +
> +    /* A "slow" read/write mapping underlies all BARs */
> +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);

So far all container BARs have been pure containers, without RAM or I/O
callbacks.  It should all work, but this sets precedent and requires it
to work.  I guess there's no problem supporting it though.

> +
> +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> +        return; /* IO space is only slow, don't expect high perf here */
> +    }

What about non-x86 where IO is actually memory?  I think you can drop
this and let the address space filtering in the listener drop it if it
turns out to be in IO space.

> +
> +    if (size & ~TARGET_PAGE_MASK) {
> +        error_report("%s is too small to mmap, this may affect performance.\n",
> +                     name);
> +        return;
> +    }

We can work a little harder and align the host space offset with the
guest space offset, and map it in.

> +
> +    /*
> +     * We can't mmap areas overlapping the MSIX vector table, so we
> +     * potentially insert a direct-mapped subregion before and after it.
> +     */

This splitting is what the memory core really enjoys.  You can just
place the MSIX page over the RAM page and let it do the cut-n-paste.

> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> +    }
> +         
> +    if (size) {
> +        strcat(name, " mmap");
> +        if (vfio_mmap_bar(bar, &bar->mem, &bar->mmap_mem, &bar->mmap,
> +                          size, 0, name)) {
> +            error_report("%s Failed. Performance may be slow\n", name);
> +        }
> +    }
> +
> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        unsigned start;
> +
> +        start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
> +
> +        if (start < bar->size) {
> +            size = bar->size - start;
> +            strcat(name, " msix-hi");
> +            /* MSIXInfo contains another MemoryRegion for this mapping */
> +            if (vfio_mmap_bar(bar, &bar->mem, &vdev->msix->mmap_mem,
> +                              &vdev->msix->mmap, size, start, name)) {
> +                error_report("%s Failed. Performance may be slow\n", name);
> +            }
> +        }
> +    }
> +
> +    return;
> +}
> +
> +
> +static int __vfio_get_device(VFIOGroup *group,
> +                             const char *name, VFIODevice *vdev)

__foo is a reserved symbol.

> +{
> +    int ret;
> +
> +    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    if (ret < 0) {
> +        error_report("vfio: error getting device %s from group %d: %s",
> +                     name, group->groupid, strerror(errno));
> +        error_report("Verify all devices in group %d "
> +                     "are bound to vfio-pci or pci-stub and not already in use",
> +                     group->groupid);
> +        return -1;
> +    }
> +
> +    vdev->group = group;
> +    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
> +
> +    vdev->fd = ret;
> +
> +    return 0;
> +}
> +
> +
> +static Property vfio_pci_dev_properties[] = {
> +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> +    //TODO - support passed fds... is this necessary?

Yes.

> +    //DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
> +    //DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +
> +
> +typedef struct MSIVector {
> +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> +    struct VFIODevice *vdev; /* back pointer to device */
> +    int vector; /* the vector number for this element */
> +    int virq; /* KVM irqchip route for Qemu bypass */

This calls for an abstraction (don't we have a cache where we look those
up?)

> +    bool use;
> +} MSIVector;
> +
> +
> +typedef struct VFIOContainer {
> +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> +    struct {
> +        /* enable abstraction to support various iommu backends */
> +        union {
> +            MemoryListener listener; /* Used by type1 iommu */
> +        };

The usual was is to have a Type1VFIOContainer deriving from
VFIOContainer and adding a MemoryListener.

> +        void (*release)(struct VFIOContainer *);
> +    } iommu_data;
> +    QLIST_HEAD(, VFIOGroup) group_list;
> +    QLIST_ENTRY(VFIOContainer) next;
> +} VFIOContainer;
> +

> +#endif /* __VFIO_H__ */
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 5a9d4e3..bd1a76c 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h

Separate patch when leaving RFC mode.

> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> new file mode 100644
> index 0000000..0a4f180
> --- /dev/null
> +++ b/linux-headers/linux/vfio.h
> @@ -0,0 +1,445 @@
> +/*
> + * VFIO API definition
> + *
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define VFIO_API_VERSION	0
> +
> +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */

Use the exported file, that gets rid of the __KERNEL__ bits.
Avi Kivity July 26, 2012, 4:40 p.m. UTC | #10
On 07/26/2012 07:33 PM, Alex Williamson wrote:
>> 
>> In the common case, on x86 (but I'm repeating myself), the iommu group
>> includes just one device, yes?  Could we make pci-stub an alias for the
>> corresponding vfio steps?
> 
> PCI bridges masking devices is not as uncommon as you'd like, that's
> exactly why Andreas is using VFIO instead of KVM assignment.  

Well, we are using it in production for quite a while with few such reports.

> Not to
> mention that VFIO takes a much more strict stance on multifunction ACS
> requirements, typically resulting in all function of a multifunction
> device being inseparable.  So no, I don't think multi-device groups will
> be unusual at all, even on x86.  Playing games with pci-stub sounds like
> a nightmare.  Personally I think we have the opportunity to make libvirt
> and tools like virt-manager a lot better with VFIO.  They no longer need
> to do PCI bridge testing or ACS checking for VFIO and they can better
> inform the user about what devices need to be removed from the host to
> provide safe assignment.
> 
> 
> If we have both vfio and kvm assignment in the same tree there's no
> reason we couldn't intermix them within a VM.  Unfortunately we have to
> beware of KVM assignment's poor ownership model, but that's true whether
> the device is attached to vfio-pci or some other driver.  Maybe we
> should prevent that, but I see that happening by deprecating KVM
> assignment and eventually disabling and removing it.  Thanks,

That's the plan.  By making the command lines compatible, we allow
upgrading the kernel and qemu, but keeping libvirt or another stack, and
more importantly their config files, unchanged.

Perhaps we could do it part-way by making pci-assign do the magic needed
to switch from pci-stub to an iommu group and forwarding the device to
vfio-pci.  But it would probably be root-only.
Alex Williamson July 26, 2012, 4:40 p.m. UTC | #11
On Thu, 2012-07-26 at 19:06 +0300, Avi Kivity wrote:
> On 07/26/2012 05:56 PM, Alex Williamson wrote:
> > 
> > Both KVM and VFIO do strive to make the device in the guest look as much
> > like it does on bare metal as possible, but we don't guarantee they're
> > identical and we don't guarantee to match each other.
> 
> btw, this is somewhat problematic, conceivably this could break a guest
> (due to a guest bug).  But with device assignment the compatibility
> requirements can be relaxed a bit since there is no live migration.

Well, I would hope that things work better in vfio and we work to make
that the recommended method of device assignment.  We can't hold one
back to make things identical.  The only barrier I see to this is that
vfio focuses on security, enforcing things like ACS to make sure devices
can't do DMA to other devices outside of the group whereas KVM
assignment will let you attempt to do nearly anything and counts on
libvirt to only let the user attempt to do sane things.  As you say,
there's no live migration with device assignment, so absolute identical
config space is not a requirement and the difference we do have should
be sufficiently subtle that the guest doesn't care boot-to-boot.
Thanks,

Alex
Avi Kivity July 26, 2012, 4:47 p.m. UTC | #12
On 07/26/2012 07:40 PM, Alex Williamson wrote:
> On Thu, 2012-07-26 at 19:06 +0300, Avi Kivity wrote:
>> On 07/26/2012 05:56 PM, Alex Williamson wrote:
>> > 
>> > Both KVM and VFIO do strive to make the device in the guest look as much
>> > like it does on bare metal as possible, but we don't guarantee they're
>> > identical and we don't guarantee to match each other.
>> 
>> btw, this is somewhat problematic, conceivably this could break a guest
>> (due to a guest bug).  But with device assignment the compatibility
>> requirements can be relaxed a bit since there is no live migration.
> 
> Well, I would hope that things work better in vfio and we work to make
> that the recommended method of device assignment.  We can't hold one
> back to make things identical.  The only barrier I see to this is that
> vfio focuses on security, enforcing things like ACS to make sure devices
> can't do DMA to other devices outside of the group whereas KVM
> assignment will let you attempt to do nearly anything and counts on
> libvirt to only let the user attempt to do sane things.  As you say,
> there's no live migration with device assignment, so absolute identical
> config space is not a requirement and the difference we do have should
> be sufficiently subtle that the guest doesn't care boot-to-boot.

We could add a strict backward compatibility option that forces the
layout, but it isn't worth it.
Alex Williamson July 26, 2012, 5:40 p.m. UTC | #13
On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> 
> > +/*
> > + * Resource setup
> > + */
> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> > +{
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    uint64_t size;
> > +
> > +    if (!memory_region_size(&bar->mem)) {
> > +        return;
> > +    }

This one is the "slow" mapped MemoryRegion.  If there's nothing here,
the BAR isn't populated.

> > +
> > +    size = memory_region_size(&bar->mmap_mem);
> > +    if (size) {
> > +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> > +         munmap(bar->mmap, size);
> > +    }

This is the direct mapped MemoryRegion that potentially overlays the
"slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
includes the MSI-X vector table, this maps the region in front of the
table

> > +
> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> > +        if (size) {
> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> > +            munmap(vdev->msix->mmap, size);
> > +        }
> > +    }

And this one potentially unmaps the overlap after the vector table if
there's any space for one.

> Are the three size checks needed? Everything should work without them
> from the memory core point of view.

I haven't tried, but I strongly suspect I shouldn't be munmap'ing
NULL... no?

> > +
> > +    memory_region_destroy(&bar->mem);
> > +}
> > +
> > +static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
> > +                         void **map, size_t size, off_t offset,
> > +                         const char *name)
> > +{
> > +    *map = mmap(NULL, size, PROT_READ | PROT_WRITE,
> > +                MAP_SHARED, bar->fd, bar->fd_offset + offset);
> > +    if (*map == MAP_FAILED) {
> > +        *map = NULL;
> > +        return -1;
> > +    }
> > +
> > +    memory_region_init_ram_ptr(submem, name, size, *map);
> > +    memory_region_add_subregion(mem, offset, submem);
> 
> 
> 
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> > +{
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    unsigned size = bar->size;
> > +    char name[64];
> > +
> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> > +
> > +    /* A "slow" read/write mapping underlies all BARs */
> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> 
> So far all container BARs have been pure containers, without RAM or I/O
> callbacks.  It should all work, but this sets precedent and requires it
> to work.  I guess there's no problem supporting it though.

KVM device assignment already makes use of this as well, if I understand
correctly.

> > +
> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> > +        return; /* IO space is only slow, don't expect high perf here */
> > +    }
> 
> What about non-x86 where IO is actually memory?  I think you can drop
> this and let the address space filtering in the listener drop it if it
> turns out to be in IO space.

They're probably saying "What's I/O port space?" ;)  Yeah, there may be
some room to do more here, but no need until we have something that can
make use of it.  Note that these are the BAR mappings, which turn into
MemoryRegions, so I'm not sure what the listener has to do with
filtering these just yet.

> > +
> > +    if (size & ~TARGET_PAGE_MASK) {
> > +        error_report("%s is too small to mmap, this may affect performance.\n",
> > +                     name);
> > +        return;
> > +    }
> 
> We can work a little harder and align the host space offset with the
> guest space offset, and map it in.

That's actually pretty involved, requiring shifting the device in the
host address space and potentially adjust port and bridge apertures to
enable room for the device.  Not to mention that it assumes accessing
dead space between device regions is no harm, no foul.  True on x86 now,
but wasn't true on HP ia64 chipsets and I suspect some other platforms.

> > +
> > +    /*
> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> > +     * potentially insert a direct-mapped subregion before and after it.
> > +     */
> 
> This splitting is what the memory core really enjoys.  You can just
> place the MSIX page over the RAM page and let it do the cut-n-paste.

Sure, but VFIO won't allow us to mmap over the MSI-X table for security
reasons.  It might be worthwhile to someday make VFIO insert an
anonymous page over the MSI-X table to allow this, but it didn't look
trivial for my novice mm abilities.  Easy to add a flag from the VFIO
kernel structure where we learn about this BAR if we add it in the
future.

> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> > +    }
> > +         
> > +    if (size) {
> > +        strcat(name, " mmap");
> > +        if (vfio_mmap_bar(bar, &bar->mem, &bar->mmap_mem, &bar->mmap,
> > +                          size, 0, name)) {
> > +            error_report("%s Failed. Performance may be slow\n", name);
> > +        }
> > +    }
> > +
> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        unsigned start;
> > +
> > +        start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
> > +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
> > +
> > +        if (start < bar->size) {
> > +            size = bar->size - start;
> > +            strcat(name, " msix-hi");
> > +            /* MSIXInfo contains another MemoryRegion for this mapping */
> > +            if (vfio_mmap_bar(bar, &bar->mem, &vdev->msix->mmap_mem,
> > +                              &vdev->msix->mmap, size, start, name)) {
> > +                error_report("%s Failed. Performance may be slow\n", name);
> > +            }
> > +        }
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +
> > +static int __vfio_get_device(VFIOGroup *group,
> > +                             const char *name, VFIODevice *vdev)
> 
> __foo is a reserved symbol.

sigh, ok

> > +{
> > +    int ret;
> > +
> > +    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> > +    if (ret < 0) {
> > +        error_report("vfio: error getting device %s from group %d: %s",
> > +                     name, group->groupid, strerror(errno));
> > +        error_report("Verify all devices in group %d "
> > +                     "are bound to vfio-pci or pci-stub and not already in use",
> > +                     group->groupid);
> > +        return -1;
> > +    }
> > +
> > +    vdev->group = group;
> > +    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
> > +
> > +    vdev->fd = ret;
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +static Property vfio_pci_dev_properties[] = {
> > +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> > +    //TODO - support passed fds... is this necessary?
> 
> Yes.

This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
an instance of a container in the kernel.  A group can only be attached
to one container.  So whoever calls us with passed fds needs to track
this very carefully.  This is also why I've dropped any kind of shared
IOMMU option to give us a hint whether to try to cram everything in the
same container (~= iommu domain).  It's too easy to pass conflicting
info to share a container for one device, but not another... yet they
may be in the same group.  I'll work on the fd passing though and try to
come up with a reasonable model.

> > +    //DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
> > +    //DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
> > +    DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +
> > +
> > +typedef struct MSIVector {
> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> > +    struct VFIODevice *vdev; /* back pointer to device */
> > +    int vector; /* the vector number for this element */
> > +    int virq; /* KVM irqchip route for Qemu bypass */
> 
> This calls for an abstraction (don't we have a cache where we look those
> up?)

I haven't see one, pointer?  I tried to follow vhost's lead here.

> > +    bool use;
> > +} MSIVector;
> > +
> > +
> > +typedef struct VFIOContainer {
> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> > +    struct {
> > +        /* enable abstraction to support various iommu backends */
> > +        union {
> > +            MemoryListener listener; /* Used by type1 iommu */
> > +        };
> 
> The usual was is to have a Type1VFIOContainer deriving from
> VFIOContainer and adding a MemoryListener.

Yep, that would work too.  It gets a bit more complicated that way
though because we have to know when the container is allocated what type
it's going to be.  This way we can step though possible iommu types and
support the right one.  Eventually there may be more than one type
supported on the same platform (ex. one that enables PRI).  Do-able, but
I'm not sure it's worth it at this point.

> > +        void (*release)(struct VFIOContainer *);
> > +    } iommu_data;
> > +    QLIST_HEAD(, VFIOGroup) group_list;
> > +    QLIST_ENTRY(VFIOContainer) next;
> > +} VFIOContainer;
> > +
> 
> > +#endif /* __VFIO_H__ */
> > diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> > index 5a9d4e3..bd1a76c 100644
> > --- a/linux-headers/linux/kvm.h
> > +++ b/linux-headers/linux/kvm.h
> 
> Separate patch when leaving RFC mode.

Sure, this is still RFC though since the irqfd/eoifd changes are
pending.

> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > new file mode 100644
> > index 0000000..0a4f180
> > --- /dev/null
> > +++ b/linux-headers/linux/vfio.h
> > @@ -0,0 +1,445 @@
> > +/*
> > + * VFIO API definition
> > + *
> > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef VFIO_H
> > +#define VFIO_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/ioctl.h>
> > +
> > +#define VFIO_API_VERSION	0
> > +
> > +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
> 
> Use the exported file, that gets rid of the __KERNEL__ bits.

Oh?  How do I generate that aside from just deleting lines?  Thanks!

Alex
Alex Williamson July 26, 2012, 7:11 p.m. UTC | #14
On Thu, 2012-07-26 at 19:40 +0300, Avi Kivity wrote:
> On 07/26/2012 07:33 PM, Alex Williamson wrote:
> >> 
> >> In the common case, on x86 (but I'm repeating myself), the iommu group
> >> includes just one device, yes?  Could we make pci-stub an alias for the
> >> corresponding vfio steps?
> > 
> > PCI bridges masking devices is not as uncommon as you'd like, that's
> > exactly why Andreas is using VFIO instead of KVM assignment.  
> 
> Well, we are using it in production for quite a while with few such reports.

In the enterprise space, sure.  In the hobbiest/power user space, I
suspect users are too often finding that it doesn't work and move on to
something else.  Maybe we'll know if it's working better if we get more
complaints about random oddball devices not working because people are
actually able to get far enough to try it.

> > Not to
> > mention that VFIO takes a much more strict stance on multifunction ACS
> > requirements, typically resulting in all function of a multifunction
> > device being inseparable.  So no, I don't think multi-device groups will
> > be unusual at all, even on x86.  Playing games with pci-stub sounds like
> > a nightmare.  Personally I think we have the opportunity to make libvirt
> > and tools like virt-manager a lot better with VFIO.  They no longer need
> > to do PCI bridge testing or ACS checking for VFIO and they can better
> > inform the user about what devices need to be removed from the host to
> > provide safe assignment.
> > 
> > 
> > If we have both vfio and kvm assignment in the same tree there's no
> > reason we couldn't intermix them within a VM.  Unfortunately we have to
> > beware of KVM assignment's poor ownership model, but that's true whether
> > the device is attached to vfio-pci or some other driver.  Maybe we
> > should prevent that, but I see that happening by deprecating KVM
> > assignment and eventually disabling and removing it.  Thanks,
> 
> That's the plan.  By making the command lines compatible, we allow
> upgrading the kernel and qemu, but keeping libvirt or another stack, and
> more importantly their config files, unchanged.

That's why I think libvirt should do it, there's no reason libvirt can't
detect that vfio is available and default to it.  Nothing in the xml
file needs to change.  We can add an option to specify what backend to
use for people that care.  If someone is launching qemu from a script
they hit the same setup issues we've been discussing for vfio-pci, so
there's not much to preserve there.

> Perhaps we could do it part-way by making pci-assign do the magic needed
> to switch from pci-stub to an iommu group and forwarding the device to
> vfio-pci.  But it would probably be root-only.

That would definitely be root only and perhaps even only works in some
cases, assuming we decide we can only do it for single-device groups.
Then we get partial breakage and users left wondering why the VM with a
NIC assigned works, but the VM with a TV card fails.  As a user, I hate
those kinds of problems.  Thanks,

Alex
Blue Swirl July 27, 2012, 7:22 p.m. UTC | #15
On Wed, Jul 25, 2012 at 5:03 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> This adds PCI based device assignment to Qemu using the Linux VFIO
> userspace driver interface.  After setting up VFIO device access,
> devices can be added to Qemu guests using the vfio-pci device
> option:
>
>  -device vfio-pci,host=1:10.1,id=net0
>
> or for hotplug:
>
> (qemu) device_add vfio-pci,host=1:10.1,id=net0
> (qemu) device_del net0
>
> This patch adds support for assigning host physical PCI devices,
> with or without KVM[1] for x86 hosts and guests.  Support for
> POWER hosts and guests is working and expected to follow shortly.
> Other platforms wishing to make use of this need to do the following:
>  - Add a VFIO IOMMU interface to the host kernel driver or make us of
>    an existing one if possible (pre-req: linux host IOMMU support)
>  - Add corresponding mapping calls for your IOMMU in qemu, see
>    x86 and POWER for examples.
>
> And if you care about PCI legacy interrupts:
>  - Add support for EOI notification (TBD for everyone)
>
> While not requiring KVM support, VFIO based device assignment still
> supports acceleration through KVM.  MMIO regions with sufficient
> alignment are mapped directly into the guest addres space and
> platforms supporting direct interrupt injection through eventfds can
> bypass Qemu userspace.  This support is included and automatically
> enabled when KVM and KVM irqchip is enabled.  These allow VFIO
> based assignment to meet the same performance levels as KVM based
> assignment in the qemu-kvm tree.
>
> Sending this as an RFC for review as we're waiting on VFIO to be
> accepted into the Linux kernel.  I'm hoping it will be accepted
> for Linux v3.6.  Pending Linux VFIO acceptance, I'd like to get
> this support in for 1.2 and work on generic Qemu EOI infrastructure
> in-tree.  This patch is based on current qemu.git merged with MST's
> latest pull request.  Thanks,
>
> Alex
>
> [1] The proposed level IRQFD/EOIFD KVM interface is currently
> required to support legacy PCI INTx interrupts.  Qemu support for
> this is included here.  Qemu infrastructure for EOI notification
> is not yet in place to do this without KVM.  Device which rely only
> on MSI/MSIX work in unaccelerated Qemu.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
>  MAINTAINERS                |    5
>  configure                  |   12
>  hw/i386/Makefile.objs      |    1
>  hw/vfio_pci.c              | 2030 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio_pci.h              |  100 ++
>  linux-headers/linux/kvm.h  |   21
>  linux-headers/linux/vfio.h |  445 ++++++++++
>  7 files changed, 2614 insertions(+)
>  create mode 100644 hw/vfio_pci.c
>  create mode 100644 hw/vfio_pci.h
>  create mode 100644 linux-headers/linux/vfio.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 30ed56d..68406a3 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -460,6 +460,11 @@ M: Gerd Hoffmann <kraxel@redhat.com>
>  S: Maintained
>  F: hw/usb*
>
> +VFIO
> +M: Alex Williamson <alex.williamson@redhat.com>
> +S: Supported
> +F: hw/vfio*
> +
>  vhost
>  M: Michael S. Tsirkin <mst@redhat.com>
>  S: Supported
> diff --git a/configure b/configure
> index cef0a71..62d921e 100755
> --- a/configure
> +++ b/configure
> @@ -143,6 +143,7 @@ attr=""
>  libattr=""
>  xfs=""
>
> +vfio_pci="no"
>  vhost_net="no"
>  kvm="no"
>  gprof="no"
> @@ -489,6 +490,7 @@ Haiku)
>    usb="linux"
>    kvm="yes"
>    vhost_net="yes"
> +  vfio_pci="yes"
>    if [ "$cpu" = "i386" -o "$cpu" = "x86_64" ] ; then
>      audio_possible_drivers="$audio_possible_drivers fmod"
>    fi
> @@ -824,6 +826,10 @@ for opt do
>    ;;
>    --disable-guest-agent) guest_agent="no"
>    ;;
> +  --disable-vfio-pci) vfio_pci="no"
> +  ;;
> +  --enable-vfio-pci) vfio_pci="yes"
> +  ;;
>    *) echo "ERROR: unknown option $opt"; show_help="yes"
>    ;;
>    esac
> @@ -1110,6 +1116,8 @@ echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"
>  echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"
>  echo "  --with-coroutine=BACKEND coroutine backend. Supported options:"
>  echo "                           gthread, ucontext, sigaltstack, windows"
> +echo "  --disable-vfio-pci       disable vfio pci device assignement support"
> +echo "  --enable-vfio-pci        enable vfio pci device assignment support"
>  echo ""
>  echo "NOTE: The object files are built at the place where configure is launched"
>  exit 1
> @@ -3070,6 +3078,7 @@ echo "OpenGL support    $opengl"
>  echo "libiscsi support  $libiscsi"
>  echo "build guest agent $guest_agent"
>  echo "coroutine backend $coroutine_backend"
> +echo "VFIO PCI support  $vfio_pci"
>
>  if test "$sdl_too_old" = "yes"; then
>  echo "-> Your SDL version is too old - please upgrade to have SDL support"
> @@ -3747,6 +3756,9 @@ case "$target_arch2" in
>    *)
>      echo "CONFIG_NO_XEN=y" >> $config_target_mak
>  esac
> +if test "$vfio_pci" = "yes" -a "$target_softmmu" = "yes" ; then
> +  echo "CONFIG_VFIO_PCI=y" >> $config_target_mak
> +fi
>  case "$target_arch2" in
>    i386|x86_64|ppcemb|ppc|ppc64|s390x)
>      # Make sure the target and host cpus are compatible
> diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
> index 8c764bb..a2783ef 100644
> --- a/hw/i386/Makefile.objs
> +++ b/hw/i386/Makefile.objs
> @@ -11,5 +11,6 @@ obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o
>  obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o xen_pt_msi.o
>  obj-y += kvm/
>  obj-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
> +obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
>
>  obj-y := $(addprefix ../,$(obj-y))
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> new file mode 100644
> index 0000000..e9ae421
> --- /dev/null
> +++ b/hw/vfio_pci.c
> @@ -0,0 +1,2030 @@
> +/*
> + * vfio based device assignment support
> + *
> + * Copyright Red Hat, Inc. 2012
> + *
> + * Authors:
> + *  Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See

GPLv2only?

> + * the COPYING file in the top-level directory.
> + *
> + * Based on qemu-kvm device-assignment:
> + *  Adapted for KVM by Qumranet.
> + *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
> + *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
> + *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
> + *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
> + *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
> + */
> +
> +#include <dirent.h>
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <sys/io.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <linux/vfio.h>
> +
> +#include "config.h"
> +#include "event_notifier.h"
> +#include "exec-memory.h"
> +#include "hw.h"
> +#include "kvm.h"
> +#include "memory.h"
> +#include "monitor.h"
> +#include "msi.h"
> +#include "msix.h"
> +#include "notify.h"
> +#include "pc.h"
> +#include "qemu-error.h"
> +#include "qemu-timer.h"
> +#include "range.h"
> +#include "vfio_pci.h"
> +
> +//#define DEBUG_VFIO
> +#ifdef DEBUG_VFIO
> +#define DPRINTF(fmt, ...) \
> +    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> +    do { } while (0)
> +#endif
> +
> +#define MSIX_CAP_LENGTH 12
> +
> +static QLIST_HEAD(, VFIOContainer)
> +    container_list = QLIST_HEAD_INITIALIZER(container_list);
> +
> +static QLIST_HEAD(, VFIOGroup)
> +    group_list = QLIST_HEAD_INITIALIZER(group_list);
> +
> +static void vfio_disable_interrupts(VFIODevice *vdev);
> +static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> +
> +/*
> + * Common VFIO interrupt disable
> + */
> +static void vfio_disable_irqindex(VFIODevice *vdev, int index)
> +{
> +    struct vfio_irq_set irq_set = {
> +        .argsz = sizeof(irq_set),
> +        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
> +        .index = index,
> +        .start = 0,
> +        .count = 0,
> +    };
> +
> +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +
> +    vdev->interrupt = INT_NONE;
> +}
> +
> +/*
> + * INTx
> + */
> +static inline void vfio_unmask_intx(VFIODevice *vdev)

'inline' may be premature optimization.

> +{
> +    struct vfio_irq_set irq_set = {
> +        .argsz = sizeof(irq_set),
> +        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
> +        .index = VFIO_PCI_INTX_IRQ_INDEX,
> +        .start = 0,
> +        .count = 1,
> +    };
> +
> +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +}
> +
> +static inline void vfio_mask_intx(VFIODevice *vdev)
> +{
> +    struct vfio_irq_set irq_set = {
> +        .argsz = sizeof(irq_set),
> +        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
> +        .index = VFIO_PCI_INTX_IRQ_INDEX,
> +        .start = 0,
> +        .count = 1,
> +    };
> +
> +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> +}
> +
> +static void vfio_intx_interrupt(void *opaque)
> +{
> +    VFIODevice *vdev = opaque;
> +
> +    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
> +        return;
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) Pin %c\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function,
> +            'A' + vdev->intx.pin);
> +
> +    vdev->intx.pending = true;
> +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 1);
> +}
> +
> +static void vfio_eoi(VFIODevice *vdev)
> +{
> +    if (!vdev->intx.pending) {
> +        return;
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function);
> +
> +    vdev->intx.pending = false;
> +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> +    vfio_unmask_intx(vdev);
> +}
> +
> +struct vfio_irq_set_fd {
> +    struct vfio_irq_set irq_set;
> +    int32_t fd;
> +} QEMU_PACKED;

Why is this structure not defined in kernel headers?

> +
> +static void vfio_enable_intx_kvm(VFIODevice *vdev)
> +{
> +#ifdef CONFIG_KVM

These shouldn't be needed. The device will not be useful without KVM,
so the file shouldn't be compiled for non-KVM case at all.

> +    struct vfio_irq_set_fd irq_set_fd = {
> +       .irq_set = {
> +            .argsz = sizeof(irq_set_fd),
> +            .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK,
> +            .index = VFIO_PCI_INTX_IRQ_INDEX,
> +            .start = 0,
> +            .count = 1,
> +        },
> +    };
> +    struct kvm_irqfd irqfd = {
> +        .gsi = vdev->intx.route.irq,
> +        .flags = KVM_IRQFD_FLAG_LEVEL,
> +    };
> +    struct kvm_eoifd eoifd = {
> +        .flags = KVM_EOIFD_FLAG_LEVEL_IRQFD,
> +    };
> +    int key;
> +
> +    if (vdev->intx.kvm_accel || !kvm_irqchip_in_kernel() ||
> +        vdev->intx.route.mode == PCI_INTX_DISABLED ||
> +        !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_LEVEL) ||
> +        !kvm_check_extension(kvm_state, KVM_CAP_EOIFD_LEVEL_IRQFD)) {
> +        return;
> +    }
> +
> +    /*
> +     * We've already got an eventfd for interrupt signals from VFIO
> +     * into Qemu.  Plumb it into an IRQFD.
> +     */
> +    irqfd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
> +
> +    /*
> +     * Get to a known state, not listening for interrupts, hardware
> +     * masked, Qemu IRQ de-asserted.
> +     */
> +    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
> +    /* TBD - Disable qemu eoi notifier */
> +    vfio_mask_intx(vdev);
> +    vdev->intx.pending = false;
> +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> +
> +    /*
> +     * Create a new eventfd to connect unmask signals from KVM EOIFD
> +     * directly into VFIO.
> +     */
> +    if (event_notifier_init(&vdev->intx.unmask, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed eoi\n");
> +        goto fail;
> +    }
> +
> +    /* Tell both KVM EOIFD and VFIO about this eventfd */
> +    eoifd.fd = irq_set_fd.fd = event_notifier_get_fd(&vdev->intx.unmask);
> +
> +    /* IRQFD first sets up the level interrupt */
> +    key = kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
> +    if (key < 0) {
> +        error_report("vfio: Error: Failed to setup INTx irqfd: %s\n",
> +                     strerror(errno));
> +        goto fail;
> +    }
> +
> +    /* Giving us a key that let's us configure the EOIFD */
> +    eoifd.key = key;
> +    if (kvm_vm_ioctl(kvm_state, KVM_EOIFD, &eoifd)) {
> +        error_report("vfio: Error: Failed to setup INTx EOI: %s\n",
> +                     strerror(errno));
> +        goto fail_eoifd;
> +    }
> +
> +    /* Finally configure the irqfd-like vfio mechanism for unmasks */
> +    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd)) {
> +        error_report("vfio: Error: Failed to setup INTx unmask fd: %s\n",
> +                     strerror(errno));
> +        goto fail_vfio;
> +    }
> +
> +    /* Let'em rip */
> +    vfio_unmask_intx(vdev);
> +
> +    vdev->intx.kvm_accel = true;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel enabled\n",
> +            __FUNCTION__, vdev->host.domain, vdev->host.bus,
> +            vdev->host.slot, vdev->host.function);
> +
> +    return;
> +
> +fail_vfio:
> +    eoifd.flags |= KVM_EOIFD_FLAG_DEASSIGN;
> +    kvm_vm_ioctl(kvm_state, KVM_EOIFD, &eoifd);
> +fail_eoifd:
> +    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
> +    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
> +fail:
> +    /* TBD - Enable qemu eoi notifier */
> +    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +    vfio_unmask_intx(vdev);
> +#endif
> +}
> +
> +static void vfio_disable_intx_kvm(VFIODevice *vdev)
> +{
> +#ifdef CONFIG_KVM
> +    struct kvm_irqfd irqfd = {
> +        .gsi = vdev->intx.route.irq,
> +        .flags = KVM_IRQFD_FLAG_DEASSIGN,
> +    };
> +
> +    if (!vdev->intx.kvm_accel) {
> +        return;
> +    }
> +
> +    /*
> +     * Get to a known state, hardware masked, Qemu ready to accept new
> +     * interrupts, Qemu IRQ de-asserted.
> +     */
> +    vfio_mask_intx(vdev);
> +    vdev->intx.pending = false;
> +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> +
> +    /*
> +     * Both ends of the unmask eventfd watch for POLLHUP, so this kills
> +     * the eoifd and the vfio unmask handler in one shot.
> +     */
> +    event_notifier_cleanup(&vdev->intx.unmask);
> +
> +    /*
> +     * Tell the kernel to stop listening for interrupt events.
> +     */
> +    irqfd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
> +    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> +        error_report("vfio: Error: Failed to disable INTx irqfd: %s\n",
> +                     strerror(errno));
> +    }
> +
> +    /*
> +     * Qemu starts listening for interrupt events.
> +     */
> +    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +
> +    /* TBD - Enable qemu eoi notifier */
> +
> +    vdev->intx.kvm_accel = false;
> +
> +    /*
> +     * If we've missed an event, let it re-fire through qemu.
> +     */
> +    vfio_unmask_intx(vdev);
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n",
> +            __FUNCTION__, vdev->host.domain, vdev->host.bus,
> +            vdev->host.slot, vdev->host.function);
> +#endif
> +}
> +
> +static void vfio_update_irq(PCIDevice *pdev)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    PCIINTxRoute route;
> +
> +    if (vdev->interrupt != INT_INTx) {
> +        return;
> +    }
> +
> +    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
> +    if (!memcmp(&route, &vdev->intx.route, sizeof(route))) {
> +        return; /* Nothing changed */
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) IRQ moved %d -> %d\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, vdev->intx.route.irq, route.irq);
> +
> +    vfio_disable_intx_kvm(vdev);
> +    /* TBD - Disable qemu eoi notifier */
> +
> +    vdev->intx.route = route;
> +
> +    if (route.mode == PCI_INTX_DISABLED) {
> +        return;
> +    }
> +
> +    /* TBD - Enable qemu eoi notifier */
> +    vfio_enable_intx_kvm(vdev);
> +
> +    /* Re-enable the interrupt in cased we missed an EOI */
> +    vfio_eoi(vdev);
> +}
> +
> +static int vfio_enable_intx(VFIODevice *vdev)
> +{
> +    struct vfio_irq_set_fd irq_set_fd = {
> +       .irq_set = {
> +            .argsz = sizeof(irq_set_fd),
> +            .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
> +            .index = VFIO_PCI_INTX_IRQ_INDEX,
> +            .start = 0,
> +            .count = 1,
> +        },
> +    };
> +    uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
> +
> +    if (!pin) {
> +        return 0;
> +    }
> +
> +    vfio_disable_interrupts(vdev);
> +
> +    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
> +    vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
> +                                                    vdev->intx.pin);
> +    /* TBD - Enable qemu eoi notifier */
> +
> +    if (event_notifier_init(&vdev->intx.interrupt, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed\n");
> +        return -1;
> +    }
> +
> +    irq_set_fd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
> +    qemu_set_fd_handler(irq_set_fd.fd, vfio_intx_interrupt, NULL, vdev);
> +
> +    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd)) {
> +        error_report("vfio: Error: Failed to setup INTx fd: %s\n",
> +                     strerror(errno));
> +        return -1;
> +    }
> +
> +    vfio_enable_intx_kvm(vdev);
> +
> +    vdev->interrupt = INT_INTx;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function);
> +
> +    return 0;
> +}
> +
> +static void vfio_disable_intx(VFIODevice *vdev)
> +{
> +    int fd;
> +
> +    vfio_disable_intx_kvm(vdev);
> +    vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
> +
> +    /* TBD - Disable qemu eoi notifier */
> +
> +    fd = event_notifier_get_fd(&vdev->intx.interrupt);
> +    qemu_set_fd_handler(fd, NULL, NULL, vdev);
> +    event_notifier_cleanup(&vdev->intx.interrupt);
> +
> +    vdev->interrupt = INT_NONE;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function);
> +}
> +
> +/*
> + * MSI/X
> + */
> +static void vfio_msi_interrupt(void *opaque)
> +{
> +    MSIVector *vec = opaque;
> +    VFIODevice *vdev = vec->vdev;
> +
> +    if (!event_notifier_test_and_clear(&vec->interrupt)) {
> +        return;
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, vec->vector);
> +
> +    if (vdev->interrupt == INT_MSIX) {
> +        msix_notify(&vdev->pdev, vec->vector);
> +    } else if (vdev->interrupt == INT_MSI) {
> +        msi_notify(&vdev->pdev, vec->vector);
> +    } else {
> +        error_report("vfio: MSI interrupt receieved, but not enabled?\n");
> +    }
> +}
> +
> +static int vfio_enable_vectors(VFIODevice *vdev, bool msix)
> +{
> +    struct vfio_irq_set *irq_set;
> +    int ret = 0, i, argsz;
> +    int32_t *fds;
> +
> +    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
> +
> +    irq_set = g_malloc0(argsz);
> +    irq_set->argsz = argsz;
> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
> +    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
> +    irq_set->start = 0;
> +    irq_set->count = vdev->nr_vectors;
> +    fds = (int32_t *)&irq_set->data;
> +
> +    for (i = 0; i < vdev->nr_vectors; i++) {
> +        if (!vdev->msi_vectors[i].use) {
> +            fds[i] = -1;
> +            continue;
> +        }
> +
> +        fds[i] = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> +    }
> +
> +    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
> +
> +    g_free(irq_set);
> +
> +    if (!ret) {
> +        vdev->interrupt = msix ? INT_MSIX : INT_MSI;
> +    }
> +
> +    return ret;
> +}
> +
> +static int vfio_msix_vector_use(PCIDevice *pdev,
> +                                unsigned int vector, MSIMessage msg)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    int ret, fd;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d used\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, vector);
> +
> +    if (vdev->interrupt != INT_MSIX) {
> +        vfio_disable_interrupts(vdev);
> +    }
> +
> +    if (!vdev->msi_vectors) {
> +        vdev->msi_vectors = g_malloc0(vdev->msix->entries * sizeof(MSIVector));
> +    }
> +
> +    vdev->msi_vectors[vector].vdev = vdev;
> +    vdev->msi_vectors[vector].vector = vector;
> +    vdev->msi_vectors[vector].use = true;
> +
> +    msix_vector_use(pdev, vector);
> +
> +    if (event_notifier_init(&vdev->msi_vectors[vector].interrupt, 0)) {
> +        error_report("vfio: Error: event_notifier_init failed\n");
> +    }
> +
> +    fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
> +
> +    /*
> +     * Attempt to enable route through KVM irqchip,
> +     * default to userspace handling if unavailable.
> +     */
> +    vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
> +    if (vdev->msi_vectors[vector].virq < 0 ||
> +        kvm_irqchip_add_irqfd(kvm_state, fd,
> +                              vdev->msi_vectors[vector].virq) < 0) {
> +        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> +                            &vdev->msi_vectors[vector]);
> +    }
> +
> +    /*
> +     * We don't want to have the host allocate all possible MSI vectors
> +     * for a device if they're not in use, so we shutdown and incrementally
> +     * increase them as needed.
> +     */
> +    if (vdev->nr_vectors < vector + 1) {
> +        int i;
> +
> +        vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX);
> +        vdev->nr_vectors = vector + 1;
> +        ret = vfio_enable_vectors(vdev, true);
> +        if (ret) {
> +            error_report("vfio: failed to enable vectors, %d\n", ret);
> +        }
> +
> +        /* We don't know if we've missed interrupts in the interim... */
> +        for (i = 0; i < vdev->msix->entries; i++) {
> +            if (vdev->msi_vectors[i].use) {
> +                msix_notify(&vdev->pdev, i);
> +            }
> +        }
> +    } else {
> +        struct vfio_irq_set_fd irq_set_fd = {
> +            .irq_set = {
> +                .argsz = sizeof(irq_set_fd),
> +                .flags = VFIO_IRQ_SET_DATA_EVENTFD |
> +                         VFIO_IRQ_SET_ACTION_TRIGGER,
> +                .index = VFIO_PCI_MSIX_IRQ_INDEX,
> +                .start = vector,
> +                .count = 1,
> +            },
> +            .fd = fd,
> +        };
> +        ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd);
> +        if (ret) {
> +            error_report("vfio: failed to modify vector, %d\n", ret);
> +        }
> +        msix_notify(&vdev->pdev, vector);
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int vector)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    struct vfio_irq_set_fd irq_set_fd = {
> +        .irq_set = {
> +            .argsz = sizeof(irq_set_fd),
> +            .flags = VFIO_IRQ_SET_DATA_EVENTFD |
> +                     VFIO_IRQ_SET_ACTION_TRIGGER,
> +            .index = VFIO_PCI_MSIX_IRQ_INDEX,
> +            .start = vector,
> +            .count = 1,
> +        },
> +        .fd = -1,
> +    };
> +    int fd;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d released\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, vector);
> +
> +    /*
> +     * XXX What's the right thing to do here?  This turns off the interrupt
> +     * completely, but do we really just want to switch the interrupt to
> +     * bouncing through userspace and let msix.c drop it?  Not sure.
> +     */
> +    msix_vector_unuse(pdev, vector);
> +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd);
> +
> +    fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
> +
> +    if (vdev->msi_vectors[vector].virq < 0) {
> +        qemu_set_fd_handler(fd, NULL, NULL, NULL);
> +    } else {
> +        kvm_irqchip_remove_irqfd(kvm_state, fd, vdev->msi_vectors[vector].virq);
> +        kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[vector].virq);
> +        vdev->msi_vectors[vector].virq = -1;
> +    }
> +
> +    event_notifier_cleanup(&vdev->msi_vectors[vector].interrupt);
> +    vdev->msi_vectors[vector].use = false;
> +}
> +
> +/* XXX This should move to msi.c */
> +static MSIMessage msi_get_msg(PCIDevice *pdev, unsigned int vector)
> +{
> +    uint16_t flags = pci_get_word(pdev->config + pdev->msi_cap + PCI_MSI_FLAGS);
> +    bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
> +    MSIMessage msg;
> +
> +    if (msi64bit) {
> +        msg.address = pci_get_quad(pdev->config +
> +                                   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
> +    } else {
> +        msg.address = pci_get_long(pdev->config +
> +                                   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
> +    }
> +
> +    msg.data = pci_get_word(pdev->config + pdev->msi_cap +
> +                            (msi64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32));
> +    msg.data += vector;
> +
> +    return msg;
> +}
> +
> +/* So should this */
> +static void msi_set_qsize(PCIDevice *pdev, uint8_t size)
> +{
> +    uint8_t *config = pdev->config + pdev->msi_cap;
> +    uint16_t flags;
> +
> +    flags = pci_get_word(config + PCI_MSI_FLAGS);
> +    flags = le16_to_cpu(flags);
> +    flags &= ~PCI_MSI_FLAGS_QSIZE;
> +    flags |= (size & 0x7) << 4;
> +    flags = cpu_to_le16(flags);
> +    pci_set_word(config + PCI_MSI_FLAGS, flags);
> +}
> +
> +static void vfio_enable_msi(VFIODevice *vdev)
> +{
> +    int ret, i;
> +
> +    vfio_disable_interrupts(vdev);
> +
> +    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
> +retry:
> +    vdev->msi_vectors = g_malloc0(vdev->nr_vectors * sizeof(MSIVector));
> +
> +    for (i = 0; i < vdev->nr_vectors; i++) {
> +        MSIMessage msg;
> +        int fd;
> +
> +        vdev->msi_vectors[i].vdev = vdev;
> +        vdev->msi_vectors[i].vector = i;
> +        vdev->msi_vectors[i].use = true;
> +
> +        if (event_notifier_init(&vdev->msi_vectors[i].interrupt, 0)) {
> +            error_report("vfio: Error: event_notifier_init failed\n");
> +        }
> +
> +        fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> +
> +        msg = msi_get_msg(&vdev->pdev, i);
> +
> +        /*
> +         * Attempt to enable route through KVM irqchip,
> +         * default to userspace handling if unavailable.
> +         */
> +        vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
> +        if (vdev->msi_vectors[i].virq < 0 ||
> +            kvm_irqchip_add_irqfd(kvm_state, fd,
> +                                  vdev->msi_vectors[i].virq) < 0) {
> +            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
> +                                &vdev->msi_vectors[i]);
> +        }
> +    }
> +
> +    ret = vfio_enable_vectors(vdev, false);
> +    if (ret) {
> +        if (ret < 0) {
> +            error_report("vfio: Error: Failed to setup MSI fds: %s\n",
> +                         strerror(errno));
> +        } else if (ret != vdev->nr_vectors) {
> +            error_report("vfio: Error: Failed to enable %d "
> +                         "MSI vectors, retry with %d\n", vdev->nr_vectors, ret);
> +        }
> +
> +        for (i = 0; i < vdev->nr_vectors; i++) {
> +            int fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> +            if (vdev->msi_vectors[i].virq >= 0) {
> +                kvm_irqchip_remove_irqfd(kvm_state, fd,
> +                                         vdev->msi_vectors[i].virq);
> +                kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[i].virq);
> +                vdev->msi_vectors[i].virq = -1;
> +            } else {
> +                qemu_set_fd_handler(fd, NULL, NULL, NULL);
> +            }
> +            event_notifier_cleanup(&vdev->msi_vectors[i].interrupt);
> +        }
> +
> +        g_free(vdev->msi_vectors);
> +
> +        if (ret > 0 && ret != vdev->nr_vectors) {
> +            vdev->nr_vectors = ret;
> +            goto retry;
> +        }
> +        vdev->nr_vectors = 0;
> +
> +        return;
> +    }
> +
> +    msi_set_qsize(&vdev->pdev, vdev->nr_vectors);
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) Enabled %d MSI vectors\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, vdev->nr_vectors);
> +}
> +
> +static void vfio_disable_msi_x(VFIODevice *vdev, bool msix)
> +{
> +    int i;
> +
> +    vfio_disable_irqindex(vdev, msix ? VFIO_PCI_MSIX_IRQ_INDEX :
> +                                       VFIO_PCI_MSI_IRQ_INDEX);
> +
> +    for (i = 0; i < vdev->nr_vectors; i++) {
> +        int fd;
> +
> +        if (!vdev->msi_vectors[i].use) {
> +            continue;
> +        }
> +
> +        fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
> +
> +        if (vdev->msi_vectors[i].virq >= 0) {
> +            kvm_irqchip_remove_irqfd(kvm_state, fd, vdev->msi_vectors[i].virq);
> +            kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[i].virq);
> +            vdev->msi_vectors[i].virq = -1;
> +        } else {
> +            qemu_set_fd_handler(fd, NULL, NULL, NULL);
> +        }
> +
> +        if (msix) {
> +            msix_vector_unuse(&vdev->pdev, i);
> +        }
> +
> +        event_notifier_cleanup(&vdev->msi_vectors[i].interrupt);
> +    }
> +
> +    g_free(vdev->msi_vectors);
> +    vdev->msi_vectors = NULL;
> +    vdev->nr_vectors = 0;
> +
> +    if (!msix) {
> +        msi_set_qsize(&vdev->pdev, 0); /* Actually still means 1 vector */
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x, msi%s)\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, msix ? "x" : "");
> +
> +    vfio_enable_intx(vdev);
> +}
> +
> +/*
> + * IO Port/MMIO - Beware of the endians, VFIO is always little endian
> + */
> +static void vfio_bar_write(void *opaque, target_phys_addr_t addr,
> +                           uint64_t data, unsigned size)
> +{
> +    VFIOBAR *bar = opaque;
> +    uint8_t buf[8];
> +
> +    switch (size) {
> +    case 1:
> +        *buf = data & 0xff;
> +        break;
> +    case 2:
> +        *(uint16_t *)buf = cpu_to_le16(data);
> +        break;
> +    case 4:
> +        *(uint32_t *)buf = cpu_to_le32(data);
> +        break;
> +    default:
> +        hw_error("vfio: unsupported write size, %d bytes\n", size);
> +    }
> +
> +    if (pwrite(bar->fd, buf, size, bar->fd_offset + addr) != size) {
> +        error_report("%s(,0x%"PRIx64", 0x%"PRIx64", %d) failed: %s\n",
> +                     __FUNCTION__, addr, data, size, strerror(errno));
> +    }
> +
> +    DPRINTF("%s(BAR%d+0x%"PRIx64", 0x%"PRIx64", %d)\n",
> +            __FUNCTION__, bar->nr, addr, data, size);
> +}
> +
> +static uint64_t vfio_bar_read(void *opaque,
> +                              target_phys_addr_t addr, unsigned size)
> +{
> +    VFIOBAR *bar = opaque;
> +    uint8_t buf[8];
> +    uint64_t data = 0;
> +
> +    if (pread(bar->fd, buf, size, bar->fd_offset + addr) != size) {
> +        error_report("%s(,0x%"PRIx64", %d) failed: %s\n",
> +                     __FUNCTION__, addr, size, strerror(errno));
> +        return (uint64_t)-1;
> +    }
> +
> +    switch (size) {
> +    case 1:
> +        data = buf[0];
> +        break;
> +    case 2:
> +        data = le16_to_cpu(*(uint16_t *)buf);
> +        break;
> +    case 4:
> +        data = le32_to_cpu(*(uint32_t *)buf);
> +        break;
> +    default:
> +        hw_error("vfio: unsupported read size, %d bytes\n", size);
> +    }
> +
> +    DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64"\n",
> +            __FUNCTION__, bar->nr, addr, size, data);
> +
> +    return data;
> +}
> +
> +static const MemoryRegionOps vfio_bar_ops = {
> +    .read = vfio_bar_read,
> +    .write = vfio_bar_write,
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +};
> +
> +/*
> + * PCI config space
> + */
> +static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    uint32_t val = 0;
> +
> +    /*
> +     * We only need Qemu PCI config support for the ROM BAR, the MSI and MSIX
> +     * capabilities, and the multifunction bit below.  We let VFIO handle
> +     * virtualizing everything else.  Performance is not a concern here.
> +     */
> +    if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) ||
> +        (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
> +         ranges_overlap(addr, len, pdev->msix_cap, MSIX_CAP_LENGTH)) ||
> +        (pdev->cap_present & QEMU_PCI_CAP_MSI &&
> +         ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size))) {
> +
> +        val = pci_default_read_config(pdev, addr, len);
> +    } else {
> +        if (pread(vdev->fd, &val, len, vdev->config_offset + addr) != len) {
> +            error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %s\n",
> +                         __FUNCTION__, vdev->host.domain, vdev->host.bus,
> +                         vdev->host.slot, vdev->host.function, addr, len,
> +                         strerror(errno));
> +            return -1;
> +        }
> +        val = le32_to_cpu(val);
> +    }
> +
> +    /* Multifunction bit is virualized in qemu */
> +    if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) {
> +        uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION;
> +
> +        if (len == 4) {
> +            mask <<= 16;
> +        }
> +
> +        if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
> +            val |= mask;
> +        } else {
> +            val &= ~mask;
> +        }
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, addr, len, val);
> +
> +    return val;
> +}
> +
> +static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
> +                                  uint32_t val, int len)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    uint32_t val_le = cpu_to_le32(val);
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __FUNCTION__,
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, addr, val, len);
> +
> +    /* Write everything to VFIO, let it filter out what we can't write */
> +    if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) {
> +        error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %s\n",
> +                     __FUNCTION__, vdev->host.domain, vdev->host.bus,
> +                     vdev->host.slot, vdev->host.function, addr, val, len,
> +                     strerror(errno));
> +    }
> +
> +    /* Write standard header bits to emulation */
> +    if (addr < PCI_CONFIG_HEADER_SIZE) {
> +        pci_default_write_config(pdev, addr, val, len);
> +        return;
> +    }
> +
> +    /* MSI/MSI-X Enabling/Disabling */
> +    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
> +        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
> +        int is_enabled, was_enabled = msi_enabled(pdev);
> +
> +        pci_default_write_config(pdev, addr, val, len);
> +
> +        is_enabled = msi_enabled(pdev);
> +
> +        if (!was_enabled && is_enabled) {
> +            vfio_enable_msi(vdev);
> +        } else if (was_enabled && !is_enabled) {
> +            vfio_disable_msi_x(vdev, false);
> +        }
> +    }
> +
> +    if (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
> +        ranges_overlap(addr, len, pdev->msix_cap, MSIX_CAP_LENGTH)) {
> +        int is_enabled, was_enabled = msix_enabled(pdev);
> +
> +        pci_default_write_config(pdev, addr, val, len);
> +
> +        is_enabled = msix_enabled(pdev);
> +
> +        if (!was_enabled && is_enabled) {
> +            /* vfio_msix_vector_use handles this automatically */
> +        } else if (was_enabled && !is_enabled) {
> +            vfio_disable_msi_x(vdev, true);
> +        }
> +    }
> +}
> +
> +/*
> + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
> + */
> +static int vfio_dma_map(VFIOContainer *container, target_phys_addr_t iova,
> +                        ram_addr_t size, void* vaddr, bool readonly)
> +{
> +    struct vfio_iommu_type1_dma_map map = {
> +        .argsz = sizeof(map),
> +        .flags = VFIO_DMA_MAP_FLAG_READ,
> +        .vaddr = (__u64)vaddr,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    if (!readonly) {
> +        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
> +    }
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
> +        DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_dma_unmap(VFIOContainer *container,
> +                          target_phys_addr_t iova, ram_addr_t size)
> +{
> +    struct vfio_iommu_type1_dma_unmap unmap = {
> +        .argsz = sizeof(unmap),
> +        .flags = 0,
> +        .iova = iova,
> +        .size = size,
> +    };
> +
> +    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
> +        return -errno;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_listener_dummy1(MemoryListener *listener)
> +{
> +    /* We don't do batching (begin/commit) or care about logging */
> +}
> +
> +static void vfio_listener_dummy2(MemoryListener *listener,
> +                                 MemoryRegionSection *section)
> +{
> +    /* We don't do logging or care about nops */
> +}
> +
> +static void vfio_listener_dummy3(MemoryListener *listener,
> +                                 MemoryRegionSection *section,
> +                                 bool match_data, uint64_t data,
> +                                 EventNotifier *e)
> +{
> +    /* We don't care about eventfds */
> +}
> +
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    return (section->address_space != get_system_memory() ||
> +            !memory_region_is_ram(section->mr));
> +}
> +
> +static void vfio_listener_region_add(MemoryListener *listener,
> +                                     MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.listener);
> +    target_phys_addr_t iova, end;
> +    void *vaddr;
> +    int ret;
> +
> +    if (vfio_listener_skipped_section(section)) {
> +        DPRINTF("vfio: SKIPPING region_add %016lx - %016lx\n",
> +                section->offset_within_address_space,
> +                section->offset_within_address_space + section->size - 1);
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> +                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +        error_report("%s received unaligned region\n", __FUNCTION__);
> +        return;
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    end = (section->offset_within_address_space + section->size) &
> +          TARGET_PAGE_MASK;
> +
> +    if (iova >= end) {
> +        return;
> +    }
> +
> +    vaddr = memory_region_get_ram_ptr(section->mr) +
> +            section->offset_within_region +
> +            (iova - section->offset_within_address_space);
> +
> +    DPRINTF("vfio: region_add %016lx - %016lx [%p]\n",
> +            iova, end - 1, vaddr);
> +
> +    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
> +    if (ret) {
> +        error_report("vfio_dma_map(%p, 0x%016lx, 0x%lx, %p) = %d (%s)\n",
> +                     container, iova, end - iova, vaddr, ret, strerror(errno));
> +    }
> +}
> +
> +static void vfio_listener_region_del(MemoryListener *listener,
> +                                     MemoryRegionSection *section)
> +{
> +    VFIOContainer *container = container_of(listener, VFIOContainer,
> +                                            iommu_data.listener);
> +    target_phys_addr_t iova, end;
> +    int ret;
> +
> +    if (vfio_listener_skipped_section(section)) {
> +        DPRINTF("vfio: SKIPPING region_del %016lx - %016lx\n",
> +                section->offset_within_address_space,
> +                section->offset_within_address_space + section->size - 1);
> +        return;
> +    }
> +
> +    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> +                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +        error_report("%s received unaligned region\n", __FUNCTION__);
> +        return;
> +    }
> +
> +    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    end = (section->offset_within_address_space + section->size) &
> +          TARGET_PAGE_MASK;
> +
> +    if (iova >= end) {
> +        return;
> +    }
> +
> +    DPRINTF("vfio: region_del %016lx - %016lx\n", iova, end - 1);
> +
> +    ret = vfio_dma_unmap(container, iova, end - iova);
> +    if (ret) {
> +        error_report("vfio_dma_unmap(%p, 0x%016lx, 0x%lx) = %d (%s)\n",
> +                     container, iova, end - iova, ret, strerror(errno));
> +    }
> +}
> +
> +static void vfio_listener_release(VFIOContainer *container)
> +{
> +    memory_listener_unregister(&container->iommu_data.listener);
> +}
> +
> +/*
> + * Interrupt setup
> + */
> +static void vfio_disable_interrupts(VFIODevice *vdev)
> +{
> +    switch (vdev->interrupt) {
> +    case INT_INTx:
> +        vfio_disable_intx(vdev);
> +        break;
> +    case INT_MSI:
> +        vfio_disable_msi_x(vdev, false);
> +        break;
> +    case INT_MSIX:
> +        vfio_disable_msi_x(vdev, true);

I'd add 'break' here, maybe also to default cases earlier. That way if
somebody adds code after this he won't introduce a fall through
accidentally.

> +    }
> +}
> +
> +static int vfio_setup_msi(VFIODevice *vdev, int pos)
> +{
> +    uint16_t ctrl;
> +    bool msi_64bit, msi_maskbit;
> +    int ret, entries;
> +
> +    if (!msi_supported) {
> +        return 0;
> +    }
> +
> +    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
> +              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
> +        return -1;
> +    }
> +    ctrl = le16_to_cpu(ctrl);
> +
> +    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
> +    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
> +    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
> +
> +    DPRINTF("%04x:%02x:%02x.%x PCI MSI CAP @0x%x\n", vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function, pos);
> +
> +    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
> +    if (ret < 0) {
> +        error_report("vfio: msi_init failed\n");
> +        return ret;
> +    }
> +    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
> +
> +    return 0;
> +}
> +
> +/*
> + * We don't have any control over how pci_add_capability() inserts
> + * capabilities into the chain.  In order to setup MSI-X we need a
> + * MemoryRegion for the BAR.  In order to setup the BAR and not
> + * attempt to mmap the MSI-X table area, which VFIO won't allow, we
> + * need to first look for where the MSI-X table lives.  So we
> + * unfortunately split MSI-X setup across two functions.
> + */
> +static int vfio_early_setup_msix(VFIODevice *vdev)
> +{
> +    uint8_t pos;
> +    uint16_t ctrl;
> +    uint32_t table, pba;
> +
> +    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX);
> +    if (!pos) {
> +        return 0;
> +    }
> +
> +    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
> +              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
> +        return -1;
> +    }
> +
> +    if (pread(vdev->fd, &table, sizeof(table),
> +              vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
> +        return -1;
> +    }
> +
> +    if (pread(vdev->fd, &pba, sizeof(pba),
> +              vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
> +        return -1;
> +    }
> +
> +    ctrl = le16_to_cpu(ctrl);
> +    table = le32_to_cpu(table);
> +    pba = le32_to_cpu(pba);
> +
> +    vdev->msix = g_malloc0(sizeof(*(vdev->msix)));
> +    vdev->msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
> +    vdev->msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
> +    vdev->msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
> +    vdev->msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
> +    vdev->msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
> +
> +    DPRINTF("%04x:%02x:%02x.%x "
> +            "PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d\n",
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function, pos, vdev->msix->table_bar,
> +            vdev->msix->table_offset, vdev->msix->entries);
> +
> +    return 0;
> +}
> +
> +static int vfio_setup_msix(VFIODevice *vdev, int pos)
> +{
> +    int ret;
> +
> +    if (!msi_supported) {
> +        return 0;
> +    }
> +
> +    ret = msix_init(&vdev->pdev, vdev->msix->entries,
> +                    &vdev->bars[vdev->msix->table_bar].mem,
> +                    vdev->msix->table_bar, vdev->msix->table_offset,
> +                    &vdev->bars[vdev->msix->pba_bar].mem,
> +                    vdev->msix->pba_bar, vdev->msix->pba_offset, pos);
> +    if (ret < 0) {
> +        error_report("vfio: msix_init failed\n");
> +        return ret;
> +    }
> +
> +    ret = msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
> +                                    vfio_msix_vector_release);
> +    if (ret) {
> +        error_report("vfio: msix_set_vector_notifiers failed %d\n", ret);
> +        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem,
> +                    &vdev->bars[vdev->msix->pba_bar].mem);
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_teardown_msi(VFIODevice *vdev)
> +{
> +    msi_uninit(&vdev->pdev);
> +
> +    if (vdev->msix) {
> +        /* FIXME: Why can't unset just silently do nothing?? */
> +        if (vdev->pdev.msix_vector_use_notifier &&
> +            vdev->pdev.msix_vector_release_notifier) {
> +            msix_unset_vector_notifiers(&vdev->pdev);
> +        }
> +
> +        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem,
> +                    &vdev->bars[vdev->msix->pba_bar].mem);
> +    }
> +}
> +
> +/*
> + * Resource setup
> + */
> +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> +{
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    uint64_t size;
> +
> +    if (!memory_region_size(&bar->mem)) {
> +        return;
> +    }
> +
> +    size = memory_region_size(&bar->mmap_mem);
> +    if (size) {
> +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> +         munmap(bar->mmap, size);
> +    }
> +
> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        size = memory_region_size(&vdev->msix->mmap_mem);
> +        if (size) {
> +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> +            munmap(vdev->msix->mmap, size);
> +        }
> +    }
> +
> +    memory_region_destroy(&bar->mem);
> +}
> +
> +static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
> +                         void **map, size_t size, off_t offset,
> +                         const char *name)
> +{
> +    *map = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +                MAP_SHARED, bar->fd, bar->fd_offset + offset);
> +    if (*map == MAP_FAILED) {
> +        *map = NULL;
> +        return -1;
> +    }
> +
> +    memory_region_init_ram_ptr(submem, name, size, *map);
> +    memory_region_add_subregion(mem, offset, submem);
> +
> +    return 0;
> +}
> +
> +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> +{
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    unsigned size = bar->size;
> +    char name[64];
> +
> +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);

snprintf, please.

> +
> +    /* A "slow" read/write mapping underlies all BARs */
> +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> +
> +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> +        return; /* IO space is only slow, don't expect high perf here */
> +    }
> +
> +    if (size & ~TARGET_PAGE_MASK) {
> +        error_report("%s is too small to mmap, this may affect performance.\n",
> +                     name);
> +        return;
> +    }
> +
> +    /*
> +     * We can't mmap areas overlapping the MSIX vector table, so we
> +     * potentially insert a direct-mapped subregion before and after it.
> +     */
> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> +    }
> +
> +    if (size) {
> +        strcat(name, " mmap");

strncat + trailing NUL.

> +        if (vfio_mmap_bar(bar, &bar->mem, &bar->mmap_mem, &bar->mmap,
> +                          size, 0, name)) {
> +            error_report("%s Failed. Performance may be slow\n", name);
> +        }
> +    }
> +
> +    if (vdev->msix && vdev->msix->table_bar == nr) {
> +        unsigned start;
> +
> +        start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
> +
> +        if (start < bar->size) {
> +            size = bar->size - start;
> +            strcat(name, " msix-hi");
> +            /* MSIXInfo contains another MemoryRegion for this mapping */
> +            if (vfio_mmap_bar(bar, &bar->mem, &vdev->msix->mmap_mem,
> +                              &vdev->msix->mmap, size, start, name)) {
> +                error_report("%s Failed. Performance may be slow\n", name);
> +            }
> +        }
> +    }
> +
> +    return;
> +}
> +
> +static int vfio_map_bars(VFIODevice *vdev)
> +{
> +    int i;
> +
> +    for (i = 0; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
> +        VFIOBAR *bar;
> +        int ret;
> +        uint32_t bar_val;
> +        uint8_t bar_type;
> +
> +        bar = &vdev->bars[i];
> +        if (!bar->size) {
> +            continue;
> +        }
> +
> +        ret = pread(vdev->fd, &bar_val, sizeof(bar_val),
> +                    vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * i));
> +        if (ret != sizeof(bar_val)) {
> +            error_report("vfio: Failed to read BAR %d (%s)\n", i,
> +                         strerror(errno));
> +            return -1;
> +        }
> +
> +        bar_val = le32_to_cpu(bar_val);
> +        bar_type = bar_val & (bar_val & PCI_BASE_ADDRESS_SPACE_IO ?
> +                   ~PCI_BASE_ADDRESS_IO_MASK : ~PCI_BASE_ADDRESS_MEM_MASK);
> +
> +        vfio_map_bar(vdev, i, bar_type);
> +
> +        if (!(bar_type & PCI_BASE_ADDRESS_SPACE_IO) &&
> +            bar_type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
> +            i++;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void vfio_unmap_bars(VFIODevice *vdev)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        vfio_unmap_bar(vdev, i);
> +    }
> +}
> +
> +/*
> + * General setup
> + */
> +static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
> +{
> +    uint8_t tmp, next = 0xff;
> +
> +    for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
> +         tmp = pdev->config[tmp + 1]) {
> +        if (tmp > pos && tmp < next) {
> +            next = tmp;
> +        }
> +    }
> +
> +    return next - pos;
> +}
> +
> +static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +    uint8_t cap_id, next, size;
> +    int ret;
> +
> +    cap_id = pdev->config[pos];
> +    next = pdev->config[pos + 1];
> +
> +    /*
> +     * If it becomes important to configure capabilities to their actual
> +     * size, use this as the default when it's something we don't recognize.
> +     * Since qemu doesn't actually handle many of the config accesses,
> +     * exact size doesn't seem worthwhile.
> +     */
> +    size = vfio_std_cap_max_size(pdev, pos);
> +
> +    /*
> +     * pci_add_capability always inserts the new capability at the head
> +     * of the chain.  Therefore to end up with a chain that matches the
> +     * physical device, we insert from the end by making this recursive.
> +     * This is also why we pre-caclulate size above as cached config space
> +     * will be changed as we unwind the stack.
> +     */
> +    if (next) {
> +        ret = vfio_add_std_cap(vdev, next);
> +        if (ret) {
> +            return ret;
> +        }
> +    } else {
> +        pdev->config[PCI_CAPABILITY_LIST] = 0; /* Begin the rebuild */
> +    }
> +
> +    switch (cap_id) {
> +    case PCI_CAP_ID_MSI:
> +        ret = vfio_setup_msi(vdev, pos);
> +        break;
> +    case PCI_CAP_ID_MSIX:
> +        ret = vfio_setup_msix(vdev, pos);
> +        break;
> +    default:
> +        ret = pci_add_capability(pdev, cap_id, pos, size);
> +    }
> +
> +    if (ret < 0) {
> +        error_report("vfio: %04x:%02x:%02x.%x Error adding PCI capability "
> +                     "0x%x[0x%x]@0x%x: %d\n", vdev->host.domain,
> +                     vdev->host.bus, vdev->host.slot, vdev->host.function,
> +                     cap_id, size, pos, ret);
> +        return ret;
> +    }
> +
> +    return 0;
> +}
> +
> +static int vfio_add_capabilities(VFIODevice *vdev)
> +{
> +    PCIDevice *pdev = &vdev->pdev;
> +
> +    if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
> +        !pdev->config[PCI_CAPABILITY_LIST]) {
> +        return 0; /* Nothing to add */
> +    }
> +
> +    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
> +}
> +
> +static int vfio_load_rom(VFIODevice *vdev)
> +{
> +    uint64_t size = vdev->rom_size;
> +    const VMStateDescription *vmsd;
> +    char name[32];
> +    off_t off = 0, voff = vdev->rom_offset;
> +    ssize_t bytes;
> +    void *ptr;
> +
> +    /* If loading ROM from file, pci handles it */
> +    if (vdev->pdev.romfile || !vdev->pdev.rom_bar || !size)

Braces.

> +        return 0;
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function);
> +
> +    vmsd = qdev_get_vmsd(DEVICE(&vdev->pdev));
> +
> +    if (vmsd) {
> +        snprintf(name, sizeof(name), "%s.rom", vmsd->name);
> +    } else {
> +        snprintf(name, sizeof(name), "%s.rom",
> +                 object_get_typename(OBJECT(&vdev->pdev)));
> +    }
> +    memory_region_init_ram(&vdev->pdev.rom, name, size);
> +    ptr = memory_region_get_ram_ptr(&vdev->pdev.rom);
> +    memset(ptr, 0xff, size);
> +
> +    while (size) {
> +        bytes = pread(vdev->fd, ptr + off, size, voff + off);
> +        if (bytes == 0) {
> +            break; /* expect that we could get back less than the ROM BAR */
> +        } else if (bytes > 0) {
> +            off += bytes;
> +            size -= bytes;
> +        } else {
> +            if (errno == EINTR || errno == EAGAIN) {
> +                continue;
> +            }
> +            error_report("vfio: Error reading device ROM: %s\n",
> +                         strerror(errno));
> +            memory_region_destroy(&vdev->pdev.rom);
> +            return -1;
> +        }
> +    }
> +
> +    pci_register_bar(&vdev->pdev, PCI_ROM_SLOT, 0, &vdev->pdev.rom);
> +    vdev->pdev.has_rom = true;
> +    return 0;
> +}
> +
> +static int vfio_connect_container(VFIOGroup *group)
> +{
> +    VFIOContainer *container;
> +    int ret, fd;
> +
> +    if (group->container) {
> +        return 0;
> +    }
> +
> +    QLIST_FOREACH(container, &container_list, next) {
> +        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
> +            group->container = container;
> +            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +            return 0;
> +        }
> +    }
> +
> +    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
> +    if (fd < 0) {
> +        error_report("vfio: failed to open /dev/vfio/vfio: %s\n",
> +                     strerror(errno));
> +        return -1;
> +    }
> +
> +    ret = ioctl(fd, VFIO_GET_API_VERSION);
> +    if (ret != VFIO_API_VERSION) {
> +        error_report("vfio: supported vfio version: %d, "
> +                     "reported version: %d\n", VFIO_API_VERSION, ret);
> +        close(fd);
> +        return -1;
> +    }
> +
> +    container = g_malloc0(sizeof(*container));
> +    container->fd = fd;
> +
> +    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %s\n",
> +                         strerror(errno));
> +            g_free(container);
> +            close(fd);
> +            return -1;
> +        }
> +
> +        container->iommu_data.listener = (MemoryListener) {
> +            .begin = vfio_listener_dummy1,
> +            .commit = vfio_listener_dummy1,
> +            .region_add = vfio_listener_region_add,
> +            .region_del = vfio_listener_region_del,
> +            .region_nop = vfio_listener_dummy2,
> +            .log_start = vfio_listener_dummy2,
> +            .log_stop = vfio_listener_dummy2,
> +            .log_sync = vfio_listener_dummy2,
> +            .log_global_start = vfio_listener_dummy1,
> +            .log_global_stop = vfio_listener_dummy1,
> +            .eventfd_add = vfio_listener_dummy3,
> +            .eventfd_del = vfio_listener_dummy3,
> +        };
> +        container->iommu_data.release = vfio_listener_release;
> +
> +        memory_listener_register(&container->iommu_data.listener,
> +                                 get_system_memory());
> +    } else {
> +        error_report("vfio: No available IOMMU models\n");
> +        g_free(container);
> +        close(fd);
> +        return -1;
> +    }
> +
> +    QLIST_INIT(&container->group_list);
> +    QLIST_INSERT_HEAD(&container_list, container, next);
> +
> +    group->container = container;
> +    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
> +
> +    return 0;
> +}
> +
> +static void vfio_disconnect_container(VFIOGroup *group)
> +{
> +    VFIOContainer *container = group->container;
> +
> +    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
> +        error_report("vfio: error disconnecting group %d from container\n",
> +                     group->groupid);
> +    }
> +
> +    QLIST_REMOVE(group, container_next);
> +    group->container = NULL;
> +
> +    if (QLIST_EMPTY(&container->group_list)) {
> +        if (container->iommu_data.release) {
> +            container->iommu_data.release(container);
> +        }
> +        QLIST_REMOVE(container, next);
> +        DPRINTF("vfio_disconnect_container: close container->fd\n");
> +        close(container->fd);
> +        g_free(container);
> +    }
> +}
> +
> +static VFIOGroup *vfio_get_group(int groupid)
> +{
> +    VFIOGroup *group;
> +    char path[32];
> +    struct vfio_group_status status = { .argsz = sizeof(status) };
> +
> +    QLIST_FOREACH(group, &group_list, next) {
> +        if (group->groupid == groupid) {
> +            return group;
> +        }
> +    }
> +
> +    group = g_malloc0(sizeof(*group));
> +
> +    sprintf(path, "/dev/vfio/%d", groupid);

snprintf

> +    group->fd = qemu_open(path, O_RDWR);
> +    if (group->fd < 0) {
> +        error_report("vfio: error opening %s: %s", path, strerror(errno));
> +        g_free(group);
> +        return NULL;
> +    }
> +
> +    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
> +        error_report("vfio: error getting group status: %s\n",
> +                     strerror(errno));
> +        close(group->fd);
> +        g_free(group);
> +        return NULL;
> +    }
> +
> +    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
> +        error_report("vfio: error, group %d is not viable, please ensure "
> +                     "all devices within the iommu_group are bound to their "
> +                     "vfio bus driver.\n", groupid);
> +        close(group->fd);
> +        g_free(group);
> +        return NULL;
> +    }
> +
> +    group->groupid = groupid;
> +    QLIST_INIT(&group->device_list);
> +
> +    if (vfio_connect_container(group)) {
> +        error_report("vfio: failed to setup container for group %d\n", groupid);
> +        close(group->fd);
> +        g_free(group);
> +        return NULL;
> +    }
> +
> +    QLIST_INSERT_HEAD(&group_list, group, next);
> +
> +    return group;
> +}
> +
> +static void vfio_put_group(VFIOGroup *group)
> +{
> +    if (!QLIST_EMPTY(&group->device_list)) {
> +        return;
> +    }
> +
> +    vfio_disconnect_container(group);
> +    QLIST_REMOVE(group, next);
> +    DPRINTF("vfio_put_group: close group->fd\n");
> +    close(group->fd);
> +    g_free(group);
> +}
> +
> +static int __vfio_get_device(VFIOGroup *group,
> +                             const char *name, VFIODevice *vdev)

Please remove leading underscores.

> +{
> +    int ret;
> +
> +    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> +    if (ret < 0) {
> +        error_report("vfio: error getting device %s from group %d: %s",
> +                     name, group->groupid, strerror(errno));
> +        error_report("Verify all devices in group %d "
> +                     "are bound to vfio-pci or pci-stub and not already in use",
> +                     group->groupid);
> +        return -1;
> +    }
> +
> +    vdev->group = group;
> +    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
> +
> +    vdev->fd = ret;
> +
> +    return 0;
> +}
> +
> +static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
> +{
> +    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
> +    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
> +    int ret, i;
> +
> +    ret = __vfio_get_device(group, name, vdev);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    /* Sanity check device */
> +    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
> +    if (ret) {
> +        error_report("vfio: error getting device info: %s", strerror(errno));
> +        goto error;
> +    }
> +
> +    DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name,
> +            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
> +
> +    if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) {
> +        error_report("vfio: Um, this isn't a PCI device");
> +        goto error;
> +    }
> +
> +    vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
> +    if (!vdev->reset_works) {
> +        error_report("Warning, device %s does not support reset\n", name);
> +    }
> +
> +    if (dev_info.num_regions != VFIO_PCI_NUM_REGIONS) {
> +        error_report("vfio: unexpected number of io regions %u",
> +                     dev_info.num_regions);
> +        goto error;
> +    }
> +
> +    if (dev_info.num_irqs != VFIO_PCI_NUM_IRQS) {
> +        error_report("vfio: unexpected number of irqs %u", dev_info.num_irqs);
> +        goto error;
> +    }
> +
> +    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
> +        reg_info.index = i;
> +
> +        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
> +        if (ret) {
> +            error_report("vfio: Error getting region %d info: %s", i,
> +                         strerror(errno));
> +            goto error;
> +        }
> +
> +        DPRINTF("Device %s region %d:\n", name, i);
> +        DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
> +                (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
> +                (unsigned long)reg_info.flags);
> +
> +        vdev->bars[i].size = reg_info.size;
> +        vdev->bars[i].fd_offset = reg_info.offset;
> +        vdev->bars[i].fd = vdev->fd;
> +        vdev->bars[i].nr = i;
> +    }
> +
> +    reg_info.index = VFIO_PCI_ROM_REGION_INDEX;
> +
> +    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
> +    if (ret) {
> +        error_report("vfio: Error getting ROM info: %s", strerror(errno));
> +        goto error;
> +    }
> +
> +    DPRINTF("Device %s ROM:\n", name);
> +    DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
> +            (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
> +            (unsigned long)reg_info.flags);
> +
> +    vdev->rom_size = reg_info.size;
> +    vdev->rom_offset = reg_info.offset;
> +
> +    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
> +
> +    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
> +    if (ret) {
> +        error_report("vfio: Error getting config info: %s", strerror(errno));
> +        goto error;
> +    }
> +
> +    DPRINTF("Device %s config:\n", name);
> +    DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
> +            (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
> +            (unsigned long)reg_info.flags);
> +
> +    vdev->config_size = reg_info.size;
> +    vdev->config_offset = reg_info.offset;
> +
> +error:
> +    if (ret) {
> +        QLIST_REMOVE(vdev, next);
> +        vdev->group = NULL;
> +        close(vdev->fd);
> +    }
> +    return ret;
> +}
> +
> +static void vfio_put_device(VFIODevice *vdev)
> +{
> +    QLIST_REMOVE(vdev, next);
> +    vdev->group = NULL;
> +    DPRINTF("vfio_put_device: close vdev->fd\n");
> +    close(vdev->fd);
> +    if (vdev->msix) {
> +        g_free(vdev->msix);
> +       vdev->msix = NULL;
> +    }
> +}
> +
> +static int vfio_initfn(struct PCIDevice *pdev)
> +{
> +    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    VFIOGroup *group;
> +    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
> +    ssize_t len;
> +    struct stat st;
> +    int groupid;
> +    int ret;
> +
> +    /* Check that the host device exists */
> +    sprintf(path, "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function);

snprintf

> +    if (stat(path, &st) < 0) {
> +        error_report("vfio: error: no such host device: %s", path);
> +        return -1;
> +    }
> +
> +    strcat(path, "iommu_group");

strncat

> +
> +    len = readlink(path, iommu_group_path, PATH_MAX);
> +    if (len <= 0) {
> +        error_report("vfio: error no iommu_group for device\n");
> +        return -1;
> +    }
> +
> +    iommu_group_path[len] = 0;
> +    group_name = basename(iommu_group_path);
> +
> +    if (sscanf(group_name, "%d", &groupid) != 1) {
> +        error_report("vfio: error reading %s: %s", path, strerror(errno));
> +        return -1;
> +    }
> +
> +    DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __FUNCTION__, vdev->host.domain,
> +            vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
> +
> +    group = vfio_get_group(groupid);
> +    if (!group) {
> +        error_report("vfio: failed to get group %d", groupid);
> +        return -1;
> +    }
> +
> +    sprintf(path, "%04x:%02x:%02x.%01x",
> +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> +            vdev->host.function);

snprintf

> +
> +    QLIST_FOREACH(pvdev, &group->device_list, next) {
> +        if (pvdev->host.domain == vdev->host.domain &&
> +            pvdev->host.bus == vdev->host.bus &&
> +            pvdev->host.slot == vdev->host.slot &&
> +            pvdev->host.function == vdev->host.function) {
> +
> +            error_report("vfio: error: device %s is already attached\n", path);
> +            vfio_put_group(group);
> +            return -1;
> +        }
> +    }
> +
> +    ret = vfio_get_device(group, path, vdev);
> +    if (ret) {
> +        error_report("vfio: failed to get device %s", path);
> +        vfio_put_group(group);
> +        return -1;
> +    }
> +
> +    /* Get a copy of config space */
> +    assert(pci_config_size(&vdev->pdev) <= vdev->config_size);
> +    ret = pread(vdev->fd, vdev->pdev.config,
> +                pci_config_size(&vdev->pdev), vdev->config_offset);
> +    if (ret < (int)pci_config_size(&vdev->pdev)) {
> +        error_report("vfio: Failed to read device config space\n");
> +        goto out_put;
> +    }
> +
> +    /*
> +     * Clear host resource mapping info.  If we choose not to register a
> +     * BAR, such as might be the case with the option ROM, we can get
> +     * confusing, unwritable, residual addresses from the host here.
> +     */
> +    memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
> +    memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
> +
> +    vfio_load_rom(vdev);
> +
> +    if (vfio_early_setup_msix(vdev)) {
> +        goto out_put;
> +    }
> +
> +    if (vfio_map_bars(vdev)) {
> +        goto out_unmap_bars;
> +    }
> +
> +    if (vfio_add_capabilities(vdev)) {
> +        goto out_teardown_msi;
> +    }
> +
> +    if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
> +        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_update_irq);
> +    }
> +
> +    if (vfio_enable_intx(vdev)) {
> +        goto out_teardown_msi;
> +    }
> +
> +    return 0;
> +
> +out_teardown_msi:
> +    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> +    vfio_teardown_msi(vdev);
> +out_unmap_bars:
> +    vfio_unmap_bars(vdev);
> +out_put:
> +    vfio_put_device(vdev);
> +    vfio_put_group(group);
> +    return -1;
> +}
> +
> +static void vfio_exitfn(struct PCIDevice *pdev)
> +{
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +    VFIOGroup *group = vdev->group;
> +
> +    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> +    vfio_disable_interrupts(vdev);
> +    vfio_teardown_msi(vdev);
> +    vfio_unmap_bars(vdev);
> +    vfio_put_device(vdev);
> +    vfio_put_group(group);
> +}
> +
> +static void vfio_reset(DeviceState *dev)
> +{
> +    PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
> +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> +
> +    if (!vdev->reset_works) {
> +        return;
> +    }
> +
> +    if (ioctl(vdev->fd, VFIO_DEVICE_RESET)) {
> +        error_report("vfio: Error unable to reset physical device "
> +                     "(%04x:%02x:%02x.%x): %s\n", vdev->host.domain,
> +                     vdev->host.bus, vdev->host.slot, vdev->host.function,
> +                     strerror(errno));
> +    }
> +}
> +
> +static Property vfio_pci_dev_properties[] = {
> +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> +    //TODO - support passed fds... is this necessary?
> +    //DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
> +    //DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +
> +static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
> +{
> +    PCIDeviceClass *dc = PCI_DEVICE_CLASS(klass);
> +
> +    dc->parent_class.reset = vfio_reset;
> +    dc->init = vfio_initfn;
> +    dc->exit = vfio_exitfn;
> +    dc->config_read = vfio_pci_read_config;
> +    dc->config_write = vfio_pci_write_config;
> +    dc->parent_class.props = vfio_pci_dev_properties;
> +}
> +
> +static TypeInfo vfio_pci_dev_info = {
> +    .name          = "vfio-pci",
> +    .parent        = TYPE_PCI_DEVICE,
> +    .instance_size = sizeof(VFIODevice),
> +    .class_init    = vfio_pci_dev_class_init,
> +};
> +
> +static void register_vfio_pci_dev_type(void)
> +{
> +    type_register_static(&vfio_pci_dev_info);
> +}
> +
> +type_init(register_vfio_pci_dev_type)
> diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> new file mode 100644
> index 0000000..9fb27ce
> --- /dev/null
> +++ b/hw/vfio_pci.h
> @@ -0,0 +1,100 @@
> +#ifndef __VFIO_H__
> +#define __VFIO_H__

HW_VFIO_PCI_H

> +
> +#include "qemu-common.h"
> +#include "qemu-queue.h"
> +#include "pci.h"
> +#include "event_notifier.h"
> +
> +typedef struct VFIOBAR {
> +    off_t fd_offset; /* offset of BAR within device fd */
> +    int fd; /* device fd, allows us to pass VFIOBAR as opaque data */
> +    MemoryRegion mem; /* slow, read/write access */
> +    MemoryRegion mmap_mem; /* direct mapped access */
> +    void *mmap;
> +    size_t size;
> +    uint8_t nr; /* cache the BAR number for debug */
> +} VFIOBAR;
> +
> +typedef struct INTx {
> +    bool pending; /* interrupt pending */
> +    bool kvm_accel; /* set when Qemu bypass through KVM enabled */
> +    uint8_t pin; /* which pin to pull for qemu_set_irq */
> +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> +    EventNotifier unmask; /* eventfd for unmask on Qemu bypass */
> +    PCIINTxRoute route; /* routing info for Qemu bypass */
> +} INTx;
> +
> +struct VFIODevice;
> +
> +typedef struct MSIVector {
> +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> +    struct VFIODevice *vdev; /* back pointer to device */
> +    int vector; /* the vector number for this element */
> +    int virq; /* KVM irqchip route for Qemu bypass */
> +    bool use;
> +} MSIVector;
> +
> +enum {
> +    INT_NONE = 0,
> +    INT_INTx = 1,
> +    INT_MSI  = 2,
> +    INT_MSIX = 3,
> +};
> +
> +struct VFIOGroup;
> +
> +typedef struct VFIOContainer {
> +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> +    struct {
> +        /* enable abstraction to support various iommu backends */
> +        union {
> +            MemoryListener listener; /* Used by type1 iommu */
> +        };
> +        void (*release)(struct VFIOContainer *);
> +    } iommu_data;
> +    QLIST_HEAD(, VFIOGroup) group_list;
> +    QLIST_ENTRY(VFIOContainer) next;
> +} VFIOContainer;
> +
> +/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
> +typedef struct MSIXInfo {
> +    uint8_t table_bar;
> +    uint8_t pba_bar;
> +    uint16_t entries;
> +    uint32_t table_offset;
> +    uint32_t pba_offset;
> +    MemoryRegion mmap_mem;
> +    void *mmap;
> +} MSIXInfo;
> +
> +typedef struct VFIODevice {
> +    PCIDevice pdev;
> +    int fd;
> +    INTx intx;
> +    unsigned int config_size;
> +    off_t config_offset; /* Offset of config space region within device fd */
> +    unsigned int rom_size;
> +    off_t rom_offset; /* Offset of ROM region within device fd */
> +    int msi_cap_size;
> +    MSIVector *msi_vectors;
> +    MSIXInfo *msix;
> +    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
> +    int interrupt; /* Current interrupt type */
> +    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
> +    PCIHostDeviceAddress host;
> +    QLIST_ENTRY(VFIODevice) next;
> +    struct VFIOGroup *group;
> +    bool reset_works;
> +} VFIODevice;
> +
> +typedef struct VFIOGroup {
> +    int fd;
> +    int groupid;
> +    VFIOContainer *container;
> +    QLIST_HEAD(, VFIODevice) device_list;
> +    QLIST_ENTRY(VFIOGroup) next;
> +    QLIST_ENTRY(VFIOGroup) container_next;
> +} VFIOGroup;
> +
> +#endif /* __VFIO_H__ */
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 5a9d4e3..bd1a76c 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -617,6 +617,10 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_SIGNAL_MSI 77
>  #define KVM_CAP_PPC_GET_SMMU_INFO 78
>  #define KVM_CAP_S390_COW 79
> +#define KVM_CAP_PPC_ALLOC_HTAB 80
> +#define KVM_CAP_IRQFD_LEVEL 81
> +#define KVM_CAP_EOIFD 82
> +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83
>
>  #ifdef KVM_CAP_IRQ_ROUTING
>
> @@ -682,6 +686,8 @@ struct kvm_xen_hvm_config {
>  #endif
>
>  #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
> +/* Available with KVM_CAP_IRQFD_LEVEL */
> +#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
>
>  struct kvm_irqfd {
>         __u32 fd;
> @@ -690,6 +696,17 @@ struct kvm_irqfd {
>         __u8  pad[20];
>  };
>
> +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> +/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */
> +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> +
> +struct kvm_eoifd {
> +       __u32 fd;
> +       __u32 flags;
> +       __u32 key;
> +       __u8 pad[20];
> +};
> +
>  struct kvm_clock_data {
>         __u64 clock;
>         __u32 flags;
> @@ -828,6 +845,10 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_SIGNAL_MSI            _IOW(KVMIO,  0xa5, struct kvm_msi)
>  /* Available with KVM_CAP_PPC_GET_SMMU_INFO */
>  #define KVM_PPC_GET_SMMU_INFO    _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
> +/* Available with KVM_CAP_PPC_ALLOC_HTAB */
> +#define KVM_PPC_ALLOCATE_HTAB    _IOWR(KVMIO, 0xa7, __u32)
> +/* Available with KVM_CAP_EOIFD */
> +#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
>
>  /*
>   * ioctls for vcpu fds
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> new file mode 100644
> index 0000000..0a4f180
> --- /dev/null
> +++ b/linux-headers/linux/vfio.h
> @@ -0,0 +1,445 @@
> +/*
> + * VFIO API definition
> + *
> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define VFIO_API_VERSION       0
> +
> +#ifdef __KERNEL__      /* Internal VFIO-core/bus driver API */
> +
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +
> +/**
> + * struct vfio_device_ops - VFIO bus driver device callbacks
> + *
> + * @open: Called when userspace creates new file descriptor for device
> + * @release: Called when userspace releases file descriptor for device
> + * @read: Perform read(2) on device file descriptor
> + * @write: Perform write(2) on device file descriptor
> + * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
> + *         operations documented below
> + * @mmap: Perform mmap(2) on a region of the device file descriptor
> + */
> +struct vfio_device_ops {
> +       char    *name;
> +       int     (*open)(void *device_data);
> +       void    (*release)(void *device_data);
> +       ssize_t (*read)(void *device_data, char __user *buf,
> +                       size_t count, loff_t *ppos);
> +       ssize_t (*write)(void *device_data, const char __user *buf,
> +                        size_t count, loff_t *size);
> +       long    (*ioctl)(void *device_data, unsigned int cmd,
> +                        unsigned long arg);
> +       int     (*mmap)(void *device_data, struct vm_area_struct *vma);
> +};
> +
> +extern int vfio_add_group_dev(struct device *dev,
> +                             const struct vfio_device_ops *ops,
> +                             void *device_data);
> +
> +extern void *vfio_del_group_dev(struct device *dev);
> +
> +/**
> + * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
> + */
> +struct vfio_iommu_driver_ops {
> +       char            *name;
> +       struct module   *owner;
> +       void            *(*open)(unsigned long arg);
> +       void            (*release)(void *iommu_data);
> +       ssize_t         (*read)(void *iommu_data, char __user *buf,
> +                               size_t count, loff_t *ppos);
> +       ssize_t         (*write)(void *iommu_data, const char __user *buf,
> +                                size_t count, loff_t *size);
> +       long            (*ioctl)(void *iommu_data, unsigned int cmd,
> +                                unsigned long arg);
> +       int             (*mmap)(void *iommu_data, struct vm_area_struct *vma);
> +       int             (*attach_group)(void *iommu_data,
> +                                       struct iommu_group *group);
> +       void            (*detach_group)(void *iommu_data,
> +                                       struct iommu_group *group);
> +
> +};
> +
> +extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> +
> +extern void vfio_unregister_iommu_driver(
> +                               const struct vfio_iommu_driver_ops *ops);
> +
> +/**
> + * offsetofend(TYPE, MEMBER)
> + *
> + * @TYPE: The type of the structure
> + * @MEMBER: The member within the structure to get the end offset of
> + *
> + * Simple helper macro for dealing with variable sized structures passed
> + * from user space.  This allows us to easily determine if the provided
> + * structure is sized to include various fields.
> + */
> +#define offsetofend(TYPE, MEMBER) ({                           \
> +       TYPE tmp;                                               \
> +       offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })         \
> +
> +#endif /* __KERNEL__ */
> +
> +/* Kernel & User level defines for VFIO IOCTLs. */
> +
> +/* Extensions */
> +
> +#define VFIO_TYPE1_IOMMU               1
> +
> +/*
> + * The IOCTL interface is designed for extensibility by embedding the
> + * structure length (argsz) and flags into structures passed between
> + * kernel and userspace.  We therefore use the _IO() macro for these
> + * defines to avoid implicitly embedding a size into the ioctl request.
> + * As structure fields are added, argsz will increase to match and flag
> + * bits will be defined to indicate additional fields with valid data.
> + * It's *always* the caller's responsibility to indicate the size of
> + * the structure passed by setting argsz appropriately.
> + */
> +
> +#define VFIO_TYPE      (';')
> +#define VFIO_BASE      100
> +
> +/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
> +
> +/**
> + * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
> + *
> + * Report the version of the VFIO API.  This allows us to bump the entire
> + * API version should we later need to add or change features in incompatible
> + * ways.
> + * Return: VFIO_API_VERSION
> + * Availability: Always
> + */
> +#define VFIO_GET_API_VERSION           _IO(VFIO_TYPE, VFIO_BASE + 0)
> +
> +/**
> + * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
> + *
> + * Check whether an extension is supported.
> + * Return: 0 if not supported, 1 (or some other positive integer) if supported.
> + * Availability: Always
> + */
> +#define VFIO_CHECK_EXTENSION           _IO(VFIO_TYPE, VFIO_BASE + 1)
> +
> +/**
> + * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
> + *
> + * Set the iommu to the given type.  The type must be supported by an
> + * iommu driver as verified by calling CHECK_EXTENSION using the same
> + * type.  A group must be set to this file descriptor before this
> + * ioctl is available.  The IOMMU interfaces enabled by this call are
> + * specific to the value set.
> + * Return: 0 on success, -errno on failure
> + * Availability: When VFIO group attached
> + */
> +#define VFIO_SET_IOMMU                 _IO(VFIO_TYPE, VFIO_BASE + 2)
> +
> +/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
> +
> +/**
> + * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
> + *                                             struct vfio_group_status)
> + *
> + * Retrieve information about the group.  Fills in provided
> + * struct vfio_group_info.  Caller sets argsz.
> + * Return: 0 on succes, -errno on failure.
> + * Availability: Always
> + */
> +struct vfio_group_status {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_GROUP_FLAGS_VIABLE                (1 << 0)
> +#define VFIO_GROUP_FLAGS_CONTAINER_SET (1 << 1)
> +};
> +#define VFIO_GROUP_GET_STATUS          _IO(VFIO_TYPE, VFIO_BASE + 3)
> +
> +/**
> + * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
> + *
> + * Set the container for the VFIO group to the open VFIO file
> + * descriptor provided.  Groups may only belong to a single
> + * container.  Containers may, at their discretion, support multiple
> + * groups.  Only when a container is set are all of the interfaces
> + * of the VFIO file descriptor and the VFIO group file descriptor
> + * available to the user.
> + * Return: 0 on success, -errno on failure.
> + * Availability: Always
> + */
> +#define VFIO_GROUP_SET_CONTAINER       _IO(VFIO_TYPE, VFIO_BASE + 4)
> +
> +/**
> + * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
> + *
> + * Remove the group from the attached container.  This is the
> + * opposite of the SET_CONTAINER call and returns the group to
> + * an initial state.  All device file descriptors must be released
> + * prior to calling this interface.  When removing the last group
> + * from a container, the IOMMU will be disabled and all state lost,
> + * effectively also returning the VFIO file descriptor to an initial
> + * state.
> + * Return: 0 on success, -errno on failure.
> + * Availability: When attached to container
> + */
> +#define VFIO_GROUP_UNSET_CONTAINER     _IO(VFIO_TYPE, VFIO_BASE + 5)
> +
> +/**
> + * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
> + *
> + * Return a new file descriptor for the device object described by
> + * the provided string.  The string should match a device listed in
> + * the devices subdirectory of the IOMMU group sysfs entry.  The
> + * group containing the device must already be added to this context.
> + * Return: new file descriptor on success, -errno on failure.
> + * Availability: When attached to container
> + */
> +#define VFIO_GROUP_GET_DEVICE_FD       _IO(VFIO_TYPE, VFIO_BASE + 6)
> +
> +/* --------------- IOCTLs for DEVICE file descriptors --------------- */
> +
> +/**
> + * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
> + *                                             struct vfio_device_info)
> + *
> + * Retrieve information about the device.  Fills in provided
> + * struct vfio_device_info.  Caller sets argsz.
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_device_info {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_DEVICE_FLAGS_RESET        (1 << 0)        /* Device supports reset */
> +#define VFIO_DEVICE_FLAGS_PCI  (1 << 1)        /* vfio-pci device */
> +       __u32   num_regions;    /* Max region index + 1 */
> +       __u32   num_irqs;       /* Max IRQ index + 1 */
> +};
> +#define VFIO_DEVICE_GET_INFO           _IO(VFIO_TYPE, VFIO_BASE + 7)
> +
> +/**
> + * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
> + *                                    struct vfio_region_info)
> + *
> + * Retrieve information about a device region.  Caller provides
> + * struct vfio_region_info with index value set.  Caller sets argsz.
> + * Implementation of region mapping is bus driver specific.  This is
> + * intended to describe MMIO, I/O port, as well as bus specific
> + * regions (ex. PCI config space).  Zero sized regions may be used
> + * to describe unimplemented regions (ex. unimplemented PCI BARs).
> + * Return: 0 on success, -errno on failure.
> + */
> +struct vfio_region_info {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_REGION_INFO_FLAG_READ     (1 << 0) /* Region supports read */
> +#define VFIO_REGION_INFO_FLAG_WRITE    (1 << 1) /* Region supports write */
> +#define VFIO_REGION_INFO_FLAG_MMAP     (1 << 2) /* Region supports mmap */
> +       __u32   index;          /* Region index */
> +       __u32   resv;           /* Reserved for alignment */
> +       __u64   size;           /* Region size (bytes) */
> +       __u64   offset;         /* Region offset from start of device fd */
> +};
> +#define VFIO_DEVICE_GET_REGION_INFO    _IO(VFIO_TYPE, VFIO_BASE + 8)
> +
> +/**
> + * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
> + *                                 struct vfio_irq_info)
> + *
> + * Retrieve information about a device IRQ.  Caller provides
> + * struct vfio_irq_info with index value set.  Caller sets argsz.
> + * Implementation of IRQ mapping is bus driver specific.  Indexes
> + * using multiple IRQs are primarily intended to support MSI-like
> + * interrupt blocks.  Zero count irq blocks may be used to describe
> + * unimplemented interrupt types.
> + *
> + * The EVENTFD flag indicates the interrupt index supports eventfd based
> + * signaling.
> + *
> + * The MASKABLE flags indicates the index supports MASK and UNMASK
> + * actions described below.
> + *
> + * AUTOMASKED indicates that after signaling, the interrupt line is
> + * automatically masked by VFIO and the user needs to unmask the line
> + * to receive new interrupts.  This is primarily intended to distinguish
> + * level triggered interrupts.
> + *
> + * The NORESIZE flag indicates that the interrupt lines within the index
> + * are setup as a set and new subindexes cannot be enabled without first
> + * disabling the entire index.  This is used for interrupts like PCI MSI
> + * and MSI-X where the driver may only use a subset of the available
> + * indexes, but VFIO needs to enable a specific number of vectors
> + * upfront.  In the case of MSI-X, where the user can enable MSI-X and
> + * then add and unmask vectors, it's up to userspace to make the decision
> + * whether to allocate the maximum supported number of vectors or tear
> + * down setup and incrementally increase the vectors as each is enabled.
> + */
> +struct vfio_irq_info {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_IRQ_INFO_EVENTFD          (1 << 0)
> +#define VFIO_IRQ_INFO_MASKABLE         (1 << 1)
> +#define VFIO_IRQ_INFO_AUTOMASKED       (1 << 2)
> +#define VFIO_IRQ_INFO_NORESIZE         (1 << 3)
> +       __u32   index;          /* IRQ index */
> +       __u32   count;          /* Number of IRQs within this index */
> +};
> +#define VFIO_DEVICE_GET_IRQ_INFO       _IO(VFIO_TYPE, VFIO_BASE + 9)
> +
> +/**
> + * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
> + *
> + * Set signaling, masking, and unmasking of interrupts.  Caller provides
> + * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
> + * the range of subindexes being specified.
> + *
> + * The DATA flags specify the type of data provided.  If DATA_NONE, the
> + * operation performs the specified action immediately on the specified
> + * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
> + * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
> + *
> + * DATA_BOOL allows sparse support for the same on arrays of interrupts.
> + * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
> + * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
> + * data = {1,0,1}
> + *
> + * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
> + * A value of -1 can be used to either de-assign interrupts if already
> + * assigned or skip un-assigned interrupts.  For example, to set an eventfd
> + * to be trigger for interrupts [0,0] and [0,2]:
> + * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
> + * data = {fd1, -1, fd2}
> + * If index [0,1] is previously set, two count = 1 ioctls calls would be
> + * required to set [0,0] and [0,2] without changing [0,1].
> + *
> + * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
> + * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
> + * from userspace (ie. simulate hardware triggering).
> + *
> + * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
> + * enables the interrupt index for the device.  Individual subindex interrupts
> + * can be disabled using the -1 value for DATA_EVENTFD or the index can be
> + * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
> + *
> + * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
> + * ACTION_TRIGGER specifies kernel->user signaling.
> + */
> +struct vfio_irq_set {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_IRQ_SET_DATA_NONE         (1 << 0) /* Data not present */
> +#define VFIO_IRQ_SET_DATA_BOOL         (1 << 1) /* Data is bool (u8) */
> +#define VFIO_IRQ_SET_DATA_EVENTFD      (1 << 2) /* Data is eventfd (s32) */
> +#define VFIO_IRQ_SET_ACTION_MASK       (1 << 3) /* Mask interrupt */
> +#define VFIO_IRQ_SET_ACTION_UNMASK     (1 << 4) /* Unmask interrupt */
> +#define VFIO_IRQ_SET_ACTION_TRIGGER    (1 << 5) /* Trigger interrupt */
> +       __u32   index;
> +       __u32   start;
> +       __u32   count;
> +       __u8    data[];
> +};
> +#define VFIO_DEVICE_SET_IRQS           _IO(VFIO_TYPE, VFIO_BASE + 10)
> +
> +#define VFIO_IRQ_SET_DATA_TYPE_MASK    (VFIO_IRQ_SET_DATA_NONE | \
> +                                        VFIO_IRQ_SET_DATA_BOOL | \
> +                                        VFIO_IRQ_SET_DATA_EVENTFD)
> +#define VFIO_IRQ_SET_ACTION_TYPE_MASK  (VFIO_IRQ_SET_ACTION_MASK | \
> +                                        VFIO_IRQ_SET_ACTION_UNMASK | \
> +                                        VFIO_IRQ_SET_ACTION_TRIGGER)
> +/**
> + * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
> + *
> + * Reset a device.
> + */
> +#define VFIO_DEVICE_RESET              _IO(VFIO_TYPE, VFIO_BASE + 11)
> +
> +/*
> + * The VFIO-PCI bus driver makes use of the following fixed region and
> + * IRQ index mapping.  Unimplemented regions return a size of zero.
> + * Unimplemented IRQ types return a count of zero.
> + */
> +
> +enum {
> +       VFIO_PCI_BAR0_REGION_INDEX,
> +       VFIO_PCI_BAR1_REGION_INDEX,
> +       VFIO_PCI_BAR2_REGION_INDEX,
> +       VFIO_PCI_BAR3_REGION_INDEX,
> +       VFIO_PCI_BAR4_REGION_INDEX,
> +       VFIO_PCI_BAR5_REGION_INDEX,
> +       VFIO_PCI_ROM_REGION_INDEX,
> +       VFIO_PCI_CONFIG_REGION_INDEX,
> +       VFIO_PCI_NUM_REGIONS
> +};
> +
> +enum {
> +       VFIO_PCI_INTX_IRQ_INDEX,
> +       VFIO_PCI_MSI_IRQ_INDEX,
> +       VFIO_PCI_MSIX_IRQ_INDEX,
> +       VFIO_PCI_NUM_IRQS
> +};
> +
> +/* -------- API for Type1 VFIO IOMMU -------- */
> +
> +/**
> + * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
> + *
> + * Retrieve information about the IOMMU object. Fills in provided
> + * struct vfio_iommu_info. Caller sets argsz.
> + *
> + * XXX Should we do these by CHECK_EXTENSION too?
> + */
> +struct vfio_iommu_type1_info {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)       /* supported page sizes info */
> +       __u64   iova_pgsizes;           /* Bitmap of supported page sizes */
> +};
> +
> +#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +/**
> + * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
> + *
> + * Map process virtual addresses to IO virtual addresses using the
> + * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + */
> +struct vfio_iommu_type1_dma_map {
> +       __u32   argsz;
> +       __u32   flags;
> +#define VFIO_DMA_MAP_FLAG_READ (1 << 0)                /* readable from device */
> +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)       /* writable from device */
> +       __u64   vaddr;                          /* Process virtual address */
> +       __u64   iova;                           /* IO virtual address */
> +       __u64   size;                           /* Size of mapping (bytes) */
> +};
> +
> +#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
> +
> +/**
> + * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
> + *
> + * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
> + * Caller sets argsz.
> + */
> +struct vfio_iommu_type1_dma_unmap {
> +       __u32   argsz;
> +       __u32   flags;
> +       __u64   iova;                           /* IO virtual address */
> +       __u64   size;                           /* Size of mapping (bytes) */
> +};
> +
> +#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
> +
> +#endif /* VFIO_H */
>
>
Alex Williamson July 27, 2012, 8:28 p.m. UTC | #16
On Fri, 2012-07-27 at 19:22 +0000, Blue Swirl wrote:
> On Wed, Jul 25, 2012 at 5:03 PM, Alex Williamson
> > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> > new file mode 100644
> > index 0000000..e9ae421
> > --- /dev/null
> > +++ b/hw/vfio_pci.c
> > @@ -0,0 +1,2030 @@
> > +/*
> > + * vfio based device assignment support
> > + *
> > + * Copyright Red Hat, Inc. 2012
> > + *
> > + * Authors:
> > + *  Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This work is licensed under the terms of the GNU GPL, version 2.  See
> 
> GPLv2only?

Much of this is derived from KVM assignment, which is v2 only.
...
> > +
> > +/*
> > + * INTx
> > + */
> > +static inline void vfio_unmask_intx(VFIODevice *vdev)
> 
> 'inline' may be premature optimization.

ok

...
> > +
> > +struct vfio_irq_set_fd {
> > +    struct vfio_irq_set irq_set;
> > +    int32_t fd;
> > +} QEMU_PACKED;
> 
> Why is this structure not defined in kernel headers?

The kernel header defines struct vfio_irq_set, the data that follows is
specified by the flags and count values in that struct.  It's just
convenient for us to have a locally defined version that tacks on a
single fd since we use that a couple times.

> > +
> > +static void vfio_enable_intx_kvm(VFIODevice *vdev)
> > +{
> > +#ifdef CONFIG_KVM
> 
> These shouldn't be needed. The device will not be useful without KVM,
> so the file shouldn't be compiled for non-KVM case at all.

Actually we do support qemu-only device assignment, so this is an
accelerator.  I can't guarantee that TCG is going to do atomic memory
updates on control structures, which might make devices unhappy, but
otherwise it should work.  There's no way to get INTx EOIs in qemu-only,
but I hope that's a temporary problem.

...
> > +/*
> > + * Interrupt setup
> > + */
> > +static void vfio_disable_interrupts(VFIODevice *vdev)
> > +{
> > +    switch (vdev->interrupt) {
> > +    case INT_INTx:
> > +        vfio_disable_intx(vdev);
> > +        break;
> > +    case INT_MSI:
> > +        vfio_disable_msi_x(vdev, false);
> > +        break;
> > +    case INT_MSIX:
> > +        vfio_disable_msi_x(vdev, true);
> 
> I'd add 'break' here, maybe also to default cases earlier. That way if
> somebody adds code after this he won't introduce a fall through
> accidentally.

ok

...
> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> > +{
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    unsigned size = bar->size;
> > +    char name[64];
> > +
> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> 
> snprintf, please.

ok, and I'll fix all of them

> > +
> > +    /* A "slow" read/write mapping underlies all BARs */
> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> > +
> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> > +        return; /* IO space is only slow, don't expect high perf here */
> > +    }
> > +
> > +    if (size & ~TARGET_PAGE_MASK) {
> > +        error_report("%s is too small to mmap, this may affect performance.\n",
> > +                     name);
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> > +     * potentially insert a direct-mapped subregion before and after it.
> > +     */
> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> > +    }
> > +
> > +    if (size) {
> > +        strcat(name, " mmap");
> 
> strncat + trailing NUL.

ok, assume you mean strncat(name, " mmap", sizeof(name) - strlen(name) - 1).
I'll fix all of these too.

...
> > +static int vfio_load_rom(VFIODevice *vdev)
> > +{
> > +    uint64_t size = vdev->rom_size;
> > +    const VMStateDescription *vmsd;
> > +    char name[32];
> > +    off_t off = 0, voff = vdev->rom_offset;
> > +    ssize_t bytes;
> > +    void *ptr;
> > +
> > +    /* If loading ROM from file, pci handles it */
> > +    if (vdev->pdev.romfile || !vdev->pdev.rom_bar || !size)
> 
> Braces.

d'oh

...
> > +
> > +static int __vfio_get_device(VFIOGroup *group,
> > +                             const char *name, VFIODevice *vdev)
> 
> Please remove leading underscores.

yup, rolled this into the caller since it's only used once.

...
> > diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
> > new file mode 100644
> > index 0000000..9fb27ce
> > --- /dev/null
> > +++ b/hw/vfio_pci.h
> > @@ -0,0 +1,100 @@
> > +#ifndef __VFIO_H__
> > +#define __VFIO_H__
> 
> HW_VFIO_PCI_H

ok

Thanks for the review!

Alex
Alexey Kardashevskiy July 28, 2012, 2:55 a.m. UTC | #17
On 28/07/12 05:22, Blue Swirl wrote:
> On Wed, Jul 25, 2012 at 5:03 PM, Alex Williamson
>> +
>> +static void vfio_enable_intx_kvm(VFIODevice *vdev)
>> +{
>> +#ifdef CONFIG_KVM
>
> These shouldn't be needed. The device will not be useful without KVM,
> so the file shouldn't be compiled for non-KVM case at all.

It compiles without --enable-kvm and works in ppc64 full emulation though. 
Not very practical but it still can be used for debugging ppc64 drivers on 
real PCI hardware on x86 machine.
Avi Kivity July 29, 2012, 1:47 p.m. UTC | #18
On 07/26/2012 08:40 PM, Alex Williamson wrote:
> On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
>> On 07/25/2012 08:03 PM, Alex Williamson wrote:
>> 
>> > +/*
>> > + * Resource setup
>> > + */
>> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
>> > +{
>> > +    VFIOBAR *bar = &vdev->bars[nr];
>> > +    uint64_t size;
>> > +
>> > +    if (!memory_region_size(&bar->mem)) {
>> > +        return;
>> > +    }
> 
> This one is the "slow" mapped MemoryRegion.  If there's nothing here,
> the BAR isn't populated.
> 
>> > +
>> > +    size = memory_region_size(&bar->mmap_mem);
>> > +    if (size) {
>> > +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
>> > +         munmap(bar->mmap, size);
>> > +    }
> 
> This is the direct mapped MemoryRegion that potentially overlays the
> "slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
> includes the MSI-X vector table, this maps the region in front of the
> table

If the region size is zero, then both memory_region_del_subregion()
(assuming the region is parented) and munmap() do nothing.  So you could
call this unconditionally.

> 
>> > +
>> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
>> > +        size = memory_region_size(&vdev->msix->mmap_mem);
>> > +        if (size) {
>> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
>> > +            munmap(vdev->msix->mmap, size);
>> > +        }
>> > +    }
> 
> And this one potentially unmaps the overlap after the vector table if
> there's any space for one.
> 
>> Are the three size checks needed? Everything should work without them
>> from the memory core point of view.
> 
> I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> NULL... no?

NULL isn't the problem (well some kernels protect against mmaping NULL
to avoid kernel exploits), but it seems the kernel doesn't like a zero
length.

>> > +
>> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
>> > +{
>> > +    VFIOBAR *bar = &vdev->bars[nr];
>> > +    unsigned size = bar->size;
>> > +    char name[64];
>> > +
>> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
>> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
>> > +
>> > +    /* A "slow" read/write mapping underlies all BARs */
>> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
>> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
>> 
>> So far all container BARs have been pure containers, without RAM or I/O
>> callbacks.  It should all work, but this sets precedent and requires it
>> to work.  I guess there's no problem supporting it though.
> 
> KVM device assignment already makes use of this as well, if I understand
> correctly.

Okay.

> 
>> > +
>> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
>> > +        return; /* IO space is only slow, don't expect high perf here */
>> > +    }
>> 
>> What about non-x86 where IO is actually memory?  I think you can drop
>> this and let the address space filtering in the listener drop it if it
>> turns out to be in IO space.
> 
> They're probably saying "What's I/O port space?" ;)  Yeah, there may be
> some room to do more here, but no need until we have something that can
> make use of it. 

Most likely all that is needed is to drop the test.

> Note that these are the BAR mappings, which turn into
> MemoryRegions, so I'm not sure what the listener has to do with
> filtering these just yet.

+static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+{
+    return (section->address_space != get_system_memory() ||
+            !memory_region_is_ram(section->mr));
+}

Or the filter argument to memory_listener_register() (which you use --
you can drop the first check above).  On x86 those I/O regions will be
filtered out, on non-x86 with a properly-wired chipset emulation they'll
be passed to vfio (and kvm).

> 
>> > +
>> > +    if (size & ~TARGET_PAGE_MASK) {
>> > +        error_report("%s is too small to mmap, this may affect performance.\n",
>> > +                     name);
>> > +        return;
>> > +    }
>> 
>> We can work a little harder and align the host space offset with the
>> guest space offset, and map it in.
> 
> That's actually pretty involved, requiring shifting the device in the
> host address space and potentially adjust port and bridge apertures to
> enable room for the device.  Not to mention that it assumes accessing
> dead space between device regions is no harm, no foul.  True on x86 now,
> but wasn't true on HP ia64 chipsets and I suspect some other platforms.

Are sub-4k BARs common?  I expect only on older cards.

> 
>> > +
>> > +    /*
>> > +     * We can't mmap areas overlapping the MSIX vector table, so we
>> > +     * potentially insert a direct-mapped subregion before and after it.
>> > +     */
>> 
>> This splitting is what the memory core really enjoys.  You can just
>> place the MSIX page over the RAM page and let it do the cut-n-paste.
> 
> Sure, but VFIO won't allow us to mmap over the MSI-X table for security
> reasons.  It might be worthwhile to someday make VFIO insert an
> anonymous page over the MSI-X table to allow this, but it didn't look
> trivial for my novice mm abilities.  Easy to add a flag from the VFIO
> kernel structure where we learn about this BAR if we add it in the
> future.

I meant due it purely in qemu.  Instead of an emulated region overlaid
by two assigned regions, have an assigned region overlaid by the
emulated region.  The regions seen by the vfio listener will be the same.

>> > +
>> > +
>> > +static Property vfio_pci_dev_properties[] = {
>> > +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
>> > +    //TODO - support passed fds... is this necessary?
>> 
>> Yes.
> 
> This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
> an instance of a container in the kernel.  A group can only be attached
> to one container.  So whoever calls us with passed fds needs to track
> this very carefully.  This is also why I've dropped any kind of shared
> IOMMU option to give us a hint whether to try to cram everything in the
> same container (~= iommu domain).  It's too easy to pass conflicting
> info to share a container for one device, but not another... yet they
> may be in the same group.  I'll work on the fd passing though and try to
> come up with a reasonable model.

I didn't really follow the container stuff so I can't comment here.  But
suppose all assigned devices are done via fd passing, isn't it
sufficient to just pass the fd for the device (and keep the iommu group
fd in the managment tool)?

>> > +
>> > +
>> > +typedef struct MSIVector {
>> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
>> > +    struct VFIODevice *vdev; /* back pointer to device */
>> > +    int vector; /* the vector number for this element */
>> > +    int virq; /* KVM irqchip route for Qemu bypass */
>> 
>> This calls for an abstraction (don't we have a cache where we look those
>> up?)
> 
> I haven't see one, pointer?  I tried to follow vhost's lead here.

See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.

> 
>> > +    bool use;
>> > +} MSIVector;
>> > +
>> > +
>> > +typedef struct VFIOContainer {
>> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> > +    struct {
>> > +        /* enable abstraction to support various iommu backends */
>> > +        union {
>> > +            MemoryListener listener; /* Used by type1 iommu */
>> > +        };
>> 
>> The usual was is to have a Type1VFIOContainer deriving from
>> VFIOContainer and adding a MemoryListener.
> 
> Yep, that would work too.  It gets a bit more complicated that way
> though because we have to know when the container is allocated what type
> it's going to be.  This way we can step though possible iommu types and
> support the right one.  Eventually there may be more than one type
> supported on the same platform (ex. one that enables PRI).  Do-able, but
> I'm not sure it's worth it at this point.

An alternative alternative is to put a pointer to an abstract type here,
then you can defer the decision on the concrete type later.  But I agree
it's not worth it at this point.  Maybe just drop the union and decide
later when a second iommu type is added.

> 
>> > +        void (*release)(struct VFIOContainer *);
>> > +    } iommu_data;
>> > +    QLIST_HEAD(, VFIOGroup) group_list;
>> > +    QLIST_ENTRY(VFIOContainer) next;
>> > +} VFIOContainer;
>> > +
>> > +
>> > +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
>> 
>> Use the exported file, that gets rid of the __KERNEL__ bits.
> 
> Oh?  How do I generate that aside from just deleting lines?  Thanks!
> 

make headers_install
Alex Williamson July 30, 2012, 10:29 p.m. UTC | #19
On Sun, 2012-07-29 at 16:47 +0300, Avi Kivity wrote:
> On 07/26/2012 08:40 PM, Alex Williamson wrote:
> > On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
> >> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> >> 
> >> > +/*
> >> > + * Resource setup
> >> > + */
> >> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> >> > +{
> >> > +    VFIOBAR *bar = &vdev->bars[nr];
> >> > +    uint64_t size;
> >> > +
> >> > +    if (!memory_region_size(&bar->mem)) {
> >> > +        return;
> >> > +    }
> > 
> > This one is the "slow" mapped MemoryRegion.  If there's nothing here,
> > the BAR isn't populated.
> > 
> >> > +
> >> > +    size = memory_region_size(&bar->mmap_mem);
> >> > +    if (size) {
> >> > +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> >> > +         munmap(bar->mmap, size);
> >> > +    }
> > 
> > This is the direct mapped MemoryRegion that potentially overlays the
> > "slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
> > includes the MSI-X vector table, this maps the region in front of the
> > table
> 
> If the region size is zero, then both memory_region_del_subregion()
> (assuming the region is parented) and munmap() do nothing.  So you could
> call this unconditionally.

I suppose parenting them is the key.  I'm counting on memory_region_size
of zero for an uninitialized, g_malloc0() MemoryRegion.  Initializing
them just to have a parent so we can unconditionally remove them here
seems like it's just shifting complexity from one function to another.
The majority of BARs aren't even implemented, so we'd actually be
setting up a lot of dummy infrastructure for a slightly cleaner unmap
function.  I'll keep looking at this, but I'm not optimistic there's an
overall simplification here.
 
> >> > +
> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> >> > +        if (size) {
> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> >> > +            munmap(vdev->msix->mmap, size);
> >> > +        }
> >> > +    }
> > 
> > And this one potentially unmaps the overlap after the vector table if
> > there's any space for one.
> > 
> >> Are the three size checks needed? Everything should work without them
> >> from the memory core point of view.
> > 
> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> > NULL... no?
> 
> NULL isn't the problem (well some kernels protect against mmaping NULL
> to avoid kernel exploits), but it seems the kernel doesn't like a zero
> length.

in mm/mmap.c:do_munmap() I see:

        if ((len = PAGE_ALIGN(len)) == 0)
                return -EINVAL;

Before anything scary happens, so that should be ok.  It's not really
worthwhile to call the munmaps unconditionally if we already have the
condition tests because the subregions are unparented though.

> >> > +
> >> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> >> > +{
> >> > +    VFIOBAR *bar = &vdev->bars[nr];
> >> > +    unsigned size = bar->size;
> >> > +    char name[64];
> >> > +
> >> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> >> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> >> > +
> >> > +    /* A "slow" read/write mapping underlies all BARs */
> >> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> >> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> >> 
> >> So far all container BARs have been pure containers, without RAM or I/O
> >> callbacks.  It should all work, but this sets precedent and requires it
> >> to work.  I guess there's no problem supporting it though.
> > 
> > KVM device assignment already makes use of this as well, if I understand
> > correctly.
> 
> Okay.
> 
> > 
> >> > +
> >> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> >> > +        return; /* IO space is only slow, don't expect high perf here */
> >> > +    }
> >> 
> >> What about non-x86 where IO is actually memory?  I think you can drop
> >> this and let the address space filtering in the listener drop it if it
> >> turns out to be in IO space.
> > 
> > They're probably saying "What's I/O port space?" ;)  Yeah, there may be
> > some room to do more here, but no need until we have something that can
> > make use of it. 
> 
> Most likely all that is needed is to drop the test.
> 
> > Note that these are the BAR mappings, which turn into
> > MemoryRegions, so I'm not sure what the listener has to do with
> > filtering these just yet.
> 
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    return (section->address_space != get_system_memory() ||
> +            !memory_region_is_ram(section->mr));
> +}
> 
> Or the filter argument to memory_listener_register() (which you use --
> you can drop the first check above).  On x86 those I/O regions will be
> filtered out, on non-x86 with a properly-wired chipset emulation they'll
> be passed to vfio (and kvm).

Ah, I see what you're going for now.  Sure, I can store/test the vfio
region info flags somewhere to tell me whether vfio supports mmap of the
region instead of assuming mem = mmap, io = ops.  I'll re-evaluate
unconditional removal after that.

> >> > +
> >> > +    if (size & ~TARGET_PAGE_MASK) {
> >> > +        error_report("%s is too small to mmap, this may affect performance.\n",
> >> > +                     name);
> >> > +        return;
> >> > +    }
> >> 
> >> We can work a little harder and align the host space offset with the
> >> guest space offset, and map it in.
> > 
> > That's actually pretty involved, requiring shifting the device in the
> > host address space and potentially adjust port and bridge apertures to
> > enable room for the device.  Not to mention that it assumes accessing
> > dead space between device regions is no harm, no foul.  True on x86 now,
> > but wasn't true on HP ia64 chipsets and I suspect some other platforms.
> 
> Are sub-4k BARs common?  I expect only on older cards.

Correct, mostly older devices.  I don't think I've seen any "high
performance" devices with this problem.

> >> > +
> >> > +    /*
> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> >> > +     * potentially insert a direct-mapped subregion before and after it.
> >> > +     */
> >> 
> >> This splitting is what the memory core really enjoys.  You can just
> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
> > 
> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
> > reasons.  It might be worthwhile to someday make VFIO insert an
> > anonymous page over the MSI-X table to allow this, but it didn't look
> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
> > kernel structure where we learn about this BAR if we add it in the
> > future.
> 
> I meant due it purely in qemu.  Instead of an emulated region overlaid
> by two assigned regions, have an assigned region overlaid by the
> emulated region.  The regions seen by the vfio listener will be the same.

Sure, that's what KVM device assignment does, but it requires being able
to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
can't assume userspace isn't malicious so it has to prevent this.

> >> > +
> >> > +
> >> > +static Property vfio_pci_dev_properties[] = {
> >> > +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> >> > +    //TODO - support passed fds... is this necessary?
> >> 
> >> Yes.
> > 
> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
> > an instance of a container in the kernel.  A group can only be attached
> > to one container.  So whoever calls us with passed fds needs to track
> > this very carefully.  This is also why I've dropped any kind of shared
> > IOMMU option to give us a hint whether to try to cram everything in the
> > same container (~= iommu domain).  It's too easy to pass conflicting
> > info to share a container for one device, but not another... yet they
> > may be in the same group.  I'll work on the fd passing though and try to
> > come up with a reasonable model.
> 
> I didn't really follow the container stuff so I can't comment here.  But
> suppose all assigned devices are done via fd passing, isn't it
> sufficient to just pass the fd for the device (and keep the iommu group
> fd in the managment tool)?

Nope.

containerfd = open(/dev/vfio/vfio)
groupfd = open(/dev/vfio/$GROUPID)
devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)

The container provides access to the iommu, the group is the unit of
ownership and privilege, and device cannot be accessed without iommu
protection.  Therefore to get to a devicefd, we first need to privilege
the container by attaching a group to it, that let's us initialize the
iommu, which allows us to get the device fd.  At a minimum, we'd need
both container and device fds, which means libvirt would be responsible
for determining what type of iommu interface to initialize.  Doing that
makes adding another device tenuous.  It's not impossible, but VFIO is
design such that /dev/vfio/vfio is completely harmless on it's own, safe
for mode 0666 access, just like /dev/kvm.  The groupfd is the important
access point, so maybe it's sufficient that libvirt could pass only that
and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
that libvirt needs to pass the same groupfd for each device that gets
assigned within a group.

> >> > +
> >> > +
> >> > +typedef struct MSIVector {
> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> >> > +    struct VFIODevice *vdev; /* back pointer to device */
> >> > +    int vector; /* the vector number for this element */
> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
> >> 
> >> This calls for an abstraction (don't we have a cache where we look those
> >> up?)
> > 
> > I haven't see one, pointer?  I tried to follow vhost's lead here.
> 
> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.

Right, the irqfd is what we're really after.
 
> >> > +    bool use;
> >> > +} MSIVector;
> >> > +
> >> > +
> >> > +typedef struct VFIOContainer {
> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >> > +    struct {
> >> > +        /* enable abstraction to support various iommu backends */
> >> > +        union {
> >> > +            MemoryListener listener; /* Used by type1 iommu */
> >> > +        };
> >> 
> >> The usual was is to have a Type1VFIOContainer deriving from
> >> VFIOContainer and adding a MemoryListener.
> > 
> > Yep, that would work too.  It gets a bit more complicated that way
> > though because we have to know when the container is allocated what type
> > it's going to be.  This way we can step though possible iommu types and
> > support the right one.  Eventually there may be more than one type
> > supported on the same platform (ex. one that enables PRI).  Do-able, but
> > I'm not sure it's worth it at this point.
> 
> An alternative alternative is to put a pointer to an abstract type here,
> then you can defer the decision on the concrete type later.  But I agree
> it's not worth it at this point.  Maybe just drop the union and decide
> later when a second iommu type is added.

A pointer doesn't allow us to use container_of to get back to the
VFIOContainer from the memory listener callback, so we'd have to create
some new struct just to hold that back pointer.  Alexey's proposed POWER
support for VFIO already makes use of the union, so it seems like a
sufficient solution for now.  We'll have to re-evaluate if it's getting
unwieldy after we get a few though.

> >> > +        void (*release)(struct VFIOContainer *);
> >> > +    } iommu_data;
> >> > +    QLIST_HEAD(, VFIOGroup) group_list;
> >> > +    QLIST_ENTRY(VFIOContainer) next;
> >> > +} VFIOContainer;
> >> > +
> >> > +
> >> > +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
> >> 
> >> Use the exported file, that gets rid of the __KERNEL__ bits.
> > 
> > Oh?  How do I generate that aside from just deleting lines?  Thanks!
> > 
> 
> make headers_install

Thanks!

Alex
Avi Kivity July 31, 2012, 12:34 p.m. UTC | #20
On 07/31/2012 01:29 AM, Alex Williamson wrote:
>> 
>> If the region size is zero, then both memory_region_del_subregion()
>> (assuming the region is parented) and munmap() do nothing.  So you could
>> call this unconditionally.
> 
> I suppose parenting them is the key.  I'm counting on memory_region_size
> of zero for an uninitialized, g_malloc0() MemoryRegion.

That's a no-no.  We have APIs for a reason.  Maybe I'll start encrypting
the contents by xoring with a private variable.

>  Initializing
> them just to have a parent so we can unconditionally remove them here
> seems like it's just shifting complexity from one function to another.
> The majority of BARs aren't even implemented, so we'd actually be
> setting up a lot of dummy infrastructure for a slightly cleaner unmap
> function.  I'll keep looking at this, but I'm not optimistic there's an
> overall simplification here.

Ok.  But use your own bool, don't overload an something from MemoryRegion.


>  
>> >> > +
>> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
>> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
>> >> > +        if (size) {
>> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
>> >> > +            munmap(vdev->msix->mmap, size);
>> >> > +        }
>> >> > +    }
>> > 
>> > And this one potentially unmaps the overlap after the vector table if
>> > there's any space for one.
>> > 
>> >> Are the three size checks needed? Everything should work without them
>> >> from the memory core point of view.
>> > 
>> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
>> > NULL... no?
>> 
>> NULL isn't the problem (well some kernels protect against mmaping NULL
>> to avoid kernel exploits), but it seems the kernel doesn't like a zero
>> length.
> 
> in mm/mmap.c:do_munmap() I see:
> 
>         if ((len = PAGE_ALIGN(len)) == 0)
>                 return -EINVAL;
> 
> Before anything scary happens, so that should be ok.  It's not really
> worthwhile to call the munmaps unconditionally if we already have the
> condition tests because the subregions are unparented though.

Yeah.

> 
>> >> > +
>> >> > +    /*
>> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
>> >> > +     * potentially insert a direct-mapped subregion before and after it.
>> >> > +     */
>> >> 
>> >> This splitting is what the memory core really enjoys.  You can just
>> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
>> > 
>> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
>> > reasons.  It might be worthwhile to someday make VFIO insert an
>> > anonymous page over the MSI-X table to allow this, but it didn't look
>> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
>> > kernel structure where we learn about this BAR if we add it in the
>> > future.
>> 
>> I meant due it purely in qemu.  Instead of an emulated region overlaid
>> by two assigned regions, have an assigned region overlaid by the
>> emulated region.  The regions seen by the vfio listener will be the same.
> 
> Sure, that's what KVM device assignment does, but it requires being able
> to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
> can't assume userspace isn't malicious so it has to prevent this.

I wonder whether it should prevent the mmap(), or let it though and just
SIGBUS on accesses.

>> > 
>> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
>> > an instance of a container in the kernel.  A group can only be attached
>> > to one container.  So whoever calls us with passed fds needs to track
>> > this very carefully.  This is also why I've dropped any kind of shared
>> > IOMMU option to give us a hint whether to try to cram everything in the
>> > same container (~= iommu domain).  It's too easy to pass conflicting
>> > info to share a container for one device, but not another... yet they
>> > may be in the same group.  I'll work on the fd passing though and try to
>> > come up with a reasonable model.
>> 
>> I didn't really follow the container stuff so I can't comment here.  But
>> suppose all assigned devices are done via fd passing, isn't it
>> sufficient to just pass the fd for the device (and keep the iommu group
>> fd in the managment tool)?
> 
> Nope.
> 
> containerfd = open(/dev/vfio/vfio)
> groupfd = open(/dev/vfio/$GROUPID)
> devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
> 
> The container provides access to the iommu, the group is the unit of
> ownership and privilege, and device cannot be accessed without iommu
> protection.  Therefore to get to a devicefd, we first need to privilege
> the container by attaching a group to it, that let's us initialize the
> iommu, which allows us to get the device fd.  At a minimum, we'd need
> both container and device fds, which means libvirt would be responsible
> for determining what type of iommu interface to initialize.  Doing that
> makes adding another device tenuous.  It's not impossible, but VFIO is
> design such that /dev/vfio/vfio is completely harmless on it's own, safe
> for mode 0666 access, just like /dev/kvm.  The groupfd is the important
> access point, so maybe it's sufficient that libvirt could pass only that
> and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
> that libvirt needs to pass the same groupfd for each device that gets
> assigned within a group.

What I was thinking was that libvirt would do all the setup, including
attaching the iommu, then pass something that is safe to qemu.  I don't
see an issue with libvirt keeping tracks of groups; libvirt is supposed
to be doing the host-side management anyway.  But I'm not familiar with
the API so I guess it can't be done.  Maybe an extension?

> 
>> >> > +
>> >> > +
>> >> > +typedef struct MSIVector {
>> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
>> >> > +    struct VFIODevice *vdev; /* back pointer to device */
>> >> > +    int vector; /* the vector number for this element */
>> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
>> >> 
>> >> This calls for an abstraction (don't we have a cache where we look those
>> >> up?)
>> > 
>> > I haven't see one, pointer?  I tried to follow vhost's lead here.
>> 
>> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.
> 
> Right, the irqfd is what we're really after.

Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
creates an irqfd if available, or emulates it by adding a listener to
that eventfd and injecting the interrupt (either through tcg or kvm) itself.

>  
>> >> > +    bool use;
>> >> > +} MSIVector;
>> >> > +
>> >> > +
>> >> > +typedef struct VFIOContainer {
>> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
>> >> > +    struct {
>> >> > +        /* enable abstraction to support various iommu backends */
>> >> > +        union {
>> >> > +            MemoryListener listener; /* Used by type1 iommu */
>> >> > +        };
>> >> 
>> >> The usual was is to have a Type1VFIOContainer deriving from
>> >> VFIOContainer and adding a MemoryListener.
>> > 
>> > Yep, that would work too.  It gets a bit more complicated that way
>> > though because we have to know when the container is allocated what type
>> > it's going to be.  This way we can step though possible iommu types and
>> > support the right one.  Eventually there may be more than one type
>> > supported on the same platform (ex. one that enables PRI).  Do-able, but
>> > I'm not sure it's worth it at this point.
>> 
>> An alternative alternative is to put a pointer to an abstract type here,
>> then you can defer the decision on the concrete type later.  But I agree
>> it's not worth it at this point.  Maybe just drop the union and decide
>> later when a second iommu type is added.
> 
> A pointer doesn't allow us to use container_of to get back to the
> VFIOContainer from the memory listener callback, so we'd have to create
> some new struct just to hold that back pointer.  Alexey's proposed POWER
> support for VFIO already makes use of the union, so it seems like a
> sufficient solution for now.  We'll have to re-evaluate if it's getting
> unwieldy after we get a few though.

Ok.
Alex Williamson July 31, 2012, 4:56 p.m. UTC | #21
On Tue, 2012-07-31 at 15:34 +0300, Avi Kivity wrote:
> On 07/31/2012 01:29 AM, Alex Williamson wrote:
> >> 
> >> If the region size is zero, then both memory_region_del_subregion()
> >> (assuming the region is parented) and munmap() do nothing.  So you could
> >> call this unconditionally.
> > 
> > I suppose parenting them is the key.  I'm counting on memory_region_size
> > of zero for an uninitialized, g_malloc0() MemoryRegion.
> 
> That's a no-no.  We have APIs for a reason.  Maybe I'll start encrypting
> the contents by xoring with a private variable.

Heh.  Ok, I thought it was part of the API.  I'll refrain from using
that test on uninitialized MemoryRegions.

> >  Initializing
> > them just to have a parent so we can unconditionally remove them here
> > seems like it's just shifting complexity from one function to another.
> > The majority of BARs aren't even implemented, so we'd actually be
> > setting up a lot of dummy infrastructure for a slightly cleaner unmap
> > function.  I'll keep looking at this, but I'm not optimistic there's an
> > overall simplification here.
> 
> Ok.  But use your own bool, don't overload an something from MemoryRegion.

Yup.

> >  
> >> >> > +
> >> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> >> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> >> >> > +        if (size) {
> >> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> >> >> > +            munmap(vdev->msix->mmap, size);
> >> >> > +        }
> >> >> > +    }
> >> > 
> >> > And this one potentially unmaps the overlap after the vector table if
> >> > there's any space for one.
> >> > 
> >> >> Are the three size checks needed? Everything should work without them
> >> >> from the memory core point of view.
> >> > 
> >> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> >> > NULL... no?
> >> 
> >> NULL isn't the problem (well some kernels protect against mmaping NULL
> >> to avoid kernel exploits), but it seems the kernel doesn't like a zero
> >> length.
> > 
> > in mm/mmap.c:do_munmap() I see:
> > 
> >         if ((len = PAGE_ALIGN(len)) == 0)
> >                 return -EINVAL;
> > 
> > Before anything scary happens, so that should be ok.  It's not really
> > worthwhile to call the munmaps unconditionally if we already have the
> > condition tests because the subregions are unparented though.
> 
> Yeah.
> 
> > 
> >> >> > +
> >> >> > +    /*
> >> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> >> >> > +     * potentially insert a direct-mapped subregion before and after it.
> >> >> > +     */
> >> >> 
> >> >> This splitting is what the memory core really enjoys.  You can just
> >> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
> >> > 
> >> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
> >> > reasons.  It might be worthwhile to someday make VFIO insert an
> >> > anonymous page over the MSI-X table to allow this, but it didn't look
> >> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
> >> > kernel structure where we learn about this BAR if we add it in the
> >> > future.
> >> 
> >> I meant due it purely in qemu.  Instead of an emulated region overlaid
> >> by two assigned regions, have an assigned region overlaid by the
> >> emulated region.  The regions seen by the vfio listener will be the same.
> > 
> > Sure, that's what KVM device assignment does, but it requires being able
> > to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
> > can't assume userspace isn't malicious so it has to prevent this.
> 
> I wonder whether it should prevent the mmap(), or let it though and just
> SIGBUS on accesses.

That's a good idea too, maybe better than just an anonymous page so the
user knows the access doesn't get to hardware.  This is definitely an
improvement I'd like to see as we go.

> >> > 
> >> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
> >> > an instance of a container in the kernel.  A group can only be attached
> >> > to one container.  So whoever calls us with passed fds needs to track
> >> > this very carefully.  This is also why I've dropped any kind of shared
> >> > IOMMU option to give us a hint whether to try to cram everything in the
> >> > same container (~= iommu domain).  It's too easy to pass conflicting
> >> > info to share a container for one device, but not another... yet they
> >> > may be in the same group.  I'll work on the fd passing though and try to
> >> > come up with a reasonable model.
> >> 
> >> I didn't really follow the container stuff so I can't comment here.  But
> >> suppose all assigned devices are done via fd passing, isn't it
> >> sufficient to just pass the fd for the device (and keep the iommu group
> >> fd in the managment tool)?
> > 
> > Nope.
> > 
> > containerfd = open(/dev/vfio/vfio)
> > groupfd = open(/dev/vfio/$GROUPID)
> > devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)
> > 
> > The container provides access to the iommu, the group is the unit of
> > ownership and privilege, and device cannot be accessed without iommu
> > protection.  Therefore to get to a devicefd, we first need to privilege
> > the container by attaching a group to it, that let's us initialize the
> > iommu, which allows us to get the device fd.  At a minimum, we'd need
> > both container and device fds, which means libvirt would be responsible
> > for determining what type of iommu interface to initialize.  Doing that
> > makes adding another device tenuous.  It's not impossible, but VFIO is
> > design such that /dev/vfio/vfio is completely harmless on it's own, safe
> > for mode 0666 access, just like /dev/kvm.  The groupfd is the important
> > access point, so maybe it's sufficient that libvirt could pass only that
> > and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
> > that libvirt needs to pass the same groupfd for each device that gets
> > assigned within a group.
> 
> What I was thinking was that libvirt would do all the setup, including
> attaching the iommu, then pass something that is safe to qemu.  I don't
> see an issue with libvirt keeping tracks of groups; libvirt is supposed
> to be doing the host-side management anyway.  But I'm not familiar with
> the API so I guess it can't be done.  Maybe an extension?

It can be done, I think it will just be challenging to keep qemu and
libvirt in sync.  It means that libvirt has to know which iommu model to
use, when to create new containers, when to re-use old containers, etc.
For each qemu -device vfio-pci we'd need a containerfd and a devicefd
where the containerfd may or may not be the same as used for other
devices.  I've also been thinking that it would be nice if there was a
reset ioctl on the group to make it easier to handle things like p2p
bridges.  That would mean passing a groupfd as well, but then if libvirt
provides that, there's not much point in handing us the device fd
because it's not offing any protection (we can get it ourselves), which
also then makes passing the containerfd awkward because libvirt would
have to assume a 1:1 container:group model.  Passing the devicefd really
never offers protection because we know it's not isolated from other
device in the group.  I guess I'm back to libvirt should only pass the
groupfd and let qemu handle the rest.

> >> >> > +
> >> >> > +
> >> >> > +typedef struct MSIVector {
> >> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> >> >> > +    struct VFIODevice *vdev; /* back pointer to device */
> >> >> > +    int vector; /* the vector number for this element */
> >> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
> >> >> 
> >> >> This calls for an abstraction (don't we have a cache where we look those
> >> >> up?)
> >> > 
> >> > I haven't see one, pointer?  I tried to follow vhost's lead here.
> >> 
> >> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.
> > 
> > Right, the irqfd is what we're really after.
> 
> Ok, I guess both vhost and vfio could use a qemu_irq_eventfd() which
> creates an irqfd if available, or emulates it by adding a listener to
> that eventfd and injecting the interrupt (either through tcg or kvm) itself.

Yeah, that would be a nice simplification.
  
> >> >> > +    bool use;
> >> >> > +} MSIVector;
> >> >> > +
> >> >> > +
> >> >> > +typedef struct VFIOContainer {
> >> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >> >> > +    struct {
> >> >> > +        /* enable abstraction to support various iommu backends */
> >> >> > +        union {
> >> >> > +            MemoryListener listener; /* Used by type1 iommu */
> >> >> > +        };
> >> >> 
> >> >> The usual was is to have a Type1VFIOContainer deriving from
> >> >> VFIOContainer and adding a MemoryListener.
> >> > 
> >> > Yep, that would work too.  It gets a bit more complicated that way
> >> > though because we have to know when the container is allocated what type
> >> > it's going to be.  This way we can step though possible iommu types and
> >> > support the right one.  Eventually there may be more than one type
> >> > supported on the same platform (ex. one that enables PRI).  Do-able, but
> >> > I'm not sure it's worth it at this point.
> >> 
> >> An alternative alternative is to put a pointer to an abstract type here,
> >> then you can defer the decision on the concrete type later.  But I agree
> >> it's not worth it at this point.  Maybe just drop the union and decide
> >> later when a second iommu type is added.
> > 
> > A pointer doesn't allow us to use container_of to get back to the
> > VFIOContainer from the memory listener callback, so we'd have to create
> > some new struct just to hold that back pointer.  Alexey's proposed POWER
> > support for VFIO already makes use of the union, so it seems like a
> > sufficient solution for now.  We'll have to re-evaluate if it's getting
> > unwieldy after we get a few though.
> 
> Ok.

Thanks!

Alex
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 30ed56d..68406a3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -460,6 +460,11 @@  M: Gerd Hoffmann <kraxel@redhat.com>
 S: Maintained
 F: hw/usb*
 
+VFIO
+M: Alex Williamson <alex.williamson@redhat.com>
+S: Supported
+F: hw/vfio*
+
 vhost
 M: Michael S. Tsirkin <mst@redhat.com>
 S: Supported
diff --git a/configure b/configure
index cef0a71..62d921e 100755
--- a/configure
+++ b/configure
@@ -143,6 +143,7 @@  attr=""
 libattr=""
 xfs=""
 
+vfio_pci="no"
 vhost_net="no"
 kvm="no"
 gprof="no"
@@ -489,6 +490,7 @@  Haiku)
   usb="linux"
   kvm="yes"
   vhost_net="yes"
+  vfio_pci="yes"
   if [ "$cpu" = "i386" -o "$cpu" = "x86_64" ] ; then
     audio_possible_drivers="$audio_possible_drivers fmod"
   fi
@@ -824,6 +826,10 @@  for opt do
   ;;
   --disable-guest-agent) guest_agent="no"
   ;;
+  --disable-vfio-pci) vfio_pci="no"
+  ;;
+  --enable-vfio-pci) vfio_pci="yes"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -1110,6 +1116,8 @@  echo "  --disable-guest-agent    disable building of the QEMU Guest Agent"
 echo "  --enable-guest-agent     enable building of the QEMU Guest Agent"
 echo "  --with-coroutine=BACKEND coroutine backend. Supported options:"
 echo "                           gthread, ucontext, sigaltstack, windows"
+echo "  --disable-vfio-pci       disable vfio pci device assignement support"
+echo "  --enable-vfio-pci        enable vfio pci device assignment support"
 echo ""
 echo "NOTE: The object files are built at the place where configure is launched"
 exit 1
@@ -3070,6 +3078,7 @@  echo "OpenGL support    $opengl"
 echo "libiscsi support  $libiscsi"
 echo "build guest agent $guest_agent"
 echo "coroutine backend $coroutine_backend"
+echo "VFIO PCI support  $vfio_pci"
 
 if test "$sdl_too_old" = "yes"; then
 echo "-> Your SDL version is too old - please upgrade to have SDL support"
@@ -3747,6 +3756,9 @@  case "$target_arch2" in
   *)
     echo "CONFIG_NO_XEN=y" >> $config_target_mak
 esac
+if test "$vfio_pci" = "yes" -a "$target_softmmu" = "yes" ; then
+  echo "CONFIG_VFIO_PCI=y" >> $config_target_mak
+fi
 case "$target_arch2" in
   i386|x86_64|ppcemb|ppc|ppc64|s390x)
     # Make sure the target and host cpus are compatible
diff --git a/hw/i386/Makefile.objs b/hw/i386/Makefile.objs
index 8c764bb..a2783ef 100644
--- a/hw/i386/Makefile.objs
+++ b/hw/i386/Makefile.objs
@@ -11,5 +11,6 @@  obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen-host-pci-device.o
 obj-$(CONFIG_XEN_PCI_PASSTHROUGH) += xen_pt.o xen_pt_config_init.o xen_pt_msi.o
 obj-y += kvm/
 obj-$(CONFIG_SPICE) += qxl.o qxl-logger.o qxl-render.o
+obj-$(CONFIG_VFIO_PCI) += vfio_pci.o
 
 obj-y := $(addprefix ../,$(obj-y))
diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
new file mode 100644
index 0000000..e9ae421
--- /dev/null
+++ b/hw/vfio_pci.c
@@ -0,0 +1,2030 @@ 
+/*
+ * vfio based device assignment support
+ *
+ * Copyright Red Hat, Inc. 2012
+ *
+ * Authors:
+ *  Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * Based on qemu-kvm device-assignment:
+ *  Adapted for KVM by Qumranet.
+ *  Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com)
+ *  Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com)
+ *  Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com)
+ *  Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com)
+ *  Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com)
+ */
+
+#include <dirent.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/io.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <linux/vfio.h>
+
+#include "config.h"
+#include "event_notifier.h"
+#include "exec-memory.h"
+#include "hw.h"
+#include "kvm.h"
+#include "memory.h"
+#include "monitor.h"
+#include "msi.h"
+#include "msix.h"
+#include "notify.h"
+#include "pc.h"
+#include "qemu-error.h"
+#include "qemu-timer.h"
+#include "range.h"
+#include "vfio_pci.h"
+
+//#define DEBUG_VFIO
+#ifdef DEBUG_VFIO
+#define DPRINTF(fmt, ...) \
+    do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define MSIX_CAP_LENGTH 12
+
+static QLIST_HEAD(, VFIOContainer)
+    container_list = QLIST_HEAD_INITIALIZER(container_list);
+
+static QLIST_HEAD(, VFIOGroup)
+    group_list = QLIST_HEAD_INITIALIZER(group_list);
+
+static void vfio_disable_interrupts(VFIODevice *vdev);
+static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
+
+/*
+ * Common VFIO interrupt disable
+ */
+static void vfio_disable_irqindex(VFIODevice *vdev, int index)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+        .index = index,
+        .start = 0,
+        .count = 0,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+
+    vdev->interrupt = INT_NONE;
+}
+
+/*
+ * INTx
+ */
+static inline void vfio_unmask_intx(VFIODevice *vdev)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK,
+        .index = VFIO_PCI_INTX_IRQ_INDEX,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+static inline void vfio_mask_intx(VFIODevice *vdev)
+{
+    struct vfio_irq_set irq_set = {
+        .argsz = sizeof(irq_set),
+        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
+        .index = VFIO_PCI_INTX_IRQ_INDEX,
+        .start = 0,
+        .count = 1,
+    };
+
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+}
+
+static void vfio_intx_interrupt(void *opaque)
+{
+    VFIODevice *vdev = opaque;
+
+    if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) {
+        return;
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) Pin %c\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function,
+            'A' + vdev->intx.pin);
+
+    vdev->intx.pending = true;
+    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 1);
+}
+
+static void vfio_eoi(VFIODevice *vdev)
+{
+    if (!vdev->intx.pending) {
+        return;
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    vdev->intx.pending = false;
+    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
+    vfio_unmask_intx(vdev);
+}
+
+struct vfio_irq_set_fd {
+    struct vfio_irq_set irq_set;
+    int32_t fd;
+} QEMU_PACKED;
+
+static void vfio_enable_intx_kvm(VFIODevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct vfio_irq_set_fd irq_set_fd = {
+	.irq_set = {
+            .argsz = sizeof(irq_set_fd),
+            .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK,
+            .index = VFIO_PCI_INTX_IRQ_INDEX,
+            .start = 0,
+            .count = 1,
+        },
+    };
+    struct kvm_irqfd irqfd = {
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_LEVEL,
+    };
+    struct kvm_eoifd eoifd = {
+        .flags = KVM_EOIFD_FLAG_LEVEL_IRQFD,
+    };
+    int key;
+
+    if (vdev->intx.kvm_accel || !kvm_irqchip_in_kernel() ||
+        vdev->intx.route.mode == PCI_INTX_DISABLED ||
+        !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_LEVEL) ||
+        !kvm_check_extension(kvm_state, KVM_CAP_EOIFD_LEVEL_IRQFD)) {
+        return;
+    }
+
+    /*
+     * We've already got an eventfd for interrupt signals from VFIO
+     * into Qemu.  Plumb it into an IRQFD.
+     */
+    irqfd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
+
+    /*
+     * Get to a known state, not listening for interrupts, hardware
+     * masked, Qemu IRQ de-asserted.
+     */
+    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    /* TBD - Disable qemu eoi notifier */
+    vfio_mask_intx(vdev);
+    vdev->intx.pending = false;
+    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
+
+    /*
+     * Create a new eventfd to connect unmask signals from KVM EOIFD
+     * directly into VFIO.
+     */
+    if (event_notifier_init(&vdev->intx.unmask, 0)) {
+        error_report("vfio: Error: event_notifier_init failed eoi\n");
+        goto fail;
+    }
+
+    /* Tell both KVM EOIFD and VFIO about this eventfd */
+    eoifd.fd = irq_set_fd.fd = event_notifier_get_fd(&vdev->intx.unmask);
+
+    /* IRQFD first sets up the level interrupt */
+    key = kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+    if (key < 0) {
+        error_report("vfio: Error: Failed to setup INTx irqfd: %s\n",
+                     strerror(errno));
+        goto fail;
+    }
+
+    /* Giving us a key that let's us configure the EOIFD */
+    eoifd.key = key;
+    if (kvm_vm_ioctl(kvm_state, KVM_EOIFD, &eoifd)) {
+        error_report("vfio: Error: Failed to setup INTx EOI: %s\n",
+                     strerror(errno));
+        goto fail_eoifd;
+    }
+
+    /* Finally configure the irqfd-like vfio mechanism for unmasks */
+    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd)) {
+        error_report("vfio: Error: Failed to setup INTx unmask fd: %s\n",
+                     strerror(errno));
+        goto fail_vfio;
+    }
+
+    /* Let'em rip */
+    vfio_unmask_intx(vdev);
+    
+    vdev->intx.kvm_accel = true;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel enabled\n",
+            __FUNCTION__, vdev->host.domain, vdev->host.bus,
+            vdev->host.slot, vdev->host.function);
+
+    return;
+
+fail_vfio:
+    eoifd.flags |= KVM_EOIFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_EOIFD, &eoifd);
+fail_eoifd:
+    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
+    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+fail:
+    /* TBD - Enable qemu eoi notifier */
+    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+    vfio_unmask_intx(vdev);
+#endif
+}
+
+static void vfio_disable_intx_kvm(VFIODevice *vdev)
+{
+#ifdef CONFIG_KVM
+    struct kvm_irqfd irqfd = {
+        .gsi = vdev->intx.route.irq,
+        .flags = KVM_IRQFD_FLAG_DEASSIGN,
+    };
+
+    if (!vdev->intx.kvm_accel) {
+        return;
+    }
+
+    /*
+     * Get to a known state, hardware masked, Qemu ready to accept new
+     * interrupts, Qemu IRQ de-asserted.
+     */
+    vfio_mask_intx(vdev);
+    vdev->intx.pending = false;
+    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
+
+    /*
+     * Both ends of the unmask eventfd watch for POLLHUP, so this kills
+     * the eoifd and the vfio unmask handler in one shot.
+     */
+    event_notifier_cleanup(&vdev->intx.unmask);
+
+    /*
+     * Tell the kernel to stop listening for interrupt events.
+     */
+    irqfd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+        error_report("vfio: Error: Failed to disable INTx irqfd: %s\n",
+                     strerror(errno));
+    }
+
+    /*
+     * Qemu starts listening for interrupt events.
+     */
+    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+
+    /* TBD - Enable qemu eoi notifier */
+
+    vdev->intx.kvm_accel = false;
+
+    /*
+     * If we've missed an event, let it re-fire through qemu.
+     */
+    vfio_unmask_intx(vdev);
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n",
+            __FUNCTION__, vdev->host.domain, vdev->host.bus,
+            vdev->host.slot, vdev->host.function);
+#endif
+}
+
+static void vfio_update_irq(PCIDevice *pdev)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    PCIINTxRoute route;
+
+    if (vdev->interrupt != INT_INTx) {
+        return;
+    }
+
+    route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
+    if (!memcmp(&route, &vdev->intx.route, sizeof(route))) {
+        return; /* Nothing changed */
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) IRQ moved %d -> %d\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, vdev->intx.route.irq, route.irq);
+
+    vfio_disable_intx_kvm(vdev);
+    /* TBD - Disable qemu eoi notifier */
+
+    vdev->intx.route = route;
+
+    if (route.mode == PCI_INTX_DISABLED) {
+        return;
+    }
+
+    /* TBD - Enable qemu eoi notifier */
+    vfio_enable_intx_kvm(vdev);
+
+    /* Re-enable the interrupt in cased we missed an EOI */
+    vfio_eoi(vdev);
+}
+
+static int vfio_enable_intx(VFIODevice *vdev)
+{
+    struct vfio_irq_set_fd irq_set_fd = {
+	.irq_set = {
+            .argsz = sizeof(irq_set_fd),
+            .flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
+            .index = VFIO_PCI_INTX_IRQ_INDEX,
+            .start = 0,
+            .count = 1,
+        },
+    };
+    uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
+
+    if (!pin) {
+        return 0;
+    }
+
+    vfio_disable_interrupts(vdev);
+
+    vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
+    vdev->intx.route = pci_device_route_intx_to_irq(&vdev->pdev,
+                                                    vdev->intx.pin);
+    /* TBD - Enable qemu eoi notifier */
+
+    if (event_notifier_init(&vdev->intx.interrupt, 0)) {
+        error_report("vfio: Error: event_notifier_init failed\n");
+        return -1;
+    }
+
+    irq_set_fd.fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(irq_set_fd.fd, vfio_intx_interrupt, NULL, vdev);
+
+    if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd)) {
+        error_report("vfio: Error: Failed to setup INTx fd: %s\n",
+                     strerror(errno));
+        return -1;
+    }
+
+    vfio_enable_intx_kvm(vdev);
+
+    vdev->interrupt = INT_INTx;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    return 0;
+}
+
+static void vfio_disable_intx(VFIODevice *vdev)
+{
+    int fd;
+
+    vfio_disable_intx_kvm(vdev);
+    vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
+
+    /* TBD - Disable qemu eoi notifier */
+
+    fd = event_notifier_get_fd(&vdev->intx.interrupt);
+    qemu_set_fd_handler(fd, NULL, NULL, vdev);
+    event_notifier_cleanup(&vdev->intx.interrupt);
+
+    vdev->interrupt = INT_NONE;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+}
+
+/*
+ * MSI/X
+ */
+static void vfio_msi_interrupt(void *opaque)
+{
+    MSIVector *vec = opaque;
+    VFIODevice *vdev = vec->vdev;
+
+    if (!event_notifier_test_and_clear(&vec->interrupt)) {
+        return;
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, vec->vector);
+
+    if (vdev->interrupt == INT_MSIX) {
+        msix_notify(&vdev->pdev, vec->vector);
+    } else if (vdev->interrupt == INT_MSI) {
+        msi_notify(&vdev->pdev, vec->vector);
+    } else {
+        error_report("vfio: MSI interrupt receieved, but not enabled?\n");
+    }
+}
+
+static int vfio_enable_vectors(VFIODevice *vdev, bool msix)
+{
+    struct vfio_irq_set *irq_set;
+    int ret = 0, i, argsz;
+    int32_t *fds;
+
+    argsz = sizeof(*irq_set) + (vdev->nr_vectors * sizeof(*fds));
+
+    irq_set = g_malloc0(argsz);
+    irq_set->argsz = argsz;
+    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER;
+    irq_set->index = msix ? VFIO_PCI_MSIX_IRQ_INDEX : VFIO_PCI_MSI_IRQ_INDEX;
+    irq_set->start = 0;
+    irq_set->count = vdev->nr_vectors;
+    fds = (int32_t *)&irq_set->data;
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        if (!vdev->msi_vectors[i].use) {
+            fds[i] = -1;
+            continue;
+        }
+
+        fds[i] = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+    }
+
+    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
+
+    g_free(irq_set);
+
+    if (!ret) {
+        vdev->interrupt = msix ? INT_MSIX : INT_MSI;
+    }
+
+    return ret;
+}
+
+static int vfio_msix_vector_use(PCIDevice *pdev,
+                                unsigned int vector, MSIMessage msg)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    int ret, fd;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d used\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, vector);
+
+    if (vdev->interrupt != INT_MSIX) {
+        vfio_disable_interrupts(vdev);
+    }
+
+    if (!vdev->msi_vectors) {
+        vdev->msi_vectors = g_malloc0(vdev->msix->entries * sizeof(MSIVector));
+    }
+
+    vdev->msi_vectors[vector].vdev = vdev;
+    vdev->msi_vectors[vector].vector = vector;
+    vdev->msi_vectors[vector].use = true;
+
+    msix_vector_use(pdev, vector);
+
+    if (event_notifier_init(&vdev->msi_vectors[vector].interrupt, 0)) {
+        error_report("vfio: Error: event_notifier_init failed\n");
+    }
+
+    fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
+
+    /*
+     * Attempt to enable route through KVM irqchip,
+     * default to userspace handling if unavailable.
+     */
+    vdev->msi_vectors[vector].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
+    if (vdev->msi_vectors[vector].virq < 0 || 
+        kvm_irqchip_add_irqfd(kvm_state, fd,
+                              vdev->msi_vectors[vector].virq) < 0) {
+        qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
+                            &vdev->msi_vectors[vector]);
+    }
+
+    /*
+     * We don't want to have the host allocate all possible MSI vectors
+     * for a device if they're not in use, so we shutdown and incrementally
+     * increase them as needed.
+     */
+    if (vdev->nr_vectors < vector + 1) {
+        int i;
+
+        vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX);
+        vdev->nr_vectors = vector + 1;
+        ret = vfio_enable_vectors(vdev, true);
+        if (ret) {
+            error_report("vfio: failed to enable vectors, %d\n", ret);
+        }
+
+        /* We don't know if we've missed interrupts in the interim... */
+        for (i = 0; i < vdev->msix->entries; i++) {
+            if (vdev->msi_vectors[i].use) {
+                msix_notify(&vdev->pdev, i);
+            }
+        }
+    } else {
+        struct vfio_irq_set_fd irq_set_fd = {
+            .irq_set = {
+                .argsz = sizeof(irq_set_fd),
+                .flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                         VFIO_IRQ_SET_ACTION_TRIGGER,
+                .index = VFIO_PCI_MSIX_IRQ_INDEX,
+                .start = vector,
+                .count = 1,
+            },
+            .fd = fd,
+        };
+        ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd);
+        if (ret) {
+            error_report("vfio: failed to modify vector, %d\n", ret);
+        }
+        msix_notify(&vdev->pdev, vector);
+    }
+
+    return 0;
+}
+
+static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int vector)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    struct vfio_irq_set_fd irq_set_fd = {
+        .irq_set = {
+            .argsz = sizeof(irq_set_fd),
+            .flags = VFIO_IRQ_SET_DATA_EVENTFD |
+                     VFIO_IRQ_SET_ACTION_TRIGGER,
+            .index = VFIO_PCI_MSIX_IRQ_INDEX,
+            .start = vector,
+            .count = 1,
+        },
+        .fd = -1,
+    };
+    int fd;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) vector %d released\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, vector);
+
+    /*
+     * XXX What's the right thing to do here?  This turns off the interrupt
+     * completely, but do we really just want to switch the interrupt to
+     * bouncing through userspace and let msix.c drop it?  Not sure.
+     */
+    msix_vector_unuse(pdev, vector);
+    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set_fd);
+
+    fd = event_notifier_get_fd(&vdev->msi_vectors[vector].interrupt);
+
+    if (vdev->msi_vectors[vector].virq < 0) {
+        qemu_set_fd_handler(fd, NULL, NULL, NULL);
+    } else {
+        kvm_irqchip_remove_irqfd(kvm_state, fd, vdev->msi_vectors[vector].virq);
+        kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[vector].virq);
+        vdev->msi_vectors[vector].virq = -1;
+    }
+
+    event_notifier_cleanup(&vdev->msi_vectors[vector].interrupt);
+    vdev->msi_vectors[vector].use = false;
+}
+
+/* XXX This should move to msi.c */
+static MSIMessage msi_get_msg(PCIDevice *pdev, unsigned int vector)
+{
+    uint16_t flags = pci_get_word(pdev->config + pdev->msi_cap + PCI_MSI_FLAGS);
+    bool msi64bit = flags & PCI_MSI_FLAGS_64BIT;
+    MSIMessage msg;
+
+    if (msi64bit) {
+        msg.address = pci_get_quad(pdev->config +
+                                   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
+    } else {
+        msg.address = pci_get_long(pdev->config +
+                                   pdev->msi_cap + PCI_MSI_ADDRESS_LO);
+    }
+
+    msg.data = pci_get_word(pdev->config + pdev->msi_cap +
+                            (msi64bit ? PCI_MSI_DATA_64 : PCI_MSI_DATA_32));
+    msg.data += vector;
+
+    return msg;
+}
+
+/* So should this */
+static void msi_set_qsize(PCIDevice *pdev, uint8_t size)
+{
+    uint8_t *config = pdev->config + pdev->msi_cap;
+    uint16_t flags;
+
+    flags = pci_get_word(config + PCI_MSI_FLAGS);
+    flags = le16_to_cpu(flags);
+    flags &= ~PCI_MSI_FLAGS_QSIZE;
+    flags |= (size & 0x7) << 4;
+    flags = cpu_to_le16(flags);
+    pci_set_word(config + PCI_MSI_FLAGS, flags);
+}
+
+static void vfio_enable_msi(VFIODevice *vdev)
+{
+    int ret, i;
+
+    vfio_disable_interrupts(vdev);
+
+    vdev->nr_vectors = msi_nr_vectors_allocated(&vdev->pdev);
+retry:
+    vdev->msi_vectors = g_malloc0(vdev->nr_vectors * sizeof(MSIVector));
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        MSIMessage msg;
+        int fd;
+
+        vdev->msi_vectors[i].vdev = vdev;
+        vdev->msi_vectors[i].vector = i;
+        vdev->msi_vectors[i].use = true;
+
+        if (event_notifier_init(&vdev->msi_vectors[i].interrupt, 0)) {
+            error_report("vfio: Error: event_notifier_init failed\n");
+        }
+
+        fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+
+        msg = msi_get_msg(&vdev->pdev, i);
+
+        /*
+         * Attempt to enable route through KVM irqchip,
+         * default to userspace handling if unavailable.
+         */
+        vdev->msi_vectors[i].virq = kvm_irqchip_add_msi_route(kvm_state, msg);
+        if (vdev->msi_vectors[i].virq < 0 || 
+            kvm_irqchip_add_irqfd(kvm_state, fd,
+                                  vdev->msi_vectors[i].virq) < 0) {
+            qemu_set_fd_handler(fd, vfio_msi_interrupt, NULL,
+                                &vdev->msi_vectors[i]);
+        }
+    }
+    
+    ret = vfio_enable_vectors(vdev, false);
+    if (ret) {
+        if (ret < 0) {
+            error_report("vfio: Error: Failed to setup MSI fds: %s\n",
+                         strerror(errno));
+        } else if (ret != vdev->nr_vectors) {
+            error_report("vfio: Error: Failed to enable %d "
+                         "MSI vectors, retry with %d\n", vdev->nr_vectors, ret);
+        }
+
+        for (i = 0; i < vdev->nr_vectors; i++) {
+            int fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+            if (vdev->msi_vectors[i].virq >= 0) {
+                kvm_irqchip_remove_irqfd(kvm_state, fd,
+                                         vdev->msi_vectors[i].virq);
+                kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[i].virq);
+                vdev->msi_vectors[i].virq = -1;
+            } else {
+                qemu_set_fd_handler(fd, NULL, NULL, NULL);
+            }
+            event_notifier_cleanup(&vdev->msi_vectors[i].interrupt);
+        }
+
+        g_free(vdev->msi_vectors);
+
+        if (ret > 0 && ret != vdev->nr_vectors) {
+            vdev->nr_vectors = ret;
+            goto retry;
+        }
+        vdev->nr_vectors = 0;
+	
+        return;
+    }
+
+    msi_set_qsize(&vdev->pdev, vdev->nr_vectors);
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) Enabled %d MSI vectors\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, vdev->nr_vectors);
+}
+
+static void vfio_disable_msi_x(VFIODevice *vdev, bool msix)
+{
+    int i;
+
+    vfio_disable_irqindex(vdev, msix ? VFIO_PCI_MSIX_IRQ_INDEX :
+                                       VFIO_PCI_MSI_IRQ_INDEX);
+
+    for (i = 0; i < vdev->nr_vectors; i++) {
+        int fd;
+
+        if (!vdev->msi_vectors[i].use) {
+            continue;
+        }
+
+        fd = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt);
+
+        if (vdev->msi_vectors[i].virq >= 0) {
+            kvm_irqchip_remove_irqfd(kvm_state, fd, vdev->msi_vectors[i].virq);
+            kvm_irqchip_release_virq(kvm_state, vdev->msi_vectors[i].virq);
+            vdev->msi_vectors[i].virq = -1;
+        } else {
+            qemu_set_fd_handler(fd, NULL, NULL, NULL);
+        }
+
+        if (msix) {
+            msix_vector_unuse(&vdev->pdev, i);
+        }
+
+        event_notifier_cleanup(&vdev->msi_vectors[i].interrupt);
+    }
+
+    g_free(vdev->msi_vectors);
+    vdev->msi_vectors = NULL;
+    vdev->nr_vectors = 0;
+
+    if (!msix) {
+        msi_set_qsize(&vdev->pdev, 0); /* Actually still means 1 vector */
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x, msi%s)\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, msix ? "x" : "");
+
+    vfio_enable_intx(vdev);
+}
+
+/*
+ * IO Port/MMIO - Beware of the endians, VFIO is always little endian
+ */
+static void vfio_bar_write(void *opaque, target_phys_addr_t addr,
+                           uint64_t data, unsigned size)
+{
+    VFIOBAR *bar = opaque;
+    uint8_t buf[8];
+
+    switch (size) {
+    case 1:
+        *buf = data & 0xff;
+        break;
+    case 2:
+        *(uint16_t *)buf = cpu_to_le16(data);
+        break;
+    case 4:
+        *(uint32_t *)buf = cpu_to_le32(data);
+        break;
+    default:
+        hw_error("vfio: unsupported write size, %d bytes\n", size);
+    }
+
+    if (pwrite(bar->fd, buf, size, bar->fd_offset + addr) != size) {
+        error_report("%s(,0x%"PRIx64", 0x%"PRIx64", %d) failed: %s\n",
+                     __FUNCTION__, addr, data, size, strerror(errno));
+    }
+
+    DPRINTF("%s(BAR%d+0x%"PRIx64", 0x%"PRIx64", %d)\n",
+            __FUNCTION__, bar->nr, addr, data, size);
+}
+
+static uint64_t vfio_bar_read(void *opaque,
+                              target_phys_addr_t addr, unsigned size)
+{
+    VFIOBAR *bar = opaque;
+    uint8_t buf[8];
+    uint64_t data = 0;
+
+    if (pread(bar->fd, buf, size, bar->fd_offset + addr) != size) {
+        error_report("%s(,0x%"PRIx64", %d) failed: %s\n",
+                     __FUNCTION__, addr, size, strerror(errno));
+        return (uint64_t)-1;
+    }
+
+    switch (size) {
+    case 1:
+        data = buf[0];
+        break;
+    case 2:
+        data = le16_to_cpu(*(uint16_t *)buf);
+        break;
+    case 4:
+        data = le32_to_cpu(*(uint32_t *)buf);
+        break;
+    default:
+        hw_error("vfio: unsupported read size, %d bytes\n", size);
+    }
+
+    DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64"\n",
+            __FUNCTION__, bar->nr, addr, size, data);
+
+    return data;
+}
+
+static const MemoryRegionOps vfio_bar_ops = {
+    .read = vfio_bar_read,
+    .write = vfio_bar_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+/*
+ * PCI config space
+ */
+static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    uint32_t val = 0;
+
+    /*
+     * We only need Qemu PCI config support for the ROM BAR, the MSI and MSIX
+     * capabilities, and the multifunction bit below.  We let VFIO handle
+     * virtualizing everything else.  Performance is not a concern here.
+     */
+    if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) ||
+        (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
+         ranges_overlap(addr, len, pdev->msix_cap, MSIX_CAP_LENGTH)) ||
+        (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+         ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size))) {
+
+        val = pci_default_read_config(pdev, addr, len);
+    } else {
+        if (pread(vdev->fd, &val, len, vdev->config_offset + addr) != len) {
+            error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %s\n",
+                         __FUNCTION__, vdev->host.domain, vdev->host.bus,
+                         vdev->host.slot, vdev->host.function, addr, len,
+                         strerror(errno));
+            return -1;
+        }
+        val = le32_to_cpu(val);
+    }
+
+    /* Multifunction bit is virualized in qemu */
+    if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) {
+        uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION;
+
+        if (len == 4) {
+            mask <<= 16;
+        }
+
+        if (pdev->cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+            val |= mask;
+        } else {
+            val &= ~mask;
+        }
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, addr, len, val);
+
+    return val;
+}
+
+static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
+                                  uint32_t val, int len)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    uint32_t val_le = cpu_to_le32(val);
+
+    DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __FUNCTION__,
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, addr, val, len);
+
+    /* Write everything to VFIO, let it filter out what we can't write */
+    if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) {
+        error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %s\n",
+                     __FUNCTION__, vdev->host.domain, vdev->host.bus,
+                     vdev->host.slot, vdev->host.function, addr, val, len,
+                     strerror(errno));
+    }
+
+    /* Write standard header bits to emulation */
+    if (addr < PCI_CONFIG_HEADER_SIZE) {
+        pci_default_write_config(pdev, addr, val, len);
+        return;
+    }
+
+    /* MSI/MSI-X Enabling/Disabling */
+    if (pdev->cap_present & QEMU_PCI_CAP_MSI &&
+        ranges_overlap(addr, len, pdev->msi_cap, vdev->msi_cap_size)) {
+        int is_enabled, was_enabled = msi_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msi_enabled(pdev);
+
+        if (!was_enabled && is_enabled) {
+            vfio_enable_msi(vdev);
+        } else if (was_enabled && !is_enabled) {
+            vfio_disable_msi_x(vdev, false);
+        }
+    }
+
+    if (pdev->cap_present & QEMU_PCI_CAP_MSIX &&
+        ranges_overlap(addr, len, pdev->msix_cap, MSIX_CAP_LENGTH)) {
+        int is_enabled, was_enabled = msix_enabled(pdev);
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        is_enabled = msix_enabled(pdev);
+
+        if (!was_enabled && is_enabled) {
+            /* vfio_msix_vector_use handles this automatically */
+        } else if (was_enabled && !is_enabled) {
+            vfio_disable_msi_x(vdev, true);
+        }
+    }
+}
+
+/*
+ * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
+ */
+static int vfio_dma_map(VFIOContainer *container, target_phys_addr_t iova,
+                        ram_addr_t size, void* vaddr, bool readonly)
+{
+    struct vfio_iommu_type1_dma_map map = {
+        .argsz = sizeof(map),
+        .flags = VFIO_DMA_MAP_FLAG_READ,
+        .vaddr = (__u64)vaddr,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (!readonly) {
+        map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+    }
+
+    if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map)) {
+        DPRINTF("VFIO_MAP_DMA: %d\n", -errno);
+        return -errno;
+    }
+
+    return 0;
+}
+
+static int vfio_dma_unmap(VFIOContainer *container,
+                          target_phys_addr_t iova, ram_addr_t size)
+{
+    struct vfio_iommu_type1_dma_unmap unmap = {
+        .argsz = sizeof(unmap),
+        .flags = 0,
+        .iova = iova,
+        .size = size,
+    };
+
+    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno);
+        return -errno;
+    }
+
+    return 0;
+}
+
+static void vfio_listener_dummy1(MemoryListener *listener)
+{
+    /* We don't do batching (begin/commit) or care about logging */
+}
+
+static void vfio_listener_dummy2(MemoryListener *listener,
+                                 MemoryRegionSection *section)
+{
+    /* We don't do logging or care about nops */
+}
+
+static void vfio_listener_dummy3(MemoryListener *listener,
+                                 MemoryRegionSection *section,
+                                 bool match_data, uint64_t data,
+                                 EventNotifier *e)
+{
+    /* We don't care about eventfds */
+}
+ 
+static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+{
+    return (section->address_space != get_system_memory() ||
+            !memory_region_is_ram(section->mr));
+}
+
+static void vfio_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.listener);
+    target_phys_addr_t iova, end;
+    void *vaddr;
+    int ret;
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("vfio: SKIPPING region_add %016lx - %016lx\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space + section->size - 1);
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region\n", __FUNCTION__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + section->size) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    DPRINTF("vfio: region_add %016lx - %016lx [%p]\n",
+            iova, end - 1, vaddr);
+
+    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
+    if (ret) {
+        error_report("vfio_dma_map(%p, 0x%016lx, 0x%lx, %p) = %d (%s)\n",
+                     container, iova, end - iova, vaddr, ret, strerror(errno));
+    }
+}
+
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.listener);
+    target_phys_addr_t iova, end;
+    int ret;
+
+    if (vfio_listener_skipped_section(section)) {
+        DPRINTF("vfio: SKIPPING region_del %016lx - %016lx\n",
+                section->offset_within_address_space,
+                section->offset_within_address_space + section->size - 1);
+        return;
+    }
+
+    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
+                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+        error_report("%s received unaligned region\n", __FUNCTION__);
+        return;
+    }
+
+    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    end = (section->offset_within_address_space + section->size) &
+          TARGET_PAGE_MASK;
+
+    if (iova >= end) {
+        return;
+    }
+
+    DPRINTF("vfio: region_del %016lx - %016lx\n", iova, end - 1);
+
+    ret = vfio_dma_unmap(container, iova, end - iova);
+    if (ret) {
+        error_report("vfio_dma_unmap(%p, 0x%016lx, 0x%lx) = %d (%s)\n",
+                     container, iova, end - iova, ret, strerror(errno));
+    }
+}
+
+static void vfio_listener_release(VFIOContainer *container)
+{
+    memory_listener_unregister(&container->iommu_data.listener);
+}
+
+/*
+ * Interrupt setup
+ */
+static void vfio_disable_interrupts(VFIODevice *vdev)
+{
+    switch (vdev->interrupt) {
+    case INT_INTx:
+        vfio_disable_intx(vdev);
+        break;
+    case INT_MSI:
+        vfio_disable_msi_x(vdev, false);
+        break;
+    case INT_MSIX:
+        vfio_disable_msi_x(vdev, true);
+    }
+}
+
+static int vfio_setup_msi(VFIODevice *vdev, int pos)
+{
+    uint16_t ctrl;
+    bool msi_64bit, msi_maskbit;
+    int ret, entries;
+
+    if (!msi_supported) {
+        return 0;
+    }
+
+    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -1;
+    }
+    ctrl = le16_to_cpu(ctrl);
+
+    msi_64bit = !!(ctrl & PCI_MSI_FLAGS_64BIT);
+    msi_maskbit = !!(ctrl & PCI_MSI_FLAGS_MASKBIT);
+    entries = 1 << ((ctrl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+    DPRINTF("%04x:%02x:%02x.%x PCI MSI CAP @0x%x\n", vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function, pos);
+
+    ret = msi_init(&vdev->pdev, pos, entries, msi_64bit, msi_maskbit);
+    if (ret < 0) {
+        error_report("vfio: msi_init failed\n");
+        return ret;
+    }
+    vdev->msi_cap_size = 0xa + (msi_maskbit ? 0xa : 0) + (msi_64bit ? 0x4 : 0);
+
+    return 0;
+}
+
+/*
+ * We don't have any control over how pci_add_capability() inserts
+ * capabilities into the chain.  In order to setup MSI-X we need a
+ * MemoryRegion for the BAR.  In order to setup the BAR and not
+ * attempt to mmap the MSI-X table area, which VFIO won't allow, we
+ * need to first look for where the MSI-X table lives.  So we
+ * unfortunately split MSI-X setup across two functions.
+ */
+static int vfio_early_setup_msix(VFIODevice *vdev)
+{
+    uint8_t pos;
+    uint16_t ctrl;
+    uint32_t table, pba;
+
+    pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX);
+    if (!pos) {
+        return 0;
+    }
+
+    if (pread(vdev->fd, &ctrl, sizeof(ctrl),
+              vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
+        return -1;
+    }
+
+    if (pread(vdev->fd, &table, sizeof(table),
+              vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
+        return -1;
+    }
+
+    if (pread(vdev->fd, &pba, sizeof(pba),
+              vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
+        return -1;
+    }
+
+    ctrl = le16_to_cpu(ctrl);
+    table = le32_to_cpu(table);
+    pba = le32_to_cpu(pba);
+
+    vdev->msix = g_malloc0(sizeof(*(vdev->msix)));
+    vdev->msix->table_bar = table & PCI_MSIX_FLAGS_BIRMASK;
+    vdev->msix->table_offset = table & ~PCI_MSIX_FLAGS_BIRMASK;
+    vdev->msix->pba_bar = pba & PCI_MSIX_FLAGS_BIRMASK;
+    vdev->msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK;
+    vdev->msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
+
+    DPRINTF("%04x:%02x:%02x.%x "
+            "PCI MSI-X CAP @0x%x, BAR %d, offset 0x%x, entries %d\n",
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function, pos, vdev->msix->table_bar,
+            vdev->msix->table_offset, vdev->msix->entries);
+
+    return 0;
+}
+
+static int vfio_setup_msix(VFIODevice *vdev, int pos)
+{
+    int ret;
+
+    if (!msi_supported) {
+        return 0;
+    }
+
+    ret = msix_init(&vdev->pdev, vdev->msix->entries,
+                    &vdev->bars[vdev->msix->table_bar].mem,
+                    vdev->msix->table_bar, vdev->msix->table_offset,
+                    &vdev->bars[vdev->msix->pba_bar].mem,
+                    vdev->msix->pba_bar, vdev->msix->pba_offset, pos);
+    if (ret < 0) {
+        error_report("vfio: msix_init failed\n");
+        return ret;
+    }
+ 
+    ret = msix_set_vector_notifiers(&vdev->pdev, vfio_msix_vector_use,
+                                    vfio_msix_vector_release);
+    if (ret) {
+        error_report("vfio: msix_set_vector_notifiers failed %d\n", ret);
+        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem,
+                    &vdev->bars[vdev->msix->pba_bar].mem);
+        return ret;
+    }
+
+    return 0;
+}
+
+static void vfio_teardown_msi(VFIODevice *vdev)
+{
+    msi_uninit(&vdev->pdev);
+
+    if (vdev->msix) {
+        /* FIXME: Why can't unset just silently do nothing?? */
+        if (vdev->pdev.msix_vector_use_notifier &&
+            vdev->pdev.msix_vector_release_notifier) {
+            msix_unset_vector_notifiers(&vdev->pdev);
+        }
+
+        msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem,
+                    &vdev->bars[vdev->msix->pba_bar].mem);
+    }
+}
+
+/*
+ * Resource setup
+ */
+static void vfio_unmap_bar(VFIODevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    uint64_t size;
+
+    if (!memory_region_size(&bar->mem)) {
+        return;
+    }
+
+    size = memory_region_size(&bar->mmap_mem);
+    if (size) {
+         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
+         munmap(bar->mmap, size);
+    }
+
+    if (vdev->msix && vdev->msix->table_bar == nr) {
+        size = memory_region_size(&vdev->msix->mmap_mem);
+        if (size) {
+            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
+            munmap(vdev->msix->mmap, size);
+        }
+    }
+
+    memory_region_destroy(&bar->mem);
+}
+
+static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
+                         void **map, size_t size, off_t offset,
+                         const char *name)
+{
+    *map = mmap(NULL, size, PROT_READ | PROT_WRITE,
+                MAP_SHARED, bar->fd, bar->fd_offset + offset);
+    if (*map == MAP_FAILED) {
+        *map = NULL;
+        return -1;
+    }
+
+    memory_region_init_ram_ptr(submem, name, size, *map);
+    memory_region_add_subregion(mem, offset, submem);
+
+    return 0;
+}
+
+static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    unsigned size = bar->size;
+    char name[64];
+
+    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
+
+    /* A "slow" read/write mapping underlies all BARs */
+    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
+    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
+
+    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
+        return; /* IO space is only slow, don't expect high perf here */
+    }
+
+    if (size & ~TARGET_PAGE_MASK) {
+        error_report("%s is too small to mmap, this may affect performance.\n",
+                     name);
+        return;
+    }
+
+    /*
+     * We can't mmap areas overlapping the MSIX vector table, so we
+     * potentially insert a direct-mapped subregion before and after it.
+     */
+    if (vdev->msix && vdev->msix->table_bar == nr) {
+        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
+    }
+         
+    if (size) {
+        strcat(name, " mmap");
+        if (vfio_mmap_bar(bar, &bar->mem, &bar->mmap_mem, &bar->mmap,
+                          size, 0, name)) {
+            error_report("%s Failed. Performance may be slow\n", name);
+        }
+    }
+
+    if (vdev->msix && vdev->msix->table_bar == nr) {
+        unsigned start;
+
+        start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
+                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
+
+        if (start < bar->size) {
+            size = bar->size - start;
+            strcat(name, " msix-hi");
+            /* MSIXInfo contains another MemoryRegion for this mapping */
+            if (vfio_mmap_bar(bar, &bar->mem, &vdev->msix->mmap_mem,
+                              &vdev->msix->mmap, size, start, name)) {
+                error_report("%s Failed. Performance may be slow\n", name);
+            }
+        }
+    }
+
+    return;
+}
+
+static int vfio_map_bars(VFIODevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        VFIOBAR *bar;
+        int ret;
+        uint32_t bar_val;
+        uint8_t bar_type;
+
+        bar = &vdev->bars[i];
+        if (!bar->size) {
+            continue;
+        }
+
+        ret = pread(vdev->fd, &bar_val, sizeof(bar_val),
+                    vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * i));
+        if (ret != sizeof(bar_val)) {
+            error_report("vfio: Failed to read BAR %d (%s)\n", i,
+                         strerror(errno));
+            return -1;
+        }
+
+        bar_val = le32_to_cpu(bar_val);
+        bar_type = bar_val & (bar_val & PCI_BASE_ADDRESS_SPACE_IO ?
+                   ~PCI_BASE_ADDRESS_IO_MASK : ~PCI_BASE_ADDRESS_MEM_MASK);
+
+        vfio_map_bar(vdev, i, bar_type);
+
+        if (!(bar_type & PCI_BASE_ADDRESS_SPACE_IO) && 
+            bar_type & PCI_BASE_ADDRESS_MEM_TYPE_64) {
+            i++;
+        }
+    }
+
+    return 0;
+}
+
+static void vfio_unmap_bars(VFIODevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_unmap_bar(vdev, i);
+    }
+}
+
+/*
+ * General setup
+ */
+static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos)
+{
+    uint8_t tmp, next = 0xff;
+
+    for (tmp = pdev->config[PCI_CAPABILITY_LIST]; tmp;
+         tmp = pdev->config[tmp + 1]) {
+        if (tmp > pos && tmp < next) {
+            next = tmp;
+        }
+    }
+
+    return next - pos;
+}
+
+static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos)
+{
+    PCIDevice *pdev = &vdev->pdev;
+    uint8_t cap_id, next, size;
+    int ret;
+
+    cap_id = pdev->config[pos];
+    next = pdev->config[pos + 1];
+
+    /*
+     * If it becomes important to configure capabilities to their actual
+     * size, use this as the default when it's something we don't recognize.
+     * Since qemu doesn't actually handle many of the config accesses,
+     * exact size doesn't seem worthwhile.
+     */
+    size = vfio_std_cap_max_size(pdev, pos);
+
+    /*
+     * pci_add_capability always inserts the new capability at the head
+     * of the chain.  Therefore to end up with a chain that matches the
+     * physical device, we insert from the end by making this recursive.
+     * This is also why we pre-caclulate size above as cached config space
+     * will be changed as we unwind the stack.
+     */
+    if (next) {
+        ret = vfio_add_std_cap(vdev, next);
+        if (ret) {
+            return ret;
+        }
+    } else {
+        pdev->config[PCI_CAPABILITY_LIST] = 0; /* Begin the rebuild */
+    }
+
+    switch (cap_id) {
+    case PCI_CAP_ID_MSI:
+        ret = vfio_setup_msi(vdev, pos);
+        break;
+    case PCI_CAP_ID_MSIX:
+        ret = vfio_setup_msix(vdev, pos);
+        break;
+    default:
+        ret = pci_add_capability(pdev, cap_id, pos, size);
+    }
+
+    if (ret < 0) {
+        error_report("vfio: %04x:%02x:%02x.%x Error adding PCI capability "
+                     "0x%x[0x%x]@0x%x: %d\n", vdev->host.domain,
+                     vdev->host.bus, vdev->host.slot, vdev->host.function,
+                     cap_id, size, pos, ret);
+        return ret;
+    }
+
+    return 0;
+}
+
+static int vfio_add_capabilities(VFIODevice *vdev)
+{
+    PCIDevice *pdev = &vdev->pdev;
+
+    if (!(pdev->config[PCI_STATUS] & PCI_STATUS_CAP_LIST) ||
+        !pdev->config[PCI_CAPABILITY_LIST]) {
+        return 0; /* Nothing to add */
+    }
+
+    return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]);
+}
+
+static int vfio_load_rom(VFIODevice *vdev)
+{
+    uint64_t size = vdev->rom_size;
+    const VMStateDescription *vmsd;
+    char name[32];
+    off_t off = 0, voff = vdev->rom_offset;
+    ssize_t bytes;
+    void *ptr;
+
+    /* If loading ROM from file, pci handles it */
+    if (vdev->pdev.romfile || !vdev->pdev.rom_bar || !size)
+        return 0;
+
+    DPRINTF("%s(%04x:%02x:%02x.%x)\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function);
+
+    vmsd = qdev_get_vmsd(DEVICE(&vdev->pdev));
+
+    if (vmsd) {
+        snprintf(name, sizeof(name), "%s.rom", vmsd->name);
+    } else {
+        snprintf(name, sizeof(name), "%s.rom",
+                 object_get_typename(OBJECT(&vdev->pdev)));
+    }
+    memory_region_init_ram(&vdev->pdev.rom, name, size);
+    ptr = memory_region_get_ram_ptr(&vdev->pdev.rom);
+    memset(ptr, 0xff, size);
+
+    while (size) {
+        bytes = pread(vdev->fd, ptr + off, size, voff + off);
+        if (bytes == 0) {
+            break; /* expect that we could get back less than the ROM BAR */
+        } else if (bytes > 0) {
+            off += bytes;
+            size -= bytes;
+        } else {
+            if (errno == EINTR || errno == EAGAIN) {
+                continue;
+            }
+            error_report("vfio: Error reading device ROM: %s\n",
+                         strerror(errno));
+            memory_region_destroy(&vdev->pdev.rom);
+            return -1;
+        }
+    }
+
+    pci_register_bar(&vdev->pdev, PCI_ROM_SLOT, 0, &vdev->pdev.rom);
+    vdev->pdev.has_rom = true;
+    return 0;
+}
+
+static int vfio_connect_container(VFIOGroup *group)
+{
+    VFIOContainer *container;
+    int ret, fd;
+
+    if (group->container) {
+        return 0;
+    }
+
+    QLIST_FOREACH(container, &container_list, next) {
+        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
+            group->container = container;
+            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+            return 0;
+        }
+    }
+
+    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_report("vfio: failed to open /dev/vfio/vfio: %s\n",
+                     strerror(errno));
+        return -1;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_report("vfio: supported vfio version: %d, "
+                     "reported version: %d\n", VFIO_API_VERSION, ret);
+        close(fd);
+        return -1;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->fd = fd;
+
+    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s\n",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return -1;
+        }
+
+        container->iommu_data.listener = (MemoryListener) {
+            .begin = vfio_listener_dummy1,
+            .commit = vfio_listener_dummy1,
+            .region_add = vfio_listener_region_add,
+            .region_del = vfio_listener_region_del,
+            .region_nop = vfio_listener_dummy2,
+            .log_start = vfio_listener_dummy2,
+            .log_stop = vfio_listener_dummy2,
+            .log_sync = vfio_listener_dummy2,
+            .log_global_start = vfio_listener_dummy1,
+            .log_global_stop = vfio_listener_dummy1,
+            .eventfd_add = vfio_listener_dummy3,
+            .eventfd_del = vfio_listener_dummy3,
+        };
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.listener,
+                                 get_system_memory());
+    } else {
+        error_report("vfio: No available IOMMU models\n");
+        g_free(container);
+        close(fd);
+        return -1;
+    }
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&container_list, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    return 0;
+}
+
+static void vfio_disconnect_container(VFIOGroup *group)
+{
+    VFIOContainer *container = group->container;
+
+    if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) {
+        error_report("vfio: error disconnecting group %d from container\n",
+                     group->groupid);
+    }
+
+    QLIST_REMOVE(group, container_next);
+    group->container = NULL;
+
+    if (QLIST_EMPTY(&container->group_list)) {
+        if (container->iommu_data.release) {
+            container->iommu_data.release(container);
+        }
+        QLIST_REMOVE(container, next);
+        DPRINTF("vfio_disconnect_container: close container->fd\n");
+        close(container->fd);
+        g_free(container);
+    }
+}
+
+static VFIOGroup *vfio_get_group(int groupid)
+{
+    VFIOGroup *group;
+    char path[32];
+    struct vfio_group_status status = { .argsz = sizeof(status) };
+
+    QLIST_FOREACH(group, &group_list, next) {
+        if (group->groupid == groupid) {
+            return group;
+        }
+    }
+
+    group = g_malloc0(sizeof(*group));
+
+    sprintf(path, "/dev/vfio/%d", groupid);
+    group->fd = qemu_open(path, O_RDWR);
+    if (group->fd < 0) {
+        error_report("vfio: error opening %s: %s", path, strerror(errno));
+        g_free(group);
+        return NULL;
+    }
+
+    if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
+        error_report("vfio: error getting group status: %s\n",
+                     strerror(errno));
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+        error_report("vfio: error, group %d is not viable, please ensure "
+                     "all devices within the iommu_group are bound to their "
+                     "vfio bus driver.\n", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    group->groupid = groupid;
+    QLIST_INIT(&group->device_list);
+
+    if (vfio_connect_container(group)) {
+        error_report("vfio: failed to setup container for group %d\n", groupid);
+        close(group->fd);
+        g_free(group);
+        return NULL;
+    }
+
+    QLIST_INSERT_HEAD(&group_list, group, next);
+
+    return group;
+}
+
+static void vfio_put_group(VFIOGroup *group)
+{
+    if (!QLIST_EMPTY(&group->device_list)) {
+        return;
+    }
+
+    vfio_disconnect_container(group);
+    QLIST_REMOVE(group, next);
+    DPRINTF("vfio_put_group: close group->fd\n");
+    close(group->fd);
+    g_free(group);
+}
+
+static int __vfio_get_device(VFIOGroup *group,
+                             const char *name, VFIODevice *vdev)
+{
+    int ret;
+
+    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
+    if (ret < 0) {
+        error_report("vfio: error getting device %s from group %d: %s",
+                     name, group->groupid, strerror(errno));
+        error_report("Verify all devices in group %d "
+                     "are bound to vfio-pci or pci-stub and not already in use",
+                     group->groupid);
+        return -1;
+    }
+
+    vdev->group = group;
+    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
+
+    vdev->fd = ret;
+
+    return 0;
+}
+
+static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev)
+{
+    struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) };
+    struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+    int ret, i;
+
+    ret = __vfio_get_device(group, name, vdev);
+    if (ret) {
+        return ret;
+    }
+
+    /* Sanity check device */
+    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info);
+    if (ret) {
+        error_report("vfio: error getting device info: %s", strerror(errno));
+        goto error;
+    }
+
+    DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name,
+            dev_info.flags, dev_info.num_regions, dev_info.num_irqs);
+
+    if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) {
+        error_report("vfio: Um, this isn't a PCI device");
+        goto error;
+    }
+
+    vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
+    if (!vdev->reset_works) {
+        error_report("Warning, device %s does not support reset\n", name);
+    }
+
+    if (dev_info.num_regions != VFIO_PCI_NUM_REGIONS) {
+        error_report("vfio: unexpected number of io regions %u",
+                     dev_info.num_regions);
+        goto error;
+    }
+
+    if (dev_info.num_irqs != VFIO_PCI_NUM_IRQS) {
+        error_report("vfio: unexpected number of irqs %u", dev_info.num_irqs);
+        goto error;
+    }
+
+    for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
+        reg_info.index = i;
+
+        ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+        if (ret) {
+            error_report("vfio: Error getting region %d info: %s", i,
+                         strerror(errno));
+            goto error;
+        }
+
+        DPRINTF("Device %s region %d:\n", name, i);
+        DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
+                (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
+                (unsigned long)reg_info.flags);
+
+        vdev->bars[i].size = reg_info.size;
+        vdev->bars[i].fd_offset = reg_info.offset;
+        vdev->bars[i].fd = vdev->fd;
+        vdev->bars[i].nr = i;
+    }
+
+    reg_info.index = VFIO_PCI_ROM_REGION_INDEX;
+
+    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting ROM info: %s", strerror(errno));
+        goto error;
+    }
+
+    DPRINTF("Device %s ROM:\n", name);
+    DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
+            (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
+            (unsigned long)reg_info.flags);
+
+    vdev->rom_size = reg_info.size;
+    vdev->rom_offset = reg_info.offset;
+
+    reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+
+    ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+    if (ret) {
+        error_report("vfio: Error getting config info: %s", strerror(errno));
+        goto error;
+    }
+
+    DPRINTF("Device %s config:\n", name);
+    DPRINTF("  size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n",
+            (unsigned long)reg_info.size, (unsigned long)reg_info.offset,
+            (unsigned long)reg_info.flags);
+
+    vdev->config_size = reg_info.size;
+    vdev->config_offset = reg_info.offset;
+
+error:
+    if (ret) {
+        QLIST_REMOVE(vdev, next);
+        vdev->group = NULL;
+        close(vdev->fd);
+    }
+    return ret;
+}
+
+static void vfio_put_device(VFIODevice *vdev)
+{
+    QLIST_REMOVE(vdev, next);
+    vdev->group = NULL;
+    DPRINTF("vfio_put_device: close vdev->fd\n");
+    close(vdev->fd);
+    if (vdev->msix) {
+        g_free(vdev->msix);
+	vdev->msix = NULL;
+    }
+}
+
+static int vfio_initfn(struct PCIDevice *pdev)
+{
+    VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOGroup *group;
+    char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
+    ssize_t len;
+    struct stat st;
+    int groupid;
+    int ret;
+
+    /* Check that the host device exists */
+    sprintf(path, "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function);
+    if (stat(path, &st) < 0) {
+        error_report("vfio: error: no such host device: %s", path);
+        return -1;
+    }
+
+    strcat(path, "iommu_group");
+
+    len = readlink(path, iommu_group_path, PATH_MAX);
+    if (len <= 0) {
+        error_report("vfio: error no iommu_group for device\n");
+        return -1;
+    }
+
+    iommu_group_path[len] = 0;
+    group_name = basename(iommu_group_path);
+
+    if (sscanf(group_name, "%d", &groupid) != 1) {
+        error_report("vfio: error reading %s: %s", path, strerror(errno));
+        return -1;
+    }
+
+    DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __FUNCTION__, vdev->host.domain,
+            vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
+
+    group = vfio_get_group(groupid);
+    if (!group) {
+        error_report("vfio: failed to get group %d", groupid);
+        return -1;
+    }
+
+    sprintf(path, "%04x:%02x:%02x.%01x",
+            vdev->host.domain, vdev->host.bus, vdev->host.slot,
+            vdev->host.function);
+
+    QLIST_FOREACH(pvdev, &group->device_list, next) {
+        if (pvdev->host.domain == vdev->host.domain &&
+            pvdev->host.bus == vdev->host.bus &&
+            pvdev->host.slot == vdev->host.slot &&
+            pvdev->host.function == vdev->host.function) {
+
+            error_report("vfio: error: device %s is already attached\n", path);
+            vfio_put_group(group);
+            return -1;
+        }
+    }
+
+    ret = vfio_get_device(group, path, vdev);
+    if (ret) {
+        error_report("vfio: failed to get device %s", path);
+        vfio_put_group(group);
+        return -1;
+    }
+
+    /* Get a copy of config space */
+    assert(pci_config_size(&vdev->pdev) <= vdev->config_size);
+    ret = pread(vdev->fd, vdev->pdev.config,
+                pci_config_size(&vdev->pdev), vdev->config_offset);
+    if (ret < (int)pci_config_size(&vdev->pdev)) {
+        error_report("vfio: Failed to read device config space\n");
+        goto out_put;
+    }
+
+    /*
+     * Clear host resource mapping info.  If we choose not to register a
+     * BAR, such as might be the case with the option ROM, we can get
+     * confusing, unwritable, residual addresses from the host here.
+     */
+    memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
+    memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
+
+    vfio_load_rom(vdev);
+
+    if (vfio_early_setup_msix(vdev)) {
+        goto out_put;
+    }
+
+    if (vfio_map_bars(vdev)) {
+        goto out_unmap_bars;
+    }
+
+    if (vfio_add_capabilities(vdev)) {
+        goto out_teardown_msi;
+    }
+
+    if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_update_irq);
+    }
+
+    if (vfio_enable_intx(vdev)) {
+        goto out_teardown_msi;
+    }
+
+    return 0;
+
+out_teardown_msi:
+    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
+    vfio_teardown_msi(vdev);
+out_unmap_bars:
+    vfio_unmap_bars(vdev);
+out_put:
+    vfio_put_device(vdev);
+    vfio_put_group(group);
+    return -1;
+}
+
+static void vfio_exitfn(struct PCIDevice *pdev)
+{
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+    VFIOGroup *group = vdev->group;
+
+    pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
+    vfio_disable_interrupts(vdev);
+    vfio_teardown_msi(vdev);
+    vfio_unmap_bars(vdev);
+    vfio_put_device(vdev);
+    vfio_put_group(group);
+}
+
+static void vfio_reset(DeviceState *dev)
+{
+    PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
+    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
+
+    if (!vdev->reset_works) {
+        return;
+    }
+
+    if (ioctl(vdev->fd, VFIO_DEVICE_RESET)) {
+        error_report("vfio: Error unable to reset physical device "
+                     "(%04x:%02x:%02x.%x): %s\n", vdev->host.domain,
+                     vdev->host.bus, vdev->host.slot, vdev->host.function,
+                     strerror(errno));
+    }
+}
+
+static Property vfio_pci_dev_properties[] = {
+    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
+    //TODO - support passed fds... is this necessary?
+    //DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
+    //DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+
+static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+    PCIDeviceClass *dc = PCI_DEVICE_CLASS(klass);
+
+    dc->parent_class.reset = vfio_reset;
+    dc->init = vfio_initfn;
+    dc->exit = vfio_exitfn;
+    dc->config_read = vfio_pci_read_config;
+    dc->config_write = vfio_pci_write_config;
+    dc->parent_class.props = vfio_pci_dev_properties;
+}
+
+static TypeInfo vfio_pci_dev_info = {
+    .name          = "vfio-pci",
+    .parent        = TYPE_PCI_DEVICE,
+    .instance_size = sizeof(VFIODevice),
+    .class_init    = vfio_pci_dev_class_init,
+};
+
+static void register_vfio_pci_dev_type(void)
+{
+    type_register_static(&vfio_pci_dev_info);
+}
+
+type_init(register_vfio_pci_dev_type)
diff --git a/hw/vfio_pci.h b/hw/vfio_pci.h
new file mode 100644
index 0000000..9fb27ce
--- /dev/null
+++ b/hw/vfio_pci.h
@@ -0,0 +1,100 @@ 
+#ifndef __VFIO_H__
+#define __VFIO_H__
+
+#include "qemu-common.h"
+#include "qemu-queue.h"
+#include "pci.h"
+#include "event_notifier.h"
+
+typedef struct VFIOBAR {
+    off_t fd_offset; /* offset of BAR within device fd */
+    int fd; /* device fd, allows us to pass VFIOBAR as opaque data */
+    MemoryRegion mem; /* slow, read/write access */
+    MemoryRegion mmap_mem; /* direct mapped access */
+    void *mmap;
+    size_t size;
+    uint8_t nr; /* cache the BAR number for debug */
+} VFIOBAR;
+
+typedef struct INTx {
+    bool pending; /* interrupt pending */
+    bool kvm_accel; /* set when Qemu bypass through KVM enabled */
+    uint8_t pin; /* which pin to pull for qemu_set_irq */
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    EventNotifier unmask; /* eventfd for unmask on Qemu bypass */
+    PCIINTxRoute route; /* routing info for Qemu bypass */
+} INTx;
+
+struct VFIODevice;
+
+typedef struct MSIVector {
+    EventNotifier interrupt; /* eventfd triggered on interrupt */
+    struct VFIODevice *vdev; /* back pointer to device */
+    int vector; /* the vector number for this element */
+    int virq; /* KVM irqchip route for Qemu bypass */
+    bool use;
+} MSIVector;
+
+enum {
+    INT_NONE = 0,
+    INT_INTx = 1,
+    INT_MSI  = 2,
+    INT_MSIX = 3,
+};
+
+struct VFIOGroup;
+
+typedef struct VFIOContainer {
+    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
+    struct {
+        /* enable abstraction to support various iommu backends */
+        union {
+            MemoryListener listener; /* Used by type1 iommu */
+        };
+        void (*release)(struct VFIOContainer *);
+    } iommu_data;
+    QLIST_HEAD(, VFIOGroup) group_list;
+    QLIST_ENTRY(VFIOContainer) next;
+} VFIOContainer;
+
+/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
+typedef struct MSIXInfo {
+    uint8_t table_bar;
+    uint8_t pba_bar;
+    uint16_t entries;
+    uint32_t table_offset;
+    uint32_t pba_offset;
+    MemoryRegion mmap_mem;
+    void *mmap;
+} MSIXInfo;
+
+typedef struct VFIODevice {
+    PCIDevice pdev;
+    int fd;
+    INTx intx;
+    unsigned int config_size;
+    off_t config_offset; /* Offset of config space region within device fd */
+    unsigned int rom_size;
+    off_t rom_offset; /* Offset of ROM region within device fd */
+    int msi_cap_size;
+    MSIVector *msi_vectors;
+    MSIXInfo *msix;
+    int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
+    int interrupt; /* Current interrupt type */
+    VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+    PCIHostDeviceAddress host;
+    QLIST_ENTRY(VFIODevice) next;
+    struct VFIOGroup *group;
+    bool reset_works;
+} VFIODevice;
+
+typedef struct VFIOGroup {
+    int fd;
+    int groupid;
+    VFIOContainer *container;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOGroup) next;
+    QLIST_ENTRY(VFIOGroup) container_next;
+} VFIOGroup;
+
+#endif /* __VFIO_H__ */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 5a9d4e3..bd1a76c 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -617,6 +617,10 @@  struct kvm_ppc_smmu_info {
 #define KVM_CAP_SIGNAL_MSI 77
 #define KVM_CAP_PPC_GET_SMMU_INFO 78
 #define KVM_CAP_S390_COW 79
+#define KVM_CAP_PPC_ALLOC_HTAB 80
+#define KVM_CAP_IRQFD_LEVEL 81
+#define KVM_CAP_EOIFD 82
+#define KVM_CAP_EOIFD_LEVEL_IRQFD 83
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -682,6 +686,8 @@  struct kvm_xen_hvm_config {
 #endif
 
 #define KVM_IRQFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_IRQFD_LEVEL */
+#define KVM_IRQFD_FLAG_LEVEL (1 << 1)
 
 struct kvm_irqfd {
 	__u32 fd;
@@ -690,6 +696,17 @@  struct kvm_irqfd {
 	__u8  pad[20];
 };
 
+#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
+/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */
+#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
+
+struct kvm_eoifd {
+	__u32 fd;
+	__u32 flags;
+	__u32 key;
+	__u8 pad[20];
+};
+
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
@@ -828,6 +845,10 @@  struct kvm_s390_ucas_mapping {
 #define KVM_SIGNAL_MSI            _IOW(KVMIO,  0xa5, struct kvm_msi)
 /* Available with KVM_CAP_PPC_GET_SMMU_INFO */
 #define KVM_PPC_GET_SMMU_INFO	  _IOR(KVMIO,  0xa6, struct kvm_ppc_smmu_info)
+/* Available with KVM_CAP_PPC_ALLOC_HTAB */
+#define KVM_PPC_ALLOCATE_HTAB	  _IOWR(KVMIO, 0xa7, __u32)
+/* Available with KVM_CAP_EOIFD */
+#define KVM_EOIFD                 _IOW(KVMIO,  0xa8, struct kvm_eoifd)
 
 /*
  * ioctls for vcpu fds
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
new file mode 100644
index 0000000..0a4f180
--- /dev/null
+++ b/linux-headers/linux/vfio.h
@@ -0,0 +1,445 @@ 
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VFIO_API_VERSION	0
+
+#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
+
+#include <linux/iommu.h>
+#include <linux/mm.h>
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @open: Called when userspace creates new file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ *         operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+	char	*name;
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t count, loff_t *size);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+extern int vfio_add_group_dev(struct device *dev,
+			      const struct vfio_device_ops *ops,
+			      void *device_data);
+
+extern void *vfio_del_group_dev(struct device *dev);
+
+/**
+ * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
+ */
+struct vfio_iommu_driver_ops {
+	char		*name;
+	struct module	*owner;
+	void		*(*open)(unsigned long arg);
+	void		(*release)(void *iommu_data);
+	ssize_t		(*read)(void *iommu_data, char __user *buf,
+				size_t count, loff_t *ppos);
+	ssize_t		(*write)(void *iommu_data, const char __user *buf,
+				 size_t count, loff_t *size);
+	long		(*ioctl)(void *iommu_data, unsigned int cmd,
+				 unsigned long arg);
+	int		(*mmap)(void *iommu_data, struct vm_area_struct *vma);
+	int		(*attach_group)(void *iommu_data,
+					struct iommu_group *group);
+	void		(*detach_group)(void *iommu_data,
+					struct iommu_group *group);
+
+};
+
+extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
+
+extern void vfio_unregister_iommu_driver(
+				const struct vfio_iommu_driver_ops *ops);
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space.  This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({				\
+	TYPE tmp;						\
+	offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })		\
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_TYPE1_IOMMU		1
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+	__u32	index;		/* Region index */
+	__u32	resv;		/* Reserved for alignment */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__u32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__u32	start;
+	__u32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
+/* -------- API for Type1 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_type1_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_type1_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 14, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_iommu_type1_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+#endif /* VFIO_H */