[RFC] vfio-pci: Disallow device from using NoSnoop transactions

Message ID 20131004191930.3131.95209.stgit@bling.home
State New
Headers show

Commit Message

Alex Williamson Oct. 4, 2013, 7:39 p.m.
NoSnoop is a PCIe attribute that allows devices to issue transactions
that bypass cache.  If devices are allowed to do this then the
hypervisor must emulate instructions like WBINVD or else drivers in
the guest have no way to synchronize RAM for the device.  Instead of
forcing WBINVD when a device is assigned, we can instead prevent the
device from using NoSnoop, making all transactions coherent.  The
danger here is whether we can expect devices to fully support this
bit, but this is an improvement over neglecting the problem.

This fixes the Code 43 error seen on Windows guests with Intel host
systems when trying to assign Nvidia VGA devices.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

VGA folks - Use a v3.12-rc1 or newer kernel and the 6 qemu patches
            from my pull request yesterday and you should see a big
            improvement with this patch.  GTX660 is now working well
            on my Intel box with Win7 guest and Nvidia drivers.

RFC - There are a number of solutions to this coherency problem.
Legacy KVM device assignment solves it by using the IOMMU's ability
to strip NoSnoop attributes from transactions.  When this is
unavailable, WBINVD emulation is enabled in KVM.  AMD-Vi is always
capable of doing this, VT-d is a mix.  The flaw in its design is
that it remaps the guest non-atomically when the IOMMU domain
capabilities change.

If we use the IOMMU to strip NoSnoop, we cannot guarantee that the
feature will always be available and may change for a domain as
devices are added or removed.  Therefore KVM needs to assume non-
coherent or have the ability to switch when VFIO devices are attached.
This is where I was headed with the KVM-VFIO pseudo device.

However, if we disable NoSnoop at the device, as done here, KVM
can always operate assuming cache coherent I/O.  Per the spec, this
should work.  I'm not sure how this will play out across real
hardware, but it avoids a lot of messiness with page table flags
when IOMMU domains change properties and requires no new interfaces
for VFIO or KVM.

Note that clearing and preventing writes to the NoSnoop enable bit
is not a VFIO problem.  VFIO is a userspace driver interface and
other drivers may have legitimate use of NoSnoop.  This is a
hypervisor problem because the hypervisor would like to optimize
away WBINVD emulation.  QEMU is therefore the proper place to
clear the bit and prevent futher changes.

 hw/misc/vfio.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)


diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index a2d5283..a2e5742 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -163,6 +163,7 @@  typedef struct VFIODevice {
     VFIOINTx intx;
     unsigned int config_size;
     uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
+    uint8_t *read_only_config_bits; /* Bits not guest modifiable to VFIO */
     off_t config_offset; /* Offset of config space region within device fd */
     unsigned int rom_size;
     off_t rom_offset; /* Offset of ROM region within device fd */
@@ -1968,13 +1969,20 @@  static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
                                   uint32_t val, int len)
     VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
-    uint32_t val_le = cpu_to_le32(val);
+    uint32_t ro_bits, ro_val = 0, val_le;
     DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __func__,
             vdev->host.domain, vdev->host.bus, vdev->host.slot,
             vdev->host.function, addr, val, len);
-    /* Write everything to VFIO, let it filter out what we can't write */
+    memcpy(&ro_bits, vdev->read_only_config_bits + addr, len);
+    /*
+     * Read-only bits should be cleared in pdev->wmask and set to the
+     * value we want for hardware in pdev->config.
+     */
+    ro_val = pci_default_read_config(pdev, addr, len);
+    val_le = (cpu_to_le32(ro_val) & ro_bits) | (cpu_to_le32(val) & ~ro_bits);
     if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) {
         error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %m",
                      __func__, vdev->host.domain, vdev->host.bus,
@@ -2554,10 +2562,10 @@  static void vfio_add_emulated_long(VFIODevice *vdev, int pos,
 static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
-    uint16_t flags;
+    uint16_t flags, devctl;
     uint8_t type;
-    flags = pci_get_word(vdev->pdev.config + pos + PCI_CAP_FLAGS);
+    flags = pci_get_word(vdev->pdev.config + pos + PCI_EXP_FLAGS);
     type = (flags & PCI_EXP_FLAGS_TYPE) >> 4;
     if (type != PCI_EXP_TYPE_ENDPOINT &&
@@ -2569,6 +2577,22 @@  static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
         return -EINVAL;
+    /*
+     * As a VM we need to worry about how the hypervisor handles coherency.
+     * In the typical, non-assigned device case, everything is coherent and
+     * there's no need to emulate coherency instructions.  PCIe however
+     * allows devices to do non-coherent transactions with the NoSnoop
+     * attribute.  Disable the device from using these transactions and
+     * expose the bit as read-only to the guest.
+     */
+    devctl = vfio_pci_read_config(&vdev->pdev, pos + PCI_EXP_DEVCTL, 2);
+    devctl &= ~PCI_EXP_DEVCTL_NOSNOOP_EN;
+    vfio_pci_write_config(&vdev->pdev, pos + PCI_EXP_DEVCTL, devctl, 2);
+    vfio_add_emulated_word(vdev, pos + PCI_EXP_DEVCTL, 0,
+                           PCI_EXP_DEVCTL_NOSNOOP_EN);
+    vfio_set_word_bits(vdev->read_only_config_bits + pos + PCI_EXP_DEVCTL,
     if (!pci_bus_is_express(vdev->pdev.bus)) {
          * Use express capability as-is on PCI bus.  It doesn't make much
@@ -3559,6 +3583,7 @@  static int vfio_initfn(PCIDevice *pdev)
     /* vfio emulates a lot for us, but some bits need extra love */
     vdev->emulated_config_bits = g_malloc0(vdev->config_size);
+    vdev->read_only_config_bits = g_malloc0(vdev->config_size);
     /* QEMU can choose to expose the ROM or not */
     memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
@@ -3621,6 +3646,7 @@  out_teardown:
+    g_free(vdev->read_only_config_bits);
     return ret;
@@ -3640,6 +3666,7 @@  static void vfio_exitfn(PCIDevice *pdev)
+    g_free(vdev->read_only_config_bits);