diff mbox series

[RFC/RFT,5/5] vfio/pci: Allow relocating MSI-X MMIO

Message ID 20171218050253.13478.49457.stgit@gimli.home
State New
Headers show
Series vfio/pci: MSI-X MMIO relocation | expand

Commit Message

Alex Williamson Dec. 18, 2017, 5:02 a.m. UTC
With recently proposed kernel side vfio-pci changes, the MSI-X vector
table area can be mmap'd from userspace, allowing direct access to
non-MSI-X registers within the host page size of this area.  However,
we only get that direct access if QEMU isn't also emulating MSI-X
within that same page.  For x86/64 host, the system page size is 4K
and the PCI spec recommends a minimum of 4K to 8K alignment to
separate MSI-X from non-MSI-X registers, therefore only devices which
don't honor this recommendation would see any improvement from this
option.  The real targets for this feature are hosts where the page
size exceeds the PCI spec recommended alignment, such as ARM64 systems
with 64K pages.

This new x-msix-relocation option accepts the following options:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Automaically relocate MSI-X MMIO to another BAR or offset
       based on minimum additional MMIO requirement
  bar0..bar5: Specify the target BAR, which will either be extended
       if the BAR exists or added if the BAR slot is available.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h        |    1 
 hw/vfio/trace-events |    2 +
 3 files changed, 104 insertions(+), 1 deletion(-)

Comments

Alexey Kardashevskiy Dec. 18, 2017, 9:04 a.m. UTC | #1
On 18/12/17 16:02, Alex Williamson wrote:
> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> table area can be mmap'd from userspace, allowing direct access to
> non-MSI-X registers within the host page size of this area.  However,
> we only get that direct access if QEMU isn't also emulating MSI-X
> within that same page.  For x86/64 host, the system page size is 4K
> and the PCI spec recommends a minimum of 4K to 8K alignment to
> separate MSI-X from non-MSI-X registers, therefore only devices which
> don't honor this recommendation would see any improvement from this
> option.  The real targets for this feature are hosts where the page
> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> with 64K pages.
> 
> This new x-msix-relocation option accepts the following options:
> 
>   off: Disable MSI-X relocation, use native device config (default)
>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>        based on minimum additional MMIO requirement
>   bar0..bar5: Specify the target BAR, which will either be extended
>        if the BAR exists or added if the BAR slot is available.


While I am digesting the patchset, here are some test results.

This is the device:

00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
PCI-Express Fusion-MPT SAS-3 (rev 02)
Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]

Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=1 offset=0000e000
        PBA: BAR=1 offset=0000f000


Test #1: x-msix-relocation = "off":

FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
  000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
  000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
@000000000000e600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

Ok, works.


Test #2: x-msix-relocation = "auto":

FlatView #2
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000200080000000-00002000800005ff (prio 0, i/o): msix-table
  0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
@0000000000000600
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


The guest fails probing because the first 64bit BAR is broken.

lspci:

Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]

Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
        Vector table: BAR=0 offset=00000000
        PBA: BAR=0 offset=00000600



Test #3: x-msix-relocation = "bar1"


FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000010000-00002100000105ff (prio 0, i/o): msix-table
  0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
@0000000000010600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
appear as "ramd" in flatview, should it have appeared?

This is "mtree":

memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
      0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
      0000210000010000-00002100000105ff (prio 0, i/o): msix-table
      0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
      0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
        0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
3 mmaps[0]




Test #4: x-msix-relocation = "bar5"

The same net result as test #3: it works but BAR1 is not mapped:


Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K]

Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=5 offset=00000000
        PBA: BAR=5 offset=00000600

FlatView #0
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000200080000000-00002000800005ff (prio 0, i/o): msix-table
  0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
@0000000000000600
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
      0000000080000000-00000000800005ff (prio 0, i/o): msix-table
      0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled]
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
      0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
      0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
        0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
3 mmaps[0]



and there is also one minor comment below.


> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h        |    1 
>  hw/vfio/trace-events |    2 +
>  3 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c383b842da20..b4426abf297a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> +{
> +    int target_bar = -1;
> +    size_t msix_sz;
> +
> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> +        return;
> +    }
> +
> +    /* The actual minimum size of MSI-X structures */
> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> +    /* Round up to host pages, we don't want to share a page */
> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> +    /* PCI BARs must be a power of 2 */
> +    msix_sz = pow2ceil(msix_sz);
> +
> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> +        int i;
> +        size_t best = UINT64_MAX;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
> +            size_t size;
> +
> +            if (vdev->bars[i].ioport) {
> +                continue;
> +            }
> +
> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> +                continue;
> +
> +            /*
> +             * Must be pow2, so larger of double existing or double msix_sz,
> +             * or if BAR unimplemented, msix_sz
> +             */
> +            size = MAX(vdev->bars[i].size * 2,
> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> +
> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> +
> +            if (size < best) {
> +                best = size;
> +                target_bar = i;
> +            }
> +
> +            if (vdev->bars[i].mem64) {
> +              i++;
> +            }
> +        }
> +    } else {
> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> +    }
> +
> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> +        (!vdev->bars[target_bar].size &&
> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> +        return; /* Go BOOM?  Plumb Error */
> +    }
> +
> +    /*
> +     * If adding a new BAR, test if we can make it 64bit.  We make it
> +     * prefetchable since QEMU MSI-X emulation has no read side effects
> +     * and doing so makes mapping more flexible.
> +     */
> +    if (!vdev->bars[target_bar].size) {
> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> +            !vdev->bars[target_bar + 1].size) {
> +            vdev->bars[target_bar].mem64 = true;
> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        }
> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> +        vdev->bars[target_bar].size = msix_sz;
> +        vdev->msix->table_offset = 0;
> +    } else {
> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> +                                          msix_sz * 2);
> +        /*
> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> +         * which will always be a separate host page.
> +         */
> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> +    }
> +
> +    vdev->msix->table_bar = target_bar;
> +    vdev->msix->pba_bar = target_bar;
> +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> +    vdev->msix->pba_offset = vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> +
> +    trace_vfio_msix_relo(vdev->vbasedev.name,
> +                         vdev->msix->table_bar, vdev->msix->table_offset);
> +}
> +
>  /*
>   * We don't have any control over how pci_add_capability() inserts
>   * capabilities into the chain.  In order to setup MSI-X we need a
> @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vdev->msix = msix;
>  
>      vfio_pci_fixup_msix_region(vdev);
> +
> +    vfio_pci_relocate_msix(vdev);
>  }
>  
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>      vfio_pci_size_rom(vdev);
>  
> +    vfio_bars_prepare(vdev);
> +
>      vfio_msix_early_setup(vdev, &err);
>      if (err) {
>          error_propagate(errp, err);
>          goto error;
>      }
>  
> -    vfio_bars_prepare(vdev);


This could be in 2/5.


>      vfio_bars_register(vdev);
>  
>      ret = vfio_add_capabilities(vdev, errp);
> @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>                                     nv_gpudirect_clique,
>                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> +                                OFF_AUTOPCIBAR_OFF),
>      /*
>       * TODO - support passed fds... is this necessary?
>       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index dcdb1a806769..588381f201b4 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
>                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
>      bool pci_aer;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fae096c0724f..437ccdd29053 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
>  vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"
>  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> 
>
Alex Williamson Dec. 18, 2017, 1:28 p.m. UTC | #2
On Mon, 18 Dec 2017 20:04:23 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 18/12/17 16:02, Alex Williamson wrote:
> > With recently proposed kernel side vfio-pci changes, the MSI-X vector
> > table area can be mmap'd from userspace, allowing direct access to
> > non-MSI-X registers within the host page size of this area.  However,
> > we only get that direct access if QEMU isn't also emulating MSI-X
> > within that same page.  For x86/64 host, the system page size is 4K
> > and the PCI spec recommends a minimum of 4K to 8K alignment to
> > separate MSI-X from non-MSI-X registers, therefore only devices which
> > don't honor this recommendation would see any improvement from this
> > option.  The real targets for this feature are hosts where the page
> > size exceeds the PCI spec recommended alignment, such as ARM64 systems
> > with 64K pages.
> > 
> > This new x-msix-relocation option accepts the following options:
> > 
> >   off: Disable MSI-X relocation, use native device config (default)
> >   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >        based on minimum additional MMIO requirement
> >   bar0..bar5: Specify the target BAR, which will either be extended
> >        if the BAR exists or added if the BAR slot is available.  
> 
> 
> While I am digesting the patchset, here are some test results.

Thanks for testing!

> This is the device:
> 
> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
> PCI-Express Fusion-MPT SAS-3 (rev 02)

BAR1:

> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]

BAR3:

> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> 
> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>         Vector table: BAR=1 offset=0000e000
>         PBA: BAR=1 offset=0000f000
> 
> 
> Test #1: x-msix-relocation = "off":
> 
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> Ok, works.
> 
> 
> Test #2: x-msix-relocation = "auto":
> 
> FlatView #2
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
> @0000000000000600
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> The guest fails probing because the first 64bit BAR is broken.
> 
> lspci:
> 
> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> 
> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>         Vector table: BAR=0 offset=00000000
>         PBA: BAR=0 offset=00000600

Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
64bit since BAR1 is implemented.  I don't see anything fundamentally
different between this and the working BAR5 test below.

> Test #3: x-msix-relocation = "bar1"
> 
> 
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>   0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
> @0000000000010600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
> appear as "ramd" in flatview, should it have appeared?
> 
> This is "mtree":
> 
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>       0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>       0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
> 3 mmaps[0]

Did you disable vfio_pci_fixup_msix_region() as noted in 0/5?  This
series doesn't do anything about consuming the new MSI-X mappable flag
that you introduced in the kernel, so vfio_pci_fixup_msix_region() will
continue to exclude mmap'ing the 64K page overlapping the actual BAR.

> Test #4: x-msix-relocation = "bar5"
> 
> The same net result as test #3: it works but BAR1 is not mapped:
> 
> 
> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> 
> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>         Vector table: BAR=5 offset=00000000
>         PBA: BAR=5 offset=00000600
> 
> FlatView #0
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
> @0000000000000600
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
>       0000000080000000-00000000800005ff (prio 0, i/o): msix-table
>       0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled]
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
> 3 mmaps[0]

As above, you won't get the mmap without disabling the implicit page
exclusion.  The real question for this case is why does it work while
'auto' came up with a nearly identical layout, swapping BAR5 for BAR0
and it did not work.  The placement of the BARs is even the same.

> and there is also one minor comment below.
> 
> 
> > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >  
> >      vfio_pci_size_rom(vdev);
> >  
> > +    vfio_bars_prepare(vdev);
> > +
> >      vfio_msix_early_setup(vdev, &err);
> >      if (err) {
> >          error_propagate(errp, err);
> >          goto error;
> >      }
> >  
> > -    vfio_bars_prepare(vdev);  
> 
> 
> This could be in 2/5.

It could, but 2/5 was attempting to add the base BAR MemoryRegion and
split vfio_bars_setup() into vfio_bars_prepare() and
vfio_bars_register() without otherwise changing the ordering.  It's
only when we want to modify BARs between prepare and register that we
need to make this change, thus it's done here.  Thanks,

Alex
Alexey Kardashevskiy Dec. 18, 2017, 1:55 p.m. UTC | #3
On 19/12/17 00:28, Alex Williamson wrote:
> On Mon, 18 Dec 2017 20:04:23 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>> table area can be mmap'd from userspace, allowing direct access to
>>> non-MSI-X registers within the host page size of this area.  However,
>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>> within that same page.  For x86/64 host, the system page size is 4K
>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>> don't honor this recommendation would see any improvement from this
>>> option.  The real targets for this feature are hosts where the page
>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>> with 64K pages.
>>>
>>> This new x-msix-relocation option accepts the following options:
>>>
>>>   off: Disable MSI-X relocation, use native device config (default)
>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>        based on minimum additional MMIO requirement
>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>        if the BAR exists or added if the BAR slot is available.  
>>
>>
>> While I am digesting the patchset, here are some test results.
> 
> Thanks for testing!
> 
>> This is the device:
>>
>> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
>> PCI-Express Fusion-MPT SAS-3 (rev 02)
> 
> BAR1:
> 
>> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> 
> BAR3:
> 
>> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>
>> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>>         Vector table: BAR=1 offset=0000e000
>>         PBA: BAR=1 offset=0000f000
>>
>>
>> Test #1: x-msix-relocation = "off":
>>
>> FlatView #1
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>> @000000000000e600
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>> Ok, works.
>>
>>
>> Test #2: x-msix-relocation = "auto":
>>
>> FlatView #2
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
>> @0000000000000600
>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>>
>> The guest fails probing because the first 64bit BAR is broken.
>>
>> lspci:
>>
>> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
>> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
>> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>
>> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>>         Vector table: BAR=0 offset=00000000
>>         PBA: BAR=0 offset=00000600
> 
> Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
> 64bit since BAR1 is implemented.  I don't see anything fundamentally
> different between this and the working BAR5 test below.


BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
BAR, no?


> 
>> Test #3: x-msix-relocation = "bar1"
>>
>>
>> FlatView #1
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>>   0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>> @0000000000010600
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>> Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
>> appear as "ramd" in flatview, should it have appeared?
>>
>> This is "mtree":
>>
>> memory-region: pci@800000020000000.mmio
>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>     0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>       0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>>       0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
>> 3 mmaps[0]
> 
> Did you disable vfio_pci_fixup_msix_region() as noted in 0/5?  This
> series doesn't do anything about consuming the new MSI-X mappable flag
> that you introduced in the kernel, so vfio_pci_fixup_msix_region() will
> continue to exclude mmap'ing the 64K page overlapping the actual BAR.


Ah, my bad, I've read this but when I got to testing - forgot. Sorry for
the noise, tests 3 and 4 mmap as expected with fixup disabled.
Alex Williamson Dec. 18, 2017, 2:28 p.m. UTC | #4
On Tue, 19 Dec 2017 00:55:32 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/12/17 00:28, Alex Williamson wrote:
> > On Mon, 18 Dec 2017 20:04:23 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 18/12/17 16:02, Alex Williamson wrote:  
> >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> >>> table area can be mmap'd from userspace, allowing direct access to
> >>> non-MSI-X registers within the host page size of this area.  However,
> >>> we only get that direct access if QEMU isn't also emulating MSI-X
> >>> within that same page.  For x86/64 host, the system page size is 4K
> >>> and the PCI spec recommends a minimum of 4K to 8K alignment to
> >>> separate MSI-X from non-MSI-X registers, therefore only devices which
> >>> don't honor this recommendation would see any improvement from this
> >>> option.  The real targets for this feature are hosts where the page
> >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> >>> with 64K pages.
> >>>
> >>> This new x-msix-relocation option accepts the following options:
> >>>
> >>>   off: Disable MSI-X relocation, use native device config (default)
> >>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >>>        based on minimum additional MMIO requirement
> >>>   bar0..bar5: Specify the target BAR, which will either be extended
> >>>        if the BAR exists or added if the BAR slot is available.    
> >>
> >>
> >> While I am digesting the patchset, here are some test results.  
> > 
> > Thanks for testing!
> >   
> >> This is the device:
> >>
> >> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
> >> PCI-Express Fusion-MPT SAS-3 (rev 02)  
> > 
> > BAR1:
> >   
> >> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]  
> > 
> > BAR3:
> >   
> >> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> >>
> >> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
> >>         Vector table: BAR=1 offset=0000e000
> >>         PBA: BAR=1 offset=0000f000
> >>
> >>
> >> Test #1: x-msix-relocation = "off":
> >>
> >> FlatView #1
> >>  AS "memory", root: system
> >>  AS "cpu-memory", root: system
> >>  Root memory region: system
> >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
> >>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> >> @000000000000e600
> >>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>
> >> Ok, works.
> >>
> >>
> >> Test #2: x-msix-relocation = "auto":
> >>
> >> FlatView #2
> >>  AS "memory", root: system
> >>  AS "cpu-memory", root: system
> >>  Root memory region: system
> >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
> >>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
> >> @0000000000000600
> >>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> >>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>
> >>
> >> The guest fails probing because the first 64bit BAR is broken.
> >>
> >> lspci:
> >>
> >> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> >> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> >> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> >>
> >> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
> >>         Vector table: BAR=0 offset=00000000
> >>         PBA: BAR=0 offset=00000600  
> > 
> > Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
> > 64bit since BAR1 is implemented.  I don't see anything fundamentally
> > different between this and the working BAR5 test below.  
> 
> 
> BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
> BAR, no?

AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and
0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2.  Likewise
the 64bit BAR3 also consumes BAR4.  See for instance the 82576
datasheet:

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf

9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes
BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3
(upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4
(lower) + BAR5 (upper).  lspci would show this as Region 0, 2, 4.  The
layout of your SAS card does seem poorly thought out that they've
essentially precluded a 3rd 64bit BAR by starting with BAR1, but
perhaps it's for compatibility with an equally poorly designed 32bit
version of the device.  Thanks,

Alex
Alexey Kardashevskiy Dec. 19, 2017, 1:22 a.m. UTC | #5
On 19/12/17 01:28, Alex Williamson wrote:
> On Tue, 19 Dec 2017 00:55:32 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/12/17 00:28, Alex Williamson wrote:
>>> On Mon, 18 Dec 2017 20:04:23 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 18/12/17 16:02, Alex Williamson wrote:  
>>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>>>> table area can be mmap'd from userspace, allowing direct access to
>>>>> non-MSI-X registers within the host page size of this area.  However,
>>>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>>>> within that same page.  For x86/64 host, the system page size is 4K
>>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>>>> don't honor this recommendation would see any improvement from this
>>>>> option.  The real targets for this feature are hosts where the page
>>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>>>> with 64K pages.
>>>>>
>>>>> This new x-msix-relocation option accepts the following options:
>>>>>
>>>>>   off: Disable MSI-X relocation, use native device config (default)
>>>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>>>        based on minimum additional MMIO requirement
>>>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>>>        if the BAR exists or added if the BAR slot is available.    
>>>>
>>>>
>>>> While I am digesting the patchset, here are some test results.  
>>>
>>> Thanks for testing!
>>>   
>>>> This is the device:
>>>>
>>>> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
>>>> PCI-Express Fusion-MPT SAS-3 (rev 02)  
>>>
>>> BAR1:
>>>   
>>>> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]  
>>>
>>> BAR3:
>>>   
>>>> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>>>
>>>> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>>>>         Vector table: BAR=1 offset=0000e000
>>>>         PBA: BAR=1 offset=0000f000
>>>>
>>>>
>>>> Test #1: x-msix-relocation = "off":
>>>>
>>>> FlatView #1
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>> @000000000000e600
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>> Ok, works.
>>>>
>>>>
>>>> Test #2: x-msix-relocation = "auto":
>>>>
>>>> FlatView #2
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>>>>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
>>>> @0000000000000600
>>>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>>
>>>> The guest fails probing because the first 64bit BAR is broken.
>>>>
>>>> lspci:
>>>>
>>>> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
>>>> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
>>>> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>>>
>>>> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>>>>         Vector table: BAR=0 offset=00000000
>>>>         PBA: BAR=0 offset=00000600  
>>>
>>> Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
>>> 64bit since BAR1 is implemented.  I don't see anything fundamentally
>>> different between this and the working BAR5 test below.  
>>
>>
>> BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
>> BAR, no?
> 
> AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and
> 0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2.  Likewise
> the 64bit BAR3 also consumes BAR4.  See for instance the 82576
> datasheet:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf
> 
> 9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes
> BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3
> (upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4
> (lower) + BAR5 (upper).  lspci would show this as Region 0, 2, 4.  The
> layout of your SAS card does seem poorly thought out that they've
> essentially precluded a 3rd 64bit BAR by starting with BAR1, but
> perhaps it's for compatibility with an equally poorly designed 32bit
> version of the device.  Thanks,


Ah, makes sense, I just never saw 64bit BARs starting from an odd offset.
My card is weird^Wunusual then:


aik@stratton2:~$ lspci -vbxs 0001:03:00.0

0001:03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic
SAS3008 PCI-Express Fusion-MPT SAS-3 (rev
02)
        Subsystem: Super Micro Computer Inc SAS3008 PCI-Express Fusion-MPT
SAS-3
        Flags: bus master, fast devsel, latency 0
        I/O ports at <unassigned> [disabled]
        Memory at 80140000 (64-bit, non-prefetchable)
        Memory at 80100000 (64-bit, non-prefetchable)
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: mpt3sas
00: 00 10 97 00 46 05 10 00 02 00 07 01 00 00 00 00
10: 01 00 00 00 04 00 14 80 00 00 00 00 04 00 10 80
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 08 08
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00



The mpt3sas driver is funny too - it fails probing with MSIX in bar0 but
succeeds with bar5.

Region 1: Memory at 210000000000 (64-bit, non-prefetchable)
Region 3: Memory at 210000040000 (64-bit, non-prefetchable)
Region 5: Memory at 80000000 (32-bit, prefetchable)
Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=5 offset=00000000
        PBA: BAR=5 offset=00000600


vs.

Region 0: Memory at 80000000 (32-bit, prefetchable)
Region 1: Memory at 210000000000 (64-bit, non-prefetchable)
Region 3: Memory at 210000040000 (64-bit, non-prefetchable)
Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
        Vector table: BAR=0 offset=00000000
        PBA: BAR=0 offset=00000600


Here is why:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/mpt3sas/mpt3sas_base.c?h=v4.15-rc4#n2608

It is looking for a first MMIO BAR and assumes it is the one which
implements the basic registers including doorbell. I am not so sure this is
that unusual.
Alexey Kardashevskiy Dec. 19, 2017, 3:07 a.m. UTC | #6
On 18/12/17 16:02, Alex Williamson wrote:
> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> table area can be mmap'd from userspace, allowing direct access to
> non-MSI-X registers within the host page size of this area.  However,
> we only get that direct access if QEMU isn't also emulating MSI-X
> within that same page.  For x86/64 host, the system page size is 4K
> and the PCI spec recommends a minimum of 4K to 8K alignment to
> separate MSI-X from non-MSI-X registers, therefore only devices which
> don't honor this recommendation would see any improvement from this
> option.  The real targets for this feature are hosts where the page
> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> with 64K pages.
> 
> This new x-msix-relocation option accepts the following options:
> 
>   off: Disable MSI-X relocation, use native device config (default)
>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>        based on minimum additional MMIO requirement
>   bar0..bar5: Specify the target BAR, which will either be extended
>        if the BAR exists or added if the BAR slot is available.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h        |    1 
>  hw/vfio/trace-events |    2 +
>  3 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c383b842da20..b4426abf297a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> +{
> +    int target_bar = -1;
> +    size_t msix_sz;
> +
> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> +        return;
> +    }
> +
> +    /* The actual minimum size of MSI-X structures */
> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> +    /* Round up to host pages, we don't want to share a page */
> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> +    /* PCI BARs must be a power of 2 */
> +    msix_sz = pow2ceil(msix_sz);
> +
> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> +        int i;
> +        size_t best = UINT64_MAX;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {


I belieive that going from the other end is safer approach for "auto",
especially after discovering how mpt3sas works. Or you could add
"autoreverse" switch...




> +            size_t size;
> +
> +            if (vdev->bars[i].ioport) {
> +                continue;
> +            }
> +
> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> +                continue;
> +
> +            /*
> +             * Must be pow2, so larger of double existing or double msix_sz,
> +             * or if BAR unimplemented, msix_sz
> +             */
> +            size = MAX(vdev->bars[i].size * 2,
> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> +
> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> +
> +            if (size < best) {
> +                best = size;
> +                target_bar = i;
> +            }
> +
> +            if (vdev->bars[i].mem64) {
> +              i++;
> +            }
> +        }
> +    } else {
> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> +    }
> +
> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> +        (!vdev->bars[target_bar].size &&
> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> +        return; /* Go BOOM?  Plumb Error */
> +    }


This "if" only seems to make sense for the non-auto branch...


> +
> +    /*
> +     * If adding a new BAR, test if we can make it 64bit.  We make it
> +     * prefetchable since QEMU MSI-X emulation has no read side effects
> +     * and doing so makes mapping more flexible.
> +     */
> +    if (!vdev->bars[target_bar].size) {
> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> +            !vdev->bars[target_bar + 1].size) {
> +            vdev->bars[target_bar].mem64 = true;
> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        }
> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> +        vdev->bars[target_bar].size = msix_sz;
> +        vdev->msix->table_offset = 0;
> +    } else {
> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> +                                          msix_sz * 2);
> +        /*
> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> +         * which will always be a separate host page.
> +         */
> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> +    }
> +
> +    vdev->msix->table_bar = target_bar;
> +    vdev->msix->pba_bar = target_bar;


Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
was not necessary at the time but I missed that it is called before
vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,




> +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> +    vdev->msix->pba_offset = vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> +
> +    trace_vfio_msix_relo(vdev->vbasedev.name,
> +                         vdev->msix->table_bar, vdev->msix->table_offset);
> +}
> +
>  /*
>   * We don't have any control over how pci_add_capability() inserts
>   * capabilities into the chain.  In order to setup MSI-X we need a
> @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vdev->msix = msix;
>  
>      vfio_pci_fixup_msix_region(vdev);
> +
> +    vfio_pci_relocate_msix(vdev);
>  }
>  
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>      vfio_pci_size_rom(vdev);
>  
> +    vfio_bars_prepare(vdev);
> +
>      vfio_msix_early_setup(vdev, &err);
>      if (err) {
>          error_propagate(errp, err);
>          goto error;
>      }
>  
> -    vfio_bars_prepare(vdev);
>      vfio_bars_register(vdev);
>  
>      ret = vfio_add_capabilities(vdev, errp);
> @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>                                     nv_gpudirect_clique,
>                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> +                                OFF_AUTOPCIBAR_OFF),
>      /*
>       * TODO - support passed fds... is this necessary?
>       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index dcdb1a806769..588381f201b4 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
>                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
>      bool pci_aer;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fae096c0724f..437ccdd29053 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
>  vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"
>  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> 
>
Alex Williamson Dec. 19, 2017, 3:40 a.m. UTC | #7
On Tue, 19 Dec 2017 14:07:13 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 18/12/17 16:02, Alex Williamson wrote:
> > With recently proposed kernel side vfio-pci changes, the MSI-X vector
> > table area can be mmap'd from userspace, allowing direct access to
> > non-MSI-X registers within the host page size of this area.  However,
> > we only get that direct access if QEMU isn't also emulating MSI-X
> > within that same page.  For x86/64 host, the system page size is 4K
> > and the PCI spec recommends a minimum of 4K to 8K alignment to
> > separate MSI-X from non-MSI-X registers, therefore only devices which
> > don't honor this recommendation would see any improvement from this
> > option.  The real targets for this feature are hosts where the page
> > size exceeds the PCI spec recommended alignment, such as ARM64 systems
> > with 64K pages.
> > 
> > This new x-msix-relocation option accepts the following options:
> > 
> >   off: Disable MSI-X relocation, use native device config (default)
> >   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >        based on minimum additional MMIO requirement
> >   bar0..bar5: Specify the target BAR, which will either be extended
> >        if the BAR exists or added if the BAR slot is available.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.h        |    1 
> >  hw/vfio/trace-events |    2 +
> >  3 files changed, 104 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c383b842da20..b4426abf297a 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >      }
> >  }
> >  
> > +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> > +{
> > +    int target_bar = -1;
> > +    size_t msix_sz;
> > +
> > +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> > +        return;
> > +    }
> > +
> > +    /* The actual minimum size of MSI-X structures */
> > +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> > +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> > +    /* Round up to host pages, we don't want to share a page */
> > +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> > +    /* PCI BARs must be a power of 2 */
> > +    msix_sz = pow2ceil(msix_sz);
> > +
> > +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> > +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> > +        int i;
> > +        size_t best = UINT64_MAX;
> > +
> > +        for (i = 0; i < PCI_ROM_SLOT; i++) {  
> 
> 
> I belieive that going from the other end is safer approach for "auto",
> especially after discovering how mpt3sas works. Or you could add
> "autoreverse" switch...

Or is extending the smallest BAR really a safer option?  I wonder how
many drivers go through and fill fixed sized arrays with BAR info,
expecting only the device implemented number of BARs.  Maybe they
wouldn't notice if the BAR was simply bigger than expected.  On the
other hand there are probably drivers dumb enough to index registers
from the end for the BAR as well.  I don't think there exists an
auto algorithm that will fit every device, but a higher hit rate than
we have so far would be nice.  We could also implement MemoryRegionOps
for the base BAR with some error reporting if it gets called.  That
might make the problem more obvious than unassigned_mem_ops silently
eating those accesses.

> > +            size_t size;
> > +
> > +            if (vdev->bars[i].ioport) {
> > +                continue;
> > +            }
> > +
> > +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> > +            if (vdev->bars[i].size > (UINT32_MAX / 2))

Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.

NB, the existing test here needs a bit of work too, 32bit BARs max out
at 2G not 4G, so maybe we need separate tests here.  >1G for 32bit
BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
32bit BARs to 64bit?  It's all virtual addresses anyway, right.  We're
in real trouble if were extending BARs where this is an issue though. 

> > +                continue;
> > +
> > +            /*
> > +             * Must be pow2, so larger of double existing or double msix_sz,
> > +             * or if BAR unimplemented, msix_sz
> > +             */
> > +            size = MAX(vdev->bars[i].size * 2,
> > +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> > +
> > +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> > +
> > +            if (size < best) {
> > +                best = size;
> > +                target_bar = i;
> > +            }
> > +
> > +            if (vdev->bars[i].mem64) {
> > +              i++;
> > +            }
> > +        }
> > +    } else {
> > +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> > +    }
> > +
> > +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> > +        (!vdev->bars[target_bar].size &&
> > +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> > +        return; /* Go BOOM?  Plumb Error */
> > +    }  
> 
> 
> This "if" only seems to make sense for the non-auto branch...

Most of it, yes, but it's still possible for a device to exist where
the auto loop would come up empty.  Imagine if each BAR was
sufficiently large that we couldn't extend it and still give the MSI-X
MMIO areas a 32-bit offset within the BAR.  Exceptionally unlikely, it
doesn't hurt to test all the corner cases.  I also missed the case of
testing that the BAR isn't too large already here.
 
> > +
> > +    /*
> > +     * If adding a new BAR, test if we can make it 64bit.  We make it
> > +     * prefetchable since QEMU MSI-X emulation has no read side effects
> > +     * and doing so makes mapping more flexible.
> > +     */
> > +    if (!vdev->bars[target_bar].size) {
> > +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> > +            !vdev->bars[target_bar + 1].size) {
> > +            vdev->bars[target_bar].mem64 = true;
> > +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> > +        }
> > +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> > +        vdev->bars[target_bar].size = msix_sz;
> > +        vdev->msix->table_offset = 0;
> > +    } else {
> > +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> > +                                          msix_sz * 2);
> > +        /*
> > +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> > +         * which will always be a separate host page.
> > +         */
> > +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> > +    }
> > +
> > +    vdev->msix->table_bar = target_bar;
> > +    vdev->msix->pba_bar = target_bar;  
> 
> 
> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
> was not necessary at the time but I missed that it is called before
> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,

For a kernel that allows mapping the MSI-X region, yes, but if you ran
that on an older kernel I think QEMU would break when it can't mmap the
entire region.  We can't only support new kernels.  Thanks,

Alex

> > +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> > +    vdev->msix->pba_offset = vdev->msix->table_offset +
> > +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> > +
> > +    trace_vfio_msix_relo(vdev->vbasedev.name,
> > +                         vdev->msix->table_bar, vdev->msix->table_offset);
> > +}
> > +
> >  /*
> >   * We don't have any control over how pci_add_capability() inserts
> >   * capabilities into the chain.  In order to setup MSI-X we need a
> > @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
> >      vdev->msix = msix;
> >  
> >      vfio_pci_fixup_msix_region(vdev);
> > +
> > +    vfio_pci_relocate_msix(vdev);
> >  }
> >  
> >  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >  
> >      vfio_pci_size_rom(vdev);
> >  
> > +    vfio_bars_prepare(vdev);
> > +
> >      vfio_msix_early_setup(vdev, &err);
> >      if (err) {
> >          error_propagate(errp, err);
> >          goto error;
> >      }
> >  
> > -    vfio_bars_prepare(vdev);
> >      vfio_bars_register(vdev);
> >  
> >      ret = vfio_add_capabilities(vdev, errp);
> > @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
> >      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
> >                                     nv_gpudirect_clique,
> >                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> > +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> > +                                OFF_AUTOPCIBAR_OFF),
> >      /*
> >       * TODO - support passed fds... is this necessary?
> >       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index dcdb1a806769..588381f201b4 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
> >                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
> >      int32_t bootindex;
> >      uint32_t igd_gms;
> > +    OffAutoPCIBAR msix_relo;
> >      uint8_t pm_cap;
> >      uint8_t nv_gpudirect_clique;
> >      bool pci_aer;
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index fae096c0724f..437ccdd29053 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
> >  vfio_msix_pba_enable(const char *name) " (%s)"
> >  vfio_msix_disable(const char *name) " (%s)"
> >  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> > +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> > +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
> >  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
> >  vfio_msi_disable(const char *name) " (%s)"
> >  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> > 
> >   
> 
>
Alexey Kardashevskiy Dec. 19, 2017, 6:02 a.m. UTC | #8
On 19/12/17 14:40, Alex Williamson wrote:
> On Tue, 19 Dec 2017 14:07:13 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>> table area can be mmap'd from userspace, allowing direct access to
>>> non-MSI-X registers within the host page size of this area.  However,
>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>> within that same page.  For x86/64 host, the system page size is 4K
>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>> don't honor this recommendation would see any improvement from this
>>> option.  The real targets for this feature are hosts where the page
>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>> with 64K pages.
>>>
>>> This new x-msix-relocation option accepts the following options:
>>>
>>>   off: Disable MSI-X relocation, use native device config (default)
>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>        based on minimum additional MMIO requirement
>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>        if the BAR exists or added if the BAR slot is available.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  hw/vfio/pci.h        |    1 
>>>  hw/vfio/trace-events |    2 +
>>>  3 files changed, 104 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index c383b842da20..b4426abf297a 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>      }
>>>  }
>>>  
>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
>>> +{
>>> +    int target_bar = -1;
>>> +    size_t msix_sz;
>>> +
>>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* The actual minimum size of MSI-X structures */
>>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>> +    /* Round up to host pages, we don't want to share a page */
>>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>> +    /* PCI BARs must be a power of 2 */
>>> +    msix_sz = pow2ceil(msix_sz);
>>> +
>>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
>>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>> +        int i;
>>> +        size_t best = UINT64_MAX;
>>> +
>>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {  
>>
>>
>> I belieive that going from the other end is safer approach for "auto",
>> especially after discovering how mpt3sas works. Or you could add
>> "autoreverse" switch...
> 
> Or is extending the smallest BAR really a safer option?  I wonder how
> many drivers go through and fill fixed sized arrays with BAR info,
> expecting only the device implemented number of BARs.  Maybe they
> wouldn't notice if the BAR was simply bigger than expected.  On the
> other hand there are probably drivers dumb enough to index registers
> from the end for the BAR as well.  I don't think there exists an
> auto algorithm that will fit every device, but a higher hit rate than
> we have so far would be nice.

Everything is possible :(

I do not know if there are many users for this relocation though. So far
only one device has the problem (in 5 years or so) and it is fixed by
moving msix to bar5, I'd suggest start with this for now.

In general, I think we still need a way to simply disable that msix_table
region anyway if we find a device driver which uses all BARs, does not
tolerate changes to the default set of BARs, etc.


>  We could also implement MemoryRegionOps
> for the base BAR with some error reporting if it gets called.  That
> might make the problem more obvious than unassigned_mem_ops silently
> eating those accesses.

Makes sense.


> 
>>> +            size_t size;
>>> +
>>> +            if (vdev->bars[i].ioport) {
>>> +                continue;
>>> +            }
>>> +
>>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
>>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> 
> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
> 
> NB, the existing test here needs a bit of work too, 32bit BARs max out
> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
> BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
> 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
> in real trouble if were extending BARs where this is an issue though. 

until you get a driver like this :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782


> 
>>> +                continue;
>>> +
>>> +            /*
>>> +             * Must be pow2, so larger of double existing or double msix_sz,
>>> +             * or if BAR unimplemented, msix_sz
>>> +             */
>>> +            size = MAX(vdev->bars[i].size * 2,
>>> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
>>> +
>>> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
>>> +
>>> +            if (size < best) {
>>> +                best = size;
>>> +                target_bar = i;
>>> +            }
>>> +
>>> +            if (vdev->bars[i].mem64) {
>>> +              i++;
>>> +            }
>>> +        }
>>> +    } else {
>>> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
>>> +    }
>>> +
>>> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
>>> +        (!vdev->bars[target_bar].size &&
>>> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
>>> +        return; /* Go BOOM?  Plumb Error */
>>> +    }  
>>
>>
>> This "if" only seems to make sense for the non-auto branch...
> 
> Most of it, yes, but it's still possible for a device to exist where
> the auto loop would come up empty.  Imagine if each BAR was
> sufficiently large that we couldn't extend it and still give the MSI-X
> MMIO areas a 32-bit offset within the BAR.  Exceptionally unlikely, it
> doesn't hurt to test all the corner cases.  I also missed the case of
> testing that the BAR isn't too large already here.

Fair enough.


>  
>>> +
>>> +    /*
>>> +     * If adding a new BAR, test if we can make it 64bit.  We make it
>>> +     * prefetchable since QEMU MSI-X emulation has no read side effects
>>> +     * and doing so makes mapping more flexible.
>>> +     */
>>> +    if (!vdev->bars[target_bar].size) {
>>> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
>>> +            !vdev->bars[target_bar + 1].size) {
>>> +            vdev->bars[target_bar].mem64 = true;
>>> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
>>> +        }
>>> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
>>> +        vdev->bars[target_bar].size = msix_sz;
>>> +        vdev->msix->table_offset = 0;
>>> +    } else {
>>> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
>>> +                                          msix_sz * 2);
>>> +        /*
>>> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
>>> +         * which will always be a separate host page.
>>> +         */
>>> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
>>> +    }
>>> +
>>> +    vdev->msix->table_bar = target_bar;
>>> +    vdev->msix->pba_bar = target_bar;  
>>
>>
>> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
>> was not necessary at the time but I missed that it is called before
>> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,
> 
> For a kernel that allows mapping the MSI-X region, yes, but if you ran
> that on an older kernel I think QEMU would break when it can't mmap the
> entire region.  We can't only support new kernels.  Thanks,


Sure, I am not suggesting changing this.
Alex Williamson Dec. 19, 2017, 6:56 a.m. UTC | #9
On Tue, 19 Dec 2017 17:02:59 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/12/17 14:40, Alex Williamson wrote:
> > On Tue, 19 Dec 2017 14:07:13 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 18/12/17 16:02, Alex Williamson wrote:  
> >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> >>> table area can be mmap'd from userspace, allowing direct access to
> >>> non-MSI-X registers within the host page size of this area.  However,
> >>> we only get that direct access if QEMU isn't also emulating MSI-X
> >>> within that same page.  For x86/64 host, the system page size is 4K
> >>> and the PCI spec recommends a minimum of 4K to 8K alignment to
> >>> separate MSI-X from non-MSI-X registers, therefore only devices which
> >>> don't honor this recommendation would see any improvement from this
> >>> option.  The real targets for this feature are hosts where the page
> >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> >>> with 64K pages.
> >>>
> >>> This new x-msix-relocation option accepts the following options:
> >>>
> >>>   off: Disable MSI-X relocation, use native device config (default)
> >>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >>>        based on minimum additional MMIO requirement
> >>>   bar0..bar5: Specify the target BAR, which will either be extended
> >>>        if the BAR exists or added if the BAR slot is available.
> >>>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> ---
> >>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  hw/vfio/pci.h        |    1 
> >>>  hw/vfio/trace-events |    2 +
> >>>  3 files changed, 104 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>> index c383b842da20..b4426abf297a 100644
> >>> --- a/hw/vfio/pci.c
> >>> +++ b/hw/vfio/pci.c
> >>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>      }
> >>>  }
> >>>  
> >>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> >>> +{
> >>> +    int target_bar = -1;
> >>> +    size_t msix_sz;
> >>> +
> >>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    /* The actual minimum size of MSI-X structures */
> >>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> >>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> >>> +    /* Round up to host pages, we don't want to share a page */
> >>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> >>> +    /* PCI BARs must be a power of 2 */
> >>> +    msix_sz = pow2ceil(msix_sz);
> >>> +
> >>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> >>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> >>> +        int i;
> >>> +        size_t best = UINT64_MAX;
> >>> +
> >>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {    
> >>
> >>
> >> I belieive that going from the other end is safer approach for "auto",
> >> especially after discovering how mpt3sas works. Or you could add
> >> "autoreverse" switch...  
> > 
> > Or is extending the smallest BAR really a safer option?  I wonder how
> > many drivers go through and fill fixed sized arrays with BAR info,
> > expecting only the device implemented number of BARs.  Maybe they
> > wouldn't notice if the BAR was simply bigger than expected.  On the
> > other hand there are probably drivers dumb enough to index registers
> > from the end for the BAR as well.  I don't think there exists an
> > auto algorithm that will fit every device, but a higher hit rate than
> > we have so far would be nice.  
> 
> Everything is possible :(
> 
> I do not know if there are many users for this relocation though. So far
> only one device has the problem (in 5 years or so) and it is fixed by
> moving msix to bar5, I'd suggest start with this for now.

Interesting, I would have thought it to be more common.

> In general, I think we still need a way to simply disable that msix_table
> region anyway if we find a device driver which uses all BARs, does not
> tolerate changes to the default set of BARs, etc.

Only SPAPR can do that.  In fact, I'm somewhat surprised by your
interest in this series as I positioned it as a way for other
platforms, which require interaction with MSI-X MMIO space for
programming interrupts.
 
> >  We could also implement MemoryRegionOps
> > for the base BAR with some error reporting if it gets called.  That
> > might make the problem more obvious than unassigned_mem_ops silently
> > eating those accesses.  
> 
> Makes sense.
> 
> 
> >   
> >>> +            size_t size;
> >>> +
> >>> +            if (vdev->bars[i].ioport) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> >>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))  
> > 
> > Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
> > 
> > NB, the existing test here needs a bit of work too, 32bit BARs max out
> > at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
> > BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
> > 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
> > in real trouble if were extending BARs where this is an issue though.   
> 
> until you get a driver like this :)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782

Right, a diametric opposite of the SAS driver, verifying all the
attributes it can of specific BARs rather than assuming the first BAR
it finds must be the one to use.  Is it even worthwhile to try to have
any automatic selection?  I suppose this driver is another point
towards a reverse search rather than extended BAR.  Thanks,

Alex
Alexey Kardashevskiy Dec. 19, 2017, 8:28 a.m. UTC | #10
On 19/12/17 17:56, Alex Williamson wrote:
> On Tue, 19 Dec 2017 17:02:59 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/12/17 14:40, Alex Williamson wrote:
>>> On Tue, 19 Dec 2017 14:07:13 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 18/12/17 16:02, Alex Williamson wrote:  
>>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>>>> table area can be mmap'd from userspace, allowing direct access to
>>>>> non-MSI-X registers within the host page size of this area.  However,
>>>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>>>> within that same page.  For x86/64 host, the system page size is 4K
>>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>>>> don't honor this recommendation would see any improvement from this
>>>>> option.  The real targets for this feature are hosts where the page
>>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>>>> with 64K pages.
>>>>>
>>>>> This new x-msix-relocation option accepts the following options:
>>>>>
>>>>>   off: Disable MSI-X relocation, use native device config (default)
>>>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>>>        based on minimum additional MMIO requirement
>>>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>>>        if the BAR exists or added if the BAR slot is available.
>>>>>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> ---
>>>>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  hw/vfio/pci.h        |    1 
>>>>>  hw/vfio/trace-events |    2 +
>>>>>  3 files changed, 104 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index c383b842da20..b4426abf297a 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
>>>>> +{
>>>>> +    int target_bar = -1;
>>>>> +    size_t msix_sz;
>>>>> +
>>>>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    /* The actual minimum size of MSI-X structures */
>>>>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>>>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>>>> +    /* Round up to host pages, we don't want to share a page */
>>>>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>>>> +    /* PCI BARs must be a power of 2 */
>>>>> +    msix_sz = pow2ceil(msix_sz);
>>>>> +
>>>>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
>>>>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>>>> +        int i;
>>>>> +        size_t best = UINT64_MAX;
>>>>> +
>>>>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {    
>>>>
>>>>
>>>> I belieive that going from the other end is safer approach for "auto",
>>>> especially after discovering how mpt3sas works. Or you could add
>>>> "autoreverse" switch...  
>>>
>>> Or is extending the smallest BAR really a safer option?  I wonder how
>>> many drivers go through and fill fixed sized arrays with BAR info,
>>> expecting only the device implemented number of BARs.  Maybe they
>>> wouldn't notice if the BAR was simply bigger than expected.  On the
>>> other hand there are probably drivers dumb enough to index registers
>>> from the end for the BAR as well.  I don't think there exists an
>>> auto algorithm that will fit every device, but a higher hit rate than
>>> we have so far would be nice.  
>>
>> Everything is possible :(
>>
>> I do not know if there are many users for this relocation though. So far
>> only one device has the problem (in 5 years or so) and it is fixed by
>> moving msix to bar5, I'd suggest start with this for now.
> 
> Interesting, I would have thought it to be more common.

Just to clarify - one device with performance issue because of msix
emulation, non-64k-aligned msix data is not that unusual.


> 
>> In general, I think we still need a way to simply disable that msix_table
>> region anyway if we find a device driver which uses all BARs, does not
>> tolerate changes to the default set of BARs, etc.
> 
> Only SPAPR can do that.  In fact, I'm somewhat surprised by your
> interest in this series as I positioned it as a way for other
> platforms, which require interaction with MSI-X MMIO space for
> programming interrupts.

Well, it moves the guest-visible msix section away from the BAR causing
performance issues so I figured it might work for SPAPR eventually :)


>>>  We could also implement MemoryRegionOps
>>> for the base BAR with some error reporting if it gets called.  That
>>> might make the problem more obvious than unassigned_mem_ops silently
>>> eating those accesses.  
>>
>> Makes sense.
>>
>>
>>>   
>>>>> +            size_t size;
>>>>> +
>>>>> +            if (vdev->bars[i].ioport) {
>>>>> +                continue;
>>>>> +            }
>>>>> +
>>>>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
>>>>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))  
>>>
>>> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
>>>
>>> NB, the existing test here needs a bit of work too, 32bit BARs max out
>>> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
>>> BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
>>> 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
>>> in real trouble if were extending BARs where this is an issue though.   
>>
>> until you get a driver like this :)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782
> 
> Right, a diametric opposite of the SAS driver, verifying all the
> attributes it can of specific BARs rather than assuming the first BAR
> it finds must be the one to use.  Is it even worthwhile to try to have
> any automatic selection?  I suppose this driver is another point
> towards a reverse search rather than extended BAR.  Thanks,

Well, guessing like this may fail occasionally and simply allowing MSIX
mapping won't fail on SPAPR, I do not really know if it is going to be very
useful anywhere else than just SPAPR.

And I guess if we go the automatic selection path, than extending a BAR
does not have much benefit over using the last BAR because it seems quite
unlikely that a device 1) does not have any BARs unused and 2) none of BARs
is MSIX-only but if this is a case, I am not sure what guess would be safer.

I looked nearby, for example:

001e:80:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719
Gigabit Ethernet PCIe (rev 01)
Region 0: Memory at 3fc2c0250000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at 3fc2c0240000 (64-bit, prefetchable) [size=64K]
Region 4: Memory at 3fc2c0230000 (64-bit, prefetchable) [size=64K]
Capabilities: [a0] MSI-X: Enable- Count=17 Masked-
        Vector table: BAR=4 offset=00000000
        PBA: BAR=4 offset=00000120

It is fully packed and it *seems* that BAR4 is MSIX only but who knows why
it is 64K - can be anything...


This one looks more convincing but still no guarantee:

0001:09:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0
xHCI Host Controller (rev 02)
Region 0: Memory at 3fe080800000 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at 3fe080810000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
        Vector table: BAR=2 offset=00000000
        PBA: BAR=2 offset=00001000



A funny thing - my thinkpad x1 does not have a single msix-capable device,
many are MSI and "Express (v2) Endpoint, MSI 00". Hmmm. Xeon and POWER8
boxes do have MSIX.
diff mbox series

Patch

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c383b842da20..b4426abf297a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1352,6 +1352,101 @@  static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
     }
 }
 
+static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
+{
+    int target_bar = -1;
+    size_t msix_sz;
+
+    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
+        return;
+    }
+
+    /* The actual minimum size of MSI-X structures */
+    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
+              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
+    /* Round up to host pages, we don't want to share a page */
+    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
+    /* PCI BARs must be a power of 2 */
+    msix_sz = pow2ceil(msix_sz);
+
+    /* Auto: pick the BAR that incurs the least additional MMIO space */
+    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
+        int i;
+        size_t best = UINT64_MAX;
+
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            size_t size;
+
+            if (vdev->bars[i].ioport) {
+                continue;
+            }
+
+            /* MSI-X MMIO must reside within first 32bit offset of BAR */
+            if (vdev->bars[i].size > (UINT32_MAX / 2))
+                continue;
+
+            /*
+             * Must be pow2, so larger of double existing or double msix_sz,
+             * or if BAR unimplemented, msix_sz
+             */
+            size = MAX(vdev->bars[i].size * 2,
+                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
+
+            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
+
+            if (size < best) {
+                best = size;
+                target_bar = i;
+            }
+
+            if (vdev->bars[i].mem64) {
+              i++;
+            }
+        }
+    } else {
+        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
+    }
+
+    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
+        (!vdev->bars[target_bar].size &&
+         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
+        return; /* Go BOOM?  Plumb Error */
+    }
+
+    /*
+     * If adding a new BAR, test if we can make it 64bit.  We make it
+     * prefetchable since QEMU MSI-X emulation has no read side effects
+     * and doing so makes mapping more flexible.
+     */
+    if (!vdev->bars[target_bar].size) {
+        if (target_bar < (PCI_ROM_SLOT - 1) &&
+            !vdev->bars[target_bar + 1].size) {
+            vdev->bars[target_bar].mem64 = true;
+            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
+        }
+        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+        vdev->bars[target_bar].size = msix_sz;
+        vdev->msix->table_offset = 0;
+    } else {
+        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
+                                          msix_sz * 2);
+        /*
+         * Due to above size calc, MSI-X always starts halfway into the BAR,
+         * which will always be a separate host page.
+         */
+        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
+    }
+
+    vdev->msix->table_bar = target_bar;
+    vdev->msix->pba_bar = target_bar;
+    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
+    vdev->msix->pba_offset = vdev->msix->table_offset +
+                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
+
+    trace_vfio_msix_relo(vdev->vbasedev.name,
+                         vdev->msix->table_bar, vdev->msix->table_offset);
+}
+
 /*
  * We don't have any control over how pci_add_capability() inserts
  * capabilities into the chain.  In order to setup MSI-X we need a
@@ -1430,6 +1525,8 @@  static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
     vdev->msix = msix;
 
     vfio_pci_fixup_msix_region(vdev);
+
+    vfio_pci_relocate_msix(vdev);
 }
 
 static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
@@ -2845,13 +2942,14 @@  static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_pci_size_rom(vdev);
 
+    vfio_bars_prepare(vdev);
+
     vfio_msix_early_setup(vdev, &err);
     if (err) {
         error_propagate(errp, err);
         goto error;
     }
 
-    vfio_bars_prepare(vdev);
     vfio_bars_register(vdev);
 
     ret = vfio_add_capabilities(vdev, errp);
@@ -3041,6 +3139,8 @@  static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
                                    nv_gpudirect_clique,
                                    qdev_prop_nv_gpudirect_clique, uint8_t),
+    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
+                                OFF_AUTOPCIBAR_OFF),
     /*
      * TODO - support passed fds... is this necessary?
      * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index dcdb1a806769..588381f201b4 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -135,6 +135,7 @@  typedef struct VFIOPCIDevice {
                                 (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
     int32_t bootindex;
     uint32_t igd_gms;
+    OffAutoPCIBAR msix_relo;
     uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
     bool pci_aer;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index fae096c0724f..437ccdd29053 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -16,6 +16,8 @@  vfio_msix_pba_disable(const char *name) " (%s)"
 vfio_msix_pba_enable(const char *name) " (%s)"
 vfio_msix_disable(const char *name) " (%s)"
 vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
+vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
+vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
 vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
 vfio_msi_disable(const char *name) " (%s)"
 vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"