diff mbox series

[qemu,1/3] vfio/spapr: Fix indirect levels calculation

Message ID 20190117025115.81178-2-aik@ozlabs.ru
State New
Headers show
Series spapr_pci, vfio: NVIDIA V100 + P9 passthrough | expand

Commit Message

Alexey Kardashevskiy Jan. 17, 2019, 2:51 a.m. UTC
The current code assumes that we can address more bits on a PCI bus
for DMA than we really can but there is no way knowing the actual limit.

This makes a better guess for the number of levels and if the kernel
fails to allocate that, this increases the level numbers till succeeded
or reached the 64bit limit.

This adds levels to the trace point.

This may cause the kernel to warn about failed allocation:
   [65122.837458] Failed to allocate a TCE memory, level shift=28
which might happen if MAX_ORDER is not large enough as it can vary:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/spapr.c      | 38 +++++++++++++++++++++++++++++---------
 hw/vfio/trace-events |  2 +-
 2 files changed, 30 insertions(+), 10 deletions(-)

Comments

David Gibson Feb. 5, 2019, 5:54 a.m. UTC | #1
On Thu, Jan 17, 2019 at 01:51:13PM +1100, Alexey Kardashevskiy wrote:
> The current code assumes that we can address more bits on a PCI bus
> for DMA than we really can but there is no way knowing the actual limit.
> 
> This makes a better guess for the number of levels and if the kernel
> fails to allocate that, this increases the level numbers till succeeded
> or reached the 64bit limit.
> 
> This adds levels to the trace point.
> 
> This may cause the kernel to warn about failed allocation:
>    [65122.837458] Failed to allocate a TCE memory, level shift=28
> which might happen if MAX_ORDER is not large enough as it can vary:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/vfio/spapr.c      | 38 +++++++++++++++++++++++++++++---------
>  hw/vfio/trace-events |  2 +-
>  2 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
> index becf71a..2675bf5 100644
> --- a/hw/vfio/spapr.c
> +++ b/hw/vfio/spapr.c
> @@ -146,7 +146,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      int ret;
>      IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
>      uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
> -    unsigned entries, pages;
> +    unsigned entries, bits_total, bits_per_level, max_levels;
>      struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
>      long systempagesize = qemu_getrampagesize();

So, "systempagesize" is a bad name, since this is *not* the system
page size, but rather the (minimum) page size we're using to map guest
RAM...

> @@ -176,16 +176,35 @@ int vfio_spapr_create_window(VFIOContainer *container,
>      create.window_size = int128_get64(section->size);
>      create.page_shift = ctz64(pagesize);
>      /*
> -     * SPAPR host supports multilevel TCE tables, there is some
> -     * heuristic to decide how many levels we want for our table:
> -     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
> +     * SPAPR host supports multilevel TCE tables. We try to guess optimal
> +     * levels number and if this fails (for example due to the host memory
> +     * fragmentation), we increase levels. The DMA address structure is:
> +     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
> +     * where:
> +     *   r = reserved (bits >= 55 are reserved in the existing hardware)
> +     *   i = IOMMU page offset (64K in this example)
> +     *   x = bits to index a TCE which can be split to equal chunks to index
> +     *      within the level.
> +     * The aim is to split "x" to smaller possible number of levels.
>       */
>      entries = create.window_size >> create.page_shift;
> -    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
> -    pages = MAX(pow2ceil(pages), 1); /* Round up */
> -    create.levels = ctz64(pages) / 6 + 1;
> -
> -    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +    /* bits_total is number of "x" needed */
> +    bits_total = ctz64((entries * sizeof(uint64_t)));
> +    /*
> +     * bits_per_level is a safe guess of how much we can allocate per level:
> +     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
> +     * is usually bigger than that.
> +     */
> +    bits_per_level = ctz64(systempagesize) + 8;

..which is relevant, because here you really do want the system page
size, since you're guestimating what the kernel will be able to put on
a page, entirely unrelated to the guest page size.

> +    create.levels = bits_total / bits_per_level;

Don't you need a div-rounded-up here?

> +    create.levels = MAX(1, create.levels);
> +    max_levels = (64 - create.page_shift) / ctz64(systempagesize);

And I think you want the real system page size here again.

> +    for ( ; create.levels <= max_levels; ++create.levels) {
> +        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
> +        if (!ret) {
> +            break;
> +        }
> +    }
>      if (ret) {
>          error_report("Failed to create a window, ret = %d (%m)", ret);
>          return -errno;
> @@ -200,6 +219,7 @@ int vfio_spapr_create_window(VFIOContainer *container,
>          return -EINVAL;
>      }
>      trace_vfio_spapr_create_window(create.page_shift,
> +                                   create.levels,
>                                     create.window_size,
>                                     create.start_addr);
>      *pgsize = pagesize;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 0f9f810..adfa75e 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -129,6 +129,6 @@ vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
>  vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
>  vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
>  vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
> -vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
> +vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
>  vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
>  vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"
diff mbox series

Patch

diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index becf71a..2675bf5 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -146,7 +146,7 @@  int vfio_spapr_create_window(VFIOContainer *container,
     int ret;
     IOMMUMemoryRegion *iommu_mr = IOMMU_MEMORY_REGION(section->mr);
     uint64_t pagesize = memory_region_iommu_get_min_page_size(iommu_mr);
-    unsigned entries, pages;
+    unsigned entries, bits_total, bits_per_level, max_levels;
     struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) };
     long systempagesize = qemu_getrampagesize();
 
@@ -176,16 +176,35 @@  int vfio_spapr_create_window(VFIOContainer *container,
     create.window_size = int128_get64(section->size);
     create.page_shift = ctz64(pagesize);
     /*
-     * SPAPR host supports multilevel TCE tables, there is some
-     * heuristic to decide how many levels we want for our table:
-     * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4
+     * SPAPR host supports multilevel TCE tables. We try to guess optimal
+     * levels number and if this fails (for example due to the host memory
+     * fragmentation), we increase levels. The DMA address structure is:
+     * rrrrrrrr rxxxxxxx xxxxxxxx xxxxxxxx  xxxxxxxx xxxxxxxx xxxxxxxx iiiiiiii
+     * where:
+     *   r = reserved (bits >= 55 are reserved in the existing hardware)
+     *   i = IOMMU page offset (64K in this example)
+     *   x = bits to index a TCE which can be split to equal chunks to index
+     *      within the level.
+     * The aim is to split "x" to smaller possible number of levels.
      */
     entries = create.window_size >> create.page_shift;
-    pages = MAX((entries * sizeof(uint64_t)) / getpagesize(), 1);
-    pages = MAX(pow2ceil(pages), 1); /* Round up */
-    create.levels = ctz64(pages) / 6 + 1;
-
-    ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+    /* bits_total is number of "x" needed */
+    bits_total = ctz64((entries * sizeof(uint64_t)));
+    /*
+     * bits_per_level is a safe guess of how much we can allocate per level:
+     * 8 is the current minimum for CONFIG_FORCE_MAX_ZONEORDER and MAX_ORDER
+     * is usually bigger than that.
+     */
+    bits_per_level = ctz64(systempagesize) + 8;
+    create.levels = bits_total / bits_per_level;
+    create.levels = MAX(1, create.levels);
+    max_levels = (64 - create.page_shift) / ctz64(systempagesize);
+    for ( ; create.levels <= max_levels; ++create.levels) {
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create);
+        if (!ret) {
+            break;
+        }
+    }
     if (ret) {
         error_report("Failed to create a window, ret = %d (%m)", ret);
         return -errno;
@@ -200,6 +219,7 @@  int vfio_spapr_create_window(VFIOContainer *container,
         return -EINVAL;
     }
     trace_vfio_spapr_create_window(create.page_shift,
+                                   create.levels,
                                    create.window_size,
                                    create.start_addr);
     *pgsize = pagesize;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0f9f810..adfa75e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -129,6 +129,6 @@  vfio_prereg_listener_region_add_skip(uint64_t start, uint64_t end) "0x%"PRIx64"
 vfio_prereg_listener_region_del_skip(uint64_t start, uint64_t end) "0x%"PRIx64" - 0x%"PRIx64
 vfio_prereg_register(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
 vfio_prereg_unregister(uint64_t va, uint64_t size, int ret) "va=0x%"PRIx64" size=0x%"PRIx64" ret=%d"
-vfio_spapr_create_window(int ps, uint64_t ws, uint64_t off) "pageshift=0x%x winsize=0x%"PRIx64" offset=0x%"PRIx64
+vfio_spapr_create_window(int ps, unsigned int levels, uint64_t ws, uint64_t off) "pageshift=0x%x levels=%u winsize=0x%"PRIx64" offset=0x%"PRIx64
 vfio_spapr_remove_window(uint64_t off) "offset=0x%"PRIx64
 vfio_spapr_group_attach(int groupfd, int tablefd) "Attached groupfd %d to liobn fd %d"