[v3,01/12] intel-iommu: send PSI always even if across PDEs

Message ID 20180517085927.24925-2-peterx@redhat.com
State New
Headers show
Series
  • intel-iommu: nested vIOMMU, cleanups, bug fixes
Related show

Commit Message

Peter Xu May 17, 2018, 8:59 a.m.
During IOVA page table walking, there is a special case when the PSI
covers one whole PDE (Page Directory Entry, which contains 512 Page
Table Entries) or more.  In the past, we skip that entry and we don't
notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
notification to registered UNMAP notifiers in this case.

For UNMAP only notifiers, this might cause IOTLBs cached in the devices
even if they were already invalid.  For MAP/UNMAP notifiers like
vfio-pci, this will cause stale page mappings.

This special case doesn't trigger often, but it is very easy to be
triggered by nested device assignments, since in that case we'll
possibly map the whole L2 guest RAM region into the device's IOVA
address space (several GBs at least), which is far bigger than normal
kernel driver usages of the device (tens of MBs normally).

Without this patch applied to L1 QEMU, nested device assignment to L2
guests will dump some errors like:

qemu-system-x86_64: VFIO_MAP_DMA: -17
qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
                    0x7f89a920d000) = -17 (File exists)

Acked-by: Jason Wang <jasowang@redhat.com>
[peterx: rewrite the commit message]
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

Comments

Michael S. Tsirkin May 17, 2018, 9 p.m. | #1
On Thu, May 17, 2018 at 04:59:16PM +0800, Peter Xu wrote:
> During IOVA page table walking, there is a special case when the PSI
> covers one whole PDE (Page Directory Entry, which contains 512 Page
> Table Entries) or more.  In the past, we skip that entry and we don't
> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> notification to registered UNMAP notifiers in this case.
> 
> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> even if they were already invalid.  For MAP/UNMAP notifiers like
> vfio-pci, this will cause stale page mappings.
> 
> This special case doesn't trigger often, but it is very easy to be
> triggered by nested device assignments, since in that case we'll
> possibly map the whole L2 guest RAM region into the device's IOVA
> address space (several GBs at least), which is far bigger than normal
> kernel driver usages of the device (tens of MBs normally).
> 
> Without this patch applied to L1 QEMU, nested device assignment to L2
> guests will dump some errors like:
> 
> qemu-system-x86_64: VFIO_MAP_DMA: -17
> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>                     0x7f89a920d000) = -17 (File exists)

Could you add description of security implications if any and their
severity?

> Acked-by: Jason Wang <jasowang@redhat.com>
> [peterx: rewrite the commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>

Cc stable here? For any other patches as well?
> ---
>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fb31de9416..b359efd6f9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> +                             vtd_page_walk_hook hook_fn, void *private)
> +{
> +    assert(hook_fn);
> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> +                            entry->addr_mask, entry->perm);
> +    return hook_fn(entry, private);
> +}
> +
>  /**
>   * vtd_page_walk_level - walk over specific level for IOVA range
>   *
> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> +        entry.target_as = &address_space_memory;
> +        entry.iova = iova & subpage_mask;
> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +        entry.addr_mask = ~subpage_mask;
> +
>          if (vtd_is_last_slpte(slpte, level)) {
> -            entry.target_as = &address_space_memory;
> -            entry.iova = iova & subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            entry.addr_mask = ~subpage_mask;
> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              if (!entry_valid && !notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> -                                    entry.addr_mask, entry.perm);
> -            if (hook_fn) {
> -                ret = hook_fn(&entry, private);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            if (ret < 0) {
> +                return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                if (notify_unmap) {
> +                    /*
> +                     * The whole entry is invalid; unmap it all.
> +                     * Translated address is meaningless, zero it.
> +                     */
> +                    entry.translated_addr = 0x0;
> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                } else {
> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                }
>                  goto next;
>              }
>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
> -- 
> 2.17.0
Auger Eric May 18, 2018, 8:23 a.m. | #2
Hi Peter,

On 05/17/2018 10:59 AM, Peter Xu wrote:
> During IOVA page table walking, there is a special case when the PSI
> covers one whole PDE (Page Directory Entry, which contains 512 Page
> Table Entries) or more.  In the past, we skip that entry and we don't
> notify the IOMMU notifiers.  This is not correct.  We should send UNMAP
> notification to registered UNMAP notifiers in this case.
> 
> For UNMAP only notifiers, this might cause IOTLBs cached in the devices
> even if they were already invalid.  For MAP/UNMAP notifiers like
> vfio-pci, this will cause stale page mappings.
> 
> This special case doesn't trigger often, but it is very easy to be
> triggered by nested device assignments, since in that case we'll
> possibly map the whole L2 guest RAM region into the device's IOVA
> address space (several GBs at least), which is far bigger than normal
> kernel driver usages of the device (tens of MBs normally).
> 
> Without this patch applied to L1 QEMU, nested device assignment to L2
> guests will dump some errors like:
> 
> qemu-system-x86_64: VFIO_MAP_DMA: -17
> qemu-system-x86_64: vfio_dma_map(0x557305420c30, 0xad000, 0x1000,
>                     0x7f89a920d000) = -17 (File exists)
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> [peterx: rewrite the commit message]
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++------------
>  1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fb31de9416..b359efd6f9 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -722,6 +722,15 @@ static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
>  
>  typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
>  
> +static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> +                             vtd_page_walk_hook hook_fn, void *private)
Besides the name of this function (nit)
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

> +{
> +    assert(hook_fn);
> +    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
> +                            entry->addr_mask, entry->perm);
> +    return hook_fn(entry, private);
> +}
> +
>  /**
>   * vtd_page_walk_level - walk over specific level for IOVA range
>   *
> @@ -781,28 +790,37 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> +        entry.target_as = &address_space_memory;
> +        entry.iova = iova & subpage_mask;
> +        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> +        entry.addr_mask = ~subpage_mask;
> +
>          if (vtd_is_last_slpte(slpte, level)) {
> -            entry.target_as = &address_space_memory;
> -            entry.iova = iova & subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
>              entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
> -            entry.addr_mask = ~subpage_mask;
> -            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              if (!entry_valid && !notify_unmap) {
>                  trace_vtd_page_walk_skip_perm(iova, iova_next);
>                  goto next;
>              }
> -            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> -                                    entry.addr_mask, entry.perm);
> -            if (hook_fn) {
> -                ret = hook_fn(&entry, private);
> -                if (ret < 0) {
> -                    return ret;
> -                }
> +            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +            if (ret < 0) {
> +                return ret;
>              }
>          } else {
>              if (!entry_valid) {
> -                trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                if (notify_unmap) {
> +                    /*
> +                     * The whole entry is invalid; unmap it all.
> +                     * Translated address is meaningless, zero it.
> +                     */
> +                    entry.translated_addr = 0x0;
> +                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
> +                    if (ret < 0) {
> +                        return ret;
> +                    }
> +                } else {
> +                    trace_vtd_page_walk_skip_perm(iova, iova_next);
> +                }
>                  goto next;
>              }
>              ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,
>

Patch

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb31de9416..b359efd6f9 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -722,6 +722,15 @@  static int vtd_iova_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
 
 typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
 
+static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
+                             vtd_page_walk_hook hook_fn, void *private)
+{
+    assert(hook_fn);
+    trace_vtd_page_walk_one(level, entry->iova, entry->translated_addr,
+                            entry->addr_mask, entry->perm);
+    return hook_fn(entry, private);
+}
+
 /**
  * vtd_page_walk_level - walk over specific level for IOVA range
  *
@@ -781,28 +790,37 @@  static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
+        entry.target_as = &address_space_memory;
+        entry.iova = iova & subpage_mask;
+        entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
+        entry.addr_mask = ~subpage_mask;
+
         if (vtd_is_last_slpte(slpte, level)) {
-            entry.target_as = &address_space_memory;
-            entry.iova = iova & subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
             entry.translated_addr = vtd_get_slpte_addr(slpte, aw);
-            entry.addr_mask = ~subpage_mask;
-            entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             if (!entry_valid && !notify_unmap) {
                 trace_vtd_page_walk_skip_perm(iova, iova_next);
                 goto next;
             }
-            trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
-                                    entry.addr_mask, entry.perm);
-            if (hook_fn) {
-                ret = hook_fn(&entry, private);
-                if (ret < 0) {
-                    return ret;
-                }
+            ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+            if (ret < 0) {
+                return ret;
             }
         } else {
             if (!entry_valid) {
-                trace_vtd_page_walk_skip_perm(iova, iova_next);
+                if (notify_unmap) {
+                    /*
+                     * The whole entry is invalid; unmap it all.
+                     * Translated address is meaningless, zero it.
+                     */
+                    entry.translated_addr = 0x0;
+                    ret = vtd_page_walk_one(&entry, level, hook_fn, private);
+                    if (ret < 0) {
+                        return ret;
+                    }
+                } else {
+                    trace_vtd_page_walk_skip_perm(iova, iova_next);
+                }
                 goto next;
             }
             ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte, aw), iova,