diff mbox series

[v2,09/10] intel-iommu: don't unmap all for shadow page table

Message ID 20180504030811.28111-10-peterx@redhat.com
State New
Headers show
Series intel-iommu: nested vIOMMU, cleanups, bug fixes | expand

Commit Message

Peter Xu May 4, 2018, 3:08 a.m. UTC
IOMMU replay was carried out before in many use cases, e.g., context
cache invalidations, domain flushes.  We used this mechanism to sync the
shadow page table by firstly (1) unmap the whole address space, then
(2) walk the page table to remap what's in the table.

This is very dangerous.

The problem is that we'll have a very small window (in my measurement,
it can be about 3ms) during above step (1) and (2) that the device will
see no (or incomplete) device page table.  Howerver the device never
knows that.  This can cause DMA error of devices, who assumes the page
table is always there.

So the point is that, for MAP typed notifiers (vfio-pci, for example)
they'll need the mapped page entries always be there.  We can never
unmap any existing page entries like what we did in (1) above.

The only solution is to remove step (1).  We can't do that before since
we didn't know what device page was mapped and what was not, so we unmap
them all.  Now with the new IOVA tree QEMU knows what has mapped and
what has not.  We don't need this step (1) any more.  Remove it.

Note that after removing that global unmap flushing, we'll need to
notify unmap now during page walkings.

This should fix the DMA error problem that Jintack Lim reported with
nested device assignment.  This problem won't not happen always, e.g., I
cannot reproduce the error.  However after collecting logs it shows that
this is the possible cause to Jintack's problem.

Reported-by: Jintack Lim <jintack@cs.columbia.edu>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Eric Auger May 17, 2018, 5:23 p.m. UTC | #1
Hi Peter,

On 05/04/2018 05:08 AM, Peter Xu wrote:
> IOMMU replay was carried out before in many use cases, e.g., context
> cache invalidations, domain flushes.  We used this mechanism to sync the
> shadow page table by firstly (1) unmap the whole address space, then
> (2) walk the page table to remap what's in the table.
> 
> This is very dangerous.
> 
> The problem is that we'll have a very small window (in my measurement,
> it can be about 3ms) during above step (1) and (2) that the device will
> see no (or incomplete) device page table.  Howerver the device never
> knows that.  This can cause DMA error of devices, who assumes the page
> table is always there.
But now you have the QemuMutex can we have a translate and an
invalidation that occur concurrently? Don't the iotlb flush and replay
happen while the lock is held?

Thanks

Eric
> 
> So the point is that, for MAP typed notifiers (vfio-pci, for example)
> they'll need the mapped page entries always be there.  We can never
> unmap any existing page entries like what we did in (1) above.
> 
> The only solution is to remove step (1).  We can't do that before since
> we didn't know what device page was mapped and what was not, so we unmap
> them all.  Now with the new IOVA tree QEMU knows what has mapped and
> what has not.  We don't need this step (1) any more.  Remove it.
> 
> Note that after removing that global unmap flushing, we'll need to
> notify unmap now during page walkings.
> 
> This should fix the DMA error problem that Jintack Lim reported with
> nested device assignment.  This problem won't not happen always, e.g., I
> cannot reproduce the error.  However after collecting logs it shows that
> this is the possible cause to Jintack's problem.
> 
> Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 16b3514afb..9439103cac 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>  
>      /*
>       * The replay can be triggered by either a invalidation or a newly
> -     * created entry. No matter what, we release existing mappings
> -     * (it means flushing caches for UNMAP-only registers).
> +     * created entry.
>       */
> -    vtd_address_space_unmap(vtd_as, n);
>  
>      if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
>          trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> @@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>                                    ce.hi, ce.lo);
>          if (vtd_as_notify_mappings(vtd_as)) {
>              /* This is required only for MAP typed notifiers */
> -            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
> +            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
>                            vtd_as);
> +        } else {
> +            vtd_address_space_unmap(vtd_as, n);
>          }
>      } else {
>          trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
>
Peter Xu May 18, 2018, 6:06 a.m. UTC | #2
On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote:
> Hi Peter,
> 
> On 05/04/2018 05:08 AM, Peter Xu wrote:
> > IOMMU replay was carried out before in many use cases, e.g., context
> > cache invalidations, domain flushes.  We used this mechanism to sync the
> > shadow page table by firstly (1) unmap the whole address space, then
> > (2) walk the page table to remap what's in the table.
> > 
> > This is very dangerous.
> > 
> > The problem is that we'll have a very small window (in my measurement,
> > it can be about 3ms) during above step (1) and (2) that the device will
> > see no (or incomplete) device page table.  Howerver the device never
> > knows that.  This can cause DMA error of devices, who assumes the page
> > table is always there.
> But now you have the QemuMutex can we have a translate and an
> invalidation that occur concurrently? Don't the iotlb flush and replay
> happen while the lock is held?

Note that when we are using vfio-pci devices we can't really know when
the device started a DMA since that's totally happening only between
the host IOMMU and the hardware.  Say, vfio-pci device page
translation is happening in the shadow page table, not really in QEMU.
So IMO we aren't protected by anything.

Also, this patch is dropped in version 3, and I did something else to
achieve similar goal (by introducing the shadow page sync helper, and
then for DSIs we'll use that instead of calling IOMMU replay here).
Please have a look.  Thanks,
Eric Auger May 18, 2018, 7:31 a.m. UTC | #3
Hi Peter,

On 05/18/2018 08:06 AM, Peter Xu wrote:
> On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote:
>> Hi Peter,
>>
>> On 05/04/2018 05:08 AM, Peter Xu wrote:
>>> IOMMU replay was carried out before in many use cases, e.g., context
>>> cache invalidations, domain flushes.  We used this mechanism to sync the
>>> shadow page table by firstly (1) unmap the whole address space, then
>>> (2) walk the page table to remap what's in the table.
>>>
>>> This is very dangerous.
>>>
>>> The problem is that we'll have a very small window (in my measurement,
>>> it can be about 3ms) during above step (1) and (2) that the device will
>>> see no (or incomplete) device page table.  Howerver the device never
>>> knows that.  This can cause DMA error of devices, who assumes the page
>>> table is always there.
>> But now you have the QemuMutex can we have a translate and an
>> invalidation that occur concurrently? Don't the iotlb flush and replay
>> happen while the lock is held?
> 
> Note that when we are using vfio-pci devices we can't really know when
> the device started a DMA since that's totally happening only between
> the host IOMMU and the hardware.  

Oh yes that's fully relevant in vfio-pci use case. thank you for the
clarification.

Say, vfio-pci device page
> translation is happening in the shadow page table, not really in QEMU.
> So IMO we aren't protected by anything.
> 
> Also, this patch is dropped in version 3, and I did something else to
> achieve similar goal (by introducing the shadow page sync helper, and
> then for DSIs we'll use that instead of calling IOMMU replay here).
> Please have a look.  Thanks,

OK

Thanks

Eric
>
diff mbox series

Patch

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 16b3514afb..9439103cac 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3003,10 +3003,8 @@  static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
 
     /*
      * The replay can be triggered by either a invalidation or a newly
-     * created entry. No matter what, we release existing mappings
-     * (it means flushing caches for UNMAP-only registers).
+     * created entry.
      */
-    vtd_address_space_unmap(vtd_as, n);
 
     if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
         trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
@@ -3015,8 +3013,10 @@  static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
                                   ce.hi, ce.lo);
         if (vtd_as_notify_mappings(vtd_as)) {
             /* This is required only for MAP typed notifiers */
-            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false,
+            vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true,
                           vtd_as);
+        } else {
+            vtd_address_space_unmap(vtd_as, n);
         }
     } else {
         trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),