Message ID | 20180504030811.28111-10-peterx@redhat.com |
---|---|
State | New |
Headers | show |
Series | intel-iommu: nested vIOMMU, cleanups, bug fixes | expand |
Hi Peter, On 05/04/2018 05:08 AM, Peter Xu wrote: > IOMMU replay was carried out before in many use cases, e.g., context > cache invalidations, domain flushes. We used this mechanism to sync the > shadow page table by firstly (1) unmap the whole address space, then > (2) walk the page table to remap what's in the table. > > This is very dangerous. > > The problem is that we'll have a very small window (in my measurement, > it can be about 3ms) during above step (1) and (2) that the device will > see no (or incomplete) device page table. Howerver the device never > knows that. This can cause DMA error of devices, who assumes the page > table is always there. But now you have the QemuMutex can we have a translate and an invalidation that occur concurrently? Don't the iotlb flush and replay happen while the lock is held? Thanks Eric > > So the point is that, for MAP typed notifiers (vfio-pci, for example) > they'll need the mapped page entries always be there. We can never > unmap any existing page entries like what we did in (1) above. > > The only solution is to remove step (1). We can't do that before since > we didn't know what device page was mapped and what was not, so we unmap > them all. Now with the new IOVA tree QEMU knows what has mapped and > what has not. We don't need this step (1) any more. Remove it. > > Note that after removing that global unmap flushing, we'll need to > notify unmap now during page walkings. > > This should fix the DMA error problem that Jintack Lim reported with > nested device assignment. This problem won't not happen always, e.g., I > cannot reproduce the error. However after collecting logs it shows that > this is the possible cause to Jintack's problem. > > Reported-by: Jintack Lim <jintack@cs.columbia.edu> > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > hw/i386/intel_iommu.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c > index 16b3514afb..9439103cac 100644 > --- a/hw/i386/intel_iommu.c > +++ b/hw/i386/intel_iommu.c > @@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) > > /* > * The replay can be triggered by either a invalidation or a newly > - * created entry. No matter what, we release existing mappings > - * (it means flushing caches for UNMAP-only registers). > + * created entry. > */ > - vtd_address_space_unmap(vtd_as, n); > > if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) { > trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn), > @@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) > ce.hi, ce.lo); > if (vtd_as_notify_mappings(vtd_as)) { > /* This is required only for MAP typed notifiers */ > - vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false, > + vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true, > vtd_as); > + } else { > + vtd_address_space_unmap(vtd_as, n); > } > } else { > trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn), >
On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote: > Hi Peter, > > On 05/04/2018 05:08 AM, Peter Xu wrote: > > IOMMU replay was carried out before in many use cases, e.g., context > > cache invalidations, domain flushes. We used this mechanism to sync the > > shadow page table by firstly (1) unmap the whole address space, then > > (2) walk the page table to remap what's in the table. > > > > This is very dangerous. > > > > The problem is that we'll have a very small window (in my measurement, > > it can be about 3ms) during above step (1) and (2) that the device will > > see no (or incomplete) device page table. Howerver the device never > > knows that. This can cause DMA error of devices, who assumes the page > > table is always there. > But now you have the QemuMutex can we have a translate and an > invalidation that occur concurrently? Don't the iotlb flush and replay > happen while the lock is held? Note that when we are using vfio-pci devices we can't really know when the device started a DMA since that's totally happening only between the host IOMMU and the hardware. Say, vfio-pci device page translation is happening in the shadow page table, not really in QEMU. So IMO we aren't protected by anything. Also, this patch is dropped in version 3, and I did something else to achieve similar goal (by introducing the shadow page sync helper, and then for DSIs we'll use that instead of calling IOMMU replay here). Please have a look. Thanks,
Hi Peter, On 05/18/2018 08:06 AM, Peter Xu wrote: > On Thu, May 17, 2018 at 07:23:33PM +0200, Auger Eric wrote: >> Hi Peter, >> >> On 05/04/2018 05:08 AM, Peter Xu wrote: >>> IOMMU replay was carried out before in many use cases, e.g., context >>> cache invalidations, domain flushes. We used this mechanism to sync the >>> shadow page table by firstly (1) unmap the whole address space, then >>> (2) walk the page table to remap what's in the table. >>> >>> This is very dangerous. >>> >>> The problem is that we'll have a very small window (in my measurement, >>> it can be about 3ms) during above step (1) and (2) that the device will >>> see no (or incomplete) device page table. Howerver the device never >>> knows that. This can cause DMA error of devices, who assumes the page >>> table is always there. >> But now you have the QemuMutex can we have a translate and an >> invalidation that occur concurrently? Don't the iotlb flush and replay >> happen while the lock is held? > > Note that when we are using vfio-pci devices we can't really know when > the device started a DMA since that's totally happening only between > the host IOMMU and the hardware. Oh yes that's fully relevant in vfio-pci use case. thank you for the clarification. Say, vfio-pci device page > translation is happening in the shadow page table, not really in QEMU. > So IMO we aren't protected by anything. > > Also, this patch is dropped in version 3, and I did something else to > achieve similar goal (by introducing the shadow page sync helper, and > then for DSIs we'll use that instead of calling IOMMU replay here). > Please have a look. Thanks, OK Thanks Eric >
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c index 16b3514afb..9439103cac 100644 --- a/hw/i386/intel_iommu.c +++ b/hw/i386/intel_iommu.c @@ -3003,10 +3003,8 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) /* * The replay can be triggered by either a invalidation or a newly - * created entry. No matter what, we release existing mappings - * (it means flushing caches for UNMAP-only registers). + * created entry. */ - vtd_address_space_unmap(vtd_as, n); if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) { trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn), @@ -3015,8 +3013,10 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n) ce.hi, ce.lo); if (vtd_as_notify_mappings(vtd_as)) { /* This is required only for MAP typed notifiers */ - vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, false, + vtd_page_walk(&ce, 0, ~0ULL, vtd_replay_hook, (void *)n, true, vtd_as); + } else { + vtd_address_space_unmap(vtd_as, n); } } else { trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
IOMMU replay was carried out before in many use cases, e.g., context cache invalidations, domain flushes. We used this mechanism to sync the shadow page table by firstly (1) unmap the whole address space, then (2) walk the page table to remap what's in the table. This is very dangerous. The problem is that we'll have a very small window (in my measurement, it can be about 3ms) during above step (1) and (2) that the device will see no (or incomplete) device page table. Howerver the device never knows that. This can cause DMA error of devices, who assumes the page table is always there. So the point is that, for MAP typed notifiers (vfio-pci, for example) they'll need the mapped page entries always be there. We can never unmap any existing page entries like what we did in (1) above. The only solution is to remove step (1). We can't do that before since we didn't know what device page was mapped and what was not, so we unmap them all. Now with the new IOVA tree QEMU knows what has mapped and what has not. We don't need this step (1) any more. Remove it. Note that after removing that global unmap flushing, we'll need to notify unmap now during page walkings. This should fix the DMA error problem that Jintack Lim reported with nested device assignment. This problem won't not happen always, e.g., I cannot reproduce the error. However after collecting logs it shows that this is the possible cause to Jintack's problem. Reported-by: Jintack Lim <jintack@cs.columbia.edu> Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/i386/intel_iommu.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)