mbox series

[v3,00/12] intel-iommu: nested vIOMMU, cleanups, bug fixes

Message ID 20180517085927.24925-1-peterx@redhat.com
Headers show
Series intel-iommu: nested vIOMMU, cleanups, bug fixes | expand

Message

Peter Xu May 17, 2018, 8:59 a.m. UTC
(Hello, Jintack, Feel free to test this branch again against your scp
 error case when you got free time)

I rewrote some of the patches in V3.  Major changes:

- Dropped mergable interval tree, instead introduced IOVA tree, which
  is even simpler.

- Fix the scp error issue that Jintack reported.  Please see patches
  for detailed information.  That's the major reason to rewrite a few
  of the patches.  We use replay for domain flushes are possibly
  incorrect in the past.  The thing is that IOMMU replay has an
  "definition" that "we should only send MAP when new page detected",
  while for shadow page syncing we actually need something else than
  that.  So in this version I started to use a new
  vtd_sync_shadow_page_table() helper to do the page sync.

- Some other refines after the refactoring.

I'll add unit test for the IOVA tree after this series merged to make
sure we won't switch to another new tree implementaion...

The element size in the new IOVA tree should be around
sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
worst case usage ratio would be 72/4K=2%, which still seems acceptable
(it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
mapping in QEMU).

I did explicit test with scp this time, copying 1G sized file for >10
times on each of the following case:

- L1 guest, with vIOMMU and with assigned device
- L2 guest, without vIOMMU and with assigned device
- L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device

Please review.  Thanks,

(Below are old content from previous cover letter)

==========================

v2:
- fix patchew code style warnings
- interval tree: postpone malloc when inserting; simplify node remove
  a bit where proper [Jason]
- fix up comment and commit message for iommu lock patch [Kevin]
- protect context cache too using the iommu lock [Kevin, Jason]
- add vast comment in patch 8 to explain the modify-PTE problem
  [Jason, Kevin]

Online repo:

  https://github.com/xzpeter/qemu/tree/fix-vtd-dma

This series fixes several major problems that current code has:

- Issue 1: when getting very big PSI UNMAP invalidations, the current
  code is buggy in that we might skip the notification while actually
  we should always send that notification.

- Issue 2: IOTLB is not thread safe, while block dataplane can be
  accessing and updating it in parallel.

- Issue 3: For devices that only registered with UNMAP-only notifiers,
  we don't really need to do page walking for PSIs, we can directly
  deliver the notification down.  For example, vhost.

- Issue 4: unsafe window for MAP notified devices like vfio-pci (and
  in the future, vDPA as well).  The problem is that, now for domain
  invalidations we do this to make sure the shadow page tables are
  correctly synced:

  1. unmap the whole address space
  2. replay the whole address space, map existing pages

  However during step 1 and 2 there will be a very tiny window (it can
  be as big as 3ms) that the shadow page table is either invalid or
  incomplete (since we're rebuilding it up).  That's fatal error since
  devices never know that happending and it's still possible to DMA to
  memories.

Patch 1 fixes issue 1.  I put it at the first since it's picked from
an old post.

Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.

Patch 3 fixes issue 2.

Patch 4 fixes issue 3.

Patch 5-9 fix issue 4.  Here a very simple interval tree is
implemented based on Gtree.  It's different with general interval tree
in that it does not allow user to pass in private data (e.g.,
translated addresses).  However that benefits us that then we can
merge adjacent interval leaves so that hopefully we won't consume much
memory even if the mappings are a lot (that happens for nested virt -
when mapping the whole L2 guest RAM range, it can be at least in GBs).

Patch 10 is another big cleanup only can work after patch 9.

Tests:

- device assignments to L1, even L2 guests.  With this series applied
  (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
  we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
  with assigned devices and things will work.  We can't before.

- vhost smoke test for regression.

Please review.  Thanks,

Peter Xu (12):
  intel-iommu: send PSI always even if across PDEs
  intel-iommu: remove IntelIOMMUNotifierNode
  intel-iommu: add iommu lock
  intel-iommu: only do page walk for MAP notifiers
  intel-iommu: introduce vtd_page_walk_info
  intel-iommu: pass in address space when page walk
  intel-iommu: trace domain id during page walk
  util: implement simple iova tree
  intel-iommu: maintain per-device iova ranges
  intel-iommu: simplify page walk logic
  intel-iommu: new vtd_sync_shadow_page_table_range
  intel-iommu: new sync_shadow_page_table

 include/hw/i386/intel_iommu.h |  19 +-
 include/qemu/iova-tree.h      | 134 ++++++++++++
 hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
 util/iova-tree.c              | 114 ++++++++++
 MAINTAINERS                   |   6 +
 hw/i386/trace-events          |   5 +-
 util/Makefile.objs            |   1 +
 7 files changed, 556 insertions(+), 104 deletions(-)
 create mode 100644 include/qemu/iova-tree.h
 create mode 100644 util/iova-tree.c

Comments

Jintack Lim May 17, 2018, 7:49 p.m. UTC | #1
On Thu, May 17, 2018 at 4:59 AM, Peter Xu <peterx@redhat.com> wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)

Hi Peter,

>
> I rewrote some of the patches in V3.  Major changes:
>
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
>
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.

I checked that the scp problem I had (i.e. scp from the host to the
guest having virtual IOMMU and an assigned network device) was gone
with this patch series. Cool!

Please feel free to move this tag if this is not the right place!
Tested-by: Jintack Lim <jintack@cs.columbia.edu>

Thanks,
Jintack

>
> - Some other refines after the refactoring.
>
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
>
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
>
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
>
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
>
> Please review.  Thanks,
>
> (Below are old content from previous cover letter)
>
> ==========================
>
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
>
> Online repo:
>
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
>
> This series fixes several major problems that current code has:
>
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.
>
> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.
>
> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.
>
> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
>
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
>
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.
>
> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
>
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
>
> Patch 3 fixes issue 2.
>
> Patch 4 fixes issue 3.
>
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
>
> Patch 10 is another big cleanup only can work after patch 9.
>
> Tests:
>
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
>
> - vhost smoke test for regression.
>
> Please review.  Thanks,
>
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
>
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
>
> --
> 2.17.0
>
>
Michael S. Tsirkin May 17, 2018, 9:04 p.m. UTC | #2
On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)
> 
> I rewrote some of the patches in V3.  Major changes:
> 
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
> 
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> - Some other refines after the refactoring.
> 
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
> 
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
> 
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
> 
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> 
> Please review.  Thanks,
> 
> (Below are old content from previous cover letter)
> 
> ==========================
> 
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> 
> This series fixes several major problems that current code has:
> 
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.

security issue

> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.

security issue

> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.

optimization

> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
> 
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
> 
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.

correctness but not a security issue

> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
> 
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> 
> Patch 3 fixes issue 2.
> 
> Patch 4 fixes issue 3.
> 
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
> 
> Patch 10 is another big cleanup only can work after patch 9.


So 1-2 are needed on stable. 1-9 would be nice to have
there too, even though they are big and it looks risky.

> Tests:
> 
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
> 
> - vhost smoke test for regression.
> 
> Please review.  Thanks,
> 
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
> 
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
> 
> -- 
> 2.17.0
Michael S. Tsirkin May 17, 2018, 9:08 p.m. UTC | #3
On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> (Hello, Jintack, Feel free to test this branch again against your scp
>  error case when you got free time)
> 
> I rewrote some of the patches in V3.  Major changes:
> 
> - Dropped mergable interval tree, instead introduced IOVA tree, which
>   is even simpler.
> 
> - Fix the scp error issue that Jintack reported.  Please see patches
>   for detailed information.  That's the major reason to rewrite a few
>   of the patches.  We use replay for domain flushes are possibly
>   incorrect in the past.  The thing is that IOMMU replay has an
>   "definition" that "we should only send MAP when new page detected",
>   while for shadow page syncing we actually need something else than
>   that.  So in this version I started to use a new
>   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> - Some other refines after the refactoring.
> 
> I'll add unit test for the IOVA tree after this series merged to make
> sure we won't switch to another new tree implementaion...
> 
> The element size in the new IOVA tree should be around
> sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> worst case usage ratio would be 72/4K=2%, which still seems acceptable
> (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> mapping in QEMU).
> 
> I did explicit test with scp this time, copying 1G sized file for >10
> times on each of the following case:
> 
> - L1 guest, with vIOMMU and with assigned device
> - L2 guest, without vIOMMU and with assigned device
> - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> 
> Please review.  Thanks,
> 
> (Below are old content from previous cover letter)
> 
> ==========================
> 
> v2:
> - fix patchew code style warnings
> - interval tree: postpone malloc when inserting; simplify node remove
>   a bit where proper [Jason]
> - fix up comment and commit message for iommu lock patch [Kevin]
> - protect context cache too using the iommu lock [Kevin, Jason]
> - add vast comment in patch 8 to explain the modify-PTE problem
>   [Jason, Kevin]
> 
> Online repo:
> 
>   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> 
> This series fixes several major problems that current code has:
> 
> - Issue 1: when getting very big PSI UNMAP invalidations, the current
>   code is buggy in that we might skip the notification while actually
>   we should always send that notification.
> 
> - Issue 2: IOTLB is not thread safe, while block dataplane can be
>   accessing and updating it in parallel.
> 
> - Issue 3: For devices that only registered with UNMAP-only notifiers,
>   we don't really need to do page walking for PSIs, we can directly
>   deliver the notification down.  For example, vhost.
> 
> - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
>   in the future, vDPA as well).  The problem is that, now for domain
>   invalidations we do this to make sure the shadow page tables are
>   correctly synced:
> 
>   1. unmap the whole address space
>   2. replay the whole address space, map existing pages
> 
>   However during step 1 and 2 there will be a very tiny window (it can
>   be as big as 3ms) that the shadow page table is either invalid or
>   incomplete (since we're rebuilding it up).  That's fatal error since
>   devices never know that happending and it's still possible to DMA to
>   memories.
> 
> Patch 1 fixes issue 1.  I put it at the first since it's picked from
> an old post.
> 
> Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> 
> Patch 3 fixes issue 2.
> 
> Patch 4 fixes issue 3.
> 
> Patch 5-9 fix issue 4.  Here a very simple interval tree is
> implemented based on Gtree.  It's different with general interval tree
> in that it does not allow user to pass in private data (e.g.,
> translated addresses).  However that benefits us that then we can
> merge adjacent interval leaves so that hopefully we won't consume much
> memory even if the mappings are a lot (that happens for nested virt -
> when mapping the whole L2 guest RAM range, it can be at least in GBs).
> 
> Patch 10 is another big cleanup only can work after patch 9.

I think patch numbers are wrong somehow.

Given you also want to tweak one comment, could
you please repost with this fix, and also
in commit log for each patch
- Cc stable
- for security patches mention as much, if possible
  add data about the issue and its severity

> Tests:
> 
> - device assignments to L1, even L2 guests.  With this series applied
>   (and the kernel IOMMU patches: https://lkml.org/lkml/2018/4/18/5),
>   we can even nest vIOMMU now, e.g., we can specify vIOMMU in L2 guest
>   with assigned devices and things will work.  We can't before.
> 
> - vhost smoke test for regression.
> 
> Please review.  Thanks,
> 
> Peter Xu (12):
>   intel-iommu: send PSI always even if across PDEs
>   intel-iommu: remove IntelIOMMUNotifierNode
>   intel-iommu: add iommu lock
>   intel-iommu: only do page walk for MAP notifiers
>   intel-iommu: introduce vtd_page_walk_info
>   intel-iommu: pass in address space when page walk
>   intel-iommu: trace domain id during page walk
>   util: implement simple iova tree
>   intel-iommu: maintain per-device iova ranges
>   intel-iommu: simplify page walk logic
>   intel-iommu: new vtd_sync_shadow_page_table_range
>   intel-iommu: new sync_shadow_page_table
> 
>  include/hw/i386/intel_iommu.h |  19 +-
>  include/qemu/iova-tree.h      | 134 ++++++++++++
>  hw/i386/intel_iommu.c         | 381 +++++++++++++++++++++++++---------
>  util/iova-tree.c              | 114 ++++++++++
>  MAINTAINERS                   |   6 +
>  hw/i386/trace-events          |   5 +-
>  util/Makefile.objs            |   1 +
>  7 files changed, 556 insertions(+), 104 deletions(-)
>  create mode 100644 include/qemu/iova-tree.h
>  create mode 100644 util/iova-tree.c
> 
> -- 
> 2.17.0
Peter Xu May 18, 2018, 6:26 a.m. UTC | #4
On Thu, May 17, 2018 at 03:49:05PM -0400, Jintack Lim wrote:
> On Thu, May 17, 2018 at 4:59 AM, Peter Xu <peterx@redhat.com> wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> 
> Hi Peter,
> 
> >
> > I rewrote some of the patches in V3.  Major changes:
> >
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> >
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> 
> I checked that the scp problem I had (i.e. scp from the host to the
> guest having virtual IOMMU and an assigned network device) was gone
> with this patch series. Cool!
> 
> Please feel free to move this tag if this is not the right place!
> Tested-by: Jintack Lim <jintack@cs.columbia.edu>

Thanks for the quick feedback!

I'll temporarily consider putting it at the last patch of series if no
one jumps out and tells me another more correct way.  Also I'll
possibly make bold to append your suggested-by too to further claim
your contribution to this problem.

Regards,
Peter Xu May 18, 2018, 6:28 a.m. UTC | #5
On Fri, May 18, 2018 at 02:26:59PM +0800, Peter Xu wrote:

[...]

> I'll temporarily consider putting it at the last patch of series if no
> one jumps out and tells me another more correct way.  Also I'll
> possibly make bold to append your suggested-by too to further claim
                                    ^^^^^^^^^^^^
Sorry, I mean "Reported-by".

> your contribution to this problem.

Regards,
Peter Xu May 18, 2018, 6:30 a.m. UTC | #6
On Fri, May 18, 2018 at 12:08:01AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> > 
> > I rewrote some of the patches in V3.  Major changes:
> > 
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> > 
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> > 
> > - Some other refines after the refactoring.
> > 
> > I'll add unit test for the IOVA tree after this series merged to make
> > sure we won't switch to another new tree implementaion...
> > 
> > The element size in the new IOVA tree should be around
> > sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> > worst case usage ratio would be 72/4K=2%, which still seems acceptable
> > (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> > mapping in QEMU).
> > 
> > I did explicit test with scp this time, copying 1G sized file for >10
> > times on each of the following case:
> > 
> > - L1 guest, with vIOMMU and with assigned device
> > - L2 guest, without vIOMMU and with assigned device
> > - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> > 
> > Please review.  Thanks,
> > 
> > (Below are old content from previous cover letter)
> > 
> > ==========================
> > 
> > v2:
> > - fix patchew code style warnings
> > - interval tree: postpone malloc when inserting; simplify node remove
> >   a bit where proper [Jason]
> > - fix up comment and commit message for iommu lock patch [Kevin]
> > - protect context cache too using the iommu lock [Kevin, Jason]
> > - add vast comment in patch 8 to explain the modify-PTE problem
> >   [Jason, Kevin]
> > 
> > Online repo:
> > 
> >   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> > 
> > This series fixes several major problems that current code has:
> > 
> > - Issue 1: when getting very big PSI UNMAP invalidations, the current
> >   code is buggy in that we might skip the notification while actually
> >   we should always send that notification.
> > 
> > - Issue 2: IOTLB is not thread safe, while block dataplane can be
> >   accessing and updating it in parallel.
> > 
> > - Issue 3: For devices that only registered with UNMAP-only notifiers,
> >   we don't really need to do page walking for PSIs, we can directly
> >   deliver the notification down.  For example, vhost.
> > 
> > - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
> >   in the future, vDPA as well).  The problem is that, now for domain
> >   invalidations we do this to make sure the shadow page tables are
> >   correctly synced:
> > 
> >   1. unmap the whole address space
> >   2. replay the whole address space, map existing pages
> > 
> >   However during step 1 and 2 there will be a very tiny window (it can
> >   be as big as 3ms) that the shadow page table is either invalid or
> >   incomplete (since we're rebuilding it up).  That's fatal error since
> >   devices never know that happending and it's still possible to DMA to
> >   memories.
> > 
> > Patch 1 fixes issue 1.  I put it at the first since it's picked from
> > an old post.
> > 
> > Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> > 
> > Patch 3 fixes issue 2.
> > 
> > Patch 4 fixes issue 3.
> > 
> > Patch 5-9 fix issue 4.  Here a very simple interval tree is
> > implemented based on Gtree.  It's different with general interval tree
> > in that it does not allow user to pass in private data (e.g.,
> > translated addresses).  However that benefits us that then we can
> > merge adjacent interval leaves so that hopefully we won't consume much
> > memory even if the mappings are a lot (that happens for nested virt -
> > when mapping the whole L2 guest RAM range, it can be at least in GBs).
> > 
> > Patch 10 is another big cleanup only can work after patch 9.
> 
> I think patch numbers are wrong somehow.

Yes, this is the old cover letter, so the numbers do not match.

> 
> Given you also want to tweak one comment, could
> you please repost with this fix, and also
> in commit log for each patch
> - Cc stable
> - for security patches mention as much, if possible
>   add data about the issue and its severity

The rest I can try to do.  Thanks,
Peter Xu May 18, 2018, 6:34 a.m. UTC | #7
On Fri, May 18, 2018 at 12:04:04AM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2018 at 04:59:15PM +0800, Peter Xu wrote:
> > (Hello, Jintack, Feel free to test this branch again against your scp
> >  error case when you got free time)
> > 
> > I rewrote some of the patches in V3.  Major changes:
> > 
> > - Dropped mergable interval tree, instead introduced IOVA tree, which
> >   is even simpler.
> > 
> > - Fix the scp error issue that Jintack reported.  Please see patches
> >   for detailed information.  That's the major reason to rewrite a few
> >   of the patches.  We use replay for domain flushes are possibly
> >   incorrect in the past.  The thing is that IOMMU replay has an
> >   "definition" that "we should only send MAP when new page detected",
> >   while for shadow page syncing we actually need something else than
> >   that.  So in this version I started to use a new
> >   vtd_sync_shadow_page_table() helper to do the page sync.
> > 
> > - Some other refines after the refactoring.
> > 
> > I'll add unit test for the IOVA tree after this series merged to make
> > sure we won't switch to another new tree implementaion...
> > 
> > The element size in the new IOVA tree should be around
> > sizeof(GTreeNode + IOMMUTLBEntry) ~= (5*8+4*8) = 72 bytes.  So the
> > worst case usage ratio would be 72/4K=2%, which still seems acceptable
> > (it means 8G L2 guest will use 8G*2%=160MB as metadata to maintain the
> > mapping in QEMU).
> > 
> > I did explicit test with scp this time, copying 1G sized file for >10
> > times on each of the following case:
> > 
> > - L1 guest, with vIOMMU and with assigned device
> > - L2 guest, without vIOMMU and with assigned device
> > - L2 guest, with vIOMMU (so 3-layer nested IOMMU) and with assigned device
> > 
> > Please review.  Thanks,
> > 
> > (Below are old content from previous cover letter)
> > 
> > ==========================
> > 
> > v2:
> > - fix patchew code style warnings
> > - interval tree: postpone malloc when inserting; simplify node remove
> >   a bit where proper [Jason]
> > - fix up comment and commit message for iommu lock patch [Kevin]
> > - protect context cache too using the iommu lock [Kevin, Jason]
> > - add vast comment in patch 8 to explain the modify-PTE problem
> >   [Jason, Kevin]
> > 
> > Online repo:
> > 
> >   https://github.com/xzpeter/qemu/tree/fix-vtd-dma
> > 
> > This series fixes several major problems that current code has:
> > 
> > - Issue 1: when getting very big PSI UNMAP invalidations, the current
> >   code is buggy in that we might skip the notification while actually
> >   we should always send that notification.
> 
> security issue
> 
> > - Issue 2: IOTLB is not thread safe, while block dataplane can be
> >   accessing and updating it in parallel.
> 
> security issue
> 
> > - Issue 3: For devices that only registered with UNMAP-only notifiers,
> >   we don't really need to do page walking for PSIs, we can directly
> >   deliver the notification down.  For example, vhost.
> 
> optimization
> 
> > - Issue 4: unsafe window for MAP notified devices like vfio-pci (and
> >   in the future, vDPA as well).  The problem is that, now for domain
> >   invalidations we do this to make sure the shadow page tables are
> >   correctly synced:
> > 
> >   1. unmap the whole address space
> >   2. replay the whole address space, map existing pages
> > 
> >   However during step 1 and 2 there will be a very tiny window (it can
> >   be as big as 3ms) that the shadow page table is either invalid or
> >   incomplete (since we're rebuilding it up).  That's fatal error since
> >   devices never know that happending and it's still possible to DMA to
> >   memories.
> 
> correctness but not a security issue
> 
> > Patch 1 fixes issue 1.  I put it at the first since it's picked from
> > an old post.
> > 
> > Patch 2 is a cleanup to remove useless IntelIOMMUNotifierNode struct.
> > 
> > Patch 3 fixes issue 2.
> > 
> > Patch 4 fixes issue 3.
> > 
> > Patch 5-9 fix issue 4.  Here a very simple interval tree is
> > implemented based on Gtree.  It's different with general interval tree
> > in that it does not allow user to pass in private data (e.g.,
> > translated addresses).  However that benefits us that then we can
> > merge adjacent interval leaves so that hopefully we won't consume much
> > memory even if the mappings are a lot (that happens for nested virt -
> > when mapping the whole L2 guest RAM range, it can be at least in GBs).
> > 
> > Patch 10 is another big cleanup only can work after patch 9.
> 
> 
> So 1-2 are needed on stable. 1-9 would be nice to have
> there too, even though they are big and it looks risky.

Yes, although issue 4 is not a security issue, but it might cause DMA
errors and unusability of devices to happen.  I don't know very much
on the details of how stable tree should treat patches like this, but
considering that this whole series only touches VT-d code, and as you
mentioned merely all the patches would be nice to have even for
stable, I'll just CC stable for all the patches, refine messages and
repost.  Thanks,