diff mbox series

[v3] intel-iommu: Document iova_tree

Message ID 20221206221348.396020-1-peterx@redhat.com
State New
Headers show
Series [v3] intel-iommu: Document iova_tree | expand

Commit Message

Peter Xu Dec. 6, 2022, 10:13 p.m. UTC
It seems not super clear on when iova_tree is used, and why.  Add a rich
comment above iova_tree to track why we needed the iova_tree, and when we
need it.

Also comment for the map/unmap messages, on how they're used and
implications (e.g. unmap can be larger than the mapped ranges).

Suggested-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3:
- Adjust according to Eric's comment
---
 include/exec/memory.h         | 28 ++++++++++++++++++++++++++
 include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
 2 files changed, 65 insertions(+), 1 deletion(-)

Comments

Eric Auger Dec. 7, 2022, 9:51 a.m. UTC | #1
Hi Peter,

On 12/6/22 23:13, Peter Xu wrote:
> It seems not super clear on when iova_tree is used, and why.  Add a rich
> comment above iova_tree to track why we needed the iova_tree, and when we
> need it.
>
> Also comment for the map/unmap messages, on how they're used and
> implications (e.g. unmap can be larger than the mapped ranges).
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> v3:
> - Adjust according to Eric's comment
> ---
>  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
>  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 91f8a2395a..269ecb873b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
>  /*
>   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
>   * register with one or multiple IOMMU Notifier capability bit(s).
> + *
> + * Normally there're two use cases for the notifiers:
> + *
> + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> + *       tables, it needs to register with both MAP|UNMAP notifies (which
> + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> + *
> + *       Regarding to accurate synchronization, it's when the notified
> + *       device maintains a shadow page table and must be notified on each
> + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> + *       events (e.g. VFIO). Both notifications must be accurate so that
> + *       the shadow page table is fully in sync with the guest view.
> + *
> + *   (2) When the device doesn't need accurate synchronizations of the
> + *       vIOMMU page tables, it needs to register only with UNMAP or
> + *       DEVIOTLB_UNMAP notifies.
> + *
> + *       It's when the device maintains a cache of IOMMU translations
> + *       (IOTLB) and is able to fill that cache by requesting translations
> + *       from the vIOMMU through a protocol similar to ATS (Address
> + *       Translation Service).
> + *
> + *       Note that in this mode the vIOMMU will not maintain a shadowed
> + *       page table for the address space, and the UNMAP messages can be
> + *       actually larger than the real invalidations (just like how the
> + *       Linux IOMMU driver normally works, where an invalidation can be
> + *       enlarged as long as it still covers the target range).  The IOMMU
> + *       notifiee should be able to take care of over-sized invalidations.
>   */
>  typedef enum {
>      IOMMU_NOTIFIER_NONE = 0,
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 46d973e629..89dcbc5e1e 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -109,7 +109,43 @@ struct VTDAddressSpace {
>      QLIST_ENTRY(VTDAddressSpace) next;
>      /* Superset of notifier flags that this address space has */
>      IOMMUNotifierFlag notifier_flags;
> -    IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> +    /*
> +     * @iova_tree traces mapped IOVA ranges.
> +     *
> +     * The tree is not needed if no MAP notifier is registered with current
> +     * VTD address space, because all guest invalidate commands can be
> +     * directly passed to the IOMMU UNMAP notifiers without any further
> +     * reshuffling.
> +     *
> +     * The tree OTOH is required for MAP typed iommu notifiers for a few
> +     * reasons.
> +     *
> +     * Firstly, there's no way to identify whether an PSI (Page Selective
> +     * Invalidations) or DSI (Domain Selective Invalidations) event is an
> +     * MAP or UNMAP event within the message itself.  Without having prior
> +     * knowledge of existing state vIOMMU doesn't know whether it should
> +     * notify MAP or UNMAP for a PSI message it received when caching mode
> +     * is enabled (for MAP notifiers).
> +     *
> +     * Secondly, PSI messages received from guest driver can be enlarged in
> +     * range, covers but not limited to what the guest driver wanted to
> +     * invalidate.  When the range to invalidates gets bigger than the
> +     * limit of a PSI message, it can even become a DSI which will
> +     * invalidate the whole domain.  If the vIOMMU directly notifies the
> +     * registered device with the unmodified range, it may confuse the
> +     * registered drivers (e.g. vfio-pci) on either:
> +     *
> +     *   (1) Trying to map the same region more than once (for
> +     *       VFIO_IOMMU_MAP_DMA, -EEXIST will trigger), or,
> +     *
> +     *   (2) Trying to UNMAP a range that is still partially mapped.
> +     *
> +     * That accuracy is not required for UNMAP-only notifiers, but it is a
> +     * must-to-have for notifiers registered with MAP events, because the
> +     * vIOMMU needs to make sure the shadow page table is always in sync
> +     * with the guest IOMMU pgtables for a device.
> +     */
> +    IOVATree *iova_tree;
>  };
>  
>  struct VTDIOTLBEntry {
Jason Wang Dec. 23, 2022, 7:48 a.m. UTC | #2
On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
>
> It seems not super clear on when iova_tree is used, and why.  Add a rich
> comment above iova_tree to track why we needed the iova_tree, and when we
> need it.
>
> Also comment for the map/unmap messages, on how they're used and
> implications (e.g. unmap can be larger than the mapped ranges).
>
> Suggested-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v3:
> - Adjust according to Eric's comment
> ---
>  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
>  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
>  2 files changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 91f8a2395a..269ecb873b 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
>  /*
>   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
>   * register with one or multiple IOMMU Notifier capability bit(s).
> + *
> + * Normally there're two use cases for the notifiers:
> + *
> + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> + *       tables, it needs to register with both MAP|UNMAP notifies (which
> + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> + *
> + *       Regarding to accurate synchronization, it's when the notified
> + *       device maintains a shadow page table and must be notified on each
> + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> + *       events (e.g. VFIO). Both notifications must be accurate so that
> + *       the shadow page table is fully in sync with the guest view.
> + *
> + *   (2) When the device doesn't need accurate synchronizations of the
> + *       vIOMMU page tables, it needs to register only with UNMAP or
> + *       DEVIOTLB_UNMAP notifies.
> + *
> + *       It's when the device maintains a cache of IOMMU translations
> + *       (IOTLB) and is able to fill that cache by requesting translations
> + *       from the vIOMMU through a protocol similar to ATS (Address
> + *       Translation Service).
> + *
> + *       Note that in this mode the vIOMMU will not maintain a shadowed
> + *       page table for the address space, and the UNMAP messages can be
> + *       actually larger than the real invalidations (just like how the
> + *       Linux IOMMU driver normally works, where an invalidation can be
> + *       enlarged as long as it still covers the target range).  The IOMMU

Just spot this when testing your fix for DSI:

        assert(entry->iova >= notifier->start && entry_end <= notifier->end);

Do we need to remove this (but it seems a partial revert of
03c7140c1a0336af3d4fca768de791b9c0e2b128)?

Thanks

> + *       notifiee should be able to take care of over-sized invalidations.
>   */
>  typedef enum {
>      IOMMU_NOTIFIER_NONE = 0,
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 46d973e629..89dcbc5e1e 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -109,7 +109,43 @@ struct VTDAddressSpace {
>      QLIST_ENTRY(VTDAddressSpace) next;
>      /* Superset of notifier flags that this address space has */
>      IOMMUNotifierFlag notifier_flags;
> -    IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
> +    /*
> +     * @iova_tree traces mapped IOVA ranges.
> +     *
> +     * The tree is not needed if no MAP notifier is registered with current
> +     * VTD address space, because all guest invalidate commands can be
> +     * directly passed to the IOMMU UNMAP notifiers without any further
> +     * reshuffling.
> +     *
> +     * The tree OTOH is required for MAP typed iommu notifiers for a few
> +     * reasons.
> +     *
> +     * Firstly, there's no way to identify whether an PSI (Page Selective
> +     * Invalidations) or DSI (Domain Selective Invalidations) event is an
> +     * MAP or UNMAP event within the message itself.  Without having prior
> +     * knowledge of existing state vIOMMU doesn't know whether it should
> +     * notify MAP or UNMAP for a PSI message it received when caching mode
> +     * is enabled (for MAP notifiers).
> +     *
> +     * Secondly, PSI messages received from guest driver can be enlarged in
> +     * range, covers but not limited to what the guest driver wanted to
> +     * invalidate.  When the range to invalidates gets bigger than the
> +     * limit of a PSI message, it can even become a DSI which will
> +     * invalidate the whole domain.  If the vIOMMU directly notifies the
> +     * registered device with the unmodified range, it may confuse the
> +     * registered drivers (e.g. vfio-pci) on either:
> +     *
> +     *   (1) Trying to map the same region more than once (for
> +     *       VFIO_IOMMU_MAP_DMA, -EEXIST will trigger), or,
> +     *
> +     *   (2) Trying to UNMAP a range that is still partially mapped.
> +     *
> +     * That accuracy is not required for UNMAP-only notifiers, but it is a
> +     * must-to-have for notifiers registered with MAP events, because the
> +     * vIOMMU needs to make sure the shadow page table is always in sync
> +     * with the guest IOMMU pgtables for a device.
> +     */
> +    IOVATree *iova_tree;
>  };
>
>  struct VTDIOTLBEntry {
> --
> 2.37.3
>
Peter Xu Dec. 23, 2022, 4:26 p.m. UTC | #3
On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > comment above iova_tree to track why we needed the iova_tree, and when we
> > need it.
> >
> > Also comment for the map/unmap messages, on how they're used and
> > implications (e.g. unmap can be larger than the mapped ranges).
> >
> > Suggested-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > v3:
> > - Adjust according to Eric's comment
> > ---
> >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> >  2 files changed, 65 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > index 91f8a2395a..269ecb873b 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> >  /*
> >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> >   * register with one or multiple IOMMU Notifier capability bit(s).
> > + *
> > + * Normally there're two use cases for the notifiers:
> > + *
> > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > + *
> > + *       Regarding to accurate synchronization, it's when the notified
> > + *       device maintains a shadow page table and must be notified on each
> > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > + *       the shadow page table is fully in sync with the guest view.
> > + *
> > + *   (2) When the device doesn't need accurate synchronizations of the
> > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > + *       DEVIOTLB_UNMAP notifies.
> > + *
> > + *       It's when the device maintains a cache of IOMMU translations
> > + *       (IOTLB) and is able to fill that cache by requesting translations
> > + *       from the vIOMMU through a protocol similar to ATS (Address
> > + *       Translation Service).
> > + *
> > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > + *       page table for the address space, and the UNMAP messages can be
> > + *       actually larger than the real invalidations (just like how the
> > + *       Linux IOMMU driver normally works, where an invalidation can be
> > + *       enlarged as long as it still covers the target range).  The IOMMU
> 
> Just spot this when testing your fix for DSI:
> 
>         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> 
> Do we need to remove this (but it seems a partial revert of
> 03c7140c1a0336af3d4fca768de791b9c0e2b128)?

Replied in the othe thread.

I assume this documentation patch is still correct, am I right?  It's
talking about the possibility of enlarged invalidation range sent from the
guest and vIOMMU.  That should still not be bigger than the registered
range in iommu notifiers (even if bigger than the actual unmapped range).

Thanks,
Jason Wang Dec. 26, 2022, 4:09 a.m. UTC | #4
On Sat, Dec 24, 2022 at 12:26 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> > On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > > comment above iova_tree to track why we needed the iova_tree, and when we
> > > need it.
> > >
> > > Also comment for the map/unmap messages, on how they're used and
> > > implications (e.g. unmap can be larger than the mapped ranges).
> > >
> > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > > v3:
> > > - Adjust according to Eric's comment
> > > ---
> > >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> > >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> > >  2 files changed, 65 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > index 91f8a2395a..269ecb873b 100644
> > > --- a/include/exec/memory.h
> > > +++ b/include/exec/memory.h
> > > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> > >  /*
> > >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> > >   * register with one or multiple IOMMU Notifier capability bit(s).
> > > + *
> > > + * Normally there're two use cases for the notifiers:
> > > + *
> > > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > > + *
> > > + *       Regarding to accurate synchronization, it's when the notified
> > > + *       device maintains a shadow page table and must be notified on each
> > > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > > + *       the shadow page table is fully in sync with the guest view.
> > > + *
> > > + *   (2) When the device doesn't need accurate synchronizations of the
> > > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > > + *       DEVIOTLB_UNMAP notifies.
> > > + *
> > > + *       It's when the device maintains a cache of IOMMU translations
> > > + *       (IOTLB) and is able to fill that cache by requesting translations
> > > + *       from the vIOMMU through a protocol similar to ATS (Address
> > > + *       Translation Service).
> > > + *
> > > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > > + *       page table for the address space, and the UNMAP messages can be
> > > + *       actually larger than the real invalidations (just like how the
> > > + *       Linux IOMMU driver normally works, where an invalidation can be
> > > + *       enlarged as long as it still covers the target range).  The IOMMU
> >
> > Just spot this when testing your fix for DSI:
> >
> >         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> >
> > Do we need to remove this (but it seems a partial revert of
> > 03c7140c1a0336af3d4fca768de791b9c0e2b128)?
>
> Replied in the othe thread.
>
> I assume this documentation patch is still correct, am I right?  It's
> talking about the possibility of enlarged invalidation range sent from the
> guest and vIOMMU.  That should still not be bigger than the registered
> range in iommu notifiers (even if bigger than the actual unmapped range).

Adding Eugenio.

So I think we need to evaluate the possible side effects to all the
current nmap notifiers. For example the vfio_iommu_map_notify().

And in another thread, if we crop the size, it basically means the
notifier itself will still assume the range is valid, which is not
what is documented in this patch.

What's more interesting I see smmu had:

/* Unmap the whole notifier's range */
static void smmu_unmap_notifier_range(IOMMUNotifier *n)
{
    IOMMUTLBEvent event;

    event.type = IOMMU_NOTIFIER_UNMAP;
    event.entry.target_as = &address_space_memory;
    event.entry.iova = n->start;
    event.entry.perm = IOMMU_NONE;
    event.entry.addr_mask = n->end - n->start;

    memory_region_notify_iommu_one(n, &event);
}

So it looks to me it's more safe to do something similar for vtd first.

Btw, I forgot the reason why we need to crop the size in the case of
device IOTLB, Eguenio do you know that?

Thanks

>
> Thanks,
>
> --
> Peter Xu
>
Peter Xu Jan. 3, 2023, 5:30 p.m. UTC | #5
On Mon, Dec 26, 2022 at 12:09:52PM +0800, Jason Wang wrote:
> On Sat, Dec 24, 2022 at 12:26 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> > > On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > > > comment above iova_tree to track why we needed the iova_tree, and when we
> > > > need it.
> > > >
> > > > Also comment for the map/unmap messages, on how they're used and
> > > > implications (e.g. unmap can be larger than the mapped ranges).
> > > >
> > > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > > v3:
> > > > - Adjust according to Eric's comment
> > > > ---
> > > >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> > > >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> > > >  2 files changed, 65 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > > index 91f8a2395a..269ecb873b 100644
> > > > --- a/include/exec/memory.h
> > > > +++ b/include/exec/memory.h
> > > > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> > > >  /*
> > > >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> > > >   * register with one or multiple IOMMU Notifier capability bit(s).
> > > > + *
> > > > + * Normally there're two use cases for the notifiers:
> > > > + *
> > > > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > > > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > > > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > > > + *
> > > > + *       Regarding to accurate synchronization, it's when the notified
> > > > + *       device maintains a shadow page table and must be notified on each
> > > > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > > > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > > > + *       the shadow page table is fully in sync with the guest view.
> > > > + *
> > > > + *   (2) When the device doesn't need accurate synchronizations of the
> > > > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > > > + *       DEVIOTLB_UNMAP notifies.
> > > > + *
> > > > + *       It's when the device maintains a cache of IOMMU translations
> > > > + *       (IOTLB) and is able to fill that cache by requesting translations
> > > > + *       from the vIOMMU through a protocol similar to ATS (Address
> > > > + *       Translation Service).
> > > > + *
> > > > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > > > + *       page table for the address space, and the UNMAP messages can be
> > > > + *       actually larger than the real invalidations (just like how the
> > > > + *       Linux IOMMU driver normally works, where an invalidation can be
> > > > + *       enlarged as long as it still covers the target range).  The IOMMU
> > >
> > > Just spot this when testing your fix for DSI:
> > >
> > >         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > >
> > > Do we need to remove this (but it seems a partial revert of
> > > 03c7140c1a0336af3d4fca768de791b9c0e2b128)?
> >
> > Replied in the othe thread.
> >
> > I assume this documentation patch is still correct, am I right?  It's
> > talking about the possibility of enlarged invalidation range sent from the
> > guest and vIOMMU.  That should still not be bigger than the registered
> > range in iommu notifiers (even if bigger than the actual unmapped range).
> 
> Adding Eugenio.
> 
> So I think we need to evaluate the possible side effects to all the
> current nmap notifiers. For example the vfio_iommu_map_notify().
> 
> And in another thread, if we crop the size, it basically means the
> notifier itself will still assume the range is valid, which is not
> what is documented in this patch.
> 
> What's more interesting I see smmu had:
> 
> /* Unmap the whole notifier's range */
> static void smmu_unmap_notifier_range(IOMMUNotifier *n)
> {
>     IOMMUTLBEvent event;
> 
>     event.type = IOMMU_NOTIFIER_UNMAP;
>     event.entry.target_as = &address_space_memory;
>     event.entry.iova = n->start;
>     event.entry.perm = IOMMU_NONE;
>     event.entry.addr_mask = n->end - n->start;
> 
>     memory_region_notify_iommu_one(n, &event);
> }
> 
> So it looks to me it's more safe to do something similar for vtd first.

Jason, could you elaborate more on this one?

Meanwhile, I don't immediately see what's the side effect you mentioned for
vfio map events.  I thought any map event should always be in the notifier
range anyway because map event only comes in page sizes and generated by
vt-d page walkers (not guest driver, which is IIUC the only place where the
range of invalidation can be enlarged).  So I don't expect any functional
change to map events if we decide to crop the ranges unconditionally.  Did
I miss anything?

Thanks,

> 
> Btw, I forgot the reason why we need to crop the size in the case of
> device IOTLB, Eguenio do you know that?
> 
> Thanks
> 
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
>
Jason Wang Jan. 4, 2023, 4:15 a.m. UTC | #6
On Wed, Jan 4, 2023 at 1:30 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Dec 26, 2022 at 12:09:52PM +0800, Jason Wang wrote:
> > On Sat, Dec 24, 2022 at 12:26 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> > > > On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > > > > comment above iova_tree to track why we needed the iova_tree, and when we
> > > > > need it.
> > > > >
> > > > > Also comment for the map/unmap messages, on how they're used and
> > > > > implications (e.g. unmap can be larger than the mapped ranges).
> > > > >
> > > > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > > v3:
> > > > > - Adjust according to Eric's comment
> > > > > ---
> > > > >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> > > > >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> > > > >  2 files changed, 65 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > > > index 91f8a2395a..269ecb873b 100644
> > > > > --- a/include/exec/memory.h
> > > > > +++ b/include/exec/memory.h
> > > > > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> > > > >  /*
> > > > >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> > > > >   * register with one or multiple IOMMU Notifier capability bit(s).
> > > > > + *
> > > > > + * Normally there're two use cases for the notifiers:
> > > > > + *
> > > > > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > > > > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > > > > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > > > > + *
> > > > > + *       Regarding to accurate synchronization, it's when the notified
> > > > > + *       device maintains a shadow page table and must be notified on each
> > > > > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > > > > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > > > > + *       the shadow page table is fully in sync with the guest view.
> > > > > + *
> > > > > + *   (2) When the device doesn't need accurate synchronizations of the
> > > > > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > > > > + *       DEVIOTLB_UNMAP notifies.
> > > > > + *
> > > > > + *       It's when the device maintains a cache of IOMMU translations
> > > > > + *       (IOTLB) and is able to fill that cache by requesting translations
> > > > > + *       from the vIOMMU through a protocol similar to ATS (Address
> > > > > + *       Translation Service).
> > > > > + *
> > > > > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > > > > + *       page table for the address space, and the UNMAP messages can be
> > > > > + *       actually larger than the real invalidations (just like how the
> > > > > + *       Linux IOMMU driver normally works, where an invalidation can be
> > > > > + *       enlarged as long as it still covers the target range).  The IOMMU
> > > >
> > > > Just spot this when testing your fix for DSI:
> > > >
> > > >         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > >
> > > > Do we need to remove this (but it seems a partial revert of
> > > > 03c7140c1a0336af3d4fca768de791b9c0e2b128)?
> > >
> > > Replied in the othe thread.
> > >
> > > I assume this documentation patch is still correct, am I right?  It's
> > > talking about the possibility of enlarged invalidation range sent from the
> > > guest and vIOMMU.  That should still not be bigger than the registered
> > > range in iommu notifiers (even if bigger than the actual unmapped range).
> >
> > Adding Eugenio.
> >
> > So I think we need to evaluate the possible side effects to all the
> > current nmap notifiers. For example the vfio_iommu_map_notify().
> >
> > And in another thread, if we crop the size, it basically means the
> > notifier itself will still assume the range is valid, which is not
> > what is documented in this patch.
> >
> > What's more interesting I see smmu had:
> >
> > /* Unmap the whole notifier's range */
> > static void smmu_unmap_notifier_range(IOMMUNotifier *n)
> > {
> >     IOMMUTLBEvent event;
> >
> >     event.type = IOMMU_NOTIFIER_UNMAP;
> >     event.entry.target_as = &address_space_memory;
> >     event.entry.iova = n->start;
> >     event.entry.perm = IOMMU_NONE;
> >     event.entry.addr_mask = n->end - n->start;
> >
> >     memory_region_notify_iommu_one(n, &event);
> > }
> >
> > So it looks to me it's more safe to do something similar for vtd first.
>
> Jason, could you elaborate more on this one?

I meant it's more safe to have a vtd version:

 > > static void vtd_unmap_notifier_range(IOMMUNotifier *n)
> > {
> >     IOMMUTLBEvent event;
> >
> >     event.type = IOMMU_NOTIFIER_UNMAP;
> >     event.entry.target_as = &address_space_memory;
> >     event.entry.iova = n->start;
> >     event.entry.perm = IOMMU_NONE;
> >     event.entry.addr_mask = n->end - n->start;
> >
> >     memory_region_notify_iommu_one(n, &event);

Or move it to the memory.c.

>
> Meanwhile, I don't immediately see what's the side effect you mentioned for
> vfio map events.

I don't see but it looks more safe. Do you know the reason why SMMU
doesn't simply do a [0, ULONG_MAX] unmap notify? (Maybe Eric know)

> I thought any map event should always be in the notifier
> range anyway because map event only comes in page sizes and generated by
> vt-d page walkers (not guest driver, which is IIUC the only place where the
> range of invalidation can be enlarged).  So I don't expect any functional
> change to map events if we decide to crop the ranges unconditionally.

If we crop the ranges, the above description:

"""
and the UNMAP messages can be actually larger than the real invalidations.
"""

doesn't apply anymore.

Thanks

>  Did I miss anything?
>
> Thanks,
>
> >
> > Btw, I forgot the reason why we need to crop the size in the case of
> > device IOTLB, Eguenio do you know that?
> >
> > Thanks
> >
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>
Peter Xu Jan. 4, 2023, 3:14 p.m. UTC | #7
On Wed, Jan 04, 2023 at 12:15:20PM +0800, Jason Wang wrote:
> On Wed, Jan 4, 2023 at 1:30 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Dec 26, 2022 at 12:09:52PM +0800, Jason Wang wrote:
> > > On Sat, Dec 24, 2022 at 12:26 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> > > > > On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > > > > > comment above iova_tree to track why we needed the iova_tree, and when we
> > > > > > need it.
> > > > > >
> > > > > > Also comment for the map/unmap messages, on how they're used and
> > > > > > implications (e.g. unmap can be larger than the mapped ranges).
> > > > > >
> > > > > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > > v3:
> > > > > > - Adjust according to Eric's comment
> > > > > > ---
> > > > > >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> > > > > >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> > > > > >  2 files changed, 65 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > > > > index 91f8a2395a..269ecb873b 100644
> > > > > > --- a/include/exec/memory.h
> > > > > > +++ b/include/exec/memory.h
> > > > > > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> > > > > >  /*
> > > > > >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> > > > > >   * register with one or multiple IOMMU Notifier capability bit(s).
> > > > > > + *
> > > > > > + * Normally there're two use cases for the notifiers:
> > > > > > + *
> > > > > > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > > > > > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > > > > > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > > > > > + *
> > > > > > + *       Regarding to accurate synchronization, it's when the notified
> > > > > > + *       device maintains a shadow page table and must be notified on each
> > > > > > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > > > > > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > > > > > + *       the shadow page table is fully in sync with the guest view.
> > > > > > + *
> > > > > > + *   (2) When the device doesn't need accurate synchronizations of the
> > > > > > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > > > > > + *       DEVIOTLB_UNMAP notifies.
> > > > > > + *
> > > > > > + *       It's when the device maintains a cache of IOMMU translations
> > > > > > + *       (IOTLB) and is able to fill that cache by requesting translations
> > > > > > + *       from the vIOMMU through a protocol similar to ATS (Address
> > > > > > + *       Translation Service).
> > > > > > + *
> > > > > > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > > > > > + *       page table for the address space, and the UNMAP messages can be
> > > > > > + *       actually larger than the real invalidations (just like how the
> > > > > > + *       Linux IOMMU driver normally works, where an invalidation can be
> > > > > > + *       enlarged as long as it still covers the target range).  The IOMMU
> > > > >
> > > > > Just spot this when testing your fix for DSI:
> > > > >
> > > > >         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > > >
> > > > > Do we need to remove this (but it seems a partial revert of
> > > > > 03c7140c1a0336af3d4fca768de791b9c0e2b128)?
> > > >
> > > > Replied in the othe thread.
> > > >
> > > > I assume this documentation patch is still correct, am I right?  It's
> > > > talking about the possibility of enlarged invalidation range sent from the
> > > > guest and vIOMMU.  That should still not be bigger than the registered
> > > > range in iommu notifiers (even if bigger than the actual unmapped range).
> > >
> > > Adding Eugenio.
> > >
> > > So I think we need to evaluate the possible side effects to all the
> > > current nmap notifiers. For example the vfio_iommu_map_notify().
> > >
> > > And in another thread, if we crop the size, it basically means the
> > > notifier itself will still assume the range is valid, which is not
> > > what is documented in this patch.
> > >
> > > What's more interesting I see smmu had:
> > >
> > > /* Unmap the whole notifier's range */
> > > static void smmu_unmap_notifier_range(IOMMUNotifier *n)
> > > {
> > >     IOMMUTLBEvent event;
> > >
> > >     event.type = IOMMU_NOTIFIER_UNMAP;
> > >     event.entry.target_as = &address_space_memory;
> > >     event.entry.iova = n->start;
> > >     event.entry.perm = IOMMU_NONE;
> > >     event.entry.addr_mask = n->end - n->start;
> > >
> > >     memory_region_notify_iommu_one(n, &event);
> > > }
> > >
> > > So it looks to me it's more safe to do something similar for vtd first.
> >
> > Jason, could you elaborate more on this one?
> 
> I meant it's more safe to have a vtd version:
> 
>  > > static void vtd_unmap_notifier_range(IOMMUNotifier *n)
> > > {
> > >     IOMMUTLBEvent event;
> > >
> > >     event.type = IOMMU_NOTIFIER_UNMAP;
> > >     event.entry.target_as = &address_space_memory;
> > >     event.entry.iova = n->start;
> > >     event.entry.perm = IOMMU_NONE;
> > >     event.entry.addr_mask = n->end - n->start;
> > >
> > >     memory_region_notify_iommu_one(n, &event);
> 
> Or move it to the memory.c.

I see.

If we always do the crop in memory_region_notify_iommu_one() it'll have
similar effect of having above helper, am I right?

I checked again on the VFIO code path in kernel, it (at least type1v2)
doesn't allow unmapping of partial mapped range, but it looks always fine
to have unmap covering not-mapped spaces.

One more thing I noticed is there's a new flag introduced in 2021 for vfio
to unmap the whole address space (VFIO_DMA_UNMAP_FLAG_ALL).  In the future
we can leverage this when we want to do DSI more efficiently, but not
immediately necessary - I think that needs a new IOMMU notifier API hook.

And if you see the impl of that new flag (in vfio_dma_do_unmap) it also
shows that a larger range of unmap is fine to vfio, because for unmap_all
it's the same as specifying the size to be max:

	if (unmap_all) {
		if (iova || size)
			goto unlock;
		size = U64_MAX;
        }

> 
> >
> > Meanwhile, I don't immediately see what's the side effect you mentioned for
> > vfio map events.
> 
> I don't see but it looks more safe. Do you know the reason why SMMU
> doesn't simply do a [0, ULONG_MAX] unmap notify? (Maybe Eric know)

Same here..

> 
> > I thought any map event should always be in the notifier
> > range anyway because map event only comes in page sizes and generated by
> > vt-d page walkers (not guest driver, which is IIUC the only place where the
> > range of invalidation can be enlarged).  So I don't expect any functional
> > change to map events if we decide to crop the ranges unconditionally.
> 
> If we crop the ranges, the above description:
> 
> """
> and the UNMAP messages can be actually larger than the real invalidations.
> """
> 
> doesn't apply anymore.

It depends on how to define the "real invalidations".  There're two places
that can enlarge an invalidation, here I wanted to reference the case where
e.g. a PSI is enlarged to a DSI.  Even if that's the driver behavior, I
wanted to make sure the qemu iommu notifiees are aware of the facts that
unmap can be bigger than what it used to have mapped.

Thanks,

> 
> Thanks
> 
> >  Did I miss anything?
> >
> > Thanks,
> >
> > >
> > > Btw, I forgot the reason why we need to crop the size in the case of
> > > device IOTLB, Eguenio do you know that?
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > > >
> > >
> >
> > --
> > Peter Xu
> >
>
Jason Wang Jan. 9, 2023, 9:08 a.m. UTC | #8
On Wed, Jan 4, 2023 at 11:14 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Jan 04, 2023 at 12:15:20PM +0800, Jason Wang wrote:
> > On Wed, Jan 4, 2023 at 1:30 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Mon, Dec 26, 2022 at 12:09:52PM +0800, Jason Wang wrote:
> > > > On Sat, Dec 24, 2022 at 12:26 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Fri, Dec 23, 2022 at 03:48:01PM +0800, Jason Wang wrote:
> > > > > > On Wed, Dec 7, 2022 at 6:13 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > It seems not super clear on when iova_tree is used, and why.  Add a rich
> > > > > > > comment above iova_tree to track why we needed the iova_tree, and when we
> > > > > > > need it.
> > > > > > >
> > > > > > > Also comment for the map/unmap messages, on how they're used and
> > > > > > > implications (e.g. unmap can be larger than the mapped ranges).
> > > > > > >
> > > > > > > Suggested-by: Jason Wang <jasowang@redhat.com>
> > > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > > ---
> > > > > > > v3:
> > > > > > > - Adjust according to Eric's comment
> > > > > > > ---
> > > > > > >  include/exec/memory.h         | 28 ++++++++++++++++++++++++++
> > > > > > >  include/hw/i386/intel_iommu.h | 38 ++++++++++++++++++++++++++++++++++-
> > > > > > >  2 files changed, 65 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/include/exec/memory.h b/include/exec/memory.h
> > > > > > > index 91f8a2395a..269ecb873b 100644
> > > > > > > --- a/include/exec/memory.h
> > > > > > > +++ b/include/exec/memory.h
> > > > > > > @@ -129,6 +129,34 @@ struct IOMMUTLBEntry {
> > > > > > >  /*
> > > > > > >   * Bitmap for different IOMMUNotifier capabilities. Each notifier can
> > > > > > >   * register with one or multiple IOMMU Notifier capability bit(s).
> > > > > > > + *
> > > > > > > + * Normally there're two use cases for the notifiers:
> > > > > > > + *
> > > > > > > + *   (1) When the device needs accurate synchronizations of the vIOMMU page
> > > > > > > + *       tables, it needs to register with both MAP|UNMAP notifies (which
> > > > > > > + *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
> > > > > > > + *
> > > > > > > + *       Regarding to accurate synchronization, it's when the notified
> > > > > > > + *       device maintains a shadow page table and must be notified on each
> > > > > > > + *       guest MAP (page table entry creation) and UNMAP (invalidation)
> > > > > > > + *       events (e.g. VFIO). Both notifications must be accurate so that
> > > > > > > + *       the shadow page table is fully in sync with the guest view.
> > > > > > > + *
> > > > > > > + *   (2) When the device doesn't need accurate synchronizations of the
> > > > > > > + *       vIOMMU page tables, it needs to register only with UNMAP or
> > > > > > > + *       DEVIOTLB_UNMAP notifies.
> > > > > > > + *
> > > > > > > + *       It's when the device maintains a cache of IOMMU translations
> > > > > > > + *       (IOTLB) and is able to fill that cache by requesting translations
> > > > > > > + *       from the vIOMMU through a protocol similar to ATS (Address
> > > > > > > + *       Translation Service).
> > > > > > > + *
> > > > > > > + *       Note that in this mode the vIOMMU will not maintain a shadowed
> > > > > > > + *       page table for the address space, and the UNMAP messages can be
> > > > > > > + *       actually larger than the real invalidations (just like how the
> > > > > > > + *       Linux IOMMU driver normally works, where an invalidation can be
> > > > > > > + *       enlarged as long as it still covers the target range).  The IOMMU
> > > > > >
> > > > > > Just spot this when testing your fix for DSI:
> > > > > >
> > > > > >         assert(entry->iova >= notifier->start && entry_end <= notifier->end);
> > > > > >
> > > > > > Do we need to remove this (but it seems a partial revert of
> > > > > > 03c7140c1a0336af3d4fca768de791b9c0e2b128)?
> > > > >
> > > > > Replied in the othe thread.
> > > > >
> > > > > I assume this documentation patch is still correct, am I right?  It's
> > > > > talking about the possibility of enlarged invalidation range sent from the
> > > > > guest and vIOMMU.  That should still not be bigger than the registered
> > > > > range in iommu notifiers (even if bigger than the actual unmapped range).
> > > >
> > > > Adding Eugenio.
> > > >
> > > > So I think we need to evaluate the possible side effects to all the
> > > > current nmap notifiers. For example the vfio_iommu_map_notify().
> > > >
> > > > And in another thread, if we crop the size, it basically means the
> > > > notifier itself will still assume the range is valid, which is not
> > > > what is documented in this patch.
> > > >
> > > > What's more interesting I see smmu had:
> > > >
> > > > /* Unmap the whole notifier's range */
> > > > static void smmu_unmap_notifier_range(IOMMUNotifier *n)
> > > > {
> > > >     IOMMUTLBEvent event;
> > > >
> > > >     event.type = IOMMU_NOTIFIER_UNMAP;
> > > >     event.entry.target_as = &address_space_memory;
> > > >     event.entry.iova = n->start;
> > > >     event.entry.perm = IOMMU_NONE;
> > > >     event.entry.addr_mask = n->end - n->start;
> > > >
> > > >     memory_region_notify_iommu_one(n, &event);
> > > > }
> > > >
> > > > So it looks to me it's more safe to do something similar for vtd first.
> > >
> > > Jason, could you elaborate more on this one?
> >
> > I meant it's more safe to have a vtd version:
> >
> >  > > static void vtd_unmap_notifier_range(IOMMUNotifier *n)
> > > > {
> > > >     IOMMUTLBEvent event;
> > > >
> > > >     event.type = IOMMU_NOTIFIER_UNMAP;
> > > >     event.entry.target_as = &address_space_memory;
> > > >     event.entry.iova = n->start;
> > > >     event.entry.perm = IOMMU_NONE;
> > > >     event.entry.addr_mask = n->end - n->start;
> > > >
> > > >     memory_region_notify_iommu_one(n, &event);
> >
> > Or move it to the memory.c.
>
> I see.
>
> If we always do the crop in memory_region_notify_iommu_one() it'll have
> similar effect of having above helper, am I right?

Right, but we may end up twice cropping. I don't have an actual
preference, but I think we need to be consistent here.

Either:

1) cropping in the memory core and remove the iommu cropping like
smmu_unmap_notifier_range()

or

2) don't corp in the memory core but move smmu_unmap_notifier_range to
the core (still, a kind of implicit crop, since the function was
called without a range)

2) seems safer but I can go with 1 if you insist.

>
> I checked again on the VFIO code path in kernel, it (at least type1v2)
> doesn't allow unmapping of partial mapped range, but it looks always fine
> to have unmap covering not-mapped spaces.

That's my understanding as well.

>
> One more thing I noticed is there's a new flag introduced in 2021 for vfio
> to unmap the whole address space (VFIO_DMA_UNMAP_FLAG_ALL).  In the future
> we can leverage this when we want to do DSI more efficiently, but not
> immediately necessary - I think that needs a new IOMMU notifier API hook.
>

Yes.

> And if you see the impl of that new flag (in vfio_dma_do_unmap) it also
> shows that a larger range of unmap is fine to vfio, because for unmap_all
> it's the same as specifying the size to be max:
>
>         if (unmap_all) {
>                 if (iova || size)
>                         goto unlock;
>                 size = U64_MAX;
>         }
>
> >
> > >
> > > Meanwhile, I don't immediately see what's the side effect you mentioned for
> > > vfio map events.
> >
> > I don't see but it looks more safe. Do you know the reason why SMMU
> > doesn't simply do a [0, ULONG_MAX] unmap notify? (Maybe Eric know)
>
> Same here..
>
> >
> > > I thought any map event should always be in the notifier
> > > range anyway because map event only comes in page sizes and generated by
> > > vt-d page walkers (not guest driver, which is IIUC the only place where the
> > > range of invalidation can be enlarged).  So I don't expect any functional
> > > change to map events if we decide to crop the ranges unconditionally.
> >
> > If we crop the ranges, the above description:
> >
> > """
> > and the UNMAP messages can be actually larger than the real invalidations.
> > """
> >
> > doesn't apply anymore.
>
> It depends on how to define the "real invalidations".  There're two places
> that can enlarge an invalidation, here I wanted to reference the case where
> e.g. a PSI is enlarged to a DSI.  Even if that's the driver behavior, I
> wanted to make sure the qemu iommu notifiees are aware of the facts that
> unmap can be bigger than what it used to have mapped.

Ok, I think the confusion came from "real invalidations". I think
there's no way for the device to know about the real invalidation
since the driver can enlarge it at will? If this is true, is this
better to say the UNAMP messages can cover the range that is not
mapped?

Thanks

>
> Thanks,
>
> >
> > Thanks
> >
> > >  Did I miss anything?
> > >
> > > Thanks,
> > >
> > > >
> > > > Btw, I forgot the reason why we need to crop the size in the case of
> > > > device IOTLB, Eguenio do you know that?
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Peter Xu
> > > > >
> > > >
> > >
> > > --
> > > Peter Xu
> > >
> >
>
> --
> Peter Xu
>
Peter Xu Jan. 9, 2023, 7:34 p.m. UTC | #9
On Mon, Jan 09, 2023 at 05:08:59PM +0800, Jason Wang wrote:
> Either:
> 
> 1) cropping in the memory core and remove the iommu cropping like
> smmu_unmap_notifier_range()
> 
> or
> 
> 2) don't corp in the memory core but move smmu_unmap_notifier_range to
> the core (still, a kind of implicit crop, since the function was
> called without a range)
> 
> 2) seems safer but I can go with 1 if you insist.

No strong opinion here, thanks for checking!  I'm not exactly sure how
it'll look like at last with 2), but so far either way sounds good.

[...]

> > It depends on how to define the "real invalidations".  There're two places
> > that can enlarge an invalidation, here I wanted to reference the case where
> > e.g. a PSI is enlarged to a DSI.  Even if that's the driver behavior, I
> > wanted to make sure the qemu iommu notifiees are aware of the facts that
> > unmap can be bigger than what it used to have mapped.
> 
> Ok, I think the confusion came from "real invalidations". I think
> there's no way for the device to know about the real invalidation
> since the driver can enlarge it at will? If this is true, is this
> better to say the UNAMP messages can cover the range that is not
> mapped?

I can reword, will repost soon.

Thanks,
diff mbox series

Patch

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 91f8a2395a..269ecb873b 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -129,6 +129,34 @@  struct IOMMUTLBEntry {
 /*
  * Bitmap for different IOMMUNotifier capabilities. Each notifier can
  * register with one or multiple IOMMU Notifier capability bit(s).
+ *
+ * Normally there're two use cases for the notifiers:
+ *
+ *   (1) When the device needs accurate synchronizations of the vIOMMU page
+ *       tables, it needs to register with both MAP|UNMAP notifies (which
+ *       is defined as IOMMU_NOTIFIER_IOTLB_EVENTS below).
+ *
+ *       Regarding to accurate synchronization, it's when the notified
+ *       device maintains a shadow page table and must be notified on each
+ *       guest MAP (page table entry creation) and UNMAP (invalidation)
+ *       events (e.g. VFIO). Both notifications must be accurate so that
+ *       the shadow page table is fully in sync with the guest view.
+ *
+ *   (2) When the device doesn't need accurate synchronizations of the
+ *       vIOMMU page tables, it needs to register only with UNMAP or
+ *       DEVIOTLB_UNMAP notifies.
+ *
+ *       It's when the device maintains a cache of IOMMU translations
+ *       (IOTLB) and is able to fill that cache by requesting translations
+ *       from the vIOMMU through a protocol similar to ATS (Address
+ *       Translation Service).
+ *
+ *       Note that in this mode the vIOMMU will not maintain a shadowed
+ *       page table for the address space, and the UNMAP messages can be
+ *       actually larger than the real invalidations (just like how the
+ *       Linux IOMMU driver normally works, where an invalidation can be
+ *       enlarged as long as it still covers the target range).  The IOMMU
+ *       notifiee should be able to take care of over-sized invalidations.
  */
 typedef enum {
     IOMMU_NOTIFIER_NONE = 0,
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 46d973e629..89dcbc5e1e 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -109,7 +109,43 @@  struct VTDAddressSpace {
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
-    IOVATree *iova_tree;          /* Traces mapped IOVA ranges */
+    /*
+     * @iova_tree traces mapped IOVA ranges.
+     *
+     * The tree is not needed if no MAP notifier is registered with current
+     * VTD address space, because all guest invalidate commands can be
+     * directly passed to the IOMMU UNMAP notifiers without any further
+     * reshuffling.
+     *
+     * The tree OTOH is required for MAP typed iommu notifiers for a few
+     * reasons.
+     *
+     * Firstly, there's no way to identify whether an PSI (Page Selective
+     * Invalidations) or DSI (Domain Selective Invalidations) event is an
+     * MAP or UNMAP event within the message itself.  Without having prior
+     * knowledge of existing state vIOMMU doesn't know whether it should
+     * notify MAP or UNMAP for a PSI message it received when caching mode
+     * is enabled (for MAP notifiers).
+     *
+     * Secondly, PSI messages received from guest driver can be enlarged in
+     * range, covers but not limited to what the guest driver wanted to
+     * invalidate.  When the range to invalidates gets bigger than the
+     * limit of a PSI message, it can even become a DSI which will
+     * invalidate the whole domain.  If the vIOMMU directly notifies the
+     * registered device with the unmodified range, it may confuse the
+     * registered drivers (e.g. vfio-pci) on either:
+     *
+     *   (1) Trying to map the same region more than once (for
+     *       VFIO_IOMMU_MAP_DMA, -EEXIST will trigger), or,
+     *
+     *   (2) Trying to UNMAP a range that is still partially mapped.
+     *
+     * That accuracy is not required for UNMAP-only notifiers, but it is a
+     * must-to-have for notifiers registered with MAP events, because the
+     * vIOMMU needs to make sure the shadow page table is always in sync
+     * with the guest IOMMU pgtables for a device.
+     */
+    IOVATree *iova_tree;
 };
 
 struct VTDIOTLBEntry {