diff mbox

[v7,14/17] memory: add MemoryRegionIOMMUOps.replay() callback

Message ID 1486456099-7345-15-git-send-email-peterx@redhat.com
State New
Headers show

Commit Message

Peter Xu Feb. 7, 2017, 8:28 a.m. UTC
Originally we have one memory_region_iommu_replay() function, which is
the default behavior to replay the translations of the whole IOMMU
region. However, on some platform like x86, we may want our own replay
logic for IOMMU regions. This patch add one more hook for IOMMUOps for
the callback, and it'll override the default if set.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/exec/memory.h | 2 ++
 memory.c              | 6 ++++++
 2 files changed, 8 insertions(+)

Comments

David Gibson Feb. 10, 2017, 2:34 a.m. UTC | #1
On Tue, Feb 07, 2017 at 04:28:16PM +0800, Peter Xu wrote:
> Originally we have one memory_region_iommu_replay() function, which is
> the default behavior to replay the translations of the whole IOMMU
> region. However, on some platform like x86, we may want our own replay
> logic for IOMMU regions. This patch add one more hook for IOMMUOps for
> the callback, and it'll override the default if set.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0767888..30b2a74 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
>  
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
> diff --git a/memory.c b/memory.c
> index 7a4f2f9..9c253cc 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
>  
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }
> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
>  
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
Yi Liu March 27, 2017, 8:35 a.m. UTC | #2
> -----Original Message-----
> From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> Behalf Of Peter Xu
> Sent: Tuesday, February 7, 2017 4:28 PM
> To: qemu-devel@nongnu.org
> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> peterx@redhat.com; alex.williamson@redhat.com; bd.aviv@gmail.com; David
> Gibson <david@gibson.dropbear.id.au>
> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> MemoryRegionIOMMUOps.replay() callback
> 
> Originally we have one memory_region_iommu_replay() function, which is the
> default behavior to replay the translations of the whole IOMMU region. However,
> on some platform like x86, we may want our own replay logic for IOMMU regions.
> This patch add one more hook for IOMMUOps for the callback, and it'll override the
> default if set.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/exec/memory.h | 2 ++
>  memory.c              | 6 ++++++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/exec/memory.h b/include/exec/memory.h index
> 0767888..30b2a74 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>      void (*notify_flag_changed)(MemoryRegion *iommu,
>                                  IOMMUNotifierFlag old_flags,
>                                  IOMMUNotifierFlag new_flags);
> +    /* Set this up to provide customized IOMMU replay function */
> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>  };
> 
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff --git
> a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion
> *mr, IOMMUNotifier *n,
>      hwaddr addr, granularity;
>      IOMMUTLBEntry iotlb;
> +    /* If the IOMMU has its own replay callback, override */
> +    if (mr->iommu_ops->replay) {
> +        mr->iommu_ops->replay(mr, n);
> +        return;
> +    }

Hi Alex, Peter,

Will all the other vendors(e.g. PPC, s390, ARM) add their own replay callback
as well? I guess it depends on whether the original replay algorithm work well
for them? Do you have such knowledge?

Regards,
Yi L

> +
>      granularity = memory_region_iommu_get_min_page_size(mr);
> 
>      for (addr = 0; addr < memory_region_size(mr); addr += granularity) {
> --
> 2.7.4
>
Peter Xu March 27, 2017, 9:12 a.m. UTC | #3
On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> > -----Original Message-----
> > From: Qemu-devel [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> > Behalf Of Peter Xu
> > Sent: Tuesday, February 7, 2017 4:28 PM
> > To: qemu-devel@nongnu.org
> > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
> > mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
> > peterx@redhat.com; alex.williamson@redhat.com; bd.aviv@gmail.com; David
> > Gibson <david@gibson.dropbear.id.au>
> > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> > MemoryRegionIOMMUOps.replay() callback
> > 
> > Originally we have one memory_region_iommu_replay() function, which is the
> > default behavior to replay the translations of the whole IOMMU region. However,
> > on some platform like x86, we may want our own replay logic for IOMMU regions.
> > This patch add one more hook for IOMMUOps for the callback, and it'll override the
> > default if set.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/exec/memory.h | 2 ++
> >  memory.c              | 6 ++++++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/include/exec/memory.h b/include/exec/memory.h index
> > 0767888..30b2a74 100644
> > --- a/include/exec/memory.h
> > +++ b/include/exec/memory.h
> > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >      void (*notify_flag_changed)(MemoryRegion *iommu,
> >                                  IOMMUNotifierFlag old_flags,
> >                                  IOMMUNotifierFlag new_flags);
> > +    /* Set this up to provide customized IOMMU replay function */
> > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> >  };
> > 
> >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff --git
> > a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> > --- a/memory.c
> > +++ b/memory.c
> > @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion
> > *mr, IOMMUNotifier *n,
> >      hwaddr addr, granularity;
> >      IOMMUTLBEntry iotlb;
> > +    /* If the IOMMU has its own replay callback, override */
> > +    if (mr->iommu_ops->replay) {
> > +        mr->iommu_ops->replay(mr, n);
> > +        return;
> > +    }
> 
> Hi Alex, Peter,
> 
> Will all the other vendors(e.g. PPC, s390, ARM) add their own replay callback
> as well? I guess it depends on whether the original replay algorithm work well
> for them? Do you have such knowledge?

I guess so. At least for VT-d we had this callback since the default
replay mechanism did not work well on x86 due to its extremely large
memory region size. Thanks,

-- peterx
Yi Liu March 27, 2017, 9:21 a.m. UTC | #4
> -----Original Message-----

> From: Peter Xu [mailto:peterx@redhat.com]

> Sent: Monday, March 27, 2017 5:12 PM

> To: Liu, Yi L <yi.l.liu@intel.com>

> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson

> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> MemoryRegionIOMMUOps.replay() callback

> 

> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:

> > > -----Original Message-----

> > > From: Qemu-devel

> > > [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On Behalf

> > > Of Peter Xu

> > > Sent: Tuesday, February 7, 2017 4:28 PM

> > > To: qemu-devel@nongnu.org

> > > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> > > <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> > > jasowang@redhat.com; peterx@redhat.com; alex.williamson@redhat.com;

> > > bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>

> > > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add

> > > MemoryRegionIOMMUOps.replay() callback

> > >

> > > Originally we have one memory_region_iommu_replay() function, which

> > > is the default behavior to replay the translations of the whole

> > > IOMMU region. However, on some platform like x86, we may want our own

> replay logic for IOMMU regions.

> > > This patch add one more hook for IOMMUOps for the callback, and

> > > it'll override the default if set.

> > >

> > > Signed-off-by: Peter Xu <peterx@redhat.com>

> > > ---

> > >  include/exec/memory.h | 2 ++

> > >  memory.c              | 6 ++++++

> > >  2 files changed, 8 insertions(+)

> > >

> > > diff --git a/include/exec/memory.h b/include/exec/memory.h index

> > > 0767888..30b2a74 100644

> > > --- a/include/exec/memory.h

> > > +++ b/include/exec/memory.h

> > > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {

> > >      void (*notify_flag_changed)(MemoryRegion *iommu,

> > >                                  IOMMUNotifierFlag old_flags,

> > >                                  IOMMUNotifierFlag new_flags);

> > > +    /* Set this up to provide customized IOMMU replay function */

> > > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);

> > >  };

> > >

> > >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff

> > > --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644

> > > --- a/memory.c

> > > +++ b/memory.c

> > > @@ -1630,6 +1630,12 @@ void memory_region_iommu_replay(MemoryRegion

> > > *mr, IOMMUNotifier *n,

> > >      hwaddr addr, granularity;

> > >      IOMMUTLBEntry iotlb;

> > > +    /* If the IOMMU has its own replay callback, override */

> > > +    if (mr->iommu_ops->replay) {

> > > +        mr->iommu_ops->replay(mr, n);

> > > +        return;

> > > +    }

> >

> > Hi Alex, Peter,

> >

> > Will all the other vendors(e.g. PPC, s390, ARM) add their own replay

> > callback as well? I guess it depends on whether the original replay

> > algorithm work well for them? Do you have such knowledge?

> 

> I guess so. At least for VT-d we had this callback since the default replay mechanism

> did not work well on x86 due to its extremely large memory region size. Thanks,


thx. that would make sense.
Yi Liu March 30, 2017, 11:06 a.m. UTC | #5
> -----Original Message-----

> From: Liu, Yi L

> Sent: Monday, March 27, 2017 5:22 PM

> To: Peter Xu <peterx@redhat.com>

> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson

> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add

> MemoryRegionIOMMUOps.replay() callback

> 

> > -----Original Message-----

> > From: Peter Xu [mailto:peterx@redhat.com]

> > Sent: Monday, March 27, 2017 5:12 PM

> > To: Liu, Yi L <yi.l.liu@intel.com>

> > Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;

> > Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;

> > jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David

> > Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> > Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> > MemoryRegionIOMMUOps.replay() callback

> >

> > On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:

> > > > -----Original Message-----

> > > > From: Qemu-devel

> > > > [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On

> > > > Behalf Of Peter Xu

> > > > Sent: Tuesday, February 7, 2017 4:28 PM

> > > > To: qemu-devel@nongnu.org

> > > > Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> > > > <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> > > > jasowang@redhat.com; peterx@redhat.com;

> > > > alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson

> > > > <david@gibson.dropbear.id.au>

> > > > Subject: [Qemu-devel] [PATCH v7 14/17] memory: add

> > > > MemoryRegionIOMMUOps.replay() callback

> > > >

> > > > Originally we have one memory_region_iommu_replay() function,

> > > > which is the default behavior to replay the translations of the

> > > > whole IOMMU region. However, on some platform like x86, we may

> > > > want our own

> > replay logic for IOMMU regions.

> > > > This patch add one more hook for IOMMUOps for the callback, and

> > > > it'll override the default if set.

> > > >

> > > > Signed-off-by: Peter Xu <peterx@redhat.com>

> > > > ---

> > > >  include/exec/memory.h | 2 ++

> > > >  memory.c              | 6 ++++++

> > > >  2 files changed, 8 insertions(+)

> > > >

> > > > diff --git a/include/exec/memory.h b/include/exec/memory.h index

> > > > 0767888..30b2a74 100644

> > > > --- a/include/exec/memory.h

> > > > +++ b/include/exec/memory.h

> > > > @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {

> > > >      void (*notify_flag_changed)(MemoryRegion *iommu,

> > > >                                  IOMMUNotifierFlag old_flags,

> > > >                                  IOMMUNotifierFlag new_flags);

> > > > +    /* Set this up to provide customized IOMMU replay function */

> > > > +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);

> > > >  };

> > > >

> > > >  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff

> > > > --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644

> > > > --- a/memory.c

> > > > +++ b/memory.c

> > > > @@ -1630,6 +1630,12 @@ void

> > > > memory_region_iommu_replay(MemoryRegion

> > > > *mr, IOMMUNotifier *n,

> > > >      hwaddr addr, granularity;

> > > >      IOMMUTLBEntry iotlb;

> > > > +    /* If the IOMMU has its own replay callback, override */

> > > > +    if (mr->iommu_ops->replay) {

> > > > +        mr->iommu_ops->replay(mr, n);

> > > > +        return;

> > > > +    }

> > >

> > > Hi Alex, Peter,

> > >

> > > Will all the other vendors(e.g. PPC, s390, ARM) add their own replay

> > > callback as well? I guess it depends on whether the original replay

> > > algorithm work well for them? Do you have such knowledge?

> >

> > I guess so. At least for VT-d we had this callback since the default

> > replay mechanism did not work well on x86 due to its extremely large

> > memory region size. Thanks,

> 

> thx. that would make sense.


Peter,

Just come to mind that there may be a corner case here.

Intel VT-d actually has a "pt" mode which allows device use physical address
even when VT-d is enabled. In kernel, there is a iommu_identity_mapping. 
If a device is in this map, then it would use "pt" mode. So that IOMMU driver
would not build second-level page table for it.

Back to the virtual IOVA implementation, if an assigned device is in the 
iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
is not able to build it when guest SL page table is empty.

So I think building an entire guest PA->HPA mapping before guest kernel boot
would be recommended. Any thoughts?

Regards,
Yi L
Jason Wang March 30, 2017, 11:57 a.m. UTC | #6
On 2017年03月30日 19:06, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Liu, Yi L
>> Sent: Monday, March 27, 2017 5:22 PM
>> To: Peter Xu <peterx@redhat.com>
>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>> jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
>> <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>> -----Original Message-----
>>> From: Peter Xu [mailto:peterx@redhat.com]
>>> Sent: Monday, March 27, 2017 5:12 PM
>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
>>> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>> MemoryRegionIOMMUOps.replay() callback
>>>
>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>> -----Original Message-----
>>>>> From: Qemu-devel
>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>> Behalf Of Peter Xu
>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>> To: qemu-devel@nongnu.org
>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>> <david@gibson.dropbear.id.au>
>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>
>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>> which is the default behavior to replay the translations of the
>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>> want our own
>>> replay logic for IOMMU regions.
>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>> it'll override the default if set.
>>>>>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>>   include/exec/memory.h | 2 ++
>>>>>   memory.c              | 6 ++++++
>>>>>   2 files changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>> 0767888..30b2a74 100644
>>>>> --- a/include/exec/memory.h
>>>>> +++ b/include/exec/memory.h
>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>       void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>                                   IOMMUNotifierFlag old_flags,
>>>>>                                   IOMMUNotifierFlag new_flags);
>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>>>>>   };
>>>>>
>>>>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
>>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>> --- a/memory.c
>>>>> +++ b/memory.c
>>>>> @@ -1630,6 +1630,12 @@ void
>>>>> memory_region_iommu_replay(MemoryRegion
>>>>> *mr, IOMMUNotifier *n,
>>>>>       hwaddr addr, granularity;
>>>>>       IOMMUTLBEntry iotlb;
>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>> +    if (mr->iommu_ops->replay) {
>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>> +        return;
>>>>> +    }
>>>> Hi Alex, Peter,
>>>>
>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
>>>> callback as well? I guess it depends on whether the original replay
>>>> algorithm work well for them? Do you have such knowledge?
>>> I guess so. At least for VT-d we had this callback since the default
>>> replay mechanism did not work well on x86 due to its extremely large
>>> memory region size. Thanks,
>> thx. that would make sense.
> Peter,
>
> Just come to mind that there may be a corner case here.
>
> Intel VT-d actually has a "pt" mode which allows device use physical address
> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> would not build second-level page table for it.

Yes, but qemu does not support ECAP_PT now, so guest will still have a 
page table in this case.

>
> Back to the virtual IOVA implementation, if an assigned device is in the
> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> is not able to build it when guest SL page table is empty.
>
> So I think building an entire guest PA->HPA mapping before guest kernel boot
> would be recommended. Any thoughts?

We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar 
region and use another region without iommu_ops. Then 
vfio_listener_region_add() will just do the correct mappings.

Thanks

>
> Regards,
> Yi L
Peter Xu March 31, 2017, 2:56 a.m. UTC | #7
On Thu, Mar 30, 2017 at 07:57:38PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月30日 19:06, Liu, Yi L wrote:
> >>-----Original Message-----
> >>From: Liu, Yi L
> >>Sent: Monday, March 27, 2017 5:22 PM
> >>To: Peter Xu <peterx@redhat.com>
> >>Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >><kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>jasowang@redhat.com; bd.aviv@gmail.com; David Gibson
> >><david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>MemoryRegionIOMMUOps.replay() callback
> >>
> >>>-----Original Message-----
> >>>From: Peter Xu [mailto:peterx@redhat.com]
> >>>Sent: Monday, March 27, 2017 5:12 PM
> >>>To: Liu, Yi L <yi.l.liu@intel.com>
> >>>Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
> >>>Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
> >>>jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
> >>>Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
> >>>Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>MemoryRegionIOMMUOps.replay() callback
> >>>
> >>>On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
> >>>>>-----Original Message-----
> >>>>>From: Qemu-devel
> >>>>>[mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
> >>>>>Behalf Of Peter Xu
> >>>>>Sent: Tuesday, February 7, 2017 4:28 PM
> >>>>>To: qemu-devel@nongnu.org
> >>>>>Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
> >>>>><kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
> >>>>>jasowang@redhat.com; peterx@redhat.com;
> >>>>>alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
> >>>>><david@gibson.dropbear.id.au>
> >>>>>Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
> >>>>>MemoryRegionIOMMUOps.replay() callback
> >>>>>
> >>>>>Originally we have one memory_region_iommu_replay() function,
> >>>>>which is the default behavior to replay the translations of the
> >>>>>whole IOMMU region. However, on some platform like x86, we may
> >>>>>want our own
> >>>replay logic for IOMMU regions.
> >>>>>This patch add one more hook for IOMMUOps for the callback, and
> >>>>>it'll override the default if set.
> >>>>>
> >>>>>Signed-off-by: Peter Xu <peterx@redhat.com>
> >>>>>---
> >>>>>  include/exec/memory.h | 2 ++
> >>>>>  memory.c              | 6 ++++++
> >>>>>  2 files changed, 8 insertions(+)
> >>>>>
> >>>>>diff --git a/include/exec/memory.h b/include/exec/memory.h index
> >>>>>0767888..30b2a74 100644
> >>>>>--- a/include/exec/memory.h
> >>>>>+++ b/include/exec/memory.h
> >>>>>@@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
> >>>>>      void (*notify_flag_changed)(MemoryRegion *iommu,
> >>>>>                                  IOMMUNotifierFlag old_flags,
> >>>>>                                  IOMMUNotifierFlag new_flags);
> >>>>>+    /* Set this up to provide customized IOMMU replay function */
> >>>>>+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
> >>>>>  };
> >>>>>
> >>>>>  typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
> >>>>>--git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
> >>>>>--- a/memory.c
> >>>>>+++ b/memory.c
> >>>>>@@ -1630,6 +1630,12 @@ void
> >>>>>memory_region_iommu_replay(MemoryRegion
> >>>>>*mr, IOMMUNotifier *n,
> >>>>>      hwaddr addr, granularity;
> >>>>>      IOMMUTLBEntry iotlb;
> >>>>>+    /* If the IOMMU has its own replay callback, override */
> >>>>>+    if (mr->iommu_ops->replay) {
> >>>>>+        mr->iommu_ops->replay(mr, n);
> >>>>>+        return;
> >>>>>+    }
> >>>>Hi Alex, Peter,
> >>>>
> >>>>Will all the other vendors(e.g. PPC, s390, ARM) add their own replay
> >>>>callback as well? I guess it depends on whether the original replay
> >>>>algorithm work well for them? Do you have such knowledge?
> >>>I guess so. At least for VT-d we had this callback since the default
> >>>replay mechanism did not work well on x86 due to its extremely large
> >>>memory region size. Thanks,
> >>thx. that would make sense.
> >Peter,
> >
> >Just come to mind that there may be a corner case here.
> >
> >Intel VT-d actually has a "pt" mode which allows device use physical address
> >even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >would not build second-level page table for it.
> 
> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> table in this case.
> 
> >
> >Back to the virtual IOVA implementation, if an assigned device is in the
> >iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >is not able to build it when guest SL page table is empty.
> >
> >So I think building an entire guest PA->HPA mapping before guest kernel boot
> >would be recommended. Any thoughts?
> 
> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> region and use another region without iommu_ops. Then
> vfio_listener_region_add() will just do the correct mappings.

Even without any new region. With the patch 16/17 ("intel_iommu: allow
dynamic switch of IOMMU region"), we can just turn the IOMMU region
on/off, following the device's PT bit, maybe using the new
vtd_switch_address_space() interface. That should be enough.

Again, we just need to wait until current series merged.

(Oh, then I found why I had an extra "on/off" parameter in previous
 versions in vtd_switch_address_space(), but it was removed.)

Thanks,

-- peterx
Jason Wang March 31, 2017, 4:21 a.m. UTC | #8
On 2017年03月31日 10:56, Peter Xu wrote:
>>> Just come to mind that there may be a corner case here.
>>>
>>> Intel VT-d actually has a "pt" mode which allows device use physical address
>>> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
>>> would not build second-level page table for it.
>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
>> table in this case.
>>
>>> Back to the virtual IOVA implementation, if an assigned device is in the
>>> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
>>> is not able to build it when guest SL page table is empty.
>>>
>>> So I think building an entire guest PA->HPA mapping before guest kernel boot
>>> would be recommended. Any thoughts?
>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
>> region and use another region without iommu_ops. Then
>> vfio_listener_region_add() will just do the correct mappings.
> Even without any new region. With the patch 16/17 ("intel_iommu: allow
> dynamic switch of IOMMU region"), we can just turn the IOMMU region
> on/off, following the device's PT bit, maybe using the new
> vtd_switch_address_space() interface. That should be enough.

Right. For vhost it was probably need more works, e.g setting up static 
mappings during region_add().

>
> Again, we just need to wait until current series merged.
>
> (Oh, then I found why I had an extra "on/off" parameter in previous
>   versions in vtd_switch_address_space(), but it was removed.)

Good to know this.

Thanks

>
> Thanks,
>
> -- peterx
Peter Xu March 31, 2017, 5:01 a.m. UTC | #9
On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月31日 10:56, Peter Xu wrote:
> >>>Just come to mind that there may be a corner case here.
> >>>
> >>>Intel VT-d actually has a "pt" mode which allows device use physical address
> >>>even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >>>If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >>>would not build second-level page table for it.
> >>Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> >>table in this case.
> >>
> >>>Back to the virtual IOVA implementation, if an assigned device is in the
> >>>iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >>>So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >>>is not able to build it when guest SL page table is empty.
> >>>
> >>>So I think building an entire guest PA->HPA mapping before guest kernel boot
> >>>would be recommended. Any thoughts?
> >>We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> >>region and use another region without iommu_ops. Then
> >>vfio_listener_region_add() will just do the correct mappings.
> >Even without any new region. With the patch 16/17 ("intel_iommu: allow
> >dynamic switch of IOMMU region"), we can just turn the IOMMU region
> >on/off, following the device's PT bit, maybe using the new
> >vtd_switch_address_space() interface. That should be enough.
> 
> Right. For vhost it was probably need more works, e.g setting up static
> mappings during region_add().

Do we need to?

VFIO will need it for building up shadow page table, even without a
vIOMMU. But imho that should not be needed by vhost, right?

-- peterx
Jason Wang March 31, 2017, 5:12 a.m. UTC | #10
On 2017年03月31日 13:01, Peter Xu wrote:
> On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
>>
>> On 2017年03月31日 10:56, Peter Xu wrote:
>>>>> Just come to mind that there may be a corner case here.
>>>>>
>>>>> Intel VT-d actually has a "pt" mode which allows device use physical address
>>>>> even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>>>> If a device is in this map, then it would use "pt" mode. So that IOMMU driver
>>>>> would not build second-level page table for it.
>>>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page
>>>> table in this case.
>>>>
>>>>> Back to the virtual IOVA implementation, if an assigned device is in the
>>>>> iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>>>> So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
>>>>> is not able to build it when guest SL page table is empty.
>>>>>
>>>>> So I think building an entire guest PA->HPA mapping before guest kernel boot
>>>>> would be recommended. Any thoughts?
>>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
>>>> region and use another region without iommu_ops. Then
>>>> vfio_listener_region_add() will just do the correct mappings.
>>> Even without any new region. With the patch 16/17 ("intel_iommu: allow
>>> dynamic switch of IOMMU region"), we can just turn the IOMMU region
>>> on/off, following the device's PT bit, maybe using the new
>>> vtd_switch_address_space() interface. That should be enough.
>> Right. For vhost it was probably need more works, e.g setting up static
>> mappings during region_add().
> Do we need to?

Not a must if we don't care about performance.

>
> VFIO will need it for building up shadow page table, even without a
> vIOMMU. But imho that should not be needed by vhost, right?

Device IOTLB will be enabled unconditionally if iommu_platform is 
specified. If we don't set static mappings, vhost will send IOTLB miss 
request. The performance will be horrible in this case.

Thanks

>
> -- peterx
Peter Xu March 31, 2017, 5:28 a.m. UTC | #11
On Fri, Mar 31, 2017 at 01:12:56PM +0800, Jason Wang wrote:
> 
> 
> On 2017年03月31日 13:01, Peter Xu wrote:
> >On Fri, Mar 31, 2017 at 12:21:23PM +0800, Jason Wang wrote:
> >>
> >>On 2017年03月31日 10:56, Peter Xu wrote:
> >>>>>Just come to mind that there may be a corner case here.
> >>>>>
> >>>>>Intel VT-d actually has a "pt" mode which allows device use physical address
> >>>>>even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
> >>>>>If a device is in this map, then it would use "pt" mode. So that IOMMU driver
> >>>>>would not build second-level page table for it.
> >>>>Yes, but qemu does not support ECAP_PT now, so guest will still have a page
> >>>>table in this case.
> >>>>
> >>>>>Back to the virtual IOVA implementation, if an assigned device is in the
> >>>>>iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
> >>>>>So it demands a GPA->HPA mapping in host. However, the iommu->ops.replay
> >>>>>is not able to build it when guest SL page table is empty.
> >>>>>
> >>>>>So I think building an entire guest PA->HPA mapping before guest kernel boot
> >>>>>would be recommended. Any thoughts?
> >>>>We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar
> >>>>region and use another region without iommu_ops. Then
> >>>>vfio_listener_region_add() will just do the correct mappings.
> >>>Even without any new region. With the patch 16/17 ("intel_iommu: allow
> >>>dynamic switch of IOMMU region"), we can just turn the IOMMU region
> >>>on/off, following the device's PT bit, maybe using the new
> >>>vtd_switch_address_space() interface. That should be enough.
> >>Right. For vhost it was probably need more works, e.g setting up static
> >>mappings during region_add().
> >Do we need to?
> 
> Not a must if we don't care about performance.
> 
> >
> >VFIO will need it for building up shadow page table, even without a
> >vIOMMU. But imho that should not be needed by vhost, right?
> 
> Device IOTLB will be enabled unconditionally if iommu_platform is specified.
> If we don't set static mappings, vhost will send IOTLB miss request. The
> performance will be horrible in this case.

I see, thanks. So looks like we will need one more patch for PT
support now. :)

-- peterx
Yi Liu March 31, 2017, 5:34 a.m. UTC | #12
> -----Original Message-----

> From: Jason Wang [mailto:jasowang@redhat.com]

> Sent: Thursday, March 30, 2017 7:58 PM

> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan, Tianyu

> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>; 'mst@redhat.com'

> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;

> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'

> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-

> devel@nongnu.org>

> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> MemoryRegionIOMMUOps.replay() callback

> 

> 

> 

> On 2017年03月30日 19:06, Liu, Yi L wrote:

> >> -----Original Message-----

> >> From: Liu, Yi L

> >> Sent: Monday, March 27, 2017 5:22 PM

> >> To: Peter Xu <peterx@redhat.com>

> >> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;

> >> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;

> >> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David

> >> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> >> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add

> >> MemoryRegionIOMMUOps.replay() callback

> >>

> >>> -----Original Message-----

> >>> From: Peter Xu [mailto:peterx@redhat.com]

> >>> Sent: Monday, March 27, 2017 5:12 PM

> >>> To: Liu, Yi L <yi.l.liu@intel.com>

> >>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;

> >>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;

> >>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;

> >>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> >>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>> MemoryRegionIOMMUOps.replay() callback

> >>>

> >>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:

> >>>>> -----Original Message-----

> >>>>> From: Qemu-devel

> >>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On

> >>>>> Behalf Of Peter Xu

> >>>>> Sent: Tuesday, February 7, 2017 4:28 PM

> >>>>> To: qemu-devel@nongnu.org

> >>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> >>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> >>>>> jasowang@redhat.com; peterx@redhat.com;

> >>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson

> >>>>> <david@gibson.dropbear.id.au>

> >>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>

> >>>>> Originally we have one memory_region_iommu_replay() function,

> >>>>> which is the default behavior to replay the translations of the

> >>>>> whole IOMMU region. However, on some platform like x86, we may

> >>>>> want our own

> >>> replay logic for IOMMU regions.

> >>>>> This patch add one more hook for IOMMUOps for the callback, and

> >>>>> it'll override the default if set.

> >>>>>

> >>>>> Signed-off-by: Peter Xu <peterx@redhat.com>

> >>>>> ---

> >>>>>   include/exec/memory.h | 2 ++

> >>>>>   memory.c              | 6 ++++++

> >>>>>   2 files changed, 8 insertions(+)

> >>>>>

> >>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index

> >>>>> 0767888..30b2a74 100644

> >>>>> --- a/include/exec/memory.h

> >>>>> +++ b/include/exec/memory.h

> >>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {

> >>>>>       void (*notify_flag_changed)(MemoryRegion *iommu,

> >>>>>                                   IOMMUNotifierFlag old_flags,

> >>>>>                                   IOMMUNotifierFlag new_flags);

> >>>>> +    /* Set this up to provide customized IOMMU replay function */

> >>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);

> >>>>>   };

> >>>>>

> >>>>>   typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff

> >>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644

> >>>>> --- a/memory.c

> >>>>> +++ b/memory.c

> >>>>> @@ -1630,6 +1630,12 @@ void

> >>>>> memory_region_iommu_replay(MemoryRegion

> >>>>> *mr, IOMMUNotifier *n,

> >>>>>       hwaddr addr, granularity;

> >>>>>       IOMMUTLBEntry iotlb;

> >>>>> +    /* If the IOMMU has its own replay callback, override */

> >>>>> +    if (mr->iommu_ops->replay) {

> >>>>> +        mr->iommu_ops->replay(mr, n);

> >>>>> +        return;

> >>>>> +    }

> >>>> Hi Alex, Peter,

> >>>>

> >>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own

> >>>> replay callback as well? I guess it depends on whether the original

> >>>> replay algorithm work well for them? Do you have such knowledge?

> >>> I guess so. At least for VT-d we had this callback since the default

> >>> replay mechanism did not work well on x86 due to its extremely large

> >>> memory region size. Thanks,

> >> thx. that would make sense.

> > Peter,

> >

> > Just come to mind that there may be a corner case here.

> >

> > Intel VT-d actually has a "pt" mode which allows device use physical

> > address even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.

> > If a device is in this map, then it would use "pt" mode. So that IOMMU

> > driver would not build second-level page table for it.

> 

> Yes, but qemu does not support ECAP_PT now, so guest will still have a page table in

> this case.


That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So this solution
can work well even a device is in identify_map.

> 

> >

> > Back to the virtual IOVA implementation, if an assigned device is in

> > the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.

> > So it demands a GPA->HPA mapping in host. However, the

> > iommu->ops.replay is not able to build it when guest SL page table is empty.

> >

> > So I think building an entire guest PA->HPA mapping before guest

> > kernel boot would be recommended. Any thoughts?

> 

> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar region and

> use another region without iommu_ops. Then

> vfio_listener_region_add() will just do the correct mappings.


Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So just comes to
realize that the current replay solution may not work well when I expose ECAP_PT to guest.
I also have a rough idea here. The current listener in container listens to address space
named with devfn if virtual VTd is added. How about adding one more listener to listen
memory address space. So that the listener can build entire guest PA->HPA mapping. Also,
the vfio notifier is registered when changes happen in device address space. However, I
didn’t check if all the layout changes in memory address space happen before the first
dynamic map/unmap request from guest. If not, this solution is not practical.

Thanks,
Yi L
Jason Wang March 31, 2017, 7:16 a.m. UTC | #13
On 2017年03月31日 13:34, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Thursday, March 30, 2017 7:58 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan, Tianyu
>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>; 'mst@redhat.com'
>> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
>> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
>> devel@nongnu.org>
>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>
>>
>> On 2017年03月30日 19:06, Liu, Yi L wrote:
>>>> -----Original Message-----
>>>> From: Liu, Yi L
>>>> Sent: Monday, March 27, 2017 5:22 PM
>>>> To: Peter Xu <peterx@redhat.com>
>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com; David
>>>> Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>> MemoryRegionIOMMUOps.replay() callback
>>>>
>>>>> -----Original Message-----
>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>> Sent: Monday, March 27, 2017 5:12 PM
>>>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
>>>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>
>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Qemu-devel
>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>>>> Behalf Of Peter Xu
>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>>>> To: qemu-devel@nongnu.org
>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>>>> <david@gibson.dropbear.id.au>
>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>
>>>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>>>> which is the default behavior to replay the translations of the
>>>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>>>> want our own
>>>>> replay logic for IOMMU regions.
>>>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>>>> it'll override the default if set.
>>>>>>>
>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>> ---
>>>>>>>    include/exec/memory.h | 2 ++
>>>>>>>    memory.c              | 6 ++++++
>>>>>>>    2 files changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>>>> 0767888..30b2a74 100644
>>>>>>> --- a/include/exec/memory.h
>>>>>>> +++ b/include/exec/memory.h
>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>>>        void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>>>                                    IOMMUNotifierFlag old_flags,
>>>>>>>                                    IOMMUNotifierFlag new_flags);
>>>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
>>>>>>>    };
>>>>>>>
>>>>>>>    typedef struct CoalescedMemoryRange CoalescedMemoryRange; diff
>>>>>>> --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>>>> --- a/memory.c
>>>>>>> +++ b/memory.c
>>>>>>> @@ -1630,6 +1630,12 @@ void
>>>>>>> memory_region_iommu_replay(MemoryRegion
>>>>>>> *mr, IOMMUNotifier *n,
>>>>>>>        hwaddr addr, granularity;
>>>>>>>        IOMMUTLBEntry iotlb;
>>>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>>>> +    if (mr->iommu_ops->replay) {
>>>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>>>> +        return;
>>>>>>> +    }
>>>>>> Hi Alex, Peter,
>>>>>>
>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
>>>>>> replay callback as well? I guess it depends on whether the original
>>>>>> replay algorithm work well for them? Do you have such knowledge?
>>>>> I guess so. At least for VT-d we had this callback since the default
>>>>> replay mechanism did not work well on x86 due to its extremely large
>>>>> memory region size. Thanks,
>>>> thx. that would make sense.
>>> Peter,
>>>
>>> Just come to mind that there may be a corner case here.
>>>
>>> Intel VT-d actually has a "pt" mode which allows device use physical
>>> address even when VT-d is enabled. In kernel, there is a iommu_identity_mapping.
>>> If a device is in this map, then it would use "pt" mode. So that IOMMU
>>> driver would not build second-level page table for it.
>> Yes, but qemu does not support ECAP_PT now, so guest will still have a page table in
>> this case.
> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So this solution
> can work well even a device is in identify_map.
>
>>> Back to the virtual IOVA implementation, if an assigned device is in
>>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do DMA.
>>> So it demands a GPA->HPA mapping in host. However, the
>>> iommu->ops.replay is not able to build it when guest SL page table is empty.
>>>
>>> So I think building an entire guest PA->HPA mapping before guest
>>> kernel boot would be recommended. Any thoughts?
>> We plan to add PT in 2.10, a possible rough idea is disabled iommu dmar region and
>> use another region without iommu_ops. Then
>> vfio_listener_region_add() will just do the correct mappings.
> Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So just comes to
> realize that the current replay solution may not work well when I expose ECAP_PT to guest.
> I also have a rough idea here. The current listener in container listens to address space
> named with devfn if virtual VTd is added. How about adding one more listener to listen
> memory address space. So that the listener can build entire guest PA->HPA mapping.

This is only needed for PT. So looks like current code is sufficient to 
do this I think. See the else part of if (memory_region_is_iommu()) of 
vfio_listener_region_add().

Thanks

>   Also,
> the vfio notifier is registered when changes happen in device address space. However, I
> didn’t check if all the layout changes in memory address space happen before the first
> dynamic map/unmap request from guest. If not, this solution is not practical.
>
> Thanks,
> Yi L
Yi Liu March 31, 2017, 7:30 a.m. UTC | #14
> -----Original Message-----

> From: Jason Wang [mailto:jasowang@redhat.com]

> Sent: Friday, March 31, 2017 3:17 PM

> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'

> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-

> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'

> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>

> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> MemoryRegionIOMMUOps.replay() callback

> 

> 

> 

> On 2017年03月31日 13:34, Liu, Yi L wrote:

> >> -----Original Message-----

> >> From: Jason Wang [mailto:jasowang@redhat.com]

> >> Sent: Thursday, March 30, 2017 7:58 PM

> >> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> >> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,

> >> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> 'mst@redhat.com'

> >> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;

> >> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'

> >> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-

> >> devel@nongnu.org>

> >> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >> MemoryRegionIOMMUOps.replay() callback

> >>

> >>

> >>

> >> On 2017年03月30日 19:06, Liu, Yi L wrote:

> >>>> -----Original Message-----

> >>>> From: Liu, Yi L

> >>>> Sent: Monday, March 27, 2017 5:22 PM

> >>>> To: Peter Xu <peterx@redhat.com>

> >>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;

> >>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;

> >>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;

> >>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org

> >>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>> MemoryRegionIOMMUOps.replay() callback

> >>>>

> >>>>> -----Original Message-----

> >>>>> From: Peter Xu [mailto:peterx@redhat.com]

> >>>>> Sent: Monday, March 27, 2017 5:12 PM

> >>>>> To: Liu, Yi L <yi.l.liu@intel.com>

> >>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu

> >>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> >>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;

> >>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;

> >>>>> qemu-devel@nongnu.org

> >>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>

> >>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:

> >>>>>>> -----Original Message-----

> >>>>>>> From: Qemu-devel

> >>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On

> >>>>>>> Behalf Of Peter Xu

> >>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM

> >>>>>>> To: qemu-devel@nongnu.org

> >>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> >>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;

> >>>>>>> jasowang@redhat.com; peterx@redhat.com;

> >>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson

> >>>>>>> <david@gibson.dropbear.id.au>

> >>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>>>

> >>>>>>> Originally we have one memory_region_iommu_replay() function,

> >>>>>>> which is the default behavior to replay the translations of the

> >>>>>>> whole IOMMU region. However, on some platform like x86, we may

> >>>>>>> want our own

> >>>>> replay logic for IOMMU regions.

> >>>>>>> This patch add one more hook for IOMMUOps for the callback, and

> >>>>>>> it'll override the default if set.

> >>>>>>>

> >>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>

> >>>>>>> ---

> >>>>>>>    include/exec/memory.h | 2 ++

> >>>>>>>    memory.c              | 6 ++++++

> >>>>>>>    2 files changed, 8 insertions(+)

> >>>>>>>

> >>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index

> >>>>>>> 0767888..30b2a74 100644

> >>>>>>> --- a/include/exec/memory.h

> >>>>>>> +++ b/include/exec/memory.h

> >>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {

> >>>>>>>        void (*notify_flag_changed)(MemoryRegion *iommu,

> >>>>>>>                                    IOMMUNotifierFlag old_flags,

> >>>>>>>                                    IOMMUNotifierFlag new_flags);

> >>>>>>> +    /* Set this up to provide customized IOMMU replay function */

> >>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier

> >>>>>>> + *notifier);

> >>>>>>>    };

> >>>>>>>

> >>>>>>>    typedef struct CoalescedMemoryRange CoalescedMemoryRange;

> >>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644

> >>>>>>> --- a/memory.c

> >>>>>>> +++ b/memory.c

> >>>>>>> @@ -1630,6 +1630,12 @@ void

> >>>>>>> memory_region_iommu_replay(MemoryRegion

> >>>>>>> *mr, IOMMUNotifier *n,

> >>>>>>>        hwaddr addr, granularity;

> >>>>>>>        IOMMUTLBEntry iotlb;

> >>>>>>> +    /* If the IOMMU has its own replay callback, override */

> >>>>>>> +    if (mr->iommu_ops->replay) {

> >>>>>>> +        mr->iommu_ops->replay(mr, n);

> >>>>>>> +        return;

> >>>>>>> +    }

> >>>>>> Hi Alex, Peter,

> >>>>>>

> >>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own

> >>>>>> replay callback as well? I guess it depends on whether the

> >>>>>> original replay algorithm work well for them? Do you have such knowledge?

> >>>>> I guess so. At least for VT-d we had this callback since the

> >>>>> default replay mechanism did not work well on x86 due to its

> >>>>> extremely large memory region size. Thanks,

> >>>> thx. that would make sense.

> >>> Peter,

> >>>

> >>> Just come to mind that there may be a corner case here.

> >>>

> >>> Intel VT-d actually has a "pt" mode which allows device use physical

> >>> address even when VT-d is enabled. In kernel, there is a

> iommu_identity_mapping.

> >>> If a device is in this map, then it would use "pt" mode. So that

> >>> IOMMU driver would not build second-level page table for it.

> >> Yes, but qemu does not support ECAP_PT now, so guest will still have

> >> a page table in this case.

> > That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So

> > this solution can work well even a device is in identify_map.

> >

> >>> Back to the virtual IOVA implementation, if an assigned device is in

> >>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do

> DMA.

> >>> So it demands a GPA->HPA mapping in host. However, the

> >>> iommu->ops.replay is not able to build it when guest SL page table is empty.

> >>>

> >>> So I think building an entire guest PA->HPA mapping before guest

> >>> kernel boot would be recommended. Any thoughts?

> >> We plan to add PT in 2.10, a possible rough idea is disabled iommu

> >> dmar region and use another region without iommu_ops. Then

> >> vfio_listener_region_add() will just do the correct mappings.

> > Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So

> > just comes to realize that the current replay solution may not work well when I

> expose ECAP_PT to guest.

> > I also have a rough idea here. The current listener in container

> > listens to address space named with devfn if virtual VTd is added. How

> > about adding one more listener to listen memory address space. So that the

> listener can build entire guest PA->HPA mapping.

> 

> This is only needed for PT. So looks like current code is sufficient to do this I think.

> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().


Jason, when the listener listen to device address space, the "else part" may not work
even we set the mr->iommu_ops = NULL. The mr would be a non-ram region when the
time region_add is called since it is actually listen to changes from device address space.

Regards,
Yi L
Jason Wang April 1, 2017, 5 a.m. UTC | #15
On 2017年03月31日 15:30, Liu, Yi L wrote:
>> -----Original Message-----
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Friday, March 31, 2017 3:17 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'
>> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-
>> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'
>> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>
>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>> MemoryRegionIOMMUOps.replay() callback
>>
>>
>>
>> On 2017年03月31日 13:34, Liu, Yi L wrote:
>>>> -----Original Message-----
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Thursday, March 30, 2017 7:58 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>
>>>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,
>>>> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>> 'mst@redhat.com'
>>>> <mst@redhat.com>; 'jan.kiszka@siemens.com' <jan.kiszka@siemens.com>;
>>>> 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'David Gibson'
>>>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-
>>>> devel@nongnu.org>
>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>> MemoryRegionIOMMUOps.replay() callback
>>>>
>>>>
>>>>
>>>> On 2017年03月30日 19:06, Liu, Yi L wrote:
>>>>>> -----Original Message-----
>>>>>> From: Liu, Yi L
>>>>>> Sent: Monday, March 27, 2017 5:22 PM
>>>>>> To: Peter Xu <peterx@redhat.com>
>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu <tianyu.lan@intel.com>;
>>>>>> Tian, Kevin <kevin.tian@intel.com>; mst@redhat.com;
>>>>>> jan.kiszka@siemens.com; jasowang@redhat.com; bd.aviv@gmail.com;
>>>>>> David Gibson <david@gibson.dropbear.id.au>; qemu-devel@nongnu.org
>>>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Peter Xu [mailto:peterx@redhat.com]
>>>>>>> Sent: Monday, March 27, 2017 5:12 PM
>>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu
>>>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;
>>>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;
>>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;
>>>>>>> qemu-devel@nongnu.org
>>>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>
>>>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Qemu-devel
>>>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On
>>>>>>>>> Behalf Of Peter Xu
>>>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM
>>>>>>>>> To: qemu-devel@nongnu.org
>>>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin
>>>>>>>>> <kevin.tian@intel.com>; mst@redhat.com; jan.kiszka@siemens.com;
>>>>>>>>> jasowang@redhat.com; peterx@redhat.com;
>>>>>>>>> alex.williamson@redhat.com; bd.aviv@gmail.com; David Gibson
>>>>>>>>> <david@gibson.dropbear.id.au>
>>>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add
>>>>>>>>> MemoryRegionIOMMUOps.replay() callback
>>>>>>>>>
>>>>>>>>> Originally we have one memory_region_iommu_replay() function,
>>>>>>>>> which is the default behavior to replay the translations of the
>>>>>>>>> whole IOMMU region. However, on some platform like x86, we may
>>>>>>>>> want our own
>>>>>>> replay logic for IOMMU regions.
>>>>>>>>> This patch add one more hook for IOMMUOps for the callback, and
>>>>>>>>> it'll override the default if set.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>>>>>> ---
>>>>>>>>>     include/exec/memory.h | 2 ++
>>>>>>>>>     memory.c              | 6 ++++++
>>>>>>>>>     2 files changed, 8 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h index
>>>>>>>>> 0767888..30b2a74 100644
>>>>>>>>> --- a/include/exec/memory.h
>>>>>>>>> +++ b/include/exec/memory.h
>>>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {
>>>>>>>>>         void (*notify_flag_changed)(MemoryRegion *iommu,
>>>>>>>>>                                     IOMMUNotifierFlag old_flags,
>>>>>>>>>                                     IOMMUNotifierFlag new_flags);
>>>>>>>>> +    /* Set this up to provide customized IOMMU replay function */
>>>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier
>>>>>>>>> + *notifier);
>>>>>>>>>     };
>>>>>>>>>
>>>>>>>>>     typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>>>>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644
>>>>>>>>> --- a/memory.c
>>>>>>>>> +++ b/memory.c
>>>>>>>>> @@ -1630,6 +1630,12 @@ void
>>>>>>>>> memory_region_iommu_replay(MemoryRegion
>>>>>>>>> *mr, IOMMUNotifier *n,
>>>>>>>>>         hwaddr addr, granularity;
>>>>>>>>>         IOMMUTLBEntry iotlb;
>>>>>>>>> +    /* If the IOMMU has its own replay callback, override */
>>>>>>>>> +    if (mr->iommu_ops->replay) {
>>>>>>>>> +        mr->iommu_ops->replay(mr, n);
>>>>>>>>> +        return;
>>>>>>>>> +    }
>>>>>>>> Hi Alex, Peter,
>>>>>>>>
>>>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own
>>>>>>>> replay callback as well? I guess it depends on whether the
>>>>>>>> original replay algorithm work well for them? Do you have such knowledge?
>>>>>>> I guess so. At least for VT-d we had this callback since the
>>>>>>> default replay mechanism did not work well on x86 due to its
>>>>>>> extremely large memory region size. Thanks,
>>>>>> thx. that would make sense.
>>>>> Peter,
>>>>>
>>>>> Just come to mind that there may be a corner case here.
>>>>>
>>>>> Intel VT-d actually has a "pt" mode which allows device use physical
>>>>> address even when VT-d is enabled. In kernel, there is a
>> iommu_identity_mapping.
>>>>> If a device is in this map, then it would use "pt" mode. So that
>>>>> IOMMU driver would not build second-level page table for it.
>>>> Yes, but qemu does not support ECAP_PT now, so guest will still have
>>>> a page table in this case.
>>> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map. So
>>> this solution can work well even a device is in identify_map.
>>>
>>>>> Back to the virtual IOVA implementation, if an assigned device is in
>>>>> the iommu_identity_mapping(e.g. VGA controller), it uses GPA directly to do
>> DMA.
>>>>> So it demands a GPA->HPA mapping in host. However, the
>>>>> iommu->ops.replay is not able to build it when guest SL page table is empty.
>>>>>
>>>>> So I think building an entire guest PA->HPA mapping before guest
>>>>> kernel boot would be recommended. Any thoughts?
>>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu
>>>> dmar region and use another region without iommu_ops. Then
>>>> vfio_listener_region_add() will just do the correct mappings.
>>> Good to know it. Actually, I also need to expose ECAP_PT for vSVM. So
>>> just comes to realize that the current replay solution may not work well when I
>> expose ECAP_PT to guest.
>>> I also have a rough idea here. The current listener in container
>>> listens to address space named with devfn if virtual VTd is added. How
>>> about adding one more listener to listen memory address space. So that the
>> listener can build entire guest PA->HPA mapping.
>>
>> This is only needed for PT. So looks like current code is sufficient to do this I think.
>> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().
> Jason, when the listener listen to device address space, the "else part" may not work
> even we set the mr->iommu_ops = NULL. The mr would be a non-ram region when the
> time region_add is called since it is actually listen to changes from device address space.
>
> Regards,
> Yi L
>

See Peter's patch ("intel_iommu: allow dynamic switch of IOMMU region"). 
It has

+        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),
+                                 "vtd_sys_alias", get_system_memory(),
+                                 0, 
memory_region_size(get_system_memory()));

We can enable sys_alias in when PT is used which should work I think.

Thanks
Yi Liu April 1, 2017, 6:39 a.m. UTC | #16
> -----Original Message-----

> From: Jason Wang [mailto:jasowang@redhat.com]

> Sent: Saturday, April 1, 2017 1:01 PM

> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> 'mst@redhat.com' <mst@redhat.com>; 'jan.kiszka@siemens.com'

> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>; 'qemu-

> devel@nongnu.org' <qemu-devel@nongnu.org>; 'alex.williamson@redhat.com'

> <alex.williamson@redhat.com>; 'David Gibson' <david@gibson.dropbear.id.au>

> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> MemoryRegionIOMMUOps.replay() callback

> 

> 

> 

> On 2017年03月31日 15:30, Liu, Yi L wrote:

> >> -----Original Message-----

> >> From: Jason Wang [mailto:jasowang@redhat.com]

> >> Sent: Friday, March 31, 2017 3:17 PM

> >> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> >> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> >> <kevin.tian@intel.com>; 'mst@redhat.com' <mst@redhat.com>;

> 'jan.kiszka@siemens.com'

> >> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>;

> >> 'qemu- devel@nongnu.org' <qemu-devel@nongnu.org>;

> 'alex.williamson@redhat.com'

> >> <alex.williamson@redhat.com>; 'David Gibson'

> >> <david@gibson.dropbear.id.au>

> >> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >> MemoryRegionIOMMUOps.replay() callback

> >>

> >>

> >>

> >> On 2017年03月31日 13:34, Liu, Yi L wrote:

> >>>> -----Original Message-----

> >>>> From: Jason Wang [mailto:jasowang@redhat.com]

> >>>> Sent: Thursday, March 30, 2017 7:58 PM

> >>>> To: Liu, Yi L <yi.l.liu@intel.com>; 'Peter Xu' <peterx@redhat.com>

> >>>> Cc: 'alex.williamson@redhat.com' <alex.williamson@redhat.com>; Lan,

> >>>> Tianyu <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> >> 'mst@redhat.com'

> >>>> <mst@redhat.com>; 'jan.kiszka@siemens.com'

> >>>> <jan.kiszka@siemens.com>; 'bd.aviv@gmail.com' <bd.aviv@gmail.com>;

> 'David Gibson'

> >>>> <david@gibson.dropbear.id.au>; 'qemu-devel@nongnu.org' <qemu-

> >>>> devel@nongnu.org>

> >>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>> MemoryRegionIOMMUOps.replay() callback

> >>>>

> >>>>

> >>>>

> >>>> On 2017年03月30日 19:06, Liu, Yi L wrote:

> >>>>>> -----Original Message-----

> >>>>>> From: Liu, Yi L

> >>>>>> Sent: Monday, March 27, 2017 5:22 PM

> >>>>>> To: Peter Xu <peterx@redhat.com>

> >>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu

> >>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> >>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;

> >>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;

> >>>>>> qemu-devel@nongnu.org

> >>>>>> Subject: RE: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>>

> >>>>>>> -----Original Message-----

> >>>>>>> From: Peter Xu [mailto:peterx@redhat.com]

> >>>>>>> Sent: Monday, March 27, 2017 5:12 PM

> >>>>>>> To: Liu, Yi L <yi.l.liu@intel.com>

> >>>>>>> Cc: alex.williamson@redhat.com; Lan, Tianyu

> >>>>>>> <tianyu.lan@intel.com>; Tian, Kevin <kevin.tian@intel.com>;

> >>>>>>> mst@redhat.com; jan.kiszka@siemens.com; jasowang@redhat.com;

> >>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>;

> >>>>>>> qemu-devel@nongnu.org

> >>>>>>> Subject: Re: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>>>

> >>>>>>> On Mon, Mar 27, 2017 at 08:35:05AM +0000, Liu, Yi L wrote:

> >>>>>>>>> -----Original Message-----

> >>>>>>>>> From: Qemu-devel

> >>>>>>>>> [mailto:qemu-devel-bounces+yi.l.liu=intel.com@nongnu.org] On

> >>>>>>>>> Behalf Of Peter Xu

> >>>>>>>>> Sent: Tuesday, February 7, 2017 4:28 PM

> >>>>>>>>> To: qemu-devel@nongnu.org

> >>>>>>>>> Cc: Lan, Tianyu <tianyu.lan@intel.com>; Tian, Kevin

> >>>>>>>>> <kevin.tian@intel.com>; mst@redhat.com;

> >>>>>>>>> jan.kiszka@siemens.com; jasowang@redhat.com;

> >>>>>>>>> peterx@redhat.com; alex.williamson@redhat.com;

> >>>>>>>>> bd.aviv@gmail.com; David Gibson <david@gibson.dropbear.id.au>

> >>>>>>>>> Subject: [Qemu-devel] [PATCH v7 14/17] memory: add

> >>>>>>>>> MemoryRegionIOMMUOps.replay() callback

> >>>>>>>>>

> >>>>>>>>> Originally we have one memory_region_iommu_replay() function,

> >>>>>>>>> which is the default behavior to replay the translations of

> >>>>>>>>> the whole IOMMU region. However, on some platform like x86, we

> >>>>>>>>> may want our own

> >>>>>>> replay logic for IOMMU regions.

> >>>>>>>>> This patch add one more hook for IOMMUOps for the callback,

> >>>>>>>>> and it'll override the default if set.

> >>>>>>>>>

> >>>>>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>

> >>>>>>>>> ---

> >>>>>>>>>     include/exec/memory.h | 2 ++

> >>>>>>>>>     memory.c              | 6 ++++++

> >>>>>>>>>     2 files changed, 8 insertions(+)

> >>>>>>>>>

> >>>>>>>>> diff --git a/include/exec/memory.h b/include/exec/memory.h

> >>>>>>>>> index

> >>>>>>>>> 0767888..30b2a74 100644

> >>>>>>>>> --- a/include/exec/memory.h

> >>>>>>>>> +++ b/include/exec/memory.h

> >>>>>>>>> @@ -191,6 +191,8 @@ struct MemoryRegionIOMMUOps {

> >>>>>>>>>         void (*notify_flag_changed)(MemoryRegion *iommu,

> >>>>>>>>>                                     IOMMUNotifierFlag old_flags,

> >>>>>>>>>                                     IOMMUNotifierFlag

> >>>>>>>>> new_flags);

> >>>>>>>>> +    /* Set this up to provide customized IOMMU replay function */

> >>>>>>>>> +    void (*replay)(MemoryRegion *iommu, IOMMUNotifier

> >>>>>>>>> + *notifier);

> >>>>>>>>>     };

> >>>>>>>>>

> >>>>>>>>>     typedef struct CoalescedMemoryRange CoalescedMemoryRange;

> >>>>>>>>> diff --git a/memory.c b/memory.c index 7a4f2f9..9c253cc 100644

> >>>>>>>>> --- a/memory.c

> >>>>>>>>> +++ b/memory.c

> >>>>>>>>> @@ -1630,6 +1630,12 @@ void

> >>>>>>>>> memory_region_iommu_replay(MemoryRegion

> >>>>>>>>> *mr, IOMMUNotifier *n,

> >>>>>>>>>         hwaddr addr, granularity;

> >>>>>>>>>         IOMMUTLBEntry iotlb;

> >>>>>>>>> +    /* If the IOMMU has its own replay callback, override */

> >>>>>>>>> +    if (mr->iommu_ops->replay) {

> >>>>>>>>> +        mr->iommu_ops->replay(mr, n);

> >>>>>>>>> +        return;

> >>>>>>>>> +    }

> >>>>>>>> Hi Alex, Peter,

> >>>>>>>>

> >>>>>>>> Will all the other vendors(e.g. PPC, s390, ARM) add their own

> >>>>>>>> replay callback as well? I guess it depends on whether the

> >>>>>>>> original replay algorithm work well for them? Do you have such

> knowledge?

> >>>>>>> I guess so. At least for VT-d we had this callback since the

> >>>>>>> default replay mechanism did not work well on x86 due to its

> >>>>>>> extremely large memory region size. Thanks,

> >>>>>> thx. that would make sense.

> >>>>> Peter,

> >>>>>

> >>>>> Just come to mind that there may be a corner case here.

> >>>>>

> >>>>> Intel VT-d actually has a "pt" mode which allows device use

> >>>>> physical address even when VT-d is enabled. In kernel, there is a

> >> iommu_identity_mapping.

> >>>>> If a device is in this map, then it would use "pt" mode. So that

> >>>>> IOMMU driver would not build second-level page table for it.

> >>>> Yes, but qemu does not support ECAP_PT now, so guest will still

> >>>> have a page table in this case.

> >>> That's true. Without ECAP_PT, IOMMU driver would create a 1:1 map.

> >>> So this solution can work well even a device is in identify_map.

> >>>

> >>>>> Back to the virtual IOVA implementation, if an assigned device is

> >>>>> in the iommu_identity_mapping(e.g. VGA controller), it uses GPA

> >>>>> directly to do

> >> DMA.

> >>>>> So it demands a GPA->HPA mapping in host. However, the

> >>>>> iommu->ops.replay is not able to build it when guest SL page table is empty.

> >>>>>

> >>>>> So I think building an entire guest PA->HPA mapping before guest

> >>>>> kernel boot would be recommended. Any thoughts?

> >>>> We plan to add PT in 2.10, a possible rough idea is disabled iommu

> >>>> dmar region and use another region without iommu_ops. Then

> >>>> vfio_listener_region_add() will just do the correct mappings.

> >>> Good to know it. Actually, I also need to expose ECAP_PT for vSVM.

> >>> So just comes to realize that the current replay solution may not

> >>> work well when I

> >> expose ECAP_PT to guest.

> >>> I also have a rough idea here. The current listener in container

> >>> listens to address space named with devfn if virtual VTd is added.

> >>> How about adding one more listener to listen memory address space.

> >>> So that the

> >> listener can build entire guest PA->HPA mapping.

> >>

> >> This is only needed for PT. So looks like current code is sufficient to do this I think.

> >> See the else part of if (memory_region_is_iommu()) of vfio_listener_region_add().

> > Jason, when the listener listen to device address space, the "else

> > part" may not work even we set the mr->iommu_ops = NULL. The mr would

> > be a non-ram region when the time region_add is called since it is actually listen to

> changes from device address space.

> >

> > Regards,

> > Yi L

> >

> 

> See Peter's patch ("intel_iommu: allow dynamic switch of IOMMU region").

> It has

> 

> +        memory_region_init_alias(&vtd_dev_as->sys_alias, OBJECT(s),

> +                                 "vtd_sys_alias", get_system_memory(),

> +                                 0,

> memory_region_size(get_system_memory()));

> 

> We can enable sys_alias in when PT is used which should work I think.


Great. I think it works. Thx.

Regards,
Yi L
diff mbox

Patch

diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0767888..30b2a74 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -191,6 +191,8 @@  struct MemoryRegionIOMMUOps {
     void (*notify_flag_changed)(MemoryRegion *iommu,
                                 IOMMUNotifierFlag old_flags,
                                 IOMMUNotifierFlag new_flags);
+    /* Set this up to provide customized IOMMU replay function */
+    void (*replay)(MemoryRegion *iommu, IOMMUNotifier *notifier);
 };
 
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
diff --git a/memory.c b/memory.c
index 7a4f2f9..9c253cc 100644
--- a/memory.c
+++ b/memory.c
@@ -1630,6 +1630,12 @@  void memory_region_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n,
     hwaddr addr, granularity;
     IOMMUTLBEntry iotlb;
 
+    /* If the IOMMU has its own replay callback, override */
+    if (mr->iommu_ops->replay) {
+        mr->iommu_ops->replay(mr, n);
+        return;
+    }
+
     granularity = memory_region_iommu_get_min_page_size(mr);
 
     for (addr = 0; addr < memory_region_size(mr); addr += granularity) {