diff mbox series

[kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains

Message ID 20220707135552.3688927-1-aik@ozlabs.ru
State Changes Requested
Headers show
Series [kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains | expand

Commit Message

Alexey Kardashevskiy July 7, 2022, 1:55 p.m. UTC
Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
the Type1 VFIO driver. Recent development though has added a coherency
capability check to the generic part of VFIO and essentially disabled
VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().

This adds an iommu_ops stub which reports support for cache
coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
this provides minimum code for the probing to not crash.

Because now we have to set iommu_ops to the system (bus_set_iommu() or
iommu_device_register()), this requires the POWERNV PCI setup to happen
after bus_register(&pci_bus_type) which is postcore_initcall
TODO: check if it still works, read sha1, for more details:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387

Because setting the ops triggers probing, this does not work well with
iommu_group_add_device(), hence the move to iommu_probe_device().

Because iommu_probe_device() does not take the group (which is why
we had the helper in the first place), this adds
pci_controller_ops::device_group.

So, basically there is one iommu_device per PHB and devices are added to
groups indirectly via series of calls inside the IOMMU code.

pSeries is out of scope here (a minor fix needed for barely supported
platform in regard to VFIO).

The previous discussion is here:
https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/

Fixes: e8ae0e140c05 ("vfio: Require that devices support DMA cache coherence")
Fixes: 70693f470848 ("vfio: Set DMA ownership for VFIO devices")
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>
Cc: Fabiano Rosas <farosas@linux.ibm.com>
Cc: Murilo Opsfelder Araujo <muriloo@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---


does it make sense to have this many callbacks, or
the generic IOMMU code can safely operate without some
(given I add some more checks for !NULL)? thanks,


---
 arch/powerpc/include/asm/iommu.h          |   2 +
 arch/powerpc/include/asm/pci-bridge.h     |   7 ++
 arch/powerpc/kernel/iommu.c               | 106 +++++++++++++++++++++-
 arch/powerpc/kernel/pci-common.c          |   2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  40 ++++++++
 5 files changed, 155 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe July 7, 2022, 3:10 p.m. UTC | #1
On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
> the Type1 VFIO driver. Recent development though has added a coherency
> capability check to the generic part of VFIO and essentially disabled
> VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().
> 
> This adds an iommu_ops stub which reports support for cache
> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
> this provides minimum code for the probing to not crash.
> 
> Because now we have to set iommu_ops to the system (bus_set_iommu() or
> iommu_device_register()), this requires the POWERNV PCI setup to happen
> after bus_register(&pci_bus_type) which is postcore_initcall
> TODO: check if it still works, read sha1, for more details:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387
> 
> Because setting the ops triggers probing, this does not work well with
> iommu_group_add_device(), hence the move to iommu_probe_device().
> 
> Because iommu_probe_device() does not take the group (which is why
> we had the helper in the first place), this adds
> pci_controller_ops::device_group.
> 
> So, basically there is one iommu_device per PHB and devices are added to
> groups indirectly via series of calls inside the IOMMU code.
> 
> pSeries is out of scope here (a minor fix needed for barely supported
> platform in regard to VFIO).
> 
> The previous discussion is here:
> https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/

I think this is basically OK, for what it is. It looks like there is
more some-day opportunity to make use of the core infrastructure though.

> does it make sense to have this many callbacks, or
> the generic IOMMU code can safely operate without some
> (given I add some more checks for !NULL)? thanks,

I wouldn't worry about it..

> @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
>  	pr_debug("%s: Adding %s to iommu group %d\n",
>  		 __func__, dev_name(dev),  iommu_group_id(table_group->group));
>  
> -	return iommu_group_add_device(table_group->group, dev);
> +	ret = iommu_probe_device(dev);
> +	dev_info(dev, "probed with %d\n", ret);

For another day, but it seems a bit strange to call iommu_probe_device() like this?
Shouldn't one of the existing call sites cover this? The one in
of_iommu.c perhaps?

> +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
> +{
> +       return false;
> +}

I think you can NULL this op:

static bool iommu_is_attach_deferred(struct device *dev)
{
	const struct iommu_ops *ops = dev_iommu_ops(dev);

	if (ops->is_attach_deferred)
		return ops->is_attach_deferred(dev);

	return false;
}

> +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
> +{
> +	struct pci_controller *hose;
> +	struct pci_dev *pdev;
> +
> +	/* Weirdly iommu_device_register() assigns the same ops to all buses */
> +	if (!dev_is_pci(dev))
> +		return ERR_PTR(-EPERM);
> +
> +	pdev = to_pci_dev(dev);
> +	hose = pdev->bus->sysdata;
> +
> +	if (!hose->controller_ops.device_group)
> +		return ERR_PTR(-ENOENT);
> +
> +	return hose->controller_ops.device_group(hose, pdev);
> +}

Is this missing a refcount get on the group?

> +
> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
> +				      struct device *dev)
> +{
> +	return 0;
> +}

It is important when this returns that the iommu translation is all
emptied. There should be no left over translations from the DMA API at
this point. I have no idea how power works in this regard, but it
should be explained why this is safe in a comment at a minimum.

It will turn into a security problem to allow kernel mappings to leak
past this point.

Jason
Alexey Kardashevskiy July 8, 2022, 5 a.m. UTC | #2
On 7/8/22 01:10, Jason Gunthorpe wrote:
> On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
>> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
>> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
>> the Type1 VFIO driver. Recent development though has added a coherency
>> capability check to the generic part of VFIO and essentially disabled
>> VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().
>>
>> This adds an iommu_ops stub which reports support for cache
>> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices,
>> this provides minimum code for the probing to not crash.
>>
>> Because now we have to set iommu_ops to the system (bus_set_iommu() or
>> iommu_device_register()), this requires the POWERNV PCI setup to happen
>> after bus_register(&pci_bus_type) which is postcore_initcall
>> TODO: check if it still works, read sha1, for more details:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387
>>
>> Because setting the ops triggers probing, this does not work well with
>> iommu_group_add_device(), hence the move to iommu_probe_device().
>>
>> Because iommu_probe_device() does not take the group (which is why
>> we had the helper in the first place), this adds
>> pci_controller_ops::device_group.
>>
>> So, basically there is one iommu_device per PHB and devices are added to
>> groups indirectly via series of calls inside the IOMMU code.
>>
>> pSeries is out of scope here (a minor fix needed for barely supported
>> platform in regard to VFIO).
>>
>> The previous discussion is here:
>> https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/
> 
> I think this is basically OK, for what it is. It looks like there is
> more some-day opportunity to make use of the core infrastructure though.
> 
>> does it make sense to have this many callbacks, or
>> the generic IOMMU code can safely operate without some
>> (given I add some more checks for !NULL)? thanks,
> 
> I wouldn't worry about it..
> 
>> @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
>>   	pr_debug("%s: Adding %s to iommu group %d\n",
>>   		 __func__, dev_name(dev),  iommu_group_id(table_group->group));
>>   
>> -	return iommu_group_add_device(table_group->group, dev);
>> +	ret = iommu_probe_device(dev);
>> +	dev_info(dev, "probed with %d\n", ret);
> 
> For another day, but it seems a bit strange to call iommu_probe_device() like this?
> Shouldn't one of the existing call sites cover this? The one in
> of_iommu.c perhaps?


It looks to me that of_iommu.c expects the iommu setup to happen before 
linux starts as linux looks for #iommu-cells or iommu-map properties in 
the device tree. The powernv firmware (aka skiboot) does not do this and 
it is linux which manages iommu groups.


>> +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
>> +{
>> +       return false;
>> +}
> 
> I think you can NULL this op:
> 
> static bool iommu_is_attach_deferred(struct device *dev)
> {
> 	const struct iommu_ops *ops = dev_iommu_ops(dev);
> 
> 	if (ops->is_attach_deferred)
> 		return ops->is_attach_deferred(dev);
> 
> 	return false;
> }
> 
>> +static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
>> +{
>> +	struct pci_controller *hose;
>> +	struct pci_dev *pdev;
>> +
>> +	/* Weirdly iommu_device_register() assigns the same ops to all buses */
>> +	if (!dev_is_pci(dev))
>> +		return ERR_PTR(-EPERM);
>> +
>> +	pdev = to_pci_dev(dev);
>> +	hose = pdev->bus->sysdata;
>> +
>> +	if (!hose->controller_ops.device_group)
>> +		return ERR_PTR(-ENOENT);
>> +
>> +	return hose->controller_ops.device_group(hose, pdev);
>> +}
> 
> Is this missing a refcount get on the group?
> 
>> +
>> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
>> +				      struct device *dev)
>> +{
>> +	return 0;
>> +}
> 
> It is important when this returns that the iommu translation is all
> emptied. There should be no left over translations from the DMA API at
> this point. I have no idea how power works in this regard, but it
> should be explained why this is safe in a comment at a minimum.
>
 > It will turn into a security problem to allow kernel mappings to leak
 > past this point.
 >

I've added for v2 checking for no valid mappings for a device (or, more 
precisely, in the associated iommu_group), this domain does not need 
checking, right?

In general, is "domain" something from hardware or it is a software 
concept? Thanks,


> Jason
Alexey Kardashevskiy July 8, 2022, 6:34 a.m. UTC | #3
On 7/8/22 15:00, Alexey Kardashevskiy wrote:
> 
> 
> On 7/8/22 01:10, Jason Gunthorpe wrote:
>> On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
>>> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
>>> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
>>> the Type1 VFIO driver. Recent development though has added a coherency
>>> capability check to the generic part of VFIO and essentially disabled
>>> VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed().
>>>
>>> This adds an iommu_ops stub which reports support for cache
>>> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI 
>>> devices,
>>> this provides minimum code for the probing to not crash.
>>>
>>> Because now we have to set iommu_ops to the system (bus_set_iommu() or
>>> iommu_device_register()), this requires the POWERNV PCI setup to happen
>>> after bus_register(&pci_bus_type) which is postcore_initcall
>>> TODO: check if it still works, read sha1, for more details:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387
>>>
>>> Because setting the ops triggers probing, this does not work well with
>>> iommu_group_add_device(), hence the move to iommu_probe_device().
>>>
>>> Because iommu_probe_device() does not take the group (which is why
>>> we had the helper in the first place), this adds
>>> pci_controller_ops::device_group.
>>>
>>> So, basically there is one iommu_device per PHB and devices are added to
>>> groups indirectly via series of calls inside the IOMMU code.
>>>
>>> pSeries is out of scope here (a minor fix needed for barely supported
>>> platform in regard to VFIO).
>>>
>>> The previous discussion is here:
>>> https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/
>>
>> I think this is basically OK, for what it is. It looks like there is
>> more some-day opportunity to make use of the core infrastructure though.
>>
>>> does it make sense to have this many callbacks, or
>>> the generic IOMMU code can safely operate without some
>>> (given I add some more checks for !NULL)? thanks,
>>
>> I wouldn't worry about it..
>>
>>> @@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group 
>>> *table_group, struct device *dev)
>>>       pr_debug("%s: Adding %s to iommu group %d\n",
>>>            __func__, dev_name(dev),  
>>> iommu_group_id(table_group->group));
>>> -    return iommu_group_add_device(table_group->group, dev);
>>> +    ret = iommu_probe_device(dev);
>>> +    dev_info(dev, "probed with %d\n", ret);
>>
>> For another day, but it seems a bit strange to call 
>> iommu_probe_device() like this?
>> Shouldn't one of the existing call sites cover this? The one in
>> of_iommu.c perhaps?
> 
> 
> It looks to me that of_iommu.c expects the iommu setup to happen before 
> linux starts as linux looks for #iommu-cells or iommu-map properties in 
> the device tree. The powernv firmware (aka skiboot) does not do this and 
> it is linux which manages iommu groups.
> 
> 
>>> +static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
>>> +{
>>> +       return false;
>>> +}
>>
>> I think you can NULL this op:
>>
>> static bool iommu_is_attach_deferred(struct device *dev)
>> {
>>     const struct iommu_ops *ops = dev_iommu_ops(dev);
>>
>>     if (ops->is_attach_deferred)
>>         return ops->is_attach_deferred(dev);
>>
>>     return false;
>> }
>>
>>> +static struct iommu_group *spapr_tce_iommu_device_group(struct 
>>> device *dev)
>>> +{
>>> +    struct pci_controller *hose;
>>> +    struct pci_dev *pdev;
>>> +
>>> +    /* Weirdly iommu_device_register() assigns the same ops to all 
>>> buses */
>>> +    if (!dev_is_pci(dev))
>>> +        return ERR_PTR(-EPERM);
>>> +
>>> +    pdev = to_pci_dev(dev);
>>> +    hose = pdev->bus->sysdata;
>>> +
>>> +    if (!hose->controller_ops.device_group)
>>> +        return ERR_PTR(-ENOENT);
>>> +
>>> +    return hose->controller_ops.device_group(hose, pdev);
>>> +}
>>
>> Is this missing a refcount get on the group?
>>
>>> +
>>> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
>>> +                      struct device *dev)
>>> +{
>>> +    return 0;
>>> +}
>>
>> It is important when this returns that the iommu translation is all
>> emptied. There should be no left over translations from the DMA API at
>> this point. I have no idea how power works in this regard, but it
>> should be explained why this is safe in a comment at a minimum.
>>
>  > It will turn into a security problem to allow kernel mappings to leak
>  > past this point.
>  >
> 
> I've added for v2 checking for no valid mappings for a device (or, more 
> precisely, in the associated iommu_group), this domain does not need 
> checking, right?


Uff, not that simple. Looks like once a device is in a group, its 
dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess then 
there is a way to set those to NULL or do something similar to let
dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is 
fine to do nothing as tce_iommu_take_ownership() and 
tce_iommu_take_ownership_ddw() take care of not having active DMA 
mappings. Thanks,


> 
> In general, is "domain" something from hardware or it is a software 
> concept? Thanks,
> 
> 
>> Jason
>
Tian, Kevin July 8, 2022, 7:32 a.m. UTC | #4
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> Sent: Friday, July 8, 2022 2:35 PM
> On 7/8/22 15:00, Alexey Kardashevskiy wrote:
> >
> >
> > On 7/8/22 01:10, Jason Gunthorpe wrote:
> >> On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
> >>> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
> >>> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
> >>> the Type1 VFIO driver. Recent development though has added a
> coherency
> >>> capability check to the generic part of VFIO and essentially disabled
> >>> VFIO on PPC64; the similar story about
> iommu_group_dma_owner_claimed().
> >>>
> >>> This adds an iommu_ops stub which reports support for cache
> >>> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI
> >>> devices,
> >>> this provides minimum code for the probing to not crash.

stale comment since this patch doesn't use bus_set_iommu() now.

> >>> +
> >>> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
> >>> +                      struct device *dev)
> >>> +{
> >>> +    return 0;
> >>> +}
> >>
> >> It is important when this returns that the iommu translation is all
> >> emptied. There should be no left over translations from the DMA API at
> >> this point. I have no idea how power works in this regard, but it
> >> should be explained why this is safe in a comment at a minimum.
> >>
> >  > It will turn into a security problem to allow kernel mappings to leak
> >  > past this point.
> >  >
> >
> > I've added for v2 checking for no valid mappings for a device (or, more
> > precisely, in the associated iommu_group), this domain does not need
> > checking, right?
> 
> 
> Uff, not that simple. Looks like once a device is in a group, its
> dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess
> then
> there is a way to set those to NULL or do something similar to let
> dma_map_direct() from kernel/dma/mapping.c return "true", is not there?

dev->dma_ops is NULL as long as you don't handle DMA domain type
here and don't call iommu_setup_dma_ops().

Given this only supports blocking domain then above should be irrelevant.

> 
> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is
> fine to do nothing as tce_iommu_take_ownership() and
> tce_iommu_take_ownership_ddw() take care of not having active DMA
> mappings. Thanks,
> 
> 
> >
> > In general, is "domain" something from hardware or it is a software
> > concept? Thanks,
> >

'domain' is a software concept to represent the hardware I/O page
table. A blocking domain means all DMAs from a device attached to
this domain are blocked/rejected (equivalent to an empty I/O page
table), usually enforced in the .attach_dev() callback. 

Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.

Thanks
Kevin
Alexey Kardashevskiy July 8, 2022, 9:45 a.m. UTC | #5
On 08/07/2022 17:32, Tian, Kevin wrote:
>> From: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Sent: Friday, July 8, 2022 2:35 PM
>> On 7/8/22 15:00, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 7/8/22 01:10, Jason Gunthorpe wrote:
>>>> On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
>>>>> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
>>>>> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
>>>>> the Type1 VFIO driver. Recent development though has added a
>> coherency
>>>>> capability check to the generic part of VFIO and essentially disabled
>>>>> VFIO on PPC64; the similar story about
>> iommu_group_dma_owner_claimed().
>>>>>
>>>>> This adds an iommu_ops stub which reports support for cache
>>>>> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI
>>>>> devices,
>>>>> this provides minimum code for the probing to not crash.
> 
> stale comment since this patch doesn't use bus_set_iommu() now.
> 
>>>>> +
>>>>> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
>>>>> +                      struct device *dev)
>>>>> +{
>>>>> +    return 0;
>>>>> +}
>>>>
>>>> It is important when this returns that the iommu translation is all
>>>> emptied. There should be no left over translations from the DMA API at
>>>> this point. I have no idea how power works in this regard, but it
>>>> should be explained why this is safe in a comment at a minimum.
>>>>
>>>   > It will turn into a security problem to allow kernel mappings to leak
>>>   > past this point.
>>>   >
>>>
>>> I've added for v2 checking for no valid mappings for a device (or, more
>>> precisely, in the associated iommu_group), this domain does not need
>>> checking, right?
>>
>>
>> Uff, not that simple. Looks like once a device is in a group, its
>> dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess
>> then
>> there is a way to set those to NULL or do something similar to let
>> dma_map_direct() from kernel/dma/mapping.c return "true", is not there?
> 
> dev->dma_ops is NULL as long as you don't handle DMA domain type
> here and don't call iommu_setup_dma_ops().
> 
> Given this only supports blocking domain then above should be irrelevant.
> 
>>
>> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is
>> fine to do nothing as tce_iommu_take_ownership() and
>> tce_iommu_take_ownership_ddw() take care of not having active DMA
>> mappings. Thanks,
>>
>>
>>>
>>> In general, is "domain" something from hardware or it is a software
>>> concept? Thanks,
>>>
> 
> 'domain' is a software concept to represent the hardware I/O page
> table. A blocking domain means all DMAs from a device attached to
> this domain are blocked/rejected (equivalent to an empty I/O page
> table), usually enforced in the .attach_dev() callback.
> 
> Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.


Making it NULL makes __iommu_attach_device() fail, .attach_dev() needs 
to return 0 in this crippled environment. Thanks for explaining the 
rest, good food for thoughts.



> 
> Thanks
> Kevin
Tian, Kevin July 8, 2022, 10:18 a.m. UTC | #6
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> Sent: Friday, July 8, 2022 5:46 PM
> 
> >>>
> >>> In general, is "domain" something from hardware or it is a software
> >>> concept? Thanks,
> >>>
> >
> > 'domain' is a software concept to represent the hardware I/O page
> > table. A blocking domain means all DMAs from a device attached to
> > this domain are blocked/rejected (equivalent to an empty I/O page
> > table), usually enforced in the .attach_dev() callback.
> >
> > Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.
> 
> 
> Making it NULL makes __iommu_attach_device() fail, .attach_dev() needs
> to return 0 in this crippled environment. Thanks for explaining the
> rest, good food for thoughts.
> 

Yeah, that's a typo. 😊
Jason Gunthorpe July 8, 2022, 11:55 a.m. UTC | #7
On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:

> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
> to do nothing as tce_iommu_take_ownership() and
> tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.

That will still cause a security problem because
tce_iommu_take_ownership()/etc are called too late. This is the moment
in the flow when the ownershift must change away from the DMA API that
power implements and to VFIO, not later.

Jason
Alexey Kardashevskiy July 8, 2022, 1:10 p.m. UTC | #8
On 08/07/2022 21:55, Jason Gunthorpe wrote:
> On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:
> 
>> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
>> to do nothing as tce_iommu_take_ownership() and
>> tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.
> 
> That will still cause a security problem because
> tce_iommu_take_ownership()/etc are called too late. This is the moment
> in the flow when the ownershift must change away from the DMA API that
> power implements and to VFIO, not later.

It is getting better and better :)

On POWERNV, at the boot time the platforms sets up PHBs, enables bypass, 
creates groups and attaches devices. As for now attaching devices to the 
default domain (which is BLOCKED) fails the not-being-use check as 
enabled bypass means "everything is mapped for DMA". So at this point 
the default domain has to be IOMMU_DOMAIN_IDENTITY or 
IOMMU_DOMAIN_UNMANAGED so later on VFIO can move devices to 
IOMMU_DOMAIN_BLOCKED. Am I missing something?


> 
> Jason
Jason Gunthorpe July 8, 2022, 1:19 p.m. UTC | #9
On Fri, Jul 08, 2022 at 11:10:07PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 08/07/2022 21:55, Jason Gunthorpe wrote:
> > On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:
> > 
> > > For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
> > > to do nothing as tce_iommu_take_ownership() and
> > > tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.
> > 
> > That will still cause a security problem because
> > tce_iommu_take_ownership()/etc are called too late. This is the moment
> > in the flow when the ownershift must change away from the DMA API that
> > power implements and to VFIO, not later.
> 
> It is getting better and better :)
> 
> On POWERNV, at the boot time the platforms sets up PHBs, enables bypass,
> creates groups and attaches devices. As for now attaching devices to the
> default domain (which is BLOCKED) fails the not-being-use check as enabled
> bypass means "everything is mapped for DMA". So at this point the default
> domain has to be IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_UNMANAGED so later on
> VFIO can move devices to IOMMU_DOMAIN_BLOCKED. Am I missing something?

For power the default domain should be NULL

NULL means that the platform is using the group to provide its DMA
ops. IIRC this patch was already setup correctly to do this?

The transition from NULL to blocking must isolate the group so all DMA
is blocked. blocking to NULL should re-estbalish platform DMA API
control.

The default domain should be non-NULL when the normal dma-iommu stuff is
providing the DMA API.

So, I think it is already setup properly, it is just the question of
what to do when entering/leaving blocking mode.

Jason
Alexey Kardashevskiy July 8, 2022, 1:32 p.m. UTC | #10
On 08/07/2022 23:19, Jason Gunthorpe wrote:
> On Fri, Jul 08, 2022 at 11:10:07PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 08/07/2022 21:55, Jason Gunthorpe wrote:
>>> On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:
>>>
>>>> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
>>>> to do nothing as tce_iommu_take_ownership() and
>>>> tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.
>>>
>>> That will still cause a security problem because
>>> tce_iommu_take_ownership()/etc are called too late. This is the moment
>>> in the flow when the ownershift must change away from the DMA API that
>>> power implements and to VFIO, not later.
>>
>> It is getting better and better :)
>>
>> On POWERNV, at the boot time the platforms sets up PHBs, enables bypass,
>> creates groups and attaches devices. As for now attaching devices to the
>> default domain (which is BLOCKED) fails the not-being-use check as enabled
>> bypass means "everything is mapped for DMA". So at this point the default
>> domain has to be IOMMU_DOMAIN_IDENTITY or IOMMU_DOMAIN_UNMANAGED so later on
>> VFIO can move devices to IOMMU_DOMAIN_BLOCKED. Am I missing something?
> 
> For power the default domain should be NULL
> 
> NULL means that the platform is using the group to provide its DMA
> ops. IIRC this patch was already setup correctly to do this?
> 
> The transition from NULL to blocking must isolate the group so all DMA
> is blocked. blocking to NULL should re-estbalish platform DMA API
> control.
> 
> The default domain should be non-NULL when the normal dma-iommu stuff is
> providing the DMA API.
> 
> So, I think it is already setup properly, it is just the question of
> what to do when entering/leaving blocking mode.



Well, the patch calls iommu_probe_device() which calls 
iommu_alloc_default_domain() which creates IOMMU_DOMAIN_BLOCKED (==0) as 
nothing initialized iommu_def_domain_type. Need a different default type 
(and return NULL when IOMMU API tries creating this type)?



> 
> Jason
Jason Gunthorpe July 8, 2022, 1:59 p.m. UTC | #11
On Fri, Jul 08, 2022 at 11:32:58PM +1000, Alexey Kardashevskiy wrote:
> > For power the default domain should be NULL
> > 
> > NULL means that the platform is using the group to provide its DMA
> > ops. IIRC this patch was already setup correctly to do this?
> > 
> > The transition from NULL to blocking must isolate the group so all DMA
> > is blocked. blocking to NULL should re-estbalish platform DMA API
> > control.
> > 
> > The default domain should be non-NULL when the normal dma-iommu stuff is
> > providing the DMA API.
> > 
> > So, I think it is already setup properly, it is just the question of
> > what to do when entering/leaving blocking mode.
> 
> Well, the patch calls iommu_probe_device() which calls
> iommu_alloc_default_domain() which creates IOMMU_DOMAIN_BLOCKED
> (==0) as

Yes, we always create a blocking domain during probe, but it isn't
used until required

> nothing initialized iommu_def_domain_type. Need a different default type
> (and return NULL when IOMMU API tries creating this type)?

iommu_alloc_default_domain() should fail on power because none of the
domain types it tries to create are supported. This should result in a
NULL group->default_domain

Jason
Alexey Kardashevskiy July 9, 2022, 2:58 a.m. UTC | #12
On 08/07/2022 21:55, Jason Gunthorpe wrote:
> On Fri, Jul 08, 2022 at 04:34:55PM +1000, Alexey Kardashevskiy wrote:
> 
>> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is fine
>> to do nothing as tce_iommu_take_ownership() and
>> tce_iommu_take_ownership_ddw() take care of not having active DMA mappings.
> 
> That will still cause a security problem because
> tce_iommu_take_ownership()/etc are called too late. This is the moment
> in the flow when the ownershift must change away from the DMA API that
> power implements and to VFIO, not later.

Trying to do that.

vfio_group_set_container:
     iommu_group_claim_dma_owner
     driver->ops->attach_group

iommu_group_claim_dma_owner sets a domain to a group. Good. But it 
attaches devices, not groups. Bad.

driver->ops->attach_group on POWER attaches a group so VFIO claims 
ownership over a group, not devices. Underlying API 
(pnv_ioda2_take_ownership()) does not need to keep track of the state, 
it is one group, one ownership transfer, easy.


What is exactly the reason why iommu_group_claim_dma_owner() cannot stay 
inside Type1 (sorry if it was explained, I could have missed)?



Also, from another mail, you said iommu_alloc_default_domain() should 
fail on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or 
the whole iommu_group_claim_dma_owner() thing falls apart.
And iommu_ops::domain_alloc() is not told if it is asked to create a 
default domain, it only takes a type.
Jason Gunthorpe July 10, 2022, 6:29 a.m. UTC | #13
On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote:
 
> driver->ops->attach_group on POWER attaches a group so VFIO claims ownership
> over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does
> not need to keep track of the state, it is one group, one ownership
> transfer, easy.

It should not change, I think you can just map the attach_dev to the group?

> What is exactly the reason why iommu_group_claim_dma_owner() cannot stay
> inside Type1 (sorry if it was explained, I could have missed)?

It has nothing to do with type1 - the ownership system is designed to
exclude other in-kernel drivers from using the group at the same time
vfio is using the group. power still needs this protection regardless
of if is using the formal iommu api or not.

> Also, from another mail, you said iommu_alloc_default_domain() should fail
> on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole
> iommu_group_claim_dma_owner() thing falls apart.

Yes

> And iommu_ops::domain_alloc() is not told if it is asked to create a default
> domain, it only takes a type.

"default domain" refers to the default type pased to domain_alloc(),
it will never be blocking, so it will always fail on power.

"default domain" is better understood as the domain used by the DMA
API

Jason
Alexey Kardashevskiy July 10, 2022, 12:32 p.m. UTC | #14
On 10/07/2022 16:29, Jason Gunthorpe wrote:
> On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote:
>   
>> driver->ops->attach_group on POWER attaches a group so VFIO claims ownership
>> over a group, not devices. Underlying API (pnv_ioda2_take_ownership()) does
>> not need to keep track of the state, it is one group, one ownership
>> transfer, easy.
> 
> It should not change, I think you can just map the attach_dev to the group?

There are multiple devices in a group, cannot just map 1:1.


>> What is exactly the reason why iommu_group_claim_dma_owner() cannot stay
>> inside Type1 (sorry if it was explained, I could have missed)?
> 
> It has nothing to do with type1 - the ownership system is designed to
> exclude other in-kernel drivers from using the group at the same time
> vfio is using the group. power still needs this protection regardless
> of if is using the formal iommu api or not.

POWER deals with it in vfio_iommu_driver_ops::attach_group.

>> Also, from another mail, you said iommu_alloc_default_domain() should fail
>> on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the whole
>> iommu_group_claim_dma_owner() thing falls apart.
> 
> Yes
> 
>> And iommu_ops::domain_alloc() is not told if it is asked to create a default
>> domain, it only takes a type.
> 
> "default domain" refers to the default type pased to domain_alloc(),
> it will never be blocking, so it will always fail on power.
> "default domain" is better understood as the domain used by the DMA
> API

The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from 
arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain 
is type == 0 where 0 == BLOCKED.
Alexey Kardashevskiy July 11, 2022, 1:24 p.m. UTC | #15
On 10/07/2022 22:32, Alexey Kardashevskiy wrote:
> 
> 
> On 10/07/2022 16:29, Jason Gunthorpe wrote:
>> On Sat, Jul 09, 2022 at 12:58:00PM +1000, Alexey Kardashevskiy wrote:
>>> driver->ops->attach_group on POWER attaches a group so VFIO claims 
>>> ownership
>>> over a group, not devices. Underlying API 
>>> (pnv_ioda2_take_ownership()) does
>>> not need to keep track of the state, it is one group, one ownership
>>> transfer, easy.
>>
>> It should not change, I think you can just map the attach_dev to the 
>> group?
> 
> There are multiple devices in a group, cannot just map 1:1.
> 
> 
>>> What is exactly the reason why iommu_group_claim_dma_owner() cannot stay
>>> inside Type1 (sorry if it was explained, I could have missed)?
>>
>> It has nothing to do with type1 - the ownership system is designed to
>> exclude other in-kernel drivers from using the group at the same time
>> vfio is using the group. power still needs this protection regardless
>> of if is using the formal iommu api or not.
> 
> POWER deals with it in vfio_iommu_driver_ops::attach_group.


I really think that for 5.19 we should really move this blocked domain 
business to Type1 like this:

https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf

Thanks,


>>> Also, from another mail, you said iommu_alloc_default_domain() should 
>>> fail
>>> on power but at least IOMMU_DOMAIN_BLOCKED must be supported, or the 
>>> whole
>>> iommu_group_claim_dma_owner() thing falls apart.
>>
>> Yes
>>
>>> And iommu_ops::domain_alloc() is not told if it is asked to create a 
>>> default
>>> domain, it only takes a type.
>>
>> "default domain" refers to the default type pased to domain_alloc(),
>> it will never be blocking, so it will always fail on power.
>> "default domain" is better understood as the domain used by the DMA
>> API
> 
> The DMA API on POWER does not use iommu_ops, it is dma_iommu_ops from 
> arch/powerpc/kernel/dma-iommu.c from before 2005. so the default domain 
> is type == 0 where 0 == BLOCKED.
>
Jason Gunthorpe July 11, 2022, 6:46 p.m. UTC | #16
On Mon, Jul 11, 2022 at 11:24:32PM +1000, Alexey Kardashevskiy wrote:

> I really think that for 5.19 we should really move this blocked domain
> business to Type1 like this:
> 
> https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf

This creates the same security bug for power we are discussing here. If you
don't want to fix it then lets just merge this iommu_ops patch as is rather than
mangle the core code.

Jason
Alexey Kardashevskiy July 12, 2022, 2:27 a.m. UTC | #17
On 7/12/22 04:46, Jason Gunthorpe wrote:
> On Mon, Jul 11, 2022 at 11:24:32PM +1000, Alexey Kardashevskiy wrote:
> 
>> I really think that for 5.19 we should really move this blocked domain
>> business to Type1 like this:
>>
>> https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf
> 
> This creates the same security bug for power we are discussing here. If you

How so? attach_dev() on power makes uninitalizes DMA setup for the group 
on the hardware level, any other DMA user won't be able to initiate DMA.


> don't want to fix it then lets just merge this iommu_ops patch as is rather than
> mangle the core code.

The core code should not be assuming iommu_ops != NULL, Type1 should, I 
thought it is the whole point of having Type1, why is not it the case 
anymore?
Jason Gunthorpe July 12, 2022, 5:44 a.m. UTC | #18
On Tue, Jul 12, 2022 at 12:27:17PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 7/12/22 04:46, Jason Gunthorpe wrote:
> > On Mon, Jul 11, 2022 at 11:24:32PM +1000, Alexey Kardashevskiy wrote:
> > 
> > > I really think that for 5.19 we should really move this blocked domain
> > > business to Type1 like this:
> > > 
> > > https://github.com/aik/linux/commit/96f80c8db03b181398ad355f6f90e574c3ada4bf
> > 
> > This creates the same security bug for power we are discussing here. If you
> 
> How so? attach_dev() on power makes uninitalizes DMA setup for the group on
> the hardware level, any other DMA user won't be able to initiate DMA.

We removed all the code from VFIO that prevented dma driver conflicts
and lowered into the new APIs. You have to use these new APIs or
there are problems with exclusivity of the group.

The previous code that was allowing power to work safely doesn't exist
any more, which is why you can't just ignore these apis for
type2.

They have nothing to do with the vfio 'type', they are all about
arbitrating who gets to use the group or not and making a safe hand
off protocol from one group owner to the other. Since power says it
has groups it must implement the sharing protocol for groups.

> > don't want to fix it then lets just merge this iommu_ops patch as is rather than
> > mangle the core code.
> 
> The core code should not be assuming iommu_ops != NULL, Type1 should, I
> thought it is the whole point of having Type1, why is not it the case
> anymore?

Architectures should not be creating iommu groups without providing
proper iommu subsystem support. The half baked use of the iommu
subsystem in power is the problem here.

Adding the ops and starting to use the subsystem properly is the
correct thing to do, even if you can't complete every corner right
now. At least the issues are limited to arch code and can be fixed by
arch maintainers.

I think the patch you have here is fine to fix vfio on power and it
should simply be merged for v5.19 and power folks can further work on
this in the later cycles.

Jason
Alexey Kardashevskiy July 29, 2022, 2:21 a.m. UTC | #19
On 08/07/2022 17:32, Tian, Kevin wrote:
>> From: Alexey Kardashevskiy <aik@ozlabs.ru>
>> Sent: Friday, July 8, 2022 2:35 PM
>> On 7/8/22 15:00, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 7/8/22 01:10, Jason Gunthorpe wrote:
>>>> On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
>>>>> Historically PPC64 managed to avoid using iommu_ops. The VFIO driver
>>>>> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
>>>>> the Type1 VFIO driver. Recent development though has added a
>> coherency
>>>>> capability check to the generic part of VFIO and essentially disabled
>>>>> VFIO on PPC64; the similar story about
>> iommu_group_dma_owner_claimed().
>>>>>
>>>>> This adds an iommu_ops stub which reports support for cache
>>>>> coherency. Because bus_set_iommu() triggers IOMMU probing of PCI
>>>>> devices,
>>>>> this provides minimum code for the probing to not crash.
> 
> stale comment since this patch doesn't use bus_set_iommu() now.
> 
>>>>> +
>>>>> +static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
>>>>> +                      struct device *dev)
>>>>> +{
>>>>> +    return 0;
>>>>> +}
>>>>
>>>> It is important when this returns that the iommu translation is all
>>>> emptied. There should be no left over translations from the DMA API at
>>>> this point. I have no idea how power works in this regard, but it
>>>> should be explained why this is safe in a comment at a minimum.
>>>>
>>>   > It will turn into a security problem to allow kernel mappings to leak
>>>   > past this point.
>>>   >
>>>
>>> I've added for v2 checking for no valid mappings for a device (or, more
>>> precisely, in the associated iommu_group), this domain does not need
>>> checking, right?
>>
>>
>> Uff, not that simple. Looks like once a device is in a group, its
>> dma_ops is set to iommu_dma_ops and IOMMU code owns DMA. I guess
>> then
>> there is a way to set those to NULL or do something similar to let
>> dma_map_direct() from kernel/dma/mapping.c return "true", is not there?
> 
> dev->dma_ops is NULL as long as you don't handle DMA domain type
> here and don't call iommu_setup_dma_ops().
> 
> Given this only supports blocking domain then above should be irrelevant.
> 
>>
>> For now I'll add a comment in spapr_tce_iommu_attach_dev() that it is
>> fine to do nothing as tce_iommu_take_ownership() and
>> tce_iommu_take_ownership_ddw() take care of not having active DMA
>> mappings. Thanks,
>>
>>
>>>
>>> In general, is "domain" something from hardware or it is a software
>>> concept? Thanks,
>>>
> 
> 'domain' is a software concept to represent the hardware I/O page
> table. 


About this. If a platform has a concept of explicit DMA windows (2 or 
more), is it one domain with 2 windows or 2 domains with one window each?

If it is 2 windows, iommu_domain_ops misses windows manipulation 
callbacks (I vaguely remember it being there for embedded PPC64 but 
cannot find it quickly).

If it is 1 window per a domain, then can a device be attached to 2 
domains at least in theory (I suspect not)?

On server POWER CPUs, each DMA window is backed by an independent IOMMU 
page table. (reminder) A window is a bus address range where devices are 
allowed to DMA to/from ;)

Thanks,


> A blocking domain means all DMAs from a device attached to
> this domain are blocked/rejected (equivalent to an empty I/O page
> table), usually enforced in the .attach_dev() callback.
> 
> Yes, a comment for why having a NULL .attach_dev() is OK is welcomed.
> 
> Thanks
> Kevin
Oliver O'Halloran July 29, 2022, 2:53 a.m. UTC | #20
On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> *snip*
>
> About this. If a platform has a concept of explicit DMA windows (2 or
> more), is it one domain with 2 windows or 2 domains with one window each?
>
> If it is 2 windows, iommu_domain_ops misses windows manipulation
> callbacks (I vaguely remember it being there for embedded PPC64 but
> cannot find it quickly).
>
> If it is 1 window per a domain, then can a device be attached to 2
> domains at least in theory (I suspect not)?
>
> On server POWER CPUs, each DMA window is backed by an independent IOMMU
> page table. (reminder) A window is a bus address range where devices are
> allowed to DMA to/from ;)

I've always thought of windows as being entries to a top-level "iommu
page table" for the device / domain. The fact each window is backed by
a separate IOMMU page table shouldn't really be relevant outside the
arch/platform.
Tian, Kevin July 29, 2022, 3:10 a.m. UTC | #21
> From: Oliver O'Halloran <oohall@gmail.com>
> Sent: Friday, July 29, 2022 10:53 AM
> 
> On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> > *snip*
> >
> > About this. If a platform has a concept of explicit DMA windows (2 or
> > more), is it one domain with 2 windows or 2 domains with one window
> each?
> >
> > If it is 2 windows, iommu_domain_ops misses windows manipulation
> > callbacks (I vaguely remember it being there for embedded PPC64 but
> > cannot find it quickly).
> >
> > If it is 1 window per a domain, then can a device be attached to 2
> > domains at least in theory (I suspect not)?
> >
> > On server POWER CPUs, each DMA window is backed by an independent
> IOMMU
> > page table. (reminder) A window is a bus address range where devices are
> > allowed to DMA to/from ;)
> 
> I've always thought of windows as being entries to a top-level "iommu
> page table" for the device / domain. The fact each window is backed by
> a separate IOMMU page table shouldn't really be relevant outside the
> arch/platform.

Yes. This is what was agreed when discussing how to integrate iommufd
with POWER [1].

One domain represents one address space.

Windows are just constraints on the address space for what ranges can
be mapped.

having two page tables underlying is just kind of POWER specific format.

Thanks
Kevin

[1] https://lore.kernel.org/all/Yns+TCSa6hWbU7wZ@yekko/
Alexey Kardashevskiy July 29, 2022, 3:50 a.m. UTC | #22
On 29/07/2022 13:10, Tian, Kevin wrote:
>> From: Oliver O'Halloran <oohall@gmail.com>
>> Sent: Friday, July 29, 2022 10:53 AM
>>
>> On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>
>>> *snip*
>>>
>>> About this. If a platform has a concept of explicit DMA windows (2 or
>>> more), is it one domain with 2 windows or 2 domains with one window
>> each?
>>>
>>> If it is 2 windows, iommu_domain_ops misses windows manipulation
>>> callbacks (I vaguely remember it being there for embedded PPC64 but
>>> cannot find it quickly).
>>>
>>> If it is 1 window per a domain, then can a device be attached to 2
>>> domains at least in theory (I suspect not)?
>>>
>>> On server POWER CPUs, each DMA window is backed by an independent
>> IOMMU
>>> page table. (reminder) A window is a bus address range where devices are
>>> allowed to DMA to/from ;)
>>
>> I've always thought of windows as being entries to a top-level "iommu
>> page table" for the device / domain. The fact each window is backed by
>> a separate IOMMU page table shouldn't really be relevant outside the
>> arch/platform.
> 
> Yes. This is what was agreed when discussing how to integrate iommufd
> with POWER [1].
> 
> One domain represents one address space.
> 
> Windows are just constraints on the address space for what ranges can
> be mapped.
> 
> having two page tables underlying is just kind of POWER specific format.


It is a POWER specific thing with one not-so-obvious consequence of each 
window having an independent page size (fixed at the moment or creation) 
and (most likely) different page size, like, 4K vs. 2M.


> 
> Thanks
> Kevin
> 
> [1] https://lore.kernel.org/all/Yns+TCSa6hWbU7wZ@yekko/
Tian, Kevin July 29, 2022, 4:24 a.m. UTC | #23
> From: Alexey Kardashevskiy <aik@ozlabs.ru>
> Sent: Friday, July 29, 2022 11:50 AM
> 
> 
> On 29/07/2022 13:10, Tian, Kevin wrote:
> >> From: Oliver O'Halloran <oohall@gmail.com>
> >> Sent: Friday, July 29, 2022 10:53 AM
> >>
> >> On Fri, Jul 29, 2022 at 12:21 PM Alexey Kardashevskiy <aik@ozlabs.ru>
> wrote:
> >>>
> >>> *snip*
> >>>
> >>> About this. If a platform has a concept of explicit DMA windows (2 or
> >>> more), is it one domain with 2 windows or 2 domains with one window
> >> each?
> >>>
> >>> If it is 2 windows, iommu_domain_ops misses windows manipulation
> >>> callbacks (I vaguely remember it being there for embedded PPC64 but
> >>> cannot find it quickly).
> >>>
> >>> If it is 1 window per a domain, then can a device be attached to 2
> >>> domains at least in theory (I suspect not)?
> >>>
> >>> On server POWER CPUs, each DMA window is backed by an independent
> >> IOMMU
> >>> page table. (reminder) A window is a bus address range where devices
> are
> >>> allowed to DMA to/from ;)
> >>
> >> I've always thought of windows as being entries to a top-level "iommu
> >> page table" for the device / domain. The fact each window is backed by
> >> a separate IOMMU page table shouldn't really be relevant outside the
> >> arch/platform.
> >
> > Yes. This is what was agreed when discussing how to integrate iommufd
> > with POWER [1].
> >
> > One domain represents one address space.
> >
> > Windows are just constraints on the address space for what ranges can
> > be mapped.
> >
> > having two page tables underlying is just kind of POWER specific format.
> 
> 
> It is a POWER specific thing with one not-so-obvious consequence of each
> window having an independent page size (fixed at the moment or creation)
> and (most likely) different page size, like, 4K vs. 2M.
> 
> 

page size is anyway decided by the iommu driver. Same for other vendors.
the caller (e.g. vfio) just tries to map as many contiguous pages as
possible but in the end it's iommu driver to decide the actual size.
Jason Gunthorpe July 29, 2022, 12:09 p.m. UTC | #24
On Fri, Jul 29, 2022 at 04:24:36AM +0000, Tian, Kevin wrote:
> > It is a POWER specific thing with one not-so-obvious consequence of each
> > window having an independent page size (fixed at the moment or creation)
> > and (most likely) different page size, like, 4K vs. 2M.
> 
> page size is anyway decided by the iommu driver. Same for other vendors.
> the caller (e.g. vfio) just tries to map as many contiguous pages as
> possible but in the end it's iommu driver to decide the actual size.

An iommu_domain with a non-uniform page_size is weird, but it could be
made to work as some device-specific thing. The only thing the core
codes care about is the minimum alignment, so you'd just have to set
it to 4k and mapping the wrong thing to the 2M area would fail.

domains with their io page size != PAGE_SIZE are already very special
things that require special handling by userspace..

Jason
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 7e29c73e3dd4..4bdae0ee29d0 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -215,6 +215,8 @@  extern long iommu_tce_xchg_no_kill(struct mm_struct *mm,
 		enum dma_data_direction *direction);
 extern void iommu_tce_kill(struct iommu_table *tbl,
 		unsigned long entry, unsigned long pages);
+
+extern const struct iommu_ops spapr_tce_iommu_ops;
 #else
 static inline void iommu_register_group(struct iommu_table_group *table_group,
 					int pci_domain_number,
diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
index c85f901227c9..338a45b410b4 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -8,6 +8,7 @@ 
 #include <linux/list.h>
 #include <linux/ioport.h>
 #include <linux/numa.h>
+#include <linux/iommu.h>
 
 struct device_node;
 
@@ -44,6 +45,9 @@  struct pci_controller_ops {
 #endif
 
 	void		(*shutdown)(struct pci_controller *hose);
+
+	struct iommu_group *(*device_group)(struct pci_controller *hose,
+					    struct pci_dev *pdev);
 };
 
 /*
@@ -131,6 +135,9 @@  struct pci_controller {
 	struct irq_domain	*dev_domain;
 	struct irq_domain	*msi_domain;
 	struct fwnode_handle	*fwnode;
+
+	/* iommu_ops support */
+	struct iommu_device	iommu;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 7e56ddb3e0b9..c4c7eb596fef 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1138,6 +1138,8 @@  EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
+	int ret;
+
 	/*
 	 * The sysfs entries should be populated before
 	 * binding IOMMU group. If sysfs entries isn't
@@ -1156,7 +1158,10 @@  int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 	pr_debug("%s: Adding %s to iommu group %d\n",
 		 __func__, dev_name(dev),  iommu_group_id(table_group->group));
 
-	return iommu_group_add_device(table_group->group, dev);
+	ret = iommu_probe_device(dev);
+	dev_info(dev, "probed with %d\n", ret);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
@@ -1176,4 +1181,103 @@  void iommu_del_device(struct device *dev)
 	iommu_group_remove_device(dev);
 }
 EXPORT_SYMBOL_GPL(iommu_del_device);
+
+/*
+ * A simple iommu_ops to allow less cruft in generic VFIO code.
+ */
+static bool spapr_tce_iommu_capable(enum iommu_cap cap)
+{
+	switch (cap) {
+	case IOMMU_CAP_CACHE_COHERENCY:
+		return true;
+	default:
+		break;
+	}
+
+	return false;
+}
+
+static struct iommu_domain *spapr_tce_iommu_domain_alloc(unsigned int type)
+{
+	struct iommu_domain *domain;
+
+	if (type != IOMMU_DOMAIN_BLOCKED)
+		return NULL;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return NULL;
+
+	domain->geometry.aperture_start = 0;
+	domain->geometry.aperture_end = ~0ULL;
+	domain->geometry.force_aperture = true;
+
+	return domain;
+}
+
+static struct iommu_device *spapr_tce_iommu_probe_device(struct device *dev)
+{
+	struct pci_dev *pdev;
+	struct pci_controller *hose;
+
+	/* Weirdly iommu_device_register() assigns the same ops to all buses */
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-EPERM);
+
+	pdev = to_pci_dev(dev);
+	hose = pdev->bus->sysdata;
+	return &hose->iommu;
+}
+
+static void spapr_tce_iommu_release_device(struct device *dev)
+{
+}
+
+static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
+{
+	return false;
+}
+
+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+	struct pci_controller *hose;
+	struct pci_dev *pdev;
+
+	/* Weirdly iommu_device_register() assigns the same ops to all buses */
+	if (!dev_is_pci(dev))
+		return ERR_PTR(-EPERM);
+
+	pdev = to_pci_dev(dev);
+	hose = pdev->bus->sysdata;
+
+	if (!hose->controller_ops.device_group)
+		return ERR_PTR(-ENOENT);
+
+	return hose->controller_ops.device_group(hose, pdev);
+}
+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+				      struct device *dev)
+{
+	return 0;
+}
+
+static void spapr_tce_iommu_detach_dev(struct iommu_domain *dom,
+				       struct device *dev)
+{
+}
+
+const struct iommu_ops spapr_tce_iommu_ops = {
+	.capable = spapr_tce_iommu_capable,
+	.domain_alloc = spapr_tce_iommu_domain_alloc,
+	.probe_device = spapr_tce_iommu_probe_device,
+	.release_device = spapr_tce_iommu_release_device,
+	.device_group = spapr_tce_iommu_device_group,
+	.is_attach_deferred = spapr_tce_iommu_is_attach_deferred,
+	.default_domain_ops = &(const struct iommu_domain_ops) {
+		.attach_dev = spapr_tce_iommu_attach_dev,
+		.detach_dev = spapr_tce_iommu_detach_dev,
+	}
+};
+
 #endif /* CONFIG_IOMMU_API */
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 068410cd54a3..72ca5afba0c0 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1714,4 +1714,4 @@  static int __init discover_phbs(void)
 
 	return 0;
 }
-core_initcall(discover_phbs);
+postcore_initcall_sync(discover_phbs);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index 5e65d983e257..d5139d003794 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2912,6 +2912,25 @@  static void pnv_pci_ioda_dma_bus_setup(struct pci_bus *bus)
 	}
 }
 
+static struct iommu_group *pnv_pci_device_group(struct pci_controller *hose,
+						struct pci_dev *pdev)
+{
+	struct pnv_phb *phb = hose->private_data;
+	struct pnv_ioda_pe *pe;
+
+	if (WARN_ON(!phb))
+		return ERR_PTR(-ENODEV);
+
+	pe = pnv_pci_bdfn_to_pe(phb, pdev->devfn | (pdev->bus->number << 8));
+	if (!pe)
+		return ERR_PTR(-ENODEV);
+
+	if (!pe->table_group.group)
+		return ERR_PTR(-ENODEV);
+
+	return pe->table_group.group;
+}
+
 static const struct pci_controller_ops pnv_pci_ioda_controller_ops = {
 	.dma_dev_setup		= pnv_pci_ioda_dma_dev_setup,
 	.dma_bus_setup		= pnv_pci_ioda_dma_bus_setup,
@@ -2922,6 +2941,7 @@  static const struct pci_controller_ops pnv_pci_ioda_controller_ops = {
 	.setup_bridge		= pnv_pci_fixup_bridge_resources,
 	.reset_secondary_bus	= pnv_pci_reset_secondary_bus,
 	.shutdown		= pnv_pci_ioda_shutdown,
+	.device_group		= pnv_pci_device_group,
 };
 
 static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
@@ -2932,6 +2952,20 @@  static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = {
 	.shutdown		= pnv_pci_ioda_shutdown,
 };
 
+static struct attribute *spapr_tce_iommu_attrs[] = {
+	NULL,
+};
+
+static struct attribute_group spapr_tce_iommu_group = {
+	.name = "spapr-tce-iommu",
+	.attrs = spapr_tce_iommu_attrs,
+};
+
+static const struct attribute_group *spapr_tce_iommu_groups[] = {
+	&spapr_tce_iommu_group,
+	NULL,
+};
+
 static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 					 u64 hub_id, int ioda_type)
 {
@@ -3199,6 +3233,12 @@  static void __init pnv_pci_init_ioda_phb(struct device_node *np,
 
 	/* create pci_dn's for DT nodes under this PHB */
 	pci_devs_phb_init_dynamic(hose);
+
+	iommu_device_sysfs_add(&hose->iommu, hose->parent,
+			       spapr_tce_iommu_groups, "iommu%lld", phb_id);
+	rc = iommu_device_register(&hose->iommu, &spapr_tce_iommu_ops, hose->parent);
+	if (rc)
+		pr_warn("iommu_device_register returned %ld\n", rc);
 }
 
 void __init pnv_pci_init_ioda2_phb(struct device_node *np)