mbox series

[v2,00/17] Solve iommu probe races around iommu_fwspec

Message ID 0-v2-36a0088ecaa7+22c6e-iommu_fwspec_jgg@nvidia.com
Headers show
Series Solve iommu probe races around iommu_fwspec | expand

Message

Jason Gunthorpe Nov. 15, 2023, 2:05 p.m. UTC
[Several people have tested this now, so it is something that should sit in
linux-next for a while]

The iommu subsystem uses dev->iommu to store bits of information about the
attached iommu driver. This has been co-opted by the ACPI/OF code to also
be a place to pass around the iommu_fwspec before a driver is probed.

Since both are using the same pointers without any locking it triggers
races if there is concurrent driver loading:

     CPU0                                     CPU1
of_iommu_configure()                iommu_device_register()
 ..                                   bus_iommu_probe()
  iommu_fwspec_of_xlate()              __iommu_probe_device()
                                        iommu_init_device()
   dev_iommu_get()
                                          .. ops->probe fails, no fwspec ..
                                          dev_iommu_free()
   dev->iommu->fwspec    *crash*

My first attempt get correct locking here was to use the device_lock to
protect the entire *_iommu_configure() and iommu_probe() paths. This
allowed safe use of dev->iommu within those paths. Unfortuately enough
drivers abuse the of_iommu_configure() flow without proper locking and
this approach failed.

This approach removes touches of dev->iommu from the *_iommu_configure()
code. The few remaining required touches are moved into iommu.c and
protected with the existing iommu_probe_device_lock.

To do this we change *_iommu_configure() to hold the iommu_fwspec on the
stack while it is being built. Once it is fully formed the core code will
install it into the dev->iommu when it calls probe.

This also removes all the touches of iommu_ops from
the *_iommu_configure() paths and makes that mechanism private to the
iommu core.

A few more lockdep assertions are added to discourage future mis-use.

This is on github: https://github.com/jgunthorpe/linux/commits/iommu_fwspec

v2:
 - Fix all the kconfig randomization 0-day stuff
 - Add missing kdoc parameters
 - Remove NO_IOMMU, replace it with ENODEV
 - Use PTR_ERR to print errno in the new/moved logging
v1: https://lore.kernel.org/r/0-v1-5f734af130a3+34f-iommu_fwspec_jgg@nvidia.com

Jason Gunthorpe (17):
  iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()
  iommmu/of: Do not return struct iommu_ops from of_iommu_configure()
  iommu/of: Use -ENODEV consistently in of_iommu_configure()
  acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()
  iommu: Make iommu_fwspec->ids a distinct allocation
  iommu: Add iommu_fwspec_alloc/dealloc()
  iommu: Add iommu_probe_device_fwspec()
  iommu/of: Do not use dev->iommu within of_iommu_configure()
  iommu: Add iommu_fwspec_append_ids()
  acpi: Do not use dev->iommu within acpi_iommu_configure()
  iommu: Hold iommu_probe_device_lock while calling ops->of_xlate
  iommu: Make iommu_ops_from_fwnode() static
  iommu: Remove dev_iommu_fwspec_set()
  iommu: Remove pointless iommu_fwspec_free()
  iommu: Add ops->of_xlate_fwspec()
  iommu: Mark dev_iommu_get() with lockdep
  iommu: Mark dev_iommu_priv_set() with a lockdep

 arch/arc/mm/dma.c                           |   2 +-
 arch/arm/mm/dma-mapping-nommu.c             |   2 +-
 arch/arm/mm/dma-mapping.c                   |  10 +-
 arch/arm64/mm/dma-mapping.c                 |   4 +-
 arch/mips/mm/dma-noncoherent.c              |   2 +-
 arch/riscv/mm/dma-noncoherent.c             |   2 +-
 drivers/acpi/arm64/iort.c                   |  42 ++--
 drivers/acpi/scan.c                         | 104 +++++----
 drivers/acpi/viot.c                         |  45 ++--
 drivers/hv/hv_common.c                      |   2 +-
 drivers/iommu/amd/iommu.c                   |   2 -
 drivers/iommu/apple-dart.c                  |   1 -
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  23 +-
 drivers/iommu/intel/iommu.c                 |   2 -
 drivers/iommu/iommu.c                       | 227 +++++++++++++++-----
 drivers/iommu/of_iommu.c                    | 133 +++++-------
 drivers/iommu/omap-iommu.c                  |   1 -
 drivers/iommu/tegra-smmu.c                  |   1 -
 drivers/iommu/virtio-iommu.c                |   8 +-
 drivers/of/device.c                         |  24 ++-
 include/acpi/acpi_bus.h                     |   8 +-
 include/linux/acpi_iort.h                   |   8 +-
 include/linux/acpi_viot.h                   |   5 +-
 include/linux/dma-map-ops.h                 |   4 +-
 include/linux/iommu.h                       |  47 ++--
 include/linux/of_iommu.h                    |  13 +-
 27 files changed, 424 insertions(+), 307 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86

Comments

Jerry Snitselaar Nov. 15, 2023, 2:54 p.m. UTC | #1
Did patch 12 v2 get sent? I'm not seeing it locally, nor in lore, and b4 doesn't find it when pulling then thread.

Regards,
Jerry
Robin Murphy Nov. 15, 2023, 3:22 p.m. UTC | #2
On 2023-11-15 2:05 pm, Jason Gunthorpe wrote:
> [Several people have tested this now, so it is something that should sit in
> linux-next for a while]

What's the aim here? This is obviously far, far too much for a stable 
fix, but then it's also not the refactoring we want for the future 
either, since it's moving in the wrong direction of cementing the 
fundamental brokenness further in place rather than getting any closer 
to removing it.

Thanks,
Robin.

> The iommu subsystem uses dev->iommu to store bits of information about the
> attached iommu driver. This has been co-opted by the ACPI/OF code to also
> be a place to pass around the iommu_fwspec before a driver is probed.
> 
> Since both are using the same pointers without any locking it triggers
> races if there is concurrent driver loading:
> 
>       CPU0                                     CPU1
> of_iommu_configure()                iommu_device_register()
>   ..                                   bus_iommu_probe()
>    iommu_fwspec_of_xlate()              __iommu_probe_device()
>                                          iommu_init_device()
>     dev_iommu_get()
>                                            .. ops->probe fails, no fwspec ..
>                                            dev_iommu_free()
>     dev->iommu->fwspec    *crash*
> 
> My first attempt get correct locking here was to use the device_lock to
> protect the entire *_iommu_configure() and iommu_probe() paths. This
> allowed safe use of dev->iommu within those paths. Unfortuately enough
> drivers abuse the of_iommu_configure() flow without proper locking and
> this approach failed.
> 
> This approach removes touches of dev->iommu from the *_iommu_configure()
> code. The few remaining required touches are moved into iommu.c and
> protected with the existing iommu_probe_device_lock.
> 
> To do this we change *_iommu_configure() to hold the iommu_fwspec on the
> stack while it is being built. Once it is fully formed the core code will
> install it into the dev->iommu when it calls probe.
> 
> This also removes all the touches of iommu_ops from
> the *_iommu_configure() paths and makes that mechanism private to the
> iommu core.
> 
> A few more lockdep assertions are added to discourage future mis-use.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/iommu_fwspec
> 
> v2:
>   - Fix all the kconfig randomization 0-day stuff
>   - Add missing kdoc parameters
>   - Remove NO_IOMMU, replace it with ENODEV
>   - Use PTR_ERR to print errno in the new/moved logging
> v1: https://lore.kernel.org/r/0-v1-5f734af130a3+34f-iommu_fwspec_jgg@nvidia.com
> 
> Jason Gunthorpe (17):
>    iommu: Remove struct iommu_ops *iommu from arch_setup_dma_ops()
>    iommmu/of: Do not return struct iommu_ops from of_iommu_configure()
>    iommu/of: Use -ENODEV consistently in of_iommu_configure()
>    acpi: Do not return struct iommu_ops from acpi_iommu_configure_id()
>    iommu: Make iommu_fwspec->ids a distinct allocation
>    iommu: Add iommu_fwspec_alloc/dealloc()
>    iommu: Add iommu_probe_device_fwspec()
>    iommu/of: Do not use dev->iommu within of_iommu_configure()
>    iommu: Add iommu_fwspec_append_ids()
>    acpi: Do not use dev->iommu within acpi_iommu_configure()
>    iommu: Hold iommu_probe_device_lock while calling ops->of_xlate
>    iommu: Make iommu_ops_from_fwnode() static
>    iommu: Remove dev_iommu_fwspec_set()
>    iommu: Remove pointless iommu_fwspec_free()
>    iommu: Add ops->of_xlate_fwspec()
>    iommu: Mark dev_iommu_get() with lockdep
>    iommu: Mark dev_iommu_priv_set() with a lockdep
> 
>   arch/arc/mm/dma.c                           |   2 +-
>   arch/arm/mm/dma-mapping-nommu.c             |   2 +-
>   arch/arm/mm/dma-mapping.c                   |  10 +-
>   arch/arm64/mm/dma-mapping.c                 |   4 +-
>   arch/mips/mm/dma-noncoherent.c              |   2 +-
>   arch/riscv/mm/dma-noncoherent.c             |   2 +-
>   drivers/acpi/arm64/iort.c                   |  42 ++--
>   drivers/acpi/scan.c                         | 104 +++++----
>   drivers/acpi/viot.c                         |  45 ++--
>   drivers/hv/hv_common.c                      |   2 +-
>   drivers/iommu/amd/iommu.c                   |   2 -
>   drivers/iommu/apple-dart.c                  |   1 -
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |   9 +-
>   drivers/iommu/arm/arm-smmu/arm-smmu.c       |  23 +-
>   drivers/iommu/intel/iommu.c                 |   2 -
>   drivers/iommu/iommu.c                       | 227 +++++++++++++++-----
>   drivers/iommu/of_iommu.c                    | 133 +++++-------
>   drivers/iommu/omap-iommu.c                  |   1 -
>   drivers/iommu/tegra-smmu.c                  |   1 -
>   drivers/iommu/virtio-iommu.c                |   8 +-
>   drivers/of/device.c                         |  24 ++-
>   include/acpi/acpi_bus.h                     |   8 +-
>   include/linux/acpi_iort.h                   |   8 +-
>   include/linux/acpi_viot.h                   |   5 +-
>   include/linux/dma-map-ops.h                 |   4 +-
>   include/linux/iommu.h                       |  47 ++--
>   include/linux/of_iommu.h                    |  13 +-
>   27 files changed, 424 insertions(+), 307 deletions(-)
> 
> 
> base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
Jason Gunthorpe Nov. 15, 2023, 3:36 p.m. UTC | #3
On Wed, Nov 15, 2023 at 03:22:09PM +0000, Robin Murphy wrote:
> On 2023-11-15 2:05 pm, Jason Gunthorpe wrote:
> > [Several people have tested this now, so it is something that should sit in
> > linux-next for a while]
> 
> What's the aim here? This is obviously far, far too much for a
> stable fix,

To fix the locking bug and ugly abuse of dev->iommu?

I wouldn't say that, it is up to the people who care about this to
decide. It seems alot of people are hitting it so maybe it should be
backported in some situations. Regardless, we should not continue to
have this locking bug in v6.8.

> but then it's also not the refactoring we want for the future either, since
> it's moving in the wrong direction of cementing the fundamental brokenness
> further in place rather than getting any closer to removing it.

I haven't seen patches or an outline on what you have in mind though?

In my view I would like to get rid of of_xlate(), at a minimum. It is
a micro-optimization I don't think we need. I see a pretty
straightforward path to get there from here.

Do you also want to get rid of iommu_fwspec, or at least thin it out?
That seems reasonable too, I think that becomes within reach once
of_xlate is gone.

What do you see as "cemeting"?

Jason
Robin Murphy Nov. 15, 2023, 8:23 p.m. UTC | #4
On 2023-11-15 3:36 pm, Jason Gunthorpe wrote:
> On Wed, Nov 15, 2023 at 03:22:09PM +0000, Robin Murphy wrote:
>> On 2023-11-15 2:05 pm, Jason Gunthorpe wrote:
>>> [Several people have tested this now, so it is something that should sit in
>>> linux-next for a while]
>>
>> What's the aim here? This is obviously far, far too much for a
>> stable fix,
> 
> To fix the locking bug and ugly abuse of dev->iommu?

Fixing the locking can be achieved by fixing the locking, as I have now 
demonstrated.

> I wouldn't say that, it is up to the people who care about this to
> decide. It seems alot of people are hitting it so maybe it should be
> backported in some situations. Regardless, we should not continue to
> have this locking bug in v6.8.
> 
>> but then it's also not the refactoring we want for the future either, since
>> it's moving in the wrong direction of cementing the fundamental brokenness
>> further in place rather than getting any closer to removing it.
> 
> I haven't seen patches or an outline on what you have in mind though?
> 
> In my view I would like to get rid of of_xlate(), at a minimum. It is
> a micro-optimization I don't think we need. I see a pretty
> straightforward path to get there from here.

Micro-optimisation!? OK, I think I have to say it. Please stop trying to 
rewrite code you don't understand.

> Do you also want to get rid of iommu_fwspec, or at least thin it out?
> That seems reasonable too, I think that becomes within reach once
> of_xlate is gone.
> 
> What do you see as "cemeting"?

Most of this series constitutes a giant sweeping redesign of a whole 
bunch of internal machinery to permit it to be used concurrently, where 
that concurrency should still not exist in the first place because the 
thing that allows it to happen also causes other problems like groups 
being broken. Once the real problem is fixed there will be no need for 
any of this, and at worst some of it will then actually get in the way.

I feel like I've explained it many times already, but what needs to 
happen is for the firmware parsing and of_xlate stage to be initiated by 
__iommu_probe_device() itself. The first step is my bus ops series (if 
I'm ever allowed to get it landed...) which gets to the state of 
expecting to start from a fwspec. Then it's a case of shuffling around 
what's currently in the bus_type dma_configure methods such that point 
is where the fwspec is created as well, and the driver-probe-time work 
is almost removed except for still deferring if a device is waiting for 
its IOMMU instance (since that instance turning up and registering will 
retrigger the rest itself). And there at last, a trivial lifecycle and 
access pattern for dev->iommu (with the overlapping bits of iommu_fwspec 
finally able to be squashed as well), and finally an end to 8 long and 
unfortunate years of calling things in the wrong order in ways they were 
never supposed to be.

Thanks,
Robin.
Jason Gunthorpe Nov. 16, 2023, 4:17 a.m. UTC | #5
On Wed, Nov 15, 2023 at 08:23:54PM +0000, Robin Murphy wrote:
> On 2023-11-15 3:36 pm, Jason Gunthorpe wrote:
> > On Wed, Nov 15, 2023 at 03:22:09PM +0000, Robin Murphy wrote:
> > > On 2023-11-15 2:05 pm, Jason Gunthorpe wrote:
> > > > [Several people have tested this now, so it is something that should sit in
> > > > linux-next for a while]
> > > 
> > > What's the aim here? This is obviously far, far too much for a
> > > stable fix,
> > 
> > To fix the locking bug and ugly abuse of dev->iommu?
> 
> Fixing the locking can be achieved by fixing the locking, as I have now
> demonstrated.

Obviously. I rejected that right away because of how incredibly
wrongly layered and hacky it is to do something like that.

> > I haven't seen patches or an outline on what you have in mind though?
> > 
> > In my view I would like to get rid of of_xlate(), at a minimum. It is
> > a micro-optimization I don't think we need. I see a pretty
> > straightforward path to get there from here.
> 
> Micro-optimisation!? OK, I think I have to say it. Please stop trying to
> rewrite code you don't understand.

I understand it fine. The list of (fwnode_handle, of_phandle_args)
tuples doesn't change between when of_xlate is callled and when probe
is called. Probe can have the same list. As best I can tell the extra
ops avoids maybe some memory allocation, maybe an extra iteration.

What it does do is screw up alot of the drivers that seem to want to
allocate the per-device data in of_xlate and make it convoluted and
memory leaking buggy on error paths.

So, I would move toward having the driver's probe invoke a helper like:

   iommu_of_xlate(dev, fwspec, &per_fwnode_function, &args);

Which generates the same list of (fwnode_handle, of_phandle_args) that
was passed to of_xlate today, but is ordered sensibly within the
sequence of probe for what many drivers seem to want to do.

So, it is not so much that that the idea of of_xlate goes away, but
the specific op->of_xlate does, it gets shifted into a helper that
invokes the same function in a more logical spot.

The per-device data can be allocated at the top of probe and passed
through args to fix the lifetime bugs.

It is pretty simple to do.

> Most of this series constitutes a giant sweeping redesign of a whole bunch
> of internal machinery to permit it to be used concurrently, where that
> concurrency should still not exist in the first place because the thing that
> allows it to happen also causes other problems like groups being broken.
> Once the real problem is fixed there will be no need for any of this, and at
> worst some of it will then actually get in the way.

Not quite. This decouples two unrelated things into seperate
concerns. It is not so much about the concurrency but removing the
abuse of dev->iommu by code that has no need to touch it at all.

Decoupling makes moving code around easier since the relationships are
easier to reason about.

You can still allocated a fwnode, populate it, and do the rest of the
flow under a probe function just fine.
 
> I feel like I've explained it many times already, but what needs to happen
> is for the firmware parsing and of_xlate stage to be initiated by
> __iommu_probe_device() itself.

Yes, OK I see. I don't see a problem, I think this still a good
improvement even in that world it is undesirable to use dev->iommu as
a temporary, even if the locking can work.

> ever allowed to get it landed...) which gets to the state of
> expecting to

Repost it? Rc1 is out and you need to add one hunk to the new user
domain creation in iommufd.

> start from a fwspec. Then it's a case of shuffling around what's currently
> in the bus_type dma_configure methods such that point is where the fwspec is
> created as well, and the driver-probe-time work is almost removed except for
> still deferring if a device is waiting for its IOMMU instance (since that
> instance turning up and registering will retrigger the rest itself). And
> there at last, a trivial lifecycle and access pattern for dev->iommu (with
> the overlapping bits of iommu_fwspec finally able to be squashed as well),
> and finally an end to 8 long and unfortunate years of calling things in the
> wrong order in ways they were never supposed to be.

Having just gone through this all in detail I don't think it is as
entirely straightforward as this, the open coded callers to
of_dma_configure() are not going to be so nice to unravel.

Regardless, I don't see any worrying incompatibility with that
direction.

Cheers,
Jason
Robin Murphy Nov. 21, 2023, 4:06 p.m. UTC | #6
On 2023-11-16 4:17 am, Jason Gunthorpe wrote:
> On Wed, Nov 15, 2023 at 08:23:54PM +0000, Robin Murphy wrote:
>> On 2023-11-15 3:36 pm, Jason Gunthorpe wrote:
>>> On Wed, Nov 15, 2023 at 03:22:09PM +0000, Robin Murphy wrote:
>>>> On 2023-11-15 2:05 pm, Jason Gunthorpe wrote:
>>>>> [Several people have tested this now, so it is something that should sit in
>>>>> linux-next for a while]
>>>>
>>>> What's the aim here? This is obviously far, far too much for a
>>>> stable fix,
>>>
>>> To fix the locking bug and ugly abuse of dev->iommu?
>>
>> Fixing the locking can be achieved by fixing the locking, as I have now
>> demonstrated.
> 
> Obviously. I rejected that right away because of how incredibly
> wrongly layered and hacky it is to do something like that.

What, and dressing up the fundamental layering violation by baking it 
even further into the API flow, while still not actually fixing it or 
any of its *other* symptoms, is somehow better?

Ultimately, this series is still basically doing the same thing my patch 
does - extending the scope of the existing iommu_probe_device_lock hack 
to cover fwspec creation. A hack is a hack, so frankly I'd rather it be 
simple and obvious and look like one, and being easy to remove again is 
an obvious bonus too.

>>> I haven't seen patches or an outline on what you have in mind though?
>>>
>>> In my view I would like to get rid of of_xlate(), at a minimum. It is
>>> a micro-optimization I don't think we need. I see a pretty
>>> straightforward path to get there from here.
>>
>> Micro-optimisation!? OK, I think I have to say it. Please stop trying to
>> rewrite code you don't understand.
> 
> I understand it fine. The list of (fwnode_handle, of_phandle_args)
> tuples doesn't change between when of_xlate is callled and when probe
> is called. Probe can have the same list. As best I can tell the extra
> ops avoids maybe some memory allocation, maybe an extra iteration.
> 
> What it does do is screw up alot of the drivers that seem to want to
> allocate the per-device data in of_xlate and make it convoluted and
> memory leaking buggy on error paths.
> 
> So, I would move toward having the driver's probe invoke a helper like:
> 
>     iommu_of_xlate(dev, fwspec, &per_fwnode_function, &args);
> 
> Which generates the same list of (fwnode_handle, of_phandle_args) that
> was passed to of_xlate today, but is ordered sensibly within the
> sequence of probe for what many drivers seem to want to do.

Grep for of_xlate. It is a standard and well-understood callback pattern 
for a subsystem to parse a common DT binding and pass a driver-specific 
specifier to a driver to interpret. Or maybe you just have a peculiar 
definition of what you think "micro-optimisation" means? :/

> So, it is not so much that that the idea of of_xlate goes away, but
> the specific op->of_xlate does, it gets shifted into a helper that
> invokes the same function in a more logical spot.

I'm curious how you imagine an IOMMU driver's ->probe function could be 
called *before* parsing the firmware to work out what, if any, IOMMU, 
and thus driver, a device is associated with. Unless you think we should 
have the horrible perf model of passing the device to *every* registered 
->probe callback in turn until someone claims it. And then every driver 
has to have identical boilerplate to go off and parse the generic 
"iommus" binding... which is the whole damn reason for *not* going down 
that route and instead using an of_xlate mechanism in the first place.

> The per-device data can be allocated at the top of probe and passed
> through args to fix the lifetime bugs.
> 
> It is pretty simple to do.

I believe the kids these days would say "Say you don't understand the 
code without saying you don't understand the code."

>> Most of this series constitutes a giant sweeping redesign of a whole bunch
>> of internal machinery to permit it to be used concurrently, where that
>> concurrency should still not exist in the first place because the thing that
>> allows it to happen also causes other problems like groups being broken.
>> Once the real problem is fixed there will be no need for any of this, and at
>> worst some of it will then actually get in the way.
> 
> Not quite. This decouples two unrelated things into seperate
> concerns. It is not so much about the concurrency but removing the
> abuse of dev->iommu by code that has no need to touch it at all.

Sorry, the "abuse" of storing IOMMU-API-specific data in the place we 
intentionally created to consolidate all the IOMMU-API-specific data 
into? Yes, there is an issue with the circumstances in which this data 
is sometimes accessed, but as I'm starting to tire of repeating, that 
issue fundamentally dates back to 2017, and the implications were 
unfortunately overlooked when dev->iommu was later introduced and fwspec 
moved into it (since the non-DT probing paths still worked as originally 
designed). Pretending that dev->iommu is the issue here is missing the 
point.

> Decoupling makes moving code around easier since the relationships are
> easier to reason about.

Again with the odd definitions of "easier". You know what I think is 
easy? Having a thing be in the obvious place where it should be (but 
used in the way that was intended). What I would consider objectively 
less easy is having a thing sometimes be there but sometimes be 
somewhere else with loads more API machinery to juggle between the two. 
Especially when once again, that machinery is itself prone to new bugs.

Once again you've got hung up on one particular detail of one symptom of 
the *real* issue, so although I can see and follow your chain of 
reasoning, the fact that it starts from the wrong place makes it not 
particularly useful in the bigger picture.

> You can still allocated a fwnode, populate it, and do the rest of the
> flow under a probe function just fine.
>   
>> I feel like I've explained it many times already, but what needs to happen
>> is for the firmware parsing and of_xlate stage to be initiated by
>> __iommu_probe_device() itself.
> 
> Yes, OK I see. I don't see a problem, I think this still a good
> improvement even in that world it is undesirable to use dev->iommu as
> a temporary, even if the locking can work.
> 
>> ever allowed to get it landed...) which gets to the state of
>> expecting to
> 
> Repost it? Rc1 is out and you need to add one hunk to the new user
> domain creation in iommufd.

Well yeah, I'm trying to get that rebase finished (hence why I'm finding 
broken IOMMUFD selftests), but as always I'm also busy with a lot of 
other non-upstream things, and every time I have managed to do it so far 
this year it's ended up being blocked by conflicting changes, so I 
reserve my optimism...

>> start from a fwspec. Then it's a case of shuffling around what's currently
>> in the bus_type dma_configure methods such that point is where the fwspec is
>> created as well, and the driver-probe-time work is almost removed except for
>> still deferring if a device is waiting for its IOMMU instance (since that
>> instance turning up and registering will retrigger the rest itself). And
>> there at last, a trivial lifecycle and access pattern for dev->iommu (with
>> the overlapping bits of iommu_fwspec finally able to be squashed as well),
>> and finally an end to 8 long and unfortunate years of calling things in the
>> wrong order in ways they were never supposed to be.
> 
> Having just gone through this all in detail I don't think it is as
> entirely straightforward as this, the open coded callers to
> of_dma_configure() are not going to be so nice to unravel.

I've only had this turning over in the back of my mind for about the 
last 4 years now, so I think I have a good understanding of what to 
expect, thanks.

Robin.
Jason Gunthorpe Nov. 21, 2023, 5:55 p.m. UTC | #7
On Tue, Nov 21, 2023 at 04:06:15PM +0000, Robin Murphy wrote:

> > Obviously. I rejected that right away because of how incredibly
> > wrongly layered and hacky it is to do something like that.
> 
> What, and dressing up the fundamental layering violation by baking it even
> further into the API flow, while still not actually fixing it or any of its
> *other* symptoms, is somehow better?

It puts the locks and the controlled data in the right place, in the
right layer.

> Ultimately, this series is still basically doing the same thing my patch
> does - extending the scope of the existing iommu_probe_device_lock hack to
> cover fwspec creation. A hack is a hack, so frankly I'd rather it be simple
> and obvious and look like one, and being easy to remove again is an obvious
> bonus too.

Not entirely, as I've said this series removes the use of the dev->iommu
during fwspec creation. This is different than just extending the lock.

> > So, I would move toward having the driver's probe invoke a helper like:
> > 
> >     iommu_of_xlate(dev, fwspec, &per_fwnode_function, &args);
> > 
> > Which generates the same list of (fwnode_handle, of_phandle_args) that
> > was passed to of_xlate today, but is ordered sensibly within the
> > sequence of probe for what many drivers seem to want to do.
> 
> Grep for of_xlate. It is a standard and well-understood callback pattern for
> a subsystem to parse a common DT binding and pass a driver-specific
> specifier to a driver to interpret. Or maybe you just have a peculiar
> definition of what you think "micro-optimisation" means? :/

Don't know about other subsystems, here is making a mess. Look at what
the drivers are doing. The two SMMU drivers are sort of sane, but
everything else has coded half their pobe into of_xlate. It doesn't
make alot of sense.

> > So, it is not so much that that the idea of of_xlate goes away, but
> > the specific op->of_xlate does, it gets shifted into a helper that
> > invokes the same function in a more logical spot.
> 
> I'm curious how you imagine an IOMMU driver's ->probe function could be
> called *before* parsing the firmware to work out what, if any, IOMMU, and
> thus driver, a device is associated with.

You've jumped ahead here, I'm talking about removing ops->of_xlate.

With everything kept the same as today this means we'd scan the FW
description twice. Once to find the ops and once inside the driver to
parse it.

When I say micro-optimization this is what I mean - structuring the
code to try to avoid multiple-scans of the FW table.

Once the drivers are self-parsing I see there are two paths to keep it
at one FW parse:

 1) Have the core code parse and cache common cases in the iommu_fwspec.
    Driver then pulls out the common cases from the iommu_fwspec and
    reparsed in uncommon cases.

 2) Accept we have only a single ops in all real systems and just
    invoke the driver to parse it. That parse will cache enough
    information to decide if EPROBE_DEFER is needed.

In either case the drivers would call the same APIs and have the same
logic. The choice is just infrastructure-side stuff.

It is a different view than trying to do everything up front, but I'm
not seeing that it is so differently efficient as to be a performance
problem.

On the plus side it looks to make the drivers alot simpler and more
logical.

> And then every driver has to have
> identical boilerplate to go off and parse the generic "iommus" binding...
> which is the whole damn reason for *not* going down that route and instead
> using an of_xlate mechanism in the first place.

Let's not guess. I've attached below a sketch conversion for
apple-dart.

Diffstat says it *saves* 1 line. But also we fix several bugs!

 - iommu_of_xlate() will check for NULL fwspec and reject the bus
   probe path

 - iommu_of_xlate() can match the iommu's from the iommu list and
   check that the OF actually points only to iommus with this driver's
   ops (missed sanity check)

 - The of parsing machinery already computes the iommu_driver but throws
   it out. This forces all of the drivers to do their own thing. Pass
   it in and thus apple_dart_of_xlate() doesn't need to mess around
   with of_find_device_by_node() and we fix the bug where it leaks the
   reference on the struct device (woops!)

 - We don't leak the cfg allocation that apple_dart_of_xlate() did on
   various error paths. All error paths logically free in probe. We
   don't have to think about what happens if of_xlate is called twice
   for the same args on error/reprobe paths.

Many drivers follow this pattern of apple-dart and would end up like this.

Drivers that just need the u32 array would be simpler, more like:

   struct iommu_driver *instance;
   unsigned int num_ids;

   instance = iommu_of_get_iommu(..., &num_ids);
   if (IS_ERR(instance))
      return ERR_CAST(instance);
   cfg = kzalloc(struct_size(cfg, sids, num_ids), GFP_KERNEL);
   if (!cfg)
      return -ENOMEM;
   iommu_of_read_u32_ids(..., &cfg->sids);
   [..]
   return instance;

I haven't sketched *every* driver yet, but I've sketched enough to be
confident.

Robin, I know it is not what you have in your head, but you should
stop with the insults and be more open to other perspectives.

> > The per-device data can be allocated at the top of probe and passed
> > through args to fix the lifetime bugs.
> > 
> > It is pretty simple to do.
> 
> I believe the kids these days would say "Say you don't understand the code
> without saying you don't understand the code."

I think you are too fixated on what you have in your mind. It will
take me a bit but I will come with a series to move all the FW parsing
into probe along this model. Once done it is trivial to make bus probe
work like it should.

Regarding this series, I don't really see a problem. There is no
"concrete" or anything like that.

> > Not quite. This decouples two unrelated things into seperate
> > concerns. It is not so much about the concurrency but removing the
> > abuse of dev->iommu by code that has no need to touch it at all.
> 
> Sorry, the "abuse" of storing IOMMU-API-specific data in the place we
> intentionally created to consolidate all the IOMMU-API-specific data
> into?

The global state should not be filled until the driver is probed. It
is an abuse to use a global instead of an on-stack variable when
building it. Publishing only fully initialized data to global
visibility is concurrency 101. :(

Jason

diff --git a/drivers/iommu/apple-dart.c b/drivers/iommu/apple-dart.c
index ee05f4824bfad1..476938722460b8 100644
--- a/drivers/iommu/apple-dart.c
+++ b/drivers/iommu/apple-dart.c
@@ -721,19 +721,29 @@ static struct iommu_domain apple_dart_blocked_domain = {
 
 static struct iommu_device *apple_dart_probe_device(struct device *dev)
 {
-	struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
+	struct apple_dart_master_cfg *cfg;
 	struct apple_dart_stream_map *stream_map;
+	int ret;
 	int i;
 
+	cfg = kzalloc(sizeof(*cfg), GFP_KERNEL);
 	if (!cfg)
-		return ERR_PTR(-ENODEV);
+		return ERR_PTR(-ENOMEM);
+	ret = iommu_of_xlate(dev_iommu_fwspec_get(dev), &apple_dart_iommu_ops,
+			     1, &apple_dart_of_xlate, cfg);
+	if (ret)
+		goto err_free;
 
 	for_each_stream_map(i, cfg, stream_map)
 		device_link_add(
 			dev, stream_map->dart->dev,
 			DL_FLAG_PM_RUNTIME | DL_FLAG_AUTOREMOVE_SUPPLIER);
 
+	dev_iommu_priv_set(dev, cfg);
 	return &cfg->stream_maps[0].dart->iommu;
+err_free:
+	kfree(cfg);
+	return ret;
 }
 
 static void apple_dart_release_device(struct device *dev)
@@ -777,25 +787,15 @@ static void apple_dart_domain_free(struct iommu_domain *domain)
 	kfree(dart_domain);
 }
 
-static int apple_dart_of_xlate(struct device *dev, struct of_phandle_args *args)
+static int apple_dart_of_xlate(struct iommu_device *iommu,
+			       struct of_phandle_args *args, void *priv)
 {
-	struct apple_dart_master_cfg *cfg = dev_iommu_priv_get(dev);
-	struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
-	struct apple_dart *dart = platform_get_drvdata(iommu_pdev);
-	struct apple_dart *cfg_dart;
-	int i, sid;
+	struct apple_dart *dart = container_of(iommu, struct apple_dart, iommu);
+	struct apple_dart_master_cfg *cfg = priv;
+	struct apple_dart *cfg_dart = cfg->stream_maps[0].dart;
+	int sid = args->args[0];
+	int i;
 
-	if (args->args_count != 1)
-		return -EINVAL;
-	sid = args->args[0];
-
-	if (!cfg)
-		cfg = kzalloc(sizeof(*cfg), GFP_KERNEL);
-	if (!cfg)
-		return -ENOMEM;
-	dev_iommu_priv_set(dev, cfg);
-
-	cfg_dart = cfg->stream_maps[0].dart;
 	if (cfg_dart) {
 		if (cfg_dart->supports_bypass != dart->supports_bypass)
 			return -EINVAL;
@@ -980,7 +980,6 @@ static const struct iommu_ops apple_dart_iommu_ops = {
 	.probe_device = apple_dart_probe_device,
 	.release_device = apple_dart_release_device,
 	.device_group = apple_dart_device_group,
-	.of_xlate = apple_dart_of_xlate,
 	.def_domain_type = apple_dart_def_domain_type,
 	.get_resv_regions = apple_dart_get_resv_regions,
 	.pgsize_bitmap = -1UL, /* Restricted during dart probe */