diff mbox series

[kernel,RFC,2/2] vfio-pci-nvlink2: Implement interconnect isolation

Message ID 20190315081835.14083-3-aik@ozlabs.ru
State Not Applicable
Headers show
Series vfio, powerpc/powernv: Isolate GV100GL | expand

Commit Message

Alexey Kardashevskiy March 15, 2019, 8:18 a.m. UTC
The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
(on POWER9) NVLinks. In addition to that, GPUs themselves have direct
peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
platform puts all interconnected GPUs to the same IOMMU group.

However the user may want to pass individual GPUs to the userspace so
in order to do so we need to put them into separate IOMMU groups and
cut off the interconnects.

Thankfully V100 GPUs implement an interface to do by programming link
disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
this interface, it cannot be re-enabled until the secondary bus reset is
issued to the GPU.

This defines a reset_done() handler for V100 NVlink2 device which
determines what links need to be disabled. This relies on presence
of the new "ibm,nvlink-peers" device tree property of a GPU telling which
PCI peers it is connected to (which includes NVLink bridges or peer GPUs).

This does not change the existing behaviour and instead adds
a new "isolate_nvlink" kernel parameter to allow such isolation.

The alternative approaches would be:

1. do this in the system firmware (skiboot) but for that we would need
to tell skiboot via an additional OPAL call whether or not we want this
isolation - skiboot is unaware of IOMMU groups.

2. do this in the secondary bus reset handler in the POWERNV platform -
the problem with that is at that point the device is not enabled, i.e.
config space is not restored so we need to enable the device (i.e. MMIO
bit in CMD register + program valid address to BAR0) in order to disable
links and then perhaps undo all this initialization to bring the device
back to the state where pci_try_reset_function() expects it to be.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 24 +++++-
 drivers/vfio/pci/vfio_pci_nvlink2.c      | 98 ++++++++++++++++++++++++
 2 files changed, 120 insertions(+), 2 deletions(-)

Comments

Alex Williamson March 19, 2019, 4:36 p.m. UTC | #1
On Fri, 15 Mar 2019 19:18:35 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> platform puts all interconnected GPUs to the same IOMMU group.
> 
> However the user may want to pass individual GPUs to the userspace so
> in order to do so we need to put them into separate IOMMU groups and
> cut off the interconnects.
> 
> Thankfully V100 GPUs implement an interface to do by programming link
> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> this interface, it cannot be re-enabled until the secondary bus reset is
> issued to the GPU.
> 
> This defines a reset_done() handler for V100 NVlink2 device which
> determines what links need to be disabled. This relies on presence
> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> 
> This does not change the existing behaviour and instead adds
> a new "isolate_nvlink" kernel parameter to allow such isolation.
> 
> The alternative approaches would be:
> 
> 1. do this in the system firmware (skiboot) but for that we would need
> to tell skiboot via an additional OPAL call whether or not we want this
> isolation - skiboot is unaware of IOMMU groups.
> 
> 2. do this in the secondary bus reset handler in the POWERNV platform -
> the problem with that is at that point the device is not enabled, i.e.
> config space is not restored so we need to enable the device (i.e. MMIO
> bit in CMD register + program valid address to BAR0) in order to disable
> links and then perhaps undo all this initialization to bring the device
> back to the state where pci_try_reset_function() expects it to be.

The trouble seems to be that this approach only maintains the isolation
exposed by the IOMMU group when vfio-pci is the active driver for the
device.  IOMMU groups can be used by any driver and the IOMMU core is
incorporating groups in various ways.  So, if there's a device specific
way to configure the isolation reported in the group, which requires
some sort of active management against things like secondary bus
resets, then I think we need to manage it above the attached endpoint
driver.  Ideally I'd see this as a set of PCI quirks so that we might
leverage it beyond POWER platforms.  I'm not sure how we get past the
reliance on device tree properties that we won't have on other
platforms though, if only NVIDIA could at least open a spec addressing
the discovery and configuration of NVLink registers on their
devices :-\  Thanks,

Alex
Alexey Kardashevskiy March 20, 2019, 1:57 a.m. UTC | #2
On 20/03/2019 03:36, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
>> (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
>> peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
>> platform puts all interconnected GPUs to the same IOMMU group.
>>
>> However the user may want to pass individual GPUs to the userspace so
>> in order to do so we need to put them into separate IOMMU groups and
>> cut off the interconnects.
>>
>> Thankfully V100 GPUs implement an interface to do by programming link
>> disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
>> this interface, it cannot be re-enabled until the secondary bus reset is
>> issued to the GPU.
>>
>> This defines a reset_done() handler for V100 NVlink2 device which
>> determines what links need to be disabled. This relies on presence
>> of the new "ibm,nvlink-peers" device tree property of a GPU telling which
>> PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
>>
>> This does not change the existing behaviour and instead adds
>> a new "isolate_nvlink" kernel parameter to allow such isolation.
>>
>> The alternative approaches would be:
>>
>> 1. do this in the system firmware (skiboot) but for that we would need
>> to tell skiboot via an additional OPAL call whether or not we want this
>> isolation - skiboot is unaware of IOMMU groups.
>>
>> 2. do this in the secondary bus reset handler in the POWERNV platform -
>> the problem with that is at that point the device is not enabled, i.e.
>> config space is not restored so we need to enable the device (i.e. MMIO
>> bit in CMD register + program valid address to BAR0) in order to disable
>> links and then perhaps undo all this initialization to bring the device
>> back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.  So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

Fair point. So for now I'll go for 2) then.

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

This would be nice, yes...
David Gibson March 20, 2019, 4:38 a.m. UTC | #3
On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> On Fri, 15 Mar 2019 19:18:35 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
> > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > platform puts all interconnected GPUs to the same IOMMU group.
> > 
> > However the user may want to pass individual GPUs to the userspace so
> > in order to do so we need to put them into separate IOMMU groups and
> > cut off the interconnects.
> > 
> > Thankfully V100 GPUs implement an interface to do by programming link
> > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > this interface, it cannot be re-enabled until the secondary bus reset is
> > issued to the GPU.
> > 
> > This defines a reset_done() handler for V100 NVlink2 device which
> > determines what links need to be disabled. This relies on presence
> > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > 
> > This does not change the existing behaviour and instead adds
> > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > 
> > The alternative approaches would be:
> > 
> > 1. do this in the system firmware (skiboot) but for that we would need
> > to tell skiboot via an additional OPAL call whether or not we want this
> > isolation - skiboot is unaware of IOMMU groups.
> > 
> > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > the problem with that is at that point the device is not enabled, i.e.
> > config space is not restored so we need to enable the device (i.e. MMIO
> > bit in CMD register + program valid address to BAR0) in order to disable
> > links and then perhaps undo all this initialization to bring the device
> > back to the state where pci_try_reset_function() expects it to be.
> 
> The trouble seems to be that this approach only maintains the isolation
> exposed by the IOMMU group when vfio-pci is the active driver for the
> device.  IOMMU groups can be used by any driver and the IOMMU core is
> incorporating groups in various ways.

I don't think that reasoning is quite right.  An IOMMU group doesn't
necessarily represent devices which *are* isolated, just devices which
*can be* isolated.  There are plenty of instances when we don't need
to isolate devices in different IOMMU groups: passing both groups to
the same guest or userspace VFIO driver for example, or indeed when
both groups are owned by regular host kernel drivers.

In at least some of those cases we also don't want to isolate the
devices when we don't have to, usually for performance reasons.

> So, if there's a device specific
> way to configure the isolation reported in the group, which requires
> some sort of active management against things like secondary bus
> resets, then I think we need to manage it above the attached endpoint
> driver.

The problem is that above the endpoint driver, we don't actually have
enough information about what should be isolated.  For VFIO we want to
isolate things if they're in different containers, for most regular
host kernel drivers we don't need to isolate at all (although we might
as well when it doesn't have a cost).  The host side nVidia GPGPU
drivers also won't want to isolate the (host owned) NVLink devices
from each other, since they'll want to use the fast interconnects

> Ideally I'd see this as a set of PCI quirks so that we might
> leverage it beyond POWER platforms.  I'm not sure how we get past the
> reliance on device tree properties that we won't have on other
> platforms though, if only NVIDIA could at least open a spec addressing
> the discovery and configuration of NVLink registers on their
> devices :-\  Thanks,

Yeah, that'd be nice :/.
Alex Williamson March 20, 2019, 7:09 p.m. UTC | #4
On Wed, 20 Mar 2019 15:38:24 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > On Fri, 15 Mar 2019 19:18:35 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > platform puts all interconnected GPUs to the same IOMMU group.
> > > 
> > > However the user may want to pass individual GPUs to the userspace so
> > > in order to do so we need to put them into separate IOMMU groups and
> > > cut off the interconnects.
> > > 
> > > Thankfully V100 GPUs implement an interface to do by programming link
> > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > issued to the GPU.
> > > 
> > > This defines a reset_done() handler for V100 NVlink2 device which
> > > determines what links need to be disabled. This relies on presence
> > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > 
> > > This does not change the existing behaviour and instead adds
> > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > 
> > > The alternative approaches would be:
> > > 
> > > 1. do this in the system firmware (skiboot) but for that we would need
> > > to tell skiboot via an additional OPAL call whether or not we want this
> > > isolation - skiboot is unaware of IOMMU groups.
> > > 
> > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > the problem with that is at that point the device is not enabled, i.e.
> > > config space is not restored so we need to enable the device (i.e. MMIO
> > > bit in CMD register + program valid address to BAR0) in order to disable
> > > links and then perhaps undo all this initialization to bring the device
> > > back to the state where pci_try_reset_function() expects it to be.  
> > 
> > The trouble seems to be that this approach only maintains the isolation
> > exposed by the IOMMU group when vfio-pci is the active driver for the
> > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > incorporating groups in various ways.  
> 
> I don't think that reasoning is quite right.  An IOMMU group doesn't
> necessarily represent devices which *are* isolated, just devices which
> *can be* isolated.  There are plenty of instances when we don't need
> to isolate devices in different IOMMU groups: passing both groups to
> the same guest or userspace VFIO driver for example, or indeed when
> both groups are owned by regular host kernel drivers.
> 
> In at least some of those cases we also don't want to isolate the
> devices when we don't have to, usually for performance reasons.

I see IOMMU groups as representing the current isolation of the device,
not just the possible isolation.  If there are ways to break down that
isolation then ideally the group would be updated to reflect it.  The
ACS disable patches seem to support this, at boot time we can choose to
disable ACS at certain points in the topology to favor peer-to-peer
performance over isolation.  This is then reflected in the group
composition, because even though ACS *can be* enabled at the given
isolation points, it's intentionally not with this option.  Whether or
not a given user who owns multiple devices needs that isolation is
really beside the point, the user can choose to connect groups via IOMMU
mappings or reconfigure the system to disable ACS and potentially more
direct routing.  The IOMMU groups are still accurately reflecting the
topology and IOMMU based isolation.

> > So, if there's a device specific
> > way to configure the isolation reported in the group, which requires
> > some sort of active management against things like secondary bus
> > resets, then I think we need to manage it above the attached endpoint
> > driver.  
> 
> The problem is that above the endpoint driver, we don't actually have
> enough information about what should be isolated.  For VFIO we want to
> isolate things if they're in different containers, for most regular
> host kernel drivers we don't need to isolate at all (although we might
> as well when it doesn't have a cost).

This idea that we only want to isolate things if they're in different
containers is bogus, imo.  There are performance reasons why we might
not want things isolated, but there are also address space reasons why
we do.  If there are direct routes between devices, the user needs to
be aware of the IOVA pollution, if we maintain singleton groups, they
don't.  Granted we don't really account for this well in most
userspaces and fumble through it by luck of the address space layout
and lack of devices really attempting peer to peer access.

For in-kernel users, we're still theoretically trying to isolate
devices such that they have restricted access to only the resources
they need.  Disabling things like ACS in the topology reduces that
isolation.  AFAICT, most users don't really care about that degree of
isolation, so they run with iommu=pt for native driver performance
while still having the IOMMU available for isolation use cases running
in parallel.  We don't currently have support for on-demand enabling
isolation.

> The host side nVidia GPGPU
> drivers also won't want to isolate the (host owned) NVLink devices
> from each other, since they'll want to use the fast interconnects

This falls into the same mixed use case scenario above where we don't
really have a good solution today.  Things like ACS are dynamically
configurable, but we don't expose any interfaces to let drivers or
users change it (aside from setpci, which we don't account for
dynamically).  We assume a simplistic model where if you want IOMMU,
then you must also want the maximum configurable isolation.
Dynamically changing routing is not necessarily the most foolproof
thing either with potentially in-flight transactions and existing DMA
mappings, which is why I've suggested a couple times that perhaps we
could do a software hot-unplug of a sub-hierarchy, muck with isolation
at the remaining node, then re-discover the removed devices.

Of course when we bring NVIDIA into the mix, I have little sympathy
that the NVLink interfaces are all proprietary and we have no idea how
to make those dynamic changes or discover the interconnected-ness of a
device.  Thanks,

Alex
David Gibson March 20, 2019, 11:56 p.m. UTC | #5
On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> On Wed, 20 Mar 2019 15:38:24 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:
> > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > >   
> > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > 
> > > > However the user may want to pass individual GPUs to the userspace so
> > > > in order to do so we need to put them into separate IOMMU groups and
> > > > cut off the interconnects.
> > > > 
> > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > issued to the GPU.
> > > > 
> > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > determines what links need to be disabled. This relies on presence
> > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > > 
> > > > This does not change the existing behaviour and instead adds
> > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > 
> > > > The alternative approaches would be:
> > > > 
> > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > isolation - skiboot is unaware of IOMMU groups.
> > > > 
> > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > the problem with that is at that point the device is not enabled, i.e.
> > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > links and then perhaps undo all this initialization to bring the device
> > > > back to the state where pci_try_reset_function() expects it to be.  
> > > 
> > > The trouble seems to be that this approach only maintains the isolation
> > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > incorporating groups in various ways.  
> > 
> > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > necessarily represent devices which *are* isolated, just devices which
> > *can be* isolated.  There are plenty of instances when we don't need
> > to isolate devices in different IOMMU groups: passing both groups to
> > the same guest or userspace VFIO driver for example, or indeed when
> > both groups are owned by regular host kernel drivers.
> > 
> > In at least some of those cases we also don't want to isolate the
> > devices when we don't have to, usually for performance reasons.
> 
> I see IOMMU groups as representing the current isolation of the device,
> not just the possible isolation.  If there are ways to break down that
> isolation then ideally the group would be updated to reflect it.  The
> ACS disable patches seem to support this, at boot time we can choose to
> disable ACS at certain points in the topology to favor peer-to-peer
> performance over isolation.  This is then reflected in the group
> composition, because even though ACS *can be* enabled at the given
> isolation points, it's intentionally not with this option.  Whether or
> not a given user who owns multiple devices needs that isolation is
> really beside the point, the user can choose to connect groups via IOMMU
> mappings or reconfigure the system to disable ACS and potentially more
> direct routing.  The IOMMU groups are still accurately reflecting the
> topology and IOMMU based isolation.

Huh, ok, I think we need to straighten this out.  Thinking of iommu
groups as possible rather than potential isolation was a conscious
decision on my part when we were first coming up with them.  The
rationale was that that way iommu groups could be static for the
lifetime of boot, with more dynamic isolation state layered on top.

Now, that was based on analogy with PAPR's concept of "Partitionable
Endpoints" which are decided by firmware before boot.  However, I
think it makes sense in other contexts too: if iommu groups represent
current isolation, then we need some other way to advertise possible
isolation - otherwise how will the admin (and/or tools) know how it
can configure the iommu groups.

VFIO already has the container, which represents explicitly a "group
of groups" that we don't care to isolate from each other.  I don't
actually know what other uses of the iommu group infrastructure we
have at present and how they treat them.

So, if we now have dynamically reconfigurable groups which are a
departure from that design, how can we go from here trying to bring
things back to consistency.

> > > So, if there's a device specific
> > > way to configure the isolation reported in the group, which requires
> > > some sort of active management against things like secondary bus
> > > resets, then I think we need to manage it above the attached endpoint
> > > driver.  
> > 
> > The problem is that above the endpoint driver, we don't actually have
> > enough information about what should be isolated.  For VFIO we want to
> > isolate things if they're in different containers, for most regular
> > host kernel drivers we don't need to isolate at all (although we might
> > as well when it doesn't have a cost).
> 
> This idea that we only want to isolate things if they're in different
> containers is bogus, imo.  There are performance reasons why we might
> not want things isolated, but there are also address space reasons why
> we do.  If there are direct routes between devices, the user needs to
> be aware of the IOVA pollution, if we maintain singleton groups, they
> don't.  Granted we don't really account for this well in most
> userspaces and fumble through it by luck of the address space layout
> and lack of devices really attempting peer to peer access.

I don't really follow what you're saying here.

> For in-kernel users, we're still theoretically trying to isolate
> devices such that they have restricted access to only the resources
> they need.  Disabling things like ACS in the topology reduces that
> isolation.  AFAICT, most users don't really care about that degree of
> isolation, so they run with iommu=pt for native driver performance
> while still having the IOMMU available for isolation use cases running
> in parallel.  We don't currently have support for on-demand enabling
> isolation.

Ok.

> > The host side nVidia GPGPU
> > drivers also won't want to isolate the (host owned) NVLink devices
> > from each other, since they'll want to use the fast interconnects
> 
> This falls into the same mixed use case scenario above where we don't
> really have a good solution today.  Things like ACS are dynamically
> configurable, but we don't expose any interfaces to let drivers or
> users change it (aside from setpci, which we don't account for
> dynamically).  We assume a simplistic model where if you want IOMMU,
> then you must also want the maximum configurable isolation.
> Dynamically changing routing is not necessarily the most foolproof
> thing either with potentially in-flight transactions and existing DMA
> mappings, which is why I've suggested a couple times that perhaps we
> could do a software hot-unplug of a sub-hierarchy, muck with isolation
> at the remaining node, then re-discover the removed devices.

Ok, so I feel like we need to go fully one way or the other.  Either:

1) Groups represent current isolation status, in which case we
   deprecate vfio containers in favour of fusing groups beforehand,
   and we need some new concept ("isolation atoms"?) to represent what
   isolation is possible

or

2) Groups represent potential isolation, with a higher level construct
   representing current isolation.  This could involve making vfio
   containers essentially a wrapper around some more generic concept
   ("isolation clusters"?) of a group of groups.

> Of course when we bring NVIDIA into the mix, I have little sympathy
> that the NVLink interfaces are all proprietary and we have no idea how
> to make those dynamic changes or discover the interconnected-ness of a
> device.  Thanks,

Yes, well, no argument there.
Alex Williamson March 21, 2019, 6:19 p.m. UTC | #6
On Thu, 21 Mar 2019 10:56:00 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> > On Wed, 20 Mar 2019 15:38:24 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:  
> > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > >     
> > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > 
> > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > cut off the interconnects.
> > > > > 
> > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > > issued to the GPU.
> > > > > 
> > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > determines what links need to be disabled. This relies on presence
> > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > > > 
> > > > > This does not change the existing behaviour and instead adds
> > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > 
> > > > > The alternative approaches would be:
> > > > > 
> > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > 
> > > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > > links and then perhaps undo all this initialization to bring the device
> > > > > back to the state where pci_try_reset_function() expects it to be.    
> > > > 
> > > > The trouble seems to be that this approach only maintains the isolation
> > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > incorporating groups in various ways.    
> > > 
> > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > necessarily represent devices which *are* isolated, just devices which
> > > *can be* isolated.  There are plenty of instances when we don't need
> > > to isolate devices in different IOMMU groups: passing both groups to
> > > the same guest or userspace VFIO driver for example, or indeed when
> > > both groups are owned by regular host kernel drivers.
> > > 
> > > In at least some of those cases we also don't want to isolate the
> > > devices when we don't have to, usually for performance reasons.  
> > 
> > I see IOMMU groups as representing the current isolation of the device,
> > not just the possible isolation.  If there are ways to break down that
> > isolation then ideally the group would be updated to reflect it.  The
> > ACS disable patches seem to support this, at boot time we can choose to
> > disable ACS at certain points in the topology to favor peer-to-peer
> > performance over isolation.  This is then reflected in the group
> > composition, because even though ACS *can be* enabled at the given
> > isolation points, it's intentionally not with this option.  Whether or
> > not a given user who owns multiple devices needs that isolation is
> > really beside the point, the user can choose to connect groups via IOMMU
> > mappings or reconfigure the system to disable ACS and potentially more
> > direct routing.  The IOMMU groups are still accurately reflecting the
> > topology and IOMMU based isolation.  
> 
> Huh, ok, I think we need to straighten this out.  Thinking of iommu
> groups as possible rather than potential isolation was a conscious

possible ~= potential

> decision on my part when we were first coming up with them.  The
> rationale was that that way iommu groups could be static for the
> lifetime of boot, with more dynamic isolation state layered on top.
> 
> Now, that was based on analogy with PAPR's concept of "Partitionable
> Endpoints" which are decided by firmware before boot.  However, I
> think it makes sense in other contexts too: if iommu groups represent
> current isolation, then we need some other way to advertise possible
> isolation - otherwise how will the admin (and/or tools) know how it
> can configure the iommu groups.
> 
> VFIO already has the container, which represents explicitly a "group
> of groups" that we don't care to isolate from each other.  I don't
> actually know what other uses of the iommu group infrastructure we
> have at present and how they treat them.

s/that we don't care to isolate/that the user doesn't care to isolate/

Though even that is not necessarily accurate, the container represents
a shared IOMMU context, that context doesn't necessarily include
mappings between devices, so the device can still be isolated *from
each other*.

> So, if we now have dynamically reconfigurable groups which are a
> departure from that design, how can we go from here trying to bring
> things back to consistency.

We don't currently have dynamically reconfigurable groups, I think the
question is how do we introduce dynamically configurable groups within
the existing design, because whatever intentions were 7.5 years ago is
rather irrelevant now.

VFIO groups are mapped directly to IOMMU groups.  IOMMU groups are the
smallest set of devices which can be considered isolated from other
sets of devices (not potentially, but as configured).  VFIO groups are
the unit of ownership to userspace, therefore a VFIO group must be
(currently) isolated from other groups.  The user owns and manages the
IOMMU context for one or more groups via a container.  A container
allows efficiency in managing a shared IOVA space between groups.

> > > > So, if there's a device specific
> > > > way to configure the isolation reported in the group, which requires
> > > > some sort of active management against things like secondary bus
> > > > resets, then I think we need to manage it above the attached endpoint
> > > > driver.    
> > > 
> > > The problem is that above the endpoint driver, we don't actually have
> > > enough information about what should be isolated.  For VFIO we want to
> > > isolate things if they're in different containers, for most regular
> > > host kernel drivers we don't need to isolate at all (although we might
> > > as well when it doesn't have a cost).  
> > 
> > This idea that we only want to isolate things if they're in different
> > containers is bogus, imo.  There are performance reasons why we might
> > not want things isolated, but there are also address space reasons why
> > we do.  If there are direct routes between devices, the user needs to
> > be aware of the IOVA pollution, if we maintain singleton groups, they
> > don't.  Granted we don't really account for this well in most
> > userspaces and fumble through it by luck of the address space layout
> > and lack of devices really attempting peer to peer access.  
> 
> I don't really follow what you're saying here.

If the lack of isolation between devices includes peer-to-peer channels
then the MMIO of the devices within the group pollutes the IOVA space
of the container.
 
> > For in-kernel users, we're still theoretically trying to isolate
> > devices such that they have restricted access to only the resources
> > they need.  Disabling things like ACS in the topology reduces that
> > isolation.  AFAICT, most users don't really care about that degree of
> > isolation, so they run with iommu=pt for native driver performance
> > while still having the IOMMU available for isolation use cases running
> > in parallel.  We don't currently have support for on-demand enabling
> > isolation.  
> 
> Ok.
> 
> > > The host side nVidia GPGPU
> > > drivers also won't want to isolate the (host owned) NVLink devices
> > > from each other, since they'll want to use the fast interconnects  
> > 
> > This falls into the same mixed use case scenario above where we don't
> > really have a good solution today.  Things like ACS are dynamically
> > configurable, but we don't expose any interfaces to let drivers or
> > users change it (aside from setpci, which we don't account for
> > dynamically).  We assume a simplistic model where if you want IOMMU,
> > then you must also want the maximum configurable isolation.
> > Dynamically changing routing is not necessarily the most foolproof
> > thing either with potentially in-flight transactions and existing DMA
> > mappings, which is why I've suggested a couple times that perhaps we
> > could do a software hot-unplug of a sub-hierarchy, muck with isolation
> > at the remaining node, then re-discover the removed devices.  
> 
> Ok, so I feel like we need to go fully one way or the other.  Either:
> 
> 1) Groups represent current isolation status, in which case we
>    deprecate vfio containers in favour of fusing groups beforehand,
>    and we need some new concept ("isolation atoms"?) to represent what
>    isolation is possible

Nope, containers are owned by users and serve a purpose beyond what
you're assuming here.  Also, there's 7.5 years of userspace tooling
broken by this.  I do remember we had discussions about merging groups,
but what we have now is what we agreed on.
 
> or
> 
> 2) Groups represent potential isolation, with a higher level construct
>    representing current isolation.  This could involve making vfio
>    containers essentially a wrapper around some more generic concept
>    ("isolation clusters"?) of a group of groups.

Have fun.  Containers and groups are exposed to userspace, so you're
essentially suggesting to throw away all that support, for... I don't
really know what.  Groups are involved in IOMMU context, so dynamically
changing what a group defines is hard.  Thus my suggestions that mixed
or dynamic workloads could make use of soft remove and rediscovery for
sub-hierarchies, unbinding and rebinding drivers such that we have the
proper expectations of DMA context.   Thanks,

Alex
David Gibson March 22, 2019, 3:08 a.m. UTC | #7
On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> On Thu, 21 Mar 2019 10:56:00 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:
> > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:  
> > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > >     
> > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > 
> > > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > > cut off the interconnects.
> > > > > > 
> > > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > > > issued to the GPU.
> > > > > > 
> > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > determines what links need to be disabled. This relies on presence
> > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > > > > 
> > > > > > This does not change the existing behaviour and instead adds
> > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > 
> > > > > > The alternative approaches would be:
> > > > > > 
> > > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > 
> > > > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > > > links and then perhaps undo all this initialization to bring the device
> > > > > > back to the state where pci_try_reset_function() expects it to be.    
> > > > > 
> > > > > The trouble seems to be that this approach only maintains the isolation
> > > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > > incorporating groups in various ways.    
> > > > 
> > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > necessarily represent devices which *are* isolated, just devices which
> > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > both groups are owned by regular host kernel drivers.
> > > > 
> > > > In at least some of those cases we also don't want to isolate the
> > > > devices when we don't have to, usually for performance reasons.  
> > > 
> > > I see IOMMU groups as representing the current isolation of the device,
> > > not just the possible isolation.  If there are ways to break down that
> > > isolation then ideally the group would be updated to reflect it.  The
> > > ACS disable patches seem to support this, at boot time we can choose to
> > > disable ACS at certain points in the topology to favor peer-to-peer
> > > performance over isolation.  This is then reflected in the group
> > > composition, because even though ACS *can be* enabled at the given
> > > isolation points, it's intentionally not with this option.  Whether or
> > > not a given user who owns multiple devices needs that isolation is
> > > really beside the point, the user can choose to connect groups via IOMMU
> > > mappings or reconfigure the system to disable ACS and potentially more
> > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > topology and IOMMU based isolation.  
> > 
> > Huh, ok, I think we need to straighten this out.  Thinking of iommu
> > groups as possible rather than potential isolation was a conscious
> 
> possible ~= potential

Sorry, I meant "current" not "potential".

> > decision on my part when we were first coming up with them.  The
> > rationale was that that way iommu groups could be static for the
> > lifetime of boot, with more dynamic isolation state layered on top.
> > 
> > Now, that was based on analogy with PAPR's concept of "Partitionable
> > Endpoints" which are decided by firmware before boot.  However, I
> > think it makes sense in other contexts too: if iommu groups represent
> > current isolation, then we need some other way to advertise possible
> > isolation - otherwise how will the admin (and/or tools) know how it
> > can configure the iommu groups.
> > 
> > VFIO already has the container, which represents explicitly a "group
> > of groups" that we don't care to isolate from each other.  I don't
> > actually know what other uses of the iommu group infrastructure we
> > have at present and how they treat them.
> 
> s/that we don't care to isolate/that the user doesn't care to isolate/

Well, true, I guess we still want to isolate them when possible for
additional safety.  But I don't think we should do so when that
isolation has a real performance cost, which it would in the case of
isolating linked GPUs here.

> Though even that is not necessarily accurate, the container represents
> a shared IOMMU context, that context doesn't necessarily include
> mappings between devices, so the device can still be isolated *from
> each other*.

Well, sure, but it's the only mechanism the user has for indicating
that they don't care about isolation between these device.  That might
be because they simply don't care, but it might also be because they
want to use interlinks between those devices for additional
performance.

> > So, if we now have dynamically reconfigurable groups which are a
> > departure from that design, how can we go from here trying to bring
> > things back to consistency.
> 
> We don't currently have dynamically reconfigurable groups,

Ah, ok.  So what were these other use cases you were describing?  Boot
time options which change how the iommu groups are generated?  Or
something else.

> I think the
> question is how do we introduce dynamically configurable groups within
> the existing design, because whatever intentions were 7.5 years ago is
> rather irrelevant now.

Well, yeah, I guess.

> VFIO groups are mapped directly to IOMMU groups.  IOMMU groups are the
> smallest set of devices which can be considered isolated from other
> sets of devices (not potentially, but as configured).

Hm.. but what actually makes that assumption that this is the case as
configured, rather than just possibly - I'm taking "possible" to mean
possible with this host kernel, not just possible with this hardware.

> VFIO groups are
> the unit of ownership to userspace, therefore a VFIO group must be
> (currently) isolated from other groups.

It must be currently isolated once the VFIO group is instantiated, but
is there actually anything that requires that current isolation before
the iommu group is instantiated as a vfio group.

> The user owns and manages the
> IOMMU context for one or more groups via a container.  A container
> allows efficiency in managing a shared IOVA space between groups.

Yes, and it seems to me an obvious extension to have the container
also permit other efficiencies which aren't compatible with full
isolation.

> > > > > So, if there's a device specific
> > > > > way to configure the isolation reported in the group, which requires
> > > > > some sort of active management against things like secondary bus
> > > > > resets, then I think we need to manage it above the attached endpoint
> > > > > driver.    
> > > > 
> > > > The problem is that above the endpoint driver, we don't actually have
> > > > enough information about what should be isolated.  For VFIO we want to
> > > > isolate things if they're in different containers, for most regular
> > > > host kernel drivers we don't need to isolate at all (although we might
> > > > as well when it doesn't have a cost).  
> > > 
> > > This idea that we only want to isolate things if they're in different
> > > containers is bogus, imo.  There are performance reasons why we might
> > > not want things isolated, but there are also address space reasons why
> > > we do.  If there are direct routes between devices, the user needs to
> > > be aware of the IOVA pollution, if we maintain singleton groups, they
> > > don't.  Granted we don't really account for this well in most
> > > userspaces and fumble through it by luck of the address space layout
> > > and lack of devices really attempting peer to peer access.  
> > 
> > I don't really follow what you're saying here.
> 
> If the lack of isolation between devices includes peer-to-peer channels
> then the MMIO of the devices within the group pollutes the IOVA space
> of the container.

Yes, if the permissible IOVA addresses overlap with valid MMIO
addresses.  I don't really see why that's significant, though.  We
already have basically the same situation if there are multiple
devices in a group.  e.g. if you have several devices behind a dumb
PCI-E to PCI bridge, you can't actually prohibit peer-to-peer DMA
between them, and so those devices' MMIOs pollute the IOVA space for
each other, even if they don't for other devices in the container.
There's only so far we can go to prevent the user from shooting
themselves in the foot.

> > > For in-kernel users, we're still theoretically trying to isolate
> > > devices such that they have restricted access to only the resources
> > > they need.  Disabling things like ACS in the topology reduces that
> > > isolation.  AFAICT, most users don't really care about that degree of
> > > isolation, so they run with iommu=pt for native driver performance
> > > while still having the IOMMU available for isolation use cases running
> > > in parallel.  We don't currently have support for on-demand enabling
> > > isolation.  
> > 
> > Ok.
> > 
> > > > The host side nVidia GPGPU
> > > > drivers also won't want to isolate the (host owned) NVLink devices
> > > > from each other, since they'll want to use the fast interconnects  
> > > 
> > > This falls into the same mixed use case scenario above where we don't
> > > really have a good solution today.  Things like ACS are dynamically
> > > configurable, but we don't expose any interfaces to let drivers or
> > > users change it (aside from setpci, which we don't account for
> > > dynamically).  We assume a simplistic model where if you want IOMMU,
> > > then you must also want the maximum configurable isolation.
> > > Dynamically changing routing is not necessarily the most foolproof
> > > thing either with potentially in-flight transactions and existing DMA
> > > mappings, which is why I've suggested a couple times that perhaps we
> > > could do a software hot-unplug of a sub-hierarchy, muck with isolation
> > > at the remaining node, then re-discover the removed devices.  
> > 
> > Ok, so I feel like we need to go fully one way or the other.  Either:
> > 
> > 1) Groups represent current isolation status, in which case we
> >    deprecate vfio containers in favour of fusing groups beforehand,
> >    and we need some new concept ("isolation atoms"?) to represent what
> >    isolation is possible
> 
> Nope, containers are owned by users and serve a purpose beyond what
> you're assuming here.  Also, there's 7.5 years of userspace tooling
> broken by this.  I do remember we had discussions about merging groups,
> but what we have now is what we agreed on.

Well, quite - and I thought the reasoning that led to that design was
that groups would be a representation of isolation granularity, not
actual isolation.  That allows them to be static, which as you say,
tools are likely to assume.

> > or
> > 
> > 2) Groups represent potential isolation, with a higher level construct
> >    representing current isolation.  This could involve making vfio
> >    containers essentially a wrapper around some more generic concept
> >    ("isolation clusters"?) of a group of groups.
> 
> Have fun.  Containers and groups are exposed to userspace, so you're
> essentially suggesting to throw away all that support, for... I don't
> really know what.

No, I'm not proposing removing containers, just changing them from
being purely VFIO specific to being the VFIO front end to a generic
concept.  Roughly speaking VFIO containers would be to "isolation
clusters" (or whatever) as VFIO groups are to IOMMU groups now.

> Groups are involved in IOMMU context, so dynamically
> changing what a group defines is hard.

Absolutely!  That's why I don't want to, and why my conception of
groups was defined so that they didn't have to.

> Thus my suggestions that mixed
> or dynamic workloads could make use of soft remove and rediscovery for
> sub-hierarchies, unbinding and rebinding drivers such that we have the
> proper expectations of DMA context.   Thanks,

Well, if we have to change groups, then that sounds like the sanest
available way to do it.  But I'm not yet convinced that altering
groups makes sense rather than using a group-of-groups concept on top
of it - which would map to containers in the case of VFIO.
Alex Williamson March 22, 2019, 11:10 p.m. UTC | #8
On Fri, 22 Mar 2019 14:08:38 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > On Thu, 21 Mar 2019 10:56:00 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > >     
> > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:    
> > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > > >       
> > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > 
> > > > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > > > cut off the interconnects.
> > > > > > > 
> > > > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > > > > issued to the GPU.
> > > > > > > 
> > > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > > determines what links need to be disabled. This relies on presence
> > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > > > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > > > > > 
> > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > 
> > > > > > > The alternative approaches would be:
> > > > > > > 
> > > > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > 
> > > > > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > > > > links and then perhaps undo all this initialization to bring the device
> > > > > > > back to the state where pci_try_reset_function() expects it to be.      
> > > > > > 
> > > > > > The trouble seems to be that this approach only maintains the isolation
> > > > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > > > incorporating groups in various ways.      
> > > > > 
> > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > necessarily represent devices which *are* isolated, just devices which
> > > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > > both groups are owned by regular host kernel drivers.
> > > > > 
> > > > > In at least some of those cases we also don't want to isolate the
> > > > > devices when we don't have to, usually for performance reasons.    
> > > > 
> > > > I see IOMMU groups as representing the current isolation of the device,
> > > > not just the possible isolation.  If there are ways to break down that
> > > > isolation then ideally the group would be updated to reflect it.  The
> > > > ACS disable patches seem to support this, at boot time we can choose to
> > > > disable ACS at certain points in the topology to favor peer-to-peer
> > > > performance over isolation.  This is then reflected in the group
> > > > composition, because even though ACS *can be* enabled at the given
> > > > isolation points, it's intentionally not with this option.  Whether or
> > > > not a given user who owns multiple devices needs that isolation is
> > > > really beside the point, the user can choose to connect groups via IOMMU
> > > > mappings or reconfigure the system to disable ACS and potentially more
> > > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > > topology and IOMMU based isolation.    
> > > 
> > > Huh, ok, I think we need to straighten this out.  Thinking of iommu
> > > groups as possible rather than potential isolation was a conscious  
> > 
> > possible ~= potential  
> 
> Sorry, I meant "current" not "potential".
> 
> > > decision on my part when we were first coming up with them.  The
> > > rationale was that that way iommu groups could be static for the
> > > lifetime of boot, with more dynamic isolation state layered on top.
> > > 
> > > Now, that was based on analogy with PAPR's concept of "Partitionable
> > > Endpoints" which are decided by firmware before boot.  However, I
> > > think it makes sense in other contexts too: if iommu groups represent
> > > current isolation, then we need some other way to advertise possible
> > > isolation - otherwise how will the admin (and/or tools) know how it
> > > can configure the iommu groups.
> > > 
> > > VFIO already has the container, which represents explicitly a "group
> > > of groups" that we don't care to isolate from each other.  I don't
> > > actually know what other uses of the iommu group infrastructure we
> > > have at present and how they treat them.  
> > 
> > s/that we don't care to isolate/that the user doesn't care to isolate/  
> 
> Well, true, I guess we still want to isolate them when possible for
> additional safety.  But I don't think we should do so when that
> isolation has a real performance cost, which it would in the case of
> isolating linked GPUs here.

I don't see that we can presume the user's intentions.  This sounds
like we're putting the kernel into the position of deciding policy.
 
> > Though even that is not necessarily accurate, the container represents
> > a shared IOMMU context, that context doesn't necessarily include
> > mappings between devices, so the device can still be isolated *from
> > each other*.  
> 
> Well, sure, but it's the only mechanism the user has for indicating
> that they don't care about isolation between these device.  That might
> be because they simply don't care, but it might also be because they
> want to use interlinks between those devices for additional
> performance.

Is a user really indicating "I don't care about isolation" by placing
groups within the same container or just taking advantage of a shared
IOMMU context?  Again, we're presuming a policy decision that's never
been defined in the API.  It is possible to have both isolation and
performance, for instance ACS includes a direct translated p2p feature
that allows translated requests a direct path between devices, at least
on the PCIe fabric (where supported by hardware).

> > > So, if we now have dynamically reconfigurable groups which are a
> > > departure from that design, how can we go from here trying to bring
> > > things back to consistency.  
> > 
> > We don't currently have dynamically reconfigurable groups,  
> 
> Ah, ok.  So what were these other use cases you were describing?  Boot
> time options which change how the iommu groups are generated?  Or
> something else.

        pci=option[,option...]  [PCI] various PCI subsystem options.

                                Some options herein operate on a specific device
                                or a set of devices (<pci_dev>). These are
                                specified in one of the following formats:

                                [<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
                                pci:<vendor>:<device>[:<subvendor>:<subdevice>]

        ...
                disable_acs_redir=<pci_dev>[; ...]
                                Specify one or more PCI devices (in the format
                                specified above) separated by semicolons.
                                Each device specified will have the PCI ACS
                                redirect capabilities forced off which will
                                allow P2P traffic between devices through
                                bridges without forcing it upstream. Note:
                                this removes isolation between devices and
                                may put more devices in an IOMMU group.

AIUI, GPU and RDMA were primary use cases for this.  As noted, IOMMU
groups take this into account since it's done at boot time and static,
we're simply disabling isolation switches that would have otherwise
been enabled at points in the topology (specifically enabling p2p as
much as possible), and the IOMMU group grows to match.

The NVLink problem is rather the opposite of this, rather than simply
disabling isolation for a sub-hierarchy and growing the IOMMU group, we
need to enable isolation and then actively, and repeatedly maintain
that configuration as it represents a finer granularity isolation
promise to the system.  And of course we're trying to manage the
isolation of an interconnect for which we have no visibility.

> > I think the
> > question is how do we introduce dynamically configurable groups within
> > the existing design, because whatever intentions were 7.5 years ago is
> > rather irrelevant now.  
> 
> Well, yeah, I guess.
> 
> > VFIO groups are mapped directly to IOMMU groups.  IOMMU groups are the
> > smallest set of devices which can be considered isolated from other
> > sets of devices (not potentially, but as configured).  
> 
> Hm.. but what actually makes that assumption that this is the case as
> configured, rather than just possibly - I'm taking "possible" to mean
> possible with this host kernel, not just possible with this hardware.

The IOMMU driver is ultimately responsible for deciding the isolation,
but we have shared code that understands the PCI/e topology.  That code
looks at the actual, current ACS settings to decide the highest point
in the topology where we might have p2p.

> > VFIO groups are
> > the unit of ownership to userspace, therefore a VFIO group must be
> > (currently) isolated from other groups.  
> 
> It must be currently isolated once the VFIO group is instantiated, but
> is there actually anything that requires that current isolation before
> the iommu group is instantiated as a vfio group.

The iommu-core is incorporating more group-aware support, I believe
default domains are group based and we might have per group reserved
ranges, I'm not sure about that.

> > The user owns and manages the
> > IOMMU context for one or more groups via a container.  A container
> > allows efficiency in managing a shared IOVA space between groups.  
> 
> Yes, and it seems to me an obvious extension to have the container
> also permit other efficiencies which aren't compatible with full
> isolation.

It might be a control point, yes, but I don't see that we can presume
it.  If we were to automatically enable redirection within a container,
we could wreak havoc on the IOVA space for the user.  ACS also
theoretically provides a hardware path to enable this via direct
translated p2p as well.

> > > > > > So, if there's a device specific
> > > > > > way to configure the isolation reported in the group, which requires
> > > > > > some sort of active management against things like secondary bus
> > > > > > resets, then I think we need to manage it above the attached endpoint
> > > > > > driver.      
> > > > > 
> > > > > The problem is that above the endpoint driver, we don't actually have
> > > > > enough information about what should be isolated.  For VFIO we want to
> > > > > isolate things if they're in different containers, for most regular
> > > > > host kernel drivers we don't need to isolate at all (although we might
> > > > > as well when it doesn't have a cost).    
> > > > 
> > > > This idea that we only want to isolate things if they're in different
> > > > containers is bogus, imo.  There are performance reasons why we might
> > > > not want things isolated, but there are also address space reasons why
> > > > we do.  If there are direct routes between devices, the user needs to
> > > > be aware of the IOVA pollution, if we maintain singleton groups, they
> > > > don't.  Granted we don't really account for this well in most
> > > > userspaces and fumble through it by luck of the address space layout
> > > > and lack of devices really attempting peer to peer access.    
> > > 
> > > I don't really follow what you're saying here.  
> > 
> > If the lack of isolation between devices includes peer-to-peer channels
> > then the MMIO of the devices within the group pollutes the IOVA space
> > of the container.  
> 
> Yes, if the permissible IOVA addresses overlap with valid MMIO
> addresses.  I don't really see why that's significant, though.  We
> already have basically the same situation if there are multiple
> devices in a group.  e.g. if you have several devices behind a dumb
> PCI-E to PCI bridge, you can't actually prohibit peer-to-peer DMA
> between them, and so those devices' MMIOs pollute the IOVA space for
> each other, even if they don't for other devices in the container.
> There's only so far we can go to prevent the user from shooting
> themselves in the foot.

So it's therefore OK to intentionally pollute the container IOVA
space?  I believe we'd like to think that hardware is headed towards
being more isolation-aware and will evolve to the point where we have
typically singleton group.  Perhaps we'll also evolve to supporting ACS
direct translated p2p so that we have both a pristine IOVA space and
direct routes between devices.  There are current issues with
non-singleton groups that userspace doesn't handle (see "fumble
through" comment above), but I'm not willing to say "this is
userspace's problem anyway, therefore it's ok it make it worse".

> > > > For in-kernel users, we're still theoretically trying to isolate
> > > > devices such that they have restricted access to only the resources
> > > > they need.  Disabling things like ACS in the topology reduces that
> > > > isolation.  AFAICT, most users don't really care about that degree of
> > > > isolation, so they run with iommu=pt for native driver performance
> > > > while still having the IOMMU available for isolation use cases running
> > > > in parallel.  We don't currently have support for on-demand enabling
> > > > isolation.    
> > > 
> > > Ok.
> > >   
> > > > > The host side nVidia GPGPU
> > > > > drivers also won't want to isolate the (host owned) NVLink devices
> > > > > from each other, since they'll want to use the fast interconnects    
> > > > 
> > > > This falls into the same mixed use case scenario above where we don't
> > > > really have a good solution today.  Things like ACS are dynamically
> > > > configurable, but we don't expose any interfaces to let drivers or
> > > > users change it (aside from setpci, which we don't account for
> > > > dynamically).  We assume a simplistic model where if you want IOMMU,
> > > > then you must also want the maximum configurable isolation.
> > > > Dynamically changing routing is not necessarily the most foolproof
> > > > thing either with potentially in-flight transactions and existing DMA
> > > > mappings, which is why I've suggested a couple times that perhaps we
> > > > could do a software hot-unplug of a sub-hierarchy, muck with isolation
> > > > at the remaining node, then re-discover the removed devices.    
> > > 
> > > Ok, so I feel like we need to go fully one way or the other.  Either:
> > > 
> > > 1) Groups represent current isolation status, in which case we
> > >    deprecate vfio containers in favour of fusing groups beforehand,
> > >    and we need some new concept ("isolation atoms"?) to represent what
> > >    isolation is possible  
> > 
> > Nope, containers are owned by users and serve a purpose beyond what
> > you're assuming here.  Also, there's 7.5 years of userspace tooling
> > broken by this.  I do remember we had discussions about merging groups,
> > but what we have now is what we agreed on.  
> 
> Well, quite - and I thought the reasoning that led to that design was
> that groups would be a representation of isolation granularity, not
> actual isolation.  That allows them to be static, which as you say,
> tools are likely to assume.

What is the value of this idea of potential isolation?  Any device
might be considered potentially isolated, simply disable all the
devices around it.  So now we have a "potentially isolated" group at
each device.  What have we accomplished?  How do we build that into a
user interface?  Who decides where the actual isolation lines are
drawn?  Not that it didn't happen, but I don't recall this design idea
and I don't see how it's practical.

> > > or
> > > 
> > > 2) Groups represent potential isolation, with a higher level construct
> > >    representing current isolation.  This could involve making vfio
> > >    containers essentially a wrapper around some more generic concept
> > >    ("isolation clusters"?) of a group of groups.  
> > 
> > Have fun.  Containers and groups are exposed to userspace, so you're
> > essentially suggesting to throw away all that support, for... I don't
> > really know what.  
> 
> No, I'm not proposing removing containers, just changing them from
> being purely VFIO specific to being the VFIO front end to a generic
> concept.  Roughly speaking VFIO containers would be to "isolation
> clusters" (or whatever) as VFIO groups are to IOMMU groups now.

Renaming everything doesn't help.  AFAIK, there's nothing in vfio that
prevents groups from being dynamic (changing them while in use might be
unkind though), but they must be isolated from other groups.  We can't
move the line where we have isolation to anything above the group.
Groups are currently not dynamic because "it's hard" (tm), but
obviously it can be done with device removal and rediscovery.
 
> > Groups are involved in IOMMU context, so dynamically
> > changing what a group defines is hard.  
> 
> Absolutely!  That's why I don't want to, and why my conception of
> groups was defined so that they didn't have to.

As above, groups are the unit of userspace ownership and therefore must
be isolated.

> > Thus my suggestions that mixed
> > or dynamic workloads could make use of soft remove and rediscovery for
> > sub-hierarchies, unbinding and rebinding drivers such that we have the
> > proper expectations of DMA context.   Thanks,  
> 
> Well, if we have to change groups, then that sounds like the sanest
> available way to do it.  But I'm not yet convinced that altering
> groups makes sense rather than using a group-of-groups concept on top
> of it - which would map to containers in the case of VFIO.

Groups must be isolated in the vfio API.  What happens underneath a
group might be fungible, but the exposed group must be isolated.
Thanks,

Alex
David Gibson April 5, 2019, 12:34 a.m. UTC | #9
On Fri, Mar 22, 2019 at 05:10:25PM -0600, Alex Williamson wrote:
> On Fri, 22 Mar 2019 14:08:38 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Mar 21, 2019 at 12:19:34PM -0600, Alex Williamson wrote:
> > > On Thu, 21 Mar 2019 10:56:00 +1100
> > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > >   
> > > > On Wed, Mar 20, 2019 at 01:09:08PM -0600, Alex Williamson wrote:  
> > > > > On Wed, 20 Mar 2019 15:38:24 +1100
> > > > > David Gibson <david@gibson.dropbear.id.au> wrote:
> > > > >     
> > > > > > On Tue, Mar 19, 2019 at 10:36:19AM -0600, Alex Williamson wrote:    
> > > > > > > On Fri, 15 Mar 2019 19:18:35 +1100
> > > > > > > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> > > > > > >       
> > > > > > > > The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and
> > > > > > > > (on POWER9) NVLinks. In addition to that, GPUs themselves have direct
> > > > > > > > peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV
> > > > > > > > platform puts all interconnected GPUs to the same IOMMU group.
> > > > > > > > 
> > > > > > > > However the user may want to pass individual GPUs to the userspace so
> > > > > > > > in order to do so we need to put them into separate IOMMU groups and
> > > > > > > > cut off the interconnects.
> > > > > > > > 
> > > > > > > > Thankfully V100 GPUs implement an interface to do by programming link
> > > > > > > > disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using
> > > > > > > > this interface, it cannot be re-enabled until the secondary bus reset is
> > > > > > > > issued to the GPU.
> > > > > > > > 
> > > > > > > > This defines a reset_done() handler for V100 NVlink2 device which
> > > > > > > > determines what links need to be disabled. This relies on presence
> > > > > > > > of the new "ibm,nvlink-peers" device tree property of a GPU telling which
> > > > > > > > PCI peers it is connected to (which includes NVLink bridges or peer GPUs).
> > > > > > > > 
> > > > > > > > This does not change the existing behaviour and instead adds
> > > > > > > > a new "isolate_nvlink" kernel parameter to allow such isolation.
> > > > > > > > 
> > > > > > > > The alternative approaches would be:
> > > > > > > > 
> > > > > > > > 1. do this in the system firmware (skiboot) but for that we would need
> > > > > > > > to tell skiboot via an additional OPAL call whether or not we want this
> > > > > > > > isolation - skiboot is unaware of IOMMU groups.
> > > > > > > > 
> > > > > > > > 2. do this in the secondary bus reset handler in the POWERNV platform -
> > > > > > > > the problem with that is at that point the device is not enabled, i.e.
> > > > > > > > config space is not restored so we need to enable the device (i.e. MMIO
> > > > > > > > bit in CMD register + program valid address to BAR0) in order to disable
> > > > > > > > links and then perhaps undo all this initialization to bring the device
> > > > > > > > back to the state where pci_try_reset_function() expects it to be.      
> > > > > > > 
> > > > > > > The trouble seems to be that this approach only maintains the isolation
> > > > > > > exposed by the IOMMU group when vfio-pci is the active driver for the
> > > > > > > device.  IOMMU groups can be used by any driver and the IOMMU core is
> > > > > > > incorporating groups in various ways.      
> > > > > > 
> > > > > > I don't think that reasoning is quite right.  An IOMMU group doesn't
> > > > > > necessarily represent devices which *are* isolated, just devices which
> > > > > > *can be* isolated.  There are plenty of instances when we don't need
> > > > > > to isolate devices in different IOMMU groups: passing both groups to
> > > > > > the same guest or userspace VFIO driver for example, or indeed when
> > > > > > both groups are owned by regular host kernel drivers.
> > > > > > 
> > > > > > In at least some of those cases we also don't want to isolate the
> > > > > > devices when we don't have to, usually for performance reasons.    
> > > > > 
> > > > > I see IOMMU groups as representing the current isolation of the device,
> > > > > not just the possible isolation.  If there are ways to break down that
> > > > > isolation then ideally the group would be updated to reflect it.  The
> > > > > ACS disable patches seem to support this, at boot time we can choose to
> > > > > disable ACS at certain points in the topology to favor peer-to-peer
> > > > > performance over isolation.  This is then reflected in the group
> > > > > composition, because even though ACS *can be* enabled at the given
> > > > > isolation points, it's intentionally not with this option.  Whether or
> > > > > not a given user who owns multiple devices needs that isolation is
> > > > > really beside the point, the user can choose to connect groups via IOMMU
> > > > > mappings or reconfigure the system to disable ACS and potentially more
> > > > > direct routing.  The IOMMU groups are still accurately reflecting the
> > > > > topology and IOMMU based isolation.    
> > > > 
> > > > Huh, ok, I think we need to straighten this out.  Thinking of iommu
> > > > groups as possible rather than potential isolation was a conscious  
> > > 
> > > possible ~= potential  
> > 
> > Sorry, I meant "current" not "potential".
> > 
> > > > decision on my part when we were first coming up with them.  The
> > > > rationale was that that way iommu groups could be static for the
> > > > lifetime of boot, with more dynamic isolation state layered on top.
> > > > 
> > > > Now, that was based on analogy with PAPR's concept of "Partitionable
> > > > Endpoints" which are decided by firmware before boot.  However, I
> > > > think it makes sense in other contexts too: if iommu groups represent
> > > > current isolation, then we need some other way to advertise possible
> > > > isolation - otherwise how will the admin (and/or tools) know how it
> > > > can configure the iommu groups.
> > > > 
> > > > VFIO already has the container, which represents explicitly a "group
> > > > of groups" that we don't care to isolate from each other.  I don't
> > > > actually know what other uses of the iommu group infrastructure we
> > > > have at present and how they treat them.  
> > > 
> > > s/that we don't care to isolate/that the user doesn't care to isolate/  
> > 
> > Well, true, I guess we still want to isolate them when possible for
> > additional safety.  But I don't think we should do so when that
> > isolation has a real performance cost, which it would in the case of
> > isolating linked GPUs here.
> 
> I don't see that we can presume the user's intentions.  This sounds
> like we're putting the kernel into the position of deciding policy.

It's not really a question of intentions.  If two devices are in the
same container, then they are *not* isolated from the user's point of
view.  Maybe they can't get at each other's MMIO, but they can clobber
each other's DMA inputs and outputs, and for many devices that will
include descriptor tables, allowing them to adversely affect each
others' behaviours.

> > > Though even that is not necessarily accurate, the container represents
> > > a shared IOMMU context, that context doesn't necessarily include
> > > mappings between devices, so the device can still be isolated *from
> > > each other*.  
> > 
> > Well, sure, but it's the only mechanism the user has for indicating
> > that they don't care about isolation between these device.  That might
> > be because they simply don't care, but it might also be because they
> > want to use interlinks between those devices for additional
> > performance.
> 
> Is a user really indicating "I don't care about isolation" by placing
> groups within the same container or just taking advantage of a shared
> IOMMU context?  Again, we're presuming a policy decision that's never
> been defined in the API.  It is possible to have both isolation and
> performance, for instance ACS includes a direct translated p2p feature
> that allows translated requests a direct path between devices, at least
> on the PCIe fabric (where supported by hardware).
> 
> > > > So, if we now have dynamically reconfigurable groups which are a
> > > > departure from that design, how can we go from here trying to bring
> > > > things back to consistency.  
> > > 
> > > We don't currently have dynamically reconfigurable groups,  
> > 
> > Ah, ok.  So what were these other use cases you were describing?  Boot
> > time options which change how the iommu groups are generated?  Or
> > something else.
> 
>         pci=option[,option...]  [PCI] various PCI subsystem options.
> 
>                                 Some options herein operate on a specific device
>                                 or a set of devices (<pci_dev>). These are
>                                 specified in one of the following formats:
> 
>                                 [<domain>:]<bus>:<dev>.<func>[/<dev>.<func>]*
>                                 pci:<vendor>:<device>[:<subvendor>:<subdevice>]
> 
>         ...
>                 disable_acs_redir=<pci_dev>[; ...]
>                                 Specify one or more PCI devices (in the format
>                                 specified above) separated by semicolons.
>                                 Each device specified will have the PCI ACS
>                                 redirect capabilities forced off which will
>                                 allow P2P traffic between devices through
>                                 bridges without forcing it upstream. Note:
>                                 this removes isolation between devices and
>                                 may put more devices in an IOMMU group.
> 
> AIUI, GPU and RDMA were primary use cases for this.  As noted, IOMMU
> groups take this into account since it's done at boot time and static,
> we're simply disabling isolation switches that would have otherwise
> been enabled at points in the topology (specifically enabling p2p as
> much as possible), and the IOMMU group grows to match.

Right.  It's better than nothing, but requiring boot time
configuration is a pain.  It'd be nice to have a way of runtime
configuring it.  That will be much nicer to use if we can also runtime
introspect it, which requires some kind of two-level system
distinguishing current from possible isolation.

> The NVLink problem is rather the opposite of this, rather than simply
> disabling isolation for a sub-hierarchy and growing the IOMMU group, we
> need to enable isolation and then actively, and repeatedly maintain
> that configuration as it represents a finer granularity isolation
> promise to the system.  And of course we're trying to manage the
> isolation of an interconnect for which we have no visibility.

Sigh.  Yeah.  That's pretty horrible.

> > > I think the
> > > question is how do we introduce dynamically configurable groups within
> > > the existing design, because whatever intentions were 7.5 years ago is
> > > rather irrelevant now.  
> > 
> > Well, yeah, I guess.
> > 
> > > VFIO groups are mapped directly to IOMMU groups.  IOMMU groups are the
> > > smallest set of devices which can be considered isolated from other
> > > sets of devices (not potentially, but as configured).  
> > 
> > Hm.. but what actually makes that assumption that this is the case as
> > configured, rather than just possibly - I'm taking "possible" to mean
> > possible with this host kernel, not just possible with this hardware.
> 
> The IOMMU driver is ultimately responsible for deciding the isolation,
> but we have shared code that understands the PCI/e topology.  That code
> looks at the actual, current ACS settings to decide the highest point
> in the topology where we might have p2p.

Right, but I was more thinking about the consumers of the information
than the producer.

> > > VFIO groups are
> > > the unit of ownership to userspace, therefore a VFIO group must be
> > > (currently) isolated from other groups.  
> > 
> > It must be currently isolated once the VFIO group is instantiated, but
> > is there actually anything that requires that current isolation before
> > the iommu group is instantiated as a vfio group.
> 
> The iommu-core is incorporating more group-aware support, I believe
> default domains are group based and we might have per group reserved
> ranges, I'm not sure about that.

Hm, ok.

> > > The user owns and manages the
> > > IOMMU context for one or more groups via a container.  A container
> > > allows efficiency in managing a shared IOVA space between groups.  
> > 
> > Yes, and it seems to me an obvious extension to have the container
> > also permit other efficiencies which aren't compatible with full
> > isolation.
> 
> It might be a control point, yes, but I don't see that we can presume
> it.

Well, as above, I think the container already means the devices can't
really be considered isolated, even if they are in the limited sense
of can't touch each others MMIOs.

> If we were to automatically enable redirection within a container,
> we could wreak havoc on the IOVA space for the user.

Granted, am thinking more of the power case where the IOVA and MMIO
address ranges are well separated from each other anyway.  Is there
any way we can set up the PCI address space on x86 to ensure that
regardless of group separation?

> ACS also
> theoretically provides a hardware path to enable this via direct
> translated p2p as well.
> 
> > > > > > > So, if there's a device specific
> > > > > > > way to configure the isolation reported in the group, which requires
> > > > > > > some sort of active management against things like secondary bus
> > > > > > > resets, then I think we need to manage it above the attached endpoint
> > > > > > > driver.      
> > > > > > 
> > > > > > The problem is that above the endpoint driver, we don't actually have
> > > > > > enough information about what should be isolated.  For VFIO we want to
> > > > > > isolate things if they're in different containers, for most regular
> > > > > > host kernel drivers we don't need to isolate at all (although we might
> > > > > > as well when it doesn't have a cost).    
> > > > > 
> > > > > This idea that we only want to isolate things if they're in different
> > > > > containers is bogus, imo.  There are performance reasons why we might
> > > > > not want things isolated, but there are also address space reasons why
> > > > > we do.  If there are direct routes between devices, the user needs to
> > > > > be aware of the IOVA pollution, if we maintain singleton groups, they
> > > > > don't.  Granted we don't really account for this well in most
> > > > > userspaces and fumble through it by luck of the address space layout
> > > > > and lack of devices really attempting peer to peer access.    
> > > > 
> > > > I don't really follow what you're saying here.  
> > > 
> > > If the lack of isolation between devices includes peer-to-peer channels
> > > then the MMIO of the devices within the group pollutes the IOVA space
> > > of the container.  
> > 
> > Yes, if the permissible IOVA addresses overlap with valid MMIO
> > addresses.  I don't really see why that's significant, though.  We
> > already have basically the same situation if there are multiple
> > devices in a group.  e.g. if you have several devices behind a dumb
> > PCI-E to PCI bridge, you can't actually prohibit peer-to-peer DMA
> > between them, and so those devices' MMIOs pollute the IOVA space for
> > each other, even if they don't for other devices in the container.
> > There's only so far we can go to prevent the user from shooting
> > themselves in the foot.
> 
> So it's therefore OK to intentionally pollute the container IOVA
> space?  I believe we'd like to think that hardware is headed towards
> being more isolation-aware and will evolve to the point where we have
> typically singleton group.  Perhaps we'll also evolve to supporting ACS
> direct translated p2p so that we have both a pristine IOVA space and
> direct routes between devices.  There are current issues with
> non-singleton groups that userspace doesn't handle (see "fumble
> through" comment above), but I'm not willing to say "this is
> userspace's problem anyway, therefore it's ok it make it worse".
> 
> > > > > For in-kernel users, we're still theoretically trying to isolate
> > > > > devices such that they have restricted access to only the resources
> > > > > they need.  Disabling things like ACS in the topology reduces that
> > > > > isolation.  AFAICT, most users don't really care about that degree of
> > > > > isolation, so they run with iommu=pt for native driver performance
> > > > > while still having the IOMMU available for isolation use cases running
> > > > > in parallel.  We don't currently have support for on-demand enabling
> > > > > isolation.    
> > > > 
> > > > Ok.
> > > >   
> > > > > > The host side nVidia GPGPU
> > > > > > drivers also won't want to isolate the (host owned) NVLink devices
> > > > > > from each other, since they'll want to use the fast interconnects    
> > > > > 
> > > > > This falls into the same mixed use case scenario above where we don't
> > > > > really have a good solution today.  Things like ACS are dynamically
> > > > > configurable, but we don't expose any interfaces to let drivers or
> > > > > users change it (aside from setpci, which we don't account for
> > > > > dynamically).  We assume a simplistic model where if you want IOMMU,
> > > > > then you must also want the maximum configurable isolation.
> > > > > Dynamically changing routing is not necessarily the most foolproof
> > > > > thing either with potentially in-flight transactions and existing DMA
> > > > > mappings, which is why I've suggested a couple times that perhaps we
> > > > > could do a software hot-unplug of a sub-hierarchy, muck with isolation
> > > > > at the remaining node, then re-discover the removed devices.    
> > > > 
> > > > Ok, so I feel like we need to go fully one way or the other.  Either:
> > > > 
> > > > 1) Groups represent current isolation status, in which case we
> > > >    deprecate vfio containers in favour of fusing groups beforehand,
> > > >    and we need some new concept ("isolation atoms"?) to represent what
> > > >    isolation is possible  
> > > 
> > > Nope, containers are owned by users and serve a purpose beyond what
> > > you're assuming here.  Also, there's 7.5 years of userspace tooling
> > > broken by this.  I do remember we had discussions about merging groups,
> > > but what we have now is what we agreed on.  
> > 
> > Well, quite - and I thought the reasoning that led to that design was
> > that groups would be a representation of isolation granularity, not
> > actual isolation.  That allows them to be static, which as you say,
> > tools are likely to assume.
> 
> What is the value of this idea of potential isolation?  Any device
> might be considered potentially isolated, simply disable all the
> devices around it.

Well, assuming there's a disabling mechanism of suitable granularity.

> So now we have a "potentially isolated" group at
> each device.  What have we accomplished?  How do we build that into a
> user interface?  Who decides where the actual isolation lines are
> drawn?  Not that it didn't happen, but I don't recall this design idea
> and I don't see how it's practical.

The UI I had in mind is that this minimum granularity is what's
advertised to userspace.  Again, this could vary with kernel
capabilities not just hardware capabilities - "theoretically
isolatable, but we haven't implemented that" counts as a single group.

When userspace wants to use some devices, it can choose whether to
grab them individually or to put several of its choice into a
"container" (analgous to a VFIO container, but could become a more
generic concept).  Being in a container means the kernel should
prioritize performance over isolation.  If the devices are widely
separated (e.g. separate PCI domains) that might not mean anything
much (beyond sharing a logical IOMMU AS), but if there's some sort of
fast-path between them, the kernel would enable it.

> > > > or
> > > > 
> > > > 2) Groups represent potential isolation, with a higher level construct
> > > >    representing current isolation.  This could involve making vfio
> > > >    containers essentially a wrapper around some more generic concept
> > > >    ("isolation clusters"?) of a group of groups.  
> > > 
> > > Have fun.  Containers and groups are exposed to userspace, so you're
> > > essentially suggesting to throw away all that support, for... I don't
> > > really know what.  
> > 
> > No, I'm not proposing removing containers, just changing them from
> > being purely VFIO specific to being the VFIO front end to a generic
> > concept.  Roughly speaking VFIO containers would be to "isolation
> > clusters" (or whatever) as VFIO groups are to IOMMU groups now.
> 
> Renaming everything doesn't help.  AFAIK, there's nothing in vfio that
> prevents groups from being dynamic (changing them while in use might be
> unkind though), but they must be isolated from other groups.  We can't
> move the line where we have isolation to anything above the group.
> Groups are currently not dynamic because "it's hard" (tm), but
> obviously it can be done with device removal and rediscovery.
>  
> > > Groups are involved in IOMMU context, so dynamically
> > > changing what a group defines is hard.  
> > 
> > Absolutely!  That's why I don't want to, and why my conception of
> > groups was defined so that they didn't have to.
> 
> As above, groups are the unit of userspace ownership and therefore must
> be isolated.
> 
> > > Thus my suggestions that mixed
> > > or dynamic workloads could make use of soft remove and rediscovery for
> > > sub-hierarchies, unbinding and rebinding drivers such that we have the
> > > proper expectations of DMA context.   Thanks,  
> > 
> > Well, if we have to change groups, then that sounds like the sanest
> > available way to do it.  But I'm not yet convinced that altering
> > groups makes sense rather than using a group-of-groups concept on top
> > of it - which would map to containers in the case of VFIO.
> 
> Groups must be isolated in the vfio API.  What happens underneath a
> group might be fungible, but the exposed group must be isolated.

Still not seeing what makes that an immutable law.
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index 3a102378c8dc..6f5c769b6fc8 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -441,6 +441,23 @@  static void pnv_comp_attach_table_group(struct npu_comp *npucomp,
 	++npucomp->pe_num;
 }
 
+static bool isolate_nvlink;
+
+static int __init parse_isolate_nvlink(char *p)
+{
+	bool val;
+
+	if (!p)
+		val = true;
+	else if (kstrtobool(p, &val))
+		return -EINVAL;
+
+	isolate_nvlink = val;
+
+	return 0;
+}
+early_param("isolate_nvlink", parse_isolate_nvlink);
+
 struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 {
 	struct iommu_table_group *table_group;
@@ -463,7 +480,7 @@  struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 	hose = pci_bus_to_host(npdev->bus);
 	phb = hose->private_data;
 
-	if (hose->npu) {
+	if (hose->npu && !isolate_nvlink) {
 		if (!phb->npucomp) {
 			phb->npucomp = kzalloc(sizeof(struct npu_comp),
 					GFP_KERNEL);
@@ -477,7 +494,10 @@  struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe)
 					pe->pe_number);
 		}
 	} else {
-		/* Create a group for 1 GPU and attached NPUs for POWER8 */
+		/*
+		 * Create a group for 1 GPU and attached NPUs for
+		 * POWER8 (always) or POWER9 (when isolate_nvlink).
+		 */
 		pe->npucomp = kzalloc(sizeof(*pe->npucomp), GFP_KERNEL);
 		table_group = &pe->npucomp->table_group;
 		table_group->ops = &pnv_npu_peers_ops;
diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c
index 32f695ffe128..bb6bba762f46 100644
--- a/drivers/vfio/pci/vfio_pci_nvlink2.c
+++ b/drivers/vfio/pci/vfio_pci_nvlink2.c
@@ -206,6 +206,102 @@  static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb,
 	return NOTIFY_OK;
 }
 
+static int vfio_pci_nvdia_v100_is_ph_in_group(struct device *dev, void *data)
+{
+	return dev->of_node->phandle == *(phandle *) data;
+}
+
+static u32 vfio_pci_nvdia_v100_get_disable_mask(struct device *dev)
+{
+	int npu, peer;
+	u32 mask;
+	struct device_node *dn;
+	struct iommu_group *group;
+
+	dn = dev->of_node;
+	if (!of_find_property(dn, "ibm,nvlink-peers", NULL))
+		return 0;
+
+	group = iommu_group_get(dev);
+	if (!group)
+		return 0;
+
+	/*
+	 * Collect links to keep which includes links to NPU and links to
+	 * other GPUs in the same IOMMU group.
+	 */
+	for (npu = 0, mask = 0; ; ++npu) {
+		u32 npuph = 0;
+
+		if (of_property_read_u32_index(dn, "ibm,npu", npu, &npuph))
+			break;
+
+		for (peer = 0; ; ++peer) {
+			u32 peerph = 0;
+
+			if (of_property_read_u32_index(dn, "ibm,nvlink-peers",
+					peer, &peerph))
+				break;
+
+			if (peerph != npuph &&
+				!iommu_group_for_each_dev(group, &peerph,
+					vfio_pci_nvdia_v100_is_ph_in_group))
+				continue;
+
+			mask |= 1 << (peer + 16);
+		}
+	}
+	iommu_group_put(group);
+
+	/* Disabling mechanism takes links to disable so invert it here */
+	mask = ~mask & 0x3F0000;
+
+	return mask;
+}
+
+static void vfio_pci_nvdia_v100_nvlink2_reset_done(struct vfio_pci_device *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 cmd = 0, cmdmask;
+	u32 mask, val;
+	void __iomem *bar0;
+
+	bar0 = vdev->barmap[0];
+	if (!bar0)
+		return;
+
+	mask = vfio_pci_nvdia_v100_get_disable_mask(&pdev->dev);
+	if (!mask)
+		return;
+
+	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
+	cmdmask = PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER | PCI_COMMAND_PARITY;
+	if ((cmd & cmdmask) != cmdmask)
+		pci_write_config_word(pdev, PCI_COMMAND, cmd | cmdmask);
+
+	/*
+	 * The sequence is from
+	 * Tesla P100 and V100 SXM2 NVLink Isolation on Multi-Tenant Systems.
+	 * The register names are not provided there either, hence raw values.
+	 */
+	iowrite32(0x4, bar0 + 0x12004C);
+	iowrite32(0x2, bar0 + 0x122204);
+	val = ioread32(bar0 + 0x200);
+	val |= 0x02000000;
+	iowrite32(val, bar0 + 0x200);
+	val = ioread32(bar0 + 0xA00148);
+	val |= mask;
+	iowrite32(val, bar0 + 0xA00148);
+	val = ioread32(bar0 + 0xA00148);
+
+	if ((cmd | cmdmask) != cmd)
+		pci_write_config_word(pdev, PCI_COMMAND, cmd);
+}
+
+static struct vfio_pci_error_handlers vfio_pci_nvdia_v100_error_handlers = {
+	.reset_done = vfio_pci_nvdia_v100_nvlink2_reset_done,
+};
+
 int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 {
 	int ret;
@@ -286,6 +382,8 @@  int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev)
 	if (ret)
 		goto free_exit;
 
+	vdev->error_handlers = &vfio_pci_nvdia_v100_error_handlers;
+
 	return 0;
 free_exit:
 	kfree(data);