mbox series

[kernel,v2,0/7] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Message ID 20200323075354.93825-1-aik@ozlabs.ru (mailing list archive)
Headers show
Series powerpc/powenv/ioda: Allow huge DMA window at 4GB | expand

Message

Alexey Kardashevskiy March 23, 2020, 7:53 a.m. UTC
Here is an attempt to support bigger DMA space for devices
supporting DMA masks less than 59 bits (GPUs come into mind
first). POWER9 PHBs have an option to map 2 windows at 0
and select a windows based on DMA address being below or above
4GB.

This adds the "iommu=iommu_bypass" kernel parameter and
supports VFIO+pseries machine - current this requires telling
upstream+unmodified QEMU about this via
-global spapr-pci-host-bridge.dma64_win_addr=0x100000000
or per-phb property. 4/4 advertises the new option but
there is no automation around it in QEMU (should it be?).

For now it is either 1<<59 or 4GB mode; dynamic switching is
not supported (could be via sysfs).

This is a rebased version of
https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/

The main change since v1 is that now it is 7 patches with
clearer separation of steps.


This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"

Please comment. Thanks.



Alexey Kardashevskiy (7):
  powerpc/powernv/ioda: Move TCE bypass base to PE
  powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  powerpc/powernv/ioda: Allow smaller TCE table levels
  powerpc/powernv/phb4: Use IOMMU instead of bypassing
  powerpc/iommu: Add a window number to
    iommu_table_group_ops::get_table_size
  powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

 arch/powerpc/include/asm/iommu.h              |   3 +
 arch/powerpc/include/asm/opal-api.h           |   9 +-
 arch/powerpc/include/asm/opal.h               |   2 +
 arch/powerpc/platforms/powernv/pci.h          |   4 +-
 include/uapi/linux/vfio.h                     |   2 +
 arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
 arch/powerpc/platforms/powernv/opal-call.c    |   2 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
 10 files changed, 213 insertions(+), 65 deletions(-)

Comments

Alexey Kardashevskiy April 8, 2020, 9:43 a.m. UTC | #1
On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and
> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is a rebased version of
> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
> 
> The main change since v1 is that now it is 7 patches with
> clearer separation of steps.
> 
> 
> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
> 
> Please comment. Thanks.

Ping?


> 
> 
> 
> Alexey Kardashevskiy (7):
>   powerpc/powernv/ioda: Move TCE bypass base to PE
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>   powerpc/iommu: Add a window number to
>     iommu_table_group_ops::get_table_size
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   3 +
>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>  10 files changed, 213 insertions(+), 65 deletions(-)
>
Alexey Kardashevskiy April 16, 2020, 1:27 a.m. UTC | #2
Anyone? Is it totally useless or wrong approach? Thanks,


On 08/04/2020 19:43, Alexey Kardashevskiy wrote:
> 
> 
> On 23/03/2020 18:53, Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is a rebased version of
>> https://lore.kernel.org/kvm/20191202015953.127902-1-aik@ozlabs.ru/
>>
>> The main change since v1 is that now it is 7 patches with
>> clearer separation of steps.
>>
>>
>> This is based on 6c90b86a745a "Merge tag 'mmc-v5.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc"
>>
>> Please comment. Thanks.
> 
> Ping?
> 
> 
>>
>>
>>
>> Alexey Kardashevskiy (7):
>>   powerpc/powernv/ioda: Move TCE bypass base to PE
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Use IOMMU instead of bypassing
>>   powerpc/iommu: Add a window number to
>>     iommu_table_group_ops::get_table_size
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   3 +
>>  arch/powerpc/include/asm/opal-api.h           |   9 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   4 +-
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/npu-dma.c      |   1 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 234 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  17 +-
>>  10 files changed, 213 insertions(+), 65 deletions(-)
>>
>
Oliver O'Halloran April 16, 2020, 2:34 a.m. UTC | #3
On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> Anyone? Is it totally useless or wrong approach? Thanks,

I wouldn't say it's either, but I still hate it.

The 4GB mode being per-PHB makes it difficult to use unless we force
that mode on 100% of the time which I'd prefer not to do. Ideally
devices that actually support 64bit addressing (which is most of them)
should be able to use no-translate mode when possible since a) It's
faster, and b) It frees up room in the TCE cache devices that actually
need them. I know you've done some testing with 100G NICs and found
the overhead was fine, but IMO that's a bad test since it's pretty
much the best-case scenario since all the devices on the PHB are in
the same PE. The PHB's TCE cache only hits when the TCE matches the
DMA bus address and the PE number for the device so in a multi-PE
environment there's a lot of potential for TCE cache trashing. If
there was one or two PEs under that PHB it's probably not going to
matter, but if you have an NVMe rack with 20 drives it starts to look
a bit ugly.

That all said, it might be worth doing this anyway since we probably
want the software infrastructure in place to take advantage of it.
Maybe expand the command line parameters to allow it to be enabled on
a per-PHB basis rather than globally.

Oliver
Oliver O'Halloran April 16, 2020, 2:53 a.m. UTC | #4
On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >
> > Anyone? Is it totally useless or wrong approach? Thanks,
>
> I wouldn't say it's either, but I still hate it.
>
> The 4GB mode being per-PHB makes it difficult to use unless we force
> that mode on 100% of the time which I'd prefer not to do. Ideally
> devices that actually support 64bit addressing (which is most of them)
> should be able to use no-translate mode when possible since a) It's
> faster, and b) It frees up room in the TCE cache devices that actually
> need them. I know you've done some testing with 100G NICs and found
> the overhead was fine, but IMO that's a bad test since it's pretty
> much the best-case scenario since all the devices on the PHB are in
> the same PE. The PHB's TCE cache only hits when the TCE matches the
> DMA bus address and the PE number for the device so in a multi-PE
> environment there's a lot of potential for TCE cache trashing. If
> there was one or two PEs under that PHB it's probably not going to
> matter, but if you have an NVMe rack with 20 drives it starts to look
> a bit ugly.
>
> That all said, it might be worth doing this anyway since we probably
> want the software infrastructure in place to take advantage of it.
> Maybe expand the command line parameters to allow it to be enabled on
> a per-PHB basis rather than globally.

Since we're on the topic

I've been thinking the real issue we have is that we're trying to pick
an "optimal" IOMMU config at a point where we don't have enough
information to work out what's actually optimal. The IOMMU config is
done on a per-PE basis, but since PEs may contain devices with
different DMA masks (looking at you wierd AMD audio function) we're
always going to have to pick something conservative as the default
config for TVE#0 (64k, no bypass mapping) since the driver will tell
us what the device actually supports long after the IOMMU configuation
is done. What we really want is to be able to have separate IOMMU
contexts for each device, or at the very least a separate context for
the crippled devices.

We could allow a per-device IOMMU context by extending the Master /
Slave PE thing to cover DMA in addition to MMIO. Right now we only use
slave PEs when a device's MMIO BARs extend over multiple m64 segments.
When that happens an MMIO error causes the PHB to freezes the PE
corresponding to one of those segments, but not any of the others. To
present a single "PE" to the EEH core we check the freeze status of
each of the slave PEs when the EEH core does a PE status check and if
any of them are frozen, we freeze the rest of them too. When a driver
sets a limited DMA mask we could move that device to a seperate slave
PE so that it has it's own IOMMU context taylored to its DMA
addressing limits.

Thoughts?

Oliver
Russell Currey April 17, 2020, 1:26 a.m. UTC | #5
On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
> wrote:
> > On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
> > aik@ozlabs.ru> wrote:
> > > Anyone? Is it totally useless or wrong approach? Thanks,
> > 
> > I wouldn't say it's either, but I still hate it.
> > 
> > The 4GB mode being per-PHB makes it difficult to use unless we
> > force
> > that mode on 100% of the time which I'd prefer not to do. Ideally
> > devices that actually support 64bit addressing (which is most of
> > them)
> > should be able to use no-translate mode when possible since a) It's
> > faster, and b) It frees up room in the TCE cache devices that
> > actually
> > need them. I know you've done some testing with 100G NICs and found
> > the overhead was fine, but IMO that's a bad test since it's pretty
> > much the best-case scenario since all the devices on the PHB are in
> > the same PE. The PHB's TCE cache only hits when the TCE matches the
> > DMA bus address and the PE number for the device so in a multi-PE
> > environment there's a lot of potential for TCE cache trashing. If
> > there was one or two PEs under that PHB it's probably not going to
> > matter, but if you have an NVMe rack with 20 drives it starts to
> > look
> > a bit ugly.
> > 
> > That all said, it might be worth doing this anyway since we
> > probably
> > want the software infrastructure in place to take advantage of it.
> > Maybe expand the command line parameters to allow it to be enabled
> > on
> > a per-PHB basis rather than globally.
> 
> Since we're on the topic
> 
> I've been thinking the real issue we have is that we're trying to
> pick
> an "optimal" IOMMU config at a point where we don't have enough
> information to work out what's actually optimal. The IOMMU config is
> done on a per-PE basis, but since PEs may contain devices with
> different DMA masks (looking at you wierd AMD audio function) we're
> always going to have to pick something conservative as the default
> config for TVE#0 (64k, no bypass mapping) since the driver will tell
> us what the device actually supports long after the IOMMU
> configuation
> is done. What we really want is to be able to have separate IOMMU
> contexts for each device, or at the very least a separate context for
> the crippled devices.
> 
> We could allow a per-device IOMMU context by extending the Master /
> Slave PE thing to cover DMA in addition to MMIO. Right now we only
> use
> slave PEs when a device's MMIO BARs extend over multiple m64
> segments.
> When that happens an MMIO error causes the PHB to freezes the PE
> corresponding to one of those segments, but not any of the others. To
> present a single "PE" to the EEH core we check the freeze status of
> each of the slave PEs when the EEH core does a PE status check and if
> any of them are frozen, we freeze the rest of them too. When a driver
> sets a limited DMA mask we could move that device to a seperate slave
> PE so that it has it's own IOMMU context taylored to its DMA
> addressing limits.
> 
> Thoughts?

For what it's worth this sounds like a good idea to me, it just sounds
tricky to implement.  You're adding another layer of complexity on top
of EEH (well, making things look simple to the EEH core and doing your
own freezing on top of it) in addition to the DMA handling.

If it works then great, just has a high potential to become a new bug
haven.

> 
> Oliver
Alexey Kardashevskiy April 17, 2020, 5:47 a.m. UTC | #6
On 17/04/2020 11:26, Russell Currey wrote:
> On Thu, 2020-04-16 at 12:53 +1000, Oliver O'Halloran wrote:
>> On Thu, Apr 16, 2020 at 12:34 PM Oliver O'Halloran <oohall@gmail.com>
>> wrote:
>>> On Thu, Apr 16, 2020 at 11:27 AM Alexey Kardashevskiy <
>>> aik@ozlabs.ru> wrote:
>>>> Anyone? Is it totally useless or wrong approach? Thanks,
>>>
>>> I wouldn't say it's either, but I still hate it.
>>>
>>> The 4GB mode being per-PHB makes it difficult to use unless we
>>> force
>>> that mode on 100% of the time which I'd prefer not to do. Ideally
>>> devices that actually support 64bit addressing (which is most of
>>> them)
>>> should be able to use no-translate mode when possible since a) It's
>>> faster, and b) It frees up room in the TCE cache devices that
>>> actually
>>> need them. I know you've done some testing with 100G NICs and found
>>> the overhead was fine, but IMO that's a bad test since it's pretty
>>> much the best-case scenario since all the devices on the PHB are in
>>> the same PE. The PHB's TCE cache only hits when the TCE matches the
>>> DMA bus address and the PE number for the device so in a multi-PE
>>> environment there's a lot of potential for TCE cache trashing. If
>>> there was one or two PEs under that PHB it's probably not going to
>>> matter, but if you have an NVMe rack with 20 drives it starts to
>>> look
>>> a bit ugly.
>>>
>>> That all said, it might be worth doing this anyway since we
>>> probably
>>> want the software infrastructure in place to take advantage of it.
>>> Maybe expand the command line parameters to allow it to be enabled
>>> on
>>> a per-PHB basis rather than globally.
>>
>> Since we're on the topic
>>
>> I've been thinking the real issue we have is that we're trying to
>> pick
>> an "optimal" IOMMU config at a point where we don't have enough
>> information to work out what's actually optimal. The IOMMU config is
>> done on a per-PE basis, but since PEs may contain devices with
>> different DMA masks (looking at you wierd AMD audio function) we're
>> always going to have to pick something conservative as the default
>> config for TVE#0 (64k, no bypass mapping) since the driver will tell
>> us what the device actually supports long after the IOMMU
>> configuation
>> is done. What we really want is to be able to have separate IOMMU
>> contexts for each device, or at the very least a separate context for
>> the crippled devices.
>>
>> We could allow a per-device IOMMU context by extending the Master /
>> Slave PE thing to cover DMA in addition to MMIO. Right now we only
>> use
>> slave PEs when a device's MMIO BARs extend over multiple m64
>> segments.
>> When that happens an MMIO error causes the PHB to freezes the PE
>> corresponding to one of those segments, but not any of the others. To
>> present a single "PE" to the EEH core we check the freeze status of
>> each of the slave PEs when the EEH core does a PE status check and if
>> any of them are frozen, we freeze the rest of them too. When a driver
>> sets a limited DMA mask we could move that device to a seperate slave
>> PE so that it has it's own IOMMU context taylored to its DMA
>> addressing limits.
>>
>> Thoughts?
> 
> For what it's worth this sounds like a good idea to me, it just sounds
> tricky to implement.  You're adding another layer of complexity on top
> of EEH (well, making things look simple to the EEH core and doing your
> own freezing on top of it) in addition to the DMA handling.
> 
> If it works then great, just has a high potential to become a new bug
> haven.


imho putting every PCI function to a separate PE is the right thing to
do here but I've been told it is not that simple, and I believe that.
Reusing slave PEs seems unreliable - the configuration will depend on
whether a PE occupied enough segments to give an unique PE to a PCI
function and my little brain explodes.

So this is not happening soon.

For the time being, this patchset is good for:
1. weird hardware which has limited DMA mask (this is why the patchset
was written in the first place)
2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
Oliver O'Halloran April 20, 2020, 2:04 p.m. UTC | #7
On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
> 
> On 17/04/2020 11:26, Russell Currey wrote:
> > 
> > For what it's worth this sounds like a good idea to me, it just sounds
> > tricky to implement.  You're adding another layer of complexity on top
> > of EEH (well, making things look simple to the EEH core and doing your
> > own freezing on top of it) in addition to the DMA handling.
> > 
> > If it works then great, just has a high potential to become a new bug
> > haven.
> 
> imho putting every PCI function to a separate PE is the right thing to
> do here but I've been told it is not that simple, and I believe that.
> Reusing slave PEs seems unreliable - the configuration will depend on
> whether a PE occupied enough segments to give an unique PE to a PCI
> function and my little brain explodes.

You're overthinking it.

If a bus has no m64 MMIO space we're free to assign whatever PE number
we want since the PE for the bus isn't fixed by the m64 segment its
BARs were placed in. For those buses we assign a PE number starting
from the max and counting down (0xff, 0xfe, etc). For example, with
this PHB:

# lspci -s 1:: -v | egrep '0001:|Memory at'
0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
        Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
(prog-if 00 [Normal decode])
0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
f117 (rev 06) (prog-if 02 [NVM Express])
        Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
        Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
X710/X557-AT 10GBASE-T (rev 02)
        Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
        Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
(redundant functions removed)

We get these PE assignments:

0001:00:00.0 - 0xfe # Root port
0001:01:00.0 - 0xfc # upstream port
0001:02:01.0 - 0xfd # downstream port bus
0001:02:08.0 - 0xfd
0001:02:09.0 - 0xfd
0001:03:00.0 - 0x0  # NVMe
0001:09:00.0 - 0x1  # Ethernet

All the switch ports either have 32bit BARs or no BARs so they get
assigned PEs starting from the top. The Ethernet and the NVMe have some
prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
respectively since that's where their m64 BARs are located. For our
DMA-only slave PEs any MMIO space would remain in their master PE so we
can allocate a PE number for the DMA-PE (our iommu context).

I think the key thing to realise is that we'd only be using the DMA-PE
when a crippled DMA mask is set by the driver. In all other cases we
can just use the "native PE" and when the driver unbinds we can de-
allocate our DMA-PE and return the device to the PE containing it's
MMIO BARs. I think we can keep things relatively sane that way and the
real issue is detecting EEH events on the DMA-PE.

On P9 we don't have PHB error interrupts enabled in firmware so we're
completely reliant on seeing a 0xFF response to an MMIO and manually
checking the PE status to see if it's due to a PE freeze. For our DMA-
PE it could be frozen (due to a bad DMA) and we'd never notice unless
we explicitly check the status of the DMA-PE since there's no
corresponding MMIO space to freeze.

On P8 we had PHB Error interrupts so you would notice that *something*
happened, then go check for frozen PEs, at which point the master-slave 
grouping logic would see that one PE in the group was frozen and freeze
the rest of them. We can re-use that on that, but we still need
something to actually notice a freeze occured. A background poller
checking for freezes on each PE might do the trick.

> So this is not happening soon.

Oh ye of little faith.

> For the time being, this patchset is good for:
> 1. weird hardware which has limited DMA mask (this is why the patchset
> was written in the first place)
> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).

Sure, but it's still dependent on having firmware which supports the
4GB hack and I don't think that's in any offical firmware releases yet.

Oliver
Alexey Kardashevskiy April 21, 2020, 5:11 a.m. UTC | #8
On 21/04/2020 00:04, Oliver O'Halloran wrote:
> On Fri, 2020-04-17 at 15:47 +1000, Alexey Kardashevskiy wrote:
>>
>> On 17/04/2020 11:26, Russell Currey wrote:
>>>
>>> For what it's worth this sounds like a good idea to me, it just sounds
>>> tricky to implement.  You're adding another layer of complexity on top
>>> of EEH (well, making things look simple to the EEH core and doing your
>>> own freezing on top of it) in addition to the DMA handling.
>>>
>>> If it works then great, just has a high potential to become a new bug
>>> haven.
>>
>> imho putting every PCI function to a separate PE is the right thing to
>> do here but I've been told it is not that simple, and I believe that.
>> Reusing slave PEs seems unreliable - the configuration will depend on
>> whether a PE occupied enough segments to give an unique PE to a PCI
>> function and my little brain explodes.
> 
> You're overthinking it.
> 
> If a bus has no m64 MMIO space we're free to assign whatever PE number
> we want since the PE for the bus isn't fixed by the m64 segment its
> BARs were placed in. For those buses we assign a PE number starting
> from the max and counting down (0xff, 0xfe, etc). For example, with
> this PHB:
> 
> # lspci -s 1:: -v | egrep '0001:|Memory at'
> 0001:00:00.0 PCI bridge: IBM Device 04c1 (prog-if 00 [Normal decode])
> 0001:01:00.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
>         Memory at 600c081000000 (32-bit, non-prefetchable) [size=256K]
> 0001:02:01.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:08.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:02:09.0 PCI bridge: PLX Technology, Inc. Device 8725 (rev ca)
> (prog-if 00 [Normal decode])
> 0001:03:00.0 Non-Volatile memory controller: PMC-Sierra Inc. Device
> f117 (rev 06) (prog-if 02 [NVM Express])
>         Memory at 600c080000000 (64-bit, non-prefetchable) [size=16K]
>         Memory at 6004000000000 (64-bit, prefetchable) [size=1M]
> 0001:09:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> X710/X557-AT 10GBASE-T (rev 02)
>         Memory at 6004048000000 (64-bit, prefetchable) [size=8M]
>         Memory at 600404a000000 (64-bit, prefetchable) [size=32K]
> (redundant functions removed)
> 
> We get these PE assignments:
> 
> 0001:00:00.0 - 0xfe # Root port
> 0001:01:00.0 - 0xfc # upstream port
> 0001:02:01.0 - 0xfd # downstream port bus
> 0001:02:08.0 - 0xfd
> 0001:02:09.0 - 0xfd
> 0001:03:00.0 - 0x0  # NVMe
> 0001:09:00.0 - 0x1  # Ethernet
> 
> All the switch ports either have 32bit BARs or no BARs so they get
> assigned PEs starting from the top. The Ethernet and the NVMe have some
> prefetchable 64bit BARs so they have to be in PE 0x0 and 0x1
> respectively since that's where their m64 BARs are located. For our
> DMA-only slave PEs any MMIO space would remain in their master PE so we
> can allocate a PE number for the DMA-PE (our iommu context).


One example of a problem device is AMD GPU with 64bit video PCI function
and 32bit audio, no?

What PEs will they get assigned to now? Where will audio's MMIO go? It
cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
already. If not, then I do not understand "we're free to assign whatever
PE number we want.


> I think the key thing to realise is that we'd only be using the DMA-PE
> when a crippled DMA mask is set by the driver. In all other cases we
> can just use the "native PE" and when the driver unbinds we can de-
> allocate our DMA-PE and return the device to the PE containing it's
> MMIO BARs. I think we can keep things relatively sane that way and the
> real issue is detecting EEH events on the DMA-PE.


Oooor we could just have 1 DMA window (or, more precisely, a single
"TVE" as it is either window or bypass) per a PE and give every function
its own PE and create a window or a table when a device sets a DMA mask.
I feel I am missing something here though.


> 
> On P9 we don't have PHB error interrupts enabled in firmware so we're
> completely reliant on seeing a 0xFF response to an MMIO and manually
> checking the PE status to see if it's due to a PE freeze. For our DMA-
> PE it could be frozen (due to a bad DMA) and we'd never notice unless
> we explicitly check the status of the DMA-PE since there's no
> corresponding MMIO space to freeze.
> 
> On P8 we had PHB Error interrupts so you would notice that *something*
> happened, then go check for frozen PEs, at which point the master-slave 
> grouping logic would see that one PE in the group was frozen and freeze
> the rest of them. We can re-use that on that, but we still need
> something to actually notice a freeze occured. A background poller
> checking for freezes on each PE might do the trick.
> 
>> So this is not happening soon.
> 
> Oh ye of little faith.
> 
>> For the time being, this patchset is good for:
>> 1. weird hardware which has limited DMA mask (this is why the patchset
>> was written in the first place)
>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> 
> Sure, but it's still dependent on having firmware which supports the
> 4GB hack and I don't think that's in any offical firmware releases yet.

It's been a while :-/
Oliver O'Halloran April 21, 2020, 6:35 a.m. UTC | #9
On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> One example of a problem device is AMD GPU with 64bit video PCI function
> and 32bit audio, no?
>
> What PEs will they get assigned to now? Where will audio's MMIO go? It
> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
> already. If not, then I do not understand "we're free to assign whatever
> PE number we want.

The BARs stay in the same place and as far as MMIO is concerned
nothing has changed. For MMIO the PHB uses the MMIO address to find a
PE via the M64 BAR table, but for DMA it uses a *completely* different
mechanism. Instead it takes the BDFN (included in the DMA packet
header) and the Requester Translation Table (RTT) to map the BDFN to a
PE. Normally you would configure the PHB so the same PE used for MMIO
and DMA, but you don't have to.

> > I think the key thing to realise is that we'd only be using the DMA-PE
> > when a crippled DMA mask is set by the driver. In all other cases we
> > can just use the "native PE" and when the driver unbinds we can de-
> > allocate our DMA-PE and return the device to the PE containing it's
> > MMIO BARs. I think we can keep things relatively sane that way and the
> > real issue is detecting EEH events on the DMA-PE.
>
>
> Oooor we could just have 1 DMA window (or, more precisely, a single
> "TVE" as it is either window or bypass) per a PE and give every function
> its own PE and create a window or a table when a device sets a DMA mask.
> I feel I am missing something here though.

Yes, we could do that, but do we want to?

I was thinking we should try minimise the number of DMA-only PEs since
it complicates the EEH freeze handling. When MMIO and DMA are mapped
to the same PE an error on either will cause the hardware to stop
both. When seperate PEs are used for DMA and MMIO you lose that
atomicity. It's not a big deal if DMA is stopped and MMIO allowed
since PAPR (sort-of) allows that, but having MMIO frozen with DMA
unfrozen is a bit sketch.

> >> For the time being, this patchset is good for:
> >> 1. weird hardware which has limited DMA mask (this is why the patchset
> >> was written in the first place)
> >> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
> >
> > Sure, but it's still dependent on having firmware which supports the
> > 4GB hack and I don't think that's in any offical firmware releases yet.
>
> It's been a while :-/

There's been no official FW releases with a skiboot that supports the
phb get/set option opal calls so the only systems that can actually
take advantage of it are our lab systems. It might still be useful for
future systems, but I'd rather something that doesn't depend on FW
support.


Oliver
Alexey Kardashevskiy April 22, 2020, 6:49 a.m. UTC | #10
On 21/04/2020 16:35, Oliver O'Halloran wrote:
> On Tue, Apr 21, 2020 at 3:11 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>> One example of a problem device is AMD GPU with 64bit video PCI function
>> and 32bit audio, no?
>>
>> What PEs will they get assigned to now? Where will audio's MMIO go? It
>> cannot be the same 64bit MMIO segment, right? If so, it is a separate PE
>> already. If not, then I do not understand "we're free to assign whatever
>> PE number we want.
> 
> The BARs stay in the same place and as far as MMIO is concerned
> nothing has changed. For MMIO the PHB uses the MMIO address to find a
> PE via the M64 BAR table, but for DMA it uses a *completely* different
> mechanism. Instead it takes the BDFN (included in the DMA packet
> header) and the Requester Translation Table (RTT) to map the BDFN to a
> PE. Normally you would configure the PHB so the same PE used for MMIO
> and DMA, but you don't have to.

32bit MMIO is what puzzles me in this picture, how does it work?


>>> I think the key thing to realise is that we'd only be using the DMA-PE
>>> when a crippled DMA mask is set by the driver. In all other cases we
>>> can just use the "native PE" and when the driver unbinds we can de-
>>> allocate our DMA-PE and return the device to the PE containing it's
>>> MMIO BARs. I think we can keep things relatively sane that way and the
>>> real issue is detecting EEH events on the DMA-PE.
>>
>>
>> Oooor we could just have 1 DMA window (or, more precisely, a single
>> "TVE" as it is either window or bypass) per a PE and give every function
>> its own PE and create a window or a table when a device sets a DMA mask.
>> I feel I am missing something here though.
> 
> Yes, we could do that, but do we want to?
> 
> I was thinking we should try minimise the number of DMA-only PEs since
> it complicates the EEH freeze handling. When MMIO and DMA are mapped
> to the same PE an error on either will cause the hardware to stop
> both. When seperate PEs are used for DMA and MMIO you lose that
> atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> unfrozen is a bit sketch.


You suggested using slave PEs for crippled functions - won't we have the
same problem then?
And is this "slave PE" something the hardware supports or it is a
software concept?


>>>> For the time being, this patchset is good for:
>>>> 1. weird hardware which has limited DMA mask (this is why the patchset
>>>> was written in the first place)
>>>> 2. debug DMA by routing it via IOMMU (even when 4GB hack is not enabled).
>>>
>>> Sure, but it's still dependent on having firmware which supports the
>>> 4GB hack and I don't think that's in any offical firmware releases yet.
>>
>> It's been a while :-/
> 
> There's been no official FW releases with a skiboot that supports the
> phb get/set option opal calls so the only systems that can actually
> take advantage of it are our lab systems. It might still be useful for
> future systems, but I'd rather something that doesn't depend on FW
> support.

Pensando folks use it ;)
Oliver O'Halloran April 22, 2020, 9:11 a.m. UTC | #11
On Wed, Apr 22, 2020 at 4:49 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> 32bit MMIO is what puzzles me in this picture, how does it work?

For devices with no m64 we allocate a PE number as described above. In
the 32bit MMIO window we have a segment-to-PE remapping table so any
m32 segment can be assigned to any PE. As a result slave PE concept
isn't really needed. If the BARs of a device span multiple m32
segments then we can setup the remapping table so that all the
segments point to the same PE.

> > I was thinking we should try minimise the number of DMA-only PEs since
> > it complicates the EEH freeze handling. When MMIO and DMA are mapped
> > to the same PE an error on either will cause the hardware to stop
> > both. When seperate PEs are used for DMA and MMIO you lose that
> > atomicity. It's not a big deal if DMA is stopped and MMIO allowed
> > since PAPR (sort-of) allows that, but having MMIO frozen with DMA
> > unfrozen is a bit sketch.
>
> You suggested using slave PEs for crippled functions - won't we have the
> same problem then?

Yes, but I think it's probably worth doing in that case. You get
slightly janky EEH in exchange for better DMA performance.

> And is this "slave PE" something the hardware supports or it is a
> software concept?

It's all in software. The hardware does have the PELT-V which allows
you to specify a group of PEs to additionally freeze when a PE is
frozen, but the PELT-V is only used when handling AER messages.  All
other error sources (DMAs, MMIOs, etc) will only freeze one PE (or all
of them in very rare cases).

> > There's been no official FW releases with a skiboot that supports the
> > phb get/set option opal calls so the only systems that can actually
> > take advantage of it are our lab systems. It might still be useful for
> > future systems, but I'd rather something that doesn't depend on FW
> > support.
>
> Pensando folks use it ;)

the what folks

Oliver