mbox series

[v4,00/14] Copy Offload in NVMe Fabrics with P2P PCI Memory

Message ID 20180423233046.21476-1-logang@deltatee.com
Headers show
Series Copy Offload in NVMe Fabrics with P2P PCI Memory | expand

Message

Logan Gunthorpe April 23, 2018, 11:30 p.m. UTC
Hi Everyone,

Here's v4 of our series to introduce P2P based copy offload to NVMe
fabrics. This version has been rebased onto v4.17-rc2. A git repo
is here:

https://github.com/sbates130272/linux-p2pmem pci-p2p-v4

Thanks,

Logan

Changes in v4:

* Change the original upstream_bridges_match() function to
  upstream_bridge_distance() which calculates the distance between two
  devices as long as they are behind the same root port. This should
  address Bjorn's concerns that the code was to focused on
  being behind a single switch.

* The disable ACS function now disables ACS for all bridge ports instead
  of switch ports (ie. those that had two upstream_bridge ports).

* Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
  API to be more like sgl_alloc() in that the alloc function returns
  the allocated scatterlist and nents is not required bythe free
  function.

* Moved the new documentation into the driver-api tree as requested
  by Jonathan

* Add SGL alloc and free helpers in the nvmet code so that the
  individual drivers can share the code that allocates P2P memory.
  As requested by Christoph.

* Cleanup the nvmet_p2pmem_store() function as Christoph
  thought my first attempt was ugly.

* Numerous commit message and comment fix-ups

Changes in v3:

* Many more fixes and minor cleanups that were spotted by Bjorn

* Additional explanation of the ACS change in both the commit message
  and Kconfig doc. Also, the code that disables the ACS bits is surrounded
  explicitly by an #ifdef

* Removed the flag we added to rdma_rw_ctx() in favour of using
  is_pci_p2pdma_page(), as suggested by Sagi.

* Adjust pci_p2pmem_find() so that it prefers P2P providers that
  are closest to (or the same as) the clients using them. In cases
  of ties, the provider is randomly chosen.

* Modify the NVMe Target code so that the PCI device name of the provider
  may be explicitly specified, bypassing the logic in pci_p2pmem_find().
  (Note: it's still enforced that the provider must be behind the
   same switch as the clients).

* As requested by Bjorn, added documentation for driver writers.


Changes in v2:

* Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
  as a bunch of cleanup and spelling fixes he pointed out in the last
  series.

* To address Alex's ACS concerns, we change to a simpler method of
  just disabling ACS behind switches for any kernel that has
  CONFIG_PCI_P2PDMA.

* We also reject using devices that employ 'dma_virt_ops' which should
  fairly simply handle Jason's concerns that this work might break with
  the HFI, QIB and rxe drivers that use the virtual ops to implement
  their own special DMA operations.

--

This is a continuation of our work to enable using Peer-to-Peer PCI
memory in the kernel with initial support for the NVMe fabrics target
subsystem. Many thanks go to Christoph Hellwig who provided valuable
feedback to get these patches to where they are today.

The concept here is to use memory that's exposed on a PCI BAR as
data buffers in the NVMe target code such that data can be transferred
from an RDMA NIC to the special memory and then directly to an NVMe
device avoiding system memory entirely. The upside of this is better
QoS for applications running on the CPU utilizing memory and lower
PCI bandwidth required to the CPU (such that systems could be designed
with fewer lanes connected to the CPU).

Due to these trade-offs we've designed the system to only enable using
the PCI memory in cases where the NIC, NVMe devices and memory are all
behind the same PCI switch hierarchy. This will mean many setups that
could likely work well will not be supported so that we can be more
confident it will work and not place any responsibility on the user to
understand their topology. (We chose to go this route based on feedback
we received at the last LSF). Future work may enable these transfers
using a white list of known good root complexes. However, at this time,
there is no reliable way to ensure that Peer-to-Peer transactions are
permitted between PCI Root Ports.

In order to enable this functionality, we introduce a few new PCI
functions such that a driver can register P2P memory with the system.
Struct pages are created for this memory using devm_memremap_pages()
and the PCI bus offset is stored in the corresponding pagemap structure.

When the PCI P2PDMA config option is selected the ACS bits in every
bridge port in the system are turned off to allow traffic to
pass freely behind the root port. At this time, the bit must be disabled
at boot so the IOMMU subsystem can correctly create the groups, though
this could be addressed in the future. There is no way to dynamically
disable the bit and alter the groups.

Another set of functions allow a client driver to create a list of
client devices that will be used in a given P2P transactions and then
use that list to find any P2P memory that is supported by all the
client devices.

In the block layer, we also introduce a P2P request flag to indicate a
given request targets P2P memory as well as a flag for a request queue
to indicate a given queue supports targeting P2P memory. P2P requests
will only be accepted by queues that support it. Also, P2P requests
are marked to not be merged seeing a non-homogenous request would
complicate the DMA mapping requirements.

In the PCI NVMe driver, we modify the existing CMB support to utilize
the new PCI P2P memory infrastructure and also add support for P2P
memory in its request queue. When a P2P request is received it uses the
pci_p2pmem_map_sg() function which applies the necessary transformation
to get the corrent pci_bus_addr_t for the DMA transactions.

In the RDMA core, we also adjust rdma_rw_ctx_init() and
rdma_rw_ctx_destroy() to take a flags argument which indicates whether
to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
that don't use the proper DMA infrastructure this code rejects using
any device that employs the virt_dma_ops implementation.

Finally, in the NVMe fabrics target port we introduce a new
configuration boolean: 'allow_p2pmem'. When set, the port will attempt
to find P2P memory supported by the RDMA NIC and all namespaces. If
supported memory is found, it will be used in all IO transfers. And if
a port is using P2P memory, adding new namespaces that are not supported
by that memory will fail.

These patches have been tested on a number of Intel based systems and
for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
Microsemi, Chelsio and Everspin) using switches from both Microsemi
and Broadcomm.

Logan Gunthorpe (14):
  PCI/P2PDMA: Support peer-to-peer memory
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
  PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  block: Introduce PCI P2P flags for request and request queue
  IB/core: Ensure we map P2P memory correctly in
    rdma_rw_ctx_[init|destroy]()
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Add a quirk for a pseudo CMB
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvmet-rdma: Use new SGL alloc/free helper for requests
  nvmet: Optionally use PCI P2P memory

 Documentation/ABI/testing/sysfs-bus-pci    |  25 +
 Documentation/PCI/index.rst                |  14 +
 Documentation/driver-api/index.rst         |   2 +-
 Documentation/driver-api/pci/index.rst     |  20 +
 Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
 Documentation/driver-api/{ => pci}/pci.rst |   0
 Documentation/index.rst                    |   3 +-
 block/blk-core.c                           |   3 +
 drivers/infiniband/core/rw.c               |  13 +-
 drivers/nvme/host/core.c                   |   4 +
 drivers/nvme/host/nvme.h                   |   8 +
 drivers/nvme/host/pci.c                    | 118 +++--
 drivers/nvme/target/configfs.c             |  67 +++
 drivers/nvme/target/core.c                 | 143 ++++-
 drivers/nvme/target/io-cmd.c               |   3 +
 drivers/nvme/target/nvmet.h                |  15 +
 drivers/nvme/target/rdma.c                 |  22 +-
 drivers/pci/Kconfig                        |  26 +
 drivers/pci/Makefile                       |   1 +
 drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
 drivers/pci/pci.c                          |   6 +
 include/linux/blk_types.h                  |  18 +-
 include/linux/blkdev.h                     |   3 +
 include/linux/memremap.h                   |  19 +
 include/linux/pci-p2pdma.h                 | 118 +++++
 include/linux/pci.h                        |   4 +
 26 files changed, 1579 insertions(+), 56 deletions(-)
 create mode 100644 Documentation/PCI/index.rst
 create mode 100644 Documentation/driver-api/pci/index.rst
 create mode 100644 Documentation/driver-api/pci/p2pdma.rst
 rename Documentation/driver-api/{ => pci}/pci.rst (100%)
 create mode 100644 drivers/pci/p2pdma.c
 create mode 100644 include/linux/pci-p2pdma.h

--
2.11.0

Comments

Christian König May 2, 2018, 11:51 a.m. UTC | #1
Hi Logan,

it would be rather nice to have if you could separate out the functions 
to detect if peer2peer is possible between two devices.

That would allow me to reuse the same logic for GPU peer2peer where I 
don't really have ZONE_DEVICE.

Regards,
Christian.

Am 24.04.2018 um 01:30 schrieb Logan Gunthorpe:
> Hi Everyone,
>
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
>
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>
> Thanks,
>
> Logan
>
> Changes in v4:
>
> * Change the original upstream_bridges_match() function to
>    upstream_bridge_distance() which calculates the distance between two
>    devices as long as they are behind the same root port. This should
>    address Bjorn's concerns that the code was to focused on
>    being behind a single switch.
>
> * The disable ACS function now disables ACS for all bridge ports instead
>    of switch ports (ie. those that had two upstream_bridge ports).
>
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>    API to be more like sgl_alloc() in that the alloc function returns
>    the allocated scatterlist and nents is not required bythe free
>    function.
>
> * Moved the new documentation into the driver-api tree as requested
>    by Jonathan
>
> * Add SGL alloc and free helpers in the nvmet code so that the
>    individual drivers can share the code that allocates P2P memory.
>    As requested by Christoph.
>
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>    thought my first attempt was ugly.
>
> * Numerous commit message and comment fix-ups
>
> Changes in v3:
>
> * Many more fixes and minor cleanups that were spotted by Bjorn
>
> * Additional explanation of the ACS change in both the commit message
>    and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>    explicitly by an #ifdef
>
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>    is_pci_p2pdma_page(), as suggested by Sagi.
>
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>    are closest to (or the same as) the clients using them. In cases
>    of ties, the provider is randomly chosen.
>
> * Modify the NVMe Target code so that the PCI device name of the provider
>    may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>    (Note: it's still enforced that the provider must be behind the
>     same switch as the clients).
>
> * As requested by Bjorn, added documentation for driver writers.
>
>
> Changes in v2:
>
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>    as a bunch of cleanup and spelling fixes he pointed out in the last
>    series.
>
> * To address Alex's ACS concerns, we change to a simpler method of
>    just disabling ACS behind switches for any kernel that has
>    CONFIG_PCI_P2PDMA.
>
> * We also reject using devices that employ 'dma_virt_ops' which should
>    fairly simply handle Jason's concerns that this work might break with
>    the HFI, QIB and rxe drivers that use the virtual ops to implement
>    their own special DMA operations.
>
> --
>
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
>
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
>
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
>
> In order to enable this functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
>
> When the PCI P2PDMA config option is selected the ACS bits in every
> bridge port in the system are turned off to allow traffic to
> pass freely behind the root port. At this time, the bit must be disabled
> at boot so the IOMMU subsystem can correctly create the groups, though
> this could be addressed in the future. There is no way to dynamically
> disable the bit and alter the groups.
>
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
>
> In the block layer, we also introduce a P2P request flag to indicate a
> given request targets P2P memory as well as a flag for a request queue
> to indicate a given queue supports targeting P2P memory. P2P requests
> will only be accepted by queues that support it. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
>
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
>
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
>
> Finally, in the NVMe fabrics target port we introduce a new
> configuration boolean: 'allow_p2pmem'. When set, the port will attempt
> to find P2P memory supported by the RDMA NIC and all namespaces. If
> supported memory is found, it will be used in all IO transfers. And if
> a port is using P2P memory, adding new namespaces that are not supported
> by that memory will fail.
>
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
>
> Logan Gunthorpe (14):
>    PCI/P2PDMA: Support peer-to-peer memory
>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>    docs-rst: Add a new directory for PCI documentation
>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>    block: Introduce PCI P2P flags for request and request queue
>    IB/core: Ensure we map P2P memory correctly in
>      rdma_rw_ctx_[init|destroy]()
>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>    nvme-pci: Add support for P2P memory in requests
>    nvme-pci: Add a quirk for a pseudo CMB
>    nvmet: Introduce helper functions to allocate and free request SGLs
>    nvmet-rdma: Use new SGL alloc/free helper for requests
>    nvmet: Optionally use PCI P2P memory
>
>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>   Documentation/PCI/index.rst                |  14 +
>   Documentation/driver-api/index.rst         |   2 +-
>   Documentation/driver-api/pci/index.rst     |  20 +
>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>   Documentation/driver-api/{ => pci}/pci.rst |   0
>   Documentation/index.rst                    |   3 +-
>   block/blk-core.c                           |   3 +
>   drivers/infiniband/core/rw.c               |  13 +-
>   drivers/nvme/host/core.c                   |   4 +
>   drivers/nvme/host/nvme.h                   |   8 +
>   drivers/nvme/host/pci.c                    | 118 +++--
>   drivers/nvme/target/configfs.c             |  67 +++
>   drivers/nvme/target/core.c                 | 143 ++++-
>   drivers/nvme/target/io-cmd.c               |   3 +
>   drivers/nvme/target/nvmet.h                |  15 +
>   drivers/nvme/target/rdma.c                 |  22 +-
>   drivers/pci/Kconfig                        |  26 +
>   drivers/pci/Makefile                       |   1 +
>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>   drivers/pci/pci.c                          |   6 +
>   include/linux/blk_types.h                  |  18 +-
>   include/linux/blkdev.h                     |   3 +
>   include/linux/memremap.h                   |  19 +
>   include/linux/pci-p2pdma.h                 | 118 +++++
>   include/linux/pci.h                        |   4 +
>   26 files changed, 1579 insertions(+), 56 deletions(-)
>   create mode 100644 Documentation/PCI/index.rst
>   create mode 100644 Documentation/driver-api/pci/index.rst
>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>   create mode 100644 drivers/pci/p2pdma.c
>   create mode 100644 include/linux/pci-p2pdma.h
>
> --
> 2.11.0
Logan Gunthorpe May 2, 2018, 3:56 p.m. UTC | #2
Hi Christian,

On 5/2/2018 5:51 AM, Christian König wrote:
> it would be rather nice to have if you could separate out the functions 
> to detect if peer2peer is possible between two devices.

This would essentially be pci_p2pdma_distance() in the existing 
patchset. It returns the sum of the distance between a list of clients 
and a P2PDMA provider. It returns -1 if peer2peer is not possible 
between the devices (presently this means they are not behind the same 
root port).

Logan
Christian König May 3, 2018, 9:05 a.m. UTC | #3
Am 02.05.2018 um 17:56 schrieb Logan Gunthorpe:
> Hi Christian,
>
> On 5/2/2018 5:51 AM, Christian König wrote:
>> it would be rather nice to have if you could separate out the 
>> functions to detect if peer2peer is possible between two devices.
>
> This would essentially be pci_p2pdma_distance() in the existing 
> patchset. It returns the sum of the distance between a list of clients 
> and a P2PDMA provider. It returns -1 if peer2peer is not possible 
> between the devices (presently this means they are not behind the same 
> root port).

Ok, I'm still missing the big picture here. First question is what is 
the P2PDMA provider?

Second question is how to you want to handle things when device are not 
behind the same root port (which is perfectly possible in the cases I 
deal with)?

Third question why multiple clients? That feels a bit like you are 
pushing something special to your use case into the common PCI 
subsystem. Something which usually isn't a good idea.



As far as I can see we need a function which return the distance between 
a initiator and target device. This function then returns -1 if the 
transaction can't be made and a positive value otherwise.

We also need to give the direction of the transaction and have a 
whitelist root complex PCI-IDs which can handle P2P transactions from 
different ports for a certain DMA direction.


Christian.

>
> Logan
Logan Gunthorpe May 3, 2018, 3:59 p.m. UTC | #4
On 03/05/18 03:05 AM, Christian König wrote:
> Ok, I'm still missing the big picture here. First question is what is 
> the P2PDMA provider?

Well there's some pretty good documentation in the patchset for this,
but in short, a provider is a device that provides some kind of P2P
resource (ie. BAR memory, or perhaps a doorbell register -- only memory
is supported at this time).

> Second question is how to you want to handle things when device are not 
> behind the same root port (which is perfectly possible in the cases I 
> deal with)?

I think we need to implement a whitelist. If both root ports are in the
white list and are on the same bus then we return a larger distance
instead of -1.

> Third question why multiple clients? That feels a bit like you are 
> pushing something special to your use case into the common PCI 
> subsystem. Something which usually isn't a good idea.

No, I think this will be pretty standard. In the simple general case you
are going to have one provider and at least two clients (one which
writes the memory and one which reads it). However, one client is
likely, but not necessarily, the same as the provider.

In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
devices. The code doesn't care which device provides the memory as it
could be the RDMA device or one/all of the block devices (or, in theory,
a completely separate device with P2P-able memory). However, it does
require that all devices involved are accessible per
pci_p2pdma_distance() or it won't use P2P transactions.

I could also imagine other use cases: ie. an RDMA NIC sends data to a
GPU for processing and then sends the data to an NVMe device for storage
(or vice-versa). In this case we have 3 clients and one provider.

> As far as I can see we need a function which return the distance between 
> a initiator and target device. This function then returns -1 if the 
> transaction can't be made and a positive value otherwise.

If you need to make a simpler convenience function for your use case I'm
not against it.

> We also need to give the direction of the transaction and have a 
> whitelist root complex PCI-IDs which can handle P2P transactions from 
> different ports for a certain DMA direction.

Yes. In the NVMeof case we need all devices to be able to DMA in both
directions so we did not need the DMA direction. But I can see this
being useful once we add the whitelist.

Logan
Christian König May 3, 2018, 5:29 p.m. UTC | #5
Am 03.05.2018 um 17:59 schrieb Logan Gunthorpe:
> On 03/05/18 03:05 AM, Christian König wrote:
>> Second question is how to you want to handle things when device are not
>> behind the same root port (which is perfectly possible in the cases I
>> deal with)?
> I think we need to implement a whitelist. If both root ports are in the
> white list and are on the same bus then we return a larger distance
> instead of -1.

Sounds good.

>> Third question why multiple clients? That feels a bit like you are
>> pushing something special to your use case into the common PCI
>> subsystem. Something which usually isn't a good idea.
> No, I think this will be pretty standard. In the simple general case you
> are going to have one provider and at least two clients (one which
> writes the memory and one which reads it). However, one client is
> likely, but not necessarily, the same as the provider.

Ok, that is the point where I'm stuck. Why do we need that in one 
function call in the PCIe subsystem?

The problem at least with GPUs is that we seriously don't have that 
information here, cause the PCI subsystem might not be aware of all the 
interconnections.

For example it isn't uncommon to put multiple GPUs on one board. To the 
PCI subsystem that looks like separate devices, but in reality all GPUs 
are interconnected and can access each others memory directly without 
going over the PCIe bus.

I seriously don't want to model that in the PCI subsystem, but rather 
the driver. That's why it feels like a mistake to me to push all that 
into the PCI function.

> In the NVMeof case, we might have N clients: 1 RDMA device and N-1 block
> devices. The code doesn't care which device provides the memory as it
> could be the RDMA device or one/all of the block devices (or, in theory,
> a completely separate device with P2P-able memory). However, it does
> require that all devices involved are accessible per
> pci_p2pdma_distance() or it won't use P2P transactions.
>
> I could also imagine other use cases: ie. an RDMA NIC sends data to a
> GPU for processing and then sends the data to an NVMe device for storage
> (or vice-versa). In this case we have 3 clients and one provider.

Why can't we model that as two separate transactions?

E.g. one from the RDMA NIC to the GPU memory. And another one from the 
GPU memory to the NVMe device.

That would also match how I get this information from userspace.

>> As far as I can see we need a function which return the distance between
>> a initiator and target device. This function then returns -1 if the
>> transaction can't be made and a positive value otherwise.
> If you need to make a simpler convenience function for your use case I'm
> not against it.

Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
that I'm fine with that as well.

I think it would just be more convenient when we can come up with 
functions which can handle all use cases, cause there still seems to be 
a lot of similarities.

>
>> We also need to give the direction of the transaction and have a
>> whitelist root complex PCI-IDs which can handle P2P transactions from
>> different ports for a certain DMA direction.
> Yes. In the NVMeof case we need all devices to be able to DMA in both
> directions so we did not need the DMA direction. But I can see this
> being useful once we add the whitelist.

Ok, I agree that can be added later on. For simplicity let's assume for 
now we always to bidirectional transfers.

Thanks for the explanation,
Christian.

>
> Logan
Logan Gunthorpe May 3, 2018, 6:43 p.m. UTC | #6
On 03/05/18 11:29 AM, Christian König wrote:
> Ok, that is the point where I'm stuck. Why do we need that in one 
> function call in the PCIe subsystem?
> 
> The problem at least with GPUs is that we seriously don't have that 
> information here, cause the PCI subsystem might not be aware of all the 
> interconnections.
> 
> For example it isn't uncommon to put multiple GPUs on one board. To the 
> PCI subsystem that looks like separate devices, but in reality all GPUs 
> are interconnected and can access each others memory directly without 
> going over the PCIe bus.
> 
> I seriously don't want to model that in the PCI subsystem, but rather 
> the driver. That's why it feels like a mistake to me to push all that 
> into the PCI function.

Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
list to this API, if you want. If the driver is _sure_ they are all the
same, you only have to send one. In your terminology, you'd just have to
call the interface with:

pci_p2pdma_distance(target, [initiator, target])

> Why can't we model that as two separate transactions?

You could, but this is more convenient for users of the API that need to
deal with multiple devices (and manage devices that may be added or
removed at any time).

> Yeah, same for me. If Bjorn is ok with that specialized NVM functions 
> that I'm fine with that as well.
> 
> I think it would just be more convenient when we can come up with 
> functions which can handle all use cases, cause there still seems to be 
> a lot of similarities.

The way it's implemented is more general and can handle all use cases.
You are arguing for a function that can handle your case (albeit with a
bit more fuss) but can't handle mine and is therefore less general.
Calling my interface specialized is wrong.

Logan
Christian König May 4, 2018, 2:27 p.m. UTC | #7
Am 03.05.2018 um 20:43 schrieb Logan Gunthorpe:
>
> On 03/05/18 11:29 AM, Christian König wrote:
>> Ok, that is the point where I'm stuck. Why do we need that in one
>> function call in the PCIe subsystem?
>>
>> The problem at least with GPUs is that we seriously don't have that
>> information here, cause the PCI subsystem might not be aware of all the
>> interconnections.
>>
>> For example it isn't uncommon to put multiple GPUs on one board. To the
>> PCI subsystem that looks like separate devices, but in reality all GPUs
>> are interconnected and can access each others memory directly without
>> going over the PCIe bus.
>>
>> I seriously don't want to model that in the PCI subsystem, but rather
>> the driver. That's why it feels like a mistake to me to push all that
>> into the PCI function.
> Huh? I'm lost. If you have a bunch of PCI devices you can send them as a
> list to this API, if you want. If the driver is _sure_ they are all the
> same, you only have to send one. In your terminology, you'd just have to
> call the interface with:
>
> pci_p2pdma_distance(target, [initiator, target])

Ok, I expected that something like that would do it.

So just to confirm: When I have a bunch of GPUs which could be the 
initiator I only need to do "pci_p2pdma_distance(target, [first GPU, 
target]);" and not "pci_p2pdma_distance(target, [first GPU, second GPU, 
third GPU, forth...., target])" ?

>> Why can't we model that as two separate transactions?
> You could, but this is more convenient for users of the API that need to
> deal with multiple devices (and manage devices that may be added or
> removed at any time).

Are you sure that this is more convenient? At least on first glance it 
feels overly complicated.

I mean what's the difference between the two approaches?

     sum = pci_p2pdma_distance(target, [A, B, C, target]);

and

     sum = pci_p2pdma_distance(target, A);
     sum += pci_p2pdma_distance(target, B);
     sum += pci_p2pdma_distance(target, C);

>> Yeah, same for me. If Bjorn is ok with that specialized NVM functions
>> that I'm fine with that as well.
>>
>> I think it would just be more convenient when we can come up with
>> functions which can handle all use cases, cause there still seems to be
>> a lot of similarities.
> The way it's implemented is more general and can handle all use cases.
> You are arguing for a function that can handle your case (albeit with a
> bit more fuss) but can't handle mine and is therefore less general.
> Calling my interface specialized is wrong.

Well at the end of the day you only need to convince Bjorn of the 
interface, so I'm perfectly fine with it as long as it serves my use 
case as well :)

But I still would like to understand your intention, cause that really 
helps not to accidentally break something in the long term.

Now when I take a look at the pure PCI hardware level, what I have is a 
transaction between an initiator and a target, and not multiple devices 
in one operation.

I mean you must have a very good reason that you now want to deal with 
multiple devices in the software layer, but neither from the code nor 
from your explanation that reason becomes obvious to me.

Thanks,
Christian.

>
> Logan
Logan Gunthorpe May 4, 2018, 3:52 p.m. UTC | #8
On 04/05/18 08:27 AM, Christian König wrote:
> Are you sure that this is more convenient? At least on first glance it 
> feels overly complicated.
> 
> I mean what's the difference between the two approaches?
> 
>      sum = pci_p2pdma_distance(target, [A, B, C, target]);
> 
> and
> 
>      sum = pci_p2pdma_distance(target, A);
>      sum += pci_p2pdma_distance(target, B);
>      sum += pci_p2pdma_distance(target, C);

Well, it's more for consistency with the pci_p2pdma_find() which has to
take a list of devices to find a resource which matches all of them.
(You can't use multiple calls in that case because all the devices in
the list might not have the same set of compatible providers.) That way
we can use the same list to check the distance (when the user specifies
a device) as we do to find a compatible device (when the user wants to
automatically find one.

Logan
Bjorn Helgaas May 7, 2018, 11:23 p.m. UTC | #9
On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> Hi Everyone,
> 
> Here's v4 of our series to introduce P2P based copy offload to NVMe
> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> is here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> ...

> Logan Gunthorpe (14):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Introduce PCI P2P flags for request and request queue
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet-rdma: Use new SGL alloc/free helper for requests
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>  Documentation/PCI/index.rst                |  14 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  20 +
>  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  Documentation/index.rst                    |   3 +-
>  block/blk-core.c                           |   3 +
>  drivers/infiniband/core/rw.c               |  13 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 118 +++--
>  drivers/nvme/target/configfs.c             |  67 +++
>  drivers/nvme/target/core.c                 | 143 ++++-
>  drivers/nvme/target/io-cmd.c               |   3 +
>  drivers/nvme/target/nvmet.h                |  15 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  26 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>  drivers/pci/pci.c                          |   6 +
>  include/linux/blk_types.h                  |  18 +-
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |  19 +
>  include/linux/pci-p2pdma.h                 | 118 +++++
>  include/linux/pci.h                        |   4 +
>  26 files changed, 1579 insertions(+), 56 deletions(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

How do you envison merging this?  There's a big chunk in drivers/pci, but
really no opportunity for conflicts there, and there's significant stuff in
block and nvme that I don't really want to merge.

If Alex is OK with the ACS situation, I can ack the PCI parts and you could
merge it elsewhere?

Bjorn
Logan Gunthorpe May 7, 2018, 11:34 p.m. UTC | #10
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

Honestly, I don't know. I guess with your ACK on the PCI parts, the vast
balance is NVMe stuff so we could look at merging it through that tree.
The block patch and IB patch are pretty small.

Thanks,

Logan
Alex Williamson May 8, 2018, 4:57 p.m. UTC | #11
On Mon, 7 May 2018 18:23:46 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
> > Hi Everyone,
> > 
> > Here's v4 of our series to introduce P2P based copy offload to NVMe
> > fabrics. This version has been rebased onto v4.17-rc2. A git repo
> > is here:
> > 
> > https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> > ...  
> 
> > Logan Gunthorpe (14):
> >   PCI/P2PDMA: Support peer-to-peer memory
> >   PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >   PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >   PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >   docs-rst: Add a new directory for PCI documentation
> >   PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >   block: Introduce PCI P2P flags for request and request queue
> >   IB/core: Ensure we map P2P memory correctly in
> >     rdma_rw_ctx_[init|destroy]()
> >   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >   nvme-pci: Add support for P2P memory in requests
> >   nvme-pci: Add a quirk for a pseudo CMB
> >   nvmet: Introduce helper functions to allocate and free request SGLs
> >   nvmet-rdma: Use new SGL alloc/free helper for requests
> >   nvmet: Optionally use PCI P2P memory
> > 
> >  Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >  Documentation/PCI/index.rst                |  14 +
> >  Documentation/driver-api/index.rst         |   2 +-
> >  Documentation/driver-api/pci/index.rst     |  20 +
> >  Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >  Documentation/driver-api/{ => pci}/pci.rst |   0
> >  Documentation/index.rst                    |   3 +-
> >  block/blk-core.c                           |   3 +
> >  drivers/infiniband/core/rw.c               |  13 +-
> >  drivers/nvme/host/core.c                   |   4 +
> >  drivers/nvme/host/nvme.h                   |   8 +
> >  drivers/nvme/host/pci.c                    | 118 +++--
> >  drivers/nvme/target/configfs.c             |  67 +++
> >  drivers/nvme/target/core.c                 | 143 ++++-
> >  drivers/nvme/target/io-cmd.c               |   3 +
> >  drivers/nvme/target/nvmet.h                |  15 +
> >  drivers/nvme/target/rdma.c                 |  22 +-
> >  drivers/pci/Kconfig                        |  26 +
> >  drivers/pci/Makefile                       |   1 +
> >  drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >  drivers/pci/pci.c                          |   6 +
> >  include/linux/blk_types.h                  |  18 +-
> >  include/linux/blkdev.h                     |   3 +
> >  include/linux/memremap.h                   |  19 +
> >  include/linux/pci-p2pdma.h                 | 118 +++++
> >  include/linux/pci.h                        |   4 +
> >  26 files changed, 1579 insertions(+), 56 deletions(-)
> >  create mode 100644 Documentation/PCI/index.rst
> >  create mode 100644 Documentation/driver-api/pci/index.rst
> >  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >  create mode 100644 drivers/pci/p2pdma.c
> >  create mode 100644 include/linux/pci-p2pdma.h  
> 
> How do you envison merging this?  There's a big chunk in drivers/pci, but
> really no opportunity for conflicts there, and there's significant stuff in
> block and nvme that I don't really want to merge.
> 
> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> merge it elsewhere?

AIUI from previously questioning this, the change is hidden behind a
build-time config option and only custom kernels or distros optimized
for this sort of support would enable that build option.  I'm more than
a little dubious though that we're not going to have a wave of distros
enabling this only to get user complaints that they can no longer make
effective use of their devices for assignment due to the resulting span
of the IOMMU groups, nor is there any sort of compromise, configure
the kernel for p2p or device assignment, not both.  Is this really such
a unique feature that distro users aren't going to be asking for both
features?  Thanks,

Alex
Logan Gunthorpe May 8, 2018, 7:14 p.m. UTC | #12
On 08/05/18 10:57 AM, Alex Williamson wrote:
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,

I think it is. But it sounds like the majority want this to be a command
line option. So we will look at doing that for v5.

Logan
Don Dutile May 8, 2018, 9:25 p.m. UTC | #13
On 05/08/2018 12:57 PM, Alex Williamson wrote:
> On Mon, 7 May 2018 18:23:46 -0500
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:
>>> Hi Everyone,
>>>
>>> Here's v4 of our series to introduce P2P based copy offload to NVMe
>>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
>>> is here:
>>>
>>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
>>> ...
>>
>>> Logan Gunthorpe (14):
>>>    PCI/P2PDMA: Support peer-to-peer memory
>>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
>>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
>>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
>>>    docs-rst: Add a new directory for PCI documentation
>>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
>>>    block: Introduce PCI P2P flags for request and request queue
>>>    IB/core: Ensure we map P2P memory correctly in
>>>      rdma_rw_ctx_[init|destroy]()
>>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>>>    nvme-pci: Add support for P2P memory in requests
>>>    nvme-pci: Add a quirk for a pseudo CMB
>>>    nvmet: Introduce helper functions to allocate and free request SGLs
>>>    nvmet-rdma: Use new SGL alloc/free helper for requests
>>>    nvmet: Optionally use PCI P2P memory
>>>
>>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
>>>   Documentation/PCI/index.rst                |  14 +
>>>   Documentation/driver-api/index.rst         |   2 +-
>>>   Documentation/driver-api/pci/index.rst     |  20 +
>>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
>>>   Documentation/driver-api/{ => pci}/pci.rst |   0
>>>   Documentation/index.rst                    |   3 +-
>>>   block/blk-core.c                           |   3 +
>>>   drivers/infiniband/core/rw.c               |  13 +-
>>>   drivers/nvme/host/core.c                   |   4 +
>>>   drivers/nvme/host/nvme.h                   |   8 +
>>>   drivers/nvme/host/pci.c                    | 118 +++--
>>>   drivers/nvme/target/configfs.c             |  67 +++
>>>   drivers/nvme/target/core.c                 | 143 ++++-
>>>   drivers/nvme/target/io-cmd.c               |   3 +
>>>   drivers/nvme/target/nvmet.h                |  15 +
>>>   drivers/nvme/target/rdma.c                 |  22 +-
>>>   drivers/pci/Kconfig                        |  26 +
>>>   drivers/pci/Makefile                       |   1 +
>>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
>>>   drivers/pci/pci.c                          |   6 +
>>>   include/linux/blk_types.h                  |  18 +-
>>>   include/linux/blkdev.h                     |   3 +
>>>   include/linux/memremap.h                   |  19 +
>>>   include/linux/pci-p2pdma.h                 | 118 +++++
>>>   include/linux/pci.h                        |   4 +
>>>   26 files changed, 1579 insertions(+), 56 deletions(-)
>>>   create mode 100644 Documentation/PCI/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/index.rst
>>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>>>   create mode 100644 drivers/pci/p2pdma.c
>>>   create mode 100644 include/linux/pci-p2pdma.h
>>
>> How do you envison merging this?  There's a big chunk in drivers/pci, but
>> really no opportunity for conflicts there, and there's significant stuff in
>> block and nvme that I don't really want to merge.
>>
>> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
>> merge it elsewhere?
> 
> AIUI from previously questioning this, the change is hidden behind a
> build-time config option and only custom kernels or distros optimized
> for this sort of support would enable that build option.  I'm more than
> a little dubious though that we're not going to have a wave of distros
> enabling this only to get user complaints that they can no longer make
> effective use of their devices for assignment due to the resulting span
> of the IOMMU groups, nor is there any sort of compromise, configure
> the kernel for p2p or device assignment, not both.  Is this really such
> a unique feature that distro users aren't going to be asking for both
> features?  Thanks,
> 
> Alex
At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
a (layer of) switch(ing) provides.
To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
method to make two points p2p dma capable.

Worse case, the whole system is one large IOMMU group (current mindset of this static or run-time config option),
or best case (over time, more hw), a secure set of the primary system with p2p-enabled sections, that are deemed 'safe' or 'self-inflicting-unsecure',
the latter the case of today's VM with an assigned device -- can scribble all over the VM, but no other VM and not the host/HV.


> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Alex Williamson May 8, 2018, 9:40 p.m. UTC | #14
On Tue, 8 May 2018 17:25:24 -0400
Don Dutile <ddutile@redhat.com> wrote:

> On 05/08/2018 12:57 PM, Alex Williamson wrote:
> > On Mon, 7 May 2018 18:23:46 -0500
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> >   
> >> On Mon, Apr 23, 2018 at 05:30:32PM -0600, Logan Gunthorpe wrote:  
> >>> Hi Everyone,
> >>>
> >>> Here's v4 of our series to introduce P2P based copy offload to NVMe
> >>> fabrics. This version has been rebased onto v4.17-rc2. A git repo
> >>> is here:
> >>>
> >>> https://github.com/sbates130272/linux-p2pmem pci-p2p-v4
> >>> ...  
> >>  
> >>> Logan Gunthorpe (14):
> >>>    PCI/P2PDMA: Support peer-to-peer memory
> >>>    PCI/P2PDMA: Add sysfs group to display p2pmem stats
> >>>    PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset
> >>>    PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches
> >>>    docs-rst: Add a new directory for PCI documentation
> >>>    PCI/P2PDMA: Add P2P DMA driver writer's documentation
> >>>    block: Introduce PCI P2P flags for request and request queue
> >>>    IB/core: Ensure we map P2P memory correctly in
> >>>      rdma_rw_ctx_[init|destroy]()
> >>>    nvme-pci: Use PCI p2pmem subsystem to manage the CMB
> >>>    nvme-pci: Add support for P2P memory in requests
> >>>    nvme-pci: Add a quirk for a pseudo CMB
> >>>    nvmet: Introduce helper functions to allocate and free request SGLs
> >>>    nvmet-rdma: Use new SGL alloc/free helper for requests
> >>>    nvmet: Optionally use PCI P2P memory
> >>>
> >>>   Documentation/ABI/testing/sysfs-bus-pci    |  25 +
> >>>   Documentation/PCI/index.rst                |  14 +
> >>>   Documentation/driver-api/index.rst         |   2 +-
> >>>   Documentation/driver-api/pci/index.rst     |  20 +
> >>>   Documentation/driver-api/pci/p2pdma.rst    | 166 ++++++
> >>>   Documentation/driver-api/{ => pci}/pci.rst |   0
> >>>   Documentation/index.rst                    |   3 +-
> >>>   block/blk-core.c                           |   3 +
> >>>   drivers/infiniband/core/rw.c               |  13 +-
> >>>   drivers/nvme/host/core.c                   |   4 +
> >>>   drivers/nvme/host/nvme.h                   |   8 +
> >>>   drivers/nvme/host/pci.c                    | 118 +++--
> >>>   drivers/nvme/target/configfs.c             |  67 +++
> >>>   drivers/nvme/target/core.c                 | 143 ++++-
> >>>   drivers/nvme/target/io-cmd.c               |   3 +
> >>>   drivers/nvme/target/nvmet.h                |  15 +
> >>>   drivers/nvme/target/rdma.c                 |  22 +-
> >>>   drivers/pci/Kconfig                        |  26 +
> >>>   drivers/pci/Makefile                       |   1 +
> >>>   drivers/pci/p2pdma.c                       | 814 +++++++++++++++++++++++++++++
> >>>   drivers/pci/pci.c                          |   6 +
> >>>   include/linux/blk_types.h                  |  18 +-
> >>>   include/linux/blkdev.h                     |   3 +
> >>>   include/linux/memremap.h                   |  19 +
> >>>   include/linux/pci-p2pdma.h                 | 118 +++++
> >>>   include/linux/pci.h                        |   4 +
> >>>   26 files changed, 1579 insertions(+), 56 deletions(-)
> >>>   create mode 100644 Documentation/PCI/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/index.rst
> >>>   create mode 100644 Documentation/driver-api/pci/p2pdma.rst
> >>>   rename Documentation/driver-api/{ => pci}/pci.rst (100%)
> >>>   create mode 100644 drivers/pci/p2pdma.c
> >>>   create mode 100644 include/linux/pci-p2pdma.h  
> >>
> >> How do you envison merging this?  There's a big chunk in drivers/pci, but
> >> really no opportunity for conflicts there, and there's significant stuff in
> >> block and nvme that I don't really want to merge.
> >>
> >> If Alex is OK with the ACS situation, I can ack the PCI parts and you could
> >> merge it elsewhere?  
> > 
> > AIUI from previously questioning this, the change is hidden behind a
> > build-time config option and only custom kernels or distros optimized
> > for this sort of support would enable that build option.  I'm more than
> > a little dubious though that we're not going to have a wave of distros
> > enabling this only to get user complaints that they can no longer make
> > effective use of their devices for assignment due to the resulting span
> > of the IOMMU groups, nor is there any sort of compromise, configure
> > the kernel for p2p or device assignment, not both.  Is this really such
> > a unique feature that distro users aren't going to be asking for both
> > features?  Thanks,
> > 
> > Alex  
> At least 1/2 the cases presented to me by existing customers want it in a tunable kernel,
> and tunable btwn two points, if the hw allows it to be 'contained' in that manner, which
> a (layer of) switch(ing) provides.
> To me, that means a kernel cmdline parameter to _enable_, and another sysfs (configfs? ... i'm not a configfs afficionato to say which is best),
> method to make two points p2p dma capable.

That's not what's done here AIUI.  There are also some complications to
making IOMMU groups dynamic, for instance could a downstream endpoint
already be in use by a userspace tool as ACS is being twiddled in
sysfs?  Probably the easiest solution would be that all devices
affected by the ACS change are soft unplugged before and re-added after
the ACS change.  Note that "affected" is not necessarily only the
downstream devices if the downstream port at which we're playing with
ACS is part of a multifunction device.  Thanks,

Alex