mbox series

[0/3] Add support for Block Passthrough Endpoint function driver

Message ID 20240224210409.112333-1-wafgo01@gmail.com
Headers show
Series Add support for Block Passthrough Endpoint function driver | expand

Message

Wadim Mueller Feb. 24, 2024, 9:03 p.m. UTC
Hello,

This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).


Block Passthrough
==================
The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
through a PCI(e) link, one running in RC mode, the other in EP mode.
If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
those from the other SoC (SoC1 in RC mode below), without having any direct connection to
those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:


                                                           +-------------+
                                                           |             |
                                                           |   SD Card   |
                                                           |             |
                                                           +------^------+
                                                                  |
                                                                  |
    +--------------------------+                +-----------------v----------------+
    |                          |      PCI(e)    |                                  |
    |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
    | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
    |                          |                |                                  |
    +--------------------------+                +-----------------^----------------+
                                                                  |
                                                                  |
                                                           +------v------+
                                                           |             |
                                                           |    NVMe     |
                                                           |             |
                                                           +-------------+


This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.

The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.

A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.

Test setup
==========

This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.

A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.

The measurement was done through the FIO tool [1] with 4kiB Blocks.

[1] https://linux.die.net/man/1/fio

Wadim Mueller (3):
  PCI: Add PCI Endpoint function driver for Block-device passthrough
  PCI: Add PCI driver for a PCI EP remote Blockdevice
  Documentation: PCI: Add documentation for the PCI Block Passthrough

 .../function/binding/pci-block-passthru.rst   |   24 +
 Documentation/PCI/endpoint/index.rst          |    3 +
 .../pci-endpoint-block-passthru-function.rst  |  331 ++++
 .../pci-endpoint-block-passthru-howto.rst     |  158 ++
 MAINTAINERS                                   |    8 +
 drivers/block/Kconfig                         |   14 +
 drivers/block/Makefile                        |    1 +
 drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
 drivers/pci/endpoint/functions/Kconfig        |   12 +
 drivers/pci/endpoint/functions/Makefile       |    1 +
 .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
 include/linux/pci-epf-block-passthru.h        |   77 +
 12 files changed, 3069 insertions(+)
 create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
 create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
 create mode 100644 drivers/block/pci-remote-disk.c
 create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
 create mode 100644 include/linux/pci-epf-block-passthru.h

Comments

Manivannan Sadhasivam Feb. 25, 2024, 4:09 p.m. UTC | #1
On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> Hello,
> 
> This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> 
> 
> Block Passthrough
> ==================
> The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> through a PCI(e) link, one running in RC mode, the other in EP mode.
> If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> 
> 
>                                                            +-------------+
>                                                            |             |
>                                                            |   SD Card   |
>                                                            |             |
>                                                            +------^------+
>                                                                   |
>                                                                   |
>     +--------------------------+                +-----------------v----------------+
>     |                          |      PCI(e)    |                                  |
>     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
>     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
>     |                          |                |                                  |
>     +--------------------------+                +-----------------^----------------+
>                                                                   |
>                                                                   |
>                                                            +------v------+
>                                                            |             |
>                                                            |    NVMe     |
>                                                            |             |
>                                                            +-------------+
> 
> 
> This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> 
> The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> 
> A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> 
> Test setup
> ==========
> 
> This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> 
> A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> 
> The measurement was done through the FIO tool [1] with 4kiB Blocks.
> 
> [1] https://linux.die.net/man/1/fio
> 

Thanks for the proposal! We are planning to add virtio function support to
endpoint subsystem to cover usecases like this. I think your usecase can be
satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
support once we have the infra in place. Thoughts?

- Mani

> Wadim Mueller (3):
>   PCI: Add PCI Endpoint function driver for Block-device passthrough
>   PCI: Add PCI driver for a PCI EP remote Blockdevice
>   Documentation: PCI: Add documentation for the PCI Block Passthrough
> 
>  .../function/binding/pci-block-passthru.rst   |   24 +
>  Documentation/PCI/endpoint/index.rst          |    3 +
>  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
>  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
>  MAINTAINERS                                   |    8 +
>  drivers/block/Kconfig                         |   14 +
>  drivers/block/Makefile                        |    1 +
>  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
>  drivers/pci/endpoint/functions/Kconfig        |   12 +
>  drivers/pci/endpoint/functions/Makefile       |    1 +
>  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
>  include/linux/pci-epf-block-passthru.h        |   77 +
>  12 files changed, 3069 insertions(+)
>  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
>  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
>  create mode 100644 drivers/block/pci-remote-disk.c
>  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
>  create mode 100644 include/linux/pci-epf-block-passthru.h
> 
> -- 
> 2.25.1
>
Wadim Mueller Feb. 25, 2024, 8:39 p.m. UTC | #2
On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > Hello,
> > 
> > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > 
> > 
> > Block Passthrough
> > ==================
> > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > 
> > 
> >                                                            +-------------+
> >                                                            |             |
> >                                                            |   SD Card   |
> >                                                            |             |
> >                                                            +------^------+
> >                                                                   |
> >                                                                   |
> >     +--------------------------+                +-----------------v----------------+
> >     |                          |      PCI(e)    |                                  |
> >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> >     |                          |                |                                  |
> >     +--------------------------+                +-----------------^----------------+
> >                                                                   |
> >                                                                   |
> >                                                            +------v------+
> >                                                            |             |
> >                                                            |    NVMe     |
> >                                                            |             |
> >                                                            +-------------+
> > 
> > 
> > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > 
> > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > 
> > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > 
> > Test setup
> > ==========
> > 
> > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > 
> > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > 
> > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > 
> > [1] https://linux.die.net/man/1/fio
> > 
> 
> Thanks for the proposal! We are planning to add virtio function support to
> endpoint subsystem to cover usecases like this. I think your usecase can be
> satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> support once we have the infra in place. Thoughts?
> 
> - Mani
>

Hi Mani,
I initially had the plan to implement the virtio-blk as an endpoint
function driver instead of a self baked driver. 

This for sure is more elegant as we could reuse the
virtio-blk pci driver instead of implementing a new one (as I did) 

But I initially had some concerns about the feasibility, especially
that the virtio-blk pci driver is expecting immediate responses to some
register writes which I would not be able to satisfy, simply because we
do not have any kind of interrupt/event which would be triggered on the
EP side when the RC is accessing some BAR Registers (at least there is
no machanism I know of). As virtio is made mainly for Hypervisor <->
Guest communication I was afraid that a Hypersisor is able to Trap every
Register access from the Guest and act accordingly, which I would not be
able to do. I hope this make sense to you.

But to make a long story short, yes I agree with you that virtio-blk
would satisfy my usecase, and I generally think it would be a better
solution, I just did not know that you are working on some
infrastructure for that. And yes I would like to implement the endpoint
function driver for virtio-blk. Is there already an development tree you
use to work on the infrastructre I could have a look at?

- Wadim



> > Wadim Mueller (3):
> >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > 
> >  .../function/binding/pci-block-passthru.rst   |   24 +
> >  Documentation/PCI/endpoint/index.rst          |    3 +
> >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> >  MAINTAINERS                                   |    8 +
> >  drivers/block/Kconfig                         |   14 +
> >  drivers/block/Makefile                        |    1 +
> >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> >  drivers/pci/endpoint/functions/Makefile       |    1 +
> >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> >  include/linux/pci-epf-block-passthru.h        |   77 +
> >  12 files changed, 3069 insertions(+)
> >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> >  create mode 100644 drivers/block/pci-remote-disk.c
> >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > 
> > -- 
> > 2.25.1
> > 
> 
> -- 
> மணிவண்ணன் சதாசிவம்
Manivannan Sadhasivam Feb. 26, 2024, 9:45 a.m. UTC | #3
On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > > Hello,
> > > 
> > > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > > 
> > > 
> > > Block Passthrough
> > > ==================
> > > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > > 
> > > 
> > >                                                            +-------------+
> > >                                                            |             |
> > >                                                            |   SD Card   |
> > >                                                            |             |
> > >                                                            +------^------+
> > >                                                                   |
> > >                                                                   |
> > >     +--------------------------+                +-----------------v----------------+
> > >     |                          |      PCI(e)    |                                  |
> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> > >     |                          |                |                                  |
> > >     +--------------------------+                +-----------------^----------------+
> > >                                                                   |
> > >                                                                   |
> > >                                                            +------v------+
> > >                                                            |             |
> > >                                                            |    NVMe     |
> > >                                                            |             |
> > >                                                            +-------------+
> > > 
> > > 
> > > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > > 
> > > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > > 
> > > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > > 
> > > Test setup
> > > ==========
> > > 
> > > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > > 
> > > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > > 
> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > > 
> > > [1] https://linux.die.net/man/1/fio
> > > 
> > 
> > Thanks for the proposal! We are planning to add virtio function support to
> > endpoint subsystem to cover usecases like this. I think your usecase can be
> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> > support once we have the infra in place. Thoughts?
> > 
> > - Mani
> >
> 
> Hi Mani,
> I initially had the plan to implement the virtio-blk as an endpoint
> function driver instead of a self baked driver. 
> 
> This for sure is more elegant as we could reuse the
> virtio-blk pci driver instead of implementing a new one (as I did) 
> 
> But I initially had some concerns about the feasibility, especially
> that the virtio-blk pci driver is expecting immediate responses to some
> register writes which I would not be able to satisfy, simply because we
> do not have any kind of interrupt/event which would be triggered on the
> EP side when the RC is accessing some BAR Registers (at least there is
> no machanism I know of). As virtio is made mainly for Hypervisor <->

Right. There is a limitation currently w.r.t triggering doorbell from the host
to endpoint. But I believe that could be addressed later by repurposing the
endpoint MSI controller [1].

> As virtio is made mainly for Hypervisor <->
> Guest communication I was afraid that a Hypersisor is able to Trap every
> Register access from the Guest and act accordingly, which I would not be
> able to do. I hope this make sense to you.
> 

I'm not worrying about the hypervisor right now. Here the endpoint is exposing
the virtio devices and host is consuming it. There is no virtualization play
here. I talked about this in the last plumbers [2].

> But to make a long story short, yes I agree with you that virtio-blk
> would satisfy my usecase, and I generally think it would be a better
> solution, I just did not know that you are working on some
> infrastructure for that. And yes I would like to implement the endpoint
> function driver for virtio-blk. Is there already an development tree you
> use to work on the infrastructre I could have a look at?
> 

Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
You can use it as a reference in the meantime.

- Mani

[1] https://lore.kernel.org/all/20230911220920.1817033-1-Frank.Li@nxp.com/
[2] https://www.youtube.com/watch?v=1tqOTge0eq0
[3] https://github.com/ShunsukeMie/linux-virtio-rdma/tree/v6.6-rc1-epf-vcon

> - Wadim
> 
> 
> 
> > > Wadim Mueller (3):
> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > > 
> > >  .../function/binding/pci-block-passthru.rst   |   24 +
> > >  Documentation/PCI/endpoint/index.rst          |    3 +
> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> > >  MAINTAINERS                                   |    8 +
> > >  drivers/block/Kconfig                         |   14 +
> > >  drivers/block/Makefile                        |    1 +
> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
> > >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> > >  include/linux/pci-epf-block-passthru.h        |   77 +
> > >  12 files changed, 3069 insertions(+)
> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> > >  create mode 100644 drivers/block/pci-remote-disk.c
> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்
Christoph Hellwig Feb. 26, 2024, 11:08 a.m. UTC | #4
Please don't just create a new (and as far as I can tell underspecified)
new "hardware" interface for this.  If the nvme endpoint work is too
much for your use case maybe just implement a minimal virtio_blk
interface.
Damien Le Moal Feb. 26, 2024, 12:58 p.m. UTC | #5
On 2024/02/26 1:45, Manivannan Sadhasivam wrote:

[...]

>> As virtio is made mainly for Hypervisor <->
>> Guest communication I was afraid that a Hypersisor is able to Trap every
>> Register access from the Guest and act accordingly, which I would not be
>> able to do. I hope this make sense to you.
>>
> 
> I'm not worrying about the hypervisor right now. Here the endpoint is exposing
> the virtio devices and host is consuming it. There is no virtualization play
> here. I talked about this in the last plumbers [2].

FYI, we are still working on our NVMe PCI EPF function driver. It is working OK
using either a rockpro64 (PCI Gen2) board and a Radxa Rock 5B board (PCI Gen3,
rk3588 SoC/DWC EPF driver). Just been super busy recently with the block layer &
ATA stuff so I have not been able to rebase/cleanup and send stuff. This driver
also depends on many cleanup/improvement patches (see below).

> 
>> But to make a long story short, yes I agree with you that virtio-blk
>> would satisfy my usecase, and I generally think it would be a better
>> solution, I just did not know that you are working on some
>> infrastructure for that. And yes I would like to implement the endpoint
>> function driver for virtio-blk. Is there already an development tree you
>> use to work on the infrastructre I could have a look at?
>>
> 
> Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
> You can use it as a reference in the meantime.

This one is very similar to what I did in my series:

https://github.com/torvalds/linux/commit/05e21d458b1eaa8c22697f12a1ae42dcb04ff377

My series is here:

https://github.com/damien-lemoal/linux/tree/rock5b_ep_v8

It is a bit of a mess but what's there is:
1) Add the "map_info" EPF method to get mapping that are not dependent on the
host address alignment. That is similar to the align_mem method Shunsuke
introduced, but with more info to make it generic and allow EPF to deal with any
host DMA address.
2) Fixes for the rockpro64 DMA mapping as it is broken
3) Adds rk2588 EPF driver
4) Adds the NVMe EPF function driver. That is implemented as a PCI EPF frontend
to an NVMe-of controller so that any NMVe-Of supported device can be exposed
over PCI (block device, file, real NVMe controller).

There are also a bunch of API changes and cleanups to make the EPF code (core
and driver) more compact/easier to read.

Once I am done with my current work on the block layer side, I intend to come
back to this for the next cycle. I still need to complete the IRQ lehacy -> intx
renaming as well...

Cheers.
Frank Li Feb. 26, 2024, 4:47 p.m. UTC | #6
On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
> > > Hello,
> > > 
> > > This series adds support for the Block Passthrough PCI(e) Endpoint functionality.
> > > PCI Block Device Passthrough allows one Linux Device running in EP mode to expose its Block devices to the PCI(e) host (RC). The device can export either the full disk or just certain partitions.
> > > Also an export in readonly mode is possible. This is useful if you want to share the same blockdevice between different SoCs, providing each SoC its own partition(s).
> > > 
> > > 
> > > Block Passthrough
> > > ==================
> > > The PCI Block Passthrough can be a useful feature if you have multiple SoCs in your system connected
> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
> > > If the block devices are connected to one SoC (SoC2 in EP Mode from the diagramm below) and you want to access
> > > those from the other SoC (SoC1 in RC mode below), without having any direct connection to
> > > those block devices (e.g. if you want to share an NVMe between two SoCs). An simple example of such a configurationis is shown below:
> > > 
> > > 
> > >                                                            +-------------+
> > >                                                            |             |
> > >                                                            |   SD Card   |
> > >                                                            |             |
> > >                                                            +------^------+
> > >                                                                   |
> > >                                                                   |
> > >     +--------------------------+                +-----------------v----------------+
> > >     |                          |      PCI(e)    |                                  |
> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (EP)             |
> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLOCK_PASSTHROUGH)|
> > >     |                          |                |                                  |
> > >     +--------------------------+                +-----------------^----------------+
> > >                                                                   |
> > >                                                                   |
> > >                                                            +------v------+
> > >                                                            |             |
> > >                                                            |    NVMe     |
> > >                                                            |             |
> > >                                                            +-------------+
> > > 
> > > 
> > > This is to a certain extent a similar functionality which NBD exposes over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framework.
> > > 
> > > The Endpoint Function driver creates parallel Queues which run on seperate CPU Cores using percpu structures. The number of parallel queues is limited
> > > by the number of CPUs on the EP device. The actual number of queues is configurable (as all other features of the driver) through CONFIGFS.
> > > 
> > > A documentation about the functional description as well as a user guide showing how both drivers can be configured is part of this series.
> > > 
> > > Test setup
> > > ==========
> > > 
> > > This series has been tested on an NXP S32G2 SoC running in Endpoint mode with a direct connection to an ARM64 host machine.
> > > 
> > > A performance measurement on the described setup shows good performance metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth of ~2GiB/s.
> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... without DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB Kingston NVMe
> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read Datarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
> > > 
> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
> > > 
> > > [1] https://linux.die.net/man/1/fio
> > > 
> > 
> > Thanks for the proposal! We are planning to add virtio function support to
> > endpoint subsystem to cover usecases like this. I think your usecase can be
> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint function
> > support once we have the infra in place. Thoughts?
> > 
> > - Mani
> >
> 
> Hi Mani,
> I initially had the plan to implement the virtio-blk as an endpoint
> function driver instead of a self baked driver. 
> 
> This for sure is more elegant as we could reuse the
> virtio-blk pci driver instead of implementing a new one (as I did) 
> 
> But I initially had some concerns about the feasibility, especially
> that the virtio-blk pci driver is expecting immediate responses to some
> register writes which I would not be able to satisfy, simply because we
> do not have any kind of interrupt/event which would be triggered on the
> EP side when the RC is accessing some BAR Registers (at least there is
> no machanism I know of). As virtio is made mainly for Hypervisor <->

A possible solution is use ITS MSI to triggger at irq at EP side.
https://lore.kernel.org/linux-pci/20230911220920.1817033-1-Frank.Li@nxp.com/
Any ways, virtio layer need some modify. 

> Guest communication I was afraid that a Hypersisor is able to Trap every
> Register access from the Guest and act accordingly, which I would not be
> able to do. I hope this make sense to you.
> 
> But to make a long story short, yes I agree with you that virtio-blk
> would satisfy my usecase, and I generally think it would be a better
> solution, I just did not know that you are working on some
> infrastructure for that. And yes I would like to implement the endpoint
> function driver for virtio-blk. Is there already an development tree you
> use to work on the infrastructre I could have a look at?

There are many one try this
https://patchew.org/linux/20230427104428.862643-1-mie@igel.co.jp/
https://lore.kernel.org/linux-pci/796eb893-f7e2-846c-e75f-9a5774089b8e@igel.co.jp/
https://lore.kernel.org/imx/d098a631-9930-26d3-48f3-8f95386c8e50@ti.com/T/#t
https://lore.kernel.org/linux-pci/20200702082143.25259-1-kishon@ti.com/

With EDMA support and ITS MSI, it should be possible now.

Frank

> 
> - Wadim
> 
> 
> 
> > > Wadim Mueller (3):
> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
> > > 
> > >  .../function/binding/pci-block-passthru.rst   |   24 +
> > >  Documentation/PCI/endpoint/index.rst          |    3 +
> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
> > >  MAINTAINERS                                   |    8 +
> > >  drivers/block/Kconfig                         |   14 +
> > >  drivers/block/Makefile                        |    1 +
> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
> > >  .../functions/pci-epf-block-passthru.c        | 1393 +++++++++++++++++
> > >  include/linux/pci-epf-block-passthru.h        |   77 +
> > >  12 files changed, 3069 insertions(+)
> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-block-passthru.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-function.rst
> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-passthru-howto.rst
> > >  create mode 100644 drivers/block/pci-remote-disk.c
> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-passthru.c
> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
> > > 
> > > -- 
> > > 2.25.1
> > > 
> > 
> > -- 
> > மணிவண்ணன் சதாசிவம்
Wadim Mueller Feb. 26, 2024, 6:47 p.m. UTC | #7
--text follows this line--
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> writes:

> On Sun, Feb 25, 2024 at 09:39:17PM +0100, Wadim Mueller wrote:
>> On Sun, Feb 25, 2024 at 09:39:26PM +0530, Manivannan Sadhasivam wrote:
>> > On Sat, Feb 24, 2024 at 10:03:59PM +0100, Wadim Mueller wrote:
>> > > Hello,
>> > >=20
>> > > This series adds support for the Block Passthrough PCI(e) Endpoint f=
unctionality.
>> > > PCI Block Device Passthrough allows one Linux Device running in EP m=
ode to expose its Block devices to the PCI(e) host (RC). The device can exp=
ort either the full disk or just certain partitions.
>> > > Also an export in readonly mode is possible. This is useful if you w=
ant to share the same blockdevice between different SoCs, providing each So=
C its own partition(s).
>> > >=20
>> > >=20
>> > > Block Passthrough
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > > The PCI Block Passthrough can be a useful feature if you have multip=
le SoCs in your system connected
>> > > through a PCI(e) link, one running in RC mode, the other in EP mode.
>> > > If the block devices are connected to one SoC (SoC2 in EP Mode from =
the diagramm below) and you want to access
>> > > those from the other SoC (SoC1 in RC mode below), without having any=
 direct connection to
>> > > those block devices (e.g. if you want to share an NVMe between two S=
oCs). An simple example of such a configurationis is shown below:
>> > >=20
>> > >=20
>> > >                                                            +--------=
-----+
>> > >                                                            |        =
     |
>> > >                                                            |   SD Ca=
rd   |
>> > >                                                            |        =
     |
>> > >                                                            +------^-=
-----+
>> > >                                                                   |
>> > >                                                                   |
>> > >     +--------------------------+                +-----------------v-=
---------------+
>> > >     |                          |      PCI(e)    |                   =
               |
>> > >     |         SoC1 (RC)        |<-------------->|            SoC2 (E=
P)             |
>> > >     | (CONFIG_PCI_REMOTE_DISK) |                |(CONFIG_PCI_EPF_BLO=
CK_PASSTHROUGH)|
>> > >     |                          |                |                   =
               |
>> > >     +--------------------------+                +-----------------^-=
---------------+
>> > >                                                                   |
>> > >                                                                   |
>> > >                                                            +------v-=
-----+
>> > >                                                            |        =
     |
>> > >                                                            |    NVMe=
     |
>> > >                                                            |        =
     |
>> > >                                                            +--------=
-----+
>> > >=20
>> > >=20
>> > > This is to a certain extent a similar functionality which NBD expose=
s over Network, but on the PCI(e) bus utilizing the EPC/EPF Kernel Framewor=
k.
>> > >=20
>> > > The Endpoint Function driver creates parallel Queues which run on se=
perate CPU Cores using percpu structures. The number of parallel queues is =
limited
>> > > by the number of CPUs on the EP device. The actual number of queues =
is configurable (as all other features of the driver) through CONFIGFS.
>> > >=20
>> > > A documentation about the functional description as well as a user g=
uide showing how both drivers can be configured is part of this series.
>> > >=20
>> > > Test setup
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> > >=20
>> > > This series has been tested on an NXP S32G2 SoC running in Endpoint =
mode with a direct connection to an ARM64 host machine.
>> > >=20
>> > > A performance measurement on the described setup shows good performa=
nce metrics. The S32G2 SoC has a 2xGen3 link which has a maximum Bandwidth =
of ~2GiB/s.
>> > > With the explained setup a Read Datarate of 1.3GiB/s (with DMA ... w=
ithout DMA the speed saturated at ~200MiB/s) was achieved using an 512GiB K=
ingston NVMe
>> > > when accessing the NVMe from the ARM64 (SoC1) Host. The local Read D=
atarate accessing the NVMe dirctly from the S32G2 (SoC2) was around 1.5GiB.
>> > >=20
>> > > The measurement was done through the FIO tool [1] with 4kiB Blocks.
>> > >=20
>> > > [1] https://linux.die.net/man/1/fio
>> > >=20
>> >=20
>> > Thanks for the proposal! We are planning to add virtio function suppor=
t to
>> > endpoint subsystem to cover usecases like this. I think your usecase c=
an be
>> > satisfied using vitio-blk. Maybe you can add the virtio-blk endpoint f=
unction
>> > support once we have the infra in place. Thoughts?
>> >=20
>> > - Mani
>> >
>>=20
>> Hi Mani,
>> I initially had the plan to implement the virtio-blk as an endpoint
>> function driver instead of a self baked driver.=20
>>=20
>> This for sure is more elegant as we could reuse the
>> virtio-blk pci driver instead of implementing a new one (as I did)=20
>>=20
>> But I initially had some concerns about the feasibility, especially
>> that the virtio-blk pci driver is expecting immediate responses to some
>> register writes which I would not be able to satisfy, simply because we
>> do not have any kind of interrupt/event which would be triggered on the
>> EP side when the RC is accessing some BAR Registers (at least there is
>> no machanism I know of). As virtio is made mainly for Hypervisor <->
>
> Right. There is a limitation currently w.r.t triggering doorbell from the=
 host
> to endpoint. But I believe that could be addressed later by repurposing t=
he
> endpoint MSI controller [1].
>
>> As virtio is made mainly for Hypervisor <->
>> Guest communication I was afraid that a Hypersisor is able to Trap every
>> Register access from the Guest and act accordingly, which I would not be
>> able to do. I hope this make sense to you.
>>=20
>
> I'm not worrying about the hypervisor right now. Here the endpoint is exp=
osing
> the virtio devices and host is consuming it. There is no virtualization p=
lay
> here. I talked about this in the last plumbers [2].
>

Okay, I understand this. The hypervisor was more of an example. I will
try to explain.

I am currently reading through the virtio spec [1].
In chapter 4.1.4.5.1 there is the following statement:

"The device MUST reset ISR status to 0 on driver read."

So I was wondering, how we, as an PCI EP Device, supposed to clear a
register when the driver reads the same register? I mean how do we detect a
register read?
If you are a hypervisor its easy to do so, because you can intercept
every memory access made my the guest (the same applies if you build
custom HW for this purpose). But for us as an EP device its
difficult to detect this, even with MSIs and Doorbell Registers in
place.

Modifying the virtio layer to write to some doorbell register after
reading the ISR status register would be possible, but kind of ugly.


[1] https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.=
pdf

>> But to make a long story short, yes I agree with you that virtio-blk
>> would satisfy my usecase, and I generally think it would be a better
>> solution, I just did not know that you are working on some
>> infrastructure for that. And yes I would like to implement the endpoint
>> function driver for virtio-blk. Is there already an development tree you
>> use to work on the infrastructre I could have a look at?
>>=20
>
> Shunsuke has a WIP branch [3], that I plan to co-work in the coming days.
> You can use it as a reference in the meantime.
>
> - Mani
>
> [1] https://lore.kernel.org/all/20230911220920.1817033-1-Frank.Li@nxp.com/
> [2] https://www.youtube.com/watch?v=3D1tqOTge0eq0
> [3] https://github.com/ShunsukeMie/linux-virtio-rdma/tree/v6.6-rc1-epf-vc=
on
>
>> - Wadim
>>=20
>>=20
>>=20
>> > > Wadim Mueller (3):
>> > >   PCI: Add PCI Endpoint function driver for Block-device passthrough
>> > >   PCI: Add PCI driver for a PCI EP remote Blockdevice
>> > >   Documentation: PCI: Add documentation for the PCI Block Passthrough
>> > >=20
>> > >  .../function/binding/pci-block-passthru.rst   |   24 +
>> > >  Documentation/PCI/endpoint/index.rst          |    3 +
>> > >  .../pci-endpoint-block-passthru-function.rst  |  331 ++++
>> > >  .../pci-endpoint-block-passthru-howto.rst     |  158 ++
>> > >  MAINTAINERS                                   |    8 +
>> > >  drivers/block/Kconfig                         |   14 +
>> > >  drivers/block/Makefile                        |    1 +
>> > >  drivers/block/pci-remote-disk.c               | 1047 +++++++++++++
>> > >  drivers/pci/endpoint/functions/Kconfig        |   12 +
>> > >  drivers/pci/endpoint/functions/Makefile       |    1 +
>> > >  .../functions/pci-epf-block-passthru.c        | 1393 ++++++++++++++=
+++
>> > >  include/linux/pci-epf-block-passthru.h        |   77 +
>> > >  12 files changed, 3069 insertions(+)
>> > >  create mode 100644 Documentation/PCI/endpoint/function/binding/pci-=
block-passthru.rst
>> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-pa=
ssthru-function.rst
>> > >  create mode 100644 Documentation/PCI/endpoint/pci-endpoint-block-pa=
ssthru-howto.rst
>> > >  create mode 100644 drivers/block/pci-remote-disk.c
>> > >  create mode 100644 drivers/pci/endpoint/functions/pci-epf-block-pas=
sthru.c
>> > >  create mode 100644 include/linux/pci-epf-block-passthru.h
>> > >=20
>> > > --=20
>> > > 2.25.1
>> > >=20
>> >=20
>> > --=20
>> > =E0=AE=AE=E0=AE=A3=E0=AE=BF=E0=AE=B5=E0=AE=A3=E0=AF=8D=E0=AE=A3=E0=AE=
=A9=E0=AF=8D =E0=AE=9A=E0=AE=A4=E0=AE=BE=E0=AE=9A=E0=AE=BF=E0=AE=B5=E0=AE=
=AE=E0=AF=8D