diff mbox series

[v3,9/9] PCI: endpoint: Set prefetch when allocating memory for 64-bit BARs

Message ID 20240313105804.100168-10-cassel@kernel.org
State New
Headers show
Series PCI: endpoint: set prefetchable bit for 64-bit BARs | expand

Commit Message

Niklas Cassel March 13, 2024, 10:58 a.m. UTC
From the PCIe 6.0 base spec:
"Generally only 64-bit BARs are good candidates, since only Legacy
Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
and most scalable platforms map all 32-bit Memory BARs into
non-prefetchable Memory Space regardless of the Prefetchable bit value."

"For a PCI Express Endpoint, 64-bit addressing must be supported for all
BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
for all BARs that do not have the Prefetchable bit Set."

"Any device that has a range that behaves like normal memory should mark
the range as prefetchable. A linear frame buffer in a graphics device is
an example of a range that should be marked prefetchable."

The PCIe spec tells us that we should have the prefetchable bit set for
64-bit BARs backed by "normal memory". The backing memory that we allocate
for a 64-bit BAR using pci_epf_alloc_space() (which calls
dma_alloc_coherent()) is obviously "normal memory".

Thus, set the prefetchable bit when allocating backing memory for a 64-bit
BAR.

Signed-off-by: Niklas Cassel <cassel@kernel.org>
---
 drivers/pci/endpoint/pci-epf-core.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Manivannan Sadhasivam March 15, 2024, 6:44 a.m. UTC | #1
+ Arnd

On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> From the PCIe 6.0 base spec:

It'd be good to mention the section also.

> "Generally only 64-bit BARs are good candidates, since only Legacy
> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
> and most scalable platforms map all 32-bit Memory BARs into
> non-prefetchable Memory Space regardless of the Prefetchable bit value."
> 
> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
> for all BARs that do not have the Prefetchable bit Set."
> 
> "Any device that has a range that behaves like normal memory should mark
> the range as prefetchable. A linear frame buffer in a graphics device is
> an example of a range that should be marked prefetchable."
> 
> The PCIe spec tells us that we should have the prefetchable bit set for
> 64-bit BARs backed by "normal memory". The backing memory that we allocate
> for a 64-bit BAR using pci_epf_alloc_space() (which calls
> dma_alloc_coherent()) is obviously "normal memory".
> 

I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
'normal memory' but rather 'consistent/coherent memory'. Here the question is,
can the memory returned by dma_alloc_coherent() be prefetched or write-combined
on all architectures.

I hope Arnd can answer this question.

- Mani

> Thus, set the prefetchable bit when allocating backing memory for a 64-bit
> BAR.
> 
> Signed-off-by: Niklas Cassel <cassel@kernel.org>
> ---
>  drivers/pci/endpoint/pci-epf-core.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/endpoint/pci-epf-core.c b/drivers/pci/endpoint/pci-epf-core.c
> index e7dbbeb1f0de..20d2bde0747c 100644
> --- a/drivers/pci/endpoint/pci-epf-core.c
> +++ b/drivers/pci/endpoint/pci-epf-core.c
> @@ -309,6 +309,9 @@ void *pci_epf_alloc_space(struct pci_epf *epf, size_t size, enum pci_barno bar,
>  	else
>  		epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_TYPE_32;
>  
> +	if (epf_bar[bar].flags & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +		epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> +
>  	return space;
>  }
>  EXPORT_SYMBOL_GPL(pci_epf_alloc_space);
> -- 
> 2.44.0
>
Arnd Bergmann March 15, 2024, 5:29 p.m. UTC | #2
On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
>> "Generally only 64-bit BARs are good candidates, since only Legacy
>> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
>> and most scalable platforms map all 32-bit Memory BARs into
>> non-prefetchable Memory Space regardless of the Prefetchable bit value."
>> 
>> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
>> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
>> for all BARs that do not have the Prefetchable bit Set."
>> 
>> "Any device that has a range that behaves like normal memory should mark
>> the range as prefetchable. A linear frame buffer in a graphics device is
>> an example of a range that should be marked prefetchable."
>> 
>> The PCIe spec tells us that we should have the prefetchable bit set for
>> 64-bit BARs backed by "normal memory". The backing memory that we allocate
>> for a 64-bit BAR using pci_epf_alloc_space() (which calls
>> dma_alloc_coherent()) is obviously "normal memory".
>> 
>
> I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
> 'normal memory' but rather 'consistent/coherent memory'. Here the question is,
> can the memory returned by dma_alloc_coherent() be prefetched or write-combined
> on all architectures.
>
> I hope Arnd can answer this question.

I think there are three separate questions here when talking about
a scenario where a PCI master accesses memory behind a PCI endpoint:

- The CPU on the host side ususally uses ioremap() for mapping
  the PCI BAR of the device. If the BAR is marked as prefetchable,
  we usually allow mapping it using ioremap_wc() for write-combining
  or ioremap_wt() for a write-through mappings that allow both
  write-combining and prefetching. On some architectures, these
  all fall back to normal register mappings which do none of these.
  If it uses write-combining or prefetching, the host side driver
  will need to manually serialize against concurrent access from
  the endpoint side.

- The endpoint device accessing a buffer in memory is controlled
  by the endpoint driver and may decide to prefetch data into a
  local cache independent of the other two. I don't know if any
  of the suppored endpoint devices actually do that. A prefetch
  from the PCI host side would appear as a normal transaction here.

- The local CPU on the endpoint side may access the same buffer as
  the endpoint device. On low-end SoCs the DMA from the PCI
  endpoint is not coherent with the CPU caches, so the CPU may
  need to map it as uncacheable to allow data consistency with
  a the CPU on the PCI host side. On higher-end SoCs (e.g. most
  non-ARM ones) DMA is coherent with the caches, so the CPU
  on the endpoint side may map the buffer as cached and
  still be coherent with a CPU on the PCI host side that has
  mapped it with ioremap().

       Arnd
Niklas Cassel March 17, 2024, 11:54 a.m. UTC | #3
Hello all,

On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> >> "Generally only 64-bit BARs are good candidates, since only Legacy
> >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
> >> and most scalable platforms map all 32-bit Memory BARs into
> >> non-prefetchable Memory Space regardless of the Prefetchable bit value."
> >> 
> >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
> >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
> >> for all BARs that do not have the Prefetchable bit Set."
> >> 
> >> "Any device that has a range that behaves like normal memory should mark
> >> the range as prefetchable. A linear frame buffer in a graphics device is
> >> an example of a range that should be marked prefetchable."
> >> 
> >> The PCIe spec tells us that we should have the prefetchable bit set for
> >> 64-bit BARs backed by "normal memory". The backing memory that we allocate
> >> for a 64-bit BAR using pci_epf_alloc_space() (which calls
> >> dma_alloc_coherent()) is obviously "normal memory".
> >> 
> >
> > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
> > 'normal memory' but rather 'consistent/coherent memory'. Here the question is,
> > can the memory returned by dma_alloc_coherent() be prefetched or write-combined
> > on all architectures.
> >
> > I hope Arnd can answer this question.
> 
> I think there are three separate questions here when talking about
> a scenario where a PCI master accesses memory behind a PCI endpoint:

I think the question is if the PCI epf-core, which runs on the endpoint
side, and which calls dma_alloc_coherent() to allocate backing memory for
a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark
the BAR as a 64-bit BAR).

The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h),
states:
"Any device that has a range that behaves like normal memory should mark
the range as prefetchable. A linear frame buffer in a graphics device is
an example of a range that should be marked prefetchable."

Does not backing memory allocated for a specific BAR using
dma_alloc_coherent() on the EP side behave like normal memory from the
host's point of view?



On the host side, this will mean that the host driver sees the
Prefetchable bit, and as according to:
https://docs.kernel.org/driver-api/device-io.html
The host might map the BAR using ioremap_wc().

Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the
BARs using pci_ioremap_bar():
https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252
which will not map it using ioremap_wc().
(But the code we have in the PCI epf-core must of course work with host
side drivers other than pci_endpoint_test.c as well.)


> 
> - The CPU on the host side ususally uses ioremap() for mapping
>   the PCI BAR of the device. If the BAR is marked as prefetchable,
>   we usually allow mapping it using ioremap_wc() for write-combining
>   or ioremap_wt() for a write-through mappings that allow both
>   write-combining and prefetching. On some architectures, these
>   all fall back to normal register mappings which do none of these.
>   If it uses write-combining or prefetching, the host side driver
>   will need to manually serialize against concurrent access from
>   the endpoint side.
> 
> - The endpoint device accessing a buffer in memory is controlled
>   by the endpoint driver and may decide to prefetch data into a
>   local cache independent of the other two. I don't know if any
>   of the suppored endpoint devices actually do that. A prefetch
>   from the PCI host side would appear as a normal transaction here.
> 
> - The local CPU on the endpoint side may access the same buffer as
>   the endpoint device. On low-end SoCs the DMA from the PCI
>   endpoint is not coherent with the CPU caches, so the CPU may

I don't follow. When doing DMA *from* the endpoint, then the DMA HW
on the EP side will read or write data to a buffer allocated on the
host side (most likely using dma_alloc_coherent()), but what does
that got to do with how the EP configures the BARs that it exposes?


>   need to map it as uncacheable to allow data consistency with
>   a the CPU on the PCI host side. On higher-end SoCs (e.g. most
>   non-ARM ones) DMA is coherent with the caches, so the CPU
>   on the endpoint side may map the buffer as cached and
>   still be coherent with a CPU on the PCI host side that has
>   mapped it with ioremap().


Kind regards,
Niklas
Manivannan Sadhasivam March 18, 2024, 3:53 a.m. UTC | #4
On Sun, Mar 17, 2024 at 12:54:11PM +0100, Niklas Cassel wrote:
> Hello all,
> 
> On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
> > On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> > > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> > >> "Generally only 64-bit BARs are good candidates, since only Legacy
> > >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
> > >> and most scalable platforms map all 32-bit Memory BARs into
> > >> non-prefetchable Memory Space regardless of the Prefetchable bit value."
> > >> 
> > >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
> > >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
> > >> for all BARs that do not have the Prefetchable bit Set."
> > >> 
> > >> "Any device that has a range that behaves like normal memory should mark
> > >> the range as prefetchable. A linear frame buffer in a graphics device is
> > >> an example of a range that should be marked prefetchable."
> > >> 
> > >> The PCIe spec tells us that we should have the prefetchable bit set for
> > >> 64-bit BARs backed by "normal memory". The backing memory that we allocate
> > >> for a 64-bit BAR using pci_epf_alloc_space() (which calls
> > >> dma_alloc_coherent()) is obviously "normal memory".
> > >> 
> > >
> > > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
> > > 'normal memory' but rather 'consistent/coherent memory'. Here the question is,
> > > can the memory returned by dma_alloc_coherent() be prefetched or write-combined
> > > on all architectures.
> > >
> > > I hope Arnd can answer this question.
> > 
> > I think there are three separate questions here when talking about
> > a scenario where a PCI master accesses memory behind a PCI endpoint:
> 
> I think the question is if the PCI epf-core, which runs on the endpoint
> side, and which calls dma_alloc_coherent() to allocate backing memory for
> a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark
> the BAR as a 64-bit BAR).
> 
> The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h),
> states:
> "Any device that has a range that behaves like normal memory should mark
> the range as prefetchable. A linear frame buffer in a graphics device is
> an example of a range that should be marked prefetchable."
> 
> Does not backing memory allocated for a specific BAR using
> dma_alloc_coherent() on the EP side behave like normal memory from the
> host's point of view?
> 
> 
> 
> On the host side, this will mean that the host driver sees the
> Prefetchable bit, and as according to:
> https://docs.kernel.org/driver-api/device-io.html
> The host might map the BAR using ioremap_wc().
> 
> Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the
> BARs using pci_ioremap_bar():
> https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252
> which will not map it using ioremap_wc().
> (But the code we have in the PCI epf-core must of course work with host
> side drivers other than pci_endpoint_test.c as well.)
> 
> 

Right. I don't see any problem with the host side assumption. But my question
is, is it OK to advertise the coherent memory allocated on the endpoint as
prefetchable to the host.

As you quoted the spec,

"Any device that has a range that behaves like normal memory should mark
the range as prefetchable."

Here, the coherent memory allocated by the device(endpoint) won't behave as a
normal memory on the _endpoint_. But I'm certainly not sure if there are any
implications in exposing this memory as a 'normal memory' to the host.

- Mani

> > 
> > - The CPU on the host side ususally uses ioremap() for mapping
> >   the PCI BAR of the device. If the BAR is marked as prefetchable,
> >   we usually allow mapping it using ioremap_wc() for write-combining
> >   or ioremap_wt() for a write-through mappings that allow both
> >   write-combining and prefetching. On some architectures, these
> >   all fall back to normal register mappings which do none of these.
> >   If it uses write-combining or prefetching, the host side driver
> >   will need to manually serialize against concurrent access from
> >   the endpoint side.
> > 
> > - The endpoint device accessing a buffer in memory is controlled
> >   by the endpoint driver and may decide to prefetch data into a
> >   local cache independent of the other two. I don't know if any
> >   of the suppored endpoint devices actually do that. A prefetch
> >   from the PCI host side would appear as a normal transaction here.
> > 
> > - The local CPU on the endpoint side may access the same buffer as
> >   the endpoint device. On low-end SoCs the DMA from the PCI
> >   endpoint is not coherent with the CPU caches, so the CPU may
> 
> I don't follow. When doing DMA *from* the endpoint, then the DMA HW
> on the EP side will read or write data to a buffer allocated on the
> host side (most likely using dma_alloc_coherent()), but what does
> that got to do with how the EP configures the BARs that it exposes?
> 
> 
> >   need to map it as uncacheable to allow data consistency with
> >   a the CPU on the PCI host side. On higher-end SoCs (e.g. most
> >   non-ARM ones) DMA is coherent with the caches, so the CPU
> >   on the endpoint side may map the buffer as cached and
> >   still be coherent with a CPU on the PCI host side that has
> >   mapped it with ioremap().
> 
> 
> Kind regards,
> Niklas
Manivannan Sadhasivam March 18, 2024, 4:30 a.m. UTC | #5
On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> >> "Generally only 64-bit BARs are good candidates, since only Legacy
> >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
> >> and most scalable platforms map all 32-bit Memory BARs into
> >> non-prefetchable Memory Space regardless of the Prefetchable bit value."
> >> 
> >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
> >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
> >> for all BARs that do not have the Prefetchable bit Set."
> >> 
> >> "Any device that has a range that behaves like normal memory should mark
> >> the range as prefetchable. A linear frame buffer in a graphics device is
> >> an example of a range that should be marked prefetchable."
> >> 
> >> The PCIe spec tells us that we should have the prefetchable bit set for
> >> 64-bit BARs backed by "normal memory". The backing memory that we allocate
> >> for a 64-bit BAR using pci_epf_alloc_space() (which calls
> >> dma_alloc_coherent()) is obviously "normal memory".
> >> 
> >
> > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
> > 'normal memory' but rather 'consistent/coherent memory'. Here the question is,
> > can the memory returned by dma_alloc_coherent() be prefetched or write-combined
> > on all architectures.
> >
> > I hope Arnd can answer this question.
> 
> I think there are three separate questions here when talking about
> a scenario where a PCI master accesses memory behind a PCI endpoint:
> 
> - The CPU on the host side ususally uses ioremap() for mapping
>   the PCI BAR of the device. If the BAR is marked as prefetchable,
>   we usually allow mapping it using ioremap_wc() for write-combining
>   or ioremap_wt() for a write-through mappings that allow both
>   write-combining and prefetching. On some architectures, these
>   all fall back to normal register mappings which do none of these.
>   If it uses write-combining or prefetching, the host side driver
>   will need to manually serialize against concurrent access from
>   the endpoint side.
> 
> - The endpoint device accessing a buffer in memory is controlled
>   by the endpoint driver and may decide to prefetch data into a
>   local cache independent of the other two. I don't know if any
>   of the suppored endpoint devices actually do that. A prefetch
>   from the PCI host side would appear as a normal transaction here.
> 
> - The local CPU on the endpoint side may access the same buffer as
>   the endpoint device. On low-end SoCs the DMA from the PCI
>   endpoint is not coherent with the CPU caches, so the CPU may
>   need to map it as uncacheable to allow data consistency with
>   a the CPU on the PCI host side. On higher-end SoCs (e.g. most
>   non-ARM ones) DMA is coherent with the caches, so the CPU
>   on the endpoint side may map the buffer as cached and
>   still be coherent with a CPU on the PCI host side that has
>   mapped it with ioremap().
> 

Thanks Arnd for the reply.

But I'm not sure I got the answer I was looking for. So let me rephrase my
question a bit.

For BAR memory, PCIe spec states that,

'A PCI Express Function requesting Memory Space through a BAR must set the BAR's
Prefetchable bit unless the range contains locations with read side effects or
locations in which the Function does not tolerate write merging'

So here, spec refers the backing memory allocated on the endpoint side as the
'range' i.e, the BAR memory allocated on the host that gets mapped on the
endpoint.

Currently on the endpoint side, we use dma_alloc_coherent() to allocate the
memory for each BAR and map it using iATU.

So I want to know if the memory range allocated in the endpoint through
dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all
architectures:

1. No Read side effects
2. Tolerates write merging

I believe the reason why we are allocating the coherent memory on the endpoint
first up is not all PCIe controllers are DMA coherent as you said above.

- Mani
Arnd Bergmann March 18, 2024, 6:44 a.m. UTC | #6
On Mon, Mar 18, 2024, at 05:30, Manivannan Sadhasivam wrote:
> On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
>> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
>> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
>
> But I'm not sure I got the answer I was looking for. So let me rephrase my
> question a bit.
>
> For BAR memory, PCIe spec states that,
>
> 'A PCI Express Function requesting Memory Space through a BAR must set the BAR's
> Prefetchable bit unless the range contains locations with read side effects or
> locations in which the Function does not tolerate write merging'
>
> So here, spec refers the backing memory allocated on the endpoint side as the
> 'range' i.e, the BAR memory allocated on the host that gets mapped on the
> endpoint.
>
> Currently on the endpoint side, we use dma_alloc_coherent() to allocate the
> memory for each BAR and map it using iATU.
>
> So I want to know if the memory range allocated in the endpoint through
> dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all
> architectures:
>
> 1. No Read side effects
> 2. Tolerates write merging
>
> I believe the reason why we are allocating the coherent memory on the endpoint
> first up is not all PCIe controllers are DMA coherent as you said above.

As far as I can tell, we never have read side effects for memory
backed BARs, but the write merging is something that depends on
how the memory is used:

If you have anything in that memory that relies on ordering,
you probably want to map it as coherent on the endpoint side,
and non-prefetchable on the host controller side, and then
use the normal rmb()/wmb() barriers on both ends between
serialized accesses. An example of this would be having blocks
of data separate from metadata that says whether the data is
valid.

If you don't care about ordering on that level, I would use
dma_map_sg() on the endpoint side and prefetchable mapping on
the host side, with the endpoint using dma_sync_*() to pass
buffer ownership between the two sides, as controlled by some
other communication method (non-prefetchable BAR, MSI, ...).

     Arnd
Arnd Bergmann March 18, 2024, 7:25 a.m. UTC | #7
On Sun, Mar 17, 2024, at 12:54, Niklas Cassel wrote:
> On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
>> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
>> 
>> I think there are three separate questions here when talking about
>> a scenario where a PCI master accesses memory behind a PCI endpoint:
>
> I think the question is if the PCI epf-core, which runs on the endpoint
> side, and which calls dma_alloc_coherent() to allocate backing memory for
> a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark
> the BAR as a 64-bit BAR).
>
> The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h),
> states:
> "Any device that has a range that behaves like normal memory should mark
> the range as prefetchable. A linear frame buffer in a graphics device is
> an example of a range that should be marked prefetchable."
>
> Does not backing memory allocated for a specific BAR using
> dma_alloc_coherent() on the EP side behave like normal memory from the
> host's point of view?

I'm not sure I follow this logic: If the device wants the
buffer to act like "normal memory", then it can be marked
as prefetchable and mapped into the host as write-combining,
but I think in this case you *don't* want it to be coherent
on the endpoint side either but use a streaming mapping with
explicit cache management instead.

Conversely, if the endpoint side requires a coherent mapping,
then I think you will want a strictly ordered (non-wc,
non-frefetchable) mapping on the host side as well.

It would be helpful to have actual endpoint function drivers
in the kernel rather than just the test drivers to see what type
of serialization you actually want for best performance on
both sides.

Can you give a specific example of an endpoint that you are
actually interested in, maybe just one that we have a host-side
device driver for in tree?

> On the host side, this will mean that the host driver sees the
> Prefetchable bit, and as according to:
> https://docs.kernel.org/driver-api/device-io.html
> The host might map the BAR using ioremap_wc().
>
> Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the
> BARs using pci_ioremap_bar():
> https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252
> which will not map it using ioremap_wc().
> (But the code we have in the PCI epf-core must of course work with host
> side drivers other than pci_endpoint_test.c as well.)

It is to some degree architecture specific here. On powerpc
and i386 with MTTRs, any prefetchable BAR will be mapped as
write-combining IIRC, but on arm and arm64 it only depends on
whether the host side driver uses ioremap() or ioremap_wc().

>> - The local CPU on the endpoint side may access the same buffer as
>>   the endpoint device. On low-end SoCs the DMA from the PCI
>>   endpoint is not coherent with the CPU caches, so the CPU may
>
> I don't follow. When doing DMA *from* the endpoint, then the DMA HW
> on the EP side will read or write data to a buffer allocated on the
> host side (most likely using dma_alloc_coherent()), but what does
> that got to do with how the EP configures the BARs that it exposes?

I meant doing DMA to the memory of the endpoint side, not the
host side. DMA to the host side memory is completely separate
from this question.

     Arnd
Niklas Cassel March 18, 2024, 3:13 p.m. UTC | #8
Hello Arnd,

On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote:
> 
> I'm not sure I follow this logic: If the device wants the
> buffer to act like "normal memory", then it can be marked
> as prefetchable and mapped into the host as write-combining,
> but I think in this case you *don't* want it to be coherent
> on the endpoint side either but use a streaming mapping with
> explicit cache management instead.
> 
> Conversely, if the endpoint side requires a coherent mapping,
> then I think you will want a strictly ordered (non-wc,
> non-frefetchable) mapping on the host side as well.
> 
> It would be helpful to have actual endpoint function drivers
> in the kernel rather than just the test drivers to see what type
> of serialization you actually want for best performance on
> both sides.

Yes, that would be nice.

This specific API, pci_epf_alloc_space(), is only used by the
following drivers:
drivers/pci/endpoint/functions/pci-epf-test.c
drivers/pci/endpoint/functions/pci-epf-ntb.c
drivers/pci/endpoint/functions/pci-epf-vntb.c

pci_epf_alloc_space() is only used to allocate backing
memory for the BARs.


> 
> Can you give a specific example of an endpoint that you are
> actually interested in, maybe just one that we have a host-side
> device driver for in tree?

I personally just care about pci-epf-test, but obviously I don't
want to regress any other user of pci_epf_alloc_space().

Looking at the endpoint side driver:
drivers/pci/endpoint/functions/pci-epf-test.c
and the host side driver:
drivers/misc/pci_endpoint_test.c

On the RC side, allocating buffers that the EP will DMA to is
done using: kzalloc() + dma_map_single().

On EP side:
drivers/pci/endpoint/functions/pci-epf-test.c
uses dma_map_single() when using DMA, and signals completion using MSI.

On EP side:
When reading/writing to the BARs, it simply does:
READ_ONCE()/WRITE_ONCE():
https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648

There is no dma_sync(), so the pci-test-epf driver currently seems to
depend on the backing memory being allocated by dma_alloc_coherent().


> If you don't care about ordering on that level, I would use
> dma_map_sg() on the endpoint side and prefetchable mapping on
> the host side, with the endpoint using dma_sync_*() to pass
> buffer ownership between the two sides, as controlled by some
> other communication method (non-prefetchable BAR, MSI, ...).

I don't think that there is no big reason why pci-epf-test is
implemented using dma_alloc_coherent() rather than dma_sync()
for the memory backing the BARs, but that is the way it is.

Since I don't feel like totally rewriting pci-epf-test, and since
you say that we shouldn't use dma_alloc_coherent() for the memory
backing the BARs together with exporting the BAR as prefetchable,
I will drop this patch from the series in the next revision.


Kind regards,
Niklas
Arnd Bergmann March 18, 2024, 3:49 p.m. UTC | #9
On Mon, Mar 18, 2024, at 16:13, Niklas Cassel wrote:
> On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote:
>
> I personally just care about pci-epf-test, but obviously I don't
> want to regress any other user of pci_epf_alloc_space().
>
> Looking at the endpoint side driver:
> drivers/pci/endpoint/functions/pci-epf-test.c
> and the host side driver:
> drivers/misc/pci_endpoint_test.c
>
> On the RC side, allocating buffers that the EP will DMA to is
> done using: kzalloc() + dma_map_single().
>
> On EP side:
> drivers/pci/endpoint/functions/pci-epf-test.c
> uses dma_map_single() when using DMA, and signals completion using MSI.
>
> On EP side:
> When reading/writing to the BARs, it simply does:
> READ_ONCE()/WRITE_ONCE():
> https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648
>
> There is no dma_sync(), so the pci-test-epf driver currently seems to
> depend on the backing memory being allocated by dma_alloc_coherent().

From my reading of that function, this is really some kind
of command buffer that implements individual structured
registers and can be accessed from both sides at the same
time, so it would not actually make sense with the streaming
interface and wc/prefetchable access in place of explicit
READ_ONCE/WRITE_ONCE and readl/writel accesses.

>> If you don't care about ordering on that level, I would use
>> dma_map_sg() on the endpoint side and prefetchable mapping on
>> the host side, with the endpoint using dma_sync_*() to pass
>> buffer ownership between the two sides, as controlled by some
>> other communication method (non-prefetchable BAR, MSI, ...).
>
> I don't think that there is no big reason why pci-epf-test is
> implemented using dma_alloc_coherent() rather than dma_sync()
> for the memory backing the BARs, but that is the way it is.
>
> Since I don't feel like totally rewriting pci-epf-test, and since
> you say that we shouldn't use dma_alloc_coherent() for the memory
> backing the BARs together with exporting the BAR as prefetchable,
> I will drop this patch from the series in the next revision.

Ok. It might still be useful to extend the driver to also
allow transferring streaming data through a BAR on the
endpoint side. From what I can tell, it currently supports
using either slave DMA or a RC side buffer that ioremapped
into the endpoint, but that uses a regular ioremap() as well.
Mapping the RC side buffer as WC should make it possible to
transfer data from EP to RC more efficiently, but for the RC
to EP transfers you really want the buffer to be allocated on
the EP, so you can ioremap_wc() it to the RC for a memcpy_toio,
or cacheable read from the EP.

      Arnd
Manivannan Sadhasivam March 19, 2024, 6:20 a.m. UTC | #10
On Mon, Mar 18, 2024 at 07:44:21AM +0100, Arnd Bergmann wrote:
> On Mon, Mar 18, 2024, at 05:30, Manivannan Sadhasivam wrote:
> > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
> >> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> >> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> >
> > But I'm not sure I got the answer I was looking for. So let me rephrase my
> > question a bit.
> >
> > For BAR memory, PCIe spec states that,
> >
> > 'A PCI Express Function requesting Memory Space through a BAR must set the BAR's
> > Prefetchable bit unless the range contains locations with read side effects or
> > locations in which the Function does not tolerate write merging'
> >
> > So here, spec refers the backing memory allocated on the endpoint side as the
> > 'range' i.e, the BAR memory allocated on the host that gets mapped on the
> > endpoint.
> >
> > Currently on the endpoint side, we use dma_alloc_coherent() to allocate the
> > memory for each BAR and map it using iATU.
> >
> > So I want to know if the memory range allocated in the endpoint through
> > dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all
> > architectures:
> >
> > 1. No Read side effects
> > 2. Tolerates write merging
> >
> > I believe the reason why we are allocating the coherent memory on the endpoint
> > first up is not all PCIe controllers are DMA coherent as you said above.
> 
> As far as I can tell, we never have read side effects for memory
> backed BARs, but the write merging is something that depends on
> how the memory is used:
> 
> If you have anything in that memory that relies on ordering,
> you probably want to map it as coherent on the endpoint side,
> and non-prefetchable on the host controller side, and then
> use the normal rmb()/wmb() barriers on both ends between
> serialized accesses. An example of this would be having blocks
> of data separate from metadata that says whether the data is
> valid.
> 
> If you don't care about ordering on that level, I would use
> dma_map_sg() on the endpoint side and prefetchable mapping on
> the host side, with the endpoint using dma_sync_*() to pass
> buffer ownership between the two sides, as controlled by some
> other communication method (non-prefetchable BAR, MSI, ...).
> 

Right now, there are only Test and a couple of NTB drivers making use of the
pci_epf_alloc_space() API and they do not need streaming DMA.

So to conclude, we should just live with coherent allocation/non-prefetch for
now and extend it to streaming DMA/prefetch once we have a function driver that
needs it.

Thanks a lot for your inputs!

- Mani
Manivannan Sadhasivam March 19, 2024, 6:22 a.m. UTC | #11
On Mon, Mar 18, 2024 at 04:49:07PM +0100, Arnd Bergmann wrote:
> On Mon, Mar 18, 2024, at 16:13, Niklas Cassel wrote:
> > On Mon, Mar 18, 2024 at 08:25:36AM +0100, Arnd Bergmann wrote:
> >
> > I personally just care about pci-epf-test, but obviously I don't
> > want to regress any other user of pci_epf_alloc_space().
> >
> > Looking at the endpoint side driver:
> > drivers/pci/endpoint/functions/pci-epf-test.c
> > and the host side driver:
> > drivers/misc/pci_endpoint_test.c
> >
> > On the RC side, allocating buffers that the EP will DMA to is
> > done using: kzalloc() + dma_map_single().
> >
> > On EP side:
> > drivers/pci/endpoint/functions/pci-epf-test.c
> > uses dma_map_single() when using DMA, and signals completion using MSI.
> >
> > On EP side:
> > When reading/writing to the BARs, it simply does:
> > READ_ONCE()/WRITE_ONCE():
> > https://github.com/torvalds/linux/blob/v6.8/drivers/pci/endpoint/functions/pci-epf-test.c#L643-L648
> >
> > There is no dma_sync(), so the pci-test-epf driver currently seems to
> > depend on the backing memory being allocated by dma_alloc_coherent().
> 
> From my reading of that function, this is really some kind
> of command buffer that implements individual structured
> registers and can be accessed from both sides at the same
> time, so it would not actually make sense with the streaming
> interface and wc/prefetchable access in place of explicit
> READ_ONCE/WRITE_ONCE and readl/writel accesses.
> 

Right. We should stick to the current implementation for now until a function
driver with streaming DMA usecase comes in.

- Mani

> >> If you don't care about ordering on that level, I would use
> >> dma_map_sg() on the endpoint side and prefetchable mapping on
> >> the host side, with the endpoint using dma_sync_*() to pass
> >> buffer ownership between the two sides, as controlled by some
> >> other communication method (non-prefetchable BAR, MSI, ...).
> >
> > I don't think that there is no big reason why pci-epf-test is
> > implemented using dma_alloc_coherent() rather than dma_sync()
> > for the memory backing the BARs, but that is the way it is.
> >
> > Since I don't feel like totally rewriting pci-epf-test, and since
> > you say that we shouldn't use dma_alloc_coherent() for the memory
> > backing the BARs together with exporting the BAR as prefetchable,
> > I will drop this patch from the series in the next revision.
> 
> Ok. It might still be useful to extend the driver to also
> allow transferring streaming data through a BAR on the
> endpoint side. From what I can tell, it currently supports
> using either slave DMA or a RC side buffer that ioremapped
> into the endpoint, but that uses a regular ioremap() as well.
> Mapping the RC side buffer as WC should make it possible to
> transfer data from EP to RC more efficiently, but for the RC
> to EP transfers you really want the buffer to be allocated on
> the EP, so you can ioremap_wc() it to the RC for a memcpy_toio,
> or cacheable read from the EP.
> 
>       Arnd
diff mbox series

Patch

diff --git a/drivers/pci/endpoint/pci-epf-core.c b/drivers/pci/endpoint/pci-epf-core.c
index e7dbbeb1f0de..20d2bde0747c 100644
--- a/drivers/pci/endpoint/pci-epf-core.c
+++ b/drivers/pci/endpoint/pci-epf-core.c
@@ -309,6 +309,9 @@  void *pci_epf_alloc_space(struct pci_epf *epf, size_t size, enum pci_barno bar,
 	else
 		epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_TYPE_32;
 
+	if (epf_bar[bar].flags & PCI_BASE_ADDRESS_MEM_TYPE_64)
+		epf_bar[bar].flags |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+
 	return space;
 }
 EXPORT_SYMBOL_GPL(pci_epf_alloc_space);