[GIT,PULL,v3] updates to qbman (soc drivers) to support arm/arm64
mbox

Message ID HE1PR0401MB17071A7853695DB6546FE28C8FC50@HE1PR0401MB1707.eurprd04.prod.outlook.com
State New
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/leo/linux.git tags/qbman-for-arm-soc

Message

Li Yang June 20, 2017, 5:27 p.m. UTC
v2: Removed the patches for MAINTAINERS file as they are already picked
up by powerpc tree.

v3: Added signed tag to the pull request.

Hi arm-soc maintainers,

As Scott has left NXP, he agreed to transfer the maintainership of
drivers/soc/fsl to me.  Previously most of the soc drivers were going
through the powerpc tree as they were only used/tested on Power-based
SoCs.  Going forward new changes will be mostly related to arm/arm64
SoCs, and I would prefer them to go through the arm-soc tree.

This pull request includes updates to the QMAN/BMAN drivers to make
them work on the arm/arm64 architectures in addition to the power
architecture.

DPAA (Data Path Acceleration Architecture) is a set of hardware
components used on some FSL/NXP QorIQ Networking SoCs, it provides the
infrastructure to support simplified sharing of networking interfaces
and accelerators by multiple CPU cores, and the accelerators
themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
infrastructural components within the DPAA framework.  They are used to
manage queues and buffers for various I/O interfaces, hardware
accelerators.  

More information can be found via link: 
http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA

Regards,
Leo




The following changes since commit 5ed02dbb497422bf225783f46e6eadd237d23d6b:

  Linux 4.12-rc3 (2017-05-28 17:20:53 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/leo/linux.git tags/qbman-for-arm-soc

for you to fetch changes up to 35a875f231a254b7a65f8bed0c4bf4fc58586ea0:

  soc/fsl/qbman: Add missing headers on ARM (2017-06-08 15:11:00 -0500)

----------------------------------------------------------------
Adding Q/BMAN support to ARM

----------------------------------------------------------------
Claudiu Manoil (2):
      soc/fsl/qbman: Drop L1_CACHE_BYTES compile time check
      soc/fsl/qbman: Add missing headers on ARM

Madalin Bucur (3):
      soc/fsl/qbman: Drop set/clear_bits usage
      soc/fsl/qbman: define new revision QMAN_REV32
      soc/fsl/qbman: define register offsets on ARM/ARM64

Roy Pledge (4):
      soc/fsl/qbman: Use shared-dma-pool for BMan private memory allocations
      soc/fsl/qbman: Use shared-dma-pool for QMan private memory allocations
      dt-bindings: soc/fsl: Update reserved memory binding for QBMan
      soc/fsl/qbman: Use different ioremap() variants for ARM/PPC

Valentin Rothberg (1):
      soc/fsl/qbman: Fix CONFIG_ARM32 typo

 Documentation/devicetree/bindings/soc/fsl/bman.txt |  12 +-
 Documentation/devicetree/bindings/soc/fsl/qman.txt |  26 ++--
 drivers/soc/fsl/qbman/bman.c                       |  24 +++-
 drivers/soc/fsl/qbman/bman_ccsr.c                  |  35 +++++-
 drivers/soc/fsl/qbman/bman_portal.c                |  12 +-
 drivers/soc/fsl/qbman/bman_priv.h                  |   3 +
 drivers/soc/fsl/qbman/dpaa_sys.h                   |   8 +-
 drivers/soc/fsl/qbman/qman.c                       |  46 ++++++-
 drivers/soc/fsl/qbman/qman_ccsr.c                  | 140 ++++++++++++++++-----
 drivers/soc/fsl/qbman/qman_portal.c                |  12 +-
 drivers/soc/fsl/qbman/qman_priv.h                  |   5 +-
 drivers/soc/fsl/qbman/qman_test.h                  |   2 -
 12 files changed, 261 insertions(+), 64 deletions(-)

Comments

Arnd Bergmann June 23, 2017, 2:56 p.m. UTC | #1
On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
>
> v2: Removed the patches for MAINTAINERS file as they are already picked
> up by powerpc tree.
>
> v3: Added signed tag to the pull request.
>
> Hi arm-soc maintainers,
>
> As Scott has left NXP, he agreed to transfer the maintainership of
> drivers/soc/fsl to me.  Previously most of the soc drivers were going
> through the powerpc tree as they were only used/tested on Power-based
> SoCs.  Going forward new changes will be mostly related to arm/arm64
> SoCs, and I would prefer them to go through the arm-soc tree.
>
> This pull request includes updates to the QMAN/BMAN drivers to make
> them work on the arm/arm64 architectures in addition to the power
> architecture.
>
> DPAA (Data Path Acceleration Architecture) is a set of hardware
> components used on some FSL/NXP QorIQ Networking SoCs, it provides the
> infrastructure to support simplified sharing of networking interfaces
> and accelerators by multiple CPU cores, and the accelerators
> themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
> infrastructural components within the DPAA framework.  They are used to
> manage queues and buffers for various I/O interfaces, hardware
> accelerators.
>
> More information can be found via link:
> http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA

Hi Leo,

sorry for taking you through yet another revision, but I have two
more points here:

1. Please add a tag description whenever you create a signed tag. The
description is what ends up in the git history, and if there is none, I have
to think of something myself. In this case, the text above seems
roughly appropriate, so I first copied it into the commit log, but then
noticed the second issue:

2. I know we have discussed the unusual way this driver accesses MMIO
registers in the past, using ioremap_wc() to map them and the manually
flushing the caches to store the cache contents into the MMIO registers.
What I don't know is whether there was any conclusion on this topic whether
this is actually allowed by the architecture or at least the chip, based on
implementation-specific features that make it work even when the architecture
doesn't guarantee it.

Can I have an Ack from the architecture maintainers (Russell, Catalin,
Will) on the use of these architecture specific interfaces?

static inline void dpaa_flush(void *p)
{
#ifdef CONFIG_PPC
        flush_dcache_range((unsigned long)p, (unsigned long)p+64);
#elif defined(CONFIG_ARM32)
        __cpuc_flush_dcache_area(p, 64);
#elif defined(CONFIG_ARM64)
        __flush_dcache_area(p, 64);
#endif
}
#define dpaa_invalidate(p) dpaa_flush(p)
#define dpaa_zero(p) memset(p, 0, 64)
static inline void dpaa_touch_ro(void *p)
{
#if (L1_CACHE_BYTES == 32)
        prefetch(p+32);
#endif
        prefetch(p);
}

As described by Leo, the code is already there and is actively used
on powerpc, his pull request is merely for enabling the driver on ARM
and ARM64.

      Arnd
Mark Rutland June 23, 2017, 3:22 p.m. UTC | #2
On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
> >
> > v2: Removed the patches for MAINTAINERS file as they are already picked
> > up by powerpc tree.
> >
> > v3: Added signed tag to the pull request.
> >
> > Hi arm-soc maintainers,
> >
> > As Scott has left NXP, he agreed to transfer the maintainership of
> > drivers/soc/fsl to me.  Previously most of the soc drivers were going
> > through the powerpc tree as they were only used/tested on Power-based
> > SoCs.  Going forward new changes will be mostly related to arm/arm64
> > SoCs, and I would prefer them to go through the arm-soc tree.
> >
> > This pull request includes updates to the QMAN/BMAN drivers to make
> > them work on the arm/arm64 architectures in addition to the power
> > architecture.
> >
> > DPAA (Data Path Acceleration Architecture) is a set of hardware
> > components used on some FSL/NXP QorIQ Networking SoCs, it provides the
> > infrastructure to support simplified sharing of networking interfaces
> > and accelerators by multiple CPU cores, and the accelerators
> > themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
> > infrastructural components within the DPAA framework.  They are used to
> > manage queues and buffers for various I/O interfaces, hardware
> > accelerators.
> >
> > More information can be found via link:
> > http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA
> 
> Hi Leo,
> 
> sorry for taking you through yet another revision, but I have two
> more points here:
> 
> 1. Please add a tag description whenever you create a signed tag. The
> description is what ends up in the git history, and if there is none, I have
> to think of something myself. In this case, the text above seems
> roughly appropriate, so I first copied it into the commit log, but then
> noticed the second issue:
> 
> 2. I know we have discussed the unusual way this driver accesses MMIO
> registers in the past, using ioremap_wc() to map them and the manually
> flushing the caches to store the cache contents into the MMIO registers.
> What I don't know is whether there was any conclusion on this topic whether
> this is actually allowed by the architecture or at least the chip, based on
> implementation-specific features that make it work even when the architecture
> doesn't guarantee it.

>From prior discussions, my understanding was that the region in question
was memory reserved for the device, rather than MMIO registers.

The prior discussion on that front were largely to do with teh
shareability of that memory, which is an orthogonal concern.

If these are actually MMIO registers, a Device memory type must be used,
rather than a Normal memory type. There are a number of things that
could go wrong due to relaxations permitted for Normal memory, such as
speculative reads, the potential set of access sizes, memory
transactions that the endpoint might not understand, etc.

> Can I have an Ack from the architecture maintainers (Russell, Catalin,
> Will) on the use of these architecture specific interfaces?
> 
> static inline void dpaa_flush(void *p)
> {
> #ifdef CONFIG_PPC
>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
> #elif defined(CONFIG_ARM32)
>         __cpuc_flush_dcache_area(p, 64);
> #elif defined(CONFIG_ARM64)
>         __flush_dcache_area(p, 64);
> #endif
> }

Assuming this is memory, why can't the driver use the DMA APIs to handle
this without reaching into arch-internal APIs?

Thanks,
Mark.
Russell King - ARM Linux admin June 23, 2017, 3:38 p.m. UTC | #3
On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
> >
> > v2: Removed the patches for MAINTAINERS file as they are already picked
> > up by powerpc tree.
> >
> > v3: Added signed tag to the pull request.
> >
> > Hi arm-soc maintainers,
> >
> > As Scott has left NXP, he agreed to transfer the maintainership of
> > drivers/soc/fsl to me.  Previously most of the soc drivers were going
> > through the powerpc tree as they were only used/tested on Power-based
> > SoCs.  Going forward new changes will be mostly related to arm/arm64
> > SoCs, and I would prefer them to go through the arm-soc tree.
> >
> > This pull request includes updates to the QMAN/BMAN drivers to make
> > them work on the arm/arm64 architectures in addition to the power
> > architecture.
> >
> > DPAA (Data Path Acceleration Architecture) is a set of hardware
> > components used on some FSL/NXP QorIQ Networking SoCs, it provides the
> > infrastructure to support simplified sharing of networking interfaces
> > and accelerators by multiple CPU cores, and the accelerators
> > themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
> > infrastructural components within the DPAA framework.  They are used to
> > manage queues and buffers for various I/O interfaces, hardware
> > accelerators.
> >
> > More information can be found via link:
> > http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA
> 
> Hi Leo,
> 
> sorry for taking you through yet another revision, but I have two
> more points here:
> 
> 1. Please add a tag description whenever you create a signed tag. The
> description is what ends up in the git history, and if there is none, I have
> to think of something myself. In this case, the text above seems
> roughly appropriate, so I first copied it into the commit log, but then
> noticed the second issue:
> 
> 2. I know we have discussed the unusual way this driver accesses MMIO
> registers in the past, using ioremap_wc() to map them and the manually
> flushing the caches to store the cache contents into the MMIO registers.
> What I don't know is whether there was any conclusion on this topic whether
> this is actually allowed by the architecture or at least the chip, based on
> implementation-specific features that make it work even when the architecture
> doesn't guarantee it.
> 
> Can I have an Ack from the architecture maintainers (Russell, Catalin,
> Will) on the use of these architecture specific interfaces?
> 
> static inline void dpaa_flush(void *p)
> {
> #ifdef CONFIG_PPC
>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
> #elif defined(CONFIG_ARM32)
>         __cpuc_flush_dcache_area(p, 64);
> #elif defined(CONFIG_ARM64)
>         __flush_dcache_area(p, 64);
> #endif
> }
> #define dpaa_invalidate(p) dpaa_flush(p)
> #define dpaa_zero(p) memset(p, 0, 64)
> static inline void dpaa_touch_ro(void *p)
> {
> #if (L1_CACHE_BYTES == 32)
>         prefetch(p+32);
> #endif
>         prefetch(p);
> }
> 
> As described by Leo, the code is already there and is actively used
> on powerpc, his pull request is merely for enabling the driver on ARM
> and ARM64.

Well, as arch maintainer, "code already there" matters not one bit when
converting it to work on other architectures.

For the record, I'm completely dismaid that there seems to be:

(a) the thought of getting changes into the kernel that use arch-private
    interfaces without review by architecture maintainers.

    Here, I'm specifically talking about the work that was merged through
    PowerPC trees adding rather broken ARM support to this driver.

(b) the lack of Cc to ARM and ARM64 architecture maintainers of the patch
    set now being asked to be merged through Arnd.

(c) the thought that it's acceptable to use architecture private interfaces
    in driver code that really have no business being there.

(d) the lack of explanation why its necessary to poke about using
    architecture private interfaces.

Now, from looking at the code already merged in mainline, I think that
this driver really has no business touching these arch private
interfaces.

I think, rather than myself and Will/Catalin independently spending time
trying to understand what's going on with this code, what we need is for
NXP people to document in detail what they are doing that requires all
this special treatment and why they need to poking about in stuff that's
architecture private before this can be merged.

Basically, I want a proper understanding of this code to fully
understand why it's using arch private interfaces, and what the problem
is with using the standard interfaces provided for drivers.
Russell King - ARM Linux admin June 23, 2017, 3:55 p.m. UTC | #4
On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote:
> The prior discussion on that front were largely to do with teh
> shareability of that memory, which is an orthogonal concern.
> 
> If these are actually MMIO registers, a Device memory type must be used,
> rather than a Normal memory type. There are a number of things that
> could go wrong due to relaxations permitted for Normal memory, such as
> speculative reads, the potential set of access sizes, memory
> transactions that the endpoint might not understand, etc.

Well, we have:

        dma_wmb();
        mc->cr->_ncw_verb = myverb | mc->vbit;
        dpaa_flush(mc->cr);

and we also have:

        mc->cr = portal->addr.ce + BM_CL_CR;
        mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ?
                    0 : 1;

and:

        dpaa_zero(mc->cr);

which is just another name for a 64-byte memset() which doesn't have
any flushing.

Note that this is memory obtained from the ioremap() family of functions,
so all of these pointers _should_ have __iomem annotations (but don't.)
However, if this isn't iomem, then it shouldn't be using the ioremap()
family of functions, and should be using memremap() instead.

Without really having an understanding of what this chunk of code is
doing (and abuse like the above makes it harder than necessary) no one
can really say from just reading through the code... which brings up
another problem - that of long term maintanability.
Mark Rutland June 23, 2017, 4:23 p.m. UTC | #5
On Fri, Jun 23, 2017 at 04:55:40PM +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote:
> > The prior discussion on that front were largely to do with teh
> > shareability of that memory, which is an orthogonal concern.
> > 
> > If these are actually MMIO registers, a Device memory type must be used,
> > rather than a Normal memory type. There are a number of things that
> > could go wrong due to relaxations permitted for Normal memory, such as
> > speculative reads, the potential set of access sizes, memory
> > transactions that the endpoint might not understand, etc.
> 
> Well, we have:
> 
>         dma_wmb();
>         mc->cr->_ncw_verb = myverb | mc->vbit;
>         dpaa_flush(mc->cr);
> 
> and we also have:
> 
>         mc->cr = portal->addr.ce + BM_CL_CR;
>         mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ?
>                     0 : 1;
> 
> and:
> 
>         dpaa_zero(mc->cr);
> 
> which is just another name for a 64-byte memset() which doesn't have
> any flushing.
> 
> Note that this is memory obtained from the ioremap() family of functions,
> so all of these pointers _should_ have __iomem annotations (but don't.)
> However, if this isn't iomem, then it shouldn't be using the ioremap()
> family of functions, and should be using memremap() instead.
> 
> Without really having an understanding of what this chunk of code is
> doing (and abuse like the above makes it harder than necessary) no one
> can really say from just reading through the code... which brings up
> another problem - that of long term maintanability.

Sure; agreed on all counts.

I was (evidently poorly) attempting to clarify what prior discussions
had (and hadn't) covered.

Mark.
Roy Pledge June 23, 2017, 6:58 p.m. UTC | #6
On 6/23/2017 11:23 AM, Mark Rutland wrote:
> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
>>> v2: Removed the patches for MAINTAINERS file as they are already picked
>>> up by powerpc tree.
>>>
>>> v3: Added signed tag to the pull request.
>>>
>>> Hi arm-soc maintainers,
>>>
>>> As Scott has left NXP, he agreed to transfer the maintainership of
>>> drivers/soc/fsl to me.  Previously most of the soc drivers were going
>>> through the powerpc tree as they were only used/tested on Power-based
>>> SoCs.  Going forward new changes will be mostly related to arm/arm64
>>> SoCs, and I would prefer them to go through the arm-soc tree.
>>>
>>> This pull request includes updates to the QMAN/BMAN drivers to make
>>> them work on the arm/arm64 architectures in addition to the power
>>> architecture.
>>>
>>> DPAA (Data Path Acceleration Architecture) is a set of hardware
>>> components used on some FSL/NXP QorIQ Networking SoCs, it provides the
>>> infrastructure to support simplified sharing of networking interfaces
>>> and accelerators by multiple CPU cores, and the accelerators
>>> themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
>>> infrastructural components within the DPAA framework.  They are used to
>>> manage queues and buffers for various I/O interfaces, hardware
>>> accelerators.
>>>
>>> More information can be found via link:
>>> http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA
>> Hi Leo,
>>
>> sorry for taking you through yet another revision, but I have two
>> more points here:
>>
>> 1. Please add a tag description whenever you create a signed tag. The
>> description is what ends up in the git history, and if there is none, I have
>> to think of something myself. In this case, the text above seems
>> roughly appropriate, so I first copied it into the commit log, but then
>> noticed the second issue:
>>
>> 2. I know we have discussed the unusual way this driver accesses MMIO
>> registers in the past, using ioremap_wc() to map them and the manually
>> flushing the caches to store the cache contents into the MMIO registers.
>> What I don't know is whether there was any conclusion on this topic whether
>> this is actually allowed by the architecture or at least the chip, based on
>> implementation-specific features that make it work even when the architecture
>> doesn't guarantee it.
> From prior discussions, my understanding was that the region in question
> was memory reserved for the device, rather than MMIO registers.
>
> The prior discussion on that front were largely to do with teh
> shareability of that memory, which is an orthogonal concern.
>
> If these are actually MMIO registers, a Device memory type must be used,
> rather than a Normal memory type. There are a number of things that
> could go wrong due to relaxations permitted for Normal memory, such as
> speculative reads, the potential set of access sizes, memory
> transactions that the endpoint might not understand, etc.
The memory for this device (what we refer to as Software Portals) has 2
regions. One region is MMIO registers and we access it using
readl()/writel() APIs.

The second region is what we refer to as the cacheable area.  This is
memory implemented as part of the QBMan device and the device accepts
cacheline sized transactions from the interconnect. This is needed
because the descriptors read and written by SW are fairly large (larger
that 64 bits/less than a cacheline) and in order to meet the data rates
of our high speed ethernet ports and other accelerators we need the CPU
to be able to form the descriptor in a CPU cache and flush it safely
when the device is read to consume it.  Myself and the system architect
have had many discussions with our design counterparts in ARM to ensure
that our interaction with the core/interconnect/device are safe for the
set of CPU cores and interconnects we integrate into our products.

I understand there are concerns regarding our shareablity proposal
(which is not enabled in this patch set). We have been collecting some
information and talking to ARM and I do intend to address these concerns
but I was delaying confusing things more until this basic support gets
accepted and merged.
>> Can I have an Ack from the architecture maintainers (Russell, Catalin,
>> Will) on the use of these architecture specific interfaces?
>>
>> static inline void dpaa_flush(void *p)
>> {
>> #ifdef CONFIG_PPC
>>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>> #elif defined(CONFIG_ARM32)
>>         __cpuc_flush_dcache_area(p, 64);
>> #elif defined(CONFIG_ARM64)
>>         __flush_dcache_area(p, 64);
>> #endif
>> }
> Assuming this is memory, why can't the driver use the DMA APIs to handle
> this without reaching into arch-internal APIs?
I agree this isn't pretty - I think we could use
dma_sync_single_for_device() here but I am concerned it will be
expensive and hurt performance significantly. The DMA APIs have a lot of
branches. At some point we were doing 'dc cvac' here and even switching
to the above calls caused a measurable drop in throughput at high frame
rates.
>
> Thanks,
> Mark.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Roy Pledge June 23, 2017, 7:25 p.m. UTC | #7
On 6/23/2017 11:39 AM, Russell King - ARM Linux wrote:
> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
>>> v2: Removed the patches for MAINTAINERS file as they are already picked
>>> up by powerpc tree.
>>>
>>> v3: Added signed tag to the pull request.
>>>
>>> Hi arm-soc maintainers,
>>>
>>> As Scott has left NXP, he agreed to transfer the maintainership of
>>> drivers/soc/fsl to me.  Previously most of the soc drivers were going
>>> through the powerpc tree as they were only used/tested on Power-based
>>> SoCs.  Going forward new changes will be mostly related to arm/arm64
>>> SoCs, and I would prefer them to go through the arm-soc tree.
>>>
>>> This pull request includes updates to the QMAN/BMAN drivers to make
>>> them work on the arm/arm64 architectures in addition to the power
>>> architecture.
>>>
>>> DPAA (Data Path Acceleration Architecture) is a set of hardware
>>> components used on some FSL/NXP QorIQ Networking SoCs, it provides the
>>> infrastructure to support simplified sharing of networking interfaces
>>> and accelerators by multiple CPU cores, and the accelerators
>>> themselves.  The QMan(Queue Manager) and BMan(Buffer Manager) are
>>> infrastructural components within the DPAA framework.  They are used to
>>> manage queues and buffers for various I/O interfaces, hardware
>>> accelerators.
>>>
>>> More information can be found via link:
>>> http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA
>> Hi Leo,
>>
>> sorry for taking you through yet another revision, but I have two
>> more points here:
>>
>> 1. Please add a tag description whenever you create a signed tag. The
>> description is what ends up in the git history, and if there is none, I have
>> to think of something myself. In this case, the text above seems
>> roughly appropriate, so I first copied it into the commit log, but then
>> noticed the second issue:
>>
>> 2. I know we have discussed the unusual way this driver accesses MMIO
>> registers in the past, using ioremap_wc() to map them and the manually
>> flushing the caches to store the cache contents into the MMIO registers.
>> What I don't know is whether there was any conclusion on this topic whether
>> this is actually allowed by the architecture or at least the chip, based on
>> implementation-specific features that make it work even when the architecture
>> doesn't guarantee it.
>>
>> Can I have an Ack from the architecture maintainers (Russell, Catalin,
>> Will) on the use of these architecture specific interfaces?
>>
>> static inline void dpaa_flush(void *p)
>> {
>> #ifdef CONFIG_PPC
>>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>> #elif defined(CONFIG_ARM32)
>>         __cpuc_flush_dcache_area(p, 64);
>> #elif defined(CONFIG_ARM64)
>>         __flush_dcache_area(p, 64);
>> #endif
>> }
>> #define dpaa_invalidate(p) dpaa_flush(p)
>> #define dpaa_zero(p) memset(p, 0, 64)
>> static inline void dpaa_touch_ro(void *p)
>> {
>> #if (L1_CACHE_BYTES == 32)
>>         prefetch(p+32);
>> #endif
>>         prefetch(p);
>> }
>>
>> As described by Leo, the code is already there and is actively used
>> on powerpc, his pull request is merely for enabling the driver on ARM
>> and ARM64.
> Well, as arch maintainer, "code already there" matters not one bit when
> converting it to work on other architectures.
>
> For the record, I'm completely dismaid that there seems to be:
>
> (a) the thought of getting changes into the kernel that use arch-private
>     interfaces without review by architecture maintainers.
>
>     Here, I'm specifically talking about the work that was merged through
>     PowerPC trees adding rather broken ARM support to this driver.
The PowerPC support was added to support PowerPC - this device has
existed on PowerPC parts for many years before we started developing ARM
cores. We have been trying to "catch up" on our upstreaming efforts for
this family of parts and PowerPC was done first primary because there
are more parts available at this point in time.
>
> (b) the lack of Cc to ARM and ARM64 architecture maintainers of the patch
>     set now being asked to be merged through Arnd.
I apologize for the lack of CC - many of us are new to the ARM community
and learning who should be CC'd on requests takes time. The patch set
was originally posted on the ARM list in January of this year and we
have incorporated feedback from many people, including Robin Murphy and
yourself.  I certainly want and value the feedback from the community. 
The reason we were requesting the patches be merged is that the last
submission had no feedback after several rounds review (starting in
January).
>
> (c) the thought that it's acceptable to use architecture private interfaces
>     in driver code that really have no business being there.
>
> (d) the lack of explanation why its necessary to poke about using
>     architecture private interfaces.
>
> Now, from looking at the code already merged in mainline, I think that
> this driver really has no business touching these arch private
> interfaces.
>
> I think, rather than myself and Will/Catalin independently spending time
> trying to understand what's going on with this code, what we need is for
> NXP people to document in detail what they are doing that requires all
> this special treatment and why they need to poking about in stuff that's
> architecture private before this can be merged.
>
> Basically, I want a proper understanding of this code to fully
> understand why it's using arch private interfaces, and what the problem
> is with using the standard interfaces provided for drivers.
The requested pull removed most of the ARCH specific code so I would
recommend looking at the driver with those patches applied.  I will
enumerate here all the places in the final driver that check for
CONFIG_ARM or CONFIG_PPC

1) bman.c : Some register offsets in the device changed when we moved
the IP to ARM, we use CONFIG_ARM||CONFIG_ARM64 to do the defines at
different offsets for each platform.
2) bman_portal.c : On PPC we map the SW portals as
cacheable/noncoherent. This is required for those devices as they will
cause a trap if the memory isn't mapped that way.  When the device was
moved to ARM we removed that limitation.  Since mapping the device as
non-shareable is controversial the code is using ioremap_wc() right
now.  This code will potentially change once there is more discussion on
this but for now we distinguish between the two architectures.
3) dpaa_sys.h: I mentioned the dpaa_flush() in a previous message.  This
can likely be changed but it has performance implications.
4) qman.c: Same register offset buisness as point 1 but for QMan.
5) qman_portal.c:  The same non-sharable difference exists as mentioned
in bullet 2

I'm not aware of any other code that is architecture dependent. The only
architecture private interface is the flush() and I'm willing to look
into how much impact changing it will have if it is a showstopper.
Roy Pledge June 23, 2017, 7:39 p.m. UTC | #8
On 6/23/2017 11:56 AM, Russell King - ARM Linux wrote:
> On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote:
>> The prior discussion on that front were largely to do with teh
>> shareability of that memory, which is an orthogonal concern.
>>
>> If these are actually MMIO registers, a Device memory type must be used,
>> rather than a Normal memory type. There are a number of things that
>> could go wrong due to relaxations permitted for Normal memory, such as
>> speculative reads, the potential set of access sizes, memory
>> transactions that the endpoint might not understand, etc.
> Well, we have:
>
>         dma_wmb();
>         mc->cr->_ncw_verb = myverb | mc->vbit;
>         dpaa_flush(mc->cr);
>
> and we also have:
>
>         mc->cr = portal->addr.ce + BM_CL_CR;
>         mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ?
>                     0 : 1;
This is an inconsistency for sure - I will look into this.
>
> and:
>
>         dpaa_zero(mc->cr);
>
> which is just another name for a 64-byte memset() which doesn't have
> any flushing.
This is an artifact of a PPC specific optimization we did in the past
but have since removed.
We should just replace these calls with memset so things are clear.
>
> Note that this is memory obtained from the ioremap() family of functions,
> so all of these pointers _should_ have __iomem annotations (but don't.)
> However, if this isn't iomem, then it shouldn't be using the ioremap()
> family of functions, and should be using memremap() instead.
I was following a previous thread with Catalin and some Cavium folks
last month discussing ioremap_wc() vs memremap().  I think you are
correct in suggestion we should be using memremap() here. I had made a
note to explore this further but hadn't had time to investigate.  I will
move this up on my lisy
>
> Without really having an understanding of what this chunk of code is
> doing (and abuse like the above makes it harder than necessary) no one
> can really say from just reading through the code... which brings up
> another problem - that of long term maintanability.
Is this a request for more comments? I agree that looking at the code
here is difficult without understanding for the device interface works
(which is admittedly somewhat unique). I can try to add some comments in
the areas that are likely not obvious to anyone that hasn't read the
device manual.
>
Russell King - ARM Linux admin June 24, 2017, 12:10 p.m. UTC | #9
On Fri, Jun 23, 2017 at 06:58:17PM +0000, Roy Pledge wrote:
> On 6/23/2017 11:23 AM, Mark Rutland wrote:
> > On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
> >> static inline void dpaa_flush(void *p)
> >> {
> >> #ifdef CONFIG_PPC
> >>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
> >> #elif defined(CONFIG_ARM32)
> >>         __cpuc_flush_dcache_area(p, 64);
> >> #elif defined(CONFIG_ARM64)
> >>         __flush_dcache_area(p, 64);
> >> #endif
> >> }
> > Assuming this is memory, why can't the driver use the DMA APIs to handle
> > this without reaching into arch-internal APIs?
> I agree this isn't pretty - I think we could use
> dma_sync_single_for_device() here but I am concerned it will be
> expensive and hurt performance significantly. The DMA APIs have a lot of
> branches. At some point we were doing 'dc cvac' here and even switching
> to the above calls caused a measurable drop in throughput at high frame
> rates.

Well...

__cpuc_flush_dcache_area() is used to implement flush_dcache_page() and
flush_anon_page().  The former is about ensuring coherency between
multiple mappings of a kernel page cache page and userspace.  The latter
is about ensuring coherency between an anonymous page and userspace.

Currently, on ARMv7, this is implemented using "mcr p15, 0, r0, c7, c14, 1"
but we _could_ decide that is too heavy for these interfaces, and instead
switch to a lighter cache flush if one were available in a future
architecture revision (since the use case of this only requires it to
flush to the point of unification, not to the point of coherence.)

The overall effect is that changing the behaviour of this would introduce
a regression into your driver, which would have to be reverted - and that
makes my job as architecture maintainer difficult.

It may make sense to explicitly introduce a "flush data cache to point
of coherence" function - we already have gicv3 and kvm wanting this, and
doing it the right way, via:

#define gic_flush_dcache_to_poc(a,l)    __cpuc_flush_dcache_area((a), (l))
#define kvm_flush_dcache_to_poc(a,l)     __cpuc_flush_dcache_area((a), (l))

So, how about we introduce something like:

	void flush_dcache_to_poc(void *addr, size_t size)

which is defined to ensure that data is visible to the point of coherence
on all architectures?
Arnd Bergmann June 27, 2017, 7:17 a.m. UTC | #10
On Fri, Jun 23, 2017 at 8:58 PM, Roy Pledge <roy.pledge@nxp.com> wrote:
> On 6/23/2017 11:23 AM, Mark Rutland wrote:
>> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
>>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
>>>
>>> 2. I know we have discussed the unusual way this driver accesses MMIO
>>> registers in the past, using ioremap_wc() to map them and the manually
>>> flushing the caches to store the cache contents into the MMIO registers.
>>> What I don't know is whether there was any conclusion on this topic whether
>>> this is actually allowed by the architecture or at least the chip, based on
>>> implementation-specific features that make it work even when the architecture
>>> doesn't guarantee it.
>> From prior discussions, my understanding was that the region in question
>> was memory reserved for the device, rather than MMIO registers.

Ok.

>> The prior discussion on that front were largely to do with teh
>> shareability of that memory, which is an orthogonal concern.
>>
>> If these are actually MMIO registers, a Device memory type must be used,
>> rather than a Normal memory type. There are a number of things that
>> could go wrong due to relaxations permitted for Normal memory, such as
>> speculative reads, the potential set of access sizes, memory
>> transactions that the endpoint might not understand, etc.
> The memory for this device (what we refer to as Software Portals) has 2
> regions. One region is MMIO registers and we access it using
> readl()/writel() APIs.

Ok, good.

> The second region is what we refer to as the cacheable area.  This is
> memory implemented as part of the QBMan device and the device accepts
> cacheline sized transactions from the interconnect. This is needed
> because the descriptors read and written by SW are fairly large (larger
> that 64 bits/less than a cacheline) and in order to meet the data rates
> of our high speed ethernet ports and other accelerators we need the CPU
> to be able to form the descriptor in a CPU cache and flush it safely
> when the device is read to consume it.  Myself and the system architect
> have had many discussions with our design counterparts in ARM to ensure
> that our interaction with the core/interconnect/device are safe for the
> set of CPU cores and interconnects we integrate into our products.
>
> I understand there are concerns regarding our shareablity proposal
> (which is not enabled in this patch set). We have been collecting some
> information and talking to ARM and I do intend to address these concerns
> but I was delaying confusing things more until this basic support gets
> accepted and merged.

Can you summarize what the constraints are that we have for mapping
this area? E.g. why can't we just have an uncached write-combining
mapping and skip the flushes?

>>> Can I have an Ack from the architecture maintainers (Russell, Catalin,
>>> Will) on the use of these architecture specific interfaces?
>>>
>>> static inline void dpaa_flush(void *p)
>>> {
>>> #ifdef CONFIG_PPC
>>>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>>> #elif defined(CONFIG_ARM32)
>>>         __cpuc_flush_dcache_area(p, 64);
>>> #elif defined(CONFIG_ARM64)
>>>         __flush_dcache_area(p, 64);
>>> #endif
>>> }
>> Assuming this is memory, why can't the driver use the DMA APIs to handle
>> this without reaching into arch-internal APIs?
>
> I agree this isn't pretty - I think we could use
> dma_sync_single_for_device() here but I am concerned it will be
> expensive and hurt performance significantly. The DMA APIs have a lot of
> branches. At some point we were doing 'dc cvac' here and even switching
> to the above calls caused a measurable drop in throughput at high frame
> rates.

I'd suggest we start out by converting this to some standard API
first, regardless of performance, to get it working properly with code
that should be maintainable at least, and make progress with your
hardware enablement.

In parallel, we can discuss what kind of API we actually need and
how to fit that into the existing frameworks. This may take a while
as it depends on your other patch set, and perhaps input from
other parties that may have similar requirements. I could imagine
that e.g. Marvell, Cavium or Annapurna (Amazon) do something
similar in their hardware, so if we find that we need a new API,
we should define it in a way that works for everyone.

        Arnd
Russell King - ARM Linux admin June 27, 2017, 8:17 a.m. UTC | #11
On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote:
> I'd suggest we start out by converting this to some standard API
> first, regardless of performance, to get it working properly with code
> that should be maintainable at least, and make progress with your
> hardware enablement.

I think Roy is rather confused right now about what the driver does
and doesn't do.

With his patches, two areas are mapped on ARM:

1. The addr.ci area is mapped using ioremap().  This is a device mapping,
   and is used by the "Cache-inhibited register access." functions.

2. The addr.ce area is mapped using ioremap_wc(), and all other stuff
   including the verb accesses go through this region.  This is a
   memory, non-cacheable mapping.

The addr.ce region is the area that has a mixture of memset(), MMIO
accessors, direct CPU accesses, cache flushes and prefetches used on
it.

As it is marked non-cacheable, the cache flushes are a pure waste of
CPU cycles.  Prefetching to a non-cacheable area doesn't make much
sense either.

So, I think for ARM: just kill the cache flushing - its doing nothing
useful.  The prefetching could also be removed as well to recover a
few more CPU cycles.
Arnd Bergmann June 27, 2017, 9:05 a.m. UTC | #12
On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote:
>> I'd suggest we start out by converting this to some standard API
>> first, regardless of performance, to get it working properly with code
>> that should be maintainable at least, and make progress with your
>> hardware enablement.
>
> I think Roy is rather confused right now about what the driver does
> and doesn't do.
>
> With his patches, two areas are mapped on ARM:
>
> 1. The addr.ci area is mapped using ioremap().  This is a device mapping,
>    and is used by the "Cache-inhibited register access." functions.
>
> 2. The addr.ce area is mapped using ioremap_wc(), and all other stuff
>    including the verb accesses go through this region.  This is a
>    memory, non-cacheable mapping.
>
> The addr.ce region is the area that has a mixture of memset(), MMIO
> accessors, direct CPU accesses, cache flushes and prefetches used on
> it.
>
> As it is marked non-cacheable, the cache flushes are a pure waste of
> CPU cycles.  Prefetching to a non-cacheable area doesn't make much
> sense either.
>
> So, I think for ARM: just kill the cache flushing - its doing nothing
> useful.  The prefetching could also be removed as well to recover a
> few more CPU cycles.

Right, that sounds good. I wondered about the ioremap_wc() earlier
and thought I must have missed something as it didn't make any sense
to me, but apparently that was because the code really doesn't make
sense ;-).

It would be good to get consistent __iomem annotations on it too,
so we can see exactly where the MMIO interfaces are used incorrectly,
and find a fix for that.

I wonder if doing memcpy_toio on the write-combining mapping
will do what Roy wants and and up sending a single bus transaction
for one descriptor most of the time, while fitting in with the ioremap_wc()
interface and its __iomem pointers.

       Arnd
Russell King - ARM Linux admin June 27, 2017, 9:10 a.m. UTC | #13
On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote:
> On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote:
> >> I'd suggest we start out by converting this to some standard API
> >> first, regardless of performance, to get it working properly with code
> >> that should be maintainable at least, and make progress with your
> >> hardware enablement.
> >
> > I think Roy is rather confused right now about what the driver does
> > and doesn't do.
> >
> > With his patches, two areas are mapped on ARM:
> >
> > 1. The addr.ci area is mapped using ioremap().  This is a device mapping,
> >    and is used by the "Cache-inhibited register access." functions.
> >
> > 2. The addr.ce area is mapped using ioremap_wc(), and all other stuff
> >    including the verb accesses go through this region.  This is a
> >    memory, non-cacheable mapping.
> >
> > The addr.ce region is the area that has a mixture of memset(), MMIO
> > accessors, direct CPU accesses, cache flushes and prefetches used on
> > it.
> >
> > As it is marked non-cacheable, the cache flushes are a pure waste of
> > CPU cycles.  Prefetching to a non-cacheable area doesn't make much
> > sense either.
> >
> > So, I think for ARM: just kill the cache flushing - its doing nothing
> > useful.  The prefetching could also be removed as well to recover a
> > few more CPU cycles.
> 
> Right, that sounds good. I wondered about the ioremap_wc() earlier
> and thought I must have missed something as it didn't make any sense
> to me, but apparently that was because the code really doesn't make
> sense ;-).
> 
> It would be good to get consistent __iomem annotations on it too,
> so we can see exactly where the MMIO interfaces are used incorrectly,
> and find a fix for that.
> 
> I wonder if doing memcpy_toio on the write-combining mapping
> will do what Roy wants and and up sending a single bus transaction
> for one descriptor most of the time, while fitting in with the ioremap_wc()
> interface and its __iomem pointers.

Either the CE region needs __iomem pointers and proper accessors, or
it needs to be using memremap(offset, size, MEMREMAP_WC) and not use
the iomem accessors.

The CI region should definitely have __iomem annotations though.
Arnd Bergmann June 27, 2017, 10:35 a.m. UTC | #14
On Tue, Jun 27, 2017 at 11:10 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote:
>> On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux
>> <linux@armlinux.org.uk> wrote:
>>
>> I wonder if doing memcpy_toio on the write-combining mapping
>> will do what Roy wants and and up sending a single bus transaction
>> for one descriptor most of the time, while fitting in with the ioremap_wc()
>> interface and its __iomem pointers.
>
> Either the CE region needs __iomem pointers and proper accessors, or
> it needs to be using memremap(offset, size, MEMREMAP_WC) and not use
> the iomem accessors.

Right, memremap() would work here as well. If there are no other constraints,
using ioremap_wc and __iomem pointers would have the advantage of giving
us a little more type safety, but it depends on what operations we actually
need to do on the data.

        Arnd
Roy Pledge June 27, 2017, 6:36 p.m. UTC | #15
On 6/24/2017 8:10 AM, Russell King - ARM Linux wrote:
> On Fri, Jun 23, 2017 at 06:58:17PM +0000, Roy Pledge wrote:
>> On 6/23/2017 11:23 AM, Mark Rutland wrote:
>>> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
>>>> static inline void dpaa_flush(void *p)
>>>> {
>>>> #ifdef CONFIG_PPC
>>>>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>>>> #elif defined(CONFIG_ARM32)
>>>>         __cpuc_flush_dcache_area(p, 64);
>>>> #elif defined(CONFIG_ARM64)
>>>>         __flush_dcache_area(p, 64);
>>>> #endif
>>>> }
>>> Assuming this is memory, why can't the driver use the DMA APIs to handle
>>> this without reaching into arch-internal APIs?
>> I agree this isn't pretty - I think we could use
>> dma_sync_single_for_device() here but I am concerned it will be
>> expensive and hurt performance significantly. The DMA APIs have a lot of
>> branches. At some point we were doing 'dc cvac' here and even switching
>> to the above calls caused a measurable drop in throughput at high frame
>> rates.
>
> Well...
>
> __cpuc_flush_dcache_area() is used to implement flush_dcache_page() and
> flush_anon_page().  The former is about ensuring coherency between
> multiple mappings of a kernel page cache page and userspace.  The latter
> is about ensuring coherency between an anonymous page and userspace.
>
> Currently, on ARMv7, this is implemented using "mcr p15, 0, r0, c7, c14, 1"
> but we _could_ decide that is too heavy for these interfaces, and instead
> switch to a lighter cache flush if one were available in a future
> architecture revision (since the use case of this only requires it to
> flush to the point of unification, not to the point of coherence.)
>
> The overall effect is that changing the behaviour of this would introduce
> a regression into your driver, which would have to be reverted - and that
> makes my job as architecture maintainer difficult.
>
> It may make sense to explicitly introduce a "flush data cache to point
> of coherence" function - we already have gicv3 and kvm wanting this, and
> doing it the right way, via:
>
> #define gic_flush_dcache_to_poc(a,l)    __cpuc_flush_dcache_area((a), (l))
> #define kvm_flush_dcache_to_poc(a,l)     __cpuc_flush_dcache_area((a), (l))
>
> So, how about we introduce something like:
>
> 	void flush_dcache_to_poc(void *addr, size_t size)
>
> which is defined to ensure that data is visible to the point of coherence
> on all architectures?
>
This would be very useful and would help with this issue. To be 
completely honest an invalidate variant would be useful for us as well 
as we currently use flush in cases that we really just need to 
invalidate. The extra unneeded write back to the device can shows a 
small performance cost. I understand that exposing invalidate to users 
is potentially dangerous as you need to be 100% sure you're willing to 
through away any data in that cacheline so I won't be surprised if 
people aren't willing to expose an API for this.
Roy Pledge June 27, 2017, 6:59 p.m. UTC | #16
On 6/27/2017 3:17 AM, Arnd Bergmann wrote:
> On Fri, Jun 23, 2017 at 8:58 PM, Roy Pledge <roy.pledge@nxp.com> wrote:
>> On 6/23/2017 11:23 AM, Mark Rutland wrote:
>>> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote:
>>>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote:
>>>>
>>>> 2. I know we have discussed the unusual way this driver accesses MMIO
>>>> registers in the past, using ioremap_wc() to map them and the manually
>>>> flushing the caches to store the cache contents into the MMIO registers.
>>>> What I don't know is whether there was any conclusion on this topic whether
>>>> this is actually allowed by the architecture or at least the chip, based on
>>>> implementation-specific features that make it work even when the architecture
>>>> doesn't guarantee it.
>>> From prior discussions, my understanding was that the region in question
>>> was memory reserved for the device, rather than MMIO registers.
>
> Ok.
>
>>> The prior discussion on that front were largely to do with teh
>>> shareability of that memory, which is an orthogonal concern.
>>>
>>> If these are actually MMIO registers, a Device memory type must be used,
>>> rather than a Normal memory type. There are a number of things that
>>> could go wrong due to relaxations permitted for Normal memory, such as
>>> speculative reads, the potential set of access sizes, memory
>>> transactions that the endpoint might not understand, etc.
>> The memory for this device (what we refer to as Software Portals) has 2
>> regions. One region is MMIO registers and we access it using
>> readl()/writel() APIs.
>
> Ok, good.
>
>> The second region is what we refer to as the cacheable area.  This is
>> memory implemented as part of the QBMan device and the device accepts
>> cacheline sized transactions from the interconnect. This is needed
>> because the descriptors read and written by SW are fairly large (larger
>> that 64 bits/less than a cacheline) and in order to meet the data rates
>> of our high speed ethernet ports and other accelerators we need the CPU
>> to be able to form the descriptor in a CPU cache and flush it safely
>> when the device is read to consume it.  Myself and the system architect
>> have had many discussions with our design counterparts in ARM to ensure
>> that our interaction with the core/interconnect/device are safe for the
>> set of CPU cores and interconnects we integrate into our products.
>>
>> I understand there are concerns regarding our shareablity proposal
>> (which is not enabled in this patch set). We have been collecting some
>> information and talking to ARM and I do intend to address these concerns
>> but I was delaying confusing things more until this basic support gets
>> accepted and merged.
>
> Can you summarize what the constraints are that we have for mapping
> this area? E.g. why can't we just have an uncached write-combining
> mapping and skip the flushes?

For ARM/ARM64 we can (and currently do) map as uncached write-combing 
but for PPC we cannot do this as the device only accepts cacheline sized 
transactions in the cache enabled area. The device requirements were 
relaxed for ARM since the interconnect had different behavior than the 
PPC interconnect we were using. The goal for ARM is to map this memory 
as non-shareable as it gives a significant performance increase over 
write combine. We intended to work on acceptance of this mode more once 
basic support is in.

The flushes in the code for ARM can be removed but the PPC calls are 
needed.  In the dpaa_flush() code below I think the ARM variants could 
be removed for now (I would want to run some tests of course).

>
>>>> Can I have an Ack from the architecture maintainers (Russell, Catalin,
>>>> Will) on the use of these architecture specific interfaces?
>>>>
>>>> static inline void dpaa_flush(void *p)
>>>> {
>>>> #ifdef CONFIG_PPC
>>>>         flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>>>> #elif defined(CONFIG_ARM32)
>>>>         __cpuc_flush_dcache_area(p, 64);
>>>> #elif defined(CONFIG_ARM64)
>>>>         __flush_dcache_area(p, 64);
>>>> #endif
>>>> }
>>> Assuming this is memory, why can't the driver use the DMA APIs to handle
>>> this without reaching into arch-internal APIs?
>>
>> I agree this isn't pretty - I think we could use
>> dma_sync_single_for_device() here but I am concerned it will be
>> expensive and hurt performance significantly. The DMA APIs have a lot of
>> branches. At some point we were doing 'dc cvac' here and even switching
>> to the above calls caused a measurable drop in throughput at high frame
>> rates.
>
> I'd suggest we start out by converting this to some standard API
> first, regardless of performance, to get it working properly with code
> that should be maintainable at least, and make progress with your
> hardware enablement.
>
> In parallel, we can discuss what kind of API we actually need and
> how to fit that into the existing frameworks. This may take a while
> as it depends on your other patch set, and perhaps input from
> other parties that may have similar requirements. I could imagine
> that e.g. Marvell, Cavium or Annapurna (Amazon) do something
> similar in their hardware, so if we find that we need a new API,
> we should define it in a way that works for everyone.

I'm happy that your open to extending the API for these types of 
devices. I will reply to another email shortly but I think using the 
memremap() API for the Cache Enabled area is a good change for this 
device and will address the __iomem concerns.

>
>         Arnd
>
Roy Pledge June 27, 2017, 7:03 p.m. UTC | #17
On 6/27/2017 6:35 AM, Arnd Bergmann wrote:
> On Tue, Jun 27, 2017 at 11:10 AM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
>> On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote:
>>> On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux
>>> <linux@armlinux.org.uk> wrote:
>>>
>>> I wonder if doing memcpy_toio on the write-combining mapping
>>> will do what Roy wants and and up sending a single bus transaction
>>> for one descriptor most of the time, while fitting in with the ioremap_wc()
>>> interface and its __iomem pointers.
>>
>> Either the CE region needs __iomem pointers and proper accessors, or
>> it needs to be using memremap(offset, size, MEMREMAP_WC) and not use
>> the iomem accessors.
>
> Right, memremap() would work here as well. If there are no other constraints,
> using ioremap_wc and __iomem pointers would have the advantage of giving
> us a little more type safety, but it depends on what operations we actually
> need to do on the data.

I explained a bit in other threads so this is my plan for a respin
1) Have the Cache Enable areas mapped by memremap()
2) Make sure the __iomem is being respected for the Cache Inhibted areas
3) Get rid of the flush()/prefetch() ARM/ARM64 implementations - the 
calls will only have implementation for PPC.

Roy


>
>         Arnd
>