Message ID | HE1PR0401MB17071A7853695DB6546FE28C8FC50@HE1PR0401MB1707.eurprd04.prod.outlook.com |
---|---|
State | New |
Headers | show |
On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: > > v2: Removed the patches for MAINTAINERS file as they are already picked > up by powerpc tree. > > v3: Added signed tag to the pull request. > > Hi arm-soc maintainers, > > As Scott has left NXP, he agreed to transfer the maintainership of > drivers/soc/fsl to me. Previously most of the soc drivers were going > through the powerpc tree as they were only used/tested on Power-based > SoCs. Going forward new changes will be mostly related to arm/arm64 > SoCs, and I would prefer them to go through the arm-soc tree. > > This pull request includes updates to the QMAN/BMAN drivers to make > them work on the arm/arm64 architectures in addition to the power > architecture. > > DPAA (Data Path Acceleration Architecture) is a set of hardware > components used on some FSL/NXP QorIQ Networking SoCs, it provides the > infrastructure to support simplified sharing of networking interfaces > and accelerators by multiple CPU cores, and the accelerators > themselves. The QMan(Queue Manager) and BMan(Buffer Manager) are > infrastructural components within the DPAA framework. They are used to > manage queues and buffers for various I/O interfaces, hardware > accelerators. > > More information can be found via link: > http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA Hi Leo, sorry for taking you through yet another revision, but I have two more points here: 1. Please add a tag description whenever you create a signed tag. The description is what ends up in the git history, and if there is none, I have to think of something myself. In this case, the text above seems roughly appropriate, so I first copied it into the commit log, but then noticed the second issue: 2. I know we have discussed the unusual way this driver accesses MMIO registers in the past, using ioremap_wc() to map them and the manually flushing the caches to store the cache contents into the MMIO registers. What I don't know is whether there was any conclusion on this topic whether this is actually allowed by the architecture or at least the chip, based on implementation-specific features that make it work even when the architecture doesn't guarantee it. Can I have an Ack from the architecture maintainers (Russell, Catalin, Will) on the use of these architecture specific interfaces? static inline void dpaa_flush(void *p) { #ifdef CONFIG_PPC flush_dcache_range((unsigned long)p, (unsigned long)p+64); #elif defined(CONFIG_ARM32) __cpuc_flush_dcache_area(p, 64); #elif defined(CONFIG_ARM64) __flush_dcache_area(p, 64); #endif } #define dpaa_invalidate(p) dpaa_flush(p) #define dpaa_zero(p) memset(p, 0, 64) static inline void dpaa_touch_ro(void *p) { #if (L1_CACHE_BYTES == 32) prefetch(p+32); #endif prefetch(p); } As described by Leo, the code is already there and is actively used on powerpc, his pull request is merely for enabling the driver on ARM and ARM64. Arnd
On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: > On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: > > > > v2: Removed the patches for MAINTAINERS file as they are already picked > > up by powerpc tree. > > > > v3: Added signed tag to the pull request. > > > > Hi arm-soc maintainers, > > > > As Scott has left NXP, he agreed to transfer the maintainership of > > drivers/soc/fsl to me. Previously most of the soc drivers were going > > through the powerpc tree as they were only used/tested on Power-based > > SoCs. Going forward new changes will be mostly related to arm/arm64 > > SoCs, and I would prefer them to go through the arm-soc tree. > > > > This pull request includes updates to the QMAN/BMAN drivers to make > > them work on the arm/arm64 architectures in addition to the power > > architecture. > > > > DPAA (Data Path Acceleration Architecture) is a set of hardware > > components used on some FSL/NXP QorIQ Networking SoCs, it provides the > > infrastructure to support simplified sharing of networking interfaces > > and accelerators by multiple CPU cores, and the accelerators > > themselves. The QMan(Queue Manager) and BMan(Buffer Manager) are > > infrastructural components within the DPAA framework. They are used to > > manage queues and buffers for various I/O interfaces, hardware > > accelerators. > > > > More information can be found via link: > > http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA > > Hi Leo, > > sorry for taking you through yet another revision, but I have two > more points here: > > 1. Please add a tag description whenever you create a signed tag. The > description is what ends up in the git history, and if there is none, I have > to think of something myself. In this case, the text above seems > roughly appropriate, so I first copied it into the commit log, but then > noticed the second issue: > > 2. I know we have discussed the unusual way this driver accesses MMIO > registers in the past, using ioremap_wc() to map them and the manually > flushing the caches to store the cache contents into the MMIO registers. > What I don't know is whether there was any conclusion on this topic whether > this is actually allowed by the architecture or at least the chip, based on > implementation-specific features that make it work even when the architecture > doesn't guarantee it. >From prior discussions, my understanding was that the region in question was memory reserved for the device, rather than MMIO registers. The prior discussion on that front were largely to do with teh shareability of that memory, which is an orthogonal concern. If these are actually MMIO registers, a Device memory type must be used, rather than a Normal memory type. There are a number of things that could go wrong due to relaxations permitted for Normal memory, such as speculative reads, the potential set of access sizes, memory transactions that the endpoint might not understand, etc. > Can I have an Ack from the architecture maintainers (Russell, Catalin, > Will) on the use of these architecture specific interfaces? > > static inline void dpaa_flush(void *p) > { > #ifdef CONFIG_PPC > flush_dcache_range((unsigned long)p, (unsigned long)p+64); > #elif defined(CONFIG_ARM32) > __cpuc_flush_dcache_area(p, 64); > #elif defined(CONFIG_ARM64) > __flush_dcache_area(p, 64); > #endif > } Assuming this is memory, why can't the driver use the DMA APIs to handle this without reaching into arch-internal APIs? Thanks, Mark.
On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: > On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: > > > > v2: Removed the patches for MAINTAINERS file as they are already picked > > up by powerpc tree. > > > > v3: Added signed tag to the pull request. > > > > Hi arm-soc maintainers, > > > > As Scott has left NXP, he agreed to transfer the maintainership of > > drivers/soc/fsl to me. Previously most of the soc drivers were going > > through the powerpc tree as they were only used/tested on Power-based > > SoCs. Going forward new changes will be mostly related to arm/arm64 > > SoCs, and I would prefer them to go through the arm-soc tree. > > > > This pull request includes updates to the QMAN/BMAN drivers to make > > them work on the arm/arm64 architectures in addition to the power > > architecture. > > > > DPAA (Data Path Acceleration Architecture) is a set of hardware > > components used on some FSL/NXP QorIQ Networking SoCs, it provides the > > infrastructure to support simplified sharing of networking interfaces > > and accelerators by multiple CPU cores, and the accelerators > > themselves. The QMan(Queue Manager) and BMan(Buffer Manager) are > > infrastructural components within the DPAA framework. They are used to > > manage queues and buffers for various I/O interfaces, hardware > > accelerators. > > > > More information can be found via link: > > http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA > > Hi Leo, > > sorry for taking you through yet another revision, but I have two > more points here: > > 1. Please add a tag description whenever you create a signed tag. The > description is what ends up in the git history, and if there is none, I have > to think of something myself. In this case, the text above seems > roughly appropriate, so I first copied it into the commit log, but then > noticed the second issue: > > 2. I know we have discussed the unusual way this driver accesses MMIO > registers in the past, using ioremap_wc() to map them and the manually > flushing the caches to store the cache contents into the MMIO registers. > What I don't know is whether there was any conclusion on this topic whether > this is actually allowed by the architecture or at least the chip, based on > implementation-specific features that make it work even when the architecture > doesn't guarantee it. > > Can I have an Ack from the architecture maintainers (Russell, Catalin, > Will) on the use of these architecture specific interfaces? > > static inline void dpaa_flush(void *p) > { > #ifdef CONFIG_PPC > flush_dcache_range((unsigned long)p, (unsigned long)p+64); > #elif defined(CONFIG_ARM32) > __cpuc_flush_dcache_area(p, 64); > #elif defined(CONFIG_ARM64) > __flush_dcache_area(p, 64); > #endif > } > #define dpaa_invalidate(p) dpaa_flush(p) > #define dpaa_zero(p) memset(p, 0, 64) > static inline void dpaa_touch_ro(void *p) > { > #if (L1_CACHE_BYTES == 32) > prefetch(p+32); > #endif > prefetch(p); > } > > As described by Leo, the code is already there and is actively used > on powerpc, his pull request is merely for enabling the driver on ARM > and ARM64. Well, as arch maintainer, "code already there" matters not one bit when converting it to work on other architectures. For the record, I'm completely dismaid that there seems to be: (a) the thought of getting changes into the kernel that use arch-private interfaces without review by architecture maintainers. Here, I'm specifically talking about the work that was merged through PowerPC trees adding rather broken ARM support to this driver. (b) the lack of Cc to ARM and ARM64 architecture maintainers of the patch set now being asked to be merged through Arnd. (c) the thought that it's acceptable to use architecture private interfaces in driver code that really have no business being there. (d) the lack of explanation why its necessary to poke about using architecture private interfaces. Now, from looking at the code already merged in mainline, I think that this driver really has no business touching these arch private interfaces. I think, rather than myself and Will/Catalin independently spending time trying to understand what's going on with this code, what we need is for NXP people to document in detail what they are doing that requires all this special treatment and why they need to poking about in stuff that's architecture private before this can be merged. Basically, I want a proper understanding of this code to fully understand why it's using arch private interfaces, and what the problem is with using the standard interfaces provided for drivers.
On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote: > The prior discussion on that front were largely to do with teh > shareability of that memory, which is an orthogonal concern. > > If these are actually MMIO registers, a Device memory type must be used, > rather than a Normal memory type. There are a number of things that > could go wrong due to relaxations permitted for Normal memory, such as > speculative reads, the potential set of access sizes, memory > transactions that the endpoint might not understand, etc. Well, we have: dma_wmb(); mc->cr->_ncw_verb = myverb | mc->vbit; dpaa_flush(mc->cr); and we also have: mc->cr = portal->addr.ce + BM_CL_CR; mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ? 0 : 1; and: dpaa_zero(mc->cr); which is just another name for a 64-byte memset() which doesn't have any flushing. Note that this is memory obtained from the ioremap() family of functions, so all of these pointers _should_ have __iomem annotations (but don't.) However, if this isn't iomem, then it shouldn't be using the ioremap() family of functions, and should be using memremap() instead. Without really having an understanding of what this chunk of code is doing (and abuse like the above makes it harder than necessary) no one can really say from just reading through the code... which brings up another problem - that of long term maintanability.
On Fri, Jun 23, 2017 at 04:55:40PM +0100, Russell King - ARM Linux wrote: > On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote: > > The prior discussion on that front were largely to do with teh > > shareability of that memory, which is an orthogonal concern. > > > > If these are actually MMIO registers, a Device memory type must be used, > > rather than a Normal memory type. There are a number of things that > > could go wrong due to relaxations permitted for Normal memory, such as > > speculative reads, the potential set of access sizes, memory > > transactions that the endpoint might not understand, etc. > > Well, we have: > > dma_wmb(); > mc->cr->_ncw_verb = myverb | mc->vbit; > dpaa_flush(mc->cr); > > and we also have: > > mc->cr = portal->addr.ce + BM_CL_CR; > mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ? > 0 : 1; > > and: > > dpaa_zero(mc->cr); > > which is just another name for a 64-byte memset() which doesn't have > any flushing. > > Note that this is memory obtained from the ioremap() family of functions, > so all of these pointers _should_ have __iomem annotations (but don't.) > However, if this isn't iomem, then it shouldn't be using the ioremap() > family of functions, and should be using memremap() instead. > > Without really having an understanding of what this chunk of code is > doing (and abuse like the above makes it harder than necessary) no one > can really say from just reading through the code... which brings up > another problem - that of long term maintanability. Sure; agreed on all counts. I was (evidently poorly) attempting to clarify what prior discussions had (and hadn't) covered. Mark.
On 6/23/2017 11:23 AM, Mark Rutland wrote: > On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: >> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: >>> v2: Removed the patches for MAINTAINERS file as they are already picked >>> up by powerpc tree. >>> >>> v3: Added signed tag to the pull request. >>> >>> Hi arm-soc maintainers, >>> >>> As Scott has left NXP, he agreed to transfer the maintainership of >>> drivers/soc/fsl to me. Previously most of the soc drivers were going >>> through the powerpc tree as they were only used/tested on Power-based >>> SoCs. Going forward new changes will be mostly related to arm/arm64 >>> SoCs, and I would prefer them to go through the arm-soc tree. >>> >>> This pull request includes updates to the QMAN/BMAN drivers to make >>> them work on the arm/arm64 architectures in addition to the power >>> architecture. >>> >>> DPAA (Data Path Acceleration Architecture) is a set of hardware >>> components used on some FSL/NXP QorIQ Networking SoCs, it provides the >>> infrastructure to support simplified sharing of networking interfaces >>> and accelerators by multiple CPU cores, and the accelerators >>> themselves. The QMan(Queue Manager) and BMan(Buffer Manager) are >>> infrastructural components within the DPAA framework. They are used to >>> manage queues and buffers for various I/O interfaces, hardware >>> accelerators. >>> >>> More information can be found via link: >>> http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA >> Hi Leo, >> >> sorry for taking you through yet another revision, but I have two >> more points here: >> >> 1. Please add a tag description whenever you create a signed tag. The >> description is what ends up in the git history, and if there is none, I have >> to think of something myself. In this case, the text above seems >> roughly appropriate, so I first copied it into the commit log, but then >> noticed the second issue: >> >> 2. I know we have discussed the unusual way this driver accesses MMIO >> registers in the past, using ioremap_wc() to map them and the manually >> flushing the caches to store the cache contents into the MMIO registers. >> What I don't know is whether there was any conclusion on this topic whether >> this is actually allowed by the architecture or at least the chip, based on >> implementation-specific features that make it work even when the architecture >> doesn't guarantee it. > From prior discussions, my understanding was that the region in question > was memory reserved for the device, rather than MMIO registers. > > The prior discussion on that front were largely to do with teh > shareability of that memory, which is an orthogonal concern. > > If these are actually MMIO registers, a Device memory type must be used, > rather than a Normal memory type. There are a number of things that > could go wrong due to relaxations permitted for Normal memory, such as > speculative reads, the potential set of access sizes, memory > transactions that the endpoint might not understand, etc. The memory for this device (what we refer to as Software Portals) has 2 regions. One region is MMIO registers and we access it using readl()/writel() APIs. The second region is what we refer to as the cacheable area. This is memory implemented as part of the QBMan device and the device accepts cacheline sized transactions from the interconnect. This is needed because the descriptors read and written by SW are fairly large (larger that 64 bits/less than a cacheline) and in order to meet the data rates of our high speed ethernet ports and other accelerators we need the CPU to be able to form the descriptor in a CPU cache and flush it safely when the device is read to consume it. Myself and the system architect have had many discussions with our design counterparts in ARM to ensure that our interaction with the core/interconnect/device are safe for the set of CPU cores and interconnects we integrate into our products. I understand there are concerns regarding our shareablity proposal (which is not enabled in this patch set). We have been collecting some information and talking to ARM and I do intend to address these concerns but I was delaying confusing things more until this basic support gets accepted and merged. >> Can I have an Ack from the architecture maintainers (Russell, Catalin, >> Will) on the use of these architecture specific interfaces? >> >> static inline void dpaa_flush(void *p) >> { >> #ifdef CONFIG_PPC >> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >> #elif defined(CONFIG_ARM32) >> __cpuc_flush_dcache_area(p, 64); >> #elif defined(CONFIG_ARM64) >> __flush_dcache_area(p, 64); >> #endif >> } > Assuming this is memory, why can't the driver use the DMA APIs to handle > this without reaching into arch-internal APIs? I agree this isn't pretty - I think we could use dma_sync_single_for_device() here but I am concerned it will be expensive and hurt performance significantly. The DMA APIs have a lot of branches. At some point we were doing 'dc cvac' here and even switching to the above calls caused a measurable drop in throughput at high frame rates. > > Thanks, > Mark. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel >
On 6/23/2017 11:39 AM, Russell King - ARM Linux wrote: > On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: >> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: >>> v2: Removed the patches for MAINTAINERS file as they are already picked >>> up by powerpc tree. >>> >>> v3: Added signed tag to the pull request. >>> >>> Hi arm-soc maintainers, >>> >>> As Scott has left NXP, he agreed to transfer the maintainership of >>> drivers/soc/fsl to me. Previously most of the soc drivers were going >>> through the powerpc tree as they were only used/tested on Power-based >>> SoCs. Going forward new changes will be mostly related to arm/arm64 >>> SoCs, and I would prefer them to go through the arm-soc tree. >>> >>> This pull request includes updates to the QMAN/BMAN drivers to make >>> them work on the arm/arm64 architectures in addition to the power >>> architecture. >>> >>> DPAA (Data Path Acceleration Architecture) is a set of hardware >>> components used on some FSL/NXP QorIQ Networking SoCs, it provides the >>> infrastructure to support simplified sharing of networking interfaces >>> and accelerators by multiple CPU cores, and the accelerators >>> themselves. The QMan(Queue Manager) and BMan(Buffer Manager) are >>> infrastructural components within the DPAA framework. They are used to >>> manage queues and buffers for various I/O interfaces, hardware >>> accelerators. >>> >>> More information can be found via link: >>> http://www.nxp.com/products/microcontrollers-and-processors/power-architecture-processors/qoriq-platforms/data-path-acceleration:QORIQ_DPAA >> Hi Leo, >> >> sorry for taking you through yet another revision, but I have two >> more points here: >> >> 1. Please add a tag description whenever you create a signed tag. The >> description is what ends up in the git history, and if there is none, I have >> to think of something myself. In this case, the text above seems >> roughly appropriate, so I first copied it into the commit log, but then >> noticed the second issue: >> >> 2. I know we have discussed the unusual way this driver accesses MMIO >> registers in the past, using ioremap_wc() to map them and the manually >> flushing the caches to store the cache contents into the MMIO registers. >> What I don't know is whether there was any conclusion on this topic whether >> this is actually allowed by the architecture or at least the chip, based on >> implementation-specific features that make it work even when the architecture >> doesn't guarantee it. >> >> Can I have an Ack from the architecture maintainers (Russell, Catalin, >> Will) on the use of these architecture specific interfaces? >> >> static inline void dpaa_flush(void *p) >> { >> #ifdef CONFIG_PPC >> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >> #elif defined(CONFIG_ARM32) >> __cpuc_flush_dcache_area(p, 64); >> #elif defined(CONFIG_ARM64) >> __flush_dcache_area(p, 64); >> #endif >> } >> #define dpaa_invalidate(p) dpaa_flush(p) >> #define dpaa_zero(p) memset(p, 0, 64) >> static inline void dpaa_touch_ro(void *p) >> { >> #if (L1_CACHE_BYTES == 32) >> prefetch(p+32); >> #endif >> prefetch(p); >> } >> >> As described by Leo, the code is already there and is actively used >> on powerpc, his pull request is merely for enabling the driver on ARM >> and ARM64. > Well, as arch maintainer, "code already there" matters not one bit when > converting it to work on other architectures. > > For the record, I'm completely dismaid that there seems to be: > > (a) the thought of getting changes into the kernel that use arch-private > interfaces without review by architecture maintainers. > > Here, I'm specifically talking about the work that was merged through > PowerPC trees adding rather broken ARM support to this driver. The PowerPC support was added to support PowerPC - this device has existed on PowerPC parts for many years before we started developing ARM cores. We have been trying to "catch up" on our upstreaming efforts for this family of parts and PowerPC was done first primary because there are more parts available at this point in time. > > (b) the lack of Cc to ARM and ARM64 architecture maintainers of the patch > set now being asked to be merged through Arnd. I apologize for the lack of CC - many of us are new to the ARM community and learning who should be CC'd on requests takes time. The patch set was originally posted on the ARM list in January of this year and we have incorporated feedback from many people, including Robin Murphy and yourself. I certainly want and value the feedback from the community. The reason we were requesting the patches be merged is that the last submission had no feedback after several rounds review (starting in January). > > (c) the thought that it's acceptable to use architecture private interfaces > in driver code that really have no business being there. > > (d) the lack of explanation why its necessary to poke about using > architecture private interfaces. > > Now, from looking at the code already merged in mainline, I think that > this driver really has no business touching these arch private > interfaces. > > I think, rather than myself and Will/Catalin independently spending time > trying to understand what's going on with this code, what we need is for > NXP people to document in detail what they are doing that requires all > this special treatment and why they need to poking about in stuff that's > architecture private before this can be merged. > > Basically, I want a proper understanding of this code to fully > understand why it's using arch private interfaces, and what the problem > is with using the standard interfaces provided for drivers. The requested pull removed most of the ARCH specific code so I would recommend looking at the driver with those patches applied. I will enumerate here all the places in the final driver that check for CONFIG_ARM or CONFIG_PPC 1) bman.c : Some register offsets in the device changed when we moved the IP to ARM, we use CONFIG_ARM||CONFIG_ARM64 to do the defines at different offsets for each platform. 2) bman_portal.c : On PPC we map the SW portals as cacheable/noncoherent. This is required for those devices as they will cause a trap if the memory isn't mapped that way. When the device was moved to ARM we removed that limitation. Since mapping the device as non-shareable is controversial the code is using ioremap_wc() right now. This code will potentially change once there is more discussion on this but for now we distinguish between the two architectures. 3) dpaa_sys.h: I mentioned the dpaa_flush() in a previous message. This can likely be changed but it has performance implications. 4) qman.c: Same register offset buisness as point 1 but for QMan. 5) qman_portal.c: The same non-sharable difference exists as mentioned in bullet 2 I'm not aware of any other code that is architecture dependent. The only architecture private interface is the flush() and I'm willing to look into how much impact changing it will have if it is a showstopper.
On 6/23/2017 11:56 AM, Russell King - ARM Linux wrote: > On Fri, Jun 23, 2017 at 04:22:27PM +0100, Mark Rutland wrote: >> The prior discussion on that front were largely to do with teh >> shareability of that memory, which is an orthogonal concern. >> >> If these are actually MMIO registers, a Device memory type must be used, >> rather than a Normal memory type. There are a number of things that >> could go wrong due to relaxations permitted for Normal memory, such as >> speculative reads, the potential set of access sizes, memory >> transactions that the endpoint might not understand, etc. > Well, we have: > > dma_wmb(); > mc->cr->_ncw_verb = myverb | mc->vbit; > dpaa_flush(mc->cr); > > and we also have: > > mc->cr = portal->addr.ce + BM_CL_CR; > mc->rridx = (__raw_readb(&mc->cr->_ncw_verb) & BM_MCC_VERB_VBIT) ? > 0 : 1; This is an inconsistency for sure - I will look into this. > > and: > > dpaa_zero(mc->cr); > > which is just another name for a 64-byte memset() which doesn't have > any flushing. This is an artifact of a PPC specific optimization we did in the past but have since removed. We should just replace these calls with memset so things are clear. > > Note that this is memory obtained from the ioremap() family of functions, > so all of these pointers _should_ have __iomem annotations (but don't.) > However, if this isn't iomem, then it shouldn't be using the ioremap() > family of functions, and should be using memremap() instead. I was following a previous thread with Catalin and some Cavium folks last month discussing ioremap_wc() vs memremap(). I think you are correct in suggestion we should be using memremap() here. I had made a note to explore this further but hadn't had time to investigate. I will move this up on my lisy > > Without really having an understanding of what this chunk of code is > doing (and abuse like the above makes it harder than necessary) no one > can really say from just reading through the code... which brings up > another problem - that of long term maintanability. Is this a request for more comments? I agree that looking at the code here is difficult without understanding for the device interface works (which is admittedly somewhat unique). I can try to add some comments in the areas that are likely not obvious to anyone that hasn't read the device manual. >
On Fri, Jun 23, 2017 at 06:58:17PM +0000, Roy Pledge wrote: > On 6/23/2017 11:23 AM, Mark Rutland wrote: > > On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: > >> static inline void dpaa_flush(void *p) > >> { > >> #ifdef CONFIG_PPC > >> flush_dcache_range((unsigned long)p, (unsigned long)p+64); > >> #elif defined(CONFIG_ARM32) > >> __cpuc_flush_dcache_area(p, 64); > >> #elif defined(CONFIG_ARM64) > >> __flush_dcache_area(p, 64); > >> #endif > >> } > > Assuming this is memory, why can't the driver use the DMA APIs to handle > > this without reaching into arch-internal APIs? > I agree this isn't pretty - I think we could use > dma_sync_single_for_device() here but I am concerned it will be > expensive and hurt performance significantly. The DMA APIs have a lot of > branches. At some point we were doing 'dc cvac' here and even switching > to the above calls caused a measurable drop in throughput at high frame > rates. Well... __cpuc_flush_dcache_area() is used to implement flush_dcache_page() and flush_anon_page(). The former is about ensuring coherency between multiple mappings of a kernel page cache page and userspace. The latter is about ensuring coherency between an anonymous page and userspace. Currently, on ARMv7, this is implemented using "mcr p15, 0, r0, c7, c14, 1" but we _could_ decide that is too heavy for these interfaces, and instead switch to a lighter cache flush if one were available in a future architecture revision (since the use case of this only requires it to flush to the point of unification, not to the point of coherence.) The overall effect is that changing the behaviour of this would introduce a regression into your driver, which would have to be reverted - and that makes my job as architecture maintainer difficult. It may make sense to explicitly introduce a "flush data cache to point of coherence" function - we already have gicv3 and kvm wanting this, and doing it the right way, via: #define gic_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l)) #define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l)) So, how about we introduce something like: void flush_dcache_to_poc(void *addr, size_t size) which is defined to ensure that data is visible to the point of coherence on all architectures?
On Fri, Jun 23, 2017 at 8:58 PM, Roy Pledge <roy.pledge@nxp.com> wrote: > On 6/23/2017 11:23 AM, Mark Rutland wrote: >> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: >>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: >>> >>> 2. I know we have discussed the unusual way this driver accesses MMIO >>> registers in the past, using ioremap_wc() to map them and the manually >>> flushing the caches to store the cache contents into the MMIO registers. >>> What I don't know is whether there was any conclusion on this topic whether >>> this is actually allowed by the architecture or at least the chip, based on >>> implementation-specific features that make it work even when the architecture >>> doesn't guarantee it. >> From prior discussions, my understanding was that the region in question >> was memory reserved for the device, rather than MMIO registers. Ok. >> The prior discussion on that front were largely to do with teh >> shareability of that memory, which is an orthogonal concern. >> >> If these are actually MMIO registers, a Device memory type must be used, >> rather than a Normal memory type. There are a number of things that >> could go wrong due to relaxations permitted for Normal memory, such as >> speculative reads, the potential set of access sizes, memory >> transactions that the endpoint might not understand, etc. > The memory for this device (what we refer to as Software Portals) has 2 > regions. One region is MMIO registers and we access it using > readl()/writel() APIs. Ok, good. > The second region is what we refer to as the cacheable area. This is > memory implemented as part of the QBMan device and the device accepts > cacheline sized transactions from the interconnect. This is needed > because the descriptors read and written by SW are fairly large (larger > that 64 bits/less than a cacheline) and in order to meet the data rates > of our high speed ethernet ports and other accelerators we need the CPU > to be able to form the descriptor in a CPU cache and flush it safely > when the device is read to consume it. Myself and the system architect > have had many discussions with our design counterparts in ARM to ensure > that our interaction with the core/interconnect/device are safe for the > set of CPU cores and interconnects we integrate into our products. > > I understand there are concerns regarding our shareablity proposal > (which is not enabled in this patch set). We have been collecting some > information and talking to ARM and I do intend to address these concerns > but I was delaying confusing things more until this basic support gets > accepted and merged. Can you summarize what the constraints are that we have for mapping this area? E.g. why can't we just have an uncached write-combining mapping and skip the flushes? >>> Can I have an Ack from the architecture maintainers (Russell, Catalin, >>> Will) on the use of these architecture specific interfaces? >>> >>> static inline void dpaa_flush(void *p) >>> { >>> #ifdef CONFIG_PPC >>> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >>> #elif defined(CONFIG_ARM32) >>> __cpuc_flush_dcache_area(p, 64); >>> #elif defined(CONFIG_ARM64) >>> __flush_dcache_area(p, 64); >>> #endif >>> } >> Assuming this is memory, why can't the driver use the DMA APIs to handle >> this without reaching into arch-internal APIs? > > I agree this isn't pretty - I think we could use > dma_sync_single_for_device() here but I am concerned it will be > expensive and hurt performance significantly. The DMA APIs have a lot of > branches. At some point we were doing 'dc cvac' here and even switching > to the above calls caused a measurable drop in throughput at high frame > rates. I'd suggest we start out by converting this to some standard API first, regardless of performance, to get it working properly with code that should be maintainable at least, and make progress with your hardware enablement. In parallel, we can discuss what kind of API we actually need and how to fit that into the existing frameworks. This may take a while as it depends on your other patch set, and perhaps input from other parties that may have similar requirements. I could imagine that e.g. Marvell, Cavium or Annapurna (Amazon) do something similar in their hardware, so if we find that we need a new API, we should define it in a way that works for everyone. Arnd
On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote: > I'd suggest we start out by converting this to some standard API > first, regardless of performance, to get it working properly with code > that should be maintainable at least, and make progress with your > hardware enablement. I think Roy is rather confused right now about what the driver does and doesn't do. With his patches, two areas are mapped on ARM: 1. The addr.ci area is mapped using ioremap(). This is a device mapping, and is used by the "Cache-inhibited register access." functions. 2. The addr.ce area is mapped using ioremap_wc(), and all other stuff including the verb accesses go through this region. This is a memory, non-cacheable mapping. The addr.ce region is the area that has a mixture of memset(), MMIO accessors, direct CPU accesses, cache flushes and prefetches used on it. As it is marked non-cacheable, the cache flushes are a pure waste of CPU cycles. Prefetching to a non-cacheable area doesn't make much sense either. So, I think for ARM: just kill the cache flushing - its doing nothing useful. The prefetching could also be removed as well to recover a few more CPU cycles.
On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote: >> I'd suggest we start out by converting this to some standard API >> first, regardless of performance, to get it working properly with code >> that should be maintainable at least, and make progress with your >> hardware enablement. > > I think Roy is rather confused right now about what the driver does > and doesn't do. > > With his patches, two areas are mapped on ARM: > > 1. The addr.ci area is mapped using ioremap(). This is a device mapping, > and is used by the "Cache-inhibited register access." functions. > > 2. The addr.ce area is mapped using ioremap_wc(), and all other stuff > including the verb accesses go through this region. This is a > memory, non-cacheable mapping. > > The addr.ce region is the area that has a mixture of memset(), MMIO > accessors, direct CPU accesses, cache flushes and prefetches used on > it. > > As it is marked non-cacheable, the cache flushes are a pure waste of > CPU cycles. Prefetching to a non-cacheable area doesn't make much > sense either. > > So, I think for ARM: just kill the cache flushing - its doing nothing > useful. The prefetching could also be removed as well to recover a > few more CPU cycles. Right, that sounds good. I wondered about the ioremap_wc() earlier and thought I must have missed something as it didn't make any sense to me, but apparently that was because the code really doesn't make sense ;-). It would be good to get consistent __iomem annotations on it too, so we can see exactly where the MMIO interfaces are used incorrectly, and find a fix for that. I wonder if doing memcpy_toio on the write-combining mapping will do what Roy wants and and up sending a single bus transaction for one descriptor most of the time, while fitting in with the ioremap_wc() interface and its __iomem pointers. Arnd
On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote: > On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux > <linux@armlinux.org.uk> wrote: > > On Tue, Jun 27, 2017 at 09:17:48AM +0200, Arnd Bergmann wrote: > >> I'd suggest we start out by converting this to some standard API > >> first, regardless of performance, to get it working properly with code > >> that should be maintainable at least, and make progress with your > >> hardware enablement. > > > > I think Roy is rather confused right now about what the driver does > > and doesn't do. > > > > With his patches, two areas are mapped on ARM: > > > > 1. The addr.ci area is mapped using ioremap(). This is a device mapping, > > and is used by the "Cache-inhibited register access." functions. > > > > 2. The addr.ce area is mapped using ioremap_wc(), and all other stuff > > including the verb accesses go through this region. This is a > > memory, non-cacheable mapping. > > > > The addr.ce region is the area that has a mixture of memset(), MMIO > > accessors, direct CPU accesses, cache flushes and prefetches used on > > it. > > > > As it is marked non-cacheable, the cache flushes are a pure waste of > > CPU cycles. Prefetching to a non-cacheable area doesn't make much > > sense either. > > > > So, I think for ARM: just kill the cache flushing - its doing nothing > > useful. The prefetching could also be removed as well to recover a > > few more CPU cycles. > > Right, that sounds good. I wondered about the ioremap_wc() earlier > and thought I must have missed something as it didn't make any sense > to me, but apparently that was because the code really doesn't make > sense ;-). > > It would be good to get consistent __iomem annotations on it too, > so we can see exactly where the MMIO interfaces are used incorrectly, > and find a fix for that. > > I wonder if doing memcpy_toio on the write-combining mapping > will do what Roy wants and and up sending a single bus transaction > for one descriptor most of the time, while fitting in with the ioremap_wc() > interface and its __iomem pointers. Either the CE region needs __iomem pointers and proper accessors, or it needs to be using memremap(offset, size, MEMREMAP_WC) and not use the iomem accessors. The CI region should definitely have __iomem annotations though.
On Tue, Jun 27, 2017 at 11:10 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote: >> On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux >> <linux@armlinux.org.uk> wrote: >> >> I wonder if doing memcpy_toio on the write-combining mapping >> will do what Roy wants and and up sending a single bus transaction >> for one descriptor most of the time, while fitting in with the ioremap_wc() >> interface and its __iomem pointers. > > Either the CE region needs __iomem pointers and proper accessors, or > it needs to be using memremap(offset, size, MEMREMAP_WC) and not use > the iomem accessors. Right, memremap() would work here as well. If there are no other constraints, using ioremap_wc and __iomem pointers would have the advantage of giving us a little more type safety, but it depends on what operations we actually need to do on the data. Arnd
On 6/24/2017 8:10 AM, Russell King - ARM Linux wrote: > On Fri, Jun 23, 2017 at 06:58:17PM +0000, Roy Pledge wrote: >> On 6/23/2017 11:23 AM, Mark Rutland wrote: >>> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: >>>> static inline void dpaa_flush(void *p) >>>> { >>>> #ifdef CONFIG_PPC >>>> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >>>> #elif defined(CONFIG_ARM32) >>>> __cpuc_flush_dcache_area(p, 64); >>>> #elif defined(CONFIG_ARM64) >>>> __flush_dcache_area(p, 64); >>>> #endif >>>> } >>> Assuming this is memory, why can't the driver use the DMA APIs to handle >>> this without reaching into arch-internal APIs? >> I agree this isn't pretty - I think we could use >> dma_sync_single_for_device() here but I am concerned it will be >> expensive and hurt performance significantly. The DMA APIs have a lot of >> branches. At some point we were doing 'dc cvac' here and even switching >> to the above calls caused a measurable drop in throughput at high frame >> rates. > > Well... > > __cpuc_flush_dcache_area() is used to implement flush_dcache_page() and > flush_anon_page(). The former is about ensuring coherency between > multiple mappings of a kernel page cache page and userspace. The latter > is about ensuring coherency between an anonymous page and userspace. > > Currently, on ARMv7, this is implemented using "mcr p15, 0, r0, c7, c14, 1" > but we _could_ decide that is too heavy for these interfaces, and instead > switch to a lighter cache flush if one were available in a future > architecture revision (since the use case of this only requires it to > flush to the point of unification, not to the point of coherence.) > > The overall effect is that changing the behaviour of this would introduce > a regression into your driver, which would have to be reverted - and that > makes my job as architecture maintainer difficult. > > It may make sense to explicitly introduce a "flush data cache to point > of coherence" function - we already have gicv3 and kvm wanting this, and > doing it the right way, via: > > #define gic_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l)) > #define kvm_flush_dcache_to_poc(a,l) __cpuc_flush_dcache_area((a), (l)) > > So, how about we introduce something like: > > void flush_dcache_to_poc(void *addr, size_t size) > > which is defined to ensure that data is visible to the point of coherence > on all architectures? > This would be very useful and would help with this issue. To be completely honest an invalidate variant would be useful for us as well as we currently use flush in cases that we really just need to invalidate. The extra unneeded write back to the device can shows a small performance cost. I understand that exposing invalidate to users is potentially dangerous as you need to be 100% sure you're willing to through away any data in that cacheline so I won't be surprised if people aren't willing to expose an API for this.
On 6/27/2017 3:17 AM, Arnd Bergmann wrote: > On Fri, Jun 23, 2017 at 8:58 PM, Roy Pledge <roy.pledge@nxp.com> wrote: >> On 6/23/2017 11:23 AM, Mark Rutland wrote: >>> On Fri, Jun 23, 2017 at 04:56:10PM +0200, Arnd Bergmann wrote: >>>> On Tue, Jun 20, 2017 at 7:27 PM, Leo Li <leoyang.li@nxp.com> wrote: >>>> >>>> 2. I know we have discussed the unusual way this driver accesses MMIO >>>> registers in the past, using ioremap_wc() to map them and the manually >>>> flushing the caches to store the cache contents into the MMIO registers. >>>> What I don't know is whether there was any conclusion on this topic whether >>>> this is actually allowed by the architecture or at least the chip, based on >>>> implementation-specific features that make it work even when the architecture >>>> doesn't guarantee it. >>> From prior discussions, my understanding was that the region in question >>> was memory reserved for the device, rather than MMIO registers. > > Ok. > >>> The prior discussion on that front were largely to do with teh >>> shareability of that memory, which is an orthogonal concern. >>> >>> If these are actually MMIO registers, a Device memory type must be used, >>> rather than a Normal memory type. There are a number of things that >>> could go wrong due to relaxations permitted for Normal memory, such as >>> speculative reads, the potential set of access sizes, memory >>> transactions that the endpoint might not understand, etc. >> The memory for this device (what we refer to as Software Portals) has 2 >> regions. One region is MMIO registers and we access it using >> readl()/writel() APIs. > > Ok, good. > >> The second region is what we refer to as the cacheable area. This is >> memory implemented as part of the QBMan device and the device accepts >> cacheline sized transactions from the interconnect. This is needed >> because the descriptors read and written by SW are fairly large (larger >> that 64 bits/less than a cacheline) and in order to meet the data rates >> of our high speed ethernet ports and other accelerators we need the CPU >> to be able to form the descriptor in a CPU cache and flush it safely >> when the device is read to consume it. Myself and the system architect >> have had many discussions with our design counterparts in ARM to ensure >> that our interaction with the core/interconnect/device are safe for the >> set of CPU cores and interconnects we integrate into our products. >> >> I understand there are concerns regarding our shareablity proposal >> (which is not enabled in this patch set). We have been collecting some >> information and talking to ARM and I do intend to address these concerns >> but I was delaying confusing things more until this basic support gets >> accepted and merged. > > Can you summarize what the constraints are that we have for mapping > this area? E.g. why can't we just have an uncached write-combining > mapping and skip the flushes? For ARM/ARM64 we can (and currently do) map as uncached write-combing but for PPC we cannot do this as the device only accepts cacheline sized transactions in the cache enabled area. The device requirements were relaxed for ARM since the interconnect had different behavior than the PPC interconnect we were using. The goal for ARM is to map this memory as non-shareable as it gives a significant performance increase over write combine. We intended to work on acceptance of this mode more once basic support is in. The flushes in the code for ARM can be removed but the PPC calls are needed. In the dpaa_flush() code below I think the ARM variants could be removed for now (I would want to run some tests of course). > >>>> Can I have an Ack from the architecture maintainers (Russell, Catalin, >>>> Will) on the use of these architecture specific interfaces? >>>> >>>> static inline void dpaa_flush(void *p) >>>> { >>>> #ifdef CONFIG_PPC >>>> flush_dcache_range((unsigned long)p, (unsigned long)p+64); >>>> #elif defined(CONFIG_ARM32) >>>> __cpuc_flush_dcache_area(p, 64); >>>> #elif defined(CONFIG_ARM64) >>>> __flush_dcache_area(p, 64); >>>> #endif >>>> } >>> Assuming this is memory, why can't the driver use the DMA APIs to handle >>> this without reaching into arch-internal APIs? >> >> I agree this isn't pretty - I think we could use >> dma_sync_single_for_device() here but I am concerned it will be >> expensive and hurt performance significantly. The DMA APIs have a lot of >> branches. At some point we were doing 'dc cvac' here and even switching >> to the above calls caused a measurable drop in throughput at high frame >> rates. > > I'd suggest we start out by converting this to some standard API > first, regardless of performance, to get it working properly with code > that should be maintainable at least, and make progress with your > hardware enablement. > > In parallel, we can discuss what kind of API we actually need and > how to fit that into the existing frameworks. This may take a while > as it depends on your other patch set, and perhaps input from > other parties that may have similar requirements. I could imagine > that e.g. Marvell, Cavium or Annapurna (Amazon) do something > similar in their hardware, so if we find that we need a new API, > we should define it in a way that works for everyone. I'm happy that your open to extending the API for these types of devices. I will reply to another email shortly but I think using the memremap() API for the Cache Enabled area is a good change for this device and will address the __iomem concerns. > > Arnd >
On 6/27/2017 6:35 AM, Arnd Bergmann wrote: > On Tue, Jun 27, 2017 at 11:10 AM, Russell King - ARM Linux > <linux@armlinux.org.uk> wrote: >> On Tue, Jun 27, 2017 at 11:05:38AM +0200, Arnd Bergmann wrote: >>> On Tue, Jun 27, 2017 at 10:17 AM, Russell King - ARM Linux >>> <linux@armlinux.org.uk> wrote: >>> >>> I wonder if doing memcpy_toio on the write-combining mapping >>> will do what Roy wants and and up sending a single bus transaction >>> for one descriptor most of the time, while fitting in with the ioremap_wc() >>> interface and its __iomem pointers. >> >> Either the CE region needs __iomem pointers and proper accessors, or >> it needs to be using memremap(offset, size, MEMREMAP_WC) and not use >> the iomem accessors. > > Right, memremap() would work here as well. If there are no other constraints, > using ioremap_wc and __iomem pointers would have the advantage of giving > us a little more type safety, but it depends on what operations we actually > need to do on the data. I explained a bit in other threads so this is my plan for a respin 1) Have the Cache Enable areas mapped by memremap() 2) Make sure the __iomem is being respected for the Cache Inhibted areas 3) Get rid of the flush()/prefetch() ARM/ARM64 implementations - the calls will only have implementation for PPC. Roy > > Arnd >