Message ID | 20141113192723.12579.25343.stgit@ahduyck-server |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Hi Alex, On Thu, Nov 13, 2014 at 07:27:23PM +0000, Alexander Duyck wrote: > It is common for device drivers to make use of acquire/release semantics > when dealing with descriptors stored in device memory. On reviewing the > documentation and code for smp_load_acquire() and smp_store_release() as > well as reviewing an IBM website that goes over the use of PowerPC barriers > at http://www.ibm.com/developerworks/systems/articles/powerpc.html it > occurred to me that the same code could likely be applied to device drivers. > > As a result this patch introduces load_acquire() and store_release(). The > load_acquire() function can be used in the place of situations where a test > for ownership must be followed by a memory barrier. The below example is > from ixgbe: > > if (!rx_desc->wb.upper.status_error) > break; > > /* This memory barrier is needed to keep us from reading > * any other fields out of the rx_desc until we know the > * descriptor has been written back > */ > rmb(); > > With load_acquire() this can be changed to: > > if (!load_acquire(&rx_desc->wb.upper.status_error)) > break; I still don't think this is a good idea for the specific use-case you're highlighting. On ARM, an mb() can be *significantly* more expensive than an rmb() (since we may have to drain store buffers on an outer L2 cache) and on arm64 it's not at all clear that an LDAR is more efficient than an LDR; DMB LD sequence. I can certainly imagine implementations where the latter would be preferred. So, whilst I'm perfectly fine to go along with mandatory acquire/release macros (we should probably add a check to barf on __iomem pointers), I don't agree with using them in preference to finer-grained read/write barriers. Doing so will have a real impact on I/O performance. Finally, do you know of any architectures where load_acquire/store_release aren't implemented the same way as the smp_* variants on SMP kernels? Will -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Alexander Duyck > It is common for device drivers to make use of acquire/release semantics > when dealing with descriptors stored in device memory. On reviewing the > documentation and code for smp_load_acquire() and smp_store_release() as > well as reviewing an IBM website that goes over the use of PowerPC barriers > at http://www.ibm.com/developerworks/systems/articles/powerpc.html it > occurred to me that the same code could likely be applied to device drivers. > > As a result this patch introduces load_acquire() and store_release(). The > load_acquire() function can be used in the place of situations where a test > for ownership must be followed by a memory barrier. The below example is > from ixgbe: > > if (!rx_desc->wb.upper.status_error) > break; > > /* This memory barrier is needed to keep us from reading > * any other fields out of the rx_desc until we know the > * descriptor has been written back > */ > rmb(); > > With load_acquire() this can be changed to: > > if (!load_acquire(&rx_desc->wb.upper.status_error)) > break; If I'm quickly reading the 'new' code I need to look up yet another function, with the 'old' code I can easily see the logic. You've also added a memory barrier to the 'break' path - which isn't needed. The driver might also have additional code that can be added before the barrier so reducing the cost of the barrier. The driver may also be able to perform multiple actions before a barrier is needed. Hiding barriers isn't necessarily a good idea anyway. If you are writing a driver you need to understand when and where they are needed. Maybe you need a new (weaker) barrier to replace rmb() on some architectures. ... David
On 11/14/2014 02:19 AM, Will Deacon wrote: > Hi Alex, > > On Thu, Nov 13, 2014 at 07:27:23PM +0000, Alexander Duyck wrote: >> It is common for device drivers to make use of acquire/release semantics >> when dealing with descriptors stored in device memory. On reviewing the >> documentation and code for smp_load_acquire() and smp_store_release() as >> well as reviewing an IBM website that goes over the use of PowerPC barriers >> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it >> occurred to me that the same code could likely be applied to device drivers. >> >> As a result this patch introduces load_acquire() and store_release(). The >> load_acquire() function can be used in the place of situations where a test >> for ownership must be followed by a memory barrier. The below example is >> from ixgbe: >> >> if (!rx_desc->wb.upper.status_error) >> break; >> >> /* This memory barrier is needed to keep us from reading >> * any other fields out of the rx_desc until we know the >> * descriptor has been written back >> */ >> rmb(); >> >> With load_acquire() this can be changed to: >> >> if (!load_acquire(&rx_desc->wb.upper.status_error)) >> break; > I still don't think this is a good idea for the specific use-case you're > highlighting. > > On ARM, an mb() can be *significantly* more expensive than an rmb() (since > we may have to drain store buffers on an outer L2 cache) and on arm64 it's > not at all clear that an LDAR is more efficient than an LDR; DMB LD > sequence. I can certainly imagine implementations where the latter would > be preferred. Yeah, I am pretty sure I overdid it in using a mb() for arm. I think what I should probably be using is something like dmb(ish) which is used for smp_mb() instead. The general idea is to enforce memory-memory accesses. The memory-mmio accesses still should be using a full rmb()/wmb() barrier. The alternative I am mulling over is creating something like a lightweight set of memory barriers named lw_mb(), lw_rmb(), lw_wmb(), that could be used instead. The general idea is that on many architectures a full mb/rmb/wmb is far too much for just guaranteeing ordering for system memory only writes or reads. I'm thinking I could probably use the smp_ varieties as a template for them since I'm thinking that in most cases this should be correct. Also, just to be clear I am not advocating replacing the wmb() in most I/O setups where we have to sync the system memory before doing the MMIO write. This is for the case where the device descriptor ring has some bit indicating ownership by either the device or the CPU. So for example on the r8169 they have to do a wmb() before writing the DescOwn bit in the first descriptor of a given set of Tx descriptors to guarantee the rest are written, then they set the DescOwn bit, then they call wmb() again to flush that last bit before notifying the device it can start fetching the descriptors. My goal is to deal with that first wmb() and leave the second as it since it is correct. > So, whilst I'm perfectly fine to go along with mandatory acquire/release > macros (we should probably add a check to barf on __iomem pointers), I > don't agree with using them in preference to finer-grained read/write > barriers. Doing so will have a real impact on I/O performance. Couldn't that type of check be added to compiletime_assert_atomic_type? That seems like that would be the best place for something like that. > Finally, do you know of any architectures where load_acquire/store_release > aren't implemented the same way as the smp_* variants on SMP kernels? > > Will I should probably go back through and sort out the cases where mb() and smp_mb() are not the same thing. I think I probably went with too harsh of a barrier in probably a couple of other cases. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/14/2014 02:45 AM, David Laight wrote: > From: Alexander Duyck >> It is common for device drivers to make use of acquire/release semantics >> when dealing with descriptors stored in device memory. On reviewing the >> documentation and code for smp_load_acquire() and smp_store_release() as >> well as reviewing an IBM website that goes over the use of PowerPC barriers >> at http://www.ibm.com/developerworks/systems/articles/powerpc.html it >> occurred to me that the same code could likely be applied to device drivers. >> >> As a result this patch introduces load_acquire() and store_release(). The >> load_acquire() function can be used in the place of situations where a test >> for ownership must be followed by a memory barrier. The below example is >> from ixgbe: >> >> if (!rx_desc->wb.upper.status_error) >> break; >> >> /* This memory barrier is needed to keep us from reading >> * any other fields out of the rx_desc until we know the >> * descriptor has been written back >> */ >> rmb(); >> >> With load_acquire() this can be changed to: >> >> if (!load_acquire(&rx_desc->wb.upper.status_error)) >> break; > If I'm quickly reading the 'new' code I need to look up yet another > function, with the 'old' code I can easily see the logic. > > You've also added a memory barrier to the 'break' path - which isn't needed. > > The driver might also have additional code that can be added before the barrier > so reducing the cost of the barrier. > > The driver may also be able to perform multiple actions before a barrier is needed. > > Hiding barriers isn't necessarily a good idea anyway. > If you are writing a driver you need to understand when and where they are needed. > > Maybe you need a new (weaker) barrier to replace rmb() on some architectures. > > ... > > > David Yeah, I think I might explore creating some lightweight barriers. The load/acquire stuff is a bit overkill for what is needed. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h index c6a3e73..bbdcd34 100644 --- a/arch/arm/include/asm/barrier.h +++ b/arch/arm/include/asm/barrier.h @@ -59,6 +59,21 @@ #define smp_wmb() dmb(ishst) #endif +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index 6389d60..c91571c 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -32,33 +32,7 @@ #define rmb() dsb(ld) #define wmb() dsb(st) -#ifndef CONFIG_SMP -#define smp_mb() barrier() -#define smp_rmb() barrier() -#define smp_wmb() barrier() - -#define smp_store_release(p, v) \ -do { \ - compiletime_assert_atomic_type(*p); \ - barrier(); \ - ACCESS_ONCE(*p) = (v); \ -} while (0) - -#define smp_load_acquire(p) \ -({ \ - typeof(*p) ___p1 = ACCESS_ONCE(*p); \ - compiletime_assert_atomic_type(*p); \ - barrier(); \ - ___p1; \ -}) - -#else - -#define smp_mb() dmb(ish) -#define smp_rmb() dmb(ishld) -#define smp_wmb() dmb(ishst) - -#define smp_store_release(p, v) \ +#define store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ switch (sizeof(*p)) { \ @@ -73,7 +47,7 @@ do { \ } \ } while (0) -#define smp_load_acquire(p) \ +#define load_acquire(p) \ ({ \ typeof(*p) ___p1; \ compiletime_assert_atomic_type(*p); \ @@ -90,6 +64,35 @@ do { \ ___p1; \ }) +#ifndef CONFIG_SMP +#define smp_mb() barrier() +#define smp_rmb() barrier() +#define smp_wmb() barrier() + +#define smp_store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + barrier(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define smp_load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + barrier(); \ + ___p1; \ +}) + +#else + +#define smp_mb() dmb(ish) +#define smp_rmb() dmb(ishld) +#define smp_wmb() dmb(ishst) + +#define smp_store_release(p, v) store_release(p, v) +#define smp_load_acquire(p) load_acquire(p) + #endif #define read_barrier_depends() do { } while(0) diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h index a48957c..d7fe208 100644 --- a/arch/ia64/include/asm/barrier.h +++ b/arch/ia64/include/asm/barrier.h @@ -63,14 +63,14 @@ * need for asm trickery! */ -#define smp_store_release(p, v) \ +#define store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ barrier(); \ ACCESS_ONCE(*p) = (v); \ } while (0) -#define smp_load_acquire(p) \ +#define load_acquire(p) \ ({ \ typeof(*p) ___p1 = ACCESS_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ @@ -78,6 +78,9 @@ do { \ ___p1; \ }) +#define smp_store_release(p, v) store_release(p, v) +#define smp_load_acquire(p) load_acquire(p) + /* * XXX check on this ---I suspect what Linus really wants here is * acquire vs release semantics but we can't discuss this stuff with diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h index c7591e8..9beb687 100644 --- a/arch/metag/include/asm/barrier.h +++ b/arch/metag/include/asm/barrier.h @@ -85,6 +85,21 @@ static inline void fence(void) #define smp_read_barrier_depends() do { } while (0) #define set_mb(var, value) do { var = value; smp_mb(); } while (0) +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h index d0101dd..fc7323c 100644 --- a/arch/mips/include/asm/barrier.h +++ b/arch/mips/include/asm/barrier.h @@ -180,6 +180,21 @@ #define nudge_writes() mb() #endif +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h index bab79a1..f2a0d73 100644 --- a/arch/powerpc/include/asm/barrier.h +++ b/arch/powerpc/include/asm/barrier.h @@ -37,6 +37,23 @@ #define set_mb(var, value) do { var = value; mb(); } while (0) +#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory") + +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + __lwsync(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + __lwsync(); \ + ___p1; \ +}) + #ifdef CONFIG_SMP #ifdef __SUBARCH_HAS_LWSYNC @@ -45,15 +62,12 @@ # define SMPWMB eieio #endif -#define __lwsync() __asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory") #define smp_mb() mb() #define smp_rmb() __lwsync() #define smp_wmb() __asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory") #define smp_read_barrier_depends() read_barrier_depends() #else -#define __lwsync() barrier() - #define smp_mb() barrier() #define smp_rmb() barrier() #define smp_wmb() barrier() @@ -72,7 +86,7 @@ #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ - __lwsync(); \ + smp_rmb(); \ ACCESS_ONCE(*p) = (v); \ } while (0) @@ -80,7 +94,7 @@ do { \ ({ \ typeof(*p) ___p1 = ACCESS_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ - __lwsync(); \ + smp_rmb(); \ ___p1; \ }) diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h index b5dce65..637d7a9 100644 --- a/arch/s390/include/asm/barrier.h +++ b/arch/s390/include/asm/barrier.h @@ -35,14 +35,14 @@ #define set_mb(var, value) do { var = value; mb(); } while (0) -#define smp_store_release(p, v) \ +#define store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ barrier(); \ ACCESS_ONCE(*p) = (v); \ } while (0) -#define smp_load_acquire(p) \ +#define load_acquire(p) \ ({ \ typeof(*p) ___p1 = ACCESS_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ @@ -50,4 +50,7 @@ do { \ ___p1; \ }) +#define smp_store_release(p, v) store_release(p, v) +#define smp_load_acquire(p) load_acquire(p) + #endif /* __ASM_BARRIER_H */ diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h index 305dcc3..7de3c69 100644 --- a/arch/sparc/include/asm/barrier_64.h +++ b/arch/sparc/include/asm/barrier_64.h @@ -53,14 +53,14 @@ do { __asm__ __volatile__("ba,pt %%xcc, 1f\n\t" \ #define smp_read_barrier_depends() do { } while(0) -#define smp_store_release(p, v) \ +#define store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ barrier(); \ ACCESS_ONCE(*p) = (v); \ } while (0) -#define smp_load_acquire(p) \ +#define load_acquire(p) \ ({ \ typeof(*p) ___p1 = ACCESS_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ @@ -68,6 +68,8 @@ do { \ ___p1; \ }) +#define smp_store_release(p, v) store_release(p, v) +#define smp_load_acquire(p) load_acquire(p) #define smp_mb__before_atomic() barrier() #define smp_mb__after_atomic() barrier() diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h index 0f4460b..3d2aa18 100644 --- a/arch/x86/include/asm/barrier.h +++ b/arch/x86/include/asm/barrier.h @@ -103,6 +103,21 @@ * model and we should fall back to full barriers. */ +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ @@ -120,14 +135,14 @@ do { \ #else /* regular x86 TSO memory ordering */ -#define smp_store_release(p, v) \ +#define store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \ barrier(); \ ACCESS_ONCE(*p) = (v); \ } while (0) -#define smp_load_acquire(p) \ +#define load_acquire(p) \ ({ \ typeof(*p) ___p1 = ACCESS_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ @@ -135,6 +150,9 @@ do { \ ___p1; \ }) +#define smp_store_release(p, v) store_release(p, v) +#define smp_load_acquire(p) load_acquire(p) + #endif /* Atomic operations are already serializing on x86 */ diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index 1402fa8..c6e4b99 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -70,6 +70,21 @@ #define smp_mb__after_atomic() smp_mb() #endif +#define store_release(p, v) \ +do { \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ACCESS_ONCE(*p) = (v); \ +} while (0) + +#define load_acquire(p) \ +({ \ + typeof(*p) ___p1 = ACCESS_ONCE(*p); \ + compiletime_assert_atomic_type(*p); \ + mb(); \ + ___p1; \ +}) + #define smp_store_release(p, v) \ do { \ compiletime_assert_atomic_type(*p); \
It is common for device drivers to make use of acquire/release semantics when dealing with descriptors stored in device memory. On reviewing the documentation and code for smp_load_acquire() and smp_store_release() as well as reviewing an IBM website that goes over the use of PowerPC barriers at http://www.ibm.com/developerworks/systems/articles/powerpc.html it occurred to me that the same code could likely be applied to device drivers. As a result this patch introduces load_acquire() and store_release(). The load_acquire() function can be used in the place of situations where a test for ownership must be followed by a memory barrier. The below example is from ixgbe: if (!rx_desc->wb.upper.status_error) break; /* This memory barrier is needed to keep us from reading * any other fields out of the rx_desc until we know the * descriptor has been written back */ rmb(); With load_acquire() this can be changed to: if (!load_acquire(&rx_desc->wb.upper.status_error)) break; A similar change can be made in the release path of many drivers. For example in the Realtek r8169 driver there are a number of flows that consist of something like the following: wmb(); status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC)); txd->opts1 = cpu_to_le32(status); tp->cur_tx += frags + 1; wmb(); With store_release() this can be changed to the following: status = opts[0] | len | (RingEnd * !((entry + 1) % NUM_TX_DESC)); store_release(&txd->opts1, cpu_to_le32(status)); tp->cur_tx += frags + 1; wmb(); The resulting assembler code generated as a result can be significantly less expensive on architectures such as x86 and s390 that support strong ordering. On architectures that are able to use different primitives than their rmb/wmb() such as powerpc, ia64, and arm64 we should see gains as we are able to use less expensive barriers, and for other architectures we end up using a mb() which may come at the same amount of overhead or more than a rmb/wmb() as we must ensure Load/Store ordering. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Michael Ellerman <michael@ellerman.id.au> Cc: Michael Neuling <mikey@neuling.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> --- arch/arm/include/asm/barrier.h | 15 +++++++++ arch/arm64/include/asm/barrier.h | 59 ++++++++++++++++++----------------- arch/ia64/include/asm/barrier.h | 7 +++- arch/metag/include/asm/barrier.h | 15 +++++++++ arch/mips/include/asm/barrier.h | 15 +++++++++ arch/powerpc/include/asm/barrier.h | 24 +++++++++++--- arch/s390/include/asm/barrier.h | 7 +++- arch/sparc/include/asm/barrier_64.h | 6 ++-- arch/x86/include/asm/barrier.h | 22 ++++++++++++- include/asm-generic/barrier.h | 15 +++++++++ 10 files changed, 144 insertions(+), 41 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html