diff mbox

[2/4] arch: Add lightweight memory barriers fast_rmb() and fast_wmb()

Message ID 20141117171812.22333.90395.stgit@ahduyck-server
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Alexander Duyck Nov. 17, 2014, 5:18 p.m. UTC
There are a number of situations where the mandatory barriers rmb() and
wmb() are used to order memory/memory operations in the device drivers
and those barriers are much heavier than they actually need to be.  For
example in the case of PowerPC wmb() calls the heavy-weight sync
instruction when for memory/memory operations all that is really needed is
an lsync or eieio instruction.

This commit adds a fast (and loose) version of the mandatory memory
barriers rmb() and wmb().  The prefix to the name is actually based on the
version of the functions that already exist in the mips and tile trees.
However I thought it applicable since it gets at what we are trying to
accomplish with these barriers and somethat implies their risky nature.

These new barriers are not as safe as the standard rmb() and wmb().
Specifically they do not guarantee ordering between cache-enabled and
cache-inhibited memories.  The primary use case for these would be to
enforce ordering of memory reads/writes when accessing cache-enabled memory
that is shared between the CPU and a device.

It may also be noted that there is no fast_mb().  This is due to the fact
that most architectures didn't seem to have a good way to do a full memory
barrier quickly and so they usually resorted to an mb() for their smp_mb
call.  As such there no point in adding a fast_mb() function if it is going
to map to mb() for all architectures anyway.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 Documentation/memory-barriers.txt   |   41 +++++++++++++++++++++++++++++++++++
 arch/arm/include/asm/barrier.h      |    4 +++
 arch/arm64/include/asm/barrier.h    |    3 +++
 arch/ia64/include/asm/barrier.h     |    3 +++
 arch/metag/include/asm/barrier.h    |   14 ++++++------
 arch/powerpc/include/asm/barrier.h  |   22 +++++++++++--------
 arch/s390/include/asm/barrier.h     |    2 ++
 arch/sparc/include/asm/barrier_64.h |    3 +++
 arch/x86/include/asm/barrier.h      |   11 ++++++---
 arch/x86/um/asm/barrier.h           |   13 ++++++-----
 include/asm-generic/barrier.h       |    8 +++++++
 11 files changed, 98 insertions(+), 26 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Benjamin Herrenschmidt Nov. 17, 2014, 8:04 p.m. UTC | #1
On Mon, 2014-11-17 at 09:18 -0800, Alexander Duyck wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be.  For
> example in the case of PowerPC wmb() calls the heavy-weight sync
> instruction when for memory/memory operations all that is really needed is
> an lsync or eieio instruction.

So essentially those are the same as the smp_* variants but not nop'ed
out on !CONFIG_SMP right ? 

Ben.

> This commit adds a fast (and loose) version of the mandatory memory
> barriers rmb() and wmb().  The prefix to the name is actually based on the
> version of the functions that already exist in the mips and tile trees.
> However I thought it applicable since it gets at what we are trying to
> accomplish with these barriers and somethat implies their risky nature.
> 
> These new barriers are not as safe as the standard rmb() and wmb().
> Specifically they do not guarantee ordering between cache-enabled and
> cache-inhibited memories.  The primary use case for these would be to
> enforce ordering of memory reads/writes when accessing cache-enabled memory
> that is shared between the CPU and a device.
> 
> It may also be noted that there is no fast_mb().  This is due to the fact
> that most architectures didn't seem to have a good way to do a full memory
> barrier quickly and so they usually resorted to an mb() for their smp_mb
> call.  As such there no point in adding a fast_mb() function if it is going
> to map to mb() for all architectures anyway.
> 
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> ---
>  Documentation/memory-barriers.txt   |   41 +++++++++++++++++++++++++++++++++++
>  arch/arm/include/asm/barrier.h      |    4 +++
>  arch/arm64/include/asm/barrier.h    |    3 +++
>  arch/ia64/include/asm/barrier.h     |    3 +++
>  arch/metag/include/asm/barrier.h    |   14 ++++++------
>  arch/powerpc/include/asm/barrier.h  |   22 +++++++++++--------
>  arch/s390/include/asm/barrier.h     |    2 ++
>  arch/sparc/include/asm/barrier_64.h |    3 +++
>  arch/x86/include/asm/barrier.h      |   11 ++++++---
>  arch/x86/um/asm/barrier.h           |   13 ++++++-----
>  include/asm-generic/barrier.h       |    8 +++++++
>  11 files changed, 98 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 22a969c..fe55dea 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
>       operations" subsection for information on where to use these.
>  
> 
> + (*) fast_wmb();
> + (*) fast_rmb();
> +
> +     These are for use with memory based device I/O to guarantee the ordering
> +     of cache-enabled writes or reads with respect to other writes or reads
> +     to cache-enabled memory.
> +
> +     For example, consider a device driver that shares memory with a device
> +     and uses a descriptor status value to indicate if the descriptor belongs
> +     to the device or the CPU, and a doorbell to notify it when new
> +     descriptors are available:
> +
> +	if (desc->status != DEVICE_OWN) {
> +		/* do not read data until we own descriptor */
> +		fast_rmb();
> +
> +		/* read/modify data */
> +		read_data = desc->data;
> +		desc->data = write_data;
> +
> +		/* flush modifications before status update */
> +		fast_wmb();
> +
> +		/* assign ownership */
> +		desc->status = DEVICE_OWN;
> +
> +		/* force memory to sync before notifying device via MMIO */
> +		wmb();
> +
> +		/* notify device of new descriptors */
> +		writel(DESC_NOTIFY, doorbell);
> +	}
> +
> +     The fast_rmb() allows us guarantee the device has released ownership
> +     before we read the data from the descriptor, and he fast_wmb() allows
> +     us to guarantee the data is written to the descriptor before the device
> +     can see it now has ownership.  The wmb() is needed to guarantee that the
> +     cache-enabled memory writes have completed before attempting a write to
> +     the cache-inhibited MMIO region.
> +
> +
>  MMIO WRITE BARRIER
>  ------------------
>  
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index c6a3e73..c57903c 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -43,10 +43,14 @@
>  #define mb()		do { dsb(); outer_sync(); } while (0)
>  #define rmb()		dsb()
>  #define wmb()		do { dsb(st); outer_sync(); } while (0)
> +#define fast_rmb()	dmb()
> +#define fast_wmb()	dmb(st)
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
>  #define wmb()		barrier()
> +#define fast_rmb()	barrier()
> +#define fast_wmb()	barrier()
>  #endif
>  
>  #ifndef CONFIG_SMP
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 6389d60..3ca1a15 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -32,6 +32,9 @@
>  #define rmb()		dsb(ld)
>  #define wmb()		dsb(st)
>  
> +#define fast_rmb()	dmb(ld)
> +#define fast_wmb()	dmb(st)
> +
>  #ifndef CONFIG_SMP
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index e8fffb0..997f5b0 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -39,6 +39,9 @@
>  #define rmb()		mb()
>  #define wmb()		mb()
>  
> +#define fast_rmb()	mb()
> +#define fast_wmb()	mb()
> +
>  #ifdef CONFIG_SMP
>  # define smp_mb()	mb()
>  #else
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index 6d8b8c9..edddb6d 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -4,8 +4,6 @@
>  #include <asm/metag_mem.h>
>  
>  #define nop()		asm volatile ("NOP")
> -#define mb()		wmb()
> -#define rmb()		barrier()
>  
>  #ifdef CONFIG_METAG_META21
>  
> @@ -41,11 +39,13 @@ static inline void wr_fence(void)
>  
>  #endif /* !CONFIG_METAG_META21 */
>  
> -static inline void wmb(void)
> -{
> -	/* flush writes through the write combiner */
> -	wr_fence();
> -}
> +/* flush writes through the write combiner */
> +#define mb()		wr_fence()
> +#define rmb()		barrier()
> +#define wmb()		mb()
> +
> +#define fast_rmb()	rmb()
> +#define fast_wmb()	wmb()
>  
>  #ifndef CONFIG_SMP
>  #define fence()		do { } while (0)
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index cb6d66c..f480097 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,22 +36,20 @@
>  
>  #define set_mb(var, value)	do { var = value; mb(); } while (0)
>  
> -#ifdef CONFIG_SMP
> -
>  #ifdef __SUBARCH_HAS_LWSYNC
>  #    define SMPWMB      LWSYNC
>  #else
>  #    define SMPWMB      eieio
>  #endif
>  
> -#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>  
> +#ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> -#define smp_rmb()	__lwsync()
> -#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> +#define smp_rmb()	fast_rmb()
> +#define smp_wmb()	fast_wmb()
>  #else
> -#define __lwsync()	barrier()
> -
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> @@ -69,10 +67,16 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>  
> +/*
> + * The use of smp_rmb() in these functions are actually meant to map from
> + * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> + * map to the more heavy-weight sync.
> + */
>  #define smp_store_release(p, v)						\
>  do {									\
>  	compiletime_assert_atomic_type(*p);				\
> -	__lwsync();							\
> +	smp_rmb();							\
>  	ACCESS_ONCE(*p) = (v);						\
>  } while (0)
>  
> @@ -80,7 +84,7 @@ do {									\
>  ({									\
>  	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
>  	compiletime_assert_atomic_type(*p);				\
> -	__lwsync();							\
> +	smp_rmb();							\
>  	___p1;								\
>  })
>  
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 33d191d..9a31301 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -24,6 +24,8 @@
>  
>  #define rmb()				mb()
>  #define wmb()				mb()
> +#define fast_rmb()			rmb()
> +#define fast_wmb()			wmb()
>  #define smp_mb()			mb()
>  #define smp_rmb()			rmb()
>  #define smp_wmb()			wmb()
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 6c974c0..eac2777 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -37,6 +37,9 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
>  #define rmb()	__asm__ __volatile__("":::"memory")
>  #define wmb()	__asm__ __volatile__("":::"memory")
>  
> +#define fast_rmb()	rmb()
> +#define fast_wmb()	wmb()
> +
>  #define set_mb(__var, __value) \
>  	do { __var = __value; membar_safe("#StoreLoad"); } while(0)
>  
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 5238000..cf440e7 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -24,13 +24,16 @@
>  #define wmb()	asm volatile("sfence" ::: "memory")
>  #endif
>  
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
>  #ifdef CONFIG_X86_PPRO_FENCE
> -# define smp_rmb()	rmb()
> +#define fast_rmb()	rmb()
>  #else
> -# define smp_rmb()	barrier()
> +#define fast_rmb()	barrier()
>  #endif
> +#define fast_wmb()	barrier()
> +
> +#ifdef CONFIG_SMP
> +#define smp_mb()	mb()
> +#define smp_rmb()	fast_rmb()
>  #define smp_wmb()	barrier()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>  #else /* !SMP */
> diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
> index d6511d9..2c0405d 100644
> --- a/arch/x86/um/asm/barrier.h
> +++ b/arch/x86/um/asm/barrier.h
> @@ -29,17 +29,18 @@
>  
>  #endif /* CONFIG_X86_32 */
>  
> -#ifdef CONFIG_SMP
> -
> -#define smp_mb()	mb()
>  #ifdef CONFIG_X86_PPRO_FENCE
> -#define smp_rmb()	rmb()
> +#define fast_rmb()	rmb()
>  #else /* CONFIG_X86_PPRO_FENCE */
> -#define smp_rmb()	barrier()
> +#define fast_rmb()	barrier()
>  #endif /* CONFIG_X86_PPRO_FENCE */
> +#define fast_wmb()	barrier()
>  
> -#define smp_wmb()	barrier()
> +#ifdef CONFIG_SMP
>  
> +#define smp_mb()	mb()
> +#define smp_rmb()	fast_rmb()
> +#define smp_wmb()	fast_wmb()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>  
>  #else /* CONFIG_SMP */
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 1402fa8..c1d60b9 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -42,6 +42,14 @@
>  #define wmb()	mb()
>  #endif
>  
> +#ifndef fast_rmb
> +#define fast_rmb()	rmb()
> +#endif
> +
> +#ifndef fast_wmb
> +#define fast_wmb()	wmb()
> +#endif
> +
>  #ifndef read_barrier_depends
>  #define read_barrier_depends()		do { } while (0)
>  #endif
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 17, 2014, 8:18 p.m. UTC | #2
On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be.  For
> example in the case of PowerPC wmb() calls the heavy-weight sync
> instruction when for memory/memory operations all that is really needed is
> an lsync or eieio instruction.

Is this still the case if one of the memory operations is MMIO?  Last
I knew, it was not.

> This commit adds a fast (and loose) version of the mandatory memory
> barriers rmb() and wmb().  The prefix to the name is actually based on the
> version of the functions that already exist in the mips and tile trees.
> However I thought it applicable since it gets at what we are trying to
> accomplish with these barriers and somethat implies their risky nature.
> 
> These new barriers are not as safe as the standard rmb() and wmb().
> Specifically they do not guarantee ordering between cache-enabled and
> cache-inhibited memories.  The primary use case for these would be to
> enforce ordering of memory reads/writes when accessing cache-enabled memory
> that is shared between the CPU and a device.
> 
> It may also be noted that there is no fast_mb().  This is due to the fact
> that most architectures didn't seem to have a good way to do a full memory
> barrier quickly and so they usually resorted to an mb() for their smp_mb
> call.  As such there no point in adding a fast_mb() function if it is going
> to map to mb() for all architectures anyway.

I must confess that I still don't entirely understand the motivation.

Some problems in PowerPC barrier.h called out below.

							Thanx, Paul

> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: David Miller <davem@davemloft.net>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
> ---
>  Documentation/memory-barriers.txt   |   41 +++++++++++++++++++++++++++++++++++
>  arch/arm/include/asm/barrier.h      |    4 +++
>  arch/arm64/include/asm/barrier.h    |    3 +++
>  arch/ia64/include/asm/barrier.h     |    3 +++
>  arch/metag/include/asm/barrier.h    |   14 ++++++------
>  arch/powerpc/include/asm/barrier.h  |   22 +++++++++++--------
>  arch/s390/include/asm/barrier.h     |    2 ++
>  arch/sparc/include/asm/barrier_64.h |    3 +++
>  arch/x86/include/asm/barrier.h      |   11 ++++++---
>  arch/x86/um/asm/barrier.h           |   13 ++++++-----
>  include/asm-generic/barrier.h       |    8 +++++++
>  11 files changed, 98 insertions(+), 26 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 22a969c..fe55dea 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
>       operations" subsection for information on where to use these.
> 
> 
> + (*) fast_wmb();
> + (*) fast_rmb();
> +
> +     These are for use with memory based device I/O to guarantee the ordering
> +     of cache-enabled writes or reads with respect to other writes or reads
> +     to cache-enabled memory.
> +
> +     For example, consider a device driver that shares memory with a device
> +     and uses a descriptor status value to indicate if the descriptor belongs
> +     to the device or the CPU, and a doorbell to notify it when new
> +     descriptors are available:
> +
> +	if (desc->status != DEVICE_OWN) {
> +		/* do not read data until we own descriptor */
> +		fast_rmb();
> +
> +		/* read/modify data */
> +		read_data = desc->data;
> +		desc->data = write_data;
> +
> +		/* flush modifications before status update */
> +		fast_wmb();
> +
> +		/* assign ownership */
> +		desc->status = DEVICE_OWN;
> +
> +		/* force memory to sync before notifying device via MMIO */
> +		wmb();
> +
> +		/* notify device of new descriptors */
> +		writel(DESC_NOTIFY, doorbell);
> +	}
> +
> +     The fast_rmb() allows us guarantee the device has released ownership
> +     before we read the data from the descriptor, and he fast_wmb() allows
> +     us to guarantee the data is written to the descriptor before the device
> +     can see it now has ownership.  The wmb() is needed to guarantee that the
> +     cache-enabled memory writes have completed before attempting a write to
> +     the cache-inhibited MMIO region.
> +
> +
>  MMIO WRITE BARRIER
>  ------------------
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index c6a3e73..c57903c 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -43,10 +43,14 @@
>  #define mb()		do { dsb(); outer_sync(); } while (0)
>  #define rmb()		dsb()
>  #define wmb()		do { dsb(st); outer_sync(); } while (0)
> +#define fast_rmb()	dmb()
> +#define fast_wmb()	dmb(st)
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
>  #define wmb()		barrier()
> +#define fast_rmb()	barrier()
> +#define fast_wmb()	barrier()
>  #endif
> 
>  #ifndef CONFIG_SMP
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index 6389d60..3ca1a15 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -32,6 +32,9 @@
>  #define rmb()		dsb(ld)
>  #define wmb()		dsb(st)
> 
> +#define fast_rmb()	dmb(ld)
> +#define fast_wmb()	dmb(st)
> +
>  #ifndef CONFIG_SMP
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index e8fffb0..997f5b0 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -39,6 +39,9 @@
>  #define rmb()		mb()
>  #define wmb()		mb()
> 
> +#define fast_rmb()	mb()
> +#define fast_wmb()	mb()
> +
>  #ifdef CONFIG_SMP
>  # define smp_mb()	mb()
>  #else
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index 6d8b8c9..edddb6d 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -4,8 +4,6 @@
>  #include <asm/metag_mem.h>
> 
>  #define nop()		asm volatile ("NOP")
> -#define mb()		wmb()
> -#define rmb()		barrier()
> 
>  #ifdef CONFIG_METAG_META21
> 
> @@ -41,11 +39,13 @@ static inline void wr_fence(void)
> 
>  #endif /* !CONFIG_METAG_META21 */
> 
> -static inline void wmb(void)
> -{
> -	/* flush writes through the write combiner */
> -	wr_fence();
> -}
> +/* flush writes through the write combiner */
> +#define mb()		wr_fence()
> +#define rmb()		barrier()
> +#define wmb()		mb()
> +
> +#define fast_rmb()	rmb()
> +#define fast_wmb()	wmb()
> 
>  #ifndef CONFIG_SMP
>  #define fence()		do { } while (0)
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index cb6d66c..f480097 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -36,22 +36,20 @@
> 
>  #define set_mb(var, value)	do { var = value; mb(); } while (0)
> 
> -#ifdef CONFIG_SMP
> -
>  #ifdef __SUBARCH_HAS_LWSYNC
>  #    define SMPWMB      LWSYNC
>  #else
>  #    define SMPWMB      eieio
>  #endif
> 
> -#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> 
> +#ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> -#define smp_rmb()	__lwsync()
> -#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> +#define smp_rmb()	fast_rmb()
> +#define smp_wmb()	fast_wmb()
>  #else
> -#define __lwsync()	barrier()
> -
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> @@ -69,10 +67,16 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> 
> +/*
> + * The use of smp_rmb() in these functions are actually meant to map from
> + * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> + * map to the more heavy-weight sync.
> + */
>  #define smp_store_release(p, v)						\
>  do {									\
>  	compiletime_assert_atomic_type(*p);				\
> -	__lwsync();							\
> +	smp_rmb();							\

This is not good at all.  For smp_store_release(), we absolutely
must order prior loads and stores against the assignment on the following
line.  This is not something that smp_rmb() does, nor is it something
that smp_wmb() does.  Yes, it might happen to now, but this could easily
break in the future -- plus this change is extremely misleading.

The original __lwsync() is much more clear.

>  	ACCESS_ONCE(*p) = (v);						\
>  } while (0)
> 
> @@ -80,7 +84,7 @@ do {									\
>  ({									\
>  	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
>  	compiletime_assert_atomic_type(*p);				\
> -	__lwsync();							\
> +	smp_rmb();							\
>  	___p1;								\
>  })
> 
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 33d191d..9a31301 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -24,6 +24,8 @@
> 
>  #define rmb()				mb()
>  #define wmb()				mb()
> +#define fast_rmb()			rmb()
> +#define fast_wmb()			wmb()
>  #define smp_mb()			mb()
>  #define smp_rmb()			rmb()
>  #define smp_wmb()			wmb()
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 6c974c0..eac2777 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -37,6 +37,9 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
>  #define rmb()	__asm__ __volatile__("":::"memory")
>  #define wmb()	__asm__ __volatile__("":::"memory")
> 
> +#define fast_rmb()	rmb()
> +#define fast_wmb()	wmb()
> +
>  #define set_mb(__var, __value) \
>  	do { __var = __value; membar_safe("#StoreLoad"); } while(0)
> 
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index 5238000..cf440e7 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -24,13 +24,16 @@
>  #define wmb()	asm volatile("sfence" ::: "memory")
>  #endif
> 
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
>  #ifdef CONFIG_X86_PPRO_FENCE
> -# define smp_rmb()	rmb()
> +#define fast_rmb()	rmb()
>  #else
> -# define smp_rmb()	barrier()
> +#define fast_rmb()	barrier()
>  #endif
> +#define fast_wmb()	barrier()
> +
> +#ifdef CONFIG_SMP
> +#define smp_mb()	mb()
> +#define smp_rmb()	fast_rmb()
>  #define smp_wmb()	barrier()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>  #else /* !SMP */
> diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
> index d6511d9..2c0405d 100644
> --- a/arch/x86/um/asm/barrier.h
> +++ b/arch/x86/um/asm/barrier.h
> @@ -29,17 +29,18 @@
> 
>  #endif /* CONFIG_X86_32 */
> 
> -#ifdef CONFIG_SMP
> -
> -#define smp_mb()	mb()
>  #ifdef CONFIG_X86_PPRO_FENCE
> -#define smp_rmb()	rmb()
> +#define fast_rmb()	rmb()
>  #else /* CONFIG_X86_PPRO_FENCE */
> -#define smp_rmb()	barrier()
> +#define fast_rmb()	barrier()
>  #endif /* CONFIG_X86_PPRO_FENCE */
> +#define fast_wmb()	barrier()
> 
> -#define smp_wmb()	barrier()
> +#ifdef CONFIG_SMP
> 
> +#define smp_mb()	mb()
> +#define smp_rmb()	fast_rmb()
> +#define smp_wmb()	fast_wmb()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> 
>  #else /* CONFIG_SMP */
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 1402fa8..c1d60b9 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -42,6 +42,14 @@
>  #define wmb()	mb()
>  #endif
> 
> +#ifndef fast_rmb
> +#define fast_rmb()	rmb()
> +#endif
> +
> +#ifndef fast_wmb
> +#define fast_wmb()	wmb()
> +#endif
> +
>  #ifndef read_barrier_depends
>  #define read_barrier_depends()		do { } while (0)
>  #endif
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Nov. 17, 2014, 8:24 p.m. UTC | #3
On 11/17/2014 12:04 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2014-11-17 at 09:18 -0800, Alexander Duyck wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be.  For
>> example in the case of PowerPC wmb() calls the heavy-weight sync
>> instruction when for memory/memory operations all that is really needed is
>> an lsync or eieio instruction.
> So essentially those are the same as the smp_* variants but not nop'ed
> out on !CONFIG_SMP right ? 
>
> Ben.
>

Yes and no.  So for example on ARM I used the dmb() operation, however I
have to use the barrier at the system level instead of just the inner
shared domain.  However on many other architectures they are just the
same as the smp_* variants.

Basically the resultant code is somewhere between the smp and non-smp
barriers in terms of what they cover.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Nov. 17, 2014, 8:52 p.m. UTC | #4
On Mon, Nov 17, 2014 at 9:18 AM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
> There are a number of situations where the mandatory barriers rmb() and
> wmb() are used to order memory/memory operations in the device drivers
> and those barriers are much heavier than they actually need to be.

Ugh. I absolutely despise the name.

It's not "fast". It's just limited. It's the same as "smp_*mb()", in
that it works on cacheable memory, but it actually stays around even
for non-SMP builds.

So I think the name is actively misleading.

Naming should be about what it does, not about some kind of PR thing
that confuses people into thinking it's "better".

Maybe "dma_*mb()" would be acceptable, and ends up having the same
naming convention as "smb_*mb()", and explains what it's about.

And yes, in the same spirit, it would probably be good to try to
eventually get rid of the plain "*mb()" functions, and perhaps call
them "mmio_*mb()" to clarify that they are about ordering memory wrt
mmio.

Hmm?

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Nov. 17, 2014, 9:11 p.m. UTC | #5
On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be.  For
>> example in the case of PowerPC wmb() calls the heavy-weight sync
>> instruction when for memory/memory operations all that is really needed is
>> an lsync or eieio instruction.
>
> Is this still the case if one of the memory operations is MMIO?  Last
> I knew, it was not.

This barrier is not meant for use in MMIO operations, for that you still 
need a full barrier as I call out in the documentation section. What the 
barrier does is allow for a lightweight barrier for accesses to coherent 
system memory. So for example many device drivers have to perform a read 
of the descriptor to see if the device is done with it. We need an rmb() 
following that check to prevent any other accesses.

Right now on x86 that rmb() becomes an lfence instruction and is quite 
expensive, and as it turns out we don't need it since the x86 doesn't 
reorder reads. The same kind of thing applies to PowerPC, only in that 
case we use a sync when what we really need is a lwsync.

>> This commit adds a fast (and loose) version of the mandatory memory
>> barriers rmb() and wmb().  The prefix to the name is actually based on the
>> version of the functions that already exist in the mips and tile trees.
>> However I thought it applicable since it gets at what we are trying to
>> accomplish with these barriers and somethat implies their risky nature.
>>
>> These new barriers are not as safe as the standard rmb() and wmb().
>> Specifically they do not guarantee ordering between cache-enabled and
>> cache-inhibited memories.  The primary use case for these would be to
>> enforce ordering of memory reads/writes when accessing cache-enabled memory
>> that is shared between the CPU and a device.
>>
>> It may also be noted that there is no fast_mb().  This is due to the fact
>> that most architectures didn't seem to have a good way to do a full memory
>> barrier quickly and so they usually resorted to an mb() for their smp_mb
>> call.  As such there no point in adding a fast_mb() function if it is going
>> to map to mb() for all architectures anyway.
>
> I must confess that I still don't entirely understand the motivation.

The motivation is to provide finer grained barriers. So this provides an 
in-between that allows us to "choose the right hammer". In the case of 
PowerPC it is the difference between sync/lwsync, on ARM it is 
dsb()/dmb(), and on x86 it is lfence/barrier().

> Some problems in PowerPC barrier.h called out below.
>
> 							Thanx, Paul
>

<snip>

>> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
>> index 22a969c..fe55dea 100644
>> --- a/Documentation/memory-barriers.txt
>> +++ b/Documentation/memory-barriers.txt
>> @@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
>>        operations" subsection for information on where to use these.
>>
>>
>> + (*) fast_wmb();
>> + (*) fast_rmb();
>> +
>> +     These are for use with memory based device I/O to guarantee the ordering
>> +     of cache-enabled writes or reads with respect to other writes or reads
>> +     to cache-enabled memory.
>> +
>> +     For example, consider a device driver that shares memory with a device
>> +     and uses a descriptor status value to indicate if the descriptor belongs
>> +     to the device or the CPU, and a doorbell to notify it when new
>> +     descriptors are available:
>> +
>> +	if (desc->status != DEVICE_OWN) {
>> +		/* do not read data until we own descriptor */
>> +		fast_rmb();
>> +
>> +		/* read/modify data */
>> +		read_data = desc->data;
>> +		desc->data = write_data;
>> +
>> +		/* flush modifications before status update */
>> +		fast_wmb();
>> +
>> +		/* assign ownership */
>> +		desc->status = DEVICE_OWN;
>> +
>> +		/* force memory to sync before notifying device via MMIO */
>> +		wmb();
>> +
>> +		/* notify device of new descriptors */
>> +		writel(DESC_NOTIFY, doorbell);
>> +	}
>> +
>> +     The fast_rmb() allows us guarantee the device has released ownership
>> +     before we read the data from the descriptor, and he fast_wmb() allows
>> +     us to guarantee the data is written to the descriptor before the device
>> +     can see it now has ownership.  The wmb() is needed to guarantee that the
>> +     cache-enabled memory writes have completed before attempting a write to
>> +     the cache-inhibited MMIO region.
>> +
>> +
>>   MMIO WRITE BARRIER
>>   ------------------

The general idea is that the device/CPU follow acquire/release style 
semantics and we need the lightweight barriers to enforce ordering to 
prevent us from accessing the descriptors during those periods where we 
do not own them.  As the example shows we still need a full barrier when 
going between MMIO and standard memory.  Hopefully that is what is 
conveyed in the documentation bits I have above.

<snip>

>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>> index cb6d66c..f480097 100644
>> --- a/arch/powerpc/include/asm/barrier.h
>> +++ b/arch/powerpc/include/asm/barrier.h
>> @@ -36,22 +36,20 @@
>>
>>   #define set_mb(var, value)	do { var = value; mb(); } while (0)
>>
>> -#ifdef CONFIG_SMP
>> -
>>   #ifdef __SUBARCH_HAS_LWSYNC
>>   #    define SMPWMB      LWSYNC
>>   #else
>>   #    define SMPWMB      eieio
>>   #endif
>>
>> -#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>> +#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>> +#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>
>> +#ifdef CONFIG_SMP
>>   #define smp_mb()	mb()
>> -#define smp_rmb()	__lwsync()
>> -#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>> +#define smp_rmb()	fast_rmb()
>> +#define smp_wmb()	fast_wmb()
>>   #else
>> -#define __lwsync()	barrier()
>> -
>>   #define smp_mb()	barrier()
>>   #define smp_rmb()	barrier()
>>   #define smp_wmb()	barrier()
>> @@ -69,10 +67,16 @@
>>   #define data_barrier(x)	\
>>   	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>>
>> +/*
>> + * The use of smp_rmb() in these functions are actually meant to map from
>> + * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
>> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
>> + * map to the more heavy-weight sync.
>> + */
>>   #define smp_store_release(p, v)						\
>>   do {									\
>>   	compiletime_assert_atomic_type(*p);				\
>> -	__lwsync();							\
>> +	smp_rmb();							\
>
> This is not good at all.  For smp_store_release(), we absolutely
> must order prior loads and stores against the assignment on the following
> line.  This is not something that smp_rmb() does, nor is it something
> that smp_wmb() does.  Yes, it might happen to now, but this could easily
> break in the future -- plus this change is extremely misleading.
>
> The original __lwsync() is much more clear.

The problem I had with __lwsync is that it really wasn't all that clear. 
It was the lwsync instruction if SMP was enabled, otherwise it was just 
a barrier call. What I did is move the definition of __lwsync in the SMP 
case into fast_rmb, which in turn is accessed by smp_rmb. I tried to 
make this clear in the comment just above the two calls. The resultant 
assembly code should be exactly the same.

What I could do is have it added back as a smp_lwsync if that works for 
you. That way there is something there to give you a hint that it 
becomes a barrier() call as soon as SMP is disabled.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Nov. 17, 2014, 9:54 p.m. UTC | #6
On 11/17/2014 12:52 PM, Linus Torvalds wrote:
> On Mon, Nov 17, 2014 at 9:18 AM, Alexander Duyck
> <alexander.h.duyck@redhat.com> wrote:
>> There are a number of situations where the mandatory barriers rmb() and
>> wmb() are used to order memory/memory operations in the device drivers
>> and those barriers are much heavier than they actually need to be.
> Ugh. I absolutely despise the name.
>
> It's not "fast". It's just limited. It's the same as "smp_*mb()", in
> that it works on cacheable memory, but it actually stays around even
> for non-SMP builds.
>
> So I think the name is actively misleading.
>
> Naming should be about what it does, not about some kind of PR thing
> that confuses people into thinking it's "better".
>
> Maybe "dma_*mb()" would be acceptable, and ends up having the same
> naming convention as "smb_*mb()", and explains what it's about.

What would you think of the name "coherent_*mb()"?  I would prefer to 
avoid dma in the name since, at least in my mind, that implies MMIO.

It also ties in well with dma_alloc_coherent/dma_free_coherent which is 
what would typically be used to allocate the memory we would be using 
the barrier to protect anyway.

> And yes, in the same spirit, it would probably be good to try to
> eventually get rid of the plain "*mb()" functions, and perhaps call
> them "mmio_*mb()" to clarify that they are about ordering memory wrt
> mmio.
>
> Hmm?
>
>                          Linus

I will work on pulling all of the coherent barrier cases out of using 
the plain "*mb()" calls first.  We need to sort that out before we could 
look at renaming the plain barrier functions.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Nov. 17, 2014, 11:17 p.m. UTC | #7
On Mon, Nov 17, 2014 at 01:11:57PM -0800, Alexander Duyck wrote:
> On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
> >On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> >>There are a number of situations where the mandatory barriers rmb() and
> >>wmb() are used to order memory/memory operations in the device drivers
> >>and those barriers are much heavier than they actually need to be.  For
> >>example in the case of PowerPC wmb() calls the heavy-weight sync
> >>instruction when for memory/memory operations all that is really needed is
> >>an lsync or eieio instruction.
> >
> >Is this still the case if one of the memory operations is MMIO?  Last
> >I knew, it was not.
> 
> This barrier is not meant for use in MMIO operations, for that you
> still need a full barrier as I call out in the documentation
> section. What the barrier does is allow for a lightweight barrier
> for accesses to coherent system memory. So for example many device
> drivers have to perform a read of the descriptor to see if the
> device is done with it. We need an rmb() following that check to
> prevent any other accesses.
> 
> Right now on x86 that rmb() becomes an lfence instruction and is
> quite expensive, and as it turns out we don't need it since the x86
> doesn't reorder reads. The same kind of thing applies to PowerPC,
> only in that case we use a sync when what we really need is a
> lwsync.

Would it make sense to have a memory barrier that enforced the
non-store-buffer orderings, that is prior reads before later
reads and writes and prior writes before later writes?  This was
discussed earlier this year ((http://lwn.net/Articles/586838/,
https://lwn.net/Articles/588300/).  If I recall correctly, one of
the biggest obstacles was the name.  ;-)

> >>This commit adds a fast (and loose) version of the mandatory memory
> >>barriers rmb() and wmb().  The prefix to the name is actually based on the
> >>version of the functions that already exist in the mips and tile trees.
> >>However I thought it applicable since it gets at what we are trying to
> >>accomplish with these barriers and somethat implies their risky nature.
> >>
> >>These new barriers are not as safe as the standard rmb() and wmb().
> >>Specifically they do not guarantee ordering between cache-enabled and
> >>cache-inhibited memories.  The primary use case for these would be to
> >>enforce ordering of memory reads/writes when accessing cache-enabled memory
> >>that is shared between the CPU and a device.
> >>
> >>It may also be noted that there is no fast_mb().  This is due to the fact
> >>that most architectures didn't seem to have a good way to do a full memory
> >>barrier quickly and so they usually resorted to an mb() for their smp_mb
> >>call.  As such there no point in adding a fast_mb() function if it is going
> >>to map to mb() for all architectures anyway.
> >
> >I must confess that I still don't entirely understand the motivation.
> 
> The motivation is to provide finer grained barriers. So this
> provides an in-between that allows us to "choose the right hammer".
> In the case of PowerPC it is the difference between sync/lwsync, on
> ARM it is dsb()/dmb(), and on x86 it is lfence/barrier().

Ah, so ARM will motivate a fast_wmb(), given its instruction set.

> >Some problems in PowerPC barrier.h called out below.
> >
> >							Thanx, Paul
> >
> 
> <snip>
> 
> >>diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> >>index 22a969c..fe55dea 100644
> >>--- a/Documentation/memory-barriers.txt
> >>+++ b/Documentation/memory-barriers.txt
> >>@@ -1615,6 +1615,47 @@ There are some more advanced barrier functions:
> >>       operations" subsection for information on where to use these.
> >>
> >>
> >>+ (*) fast_wmb();
> >>+ (*) fast_rmb();
> >>+
> >>+     These are for use with memory based device I/O to guarantee the ordering
> >>+     of cache-enabled writes or reads with respect to other writes or reads
> >>+     to cache-enabled memory.
> >>+
> >>+     For example, consider a device driver that shares memory with a device
> >>+     and uses a descriptor status value to indicate if the descriptor belongs
> >>+     to the device or the CPU, and a doorbell to notify it when new
> >>+     descriptors are available:
> >>+
> >>+	if (desc->status != DEVICE_OWN) {
> >>+		/* do not read data until we own descriptor */
> >>+		fast_rmb();
> >>+
> >>+		/* read/modify data */
> >>+		read_data = desc->data;
> >>+		desc->data = write_data;
> >>+
> >>+		/* flush modifications before status update */
> >>+		fast_wmb();
> >>+
> >>+		/* assign ownership */
> >>+		desc->status = DEVICE_OWN;
> >>+
> >>+		/* force memory to sync before notifying device via MMIO */
> >>+		wmb();
> >>+
> >>+		/* notify device of new descriptors */
> >>+		writel(DESC_NOTIFY, doorbell);
> >>+	}
> >>+
> >>+     The fast_rmb() allows us guarantee the device has released ownership
> >>+     before we read the data from the descriptor, and he fast_wmb() allows
> >>+     us to guarantee the data is written to the descriptor before the device
> >>+     can see it now has ownership.  The wmb() is needed to guarantee that the
> >>+     cache-enabled memory writes have completed before attempting a write to
> >>+     the cache-inhibited MMIO region.
> >>+
> >>+
> >>  MMIO WRITE BARRIER
> >>  ------------------
> 
> The general idea is that the device/CPU follow acquire/release style
> semantics and we need the lightweight barriers to enforce ordering
> to prevent us from accessing the descriptors during those periods
> where we do not own them.  As the example shows we still need a full
> barrier when going between MMIO and standard memory.  Hopefully that
> is what is conveyed in the documentation bits I have above.

Thank you for the clarification.

> <snip>
> 
> >>diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> >>index cb6d66c..f480097 100644
> >>--- a/arch/powerpc/include/asm/barrier.h
> >>+++ b/arch/powerpc/include/asm/barrier.h
> >>@@ -36,22 +36,20 @@
> >>
> >>  #define set_mb(var, value)	do { var = value; mb(); } while (0)
> >>
> >>-#ifdef CONFIG_SMP
> >>-
> >>  #ifdef __SUBARCH_HAS_LWSYNC
> >>  #    define SMPWMB      LWSYNC
> >>  #else
> >>  #    define SMPWMB      eieio
> >>  #endif
> >>
> >>-#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >>+#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>
> >>+#ifdef CONFIG_SMP
> >>  #define smp_mb()	mb()
> >>-#define smp_rmb()	__lwsync()
> >>-#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >>+#define smp_rmb()	fast_rmb()
> >>+#define smp_wmb()	fast_wmb()
> >>  #else
> >>-#define __lwsync()	barrier()
> >>-
> >>  #define smp_mb()	barrier()
> >>  #define smp_rmb()	barrier()
> >>  #define smp_wmb()	barrier()
> >>@@ -69,10 +67,16 @@
> >>  #define data_barrier(x)	\
> >>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> >>
> >>+/*
> >>+ * The use of smp_rmb() in these functions are actually meant to map from
> >>+ * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
> >>+ * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
> >>+ * map to the more heavy-weight sync.
> >>+ */
> >>  #define smp_store_release(p, v)						\
> >>  do {									\
> >>  	compiletime_assert_atomic_type(*p);				\
> >>-	__lwsync();							\
> >>+	smp_rmb();							\
> >
> >This is not good at all.  For smp_store_release(), we absolutely
> >must order prior loads and stores against the assignment on the following
> >line.  This is not something that smp_rmb() does, nor is it something
> >that smp_wmb() does.  Yes, it might happen to now, but this could easily
> >break in the future -- plus this change is extremely misleading.
> >
> >The original __lwsync() is much more clear.
> 
> The problem I had with __lwsync is that it really wasn't all that
> clear. It was the lwsync instruction if SMP was enabled, otherwise
> it was just a barrier call. What I did is move the definition of
> __lwsync in the SMP case into fast_rmb, which in turn is accessed by
> smp_rmb. I tried to make this clear in the comment just above the
> two calls. The resultant assembly code should be exactly the same.
> 
> What I could do is have it added back as a smp_lwsync if that works
> for you. That way there is something there to give you a hint that
> it becomes a barrier() call as soon as SMP is disabled.

An smp_lwsync() would be a great improvement!

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 18, 2014, 12:38 a.m. UTC | #8
On Mon, 2014-11-17 at 12:18 -0800, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
> > There are a number of situations where the mandatory barriers rmb() and
> > wmb() are used to order memory/memory operations in the device drivers
> > and those barriers are much heavier than they actually need to be.  For
> > example in the case of PowerPC wmb() calls the heavy-weight sync
> > instruction when for memory/memory operations all that is really needed is
> > an lsync or eieio instruction.
> 
> Is this still the case if one of the memory operations is MMIO?  Last
> I knew, it was not.

I *think* (Alexander, correct me if I'm wrong), that what he wants is
the memory<->memory barriers (the smp_* ones) basically for ordering his
loads or stores from/to the DMA area.

The problem is that the smp_* ones aren't compiled for !CONFIG_SMP

IE. Something like:

  - Read valid bit from descriptor

  - Read rest of descriptor

That needs an rmb of some sort in between, but a full blown "rmb" will
also order vs. MMIOs and end up being a full sync, while an smp_rmb is a
lwsync which is more lightweight.

Similarily:

 - Populate descriptor

 - Write valid bit

Same deal with wmb ...

Basically, rmb and wmb order both cachable and non-cachable (memory and
MMIO) which makes them needlessly heavy on powerpc and possibly others
when all you need is to order memory accesses to some DMA data
structures. In that case you really want the normal smp_* variants
except they may not be around...

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 18, 2014, 12:39 a.m. UTC | #9
On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
> Yes and no.  So for example on ARM I used the dmb() operation, however
> I
> have to use the barrier at the system level instead of just the inner
> shared domain.  However on many other architectures they are just the
> same as the smp_* variants.
> 
> Basically the resultant code is somewhere between the smp and non-smp
> barriers in terms of what they cover.

There I don't quite follow you. You need to explain better especially in
the documentation because otherwise people will get it wrong...

If it's ordering in the coherent domain, I fail to see how a DMA agent
is different than another processor when it comes to barriers, so I fail
to see the difference with smp_*

I understand the MMIO vs. memory issue, we do have the same on powerpc,
but that other aspect eludes me.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 18, 2014, 12:41 a.m. UTC | #10
On Mon, 2014-11-17 at 12:52 -0800, Linus Torvalds wrote:

> Maybe "dma_*mb()" would be acceptable, and ends up having the same
> naming convention as "smb_*mb()", and explains what it's about.

Yes, that was what I was about to suggest as well.

> And yes, in the same spirit, it would probably be good to try to
> eventually get rid of the plain "*mb()" functions, and perhaps call
> them "mmio_*mb()" to clarify that they are about ordering memory wrt
> mmio.

It will always be somewhat unclear to users who don't read the doc
anyway :)

IE. the dma_* ones do only DMA vs DMA (or vs other processors) but the 
mmio_* one do anything vs anything. Not a huge deal tho. I still like
dma_* for Alexander's new stuff but I wouldn't bother with changing the
existing ones.

Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 18, 2014, 12:43 a.m. UTC | #11
On Mon, 2014-11-17 at 13:54 -0800, Alexander Duyck wrote:

> What would you think of the name "coherent_*mb()"?  I would prefer to 
> avoid dma in the name since, at least in my mind, that implies MMIO.

I'm lazy, I like typing less, so I like dma_* but I don't object to
coherent_* if at least one more person thinks it makes the semantics
clearer :)

> It also ties in well with dma_alloc_coherent/dma_free_coherent which is 
> what would typically be used to allocate the memory we would be using 
> the barrier to protect anyway.

Agreed.

> > And yes, in the same spirit, it would probably be good to try to
> > eventually get rid of the plain "*mb()" functions, and perhaps call
> > them "mmio_*mb()" to clarify that they are about ordering memory wrt
> > mmio.
> >
> > Hmm?
> >
> >                          Linus
> 
> I will work on pulling all of the coherent barrier cases out of using 
> the plain "*mb()" calls first.  We need to sort that out before we could 
> look at renaming the plain barrier functions.

Makes sense.

Cheers,
Ben.

> - Alex
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Nov. 18, 2014, 3:13 a.m. UTC | #12
On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
>> Yes and no.  So for example on ARM I used the dmb() operation, however
>> I
>> have to use the barrier at the system level instead of just the inner
>> shared domain.  However on many other architectures they are just the
>> same as the smp_* variants.
>>
>> Basically the resultant code is somewhere between the smp and non-smp
>> barriers in terms of what they cover.
> There I don't quite follow you. You need to explain better especially in
> the documentation because otherwise people will get it wrong...
>
> If it's ordering in the coherent domain, I fail to see how a DMA agent
> is different than another processor when it comes to barriers, so I fail
> to see the difference with smp_*
>
> I understand the MMIO vs. memory issue, we do have the same on powerpc,
> but that other aspect eludes me.
>
> Ben.

ARM adds some funky things.  They have two different types of 
primitives, a dmb() which is a data memory barrier, and a dsb() which is 
a data synchronization barrier.  Then with each of those they have the 
"domains" the barriers are effective within.

So for example on ARM a rmb() is dsb(sy) which means it is a system wide 
synchronization barrier which stops execution on the CPU core until the 
read completes.  However the smp_rmb() is a dmb(ish) which means it is 
only a barrier as far as the inner shareable domain which I believe only 
goes as far as the local shared cache hierarchy and only guarantees read 
ordering without necessarily halting the CPU or stopping in-order 
speculative reads.  So what a coherent_rmb() would be in my setup is 
dmb(sy) which means the barrier runs all the way out to memory, and it 
is allowed to speculative read as long as it does it in order.

If it is still unclear you might check out Will Deacon's talk on the 
topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in 
he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().

- Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Nov. 18, 2014, 3:33 a.m. UTC | #13
On 11/17/2014 03:17 PM, Paul E. McKenney wrote:
> On Mon, Nov 17, 2014 at 01:11:57PM -0800, Alexander Duyck wrote:
>> On 11/17/2014 12:18 PM, Paul E. McKenney wrote:
>>> On Mon, Nov 17, 2014 at 09:18:13AM -0800, Alexander Duyck wrote:
>>>> There are a number of situations where the mandatory barriers rmb() and
>>>> wmb() are used to order memory/memory operations in the device drivers
>>>> and those barriers are much heavier than they actually need to be.  For
>>>> example in the case of PowerPC wmb() calls the heavy-weight sync
>>>> instruction when for memory/memory operations all that is really needed is
>>>> an lsync or eieio instruction.
>>> Is this still the case if one of the memory operations is MMIO?  Last
>>> I knew, it was not.
>> This barrier is not meant for use in MMIO operations, for that you
>> still need a full barrier as I call out in the documentation
>> section. What the barrier does is allow for a lightweight barrier
>> for accesses to coherent system memory. So for example many device
>> drivers have to perform a read of the descriptor to see if the
>> device is done with it. We need an rmb() following that check to
>> prevent any other accesses.
>>
>> Right now on x86 that rmb() becomes an lfence instruction and is
>> quite expensive, and as it turns out we don't need it since the x86
>> doesn't reorder reads. The same kind of thing applies to PowerPC,
>> only in that case we use a sync when what we really need is a
>> lwsync.
> Would it make sense to have a memory barrier that enforced the
> non-store-buffer orderings, that is prior reads before later
> reads and writes and prior writes before later writes?  This was
> discussed earlier this year ((http://lwn.net/Articles/586838/,
> https://lwn.net/Articles/588300/).  If I recall correctly, one of
> the biggest obstacles was the name.  ;-)

You''re talking bout acquire and release barriers, or something else?  
For most devices the two barriers I have defined should do the job, I 
had tried doing load_acquire/store_release type functions in the 
previous path set and that was shot down as the preference seemed to be 
for barriers instead to remove some of the abstraction as to what was 
occurring.

>>>> This commit adds a fast (and loose) version of the mandatory memory
>>>> barriers rmb() and wmb().  The prefix to the name is actually based on the
>>>> version of the functions that already exist in the mips and tile trees.
>>>> However I thought it applicable since it gets at what we are trying to
>>>> accomplish with these barriers and somethat implies their risky nature.
>>>>
>>>> These new barriers are not as safe as the standard rmb() and wmb().
>>>> Specifically they do not guarantee ordering between cache-enabled and
>>>> cache-inhibited memories.  The primary use case for these would be to
>>>> enforce ordering of memory reads/writes when accessing cache-enabled memory
>>>> that is shared between the CPU and a device.
>>>>
>>>> It may also be noted that there is no fast_mb().  This is due to the fact
>>>> that most architectures didn't seem to have a good way to do a full memory
>>>> barrier quickly and so they usually resorted to an mb() for their smp_mb
>>>> call.  As such there no point in adding a fast_mb() function if it is going
>>>> to map to mb() for all architectures anyway.
>>> I must confess that I still don't entirely understand the motivation.
>> The motivation is to provide finer grained barriers. So this
>> provides an in-between that allows us to "choose the right hammer".
>> In the case of PowerPC it is the difference between sync/lwsync, on
>> ARM it is dsb()/dmb(), and on x86 it is lfence/barrier().
> Ah, so ARM will motivate a fast_wmb(), given its instruction set.

Actually it was x86 that started this,  lfence or sfence is much more 
expensive then just barrier().  From there I realized we had issues in 
PowerPC as well with sync vs lwsync, and ARM with dsb() vs dmb().

>> <snip>
>>
>>>> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
>>>> index cb6d66c..f480097 100644
>>>> --- a/arch/powerpc/include/asm/barrier.h
>>>> +++ b/arch/powerpc/include/asm/barrier.h
>>>> @@ -36,22 +36,20 @@
>>>>
>>>>   #define set_mb(var, value)	do { var = value; mb(); } while (0)
>>>>
>>>> -#ifdef CONFIG_SMP
>>>> -
>>>>   #ifdef __SUBARCH_HAS_LWSYNC
>>>>   #    define SMPWMB      LWSYNC
>>>>   #else
>>>>   #    define SMPWMB      eieio
>>>>   #endif
>>>>
>>>> -#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>>>> +#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>>>> +#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>>>
>>>> +#ifdef CONFIG_SMP
>>>>   #define smp_mb()	mb()
>>>> -#define smp_rmb()	__lwsync()
>>>> -#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>>>> +#define smp_rmb()	fast_rmb()
>>>> +#define smp_wmb()	fast_wmb()
>>>>   #else
>>>> -#define __lwsync()	barrier()
>>>> -
>>>>   #define smp_mb()	barrier()
>>>>   #define smp_rmb()	barrier()
>>>>   #define smp_wmb()	barrier()
>>>> @@ -69,10 +67,16 @@
>>>>   #define data_barrier(x)	\
>>>>   	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
>>>>
>>>> +/*
>>>> + * The use of smp_rmb() in these functions are actually meant to map from
>>>> + * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
>>>> + * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
>>>> + * map to the more heavy-weight sync.
>>>> + */
>>>>   #define smp_store_release(p, v)						\
>>>>   do {									\
>>>>   	compiletime_assert_atomic_type(*p);				\
>>>> -	__lwsync();							\
>>>> +	smp_rmb();							\
>>> This is not good at all.  For smp_store_release(), we absolutely
>>> must order prior loads and stores against the assignment on the following
>>> line.  This is not something that smp_rmb() does, nor is it something
>>> that smp_wmb() does.  Yes, it might happen to now, but this could easily
>>> break in the future -- plus this change is extremely misleading.
>>>
>>> The original __lwsync() is much more clear.
>> The problem I had with __lwsync is that it really wasn't all that
>> clear. It was the lwsync instruction if SMP was enabled, otherwise
>> it was just a barrier call. What I did is move the definition of
>> __lwsync in the SMP case into fast_rmb, which in turn is accessed by
>> smp_rmb. I tried to make this clear in the comment just above the
>> two calls. The resultant assembly code should be exactly the same.
>>
>> What I could do is have it added back as a smp_lwsync if that works
>> for you. That way there is something there to give you a hint that
>> it becomes a barrier() call as soon as SMP is disabled.
> An smp_lwsync() would be a great improvement!
>
> 							Thanx, Paul
>

Okay, that will be in the next patch then.

Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon Nov. 18, 2014, 11:58 a.m. UTC | #14
On Tue, Nov 18, 2014 at 03:13:29AM +0000, Alexander Duyck wrote:
> On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
> >> Yes and no.  So for example on ARM I used the dmb() operation, however
> >> I
> >> have to use the barrier at the system level instead of just the inner
> >> shared domain.  However on many other architectures they are just the
> >> same as the smp_* variants.
> >>
> >> Basically the resultant code is somewhere between the smp and non-smp
> >> barriers in terms of what they cover.
> > There I don't quite follow you. You need to explain better especially in
> > the documentation because otherwise people will get it wrong...
> >
> > If it's ordering in the coherent domain, I fail to see how a DMA agent
> > is different than another processor when it comes to barriers, so I fail
> > to see the difference with smp_*
> >
> > I understand the MMIO vs. memory issue, we do have the same on powerpc,
> > but that other aspect eludes me.
> >
> 
> ARM adds some funky things.  They have two different types of 
> primitives, a dmb() which is a data memory barrier, and a dsb() which is 
> a data synchronization barrier.  Then with each of those they have the 
> "domains" the barriers are effective within.
> 
> So for example on ARM a rmb() is dsb(sy) which means it is a system wide 
> synchronization barrier which stops execution on the CPU core until the 
> read completes.  However the smp_rmb() is a dmb(ish) which means it is 
> only a barrier as far as the inner shareable domain which I believe only 
> goes as far as the local shared cache hierarchy and only guarantees read 
> ordering without necessarily halting the CPU or stopping in-order 
> speculative reads.  So what a coherent_rmb() would be in my setup is 
> dmb(sy) which means the barrier runs all the way out to memory, and it 
> is allowed to speculative read as long as it does it in order.
> 
> If it is still unclear you might check out Will Deacon's talk on the 
> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in 
> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().

So actually, this is an interesting case where the barrier would like to
know whether the memory returned by dma_alloc_coherent is h/w coherent
(normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
is thinking of the h/w coherent case (i.e. actual snooping into the CPU
caches by the DMA master).

For the former, we could use inner-shareable barriers. For the latter, we'd
need to use outer-shareable barriers.

If we can't tell, then these should be dmb(osh), which will work for both.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander Duyck Nov. 18, 2014, 4:20 p.m. UTC | #15
On 11/18/2014 03:58 AM, Will Deacon wrote:
> On Tue, Nov 18, 2014 at 03:13:29AM +0000, Alexander Duyck wrote:
>> On 11/17/2014 04:39 PM, Benjamin Herrenschmidt wrote:
>>> On Mon, 2014-11-17 at 12:24 -0800, Alexander Duyck wrote:
>>>> Yes and no.  So for example on ARM I used the dmb() operation, however
>>>> I
>>>> have to use the barrier at the system level instead of just the inner
>>>> shared domain.  However on many other architectures they are just the
>>>> same as the smp_* variants.
>>>>
>>>> Basically the resultant code is somewhere between the smp and non-smp
>>>> barriers in terms of what they cover.
>>> There I don't quite follow you. You need to explain better especially in
>>> the documentation because otherwise people will get it wrong...
>>>
>>> If it's ordering in the coherent domain, I fail to see how a DMA agent
>>> is different than another processor when it comes to barriers, so I fail
>>> to see the difference with smp_*
>>>
>>> I understand the MMIO vs. memory issue, we do have the same on powerpc,
>>> but that other aspect eludes me.
>>>
>> ARM adds some funky things.  They have two different types of
>> primitives, a dmb() which is a data memory barrier, and a dsb() which is
>> a data synchronization barrier.  Then with each of those they have the
>> "domains" the barriers are effective within.
>>
>> So for example on ARM a rmb() is dsb(sy) which means it is a system wide
>> synchronization barrier which stops execution on the CPU core until the
>> read completes.  However the smp_rmb() is a dmb(ish) which means it is
>> only a barrier as far as the inner shareable domain which I believe only
>> goes as far as the local shared cache hierarchy and only guarantees read
>> ordering without necessarily halting the CPU or stopping in-order
>> speculative reads.  So what a coherent_rmb() would be in my setup is
>> dmb(sy) which means the barrier runs all the way out to memory, and it
>> is allowed to speculative read as long as it does it in order.
>>
>> If it is still unclear you might check out Will Deacon's talk on the
>> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in
>> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().
> So actually, this is an interesting case where the barrier would like to
> know whether the memory returned by dma_alloc_coherent is h/w coherent
> (normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
> is thinking of the h/w coherent case (i.e. actual snooping into the CPU
> caches by the DMA master).
>
> For the former, we could use inner-shareable barriers. For the latter, we'd
> need to use outer-shareable barriers.
>
> If we can't tell, then these should be dmb(osh), which will work for both.
>
> Will

Okay, so I will update the ARM portion of my patches to use osh and 
oshst then since it sounds like I was using too strong of barriers.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Will Deacon Nov. 18, 2014, 4:48 p.m. UTC | #16
On Tue, Nov 18, 2014 at 04:20:46PM +0000, Alexander Duyck wrote:
> On 11/18/2014 03:58 AM, Will Deacon wrote:
> > So actually, this is an interesting case where the barrier would like to
> > know whether the memory returned by dma_alloc_coherent is h/w coherent
> > (normal, cacheable) or s/w coherent (normal, non-cacheable). I think Ben
> > is thinking of the h/w coherent case (i.e. actual snooping into the CPU
> > caches by the DMA master).
> >
> > For the former, we could use inner-shareable barriers. For the latter, we'd
> > need to use outer-shareable barriers.
> >
> > If we can't tell, then these should be dmb(osh), which will work for both.
> >
> 
> Okay, so I will update the ARM portion of my patches to use osh and 
> oshst then since it sounds like I was using too strong of barriers.

Sounds good. Another reason this is interesting is because the native
acquire/release instructions on ARMv8 actually take into account the
shareability domain of the virtual address, so using them would give you
the shareability domain you want but slightly stronger ordering guarantees
within that domain.

Still, either of them will be a damn sight better than the dsb we currently
have courtesy of the mandatory barriers.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Herrenschmidt Nov. 18, 2014, 9:07 p.m. UTC | #17
On Mon, 2014-11-17 at 19:13 -0800, Alexander Duyck wrote:
> 
> ARM adds some funky things.  They have two different types of 
> primitives, a dmb() which is a data memory barrier, and a dsb() which is 
> a data synchronization barrier.  Then with each of those they have the 
> "domains" the barriers are effective within.
> 
> So for example on ARM a rmb() is dsb(sy) which means it is a system wide 
> synchronization barrier which stops execution on the CPU core until the 
> read completes.  

That's amazingly heavy handed ... I can see that being useful for MMIO,
we do something similar in our MMIO accessors by using a special variant
of trap instruction that never traps to make the core thing the load
value has been consumed. But that's typically only needed to guarantee
MMIO timings.

> However the smp_rmb() is a dmb(ish) which means it is 
> only a barrier as far as the inner shareable domain which I believe only 
> goes as far as the local shared cache hierarchy and only guarantees read 
> ordering without necessarily halting the CPU or stopping in-order 
> speculative reads.  So what a coherent_rmb() would be in my setup is 
> dmb(sy) which means the barrier runs all the way out to memory, and it 
> is allowed to speculative read as long as it does it in order.

Correct, which is thus the same as smp_rmb() ... which was my original
point, or am I missing something else ?

> If it is still unclear you might check out Will Deacon's talk on the 
> topic at https://www.youtube.com/watch?v=6ORn6_35kKo, at about 7:00 in 
> he explains the whole domains thing, and at 13:30 he explains dmb()/dsb().

Ok, I'll try to watch that when I get a chance.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 22a969c..fe55dea 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1615,6 +1615,47 @@  There are some more advanced barrier functions:
      operations" subsection for information on where to use these.
 
 
+ (*) fast_wmb();
+ (*) fast_rmb();
+
+     These are for use with memory based device I/O to guarantee the ordering
+     of cache-enabled writes or reads with respect to other writes or reads
+     to cache-enabled memory.
+
+     For example, consider a device driver that shares memory with a device
+     and uses a descriptor status value to indicate if the descriptor belongs
+     to the device or the CPU, and a doorbell to notify it when new
+     descriptors are available:
+
+	if (desc->status != DEVICE_OWN) {
+		/* do not read data until we own descriptor */
+		fast_rmb();
+
+		/* read/modify data */
+		read_data = desc->data;
+		desc->data = write_data;
+
+		/* flush modifications before status update */
+		fast_wmb();
+
+		/* assign ownership */
+		desc->status = DEVICE_OWN;
+
+		/* force memory to sync before notifying device via MMIO */
+		wmb();
+
+		/* notify device of new descriptors */
+		writel(DESC_NOTIFY, doorbell);
+	}
+
+     The fast_rmb() allows us guarantee the device has released ownership
+     before we read the data from the descriptor, and he fast_wmb() allows
+     us to guarantee the data is written to the descriptor before the device
+     can see it now has ownership.  The wmb() is needed to guarantee that the
+     cache-enabled memory writes have completed before attempting a write to
+     the cache-inhibited MMIO region.
+
+
 MMIO WRITE BARRIER
 ------------------
 
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index c6a3e73..c57903c 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -43,10 +43,14 @@ 
 #define mb()		do { dsb(); outer_sync(); } while (0)
 #define rmb()		dsb()
 #define wmb()		do { dsb(st); outer_sync(); } while (0)
+#define fast_rmb()	dmb()
+#define fast_wmb()	dmb(st)
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
 #define wmb()		barrier()
+#define fast_rmb()	barrier()
+#define fast_wmb()	barrier()
 #endif
 
 #ifndef CONFIG_SMP
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 6389d60..3ca1a15 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -32,6 +32,9 @@ 
 #define rmb()		dsb(ld)
 #define wmb()		dsb(st)
 
+#define fast_rmb()	dmb(ld)
+#define fast_wmb()	dmb(st)
+
 #ifndef CONFIG_SMP
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index e8fffb0..997f5b0 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -39,6 +39,9 @@ 
 #define rmb()		mb()
 #define wmb()		mb()
 
+#define fast_rmb()	mb()
+#define fast_wmb()	mb()
+
 #ifdef CONFIG_SMP
 # define smp_mb()	mb()
 #else
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index 6d8b8c9..edddb6d 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -4,8 +4,6 @@ 
 #include <asm/metag_mem.h>
 
 #define nop()		asm volatile ("NOP")
-#define mb()		wmb()
-#define rmb()		barrier()
 
 #ifdef CONFIG_METAG_META21
 
@@ -41,11 +39,13 @@  static inline void wr_fence(void)
 
 #endif /* !CONFIG_METAG_META21 */
 
-static inline void wmb(void)
-{
-	/* flush writes through the write combiner */
-	wr_fence();
-}
+/* flush writes through the write combiner */
+#define mb()		wr_fence()
+#define rmb()		barrier()
+#define wmb()		mb()
+
+#define fast_rmb()	rmb()
+#define fast_wmb()	wmb()
 
 #ifndef CONFIG_SMP
 #define fence()		do { } while (0)
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index cb6d66c..f480097 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -36,22 +36,20 @@ 
 
 #define set_mb(var, value)	do { var = value; mb(); } while (0)
 
-#ifdef CONFIG_SMP
-
 #ifdef __SUBARCH_HAS_LWSYNC
 #    define SMPWMB      LWSYNC
 #else
 #    define SMPWMB      eieio
 #endif
 
-#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define fast_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define fast_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 
+#ifdef CONFIG_SMP
 #define smp_mb()	mb()
-#define smp_rmb()	__lwsync()
-#define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
+#define smp_rmb()	fast_rmb()
+#define smp_wmb()	fast_wmb()
 #else
-#define __lwsync()	barrier()
-
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
@@ -69,10 +67,16 @@ 
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+/*
+ * The use of smp_rmb() in these functions are actually meant to map from
+ * smp_rmb()->fast_rmb()->LWSYNC.  This way if smp is disabled then
+ * smp_rmb()->barrier(), or if the platform doesn't support lwsync it will
+ * map to the more heavy-weight sync.
+ */
 #define smp_store_release(p, v)						\
 do {									\
 	compiletime_assert_atomic_type(*p);				\
-	__lwsync();							\
+	smp_rmb();							\
 	ACCESS_ONCE(*p) = (v);						\
 } while (0)
 
@@ -80,7 +84,7 @@  do {									\
 ({									\
 	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
 	compiletime_assert_atomic_type(*p);				\
-	__lwsync();							\
+	smp_rmb();							\
 	___p1;								\
 })
 
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 33d191d..9a31301 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,6 +24,8 @@ 
 
 #define rmb()				mb()
 #define wmb()				mb()
+#define fast_rmb()			rmb()
+#define fast_wmb()			wmb()
 #define smp_mb()			mb()
 #define smp_rmb()			rmb()
 #define smp_wmb()			wmb()
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 6c974c0..eac2777 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -37,6 +37,9 @@  do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 #define rmb()	__asm__ __volatile__("":::"memory")
 #define wmb()	__asm__ __volatile__("":::"memory")
 
+#define fast_rmb()	rmb()
+#define fast_wmb()	wmb()
+
 #define set_mb(__var, __value) \
 	do { __var = __value; membar_safe("#StoreLoad"); } while(0)
 
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 5238000..cf440e7 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,13 +24,16 @@ 
 #define wmb()	asm volatile("sfence" ::: "memory")
 #endif
 
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
 #ifdef CONFIG_X86_PPRO_FENCE
-# define smp_rmb()	rmb()
+#define fast_rmb()	rmb()
 #else
-# define smp_rmb()	barrier()
+#define fast_rmb()	barrier()
 #endif
+#define fast_wmb()	barrier()
+
+#ifdef CONFIG_SMP
+#define smp_mb()	mb()
+#define smp_rmb()	fast_rmb()
 #define smp_wmb()	barrier()
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 #else /* !SMP */
diff --git a/arch/x86/um/asm/barrier.h b/arch/x86/um/asm/barrier.h
index d6511d9..2c0405d 100644
--- a/arch/x86/um/asm/barrier.h
+++ b/arch/x86/um/asm/barrier.h
@@ -29,17 +29,18 @@ 
 
 #endif /* CONFIG_X86_32 */
 
-#ifdef CONFIG_SMP
-
-#define smp_mb()	mb()
 #ifdef CONFIG_X86_PPRO_FENCE
-#define smp_rmb()	rmb()
+#define fast_rmb()	rmb()
 #else /* CONFIG_X86_PPRO_FENCE */
-#define smp_rmb()	barrier()
+#define fast_rmb()	barrier()
 #endif /* CONFIG_X86_PPRO_FENCE */
+#define fast_wmb()	barrier()
 
-#define smp_wmb()	barrier()
+#ifdef CONFIG_SMP
 
+#define smp_mb()	mb()
+#define smp_rmb()	fast_rmb()
+#define smp_wmb()	fast_wmb()
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
 #else /* CONFIG_SMP */
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 1402fa8..c1d60b9 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -42,6 +42,14 @@ 
 #define wmb()	mb()
 #endif
 
+#ifndef fast_rmb
+#define fast_rmb()	rmb()
+#endif
+
+#ifndef fast_wmb
+#define fast_wmb()	wmb()
+#endif
+
 #ifndef read_barrier_depends
 #define read_barrier_depends()		do { } while (0)
 #endif