diff mbox series

[RFC,3/4] barriers: convert a control to a data dependency

Message ID 20190102205715.14054-4-mst@redhat.com
State RFC
Delegated to: David Miller
Headers show
Series barriers using data dependency | expand

Commit Message

Michael S. Tsirkin Jan. 2, 2019, 8:57 p.m. UTC
It's not uncommon to have two access two unrelated memory locations in a
specific order.  At the moment one has to use a memory barrier for this.

However, if the first access was a read and the second used an address
depending on the first one we would have a data dependency and no
barrier would be necessary.

This adds a new interface: dependent_ptr_mb which does exactly this: it
returns a pointer with a data dependency on the supplied value.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
 arch/alpha/include/asm/barrier.h  |  1 +
 include/asm-generic/barrier.h     | 18 ++++++++++++++++++
 include/linux/compiler.h          |  4 ++++
 4 files changed, 43 insertions(+)

Comments

Matthew Wilcox Jan. 2, 2019, 9 p.m. UTC | #1
On Wed, Jan 02, 2019 at 03:57:58PM -0500, Michael S. Tsirkin wrote:
> @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
>  for more information.
>  
>  
> +
> +
>  In summary:
>  
>    (*) Control dependencies can order prior loads against later stores.

Was this hunk intentional?
Michael S. Tsirkin Jan. 2, 2019, 9:24 p.m. UTC | #2
On Wed, Jan 02, 2019 at 01:00:24PM -0800, Matthew Wilcox wrote:
> On Wed, Jan 02, 2019 at 03:57:58PM -0500, Michael S. Tsirkin wrote:
> > @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
> >  for more information.
> >  
> >  
> > +
> > +
> >  In summary:
> >  
> >    (*) Control dependencies can order prior loads against later stores.
> 
> Was this hunk intentional?

Nope, thanks for catching this.
Jason Wang Jan. 7, 2019, 3:58 a.m. UTC | #3
On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> It's not uncommon to have two access two unrelated memory locations in a
> specific order.  At the moment one has to use a memory barrier for this.
>
> However, if the first access was a read and the second used an address
> depending on the first one we would have a data dependency and no
> barrier would be necessary.
>
> This adds a new interface: dependent_ptr_mb which does exactly this: it
> returns a pointer with a data dependency on the supplied value.
>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>   Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
>   arch/alpha/include/asm/barrier.h  |  1 +
>   include/asm-generic/barrier.h     | 18 ++++++++++++++++++
>   include/linux/compiler.h          |  4 ++++
>   4 files changed, 43 insertions(+)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index c1d913944ad8..9dbaa2e1dbf6 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -691,6 +691,18 @@ case what's actually required is:
>   		p = READ_ONCE(b);
>   	}
>   
> +Alternatively, a control dependency can be converted to a data dependency,
> +e.g.:
> +
> +	q = READ_ONCE(a);
> +	if (q) {
> +		b = dependent_ptr_mb(b, q);
> +		p = READ_ONCE(b);
> +	}
> +
> +Note how the result of dependent_ptr_mb must be used with the following
> +accesses in order to have an effect.
> +
>   However, stores are not speculated.  This means that ordering -is- provided
>   for load-store control dependencies, as in the following example:
>   
> @@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() does force
>   the compiler to actually emit code for a given load, it does not force
>   the compiler to use the results.
>   
> +Converting to a data dependency helps with this too:
> +
> +	q = READ_ONCE(a);
> +	b = dependent_ptr_mb(b, q);
> +	WRITE_ONCE(b, 1);
> +
>   In addition, control dependencies apply only to the then-clause and
>   else-clause of the if-statement in question.  In particular, it does
>   not necessarily apply to code following the if-statement:
> @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
>   for more information.
>   
>   
> +
> +
>   In summary:
>   
>     (*) Control dependencies can order prior loads against later stores.
> diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> index 92ec486a4f9e..b4934e8c551b 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -59,6 +59,7 @@
>    * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
>    * in cases like this where there are no data dependencies.
>    */
> +#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
>   #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
>   
>   #ifdef CONFIG_SMP
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index 2cafdbb9ae4c..fa2e2ef72b68 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -70,6 +70,24 @@
>   #define __smp_read_barrier_depends()	read_barrier_depends()
>   #endif
>   
> +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> +
> +#define dependent_ptr_mb(ptr, val) ({					\
> +	long dependent_ptr_mb_val = (long)(val);			\
> +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> +									\
> +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> +})
> +
> +#else
> +
> +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })


So for the example of patch 4, we'd better fall back to rmb() or need a 
dependent_ptr_rmb()?

Thanks


> +
> +#endif
> +
>   #ifdef CONFIG_SMP
>   
>   #ifndef smp_mb
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 6601d39e8c48..f599c30f1b28 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -152,9 +152,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
>   #endif
>   
>   #ifndef OPTIMIZER_HIDE_VAR
> +
>   /* Make the optimizer believe the variable can be manipulated arbitrarily. */
>   #define OPTIMIZER_HIDE_VAR(var)						\
>   	__asm__ ("" : "=rm" (var) : "0" (var))
> +
> +#define COMPILER_HAS_OPTIMIZER_HIDE_VAR 1
> +
>   #endif
>   
>   /* Not-quite-unique ID. */
Michael S. Tsirkin Jan. 7, 2019, 4:23 a.m. UTC | #4
On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> 
> On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> > It's not uncommon to have two access two unrelated memory locations in a
> > specific order.  At the moment one has to use a memory barrier for this.
> > 
> > However, if the first access was a read and the second used an address
> > depending on the first one we would have a data dependency and no
> > barrier would be necessary.
> > 
> > This adds a new interface: dependent_ptr_mb which does exactly this: it
> > returns a pointer with a data dependency on the supplied value.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> >   Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
> >   arch/alpha/include/asm/barrier.h  |  1 +
> >   include/asm-generic/barrier.h     | 18 ++++++++++++++++++
> >   include/linux/compiler.h          |  4 ++++
> >   4 files changed, 43 insertions(+)
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index c1d913944ad8..9dbaa2e1dbf6 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -691,6 +691,18 @@ case what's actually required is:
> >   		p = READ_ONCE(b);
> >   	}
> > +Alternatively, a control dependency can be converted to a data dependency,
> > +e.g.:
> > +
> > +	q = READ_ONCE(a);
> > +	if (q) {
> > +		b = dependent_ptr_mb(b, q);
> > +		p = READ_ONCE(b);
> > +	}
> > +
> > +Note how the result of dependent_ptr_mb must be used with the following
> > +accesses in order to have an effect.
> > +
> >   However, stores are not speculated.  This means that ordering -is- provided
> >   for load-store control dependencies, as in the following example:
> > @@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() does force
> >   the compiler to actually emit code for a given load, it does not force
> >   the compiler to use the results.
> > +Converting to a data dependency helps with this too:
> > +
> > +	q = READ_ONCE(a);
> > +	b = dependent_ptr_mb(b, q);
> > +	WRITE_ONCE(b, 1);
> > +
> >   In addition, control dependencies apply only to the then-clause and
> >   else-clause of the if-statement in question.  In particular, it does
> >   not necessarily apply to code following the if-statement:
> > @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
> >   for more information.
> > +
> > +
> >   In summary:
> >     (*) Control dependencies can order prior loads against later stores.
> > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> > index 92ec486a4f9e..b4934e8c551b 100644
> > --- a/arch/alpha/include/asm/barrier.h
> > +++ b/arch/alpha/include/asm/barrier.h
> > @@ -59,6 +59,7 @@
> >    * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
> >    * in cases like this where there are no data dependencies.
> >    */
> > +#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
> >   #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
> >   #ifdef CONFIG_SMP
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index 2cafdbb9ae4c..fa2e2ef72b68 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -70,6 +70,24 @@
> >   #define __smp_read_barrier_depends()	read_barrier_depends()
> >   #endif
> > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > +
> > +#define dependent_ptr_mb(ptr, val) ({					\
> > +	long dependent_ptr_mb_val = (long)(val);			\
> > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > +									\
> > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > +})
> > +
> > +#else
> > +
> > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> 
> 
> So for the example of patch 4, we'd better fall back to rmb() or need a
> dependent_ptr_rmb()?
> 
> Thanks

You mean for strongly ordered architectures like Intel?
Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.

mb variant is unused right now so I'll remove it.


> 
> > +
> > +#endif
> > +
> >   #ifdef CONFIG_SMP
> >   #ifndef smp_mb
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 6601d39e8c48..f599c30f1b28 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -152,9 +152,13 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
> >   #endif
> >   #ifndef OPTIMIZER_HIDE_VAR
> > +
> >   /* Make the optimizer believe the variable can be manipulated arbitrarily. */
> >   #define OPTIMIZER_HIDE_VAR(var)						\
> >   	__asm__ ("" : "=rm" (var) : "0" (var))
> > +
> > +#define COMPILER_HAS_OPTIMIZER_HIDE_VAR 1
> > +
> >   #endif
> >   /* Not-quite-unique ID. */
Jason Wang Jan. 7, 2019, 6:50 a.m. UTC | #5
On 2019/1/7 下午12:23, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
>> On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
>>> It's not uncommon to have two access two unrelated memory locations in a
>>> specific order.  At the moment one has to use a memory barrier for this.
>>>
>>> However, if the first access was a read and the second used an address
>>> depending on the first one we would have a data dependency and no
>>> barrier would be necessary.
>>>
>>> This adds a new interface: dependent_ptr_mb which does exactly this: it
>>> returns a pointer with a data dependency on the supplied value.
>>>
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>> ---
>>>    Documentation/memory-barriers.txt | 20 ++++++++++++++++++++
>>>    arch/alpha/include/asm/barrier.h  |  1 +
>>>    include/asm-generic/barrier.h     | 18 ++++++++++++++++++
>>>    include/linux/compiler.h          |  4 ++++
>>>    4 files changed, 43 insertions(+)
>>>
>>> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
>>> index c1d913944ad8..9dbaa2e1dbf6 100644
>>> --- a/Documentation/memory-barriers.txt
>>> +++ b/Documentation/memory-barriers.txt
>>> @@ -691,6 +691,18 @@ case what's actually required is:
>>>    		p = READ_ONCE(b);
>>>    	}
>>> +Alternatively, a control dependency can be converted to a data dependency,
>>> +e.g.:
>>> +
>>> +	q = READ_ONCE(a);
>>> +	if (q) {
>>> +		b = dependent_ptr_mb(b, q);
>>> +		p = READ_ONCE(b);
>>> +	}
>>> +
>>> +Note how the result of dependent_ptr_mb must be used with the following
>>> +accesses in order to have an effect.
>>> +
>>>    However, stores are not speculated.  This means that ordering -is- provided
>>>    for load-store control dependencies, as in the following example:
>>> @@ -836,6 +848,12 @@ out-guess your code.  More generally, although READ_ONCE() does force
>>>    the compiler to actually emit code for a given load, it does not force
>>>    the compiler to use the results.
>>> +Converting to a data dependency helps with this too:
>>> +
>>> +	q = READ_ONCE(a);
>>> +	b = dependent_ptr_mb(b, q);
>>> +	WRITE_ONCE(b, 1);
>>> +
>>>    In addition, control dependencies apply only to the then-clause and
>>>    else-clause of the if-statement in question.  In particular, it does
>>>    not necessarily apply to code following the if-statement:
>>> @@ -875,6 +893,8 @@ to the CPU containing it.  See the section on "Multicopy atomicity"
>>>    for more information.
>>> +
>>> +
>>>    In summary:
>>>      (*) Control dependencies can order prior loads against later stores.
>>> diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
>>> index 92ec486a4f9e..b4934e8c551b 100644
>>> --- a/arch/alpha/include/asm/barrier.h
>>> +++ b/arch/alpha/include/asm/barrier.h
>>> @@ -59,6 +59,7 @@
>>>     * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
>>>     * in cases like this where there are no data dependencies.
>>>     */
>>> +#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
>>>    #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
>>>    #ifdef CONFIG_SMP
>>> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
>>> index 2cafdbb9ae4c..fa2e2ef72b68 100644
>>> --- a/include/asm-generic/barrier.h
>>> +++ b/include/asm-generic/barrier.h
>>> @@ -70,6 +70,24 @@
>>>    #define __smp_read_barrier_depends()	read_barrier_depends()
>>>    #endif
>>> +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
>>> +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
>>> +
>>> +#define dependent_ptr_mb(ptr, val) ({					\
>>> +	long dependent_ptr_mb_val = (long)(val);			\
>>> +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
>>> +									\
>>> +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
>>> +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
>>> +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
>>> +})
>>> +
>>> +#else
>>> +
>>> +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
>> So for the example of patch 4, we'd better fall back to rmb() or need a
>> dependent_ptr_rmb()?
>>
>> Thanks
> You mean for strongly ordered architectures like Intel?
> Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
>
> mb variant is unused right now so I'll remove it.
>
>

Yes.

Thanks
Peter Zijlstra Jan. 7, 2019, 9:46 a.m. UTC | #6
On Sun, Jan 06, 2019 at 11:23:07PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> > On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:

> > > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > > +
> > > +#define dependent_ptr_mb(ptr, val) ({					\
> > > +	long dependent_ptr_mb_val = (long)(val);			\
> > > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > > +									\
> > > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > > +})
> > > +
> > > +#else
> > > +
> > > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> > 
> > 
> > So for the example of patch 4, we'd better fall back to rmb() or need a
> > dependent_ptr_rmb()?
> > 
> > Thanks
> 
> You mean for strongly ordered architectures like Intel?
> Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
> 
> mb variant is unused right now so I'll remove it.

How about naming the thing: dependent_ptr() ? That is without any (r)mb
implications at all. The address dependency is strictly weaker than an
rmb in that it will only order the two loads in qestion and not, like
rmb, any prior to any later load.
Michael S. Tsirkin Jan. 7, 2019, 1:36 p.m. UTC | #7
On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:
> On Sun, Jan 06, 2019 at 11:23:07PM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> > > On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> 
> > > > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > > > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > > > +
> > > > +#define dependent_ptr_mb(ptr, val) ({					\
> > > > +	long dependent_ptr_mb_val = (long)(val);			\
> > > > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > > > +									\
> > > > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > > > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > > > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > > > +})
> > > > +
> > > > +#else
> > > > +
> > > > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> > > 
> > > 
> > > So for the example of patch 4, we'd better fall back to rmb() or need a
> > > dependent_ptr_rmb()?
> > > 
> > > Thanks
> > 
> > You mean for strongly ordered architectures like Intel?
> > Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> > dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
> > 
> > mb variant is unused right now so I'll remove it.
> 
> How about naming the thing: dependent_ptr() ? That is without any (r)mb
> implications at all. The address dependency is strictly weaker than an
> rmb in that it will only order the two loads in qestion and not, like
> rmb, any prior to any later load.

So I'm fine with this as it's enough for virtio, but I would like to point out two things:

1. E.g. on x86 both SMP and DMA variants can be NOPs but
the madatory one can't, so assuming we do not want
it to be stronger than rmp then either we want
smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
or we just will specify that dependent_ptr() works for
both DMA and SMP.

2. Down the road, someone might want to order a store after a load.
Address dependency does that for us too. Assuming we make
dependent_ptr a NOP on x86, we will want an mb variant
which isn't a NOP on x86. Will we want to rename
dependent_ptr to dependent_ptr_rmb at that point?

Thanks,
Peter Zijlstra Jan. 7, 2019, 3:54 p.m. UTC | #8
On Mon, Jan 07, 2019 at 08:36:36AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:

> > How about naming the thing: dependent_ptr() ? That is without any (r)mb
> > implications at all. The address dependency is strictly weaker than an
> > rmb in that it will only order the two loads in qestion and not, like
> > rmb, any prior to any later load.
> 
> So I'm fine with this as it's enough for virtio, but I would like to point out two things:
> 
> 1. E.g. on x86 both SMP and DMA variants can be NOPs but
> the madatory one can't, so assuming we do not want
> it to be stronger than rmp then either we want
> smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
> or we just will specify that dependent_ptr() works for
> both DMA and SMP.

The latter; the construct simply generates dependent loads. It is up to
the CPU as to what all that works for.

> 2. Down the road, someone might want to order a store after a load.
> Address dependency does that for us too. Assuming we make
> dependent_ptr a NOP on x86, we will want an mb variant
> which isn't a NOP on x86. Will we want to rename
> dependent_ptr to dependent_ptr_rmb at that point?

Not sure; what is the actual overhead of the construct on x86 vs the
NOP?
Michael S. Tsirkin Jan. 7, 2019, 4:22 p.m. UTC | #9
On Mon, Jan 07, 2019 at 04:54:23PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 07, 2019 at 08:36:36AM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:
> 
> > > How about naming the thing: dependent_ptr() ? That is without any (r)mb
> > > implications at all. The address dependency is strictly weaker than an
> > > rmb in that it will only order the two loads in qestion and not, like
> > > rmb, any prior to any later load.
> > 
> > So I'm fine with this as it's enough for virtio, but I would like to point out two things:
> > 
> > 1. E.g. on x86 both SMP and DMA variants can be NOPs but
> > the madatory one can't, so assuming we do not want
> > it to be stronger than rmp then either we want
> > smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
> > or we just will specify that dependent_ptr() works for
> > both DMA and SMP.
> 
> The latter; the construct simply generates dependent loads. It is up to
> the CPU as to what all that works for.

But not on intel right? On intel loads are ordered so it can be a nop.

> > 2. Down the road, someone might want to order a store after a load.
> > Address dependency does that for us too. Assuming we make
> > dependent_ptr a NOP on x86, we will want an mb variant
> > which isn't a NOP on x86. Will we want to rename
> > dependent_ptr to dependent_ptr_rmb at that point?
> 
> Not sure; what is the actual overhead of the construct on x86 vs the
> NOP?

I'll have to check. There's a pipeline stall almost for sure - that's
why we put it there after all :).
Paul E. McKenney Jan. 7, 2019, 7:02 p.m. UTC | #10
On Mon, Jan 07, 2019 at 08:36:36AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:
> > On Sun, Jan 06, 2019 at 11:23:07PM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> > > > On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> > 
> > > > > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > > > > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > > > > +
> > > > > +#define dependent_ptr_mb(ptr, val) ({					\
> > > > > +	long dependent_ptr_mb_val = (long)(val);			\
> > > > > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > > > > +									\
> > > > > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > > > > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > > > > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > > > > +})
> > > > > +
> > > > > +#else
> > > > > +
> > > > > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> > > > 
> > > > 
> > > > So for the example of patch 4, we'd better fall back to rmb() or need a
> > > > dependent_ptr_rmb()?
> > > > 
> > > > Thanks
> > > 
> > > You mean for strongly ordered architectures like Intel?
> > > Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> > > dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
> > > 
> > > mb variant is unused right now so I'll remove it.
> > 
> > How about naming the thing: dependent_ptr() ? That is without any (r)mb
> > implications at all. The address dependency is strictly weaker than an
> > rmb in that it will only order the two loads in qestion and not, like
> > rmb, any prior to any later load.
> 
> So I'm fine with this as it's enough for virtio, but I would like to point out two things:
> 
> 1. E.g. on x86 both SMP and DMA variants can be NOPs but
> the madatory one can't, so assuming we do not want
> it to be stronger than rmp then either we want
> smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
> or we just will specify that dependent_ptr() works for
> both DMA and SMP.
> 
> 2. Down the road, someone might want to order a store after a load.
> Address dependency does that for us too. Assuming we make
> dependent_ptr a NOP on x86, we will want an mb variant
> which isn't a NOP on x86. Will we want to rename
> dependent_ptr to dependent_ptr_rmb at that point?

But x86 preserves store-after-load orderings anyway, and even Alpha
respects ordering from loads to dependent stores.  So what am I missing
here?

							Thanx, Paul
Michael S. Tsirkin Jan. 7, 2019, 7:13 p.m. UTC | #11
On Mon, Jan 07, 2019 at 11:02:36AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 07, 2019 at 08:36:36AM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:
> > > On Sun, Jan 06, 2019 at 11:23:07PM -0500, Michael S. Tsirkin wrote:
> > > > On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> > > > > On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> > > 
> > > > > > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > > > > > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > > > > > +
> > > > > > +#define dependent_ptr_mb(ptr, val) ({					\
> > > > > > +	long dependent_ptr_mb_val = (long)(val);			\
> > > > > > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > > > > > +									\
> > > > > > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > > > > > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > > > > > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > > > > > +})
> > > > > > +
> > > > > > +#else
> > > > > > +
> > > > > > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> > > > > 
> > > > > 
> > > > > So for the example of patch 4, we'd better fall back to rmb() or need a
> > > > > dependent_ptr_rmb()?
> > > > > 
> > > > > Thanks
> > > > 
> > > > You mean for strongly ordered architectures like Intel?
> > > > Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> > > > dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
> > > > 
> > > > mb variant is unused right now so I'll remove it.
> > > 
> > > How about naming the thing: dependent_ptr() ? That is without any (r)mb
> > > implications at all. The address dependency is strictly weaker than an
> > > rmb in that it will only order the two loads in qestion and not, like
> > > rmb, any prior to any later load.
> > 
> > So I'm fine with this as it's enough for virtio, but I would like to point out two things:
> > 
> > 1. E.g. on x86 both SMP and DMA variants can be NOPs but
> > the madatory one can't, so assuming we do not want
> > it to be stronger than rmp then either we want
> > smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
> > or we just will specify that dependent_ptr() works for
> > both DMA and SMP.
> > 
> > 2. Down the road, someone might want to order a store after a load.
> > Address dependency does that for us too. Assuming we make
> > dependent_ptr a NOP on x86, we will want an mb variant
> > which isn't a NOP on x86. Will we want to rename
> > dependent_ptr to dependent_ptr_rmb at that point?
> 
> But x86 preserves store-after-load orderings anyway, and even Alpha
> respects ordering from loads to dependent stores.  So what am I missing
> here?
> 
> 							Thanx, Paul

Oh you are right. Stores are not reordered with older loads on x86.

So point 2 is moot. Sorry about the noise.

I guess at this point the only sticking point is the ECC compiler.
I'm inclined to stick an mb() there, seeing as it doesn't even
have spectre protection enabled. Slow but safe.
Paul E. McKenney Jan. 7, 2019, 7:25 p.m. UTC | #12
On Mon, Jan 07, 2019 at 02:13:29PM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 07, 2019 at 11:02:36AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 07, 2019 at 08:36:36AM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 07, 2019 at 10:46:10AM +0100, Peter Zijlstra wrote:
> > > > On Sun, Jan 06, 2019 at 11:23:07PM -0500, Michael S. Tsirkin wrote:
> > > > > On Mon, Jan 07, 2019 at 11:58:23AM +0800, Jason Wang wrote:
> > > > > > On 2019/1/3 上午4:57, Michael S. Tsirkin wrote:
> > > > 
> > > > > > > +#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
> > > > > > > +	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
> > > > > > > +
> > > > > > > +#define dependent_ptr_mb(ptr, val) ({					\
> > > > > > > +	long dependent_ptr_mb_val = (long)(val);			\
> > > > > > > +	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
> > > > > > > +									\
> > > > > > > +	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
> > > > > > > +	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
> > > > > > > +	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
> > > > > > > +})
> > > > > > > +
> > > > > > > +#else
> > > > > > > +
> > > > > > > +#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
> > > > > > 
> > > > > > 
> > > > > > So for the example of patch 4, we'd better fall back to rmb() or need a
> > > > > > dependent_ptr_rmb()?
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > You mean for strongly ordered architectures like Intel?
> > > > > Yes, maybe it makes sense to have dependent_ptr_smp_rmb,
> > > > > dependent_ptr_dma_rmb and dependent_ptr_virt_rmb.
> > > > > 
> > > > > mb variant is unused right now so I'll remove it.
> > > > 
> > > > How about naming the thing: dependent_ptr() ? That is without any (r)mb
> > > > implications at all. The address dependency is strictly weaker than an
> > > > rmb in that it will only order the two loads in qestion and not, like
> > > > rmb, any prior to any later load.
> > > 
> > > So I'm fine with this as it's enough for virtio, but I would like to point out two things:
> > > 
> > > 1. E.g. on x86 both SMP and DMA variants can be NOPs but
> > > the madatory one can't, so assuming we do not want
> > > it to be stronger than rmp then either we want
> > > smp_dependent_ptr(), dma_dependent_ptr(), dependent_ptr()
> > > or we just will specify that dependent_ptr() works for
> > > both DMA and SMP.
> > > 
> > > 2. Down the road, someone might want to order a store after a load.
> > > Address dependency does that for us too. Assuming we make
> > > dependent_ptr a NOP on x86, we will want an mb variant
> > > which isn't a NOP on x86. Will we want to rename
> > > dependent_ptr to dependent_ptr_rmb at that point?
> > 
> > But x86 preserves store-after-load orderings anyway, and even Alpha
> > respects ordering from loads to dependent stores.  So what am I missing
> > here?
> > 
> > 							Thanx, Paul
> 
> Oh you are right. Stores are not reordered with older loads on x86.
> 
> So point 2 is moot. Sorry about the noise.
> 
> I guess at this point the only sticking point is the ECC compiler.
> I'm inclined to stick an mb() there, seeing as it doesn't even
> have spectre protection enabled. Slow but safe.

Well, there is a mention of DMA above, which on some systems throws in
a wild card.  I would certainly hope that DMA would integrate nicely
with the cache-coherence protocols these days, unlike 25 years ago,
but who knows?

							Thanx, Paul
diff mbox series

Patch

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index c1d913944ad8..9dbaa2e1dbf6 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -691,6 +691,18 @@  case what's actually required is:
 		p = READ_ONCE(b);
 	}
 
+Alternatively, a control dependency can be converted to a data dependency,
+e.g.:
+
+	q = READ_ONCE(a);
+	if (q) {
+		b = dependent_ptr_mb(b, q);
+		p = READ_ONCE(b);
+	}
+
+Note how the result of dependent_ptr_mb must be used with the following
+accesses in order to have an effect.
+
 However, stores are not speculated.  This means that ordering -is- provided
 for load-store control dependencies, as in the following example:
 
@@ -836,6 +848,12 @@  out-guess your code.  More generally, although READ_ONCE() does force
 the compiler to actually emit code for a given load, it does not force
 the compiler to use the results.
 
+Converting to a data dependency helps with this too:
+
+	q = READ_ONCE(a);
+	b = dependent_ptr_mb(b, q);
+	WRITE_ONCE(b, 1);
+
 In addition, control dependencies apply only to the then-clause and
 else-clause of the if-statement in question.  In particular, it does
 not necessarily apply to code following the if-statement:
@@ -875,6 +893,8 @@  to the CPU containing it.  See the section on "Multicopy atomicity"
 for more information.
 
 
+
+
 In summary:
 
   (*) Control dependencies can order prior loads against later stores.
diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index 92ec486a4f9e..b4934e8c551b 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -59,6 +59,7 @@ 
  * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
  * in cases like this where there are no data dependencies.
  */
+#define ARCH_NEEDS_READ_BARRIER_DEPENDS 1
 #define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
 
 #ifdef CONFIG_SMP
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index 2cafdbb9ae4c..fa2e2ef72b68 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -70,6 +70,24 @@ 
 #define __smp_read_barrier_depends()	read_barrier_depends()
 #endif
 
+#if defined(COMPILER_HAS_OPTIMIZER_HIDE_VAR) && \
+	!defined(ARCH_NEEDS_READ_BARRIER_DEPENDS)
+
+#define dependent_ptr_mb(ptr, val) ({					\
+	long dependent_ptr_mb_val = (long)(val);			\
+	long dependent_ptr_mb_ptr = (long)(ptr) - dependent_ptr_mb_val;	\
+									\
+	BUILD_BUG_ON(sizeof(val) > sizeof(long));			\
+	OPTIMIZER_HIDE_VAR(dependent_ptr_mb_val);			\
+	(typeof(ptr))(dependent_ptr_mb_ptr + dependent_ptr_mb_val);	\
+})
+
+#else
+
+#define dependent_ptr_mb(ptr, val) ({ mb(); (ptr); })
+
+#endif
+
 #ifdef CONFIG_SMP
 
 #ifndef smp_mb
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 6601d39e8c48..f599c30f1b28 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -152,9 +152,13 @@  void ftrace_likely_update(struct ftrace_likely_data *f, int val,
 #endif
 
 #ifndef OPTIMIZER_HIDE_VAR
+
 /* Make the optimizer believe the variable can be manipulated arbitrarily. */
 #define OPTIMIZER_HIDE_VAR(var)						\
 	__asm__ ("" : "=rm" (var) : "0" (var))
+
+#define COMPILER_HAS_OPTIMIZER_HIDE_VAR 1
+
 #endif
 
 /* Not-quite-unique ID. */