Linux: Implement membarrier function

Message ID	8736rldyzm.fsf@oldenburg.str.redhat.com
State	New
Headers	show Return-Path: <libc-alpha-return-97631-incoming=patchwork.ozlabs.org@sourceware.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id :mime-version:content-type; q=dns; s=default; b=p8en+1R6aulijkaJ lgYS12fUEbDLhTm610CbUBn32q2BIqMRAhBFdXaWzrKyfT1TLn6Xwwsmqi3D/FPH 4UKgOFE3Nx8x5GqX7ypLNHypt+dWO5Pav6zbMt+522UlIcAUaY/QSCDULRX4CaXR iBTypxH6yd/JthMQSRO0wRW7F7Q= Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk Sender: libc-alpha-owner@sourceware.org From: Florian Weimer <fweimer@redhat.com> To: libc-alpha@sourceware.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Torvald Riegel <triegel@redhat.com> Subject: [PATCH] Linux: Implement membarrier function Date: Wed, 28 Nov 2018 16:05:01 +0100 Message-ID: <8736rldyzm.fsf@oldenburg.str.redhat.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain
Series	Linux: Implement membarrier function \| expand Linux: Implement membarrier function

Florian Weimer Nov. 28, 2018, 3:05 p.m. UTC

This is essentially a repost of last year's patch, rebased to the glibc
2.29 symbol version and reflecting the introduction of
MEMBARRIER_CMD_GLOBAL.

I'm not including any changes to manual/ here because the set of
supported operations is evolving rapidly, we could not get consensus for
the language I proposed the last time, and I do not want to contribute
to the manual for the time being.

Thanks,
Florian

2018-11-28  Florian Weimer  <fweimer@redhat.com>

	Linux: Implement membarrier function.
	* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Add
	sys/membarrier.h.
	(tests): Add tst-membarrier.
	* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Add membarrier.
	* sysdeps/unix/sysv/linux/sys/membarrier.h: New file.
	* sysdeps/unix/sysv/linux/tst-membarrier.c: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/libc.abilist (GLIBC_2.29): Add
	membarrier.
	* sysdeps/unix/sysv/linux/alpha/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/arm/libc.abilist (GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/hppa/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/i386/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/ia64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/microblaze/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/nios2/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/sh/libc.abilist (GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist (GLIBC_2.29):
	Likewise.

Torvald Riegel Nov. 28, 2018, 10:34 p.m. UTC | #1

On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> This is essentially a repost of last year's patch, rebased to the glibc
> 2.29 symbol version and reflecting the introduction of
> MEMBARRIER_CMD_GLOBAL.
> 
> I'm not including any changes to manual/ here because the set of
> supported operations is evolving rapidly, we could not get consensus for
> the language I proposed the last time, and I do not want to contribute
> to the manual for the time being.

Fair enough.  Nonetheless, can you summarize how far you're along with
properly defining the semantics (eg, based on the C/C++ memory model)?

Florian Weimer Nov. 29, 2018, 1:50 p.m. UTC | #2

* Torvald Riegel:

> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
>> This is essentially a repost of last year's patch, rebased to the glibc
>> 2.29 symbol version and reflecting the introduction of
>> MEMBARRIER_CMD_GLOBAL.
>> 
>> I'm not including any changes to manual/ here because the set of
>> supported operations is evolving rapidly, we could not get consensus for
>> the language I proposed the last time, and I do not want to contribute
>> to the manual for the time being.
>
> Fair enough.  Nonetheless, can you summarize how far you're along with
> properly defining the semantics (eg, based on the C/C++ memory model)?

I wrote down what you could, but no one liked it.

<https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>

I expect that a formalization would interact in non-trivial ways with
any potential formalization of usable relaxed memory order semantics,
and I'm not sure if anyone knows how to do the latter today.

Thanks,
Florian

Mathieu Desnoyers Nov. 29, 2018, 2:44 p.m. UTC | #3

----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:

> * Torvald Riegel:
> 
>> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
>>> This is essentially a repost of last year's patch, rebased to the glibc
>>> 2.29 symbol version and reflecting the introduction of
>>> MEMBARRIER_CMD_GLOBAL.
>>> 
>>> I'm not including any changes to manual/ here because the set of
>>> supported operations is evolving rapidly, we could not get consensus for
>>> the language I proposed the last time, and I do not want to contribute
>>> to the manual for the time being.
>>
>> Fair enough.  Nonetheless, can you summarize how far you're along with
>> properly defining the semantics (eg, based on the C/C++ memory model)?
> 
> I wrote down what you could, but no one liked it.
> 
> <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> 
> I expect that a formalization would interact in non-trivial ways with
> any potential formalization of usable relaxed memory order semantics,
> and I'm not sure if anyone knows how to do the latter today.

Adding Paul E. McKenney in CC.

Thanks,

Mathieu

> 
> Thanks,
> Florian

Paul E. McKenney Nov. 29, 2018, 3:04 p.m. UTC | #4

On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> ----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:
> 
> > * Torvald Riegel:
> > 
> >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> >>> This is essentially a repost of last year's patch, rebased to the glibc
> >>> 2.29 symbol version and reflecting the introduction of
> >>> MEMBARRIER_CMD_GLOBAL.
> >>> 
> >>> I'm not including any changes to manual/ here because the set of
> >>> supported operations is evolving rapidly, we could not get consensus for
> >>> the language I proposed the last time, and I do not want to contribute
> >>> to the manual for the time being.
> >>
> >> Fair enough.  Nonetheless, can you summarize how far you're along with
> >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > 
> > I wrote down what you could, but no one liked it.
> > 
> > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > 
> > I expect that a formalization would interact in non-trivial ways with
> > any potential formalization of usable relaxed memory order semantics,
> > and I'm not sure if anyone knows how to do the latter today.
> 
> Adding Paul E. McKenney in CC.

There is some prototype C++ memory model wording from David Goldblatt (CCed)
here (search for "Standarese"):

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf

David's key insight is that (in Linuxese) light fences cannot pair with
each other.

							Thanx, Paul

David Goldblatt Nov. 29, 2018, 7:02 p.m. UTC | #5

One note with the suggested patch is that
`atomic_thread_fence(memory_order_acq_rel)` should probably be
`atomic_thread_fence (memory_order_seq_cst)` (otherwise the call would
be a no-op on, say, x86, which it very much isn't).

The non-transitivity thing makes the resulting description arguably
incorrect, but this is informal enough that it might not be a big deal
to add something after "For these threads, the membarrier function
call turns an existing compiler barrier (see above) executed by these
threads into full memory barriers" that clarifies it. E.g. you could
make it into "turns an existing compiler barrier [...] into full
memory barriers, with respect to the calling thread".

Since this is targeting the description of the OS call (and doesn't
have to concern itself with also being implementable by other
asymmetric techniques or degrading to architectural barriers), I think
that the description in "approach 2" in P1202 would also make sense
for a formal description of the syscall. (Of course, without the
kernel itself committing to a rigorous semantics, anything specified
on top of it will be on slightly shaky ground).

- David

On Thu, Nov 29, 2018 at 7:04 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> > ----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:
> >
> > > * Torvald Riegel:
> > >
> > >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> > >>> This is essentially a repost of last year's patch, rebased to the glibc
> > >>> 2.29 symbol version and reflecting the introduction of
> > >>> MEMBARRIER_CMD_GLOBAL.
> > >>>
> > >>> I'm not including any changes to manual/ here because the set of
> > >>> supported operations is evolving rapidly, we could not get consensus for
> > >>> the language I proposed the last time, and I do not want to contribute
> > >>> to the manual for the time being.
> > >>
> > >> Fair enough.  Nonetheless, can you summarize how far you're along with
> > >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > >
> > > I wrote down what you could, but no one liked it.
> > >
> > > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > >
> > > I expect that a formalization would interact in non-trivial ways with
> > > any potential formalization of usable relaxed memory order semantics,
> > > and I'm not sure if anyone knows how to do the latter today.
> >
> > Adding Paul E. McKenney in CC.
>
> There is some prototype C++ memory model wording from David Goldblatt (CCed)
> here (search for "Standarese"):
>
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
>
> David's key insight is that (in Linuxese) light fences cannot pair with
> each other.
>
>                                                         Thanx, Paul
>

Florian Weimer Dec. 5, 2018, 1:20 p.m. UTC | #6

* Florian Weimer:

> This is essentially a repost of last year's patch, rebased to the glibc
> 2.29 symbol version and reflecting the introduction of
> MEMBARRIER_CMD_GLOBAL.
>
> I'm not including any changes to manual/ here because the set of
> supported operations is evolving rapidly, we could not get consensus for
> the language I proposed the last time, and I do not want to contribute
> to the manual for the time being.

Any comments on the technical aspects of the patch?  I would like to
commit this.

  <https://sourceware.org/ml/libc-alpha/2018-11/msg00750.html>

Thanks,
Florian

Adhemerval Zanella Netto Dec. 5, 2018, 2:12 p.m. UTC | #7

On 28/11/2018 13:05, Florian Weimer wrote:
> This is essentially a repost of last year's patch, rebased to the glibc
> 2.29 symbol version and reflecting the introduction of
> MEMBARRIER_CMD_GLOBAL.
> 
> I'm not including any changes to manual/ here because the set of
> supported operations is evolving rapidly, we could not get consensus for
> the language I proposed the last time, and I do not want to contribute
> to the manual for the time being.

I agree that documentation might be added in a subsequent patch.  The only
issue I have with this patch is how to handle the kernel header, see comments
below.

> 
> Thanks,
> Florian
> 
> 2018-11-28  Florian Weimer  <fweimer@redhat.com>
> 
> 	Linux: Implement membarrier function.
> 	* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Add
> 	sys/membarrier.h.
> 	(tests): Add tst-membarrier.
> 	* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Add membarrier.
> 	* sysdeps/unix/sysv/linux/sys/membarrier.h: New file.
> 	* sysdeps/unix/sysv/linux/tst-membarrier.c: Likewise.
> 	* sysdeps/unix/sysv/linux/aarch64/libc.abilist (GLIBC_2.29): Add
> 	membarrier.
> 	* sysdeps/unix/sysv/linux/alpha/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/arm/libc.abilist (GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/hppa/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/i386/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/ia64/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/microblaze/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/nios2/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist
> 	(GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/sh/libc.abilist (GLIBC_2.29): Likewise.
> 	* sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/64/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 	* sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist (GLIBC_2.29):
> 	Likewise.
> 
> diff --git a/NEWS b/NEWS
> index 1098be1afb..d5786f4eab 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -35,6 +35,9 @@ Major new features:
>    different directory.  This is a GNU extension and similar to the
>    Solaris function of the same name.
>  
> +* The membarrier function and the <sys/membarrier.h> header file have been
> +  added.
> +
>  Deprecated and removed features, and other changes affecting compatibility:
>  
>  * The glibc.tune tunable namespace has been renamed to glibc.cpu and the

Please make explicit this is a Linux only interface.

> diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
> index 362cf3b950..f8c4843ab6 100644
> --- a/sysdeps/unix/sysv/linux/Makefile
> +++ b/sysdeps/unix/sysv/linux/Makefile
> @@ -43,12 +43,13 @@ sysdep_headers += sys/mount.h sys/acct.h sys/sysctl.h \
>  		  bits/siginfo-arch.h bits/siginfo-consts-arch.h \
>  		  bits/procfs.h bits/procfs-id.h bits/procfs-extra.h \
>  		  bits/procfs-prregset.h bits/mman-map-flags-generic.h \
> -		  bits/msq-pad.h bits/sem-pad.h bits/shmlba.h bits/shm-pad.h
> +		  bits/msq-pad.h bits/sem-pad.h bits/shmlba.h bits/shm-pad.h \
> +		  sys/membarrier.h
>  
>  tests += tst-clone tst-clone2 tst-clone3 tst-fanotify tst-personality \
>  	 tst-quota tst-sync_file_range tst-sysconf-iov_max tst-ttyname \
>  	 test-errno-linux tst-memfd_create tst-mlock2 tst-pkey \
> -	 tst-rlimit-infinity tst-ofdlocks
> +	 tst-rlimit-infinity tst-ofdlocks tst-membarrier
>  tests-internal += tst-ofdlocks-compat
>  
>  

Ok.

> diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
> index 336c13b57d..86db06f403 100644
> --- a/sysdeps/unix/sysv/linux/Versions
> +++ b/sysdeps/unix/sysv/linux/Versions
> @@ -171,6 +171,9 @@ libc {
>      mlock2;
>      pkey_alloc; pkey_free; pkey_set; pkey_get; pkey_mprotect;
>    }
> +  GLIBC_2.29 {
> +    membarrier;
> +  }
>    GLIBC_PRIVATE {
>      # functions used in other libraries
>      __syscall_rt_sigqueueinfo;

Ok.

> diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> index e66c741d04..c73c731eec 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
> @@ -2138,4 +2138,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/alpha/libc.abilist b/sysdeps/unix/sysv/linux/alpha/libc.abilist
> index 8df162fe99..c9488b3c18 100644
> --- a/sysdeps/unix/sysv/linux/alpha/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/alpha/libc.abilist
> @@ -2033,6 +2033,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/arm/libc.abilist b/sysdeps/unix/sysv/linux/arm/libc.abilist
> index 43c804f9dc..2524b6545b 100644
> --- a/sysdeps/unix/sysv/linux/arm/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/arm/libc.abilist
> @@ -123,6 +123,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.4 _Exit F
>  GLIBC_2.4 _IO_2_1_stderr_ D 0xa0

Ok.

> diff --git a/sysdeps/unix/sysv/linux/hppa/libc.abilist b/sysdeps/unix/sysv/linux/hppa/libc.abilist
> index 88b01c2e75..9baaa34b62 100644
> --- a/sysdeps/unix/sysv/linux/hppa/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/hppa/libc.abilist
> @@ -1880,6 +1880,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/i386/libc.abilist b/sysdeps/unix/sysv/linux/i386/libc.abilist
> index 6d02f31612..1b91873f65 100644
> --- a/sysdeps/unix/sysv/linux/i386/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/i386/libc.abilist
> @@ -2045,6 +2045,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/ia64/libc.abilist b/sysdeps/unix/sysv/linux/ia64/libc.abilist
> index 4249712611..1b3465c6f4 100644
> --- a/sysdeps/unix/sysv/linux/ia64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/ia64/libc.abilist
> @@ -1914,6 +1914,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
> index d47b808862..db22eb4a12 100644
> --- a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
> @@ -124,6 +124,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.4 _Exit F
>  GLIBC_2.4 _IO_2_1_stderr_ D 0x98

Ok.

> diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
> index d5e38308be..5a93b98e76 100644
> --- a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
> @@ -1989,6 +1989,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/microblaze/libc.abilist b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
> index 8596b84399..0142b573ca 100644
> --- a/sysdeps/unix/sysv/linux/microblaze/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
> @@ -2130,4 +2130,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
> index 88e0f896d5..a84db6e3f4 100644
> --- a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
> @@ -1967,6 +1967,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
> index aff7462c34..aed9d20fa5 100644
> --- a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
> @@ -1965,6 +1965,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
> index 71d82444aa..d117ad299e 100644
> --- a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
> @@ -1973,6 +1973,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
> index de6c53d293..3598a33eca 100644
> --- a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
> @@ -1968,6 +1968,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/nios2/libc.abilist b/sysdeps/unix/sysv/linux/nios2/libc.abilist
> index e724bab9fb..7ce4aa1841 100644
> --- a/sysdeps/unix/sysv/linux/nios2/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/nios2/libc.abilist
> @@ -2171,4 +2171,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
> index e9ecbccb71..50fedd0fab 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
> @@ -1993,6 +1993,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
> index da83ea6028..86862b6e16 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
> @@ -1997,6 +1997,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F
Ok.

> diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist
> index 4535b40d15..57c0e67347 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist
> +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc-le.abilist
> @@ -2228,4 +2228,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist
> index 65725de4f0..7658e2c2b9 100644
> --- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/libc.abilist
> @@ -123,6 +123,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 _Exit F
>  GLIBC_2.3 _IO_2_1_stderr_ D 0xe0

Ok.

> diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
> index bbb3c4a8e7..f4020e881d 100644
> --- a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
> @@ -2100,4 +2100,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
> index e85ac2a178..9476770e5b 100644
> --- a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
> @@ -2002,6 +2002,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
> index d56931022c..ee9a87659f 100644
> --- a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
> @@ -1908,6 +1908,7 @@ GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
>  GLIBC_2.29 __fentry__ F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/sh/libc.abilist b/sysdeps/unix/sysv/linux/sh/libc.abilist
> index ff939a15c4..c2dd2a5f7e 100644
> --- a/sysdeps/unix/sysv/linux/sh/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/sh/libc.abilist
> @@ -1884,6 +1884,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
> index 64fa9e10a5..55c6496b96 100644
> --- a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
> @@ -1996,6 +1996,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
> index db909d1506..a0e7f2f221 100644
> --- a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
> @@ -1937,6 +1937,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/sys/membarrier.h b/sysdeps/unix/sysv/linux/sys/membarrier.h
> new file mode 100644
> index 0000000000..4c3e6164f7
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/sys/membarrier.h
> @@ -0,0 +1,55 @@
> +/* Memory barriers.
> +   Copyright (C) 2018 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef _SYS_MEMBARRIER_H
> +#define _SYS_MEMBARRIER_H 1
> +
> +#include <features.h>
> +
> +__BEGIN_DECLS
> +
> +/* Perform a memory barrier on multiple threads.  */
> +int membarrier (int __op, int __flags) __THROW;
> +
> +__END_DECLS
> +
> +/* Obtain the definitions of the MEMBARRIER_CMD_* constants.  */
> +
> +#include <linux/version.h>
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 16, 0)
> +# include <linux/membarrier.h>
> +#else
> +
> +/* Definitions from Linux 4.16 follow.  */
> +
> +enum membarrier_cmd
> +{
> +  MEMBARRIER_CMD_QUERY = 0,
> +  MEMBARRIER_CMD_SHARED = 1,
> +  MEMBARRIER_CMD_GLOBAL = 1,
> +  MEMBARRIER_CMD_GLOBAL_EXPEDITED = 2,
> +  MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED = 4,
> +  MEMBARRIER_CMD_PRIVATE_EXPEDITED = 8,
> +  MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED = 16,
> +  MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE = 32,
> +  MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE = 64,
> +};
> +
> +#endif

If we are replicating the values, meaning the idea is to keep it sync at least
when we have the minimum supported kernel of 4.16, why just not add a comment
to add linux/membarrier.h once the minimum supported kernel provides this
header and not rely on linux/membarrier.h? 

Also I think the minimum kernel that provides this header is 4.3, however
by using 4.3 as the condition to include the kernel header in add another
issue which is glibc will have different semantic depending of the installed
header. This fallback enum definition is also lacking MEMBARRIER_CMD_SHARED,
and although is provided by kernel headers just for compatibility, it is 
another interface difference it has depending of the installed kernel
header.

Personally I prefer to decouple glibc headers from kernel ones, so I would
go just with define membarrier_cmd (and the flags once its is implemented
by kernel) with the burden of keep them in sync with kernel releases (as
we do for various syscalls and headers).

> +
> +#endif /* _SYS_MEMBARRIER_H */
> diff --git a/sysdeps/unix/sysv/linux/syscalls.list b/sysdeps/unix/sysv/linux/syscalls.list
> index e24ea29e35..3deee2bc19 100644
> --- a/sysdeps/unix/sysv/linux/syscalls.list
> +++ b/sysdeps/unix/sysv/linux/syscalls.list
> @@ -112,3 +112,4 @@ process_vm_writev EXTRA	process_vm_writev i:ipipii process_vm_writev
>  memfd_create    EXTRA	memfd_create	i:si    memfd_create
>  pkey_alloc	EXTRA	pkey_alloc	i:ii	pkey_alloc
>  pkey_free	EXTRA	pkey_free	i:i	pkey_free
> +membarrier	EXTRA	membarrier	i:ii	membarrier

Ok.

> diff --git a/sysdeps/unix/sysv/linux/tst-membarrier.c b/sysdeps/unix/sysv/linux/tst-membarrier.c
> new file mode 100644
> index 0000000000..aeaccad578
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/tst-membarrier.c
> @@ -0,0 +1,61 @@
> +/* Tests for the membarrier function.
> +   Copyright (C) 2017 Free Software Foundation, Inc.

Update copyright year.

> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <errno.h>
> +#include <stdio.h>
> +#include <support/check.h>
> +#include <sys/membarrier.h>
> +
> +static int
> +do_test (void)
> +{
> +  int supported = membarrier (MEMBARRIER_CMD_QUERY, 0);
> +  if (supported == -1)
> +    {
> +      if (errno == ENOSYS)
> +        FAIL_UNSUPPORTED ("membarrier system call not implemented");
> +      else
> +        FAIL_EXIT1 ("membarrier: %m");
> +    }
> +
> +  if ((supported & MEMBARRIER_CMD_GLOBAL) == 0)
> +    FAIL_UNSUPPORTED ("global memory barriers not supported");
> +
> +  puts ("info: membarrier is supported on this system");
> +
> +  /* The global barrier is always implemented.  */
> +  TEST_COMPARE (supported & MEMBARRIER_CMD_GLOBAL, MEMBARRIER_CMD_GLOBAL);
> +  TEST_COMPARE (membarrier (MEMBARRIER_CMD_GLOBAL, 0), 0);
> +
> +  /* If the private-expedited barrier is advertised, execute it after
> +     registering the intent.  */
> +  if (supported & MEMBARRIER_CMD_PRIVATE_EXPEDITED)
> +    {
> +      TEST_COMPARE (supported & MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
> +                    MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED);
> +      TEST_COMPARE (membarrier (MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED, 0),
> +                    0);
> +      TEST_COMPARE (membarrier (MEMBARRIER_CMD_PRIVATE_EXPEDITED, 0), 0);
> +    }
> +  else
> +    puts ("info: MEMBARRIER_CMD_PRIVATE_EXPEDITED not supported");
> +
> +  return 0;
> +}
> +
> +#include <support/test-driver.c>

Ok.

> diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
> index 3b175f104b..f097eb86e2 100644
> --- a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
> @@ -1895,6 +1895,7 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
>  GLIBC_2.3 __ctype_b_loc F
>  GLIBC_2.3 __ctype_tolower_loc F

Ok.

> diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
> index 1b57710477..afb1a196fa 100644
> --- a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
> +++ b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
> @@ -2146,4 +2146,5 @@ GLIBC_2.28 thrd_current F
>  GLIBC_2.28 thrd_equal F
>  GLIBC_2.28 thrd_sleep F
>  GLIBC_2.28 thrd_yield F
> +GLIBC_2.29 membarrier F
>  GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
> 

Ok.

Florian Weimer Dec. 5, 2018, 2:51 p.m. UTC | #8

* Adhemerval Zanella:

> If we are replicating the values, meaning the idea is to keep it sync at least
> when we have the minimum supported kernel of 4.16, why just not add a comment
> to add linux/membarrier.h once the minimum supported kernel provides this
> header and not rely on linux/membarrier.h?

This way, we can compile the test with any supported kernel headers for
glibc.

If we defer to <linux/membarrier.h> unconditionally, we cannot build the
test with all kernel headers.  The current approach definitely makes the
test case much cleaner.

> Also I think the minimum kernel that provides this header is 4.3, however
> by using 4.3 as the condition to include the kernel header in add another
> issue which is glibc will have different semantic depending of the installed
> header. This fallback enum definition is also lacking MEMBARRIER_CMD_SHARED,
> and although is provided by kernel headers just for compatibility, it is 
> another interface difference it has depending of the installed kernel
> header.

MEMBARRIER_CMD_SHARED is included.

Thanks,
Florian

Joseph Myers Dec. 5, 2018, 4:28 p.m. UTC | #9

Are you going to open a bug report in Bugzilla for the lack of 
documentation (discussing exactly what the issues are that need to be 
clarified to document the function)?

If in any particular case there is sufficient justification for not 
documenting a new function, I'd expect such a bug report as a minimum to 
say what needs to be done to add documentation.  (The most common case for 
not adding documentation is probably that the function joins a whole 
family of existing functions lacking documentation, in which case one bug 
report for the whole family suffices, like bug 10891 covering all the *_l 
functions missing documentation.)

Adhemerval Zanella Netto Dec. 5, 2018, 4:54 p.m. UTC | #10

On 05/12/2018 12:51, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> If we are replicating the values, meaning the idea is to keep it sync at least
>> when we have the minimum supported kernel of 4.16, why just not add a comment
>> to add linux/membarrier.h once the minimum supported kernel provides this
>> header and not rely on linux/membarrier.h?
> 
> This way, we can compile the test with any supported kernel headers for
> glibc.
> 
> If we defer to <linux/membarrier.h> unconditionally, we cannot build the
> test with all kernel headers.  The current approach definitely makes the
> test case much cleaner.
> 

My point is exactly to *not* rely on kernel headers.

>> Also I think the minimum kernel that provides this header is 4.3, however
>> by using 4.3 as the condition to include the kernel header in add another
>> issue which is glibc will have different semantic depending of the installed
>> header. This fallback enum definition is also lacking MEMBARRIER_CMD_SHARED,
>> and although is provided by kernel headers just for compatibility, it is 
>> another interface difference it has depending of the installed kernel
>> header.
> 
> MEMBARRIER_CMD_SHARED is included.
> 

Right, my mistake here.

Florian Weimer Dec. 5, 2018, 6:11 p.m. UTC | #11

* Adhemerval Zanella:

> On 05/12/2018 12:51, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> If we are replicating the values, meaning the idea is to keep it sync at least
>>> when we have the minimum supported kernel of 4.16, why just not add a comment
>>> to add linux/membarrier.h once the minimum supported kernel provides this
>>> header and not rely on linux/membarrier.h?
>> 
>> This way, we can compile the test with any supported kernel headers for
>> glibc.
>> 
>> If we defer to <linux/membarrier.h> unconditionally, we cannot build the
>> test with all kernel headers.  The current approach definitely makes the
>> test case much cleaner.
>> 
>
> My point is exactly to *not* rely on kernel headers.

I thought there was a general desire to move in the opposite direction,
avoiding copying declarations and definitions in clean/new headers that
are dedicated to a specific purpose?

I don't think we can add much value by copying the contents of those
headers.  We can even relax the version check if we use __has_include
(which has to be remain optional in installed headers, of course).

Thanks,
Florian

Adhemerval Zanella Netto Dec. 5, 2018, 7:15 p.m. UTC | #12

On 05/12/2018 16:11, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 05/12/2018 12:51, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> If we are replicating the values, meaning the idea is to keep it sync at least
>>>> when we have the minimum supported kernel of 4.16, why just not add a comment
>>>> to add linux/membarrier.h once the minimum supported kernel provides this
>>>> header and not rely on linux/membarrier.h?
>>>
>>> This way, we can compile the test with any supported kernel headers for
>>> glibc.
>>>
>>> If we defer to <linux/membarrier.h> unconditionally, we cannot build the
>>> test with all kernel headers.  The current approach definitely makes the
>>> test case much cleaner.
>>>
>>
>> My point is exactly to *not* rely on kernel headers.
> 
> I thought there was a general desire to move in the opposite direction,
> avoiding copying declarations and definitions in clean/new headers that
> are dedicated to a specific purpose?

I don't recall this discussion, do you have the link? In any case I still
think mixing header inclusion with the definition duplication is just
a double effort, it will still require syncing the definitions on each
kernel release.

> 
> I don't think we can add much value by copying the contents of those
> headers.  We can even relax the version check if we use __has_include
> (which has to be remain optional in installed headers, of course).
This example give us why trying to relying in kernel header to provide
glibc exported interfaces are at least tricky, imho.  The issues I can
think of are:

  - The header would require to be include safe in all possible releases
    and for all possible permutations, which means if something is broken 
    we need to rely on kernel to actually fix it.

  - glibc might provide different semantic to users depending of the
    underlying installed kernel header version.  It means for two
    identical glibc versions, some programs might fail to build 
    depending of the installed kernel version.

Florian Weimer Dec. 5, 2018, 7:53 p.m. UTC | #13

* Adhemerval Zanella:

> On 05/12/2018 16:11, Florian Weimer wrote:
>> * Adhemerval Zanella:
>> 
>>> On 05/12/2018 12:51, Florian Weimer wrote:
>>>> * Adhemerval Zanella:
>>>>
>>>>> If we are replicating the values, meaning the idea is to keep it sync at least
>>>>> when we have the minimum supported kernel of 4.16, why just not add a comment
>>>>> to add linux/membarrier.h once the minimum supported kernel provides this
>>>>> header and not rely on linux/membarrier.h?
>>>>
>>>> This way, we can compile the test with any supported kernel headers for
>>>> glibc.
>>>>
>>>> If we defer to <linux/membarrier.h> unconditionally, we cannot build the
>>>> test with all kernel headers.  The current approach definitely makes the
>>>> test case much cleaner.
>>>>
>>>
>>> My point is exactly to *not* rely on kernel headers.
>> 
>> I thought there was a general desire to move in the opposite direction,
>> avoiding copying declarations and definitions in clean/new headers that
>> are dedicated to a specific purpose?
>
> I don't recall this discussion, do you have the link?

I will try to find it.

> In any case I still think mixing header inclusion with the definition
> duplication is just a double effort, it will still require syncing the
> definitions on each kernel release.

I think that's not actually true.  We only need to do this if we need
newer definitions for building glibc itself (including the test suite).

>> I don't think we can add much value by copying the contents of those
>> headers.  We can even relax the version check if we use __has_include
>> (which has to be remain optional in installed headers, of course).

> This example give us why trying to relying in kernel header to provide
> glibc exported interfaces are at least tricky, imho.  The issues I can
> think of are:
>
>   - The header would require to be include safe in all possible releases
>     and for all possible permutations, which means if something is broken 
>     we need to rely on kernel to actually fix it.

I'm not too worried about this for new headers with a dedicated purpose.
Using the UAPI will actually make things work in more cases; see below.

>   - glibc might provide different semantic to users depending of the
>     underlying installed kernel header version.  It means for two
>     identical glibc versions, some programs might fail to build 
>     depending of the installed kernel version.

Well, that already happens with the duplicated header approach if users
want to use UAPI headers, too.  It requires complicated synchronization
to make it work.

We can tweak the conditional for the header inclusion further, like
this:

+#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 16, 0) \
+  || __glibc_has_include (<linux/membarrier.h>) \
+# include <linux/membarrier.h>
+#else

With GCC 5 and later, this would always prefer the kernel header if
available.  There would never be a conflict between the definitions,
irrespective of header file inclusion order.

What do you think?

Thanks,
Florian

Florian Weimer Dec. 5, 2018, 7:59 p.m. UTC | #14

* Joseph Myers:

> Are you going to open a bug report in Bugzilla for the lack of 
> documentation (discussing exactly what the issues are that need to be 
> clarified to document the function)?

A filed a bug for the general issue of lack of documentation for the
memory model:

  <https://sourceware.org/bugzilla/show_bug.cgi?id=23955>

Thanks,
Florian

Adhemerval Zanella Netto Dec. 5, 2018, 8:17 p.m. UTC | #15

On 05/12/2018 17:53, Florian Weimer wrote:
> * Adhemerval Zanella:
> 
>> On 05/12/2018 16:11, Florian Weimer wrote:
>>> * Adhemerval Zanella:
>>>
>>>> On 05/12/2018 12:51, Florian Weimer wrote:
>>>>> * Adhemerval Zanella:
>>>>>
>>>>>> If we are replicating the values, meaning the idea is to keep it sync at least
>>>>>> when we have the minimum supported kernel of 4.16, why just not add a comment
>>>>>> to add linux/membarrier.h once the minimum supported kernel provides this
>>>>>> header and not rely on linux/membarrier.h?
>>>>>
>>>>> This way, we can compile the test with any supported kernel headers for
>>>>> glibc.
>>>>>
>>>>> If we defer to <linux/membarrier.h> unconditionally, we cannot build the
>>>>> test with all kernel headers.  The current approach definitely makes the
>>>>> test case much cleaner.
>>>>>
>>>>
>>>> My point is exactly to *not* rely on kernel headers.
>>>
>>> I thought there was a general desire to move in the opposite direction,
>>> avoiding copying declarations and definitions in clean/new headers that
>>> are dedicated to a specific purpose?
>>
>> I don't recall this discussion, do you have the link?
> 
> I will try to find it.
> 
>> In any case I still think mixing header inclusion with the definition
>> duplication is just a double effort, it will still require syncing the
>> definitions on each kernel release.
> 
> I think that's not actually true.  We only need to do this if we need
> newer definitions for building glibc itself (including the test suite).
> 
>>> I don't think we can add much value by copying the contents of those
>>> headers.  We can even relax the version check if we use __has_include
>>> (which has to be remain optional in installed headers, of course).
> 
>> This example give us why trying to relying in kernel header to provide
>> glibc exported interfaces are at least tricky, imho.  The issues I can
>> think of are:
>>
>>   - The header would require to be include safe in all possible releases
>>     and for all possible permutations, which means if something is broken 
>>     we need to rely on kernel to actually fix it.
> 
> I'm not too worried about this for new headers with a dedicated purpose.
> Using the UAPI will actually make things work in more cases; see below.
> 
>>   - glibc might provide different semantic to users depending of the
>>     underlying installed kernel header version.  It means for two
>>     identical glibc versions, some programs might fail to build 
>>     depending of the installed kernel version.
> 
> Well, that already happens with the duplicated header approach if users
> want to use UAPI headers, too.  It requires complicated synchronization
> to make it work.

But at least glibc is not imposing it, if user want to use kernel uapi
header he will explicit include it and it is up to kernel provide a sane
implementation.

> 
> We can tweak the conditional for the header inclusion further, like
> this:
> 
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 16, 0) \
> +  || __glibc_has_include (<linux/membarrier.h>) \
> +# include <linux/membarrier.h>
> +#else
> 
> With GCC 5 and later, this would always prefer the kernel header if
> available.  There would never be a conflict between the definitions,
> irrespective of header file inclusion order.
> 
> What do you think?

In any case I don't think my suggestion should be a blocker, we already rely 
on linux header for some cases and it seems I am alone in trying to decouple
glibc from kernel headers.

Paul E. McKenney Dec. 6, 2018, 9:54 p.m. UTC | #16

Hello, David,

I took a crack at extending LKMM to accommodate what I think would
support what you have in your paper.  Please see the very end of this
email for a patch against the "dev" branch of my -rcu tree.

This gives the expected result for the following three litmus tests,
but is probably deficient or otherwise misguided in other ways.  I have
added the LKMM maintainers on CC for their amusement.  ;-)

Thoughts?

						Thanx, Paul

------------------------------------------------------------------------

C C-Goldblat-memb-1
{
}

P0(int *x0, int *x1)
{
	WRITE_ONCE(*x0, 1);
	r1 = READ_ONCE(*x1);
}


P1(int *x0, int *x1)
{
	WRITE_ONCE(*x1, 1);
	smp_memb();
	r2 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r2=0)

------------------------------------------------------------------------

C C-Goldblat-memb-2
{
}

P0(int *x0, int *x1)
{
	WRITE_ONCE(*x0, 1);
	r1 = READ_ONCE(*x1);
}


P1(int *x1, int *x2)
{
	WRITE_ONCE(*x1, 1);
	smp_memb();
	r1 = READ_ONCE(*x2);
}

P2(int *x2, int *x0)
{
	WRITE_ONCE(*x2, 1);
	r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0)

------------------------------------------------------------------------

C C-Goldblat-memb-3
{
}

P0(int *x0, int *x1)
{
	WRITE_ONCE(*x0, 1);
	r1 = READ_ONCE(*x1);
}


P1(int *x1, int *x2)
{
	WRITE_ONCE(*x1, 1);
	smp_memb();
	r1 = READ_ONCE(*x2);
}

P2(int *x2, int *x3)
{
	WRITE_ONCE(*x2, 1);
	r1 = READ_ONCE(*x3);
}

P3(int *x3, int *x0)
{
	WRITE_ONCE(*x3, 1);
	smp_memb();
	r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)

------------------------------------------------------------------------

On Thu, Nov 29, 2018 at 11:02:17AM -0800, David Goldblatt wrote:
> One note with the suggested patch is that
> `atomic_thread_fence(memory_order_acq_rel)` should probably be
> `atomic_thread_fence (memory_order_seq_cst)` (otherwise the call would
> be a no-op on, say, x86, which it very much isn't).
> 
> The non-transitivity thing makes the resulting description arguably
> incorrect, but this is informal enough that it might not be a big deal
> to add something after "For these threads, the membarrier function
> call turns an existing compiler barrier (see above) executed by these
> threads into full memory barriers" that clarifies it. E.g. you could
> make it into "turns an existing compiler barrier [...] into full
> memory barriers, with respect to the calling thread".
> 
> Since this is targeting the description of the OS call (and doesn't
> have to concern itself with also being implementable by other
> asymmetric techniques or degrading to architectural barriers), I think
> that the description in "approach 2" in P1202 would also make sense
> for a formal description of the syscall. (Of course, without the
> kernel itself committing to a rigorous semantics, anything specified
> on top of it will be on slightly shaky ground).
> 
> - David
> 
> On Thu, Nov 29, 2018 at 7:04 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >
> > On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> > > ----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:
> > >
> > > > * Torvald Riegel:
> > > >
> > > >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> > > >>> This is essentially a repost of last year's patch, rebased to the glibc
> > > >>> 2.29 symbol version and reflecting the introduction of
> > > >>> MEMBARRIER_CMD_GLOBAL.
> > > >>>
> > > >>> I'm not including any changes to manual/ here because the set of
> > > >>> supported operations is evolving rapidly, we could not get consensus for
> > > >>> the language I proposed the last time, and I do not want to contribute
> > > >>> to the manual for the time being.
> > > >>
> > > >> Fair enough.  Nonetheless, can you summarize how far you're along with
> > > >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > > >
> > > > I wrote down what you could, but no one liked it.
> > > >
> > > > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > > >
> > > > I expect that a formalization would interact in non-trivial ways with
> > > > any potential formalization of usable relaxed memory order semantics,
> > > > and I'm not sure if anyone knows how to do the latter today.
> > >
> > > Adding Paul E. McKenney in CC.
> >
> > There is some prototype C++ memory model wording from David Goldblatt (CCed)
> > here (search for "Standarese"):
> >
> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
> >
> > David's key insight is that (in Linuxese) light fences cannot pair with
> > each other.

------------------------------------------------------------------------

commit 17e3b6b60e57d1cb791f68a1a6a36e942cb2baad
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Thu Dec 6 13:40:40 2018 -0800

    EXP tools/memory-model: Add semantics for sys_membarrier()
    
    This prototype commit extends LKMM to accommodate sys_membarrier(),
    which is a asymmetric barrier with a limited ability to insert full
    ordering into tasks that provide only compiler ordering.  This commit
    currently uses the "po" relation for this purpose, but something more
    sophisticated will be required when plain accesses are added, which
    the compiler can reorder.
    
    For more detail, please see David Goldblatt's C++ working paper:
    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>

diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
index 9c42cd9ddcb4..4ef41453f569 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'release}]
 enum Barriers = 'wmb (*smp_wmb*) ||
 		'rmb (*smp_rmb*) ||
 		'mb (*smp_mb*) ||
+		'memb (*sys_membarrier*) ||
 		'rcu-lock (*rcu_read_lock*)  ||
 		'rcu-unlock (*rcu_read_unlock*) ||
 		'sync-rcu (*synchronize_rcu*) ||
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index 8dcb37835b61..837c3ee20bea 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -33,9 +33,10 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
 	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
 		fencerel(After-unlock-lock) ; [M])
+let memb = [M] ; fencerel(Memb) ; [M]
 let gp = po ; [Sync-rcu | Sync-srcu] ; po?
 
-let strong-fence = mb | gp
+let strong-fence = mb | gp | memb
 
 (* Release Acquire *)
 let acq-po = [Acquire] ; po ; [M]
@@ -86,6 +87,13 @@ acyclic hb as happens-before
 let pb = prop ; strong-fence ; hb*
 acyclic pb as propagation
 
+(********************)
+(* sys_membarrier() *)
+(********************)
+
+let memb-step = ( prop ; po ; prop )? ; memb
+acyclic memb-step as memb-before
+
 (*******)
 (* RCU *)
 (*******)
diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
index 1d6a120cde14..9ff0691c5f2c 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -17,6 +17,7 @@ rcu_dereference(X) __load{once}(X)
 smp_store_mb(X,V) { __store{once}(X,V); __fence{mb}; }
 
 // Fences
+smp_memb() { __fence{memb}; }
 smp_mb() { __fence{mb}; }
 smp_rmb() { __fence{rmb}; }
 smp_wmb() { __fence{wmb}; }

Alan Stern Dec. 10, 2018, 4:22 p.m. UTC | #17

On Thu, 6 Dec 2018, Paul E. McKenney wrote:

> Hello, David,
> 
> I took a crack at extending LKMM to accommodate what I think would
> support what you have in your paper.  Please see the very end of this
> email for a patch against the "dev" branch of my -rcu tree.
> 
> This gives the expected result for the following three litmus tests,
> but is probably deficient or otherwise misguided in other ways.  I have
> added the LKMM maintainers on CC for their amusement.  ;-)
> 
> Thoughts?

Since sys_membarrier() provides a heavyweight barrier comparable to 
synchronize_rcu(), the memory model should treat the two in the same 
way.  That's what this patch does.

The corresponding critical section would be any region of code bounded
by compiler barriers.  Since the LKMM doesn't currently handle plain
accesses, the effect is the same as if a compiler barrier were present
between each pair of instructions.  Basically, each instruction acts as
its own critical section.  Therefore the patch below defines memb-rscsi
as the trivial identity relation.  When plain accesses and compiler 
barriers are added to the memory model, a different definition will be 
needed.

This gives the correct results for the three C-Goldblat-memb-* litmus 
tests in Paul's email.

Alan

PS: The patch below is meant to apply on top of the SRCU patches, which
are not yet in the mainline kernel.



Index: usb-4.x/tools/memory-model/linux-kernel.bell
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.bell
+++ usb-4.x/tools/memory-model/linux-kernel.bell
@@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'releas
 enum Barriers = 'wmb (*smp_wmb*) ||
 		'rmb (*smp_rmb*) ||
 		'mb (*smp_mb*) ||
+		'memb (*sys_membarrier*) ||
 		'rcu-lock (*rcu_read_lock*)  ||
 		'rcu-unlock (*rcu_read_unlock*) ||
 		'sync-rcu (*synchronize_rcu*) ||
Index: usb-4.x/tools/memory-model/linux-kernel.cat
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.cat
+++ usb-4.x/tools/memory-model/linux-kernel.cat
@@ -33,7 +33,7 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
 	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
 	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
 		fencerel(After-unlock-lock) ; [M])
-let gp = po ; [Sync-rcu | Sync-srcu] ; po?
+let gp = po ; [Sync-rcu | Sync-srcu | Memb] ; po?
 
 let strong-fence = mb | gp
 
@@ -102,8 +102,10 @@ acyclic pb as propagation
  *)
 let rcu-gp = [Sync-rcu]		(* Compare with gp *)
 let srcu-gp = [Sync-srcu]
+let memb-gp = [Memb]
 let rcu-rscsi = rcu-rscs^-1
 let srcu-rscsi = srcu-rscs^-1
+let memb-rscsi = id
 
 (*
  * The synchronize_rcu() strong fence is special in that it can order not
@@ -119,15 +121,19 @@ let rcu-link = po? ; hb* ; pb* ; prop ;
  * the synchronize_srcu() and srcu_read_[un]lock() calls refer to the same
  * struct srcu_struct location.
  *)
-let rec rcu-fence = rcu-gp | srcu-gp |
+let rec rcu-fence = rcu-gp | srcu-gp | memb-gp |
 	(rcu-gp ; rcu-link ; rcu-rscsi) |
 	((srcu-gp ; rcu-link ; srcu-rscsi) & loc) |
+	(memb-gp ; rcu-link ; memb-rscsi) |
 	(rcu-rscsi ; rcu-link ; rcu-gp) |
 	((srcu-rscsi ; rcu-link ; srcu-gp) & loc) |
+	(memb-rscsi ; rcu-link ; memb-gp) |
 	(rcu-gp ; rcu-link ; rcu-fence ; rcu-link ; rcu-rscsi) |
 	((srcu-gp ; rcu-link ; rcu-fence ; rcu-link ; srcu-rscsi) & loc) |
+	(memb-gp ; rcu-link ; rcu-fence ; rcu-link ; memb-rscsi) |
 	(rcu-rscsi ; rcu-link ; rcu-fence ; rcu-link ; rcu-gp) |
 	((srcu-rscsi ; rcu-link ; rcu-fence ; rcu-link ; srcu-gp) & loc) |
+	(memb-rscsi ; rcu-link ; rcu-fence ; rcu-link ; memb-gp) |
 	(rcu-fence ; rcu-link ; rcu-fence)
 
 (* rb orders instructions just as pb does *)
Index: usb-4.x/tools/memory-model/linux-kernel.def
===================================================================
--- usb-4.x.orig/tools/memory-model/linux-kernel.def
+++ usb-4.x/tools/memory-model/linux-kernel.def
@@ -20,6 +20,7 @@ smp_store_mb(X,V) { __store{once}(X,V);
 smp_mb() { __fence{mb}; }
 smp_rmb() { __fence{rmb}; }
 smp_wmb() { __fence{wmb}; }
+smp_memb() { __fence{memb}; }
 smp_mb__before_atomic() { __fence{before-atomic}; }
 smp_mb__after_atomic() { __fence{after-atomic}; }
 smp_mb__after_spinlock() { __fence{after-spinlock}; }

Paul E. McKenney Dec. 10, 2018, 6:25 p.m. UTC | #18

On Mon, Dec 10, 2018 at 11:22:31AM -0500, Alan Stern wrote:
> On Thu, 6 Dec 2018, Paul E. McKenney wrote:
> 
> > Hello, David,
> > 
> > I took a crack at extending LKMM to accommodate what I think would
> > support what you have in your paper.  Please see the very end of this
> > email for a patch against the "dev" branch of my -rcu tree.
> > 
> > This gives the expected result for the following three litmus tests,
> > but is probably deficient or otherwise misguided in other ways.  I have
> > added the LKMM maintainers on CC for their amusement.  ;-)
> > 
> > Thoughts?
> 
> Since sys_membarrier() provides a heavyweight barrier comparable to 
> synchronize_rcu(), the memory model should treat the two in the same 
> way.  That's what this patch does.
> 
> The corresponding critical section would be any region of code bounded
> by compiler barriers.  Since the LKMM doesn't currently handle plain
> accesses, the effect is the same as if a compiler barrier were present
> between each pair of instructions.  Basically, each instruction acts as
> its own critical section.  Therefore the patch below defines memb-rscsi
> as the trivial identity relation.  When plain accesses and compiler 
> barriers are added to the memory model, a different definition will be 
> needed.
> 
> This gives the correct results for the three C-Goldblat-memb-* litmus 
> tests in Paul's email.

Yow!!!

My first reaction was that this cannot possibly be correct because
sys_membarrier(), which is probably what we should call it, does not
wait for anything.  But your formulation has the corresponding readers
being "id", which as you say above is just a single event.

But what makes this work for the following litmus test?

------------------------------------------------------------------------

C membrcu

{
}

P0(intptr_t *x0, intptr_t *x1)
{
	WRITE_ONCE(*x0, 2);
	smp_memb();
	intptr_t r2 = READ_ONCE(*x1);
}


P1(intptr_t *x1, intptr_t *x2)
{
	WRITE_ONCE(*x1, 2);
	smp_memb();
	intptr_t r2 = READ_ONCE(*x2);
}


P2(intptr_t *x2, intptr_t *x3)
{
	WRITE_ONCE(*x2, 2);
	smp_memb();
	intptr_t r2 = READ_ONCE(*x3);
}


P3(intptr_t *x3, intptr_t *x4)
{
	rcu_read_lock();
	WRITE_ONCE(*x3, 2);
	intptr_t r2 = READ_ONCE(*x4);
	rcu_read_unlock();
}


P4(intptr_t *x4, intptr_t *x5)
{
	rcu_read_lock();
	WRITE_ONCE(*x4, 2);
	intptr_t r2 = READ_ONCE(*x5);
	rcu_read_unlock();
}


P5(intptr_t *x0, intptr_t *x5)
{
	rcu_read_lock();
	WRITE_ONCE(*x5, 2);
	intptr_t r2 = READ_ONCE(*x0);
	rcu_read_unlock();
}

exists
(5:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0 /\ 3:r2=0 /\ 4:r2=0)

------------------------------------------------------------------------

For this, herd gives "Never".  Of course, if I reverse the write and
read in any of P3(), P4(), or P5(), I get "Sometimes", which does make
sense.  But what is preserving the order between P3() and P4() and
between P4() and P5()?  I am not immediately seeing how the analogy
with RCU carries over to this case.

							Thanx, Paul

> Alan
> 
> PS: The patch below is meant to apply on top of the SRCU patches, which
> are not yet in the mainline kernel.
> 
> 
> 
> Index: usb-4.x/tools/memory-model/linux-kernel.bell
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.bell
> +++ usb-4.x/tools/memory-model/linux-kernel.bell
> @@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'releas
>  enum Barriers = 'wmb (*smp_wmb*) ||
>  		'rmb (*smp_rmb*) ||
>  		'mb (*smp_mb*) ||
> +		'memb (*sys_membarrier*) ||
>  		'rcu-lock (*rcu_read_lock*)  ||
>  		'rcu-unlock (*rcu_read_unlock*) ||
>  		'sync-rcu (*synchronize_rcu*) ||
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -33,7 +33,7 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>  	([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>  	([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>  		fencerel(After-unlock-lock) ; [M])
> -let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> +let gp = po ; [Sync-rcu | Sync-srcu | Memb] ; po?
>  
>  let strong-fence = mb | gp
>  
> @@ -102,8 +102,10 @@ acyclic pb as propagation
>   *)
>  let rcu-gp = [Sync-rcu]		(* Compare with gp *)
>  let srcu-gp = [Sync-srcu]
> +let memb-gp = [Memb]
>  let rcu-rscsi = rcu-rscs^-1
>  let srcu-rscsi = srcu-rscs^-1
> +let memb-rscsi = id
>  
>  (*
>   * The synchronize_rcu() strong fence is special in that it can order not
> @@ -119,15 +121,19 @@ let rcu-link = po? ; hb* ; pb* ; prop ;
>   * the synchronize_srcu() and srcu_read_[un]lock() calls refer to the same
>   * struct srcu_struct location.
>   *)
> -let rec rcu-fence = rcu-gp | srcu-gp |
> +let rec rcu-fence = rcu-gp | srcu-gp | memb-gp |
>  	(rcu-gp ; rcu-link ; rcu-rscsi) |
>  	((srcu-gp ; rcu-link ; srcu-rscsi) & loc) |
> +	(memb-gp ; rcu-link ; memb-rscsi) |
>  	(rcu-rscsi ; rcu-link ; rcu-gp) |
>  	((srcu-rscsi ; rcu-link ; srcu-gp) & loc) |
> +	(memb-rscsi ; rcu-link ; memb-gp) |
>  	(rcu-gp ; rcu-link ; rcu-fence ; rcu-link ; rcu-rscsi) |
>  	((srcu-gp ; rcu-link ; rcu-fence ; rcu-link ; srcu-rscsi) & loc) |
> +	(memb-gp ; rcu-link ; rcu-fence ; rcu-link ; memb-rscsi) |
>  	(rcu-rscsi ; rcu-link ; rcu-fence ; rcu-link ; rcu-gp) |
>  	((srcu-rscsi ; rcu-link ; rcu-fence ; rcu-link ; srcu-gp) & loc) |
> +	(memb-rscsi ; rcu-link ; rcu-fence ; rcu-link ; memb-gp) |
>  	(rcu-fence ; rcu-link ; rcu-fence)
>  
>  (* rb orders instructions just as pb does *)
> Index: usb-4.x/tools/memory-model/linux-kernel.def
> ===================================================================
> --- usb-4.x.orig/tools/memory-model/linux-kernel.def
> +++ usb-4.x/tools/memory-model/linux-kernel.def
> @@ -20,6 +20,7 @@ smp_store_mb(X,V) { __store{once}(X,V);
>  smp_mb() { __fence{mb}; }
>  smp_rmb() { __fence{rmb}; }
>  smp_wmb() { __fence{wmb}; }
> +smp_memb() { __fence{memb}; }
>  smp_mb__before_atomic() { __fence{before-atomic}; }
>  smp_mb__after_atomic() { __fence{after-atomic}; }
>  smp_mb__after_spinlock() { __fence{after-spinlock}; }
>

David Goldblatt Dec. 11, 2018, 6:42 a.m. UTC | #19

Hi Paul, thank you for thinking about all this.

I think the modelling you suggest captures most of the algorithms I
would want to write. I think it's slightly too weak, though, to
implement the model suggested in P1202R0[1], which permits the SC
outcome to be recovered in C-Goldblat-memb-2[2] by inserting a second
smp_memb() after the first, which is a rather nice property (and I
believe is supported by the underlying implementation options). I
afraid though that I'm not familiar enough with the Linux herd
definitions to suggest a tweak (or know how easy a tweak might be).

- David

[1] Which I think may be strengthened a little bit more even in R1.
[2] As a nit, my name has two "t"'s in it, although I'd throw into the
ring "memb-pairwise", "memb-nontransitive", and "memb-sequenced" if
these get non-placeholder names.

On Thu, Dec 6, 2018 at 1:54 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
>
> Hello, David,
>
> I took a crack at extending LKMM to accommodate what I think would
> support what you have in your paper.  Please see the very end of this
> email for a patch against the "dev" branch of my -rcu tree.
>
> This gives the expected result for the following three litmus tests,
> but is probably deficient or otherwise misguided in other ways.  I have
> added the LKMM maintainers on CC for their amusement.  ;-)
>
> Thoughts?
>
>                                                 Thanx, Paul
>
> ------------------------------------------------------------------------
>
> C C-Goldblat-memb-1
> {
> }
>
> P0(int *x0, int *x1)
> {
>         WRITE_ONCE(*x0, 1);
>         r1 = READ_ONCE(*x1);
> }
>
>
> P1(int *x0, int *x1)
> {
>         WRITE_ONCE(*x1, 1);
>         smp_memb();
>         r2 = READ_ONCE(*x0);
> }
>
> exists (0:r1=0 /\ 1:r2=0)
>
> ------------------------------------------------------------------------
>
> C C-Goldblat-memb-2
> {
> }
>
> P0(int *x0, int *x1)
> {
>         WRITE_ONCE(*x0, 1);
>         r1 = READ_ONCE(*x1);
> }
>
>
> P1(int *x1, int *x2)
> {
>         WRITE_ONCE(*x1, 1);
>         smp_memb();
>         r1 = READ_ONCE(*x2);
> }
>
> P2(int *x2, int *x0)
> {
>         WRITE_ONCE(*x2, 1);
>         r1 = READ_ONCE(*x0);
> }
>
> exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0)
>
> ------------------------------------------------------------------------
>
> C C-Goldblat-memb-3
> {
> }
>
> P0(int *x0, int *x1)
> {
>         WRITE_ONCE(*x0, 1);
>         r1 = READ_ONCE(*x1);
> }
>
>
> P1(int *x1, int *x2)
> {
>         WRITE_ONCE(*x1, 1);
>         smp_memb();
>         r1 = READ_ONCE(*x2);
> }
>
> P2(int *x2, int *x3)
> {
>         WRITE_ONCE(*x2, 1);
>         r1 = READ_ONCE(*x3);
> }
>
> P3(int *x3, int *x0)
> {
>         WRITE_ONCE(*x3, 1);
>         smp_memb();
>         r1 = READ_ONCE(*x0);
> }
>
> exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)
>
> ------------------------------------------------------------------------
>
> On Thu, Nov 29, 2018 at 11:02:17AM -0800, David Goldblatt wrote:
> > One note with the suggested patch is that
> > `atomic_thread_fence(memory_order_acq_rel)` should probably be
> > `atomic_thread_fence (memory_order_seq_cst)` (otherwise the call would
> > be a no-op on, say, x86, which it very much isn't).
> >
> > The non-transitivity thing makes the resulting description arguably
> > incorrect, but this is informal enough that it might not be a big deal
> > to add something after "For these threads, the membarrier function
> > call turns an existing compiler barrier (see above) executed by these
> > threads into full memory barriers" that clarifies it. E.g. you could
> > make it into "turns an existing compiler barrier [...] into full
> > memory barriers, with respect to the calling thread".
> >
> > Since this is targeting the description of the OS call (and doesn't
> > have to concern itself with also being implementable by other
> > asymmetric techniques or degrading to architectural barriers), I think
> > that the description in "approach 2" in P1202 would also make sense
> > for a formal description of the syscall. (Of course, without the
> > kernel itself committing to a rigorous semantics, anything specified
> > on top of it will be on slightly shaky ground).
> >
> > - David
> >
> > On Thu, Nov 29, 2018 at 7:04 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > >
> > > On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> > > > ----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:
> > > >
> > > > > * Torvald Riegel:
> > > > >
> > > > >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> > > > >>> This is essentially a repost of last year's patch, rebased to the glibc
> > > > >>> 2.29 symbol version and reflecting the introduction of
> > > > >>> MEMBARRIER_CMD_GLOBAL.
> > > > >>>
> > > > >>> I'm not including any changes to manual/ here because the set of
> > > > >>> supported operations is evolving rapidly, we could not get consensus for
> > > > >>> the language I proposed the last time, and I do not want to contribute
> > > > >>> to the manual for the time being.
> > > > >>
> > > > >> Fair enough.  Nonetheless, can you summarize how far you're along with
> > > > >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > > > >
> > > > > I wrote down what you could, but no one liked it.
> > > > >
> > > > > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > > > >
> > > > > I expect that a formalization would interact in non-trivial ways with
> > > > > any potential formalization of usable relaxed memory order semantics,
> > > > > and I'm not sure if anyone knows how to do the latter today.
> > > >
> > > > Adding Paul E. McKenney in CC.
> > >
> > > There is some prototype C++ memory model wording from David Goldblatt (CCed)
> > > here (search for "Standarese"):
> > >
> > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
> > >
> > > David's key insight is that (in Linuxese) light fences cannot pair with
> > > each other.
>
> ------------------------------------------------------------------------
>
> commit 17e3b6b60e57d1cb791f68a1a6a36e942cb2baad
> Author: Paul E. McKenney <paulmck@linux.ibm.com>
> Date:   Thu Dec 6 13:40:40 2018 -0800
>
>     EXP tools/memory-model: Add semantics for sys_membarrier()
>
>     This prototype commit extends LKMM to accommodate sys_membarrier(),
>     which is a asymmetric barrier with a limited ability to insert full
>     ordering into tasks that provide only compiler ordering.  This commit
>     currently uses the "po" relation for this purpose, but something more
>     sophisticated will be required when plain accesses are added, which
>     the compiler can reorder.
>
>     For more detail, please see David Goldblatt's C++ working paper:
>     http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
>
> diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
> index 9c42cd9ddcb4..4ef41453f569 100644
> --- a/tools/memory-model/linux-kernel.bell
> +++ b/tools/memory-model/linux-kernel.bell
> @@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'release}]
>  enum Barriers = 'wmb (*smp_wmb*) ||
>                 'rmb (*smp_rmb*) ||
>                 'mb (*smp_mb*) ||
> +               'memb (*sys_membarrier*) ||
>                 'rcu-lock (*rcu_read_lock*)  ||
>                 'rcu-unlock (*rcu_read_unlock*) ||
>                 'sync-rcu (*synchronize_rcu*) ||
> diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> index 8dcb37835b61..837c3ee20bea 100644
> --- a/tools/memory-model/linux-kernel.cat
> +++ b/tools/memory-model/linux-kernel.cat
> @@ -33,9 +33,10 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
>         ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
>         ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
>                 fencerel(After-unlock-lock) ; [M])
> +let memb = [M] ; fencerel(Memb) ; [M]
>  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
>
> -let strong-fence = mb | gp
> +let strong-fence = mb | gp | memb
>
>  (* Release Acquire *)
>  let acq-po = [Acquire] ; po ; [M]
> @@ -86,6 +87,13 @@ acyclic hb as happens-before
>  let pb = prop ; strong-fence ; hb*
>  acyclic pb as propagation
>
> +(********************)
> +(* sys_membarrier() *)
> +(********************)
> +
> +let memb-step = ( prop ; po ; prop )? ; memb
> +acyclic memb-step as memb-before
> +
>  (*******)
>  (* RCU *)
>  (*******)
> diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
> index 1d6a120cde14..9ff0691c5f2c 100644
> --- a/tools/memory-model/linux-kernel.def
> +++ b/tools/memory-model/linux-kernel.def
> @@ -17,6 +17,7 @@ rcu_dereference(X) __load{once}(X)
>  smp_store_mb(X,V) { __store{once}(X,V); __fence{mb}; }
>
>  // Fences
> +smp_memb() { __fence{memb}; }
>  smp_mb() { __fence{mb}; }
>  smp_rmb() { __fence{rmb}; }
>  smp_wmb() { __fence{wmb}; }
>

Paul E. McKenney Dec. 11, 2018, 2:49 p.m. UTC | #20

On Mon, Dec 10, 2018 at 10:42:25PM -0800, David Goldblatt wrote:
> Hi Paul, thank you for thinking about all this.
> 
> I think the modelling you suggest captures most of the algorithms I
> would want to write. I think it's slightly too weak, though, to
> implement the model suggested in P1202R0[1], which permits the SC
> outcome to be recovered in C-Goldblat-memb-2[2] by inserting a second
> smp_memb() after the first, which is a rather nice property (and I
> believe is supported by the underlying implementation options). I
> afraid though that I'm not familiar enough with the Linux herd
> definitions to suggest a tweak (or know how easy a tweak might be).

Actually, there has been an offlist discussion on exactly this.

What is the general rule?  Is it that a given cycle have at least as
many heavy barriers as it does light ones?  Either way, why?

Gah!  I updated the tests to add the second "t", apologies!!!

							Thanx, Paul

> - David
> 
> [1] Which I think may be strengthened a little bit more even in R1.
> [2] As a nit, my name has two "t"'s in it, although I'd throw into the
> ring "memb-pairwise", "memb-nontransitive", and "memb-sequenced" if
> these get non-placeholder names.
> 
> On Thu, Dec 6, 2018 at 1:54 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> >
> > Hello, David,
> >
> > I took a crack at extending LKMM to accommodate what I think would
> > support what you have in your paper.  Please see the very end of this
> > email for a patch against the "dev" branch of my -rcu tree.
> >
> > This gives the expected result for the following three litmus tests,
> > but is probably deficient or otherwise misguided in other ways.  I have
> > added the LKMM maintainers on CC for their amusement.  ;-)
> >
> > Thoughts?
> >
> >                                                 Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > C C-Goldblat-memb-1
> > {
> > }
> >
> > P0(int *x0, int *x1)
> > {
> >         WRITE_ONCE(*x0, 1);
> >         r1 = READ_ONCE(*x1);
> > }
> >
> >
> > P1(int *x0, int *x1)
> > {
> >         WRITE_ONCE(*x1, 1);
> >         smp_memb();
> >         r2 = READ_ONCE(*x0);
> > }
> >
> > exists (0:r1=0 /\ 1:r2=0)
> >
> > ------------------------------------------------------------------------
> >
> > C C-Goldblat-memb-2
> > {
> > }
> >
> > P0(int *x0, int *x1)
> > {
> >         WRITE_ONCE(*x0, 1);
> >         r1 = READ_ONCE(*x1);
> > }
> >
> >
> > P1(int *x1, int *x2)
> > {
> >         WRITE_ONCE(*x1, 1);
> >         smp_memb();
> >         r1 = READ_ONCE(*x2);
> > }
> >
> > P2(int *x2, int *x0)
> > {
> >         WRITE_ONCE(*x2, 1);
> >         r1 = READ_ONCE(*x0);
> > }
> >
> > exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0)
> >
> > ------------------------------------------------------------------------
> >
> > C C-Goldblat-memb-3
> > {
> > }
> >
> > P0(int *x0, int *x1)
> > {
> >         WRITE_ONCE(*x0, 1);
> >         r1 = READ_ONCE(*x1);
> > }
> >
> >
> > P1(int *x1, int *x2)
> > {
> >         WRITE_ONCE(*x1, 1);
> >         smp_memb();
> >         r1 = READ_ONCE(*x2);
> > }
> >
> > P2(int *x2, int *x3)
> > {
> >         WRITE_ONCE(*x2, 1);
> >         r1 = READ_ONCE(*x3);
> > }
> >
> > P3(int *x3, int *x0)
> > {
> >         WRITE_ONCE(*x3, 1);
> >         smp_memb();
> >         r1 = READ_ONCE(*x0);
> > }
> >
> > exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)
> >
> > ------------------------------------------------------------------------
> >
> > On Thu, Nov 29, 2018 at 11:02:17AM -0800, David Goldblatt wrote:
> > > One note with the suggested patch is that
> > > `atomic_thread_fence(memory_order_acq_rel)` should probably be
> > > `atomic_thread_fence (memory_order_seq_cst)` (otherwise the call would
> > > be a no-op on, say, x86, which it very much isn't).
> > >
> > > The non-transitivity thing makes the resulting description arguably
> > > incorrect, but this is informal enough that it might not be a big deal
> > > to add something after "For these threads, the membarrier function
> > > call turns an existing compiler barrier (see above) executed by these
> > > threads into full memory barriers" that clarifies it. E.g. you could
> > > make it into "turns an existing compiler barrier [...] into full
> > > memory barriers, with respect to the calling thread".
> > >
> > > Since this is targeting the description of the OS call (and doesn't
> > > have to concern itself with also being implementable by other
> > > asymmetric techniques or degrading to architectural barriers), I think
> > > that the description in "approach 2" in P1202 would also make sense
> > > for a formal description of the syscall. (Of course, without the
> > > kernel itself committing to a rigorous semantics, anything specified
> > > on top of it will be on slightly shaky ground).
> > >
> > > - David
> > >
> > > On Thu, Nov 29, 2018 at 7:04 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > > >
> > > > On Thu, Nov 29, 2018 at 09:44:22AM -0500, Mathieu Desnoyers wrote:
> > > > > ----- On Nov 29, 2018, at 8:50 AM, Florian Weimer fweimer@redhat.com wrote:
> > > > >
> > > > > > * Torvald Riegel:
> > > > > >
> > > > > >> On Wed, 2018-11-28 at 16:05 +0100, Florian Weimer wrote:
> > > > > >>> This is essentially a repost of last year's patch, rebased to the glibc
> > > > > >>> 2.29 symbol version and reflecting the introduction of
> > > > > >>> MEMBARRIER_CMD_GLOBAL.
> > > > > >>>
> > > > > >>> I'm not including any changes to manual/ here because the set of
> > > > > >>> supported operations is evolving rapidly, we could not get consensus for
> > > > > >>> the language I proposed the last time, and I do not want to contribute
> > > > > >>> to the manual for the time being.
> > > > > >>
> > > > > >> Fair enough.  Nonetheless, can you summarize how far you're along with
> > > > > >> properly defining the semantics (eg, based on the C/C++ memory model)?
> > > > > >
> > > > > > I wrote down what you could, but no one liked it.
> > > > > >
> > > > > > <https://sourceware.org/ml/libc-alpha/2017-12/msg00796.html>
> > > > > >
> > > > > > I expect that a formalization would interact in non-trivial ways with
> > > > > > any potential formalization of usable relaxed memory order semantics,
> > > > > > and I'm not sure if anyone knows how to do the latter today.
> > > > >
> > > > > Adding Paul E. McKenney in CC.
> > > >
> > > > There is some prototype C++ memory model wording from David Goldblatt (CCed)
> > > > here (search for "Standarese"):
> > > >
> > > > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
> > > >
> > > > David's key insight is that (in Linuxese) light fences cannot pair with
> > > > each other.
> >
> > ------------------------------------------------------------------------
> >
> > commit 17e3b6b60e57d1cb791f68a1a6a36e942cb2baad
> > Author: Paul E. McKenney <paulmck@linux.ibm.com>
> > Date:   Thu Dec 6 13:40:40 2018 -0800
> >
> >     EXP tools/memory-model: Add semantics for sys_membarrier()
> >
> >     This prototype commit extends LKMM to accommodate sys_membarrier(),
> >     which is a asymmetric barrier with a limited ability to insert full
> >     ordering into tasks that provide only compiler ordering.  This commit
> >     currently uses the "po" relation for this purpose, but something more
> >     sophisticated will be required when plain accesses are added, which
> >     the compiler can reorder.
> >
> >     For more detail, please see David Goldblatt's C++ working paper:
> >     http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1202r0.pdf
> >
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
> >
> > diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
> > index 9c42cd9ddcb4..4ef41453f569 100644
> > --- a/tools/memory-model/linux-kernel.bell
> > +++ b/tools/memory-model/linux-kernel.bell
> > @@ -24,6 +24,7 @@ instructions RMW[{'once,'acquire,'release}]
> >  enum Barriers = 'wmb (*smp_wmb*) ||
> >                 'rmb (*smp_rmb*) ||
> >                 'mb (*smp_mb*) ||
> > +               'memb (*sys_membarrier*) ||
> >                 'rcu-lock (*rcu_read_lock*)  ||
> >                 'rcu-unlock (*rcu_read_unlock*) ||
> >                 'sync-rcu (*synchronize_rcu*) ||
> > diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
> > index 8dcb37835b61..837c3ee20bea 100644
> > --- a/tools/memory-model/linux-kernel.cat
> > +++ b/tools/memory-model/linux-kernel.cat
> > @@ -33,9 +33,10 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
> >         ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) |
> >         ([M] ; po ; [UL] ; (co | po) ; [LKW] ;
> >                 fencerel(After-unlock-lock) ; [M])
> > +let memb = [M] ; fencerel(Memb) ; [M]
> >  let gp = po ; [Sync-rcu | Sync-srcu] ; po?
> >
> > -let strong-fence = mb | gp
> > +let strong-fence = mb | gp | memb
> >
> >  (* Release Acquire *)
> >  let acq-po = [Acquire] ; po ; [M]
> > @@ -86,6 +87,13 @@ acyclic hb as happens-before
> >  let pb = prop ; strong-fence ; hb*
> >  acyclic pb as propagation
> >
> > +(********************)
> > +(* sys_membarrier() *)
> > +(********************)
> > +
> > +let memb-step = ( prop ; po ; prop )? ; memb
> > +acyclic memb-step as memb-before
> > +
> >  (*******)
> >  (* RCU *)
> >  (*******)
> > diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
> > index 1d6a120cde14..9ff0691c5f2c 100644
> > --- a/tools/memory-model/linux-kernel.def
> > +++ b/tools/memory-model/linux-kernel.def
> > @@ -17,6 +17,7 @@ rcu_dereference(X) __load{once}(X)
> >  smp_store_mb(X,V) { __store{once}(X,V); __fence{mb}; }
> >
> >  // Fences
> > +smp_memb() { __fence{memb}; }
> >  smp_mb() { __fence{mb}; }
> >  smp_rmb() { __fence{rmb}; }
> >  smp_wmb() { __fence{wmb}; }
> >
>

Alan Stern Dec. 11, 2018, 4:21 p.m. UTC | #21

On Mon, 10 Dec 2018, Paul E. McKenney wrote:

> On Mon, Dec 10, 2018 at 11:22:31AM -0500, Alan Stern wrote:
> > On Thu, 6 Dec 2018, Paul E. McKenney wrote:
> > 
> > > Hello, David,
> > > 
> > > I took a crack at extending LKMM to accommodate what I think would
> > > support what you have in your paper.  Please see the very end of this
> > > email for a patch against the "dev" branch of my -rcu tree.
> > > 
> > > This gives the expected result for the following three litmus tests,
> > > but is probably deficient or otherwise misguided in other ways.  I have
> > > added the LKMM maintainers on CC for their amusement.  ;-)
> > > 
> > > Thoughts?
> > 
> > Since sys_membarrier() provides a heavyweight barrier comparable to 
> > synchronize_rcu(), the memory model should treat the two in the same 
> > way.  That's what this patch does.
> > 
> > The corresponding critical section would be any region of code bounded
> > by compiler barriers.  Since the LKMM doesn't currently handle plain
> > accesses, the effect is the same as if a compiler barrier were present
> > between each pair of instructions.  Basically, each instruction acts as
> > its own critical section.  Therefore the patch below defines memb-rscsi
> > as the trivial identity relation.  When plain accesses and compiler 
> > barriers are added to the memory model, a different definition will be 
> > needed.
> > 
> > This gives the correct results for the three C-Goldblat-memb-* litmus 
> > tests in Paul's email.
> 
> Yow!!!
> 
> My first reaction was that this cannot possibly be correct because
> sys_membarrier(), which is probably what we should call it, does not
> wait for anything.  But your formulation has the corresponding readers
> being "id", which as you say above is just a single event.
> 
> But what makes this work for the following litmus test?
> 
> ------------------------------------------------------------------------
> 
> C membrcu
> 
> {
> }
> 
> P0(intptr_t *x0, intptr_t *x1)
> {
> 	WRITE_ONCE(*x0, 2);
> 	smp_memb();
> 	intptr_t r2 = READ_ONCE(*x1);
> }
> 
> 
> P1(intptr_t *x1, intptr_t *x2)
> {
> 	WRITE_ONCE(*x1, 2);
> 	smp_memb();
> 	intptr_t r2 = READ_ONCE(*x2);
> }
> 
> 
> P2(intptr_t *x2, intptr_t *x3)
> {
> 	WRITE_ONCE(*x2, 2);
> 	smp_memb();
> 	intptr_t r2 = READ_ONCE(*x3);
> }
> 
> 
> P3(intptr_t *x3, intptr_t *x4)
> {
> 	rcu_read_lock();
> 	WRITE_ONCE(*x3, 2);
> 	intptr_t r2 = READ_ONCE(*x4);
> 	rcu_read_unlock();
> }
> 
> 
> P4(intptr_t *x4, intptr_t *x5)
> {
> 	rcu_read_lock();
> 	WRITE_ONCE(*x4, 2);
> 	intptr_t r2 = READ_ONCE(*x5);
> 	rcu_read_unlock();
> }
> 
> 
> P5(intptr_t *x0, intptr_t *x5)
> {
> 	rcu_read_lock();
> 	WRITE_ONCE(*x5, 2);
> 	intptr_t r2 = READ_ONCE(*x0);
> 	rcu_read_unlock();
> }
> 
> exists
> (5:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0 /\ 3:r2=0 /\ 4:r2=0)
> 
> ------------------------------------------------------------------------
> 
> For this, herd gives "Never".  Of course, if I reverse the write and
> read in any of P3(), P4(), or P5(), I get "Sometimes", which does make
> sense.  But what is preserving the order between P3() and P4() and
> between P4() and P5()?  I am not immediately seeing how the analogy
> with RCU carries over to this case.

That isn't how it works.  Nothing preserves the orders you mentioned.
It's more like: the order between P1 and P4 is preserved, as is the
order between P0 and P5.  You'll see below...

(I readily agree that this result is not simple or obvious.  It took me
quite a while to formulate the following analysis.)

To begin with, since there aren't any synchronize_rcu calls in the
test, the rcu_read_lock and rcu_read_unlock calls do nothing.  They
can be eliminated.

Also, I find the variable names "x0" - "x5" to be a little hard to
work with.  If you don't mind, I'll replace them with "a" - "f".

Now, a little digression on how sys_membarrier works.  It starts by
executing a full memory barrier.  Then it injects memory barriers into
the instruction streams of all the other CPUs and waits for them all
to complete.  Then it executes an ending memory barrier.

These barriers are ordered as described.  Therefore we have

	mb0s < mb05 < mb0e,
	mb1s < mb14 < mb1e,  and
	mb2s < mb23 < mb2e,

where mb0s is the starting barrier of the sys_memb call on P0, mb05 is
the barrier that it injects into P5, mb0e is the ending barrier of the
call, and similarly for the other sys_memb calls.  The '<' signs mean
that the thing on their left finishes before the thing on their right
does.

Rewriting the litmus test in these terms gives:

        P0      P1      P2      P3      P4      P5
        Wa=2    Wb=2    Wc=2    [mb23]  [mb14]  [mb05]
        mb0s    mb1s    mb2s    Wd=2    We=2    Wf=2
        mb0e    mb1e    mb2e    Re=0    Rf=0    Ra=0
        Rb=0    Rc=0    Rd=0

Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
positions of these barriers in their respective threads' program
orderings is undetermined; they need not come at the top as shown.

(Also, in case David is unfamiliar with it, the "Wa=2" notation is
shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)

Finally, here are a few facts which may be well known and obvious, but
I'll state them anyway:

	A CPU cannot reorder instructions across a memory barrier.
	If x is po-after a barrier then x executes after the barrier
	is finished.

	If a store is po-before a barrier then the store propagates
	to every CPU before the barrier finishes.

	If a store propagates to some CPU before a load on that CPU
	reads from the same location, then the load will obtain the
	value from that store or a co-later store.  This implies that
	if a load obtains a value co-earlier than some store then the
	load must have executed before the store propagated to the
	load's CPU.

The proof consists of three main stages, each requiring three steps.
Using the facts that b - f are all read as 0, I'll show that P1
executes Rc before P3 executes Re, then that P0 executes Rb before P4
executes Rf, and lastly that P5's Ra must obtain 2, not 0.  This will
demonstrate that the litmus test is not allowed.

1.	Suppose that mb23 ends up coming po-later than Wd in P3.
	Then we would have:

		Wd propagates to P2 < mb23 < mb2e < Rd,

	and so Rd would obtain 2, not 0.  Hence mb23 must come
	po-before Wd (as shown in the listing):  mb23 < Wd.

2.	Since mb23 therefore occurs po-before Re and instructions
	cannot be reordered across barriers,  mb23 < Re.

3.	Since Rc obtains 0, we must have:

		Rc < Wc propagates to P1 < mb2s < mb23 < Re.

	Thus Rc < Re.

4.	Suppose that mb14 ends up coming po-later than We in P4.
	Then we would have:

		We propagates to P3 < mb14 < mb1e < Rc < Re,

	and so Re would obtain 2, not 0.  Hence mb14 must come
	po-before We (as shown in the listing):  mb14 < We.

5.	Since mb14 therefore occurs po-before Rf and instructions
	cannot be reordered across barriers,  mb14 < Rf.

6.	Since Rb obtains 0, we must have:

		Rb < Wb propagates to P0 < mb1s < mb14 < Rf.

	Thus Rb < Rf.

7.	Suppose that mb05 ends up coming po-later than Wf in P5.
	Then we would have:

		Wf propagates to P4 < mb05 < mb0e < Rb < Rf,

	and so Rf would obtain 2, not 0.  Hence mb05 must come
	po-before Wf (as shown in the listing):  mb05 < Wf.

8.	Since mb05 therefore occurs po-before Ra and instructions
	cannot be reordered across barriers,  mb05 < Ra.

9.	Now we have:

		Wa propagates to P5 < mb0s < mb05 < Ra,

	and so Ra must obtain 2, not 0.  QED.

Alan

Paul E. McKenney Dec. 11, 2018, 7:08 p.m. UTC | #22

On Tue, Dec 11, 2018 at 11:21:15AM -0500, Alan Stern wrote:
> On Mon, 10 Dec 2018, Paul E. McKenney wrote:
> 
> > On Mon, Dec 10, 2018 at 11:22:31AM -0500, Alan Stern wrote:
> > > On Thu, 6 Dec 2018, Paul E. McKenney wrote:
> > > 
> > > > Hello, David,
> > > > 
> > > > I took a crack at extending LKMM to accommodate what I think would
> > > > support what you have in your paper.  Please see the very end of this
> > > > email for a patch against the "dev" branch of my -rcu tree.
> > > > 
> > > > This gives the expected result for the following three litmus tests,
> > > > but is probably deficient or otherwise misguided in other ways.  I have
> > > > added the LKMM maintainers on CC for their amusement.  ;-)
> > > > 
> > > > Thoughts?
> > > 
> > > Since sys_membarrier() provides a heavyweight barrier comparable to 
> > > synchronize_rcu(), the memory model should treat the two in the same 
> > > way.  That's what this patch does.
> > > 
> > > The corresponding critical section would be any region of code bounded
> > > by compiler barriers.  Since the LKMM doesn't currently handle plain
> > > accesses, the effect is the same as if a compiler barrier were present
> > > between each pair of instructions.  Basically, each instruction acts as
> > > its own critical section.  Therefore the patch below defines memb-rscsi
> > > as the trivial identity relation.  When plain accesses and compiler 
> > > barriers are added to the memory model, a different definition will be 
> > > needed.
> > > 
> > > This gives the correct results for the three C-Goldblat-memb-* litmus 
> > > tests in Paul's email.
> > 
> > Yow!!!
> > 
> > My first reaction was that this cannot possibly be correct because
> > sys_membarrier(), which is probably what we should call it, does not
> > wait for anything.  But your formulation has the corresponding readers
> > being "id", which as you say above is just a single event.
> > 
> > But what makes this work for the following litmus test?
> > 
> > ------------------------------------------------------------------------
> > 
> > C membrcu
> > 
> > {
> > }
> > 
> > P0(intptr_t *x0, intptr_t *x1)
> > {
> > 	WRITE_ONCE(*x0, 2);
> > 	smp_memb();
> > 	intptr_t r2 = READ_ONCE(*x1);
> > }
> > 
> > 
> > P1(intptr_t *x1, intptr_t *x2)
> > {
> > 	WRITE_ONCE(*x1, 2);
> > 	smp_memb();
> > 	intptr_t r2 = READ_ONCE(*x2);
> > }
> > 
> > 
> > P2(intptr_t *x2, intptr_t *x3)
> > {
> > 	WRITE_ONCE(*x2, 2);
> > 	smp_memb();
> > 	intptr_t r2 = READ_ONCE(*x3);
> > }
> > 
> > 
> > P3(intptr_t *x3, intptr_t *x4)
> > {
> > 	rcu_read_lock();
> > 	WRITE_ONCE(*x3, 2);
> > 	intptr_t r2 = READ_ONCE(*x4);
> > 	rcu_read_unlock();
> > }
> > 
> > 
> > P4(intptr_t *x4, intptr_t *x5)
> > {
> > 	rcu_read_lock();
> > 	WRITE_ONCE(*x4, 2);
> > 	intptr_t r2 = READ_ONCE(*x5);
> > 	rcu_read_unlock();
> > }
> > 
> > 
> > P5(intptr_t *x0, intptr_t *x5)
> > {
> > 	rcu_read_lock();
> > 	WRITE_ONCE(*x5, 2);
> > 	intptr_t r2 = READ_ONCE(*x0);
> > 	rcu_read_unlock();
> > }
> > 
> > exists
> > (5:r2=0 /\ 0:r2=0 /\ 1:r2=0 /\ 2:r2=0 /\ 3:r2=0 /\ 4:r2=0)
> > 
> > ------------------------------------------------------------------------
> > 
> > For this, herd gives "Never".  Of course, if I reverse the write and
> > read in any of P3(), P4(), or P5(), I get "Sometimes", which does make
> > sense.  But what is preserving the order between P3() and P4() and
> > between P4() and P5()?  I am not immediately seeing how the analogy
> > with RCU carries over to this case.
> 
> That isn't how it works.  Nothing preserves the orders you mentioned.
> It's more like: the order between P1 and P4 is preserved, as is the
> order between P0 and P5.  You'll see below...
> 
> (I readily agree that this result is not simple or obvious.  It took me
> quite a while to formulate the following analysis.)

For whatever it is worth, David Goldblatt agrees with you to at
least some extent.  I have sent him an inquiry.  ;-)

> To begin with, since there aren't any synchronize_rcu calls in the
> test, the rcu_read_lock and rcu_read_unlock calls do nothing.  They
> can be eliminated.

Agreed.  I was just being lazy.

> Also, I find the variable names "x0" - "x5" to be a little hard to
> work with.  If you don't mind, I'll replace them with "a" - "f".

Easy enough to translate, so have at it!

> Now, a little digression on how sys_membarrier works.  It starts by
> executing a full memory barrier.  Then it injects memory barriers into
> the instruction streams of all the other CPUs and waits for them all
> to complete.  Then it executes an ending memory barrier.
> 
> These barriers are ordered as described.  Therefore we have
> 
> 	mb0s < mb05 < mb0e,
> 	mb1s < mb14 < mb1e,  and
> 	mb2s < mb23 < mb2e,
> 
> where mb0s is the starting barrier of the sys_memb call on P0, mb05 is
> the barrier that it injects into P5, mb0e is the ending barrier of the
> call, and similarly for the other sys_memb calls.  The '<' signs mean
> that the thing on their left finishes before the thing on their right
> does.
> 
> Rewriting the litmus test in these terms gives:
> 
>         P0      P1      P2      P3      P4      P5
>         Wa=2    Wb=2    Wc=2    [mb23]  [mb14]  [mb05]
>         mb0s    mb1s    mb2s    Wd=2    We=2    Wf=2
>         mb0e    mb1e    mb2e    Re=0    Rf=0    Ra=0
>         Rb=0    Rc=0    Rd=0
> 
> Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
> positions of these barriers in their respective threads' program
> orderings is undetermined; they need not come at the top as shown.
> 
> (Also, in case David is unfamiliar with it, the "Wa=2" notation is
> shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)
> 
> Finally, here are a few facts which may be well known and obvious, but
> I'll state them anyway:
> 
> 	A CPU cannot reorder instructions across a memory barrier.
> 	If x is po-after a barrier then x executes after the barrier
> 	is finished.
> 
> 	If a store is po-before a barrier then the store propagates
> 	to every CPU before the barrier finishes.
> 
> 	If a store propagates to some CPU before a load on that CPU
> 	reads from the same location, then the load will obtain the
> 	value from that store or a co-later store.  This implies that
> 	if a load obtains a value co-earlier than some store then the
> 	load must have executed before the store propagated to the
> 	load's CPU.
> 
> The proof consists of three main stages, each requiring three steps.
> Using the facts that b - f are all read as 0, I'll show that P1
> executes Rc before P3 executes Re, then that P0 executes Rb before P4
> executes Rf, and lastly that P5's Ra must obtain 2, not 0.  This will
> demonstrate that the litmus test is not allowed.
> 
> 1.	Suppose that mb23 ends up coming po-later than Wd in P3.
> 	Then we would have:
> 
> 		Wd propagates to P2 < mb23 < mb2e < Rd,
> 
> 	and so Rd would obtain 2, not 0.  Hence mb23 must come
> 	po-before Wd (as shown in the listing):  mb23 < Wd.
> 
> 2.	Since mb23 therefore occurs po-before Re and instructions
> 	cannot be reordered across barriers,  mb23 < Re.
> 
> 3.	Since Rc obtains 0, we must have:
> 
> 		Rc < Wc propagates to P1 < mb2s < mb23 < Re.
> 
> 	Thus Rc < Re.
> 
> 4.	Suppose that mb14 ends up coming po-later than We in P4.
> 	Then we would have:
> 
> 		We propagates to P3 < mb14 < mb1e < Rc < Re,
> 
> 	and so Re would obtain 2, not 0.  Hence mb14 must come
> 	po-before We (as shown in the listing):  mb14 < We.
> 
> 5.	Since mb14 therefore occurs po-before Rf and instructions
> 	cannot be reordered across barriers,  mb14 < Rf.
> 
> 6.	Since Rb obtains 0, we must have:
> 
> 		Rb < Wb propagates to P0 < mb1s < mb14 < Rf.
> 
> 	Thus Rb < Rf.
> 
> 7.	Suppose that mb05 ends up coming po-later than Wf in P5.
> 	Then we would have:
> 
> 		Wf propagates to P4 < mb05 < mb0e < Rb < Rf,
> 
> 	and so Rf would obtain 2, not 0.  Hence mb05 must come
> 	po-before Wf (as shown in the listing):  mb05 < Wf.
> 
> 8.	Since mb05 therefore occurs po-before Ra and instructions
> 	cannot be reordered across barriers,  mb05 < Ra.
> 
> 9.	Now we have:
> 
> 		Wa propagates to P5 < mb0s < mb05 < Ra,
> 
> 	and so Ra must obtain 2, not 0.  QED.

Like this, then, with maximal reordering of P3-P5's reads?

         P0      P1      P2      P3      P4      P5
         Wa=2
         mb0s
                                                 [mb05]
         mb0e                                    Ra=0
         Rb=0    Wb=2
                 mb1s
                                         [mb14]
                 mb1e                    Rf=0
                 Rc=0    Wc=2                    Wf=2
                         mb2s
                                 [mb23]
                         mb2e    Re=0
                         Rd=0            We=2
                                 Wd=2

But don't the sys_membarrier() calls affect everyone, especially given
the shared-variable communication?  If so, why wouldn't this more strict
variant hold?

         P0      P1      P2      P3      P4      P5
         Wa=2
         mb0s
                                 [mb05]  [mb05]  [mb05]
         mb0e
         Rb=0    Wb=2
                 mb1s
                                 [mb14]  [mb14]  [mb14]
                 mb1e
                 Rc=0    Wc=2
                         mb2s
                                 [mb23]  [mb23]  [mb23]
                         mb2e    Re=0    Rf=0    Ra=0
                         Rd=0            We=2    Wf=2
                                 Wd=2

In which case, wouldn't this cycle be forbidden even if it had only one
sys_membarrier() call?

Ah, but the IPIs are not necessarily synchronized across the CPUs,
so that the following could happen:

         P0      P1      P2      P3      P4      P5
         Wa=2
         mb0s
                                 [mb05]  [mb05]  [mb05]
         mb0e                                    Ra=0
         Rb=0    Wb=2
                 mb1s
                                 [mb14]  [mb14]
                                         Rf=0
                                                 Wf=2
                                                 [mb14]
                 mb1e
                 Rc=0    Wc=2
                         mb2s
                                 [mb23]
                                 Re=0
                                         We=2
                                         [mb23]  [mb23]
                         mb2e
                         Rd=0
                                 Wd=2

I guess in light of this post in 2001, I really don't have an excuse,
do I?  ;-)

	https://lists.gt.net/linux/kernel/223555

Or am I still missing something here?

							Thanx, Paul

Alan Stern Dec. 11, 2018, 8:09 p.m. UTC | #23

On Tue, 11 Dec 2018, Paul E. McKenney wrote:

> > Rewriting the litmus test in these terms gives:
> > 
> >         P0      P1      P2      P3      P4      P5
> >         Wa=2    Wb=2    Wc=2    [mb23]  [mb14]  [mb05]
> >         mb0s    mb1s    mb2s    Wd=2    We=2    Wf=2
> >         mb0e    mb1e    mb2e    Re=0    Rf=0    Ra=0
> >         Rb=0    Rc=0    Rd=0
> > 
> > Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
> > positions of these barriers in their respective threads' program
> > orderings is undetermined; they need not come at the top as shown.
> > 
> > (Also, in case David is unfamiliar with it, the "Wa=2" notation is
> > shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)
> > 
> > Finally, here are a few facts which may be well known and obvious, but
> > I'll state them anyway:
> > 
> > 	A CPU cannot reorder instructions across a memory barrier.
> > 	If x is po-after a barrier then x executes after the barrier
> > 	is finished.
> > 
> > 	If a store is po-before a barrier then the store propagates
> > 	to every CPU before the barrier finishes.
> > 
> > 	If a store propagates to some CPU before a load on that CPU
> > 	reads from the same location, then the load will obtain the
> > 	value from that store or a co-later store.  This implies that
> > 	if a load obtains a value co-earlier than some store then the
> > 	load must have executed before the store propagated to the
> > 	load's CPU.
> > 
> > The proof consists of three main stages, each requiring three steps.
> > Using the facts that b - f are all read as 0, I'll show that P1
> > executes Rc before P3 executes Re, then that P0 executes Rb before P4
> > executes Rf, and lastly that P5's Ra must obtain 2, not 0.  This will
> > demonstrate that the litmus test is not allowed.
> > 
> > 1.	Suppose that mb23 ends up coming po-later than Wd in P3.
> > 	Then we would have:
> > 
> > 		Wd propagates to P2 < mb23 < mb2e < Rd,
> > 
> > 	and so Rd would obtain 2, not 0.  Hence mb23 must come
> > 	po-before Wd (as shown in the listing):  mb23 < Wd.
> > 
> > 2.	Since mb23 therefore occurs po-before Re and instructions
> > 	cannot be reordered across barriers,  mb23 < Re.
> > 
> > 3.	Since Rc obtains 0, we must have:
> > 
> > 		Rc < Wc propagates to P1 < mb2s < mb23 < Re.
> > 
> > 	Thus Rc < Re.
> > 
> > 4.	Suppose that mb14 ends up coming po-later than We in P4.
> > 	Then we would have:
> > 
> > 		We propagates to P3 < mb14 < mb1e < Rc < Re,
> > 
> > 	and so Re would obtain 2, not 0.  Hence mb14 must come
> > 	po-before We (as shown in the listing):  mb14 < We.
> > 
> > 5.	Since mb14 therefore occurs po-before Rf and instructions
> > 	cannot be reordered across barriers,  mb14 < Rf.
> > 
> > 6.	Since Rb obtains 0, we must have:
> > 
> > 		Rb < Wb propagates to P0 < mb1s < mb14 < Rf.
> > 
> > 	Thus Rb < Rf.
> > 
> > 7.	Suppose that mb05 ends up coming po-later than Wf in P5.
> > 	Then we would have:
> > 
> > 		Wf propagates to P4 < mb05 < mb0e < Rb < Rf,
> > 
> > 	and so Rf would obtain 2, not 0.  Hence mb05 must come
> > 	po-before Wf (as shown in the listing):  mb05 < Wf.
> > 
> > 8.	Since mb05 therefore occurs po-before Ra and instructions
> > 	cannot be reordered across barriers,  mb05 < Ra.
> > 
> > 9.	Now we have:
> > 
> > 		Wa propagates to P5 < mb0s < mb05 < Ra,
> > 
> > 	and so Ra must obtain 2, not 0.  QED.
> 
> Like this, then, with maximal reordering of P3-P5's reads?
> 
>          P0      P1      P2      P3      P4      P5
>          Wa=2
>          mb0s
>                                                  [mb05]
>          mb0e                                    Ra=0
>          Rb=0    Wb=2
>                  mb1s
>                                          [mb14]
>                  mb1e                    Rf=0
>                  Rc=0    Wc=2                    Wf=2
>                          mb2s
>                                  [mb23]
>                          mb2e    Re=0
>                          Rd=0            We=2
>                                  Wd=2

Yes, that's right.  This shows how P5's Ra must obtain 2 instead of 0.

> But don't the sys_membarrier() calls affect everyone, especially given
> the shared-variable communication?

They do, but the other effects are irrelevant for this proof.

>  If so, why wouldn't this more strict
> variant hold?
> 
>          P0      P1      P2      P3      P4      P5
>          Wa=2
>          mb0s
>                                  [mb05]  [mb05]  [mb05]

You have misunderstood the naming scheme.  mb05 is the barrier injected 
by P0's sys_membarrier call into P5.  So the three barriers above 
should be named "mb03", "mb04", and "mb05".  And you left out mb01 and 
mb02.

>          mb0e
>          Rb=0    Wb=2
>                  mb1s
>                                  [mb14]  [mb14]  [mb14]
>                  mb1e
>                  Rc=0    Wc=2
>                          mb2s
>                                  [mb23]  [mb23]  [mb23]
>                          mb2e    Re=0    Rf=0    Ra=0
>                          Rd=0            We=2    Wf=2
>                                  Wd=2

Yes, this does hold.  But since it doesn't affect the end result, 
there's no point in mentioning all those other barriers.

> In which case, wouldn't this cycle be forbidden even if it had only one
> sys_membarrier() call?

No, it wouldn't.  I don't understand why you might think it would.  

This is just like RCU, if you imagine a tiny critical section between 
each adjacent pair of instructions.  You wouldn't expect RCU to enforce 
ordering among six CPUs with only one synchronize_rcu call.

> Ah, but the IPIs are not necessarily synchronized across the CPUs,
> so that the following could happen:
> 
>          P0      P1      P2      P3      P4      P5
>          Wa=2
>          mb0s
>                                  [mb05]  [mb05]  [mb05]
>          mb0e                                    Ra=0
>          Rb=0    Wb=2
>                  mb1s
>                                  [mb14]  [mb14]
>                                          Rf=0
>                                                  Wf=2
>                                                  [mb14]
>                  mb1e
>                  Rc=0    Wc=2
>                          mb2s
>                                  [mb23]
>                                  Re=0
>                                          We=2
>                                          [mb23]  [mb23]
>                          mb2e
>                          Rd=0
>                                  Wd=2

Yes it could.  But even in this execution you would end up with Ra=2 
instead of Ra=0.

> I guess in light of this post in 2001, I really don't have an excuse,
> do I?  ;-)
> 
> 	https://lists.gt.net/linux/kernel/223555
> 
> Or am I still missing something here?

You tell me...

Alan

Paul E. McKenney Dec. 11, 2018, 9:22 p.m. UTC | #24

On Tue, Dec 11, 2018 at 03:09:33PM -0500, Alan Stern wrote:
> On Tue, 11 Dec 2018, Paul E. McKenney wrote:
> 
> > > Rewriting the litmus test in these terms gives:
> > > 
> > >         P0      P1      P2      P3      P4      P5
> > >         Wa=2    Wb=2    Wc=2    [mb23]  [mb14]  [mb05]
> > >         mb0s    mb1s    mb2s    Wd=2    We=2    Wf=2
> > >         mb0e    mb1e    mb2e    Re=0    Rf=0    Ra=0
> > >         Rb=0    Rc=0    Rd=0
> > > 
> > > Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
> > > positions of these barriers in their respective threads' program
> > > orderings is undetermined; they need not come at the top as shown.
> > > 
> > > (Also, in case David is unfamiliar with it, the "Wa=2" notation is
> > > shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)
> > > 
> > > Finally, here are a few facts which may be well known and obvious, but
> > > I'll state them anyway:
> > > 
> > > 	A CPU cannot reorder instructions across a memory barrier.
> > > 	If x is po-after a barrier then x executes after the barrier
> > > 	is finished.
> > > 
> > > 	If a store is po-before a barrier then the store propagates
> > > 	to every CPU before the barrier finishes.
> > > 
> > > 	If a store propagates to some CPU before a load on that CPU
> > > 	reads from the same location, then the load will obtain the
> > > 	value from that store or a co-later store.  This implies that
> > > 	if a load obtains a value co-earlier than some store then the
> > > 	load must have executed before the store propagated to the
> > > 	load's CPU.
> > > 
> > > The proof consists of three main stages, each requiring three steps.
> > > Using the facts that b - f are all read as 0, I'll show that P1
> > > executes Rc before P3 executes Re, then that P0 executes Rb before P4
> > > executes Rf, and lastly that P5's Ra must obtain 2, not 0.  This will
> > > demonstrate that the litmus test is not allowed.
> > > 
> > > 1.	Suppose that mb23 ends up coming po-later than Wd in P3.
> > > 	Then we would have:
> > > 
> > > 		Wd propagates to P2 < mb23 < mb2e < Rd,
> > > 
> > > 	and so Rd would obtain 2, not 0.  Hence mb23 must come
> > > 	po-before Wd (as shown in the listing):  mb23 < Wd.
> > > 
> > > 2.	Since mb23 therefore occurs po-before Re and instructions
> > > 	cannot be reordered across barriers,  mb23 < Re.
> > > 
> > > 3.	Since Rc obtains 0, we must have:
> > > 
> > > 		Rc < Wc propagates to P1 < mb2s < mb23 < Re.
> > > 
> > > 	Thus Rc < Re.
> > > 
> > > 4.	Suppose that mb14 ends up coming po-later than We in P4.
> > > 	Then we would have:
> > > 
> > > 		We propagates to P3 < mb14 < mb1e < Rc < Re,
> > > 
> > > 	and so Re would obtain 2, not 0.  Hence mb14 must come
> > > 	po-before We (as shown in the listing):  mb14 < We.
> > > 
> > > 5.	Since mb14 therefore occurs po-before Rf and instructions
> > > 	cannot be reordered across barriers,  mb14 < Rf.
> > > 
> > > 6.	Since Rb obtains 0, we must have:
> > > 
> > > 		Rb < Wb propagates to P0 < mb1s < mb14 < Rf.
> > > 
> > > 	Thus Rb < Rf.
> > > 
> > > 7.	Suppose that mb05 ends up coming po-later than Wf in P5.
> > > 	Then we would have:
> > > 
> > > 		Wf propagates to P4 < mb05 < mb0e < Rb < Rf,
> > > 
> > > 	and so Rf would obtain 2, not 0.  Hence mb05 must come
> > > 	po-before Wf (as shown in the listing):  mb05 < Wf.
> > > 
> > > 8.	Since mb05 therefore occurs po-before Ra and instructions
> > > 	cannot be reordered across barriers,  mb05 < Ra.
> > > 
> > > 9.	Now we have:
> > > 
> > > 		Wa propagates to P5 < mb0s < mb05 < Ra,
> > > 
> > > 	and so Ra must obtain 2, not 0.  QED.
> > 
> > Like this, then, with maximal reordering of P3-P5's reads?
> > 
> >          P0      P1      P2      P3      P4      P5
> >          Wa=2
> >          mb0s
> >                                                  [mb05]
> >          mb0e                                    Ra=0
> >          Rb=0    Wb=2
> >                  mb1s
> >                                          [mb14]
> >                  mb1e                    Rf=0
> >                  Rc=0    Wc=2                    Wf=2
> >                          mb2s
> >                                  [mb23]
> >                          mb2e    Re=0
> >                          Rd=0            We=2
> >                                  Wd=2
> 
> Yes, that's right.  This shows how P5's Ra must obtain 2 instead of 0.
> 
> > But don't the sys_membarrier() calls affect everyone, especially given
> > the shared-variable communication?
> 
> They do, but the other effects are irrelevant for this proof.

If I understand correctly, the shared-variable communication within
sys_membarrier() is included in your proof in the form of ordering
between memory barriers in the mainline sys_membarrier() code and
in the IPI handlers.

> >  If so, why wouldn't this more strict
> > variant hold?
> > 
> >          P0      P1      P2      P3      P4      P5
> >          Wa=2
> >          mb0s
> >                                  [mb05]  [mb05]  [mb05]
> 
> You have misunderstood the naming scheme.  mb05 is the barrier injected 
> by P0's sys_membarrier call into P5.  So the three barriers above 
> should be named "mb03", "mb04", and "mb05".  And you left out mb01 and 
> mb02.

The former is a copy-and-paste error on my part, the latter was
intentional because the IPIs among P0, P1, and P2 don't seem to
strengthen the ordering.

> >          mb0e
> >          Rb=0    Wb=2
> >                  mb1s
> >                                  [mb14]  [mb14]  [mb14]
> >                  mb1e
> >                  Rc=0    Wc=2
> >                          mb2s
> >                                  [mb23]  [mb23]  [mb23]
> >                          mb2e    Re=0    Rf=0    Ra=0
> >                          Rd=0            We=2    Wf=2
> >                                  Wd=2
> 
> Yes, this does hold.  But since it doesn't affect the end result, 
> there's no point in mentioning all those other barriers.
> 
> > In which case, wouldn't this cycle be forbidden even if it had only one
> > sys_membarrier() call?
> 
> No, it wouldn't.  I don't understand why you might think it would.  

Because I hadn't yet thought of the scenario I showed below.

> This is just like RCU, if you imagine a tiny critical section between 
> each adjacent pair of instructions.  You wouldn't expect RCU to enforce 
> ordering among six CPUs with only one synchronize_rcu call.

Yes, I do now agree in light of the scenario shown below.

> > Ah, but the IPIs are not necessarily synchronized across the CPUs,
> > so that the following could happen:
> > 
> >          P0      P1      P2      P3      P4      P5
> >          Wa=2
> >          mb0s
> >                                  [mb05]  [mb05]  [mb05]
> >          mb0e                                    Ra=0
> >          Rb=0    Wb=2
> >                  mb1s
> >                                  [mb14]  [mb14]
> >                                          Rf=0
> >                                                  Wf=2
> >                                                  [mb14]
> >                  mb1e
> >                  Rc=0    Wc=2
> >                          mb2s
> >                                  [mb23]
> >                                  Re=0
> >                                          We=2
> >                                          [mb23]  [mb23]
> >                          mb2e
> >                          Rd=0
> >                                  Wd=2
> 
> Yes it could.  But even in this execution you would end up with Ra=2 
> instead of Ra=0.

Agreed.  Or I should have said that the above execution is forbidden,
either way.

> > I guess in light of this post in 2001, I really don't have an excuse,
> > do I?  ;-)
> > 
> > 	https://lists.gt.net/linux/kernel/223555
> > 
> > Or am I still missing something here?
> 
> You tell me...

I think I am on board.  ;-)

							Thanx, Paul

Paul E. McKenney Dec. 12, 2018, 5:07 p.m. UTC | #25

On Tue, Dec 11, 2018 at 01:22:04PM -0800, Paul E. McKenney wrote:
> On Tue, Dec 11, 2018 at 03:09:33PM -0500, Alan Stern wrote:
> > On Tue, 11 Dec 2018, Paul E. McKenney wrote:
> > 
> > > > Rewriting the litmus test in these terms gives:
> > > > 
> > > >         P0      P1      P2      P3      P4      P5
> > > >         Wa=2    Wb=2    Wc=2    [mb23]  [mb14]  [mb05]
> > > >         mb0s    mb1s    mb2s    Wd=2    We=2    Wf=2
> > > >         mb0e    mb1e    mb2e    Re=0    Rf=0    Ra=0
> > > >         Rb=0    Rc=0    Rd=0
> > > > 
> > > > Here the brackets in "[mb23]", "[mb14]", and "[mb05]" mean that the
> > > > positions of these barriers in their respective threads' program
> > > > orderings is undetermined; they need not come at the top as shown.
> > > > 
> > > > (Also, in case David is unfamiliar with it, the "Wa=2" notation is
> > > > shorthand for "Write 2 to a" and "Rb=0" is short for "Read 0 from b".)
> > > > 
> > > > Finally, here are a few facts which may be well known and obvious, but
> > > > I'll state them anyway:
> > > > 
> > > > 	A CPU cannot reorder instructions across a memory barrier.
> > > > 	If x is po-after a barrier then x executes after the barrier
> > > > 	is finished.
> > > > 
> > > > 	If a store is po-before a barrier then the store propagates
> > > > 	to every CPU before the barrier finishes.
> > > > 
> > > > 	If a store propagates to some CPU before a load on that CPU
> > > > 	reads from the same location, then the load will obtain the
> > > > 	value from that store or a co-later store.  This implies that
> > > > 	if a load obtains a value co-earlier than some store then the
> > > > 	load must have executed before the store propagated to the
> > > > 	load's CPU.
> > > > 
> > > > The proof consists of three main stages, each requiring three steps.
> > > > Using the facts that b - f are all read as 0, I'll show that P1
> > > > executes Rc before P3 executes Re, then that P0 executes Rb before P4
> > > > executes Rf, and lastly that P5's Ra must obtain 2, not 0.  This will
> > > > demonstrate that the litmus test is not allowed.
> > > > 
> > > > 1.	Suppose that mb23 ends up coming po-later than Wd in P3.
> > > > 	Then we would have:
> > > > 
> > > > 		Wd propagates to P2 < mb23 < mb2e < Rd,
> > > > 
> > > > 	and so Rd would obtain 2, not 0.  Hence mb23 must come
> > > > 	po-before Wd (as shown in the listing):  mb23 < Wd.
> > > > 
> > > > 2.	Since mb23 therefore occurs po-before Re and instructions
> > > > 	cannot be reordered across barriers,  mb23 < Re.
> > > > 
> > > > 3.	Since Rc obtains 0, we must have:
> > > > 
> > > > 		Rc < Wc propagates to P1 < mb2s < mb23 < Re.
> > > > 
> > > > 	Thus Rc < Re.
> > > > 
> > > > 4.	Suppose that mb14 ends up coming po-later than We in P4.
> > > > 	Then we would have:
> > > > 
> > > > 		We propagates to P3 < mb14 < mb1e < Rc < Re,
> > > > 
> > > > 	and so Re would obtain 2, not 0.  Hence mb14 must come
> > > > 	po-before We (as shown in the listing):  mb14 < We.
> > > > 
> > > > 5.	Since mb14 therefore occurs po-before Rf and instructions
> > > > 	cannot be reordered across barriers,  mb14 < Rf.
> > > > 
> > > > 6.	Since Rb obtains 0, we must have:
> > > > 
> > > > 		Rb < Wb propagates to P0 < mb1s < mb14 < Rf.
> > > > 
> > > > 	Thus Rb < Rf.
> > > > 
> > > > 7.	Suppose that mb05 ends up coming po-later than Wf in P5.
> > > > 	Then we would have:
> > > > 
> > > > 		Wf propagates to P4 < mb05 < mb0e < Rb < Rf,
> > > > 
> > > > 	and so Rf would obtain 2, not 0.  Hence mb05 must come
> > > > 	po-before Wf (as shown in the listing):  mb05 < Wf.
> > > > 
> > > > 8.	Since mb05 therefore occurs po-before Ra and instructions
> > > > 	cannot be reordered across barriers,  mb05 < Ra.
> > > > 
> > > > 9.	Now we have:
> > > > 
> > > > 		Wa propagates to P5 < mb0s < mb05 < Ra,
> > > > 
> > > > 	and so Ra must obtain 2, not 0.  QED.
> > > 
> > > Like this, then, with maximal reordering of P3-P5's reads?
> > > 
> > >          P0      P1      P2      P3      P4      P5
> > >          Wa=2
> > >          mb0s
> > >                                                  [mb05]
> > >          mb0e                                    Ra=0
> > >          Rb=0    Wb=2
> > >                  mb1s
> > >                                          [mb14]
> > >                  mb1e                    Rf=0
> > >                  Rc=0    Wc=2                    Wf=2
> > >                          mb2s
> > >                                  [mb23]
> > >                          mb2e    Re=0
> > >                          Rd=0            We=2
> > >                                  Wd=2
> > 
> > Yes, that's right.  This shows how P5's Ra must obtain 2 instead of 0.
> > 
> > > But don't the sys_membarrier() calls affect everyone, especially given
> > > the shared-variable communication?
> > 
> > They do, but the other effects are irrelevant for this proof.
> 
> If I understand correctly, the shared-variable communication within
> sys_membarrier() is included in your proof in the form of ordering
> between memory barriers in the mainline sys_membarrier() code and
> in the IPI handlers.
> 
> > >  If so, why wouldn't this more strict
> > > variant hold?
> > > 
> > >          P0      P1      P2      P3      P4      P5
> > >          Wa=2
> > >          mb0s
> > >                                  [mb05]  [mb05]  [mb05]
> > 
> > You have misunderstood the naming scheme.  mb05 is the barrier injected 
> > by P0's sys_membarrier call into P5.  So the three barriers above 
> > should be named "mb03", "mb04", and "mb05".  And you left out mb01 and 
> > mb02.
> 
> The former is a copy-and-paste error on my part, the latter was
> intentional because the IPIs among P0, P1, and P2 don't seem to
> strengthen the ordering.
> 
> > >          mb0e
> > >          Rb=0    Wb=2
> > >                  mb1s
> > >                                  [mb14]  [mb14]  [mb14]
> > >                  mb1e
> > >                  Rc=0    Wc=2
> > >                          mb2s
> > >                                  [mb23]  [mb23]  [mb23]
> > >                          mb2e    Re=0    Rf=0    Ra=0
> > >                          Rd=0            We=2    Wf=2
> > >                                  Wd=2
> > 
> > Yes, this does hold.  But since it doesn't affect the end result, 
> > there's no point in mentioning all those other barriers.
> > 
> > > In which case, wouldn't this cycle be forbidden even if it had only one
> > > sys_membarrier() call?
> > 
> > No, it wouldn't.  I don't understand why you might think it would.  
> 
> Because I hadn't yet thought of the scenario I showed below.
> 
> > This is just like RCU, if you imagine a tiny critical section between 
> > each adjacent pair of instructions.  You wouldn't expect RCU to enforce 
> > ordering among six CPUs with only one synchronize_rcu call.
> 
> Yes, I do now agree in light of the scenario shown below.
> 
> > > Ah, but the IPIs are not necessarily synchronized across the CPUs,
> > > so that the following could happen:
> > > 
> > >          P0      P1      P2      P3      P4      P5
> > >          Wa=2
> > >          mb0s
> > >                                  [mb05]  [mb05]  [mb05]
> > >          mb0e                                    Ra=0
> > >          Rb=0    Wb=2
> > >                  mb1s
> > >                                  [mb14]  [mb14]
> > >                                          Rf=0
> > >                                                  Wf=2
> > >                                                  [mb14]
> > >                  mb1e
> > >                  Rc=0    Wc=2
> > >                          mb2s
> > >                                  [mb23]
> > >                                  Re=0
> > >                                          We=2
> > >                                          [mb23]  [mb23]
> > >                          mb2e
> > >                          Rd=0
> > >                                  Wd=2
> > 
> > Yes it could.  But even in this execution you would end up with Ra=2 
> > instead of Ra=0.
> 
> Agreed.  Or I should have said that the above execution is forbidden,
> either way.
> 
> > > I guess in light of this post in 2001, I really don't have an excuse,
> > > do I?  ;-)
> > > 
> > > 	https://lists.gt.net/linux/kernel/223555
> > > 
> > > Or am I still missing something here?
> > 
> > You tell me...
> 
> I think I am on board.  ;-)

And more to the point, here is a three-process variant showing a cycle
that is permitted:


         P0      P1      P2
         Wa=2    Wb=2    Wc=2
         mb0s
                 [mb01]  [mb02]
         mb0e
         Rb=0    Rc=0    Ra=0

As can be seen by reordering it as follows:

         P0      P1      P2
                         Ra=0
         Wa=2
         mb0s
                 [mb01]
                 Rc=0
                         Wc=2
                         [mb02]
         mb0e
         Rb=0
                 Wb=2

Make sense?

							Thanx, Paul

Alan Stern Dec. 12, 2018, 6:04 p.m. UTC | #26

On Wed, 12 Dec 2018, Paul E. McKenney wrote:

> > > > Or am I still missing something here?
> > > 
> > > You tell me...
> > 
> > I think I am on board.  ;-)
> 
> And more to the point, here is a three-process variant showing a cycle
> that is permitted:
> 
> 
>          P0      P1      P2
>          Wa=2    Wb=2    Wc=2
>          mb0s
>                  [mb01]  [mb02]
>          mb0e
>          Rb=0    Rc=0    Ra=0
> 
> As can be seen by reordering it as follows:
> 
>          P0      P1      P2
>                          Ra=0
>          Wa=2
>          mb0s
>                  [mb01]
>                  Rc=0
>                          Wc=2
>                          [mb02]
>          mb0e
>          Rb=0
>                  Wb=2
> 
> Make sense?

You got it!

Alan

Paul E. McKenney Dec. 12, 2018, 7:42 p.m. UTC | #27

On Wed, Dec 12, 2018 at 01:04:44PM -0500, Alan Stern wrote:
> On Wed, 12 Dec 2018, Paul E. McKenney wrote:
> 
> > > > > Or am I still missing something here?
> > > > 
> > > > You tell me...
> > > 
> > > I think I am on board.  ;-)
> > 
> > And more to the point, here is a three-process variant showing a cycle
> > that is permitted:
> > 
> > 
> >          P0      P1      P2
> >          Wa=2    Wb=2    Wc=2
> >          mb0s
> >                  [mb01]  [mb02]
> >          mb0e
> >          Rb=0    Rc=0    Ra=0
> > 
> > As can be seen by reordering it as follows:
> > 
> >          P0      P1      P2
> >                          Ra=0
> >          Wa=2
> >          mb0s
> >                  [mb01]
> >                  Rc=0
> >                          Wc=2
> >                          [mb02]
> >          mb0e
> >          Rb=0
> >                  Wb=2
> > 
> > Make sense?
> 
> You got it!

OK.  How about this one?

         P0      P1                 P2      P3
         Wa=2    rcu_read_lock()    Wc=2    Wd=2
         memb    Wb=2               Rd=0    synchronize_rcu();
         Rb=0    Rc=0                       Ra=0
	         rcu_read_unlock()

The model should say that it is allowed.  Taking a look...

         P0      P1                 P2      P3
				    Rd=0
					    Wd=2
					    synchronize_rcu();
	                                    Ra=0
	 Wa=2
	 membs
	         rcu_read_lock()
		 [m01]
		 Rc=0
		 		    Wc=2
				    [m02]   [m03]
	 membe
	 Rb=0
	         Wb=2
		 rcu_read_unlock()

Looks allowed to me.  If the synchronization of P1 and P2 were
interchanged, it should be forbidden:

         P0      P1      P2                 P3
         Wa=2    Wb=2    rcu_read_lock()    Wd=2
         memb    Rc=0    Wc=2               synchronize_rcu();
         Rb=0            Rd=0               Ra=0
                         rcu_read_unlock()

Taking a look...

         P0      P1      P2                 P3
                         rcu_read_lock()
                         Rd=0
         Wa=2    Wb=2                       Wd=2
         membs                              synchronize_rcu();
                 [m01]
                 Rc=0
                         Wc=2
                         rcu_read_unlock()
			 [m02]              Ra=0 [Forbidden?]
	 membe
         Rb=0

I believe that this ordering forbids the cycle:

	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
		return from synchronize_rcu() -> Ra

Does this make sense, or am I missing something?

							Thanx, Paul

Alan Stern Dec. 12, 2018, 9:32 p.m. UTC | #28

On Wed, 12 Dec 2018, Paul E. McKenney wrote:

> OK.  How about this one?
> 
>          P0      P1                 P2      P3
>          Wa=2    rcu_read_lock()    Wc=2    Wd=2
>          memb    Wb=2               Rd=0    synchronize_rcu();
>          Rb=0    Rc=0                       Ra=0
> 	         rcu_read_unlock()
> 
> The model should say that it is allowed.  Taking a look...
> 
>          P0      P1                 P2      P3
> 				    Rd=0
> 					    Wd=2
> 					    synchronize_rcu();
> 	                                    Ra=0
> 	 Wa=2
> 	 membs
> 	         rcu_read_lock()
> 		 [m01]
> 		 Rc=0
> 		 		    Wc=2
> 				    [m02]   [m03]
> 	 membe
> 	 Rb=0
> 	         Wb=2
> 		 rcu_read_unlock()
> 
> Looks allowed to me.  If the synchronization of P1 and P2 were
> interchanged, it should be forbidden:
> 
>          P0      P1      P2                 P3
>          Wa=2    Wb=2    rcu_read_lock()    Wd=2
>          memb    Rc=0    Wc=2               synchronize_rcu();
>          Rb=0            Rd=0               Ra=0
>                          rcu_read_unlock()
> 
> Taking a look...
> 
>          P0      P1      P2                 P3
>                          rcu_read_lock()
>                          Rd=0
>          Wa=2    Wb=2                       Wd=2
>          membs                              synchronize_rcu();
>                  [m01]
>                  Rc=0
>                          Wc=2
>                          rcu_read_unlock()
> 			 [m02]              Ra=0 [Forbidden?]
> 	 membe
>          Rb=0

Have you tried writing these as real litmus tests and running them 
through herd?

> I believe that this ordering forbids the cycle:
> 
> 	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
> 		return from synchronize_rcu() -> Ra
> 
> Does this make sense, or am I missing something?

It's hard to tell.  What you have written here isn't justified by the
litmus test source code, since the position of m01 in P1's program
order is undetermined.  How do you justify m01 -> Rc, for example?

Write it this way instead, using the relations defined in the 
sys_membarrier patch for linux-kernel.cat:

	memb ->memb-gp memb ->rcu-link Rc ->memb-rscsi Rc ->rcu-link
		
		rcu_read_unlock ->rcu-rscsi rcu_read_lock ->rcu-link 

		synchronize_rcu ->rcu-gp synchronize_rcu ->rcu-link memb

Recall that:

	memb-gp is the identity relation on sys_membarrier events,

	rcu-link includes (po? ; fre ; po),

	memb-rscsi is the identity relation on all events,

	rcu-rscsi links unlocks to their corresponding locks, and

	rcu-gp is the identity relation on synchronize_rcu events.

These facts justify the cycle above.

Leaving off the final rcu-link step, the sequence matches the
definition of rcu-fence (the relations are memb-gp, memb-rscsi, 
rcu-rscsi, rcu-gp with rcu-links in between).  Therefore the cycle is 
forbidden.

Alan

Paul E. McKenney Dec. 12, 2018, 9:52 p.m. UTC | #29

On Wed, Dec 12, 2018 at 04:32:50PM -0500, Alan Stern wrote:
> On Wed, 12 Dec 2018, Paul E. McKenney wrote:
> 
> > OK.  How about this one?
> > 
> >          P0      P1                 P2      P3
> >          Wa=2    rcu_read_lock()    Wc=2    Wd=2
> >          memb    Wb=2               Rd=0    synchronize_rcu();
> >          Rb=0    Rc=0                       Ra=0
> > 	         rcu_read_unlock()
> > 
> > The model should say that it is allowed.  Taking a look...
> > 
> >          P0      P1                 P2      P3
> > 				    Rd=0
> > 					    Wd=2
> > 					    synchronize_rcu();
> > 	                                    Ra=0
> > 	 Wa=2
> > 	 membs
> > 	         rcu_read_lock()
> > 		 [m01]
> > 		 Rc=0
> > 		 		    Wc=2
> > 				    [m02]   [m03]
> > 	 membe
> > 	 Rb=0
> > 	         Wb=2
> > 		 rcu_read_unlock()
> > 
> > Looks allowed to me.  If the synchronization of P1 and P2 were
> > interchanged, it should be forbidden:
> > 
> >          P0      P1      P2                 P3
> >          Wa=2    Wb=2    rcu_read_lock()    Wd=2
> >          memb    Rc=0    Wc=2               synchronize_rcu();
> >          Rb=0            Rd=0               Ra=0
> >                          rcu_read_unlock()
> > 
> > Taking a look...
> > 
> >          P0      P1      P2                 P3
> >                          rcu_read_lock()
> >                          Rd=0
> >          Wa=2    Wb=2                       Wd=2
> >          membs                              synchronize_rcu();
> >                  [m01]
> >                  Rc=0
> >                          Wc=2
> >                          rcu_read_unlock()
> > 			 [m02]              Ra=0 [Forbidden?]
> > 	 membe
> >          Rb=0

For one thing, Wb=2 needs to be down here, apologies!  Which then ...

> Have you tried writing these as real litmus tests and running them 
> through herd?

That comes later, but yes, I will do that.

> > I believe that this ordering forbids the cycle:
> > 
> > 	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
> > 		return from synchronize_rcu() -> Ra
> > 
> > Does this make sense, or am I missing something?
> 
> It's hard to tell.  What you have written here isn't justified by the
> litmus test source code, since the position of m01 in P1's program
> order is undetermined.  How do you justify m01 -> Rc, for example?

... justifies Rc=0 following [m01].

> Write it this way instead, using the relations defined in the 
> sys_membarrier patch for linux-kernel.cat:
> 
> 	memb ->memb-gp memb ->rcu-link Rc ->memb-rscsi Rc ->rcu-link
> 		
> 		rcu_read_unlock ->rcu-rscsi rcu_read_lock ->rcu-link 
> 
> 		synchronize_rcu ->rcu-gp synchronize_rcu ->rcu-link memb
> 
> Recall that:
> 
> 	memb-gp is the identity relation on sys_membarrier events,
> 
> 	rcu-link includes (po? ; fre ; po),
> 
> 	memb-rscsi is the identity relation on all events,
> 
> 	rcu-rscsi links unlocks to their corresponding locks, and
> 
> 	rcu-gp is the identity relation on synchronize_rcu events.
> 
> These facts justify the cycle above.
> 
> Leaving off the final rcu-link step, the sequence matches the
> definition of rcu-fence (the relations are memb-gp, memb-rscsi, 
> rcu-rscsi, rcu-gp with rcu-links in between).  Therefore the cycle is 
> forbidden.

Understood, but that would be using the model to check the model.  ;-)

							Thanx, Paul

Alan Stern Dec. 12, 2018, 10:12 p.m. UTC | #30

On Wed, 12 Dec 2018, Paul E. McKenney wrote:

> > > I believe that this ordering forbids the cycle:
> > > 
> > > 	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
> > > 		return from synchronize_rcu() -> Ra
> > > 
> > > Does this make sense, or am I missing something?
> > 
> > It's hard to tell.  What you have written here isn't justified by the
> > litmus test source code, since the position of m01 in P1's program
> > order is undetermined.  How do you justify m01 -> Rc, for example?
> 
> ... justifies Rc=0 following [m01].
> 
> > Write it this way instead, using the relations defined in the 
> > sys_membarrier patch for linux-kernel.cat:
> > 
> > 	memb ->memb-gp memb ->rcu-link Rc ->memb-rscsi Rc ->rcu-link
> > 		
> > 		rcu_read_unlock ->rcu-rscsi rcu_read_lock ->rcu-link 
> > 
> > 		synchronize_rcu ->rcu-gp synchronize_rcu ->rcu-link memb
> > 
> > Recall that:
> > 
> > 	memb-gp is the identity relation on sys_membarrier events,
> > 
> > 	rcu-link includes (po? ; fre ; po),
> > 
> > 	memb-rscsi is the identity relation on all events,
> > 
> > 	rcu-rscsi links unlocks to their corresponding locks, and
> > 
> > 	rcu-gp is the identity relation on synchronize_rcu events.
> > 
> > These facts justify the cycle above.
> > 
> > Leaving off the final rcu-link step, the sequence matches the
> > definition of rcu-fence (the relations are memb-gp, memb-rscsi, 
> > rcu-rscsi, rcu-gp with rcu-links in between).  Therefore the cycle is 
> > forbidden.
> 
> Understood, but that would be using the model to check the model.  ;-)

Well, what are you trying to accomplish?  Do you want to find an 
argument similar to the one I posted for the 6-CPU test to show that 
this test should be forbidden?

Alan

Paul E. McKenney Dec. 12, 2018, 10:19 p.m. UTC | #31

On Wed, Dec 12, 2018 at 01:52:45PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 12, 2018 at 04:32:50PM -0500, Alan Stern wrote:
> > On Wed, 12 Dec 2018, Paul E. McKenney wrote:
> > 
> > > OK.  How about this one?
> > > 
> > >          P0      P1                 P2      P3
> > >          Wa=2    rcu_read_lock()    Wc=2    Wd=2
> > >          memb    Wb=2               Rd=0    synchronize_rcu();
> > >          Rb=0    Rc=0                       Ra=0
> > > 	         rcu_read_unlock()
> > > 
> > > The model should say that it is allowed.  Taking a look...
> > > 
> > >          P0      P1                 P2      P3
> > > 				    Rd=0
> > > 					    Wd=2
> > > 					    synchronize_rcu();
> > > 	                                    Ra=0
> > > 	 Wa=2
> > > 	 membs
> > > 	         rcu_read_lock()
> > > 		 [m01]
> > > 		 Rc=0
> > > 		 		    Wc=2
> > > 				    [m02]   [m03]
> > > 	 membe
> > > 	 Rb=0
> > > 	         Wb=2
> > > 		 rcu_read_unlock()
> > > 
> > > Looks allowed to me.  If the synchronization of P1 and P2 were
> > > interchanged, it should be forbidden:
> > > 
> > >          P0      P1      P2                 P3
> > >          Wa=2    Wb=2    rcu_read_lock()    Wd=2
> > >          memb    Rc=0    Wc=2               synchronize_rcu();
> > >          Rb=0            Rd=0               Ra=0
> > >                          rcu_read_unlock()
> > > 
> > > Taking a look...
> > > 
> > >          P0      P1      P2                 P3
> > >                          rcu_read_lock()
> > >                          Rd=0
> > >          Wa=2    Wb=2                       Wd=2
> > >          membs                              synchronize_rcu();
> > >                  [m01]
> > >                  Rc=0
> > >                          Wc=2
> > >                          rcu_read_unlock()
> > > 			 [m02]              Ra=0 [Forbidden?]
> > > 	 membe
> > >          Rb=0
> 
> For one thing, Wb=2 needs to be down here, apologies!  Which then ...
> 
> > Have you tried writing these as real litmus tests and running them 
> > through herd?
> 
> That comes later, but yes, I will do that.
> 
> > > I believe that this ordering forbids the cycle:
> > > 
> > > 	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
> > > 		return from synchronize_rcu() -> Ra
> > > 
> > > Does this make sense, or am I missing something?
> > 
> > It's hard to tell.  What you have written here isn't justified by the
> > litmus test source code, since the position of m01 in P1's program
> > order is undetermined.  How do you justify m01 -> Rc, for example?
> 
> ... justifies Rc=0 following [m01].
> 
> > Write it this way instead, using the relations defined in the 
> > sys_membarrier patch for linux-kernel.cat:
> > 
> > 	memb ->memb-gp memb ->rcu-link Rc ->memb-rscsi Rc ->rcu-link
> > 		
> > 		rcu_read_unlock ->rcu-rscsi rcu_read_lock ->rcu-link 
> > 
> > 		synchronize_rcu ->rcu-gp synchronize_rcu ->rcu-link memb
> > 
> > Recall that:
> > 
> > 	memb-gp is the identity relation on sys_membarrier events,
> > 
> > 	rcu-link includes (po? ; fre ; po),
> > 
> > 	memb-rscsi is the identity relation on all events,
> > 
> > 	rcu-rscsi links unlocks to their corresponding locks, and
> > 
> > 	rcu-gp is the identity relation on synchronize_rcu events.
> > 
> > These facts justify the cycle above.
> > 
> > Leaving off the final rcu-link step, the sequence matches the
> > definition of rcu-fence (the relations are memb-gp, memb-rscsi, 
> > rcu-rscsi, rcu-gp with rcu-links in between).  Therefore the cycle is 
> > forbidden.
> 
> Understood, but that would be using the model to check the model.  ;-)

And here are the litmus tests in the same order as above.  They do give
the results we both called out above, which is encouraging.

							Thanx, Paul

------------------------------------------------------------------------

C C-memb-RCU-1
(*
 * Result: Sometimes
 *)

{
}


P0(int *x0, int *x1)
{
	WRITE_ONCE(*x0, 1);
	smp_memb();
	r1 = READ_ONCE(*x1);
}

P1(int *x1, int *x2)
{
	rcu_read_lock();
	WRITE_ONCE(*x1, 1);
	r1 = READ_ONCE(*x2);
	rcu_read_unlock();
}

P2(int *x2, int *x3)
{
	WRITE_ONCE(*x2, 1);
	r1 = READ_ONCE(*x3);
}

P3(int *x3, int *x0)
{
	WRITE_ONCE(*x3, 1);
	synchronize_rcu();
	r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)

------------------------------------------------------------------------

C C-memb-RCU-1
(*
 * Result: Never
 *)

{
}


P0(int *x0, int *x1)
{
	WRITE_ONCE(*x0, 1);
	smp_memb();
	r1 = READ_ONCE(*x1);
}

P1(int *x1, int *x2)
{
	WRITE_ONCE(*x1, 1);
	r1 = READ_ONCE(*x2);
}

P2(int *x2, int *x3)
{
	rcu_read_lock();
	WRITE_ONCE(*x2, 1);
	r1 = READ_ONCE(*x3);
	rcu_read_unlock();
}

P3(int *x3, int *x0)
{
	WRITE_ONCE(*x3, 1);
	synchronize_rcu();
	r1 = READ_ONCE(*x0);
}

exists (0:r1=0 /\ 1:r1=0 /\ 2:r1=0 /\ 3:r1=0)

Paul E. McKenney Dec. 12, 2018, 10:49 p.m. UTC | #32

On Wed, Dec 12, 2018 at 05:12:18PM -0500, Alan Stern wrote:
> On Wed, 12 Dec 2018, Paul E. McKenney wrote:
> 
> > > > I believe that this ordering forbids the cycle:
> > > > 
> > > > 	Wa=1 > membs -> [m01] -> Rc=0 -> Wc=2 -> rcu_read_unlock() ->
> > > > 		return from synchronize_rcu() -> Ra
> > > > 
> > > > Does this make sense, or am I missing something?
> > > 
> > > It's hard to tell.  What you have written here isn't justified by the
> > > litmus test source code, since the position of m01 in P1's program
> > > order is undetermined.  How do you justify m01 -> Rc, for example?
> > 
> > ... justifies Rc=0 following [m01].
> > 
> > > Write it this way instead, using the relations defined in the 
> > > sys_membarrier patch for linux-kernel.cat:
> > > 
> > > 	memb ->memb-gp memb ->rcu-link Rc ->memb-rscsi Rc ->rcu-link
> > > 		
> > > 		rcu_read_unlock ->rcu-rscsi rcu_read_lock ->rcu-link 
> > > 
> > > 		synchronize_rcu ->rcu-gp synchronize_rcu ->rcu-link memb
> > > 
> > > Recall that:
> > > 
> > > 	memb-gp is the identity relation on sys_membarrier events,
> > > 
> > > 	rcu-link includes (po? ; fre ; po),
> > > 
> > > 	memb-rscsi is the identity relation on all events,
> > > 
> > > 	rcu-rscsi links unlocks to their corresponding locks, and
> > > 
> > > 	rcu-gp is the identity relation on synchronize_rcu events.
> > > 
> > > These facts justify the cycle above.
> > > 
> > > Leaving off the final rcu-link step, the sequence matches the
> > > definition of rcu-fence (the relations are memb-gp, memb-rscsi, 
> > > rcu-rscsi, rcu-gp with rcu-links in between).  Therefore the cycle is 
> > > forbidden.
> > 
> > Understood, but that would be using the model to check the model.  ;-)
> 
> Well, what are you trying to accomplish?  Do you want to find an 
> argument similar to the one I posted for the 6-CPU test to show that 
> this test should be forbidden?

I am trying to check odd corner cases.  Your sys_membarrier() model
is quite nice and certainly fits nicely with the rest of the model,
but where I come from, that is actually reason for suspicion.  ;-)

All kidding aside, your argument for the 6-CPU test was extremely
valuable, as it showed me a way to think of that test from an
implementation viewpoint.  Then the question is whether or not that
viewpoint actually matches the model, which seems to be the case thus far.

A good next step would be to automatically generate random tests along
with an automatically generated prediction, like I did for RCU a few
years back.  I should be able to generalize my time-based cheat for RCU to
also cover SRCU, though sys_membarrier() will require a bit more thought.
(The time-based cheat was to have fixed duration RCU grace periods and
RCU read-side critical sections, with the grace period duration being
slightly longer than that of the critical sections.  The number of
processes is of course limited by the chosen durations, but that limit
can easily be made insanely large.)

I guess that I still haven't gotten over being a bit surprised that the
RCU counting rule also applies to sys_membarrier().  ;-)

							Thanx, Paul

Alan Stern Dec. 13, 2018, 3:49 p.m. UTC | #33

On Wed, 12 Dec 2018, Paul E. McKenney wrote:

> > Well, what are you trying to accomplish?  Do you want to find an 
> > argument similar to the one I posted for the 6-CPU test to show that 
> > this test should be forbidden?
> 
> I am trying to check odd corner cases.  Your sys_membarrier() model
> is quite nice and certainly fits nicely with the rest of the model,
> but where I come from, that is actually reason for suspicion.  ;-)
> 
> All kidding aside, your argument for the 6-CPU test was extremely
> valuable, as it showed me a way to think of that test from an
> implementation viewpoint.  Then the question is whether or not that
> viewpoint actually matches the model, which seems to be the case thus far.

It should, since I formulated the reasoning behind that viewpoint 
directly from the model.  The basic idea is this:

	By induction, show that whenever we have A ->rcu-fence B then
	anything po-before A executes before anything po-after B, and
	furthermore, any write which propagates to A's CPU before A
	executes will propagate to every CPU before B finishes (i.e.,
	before anything po-after B executes).

	Using this, show that whenever X ->rb Y holds then X must
	execute before Y.

That's what the 6-CPU argument did.  In that litmus test we have
mb2 ->rcu-fence mb23, Rc ->rb Re, mb1 ->rcu-fence mb14, Rb ->rb Rf,
mb0 ->rcu-fence mb05, and lastly Ra ->rb Ra.  The last one is what 
shows that the test is forbidden.

> A good next step would be to automatically generate random tests along
> with an automatically generated prediction, like I did for RCU a few
> years back.  I should be able to generalize my time-based cheat for RCU to
> also cover SRCU, though sys_membarrier() will require a bit more thought.
> (The time-based cheat was to have fixed duration RCU grace periods and
> RCU read-side critical sections, with the grace period duration being
> slightly longer than that of the critical sections.  The number of
> processes is of course limited by the chosen durations, but that limit
> can easily be made insanely large.)

Imagine that each sys_membarrier call takes a fixed duration and each 
other instruction takes slightly less (the idea being that each 
instruction is a critical section).  Instructions can be reordered 
(although not across a sys_membarrier call), but no matter how the 
reordering is done, the result is disallowed.

> I guess that I still haven't gotten over being a bit surprised that the
> RCU counting rule also applies to sys_membarrier().  ;-)

Why not?  They are both synchronization mechanisms with heavy-weight
write sides and light-weight read sides, and most importantly, they
provide the same Guarantee.

Alan

Paul E. McKenney Dec. 14, 2018, 12:20 a.m. UTC | #34

On Thu, Dec 13, 2018 at 10:49:49AM -0500, Alan Stern wrote:
> On Wed, 12 Dec 2018, Paul E. McKenney wrote:
> 
> > > Well, what are you trying to accomplish?  Do you want to find an 
> > > argument similar to the one I posted for the 6-CPU test to show that 
> > > this test should be forbidden?
> > 
> > I am trying to check odd corner cases.  Your sys_membarrier() model
> > is quite nice and certainly fits nicely with the rest of the model,
> > but where I come from, that is actually reason for suspicion.  ;-)
> > 
> > All kidding aside, your argument for the 6-CPU test was extremely
> > valuable, as it showed me a way to think of that test from an
> > implementation viewpoint.  Then the question is whether or not that
> > viewpoint actually matches the model, which seems to be the case thus far.
> 
> It should, since I formulated the reasoning behind that viewpoint 
> directly from the model.  The basic idea is this:
> 
> 	By induction, show that whenever we have A ->rcu-fence B then
> 	anything po-before A executes before anything po-after B, and
> 	furthermore, any write which propagates to A's CPU before A
> 	executes will propagate to every CPU before B finishes (i.e.,
> 	before anything po-after B executes).
> 
> 	Using this, show that whenever X ->rb Y holds then X must
> 	execute before Y.
> 
> That's what the 6-CPU argument did.  In that litmus test we have
> mb2 ->rcu-fence mb23, Rc ->rb Re, mb1 ->rcu-fence mb14, Rb ->rb Rf,
> mb0 ->rcu-fence mb05, and lastly Ra ->rb Ra.  The last one is what 
> shows that the test is forbidden.

I really am not trying to be difficult.  Well, no more difficult than
I normally am, anyway.  Which admittedly isn't saying much.  ;-)

> > A good next step would be to automatically generate random tests along
> > with an automatically generated prediction, like I did for RCU a few
> > years back.  I should be able to generalize my time-based cheat for RCU to
> > also cover SRCU, though sys_membarrier() will require a bit more thought.
> > (The time-based cheat was to have fixed duration RCU grace periods and
> > RCU read-side critical sections, with the grace period duration being
> > slightly longer than that of the critical sections.  The number of
> > processes is of course limited by the chosen durations, but that limit
> > can easily be made insanely large.)
> 
> Imagine that each sys_membarrier call takes a fixed duration and each 
> other instruction takes slightly less (the idea being that each 
> instruction is a critical section).  Instructions can be reordered 
> (although not across a sys_membarrier call), but no matter how the 
> reordering is done, the result is disallowed.

It gets a bit trickier with interleavings of different combinations
of RCU, SRCU, and sys_membarrier().  Yes, your cat code very elegantly
sorts this out, but my goal is to be able to explain a given example
to someone.

> > I guess that I still haven't gotten over being a bit surprised that the
> > RCU counting rule also applies to sys_membarrier().  ;-)
> 
> Why not?  They are both synchronization mechanisms with heavy-weight
> write sides and light-weight read sides, and most importantly, they
> provide the same Guarantee.

True, but I do feel the need to poke at it.

The zero-size sys_membarrier() read-side critical sections do make
things act a bit differently, for example, interchanging the accesses
in an RCU read-side critical section has no effect, while doing so in
a sys_membarrier() reader can cause the result to be allowed.  One key
point is that everything before the end of a read-side critical section
of any type is ordered before any later grace period of that same type,
and vice versa.

This is why reordering accesses matters for sys_membarrier() readers but
not for RCU and SRCU readers -- in the case of RCU and SRCU readers,
the accesses are inside the read-side critical section, while for
sys_membarrier() readers, the read-side critical sections don't have
an inside.  So yes, ordering also matters in the case of SRCU and
RCU readers for accesses outside of the read-side critical sections.
The reason sys_membarrier() seems surprising to me isn't because it is
any different in theoretical structure, but rather because the practice
is to put RCU and SRCU read-side accesses inside a read-side critical
sections, which is impossible for sys_membarrier().

The other thing that took some time to get used to is the possibility
of long delays during sys_membarrier() execution, allowing significant
execution and reordering between different CPUs' IPIs.  This was key
to my understanding of the six-process example, and probably needs to
be clearly called out, including in an example or two.

The interleaving restrictions are straightforward for me, but the
fixed-time approach does have some interesting cross-talk potential
between sys_membarrier() and RCU read-side critical sections whose
accesses have been reversed.  I don't believe that it is possible to
leverage this "order the other guy's read-side critical sections" effect
in the general case, but I could be missing something.

If you are claiming that I am worrying unnecessarily, you are probably
right.  But if I didn't worry unnecessarily, RCU wouldn't work at all!  ;-)

							Thanx, Paul

Alan Stern Dec. 14, 2018, 2:26 a.m. UTC | #35

On Thu, 13 Dec 2018, Paul E. McKenney wrote:

> > > A good next step would be to automatically generate random tests along
> > > with an automatically generated prediction, like I did for RCU a few
> > > years back.  I should be able to generalize my time-based cheat for RCU to
> > > also cover SRCU, though sys_membarrier() will require a bit more thought.
> > > (The time-based cheat was to have fixed duration RCU grace periods and
> > > RCU read-side critical sections, with the grace period duration being
> > > slightly longer than that of the critical sections.  The number of
> > > processes is of course limited by the chosen durations, but that limit
> > > can easily be made insanely large.)
> > 
> > Imagine that each sys_membarrier call takes a fixed duration and each 
> > other instruction takes slightly less (the idea being that each 
> > instruction is a critical section).  Instructions can be reordered 
> > (although not across a sys_membarrier call), but no matter how the 
> > reordering is done, the result is disallowed.

This turns out not to be right.  Instead, imagine that each 
sys_membarrier call takes a fixed duration, T.  Other instructions can 
take arbitrary amounts of time and can be reordered abitrarily, with 
two restrictions:

	Instructions cannot be reordered past a sys_membarrier call;

	If instructions A and B are reordered then the time duration
	from B to A must be less than T.

If you prefer, you can replace the second restriction with something a 
little more liberal:

	If A and B are reordered and A ends up executing after a 
	sys_membarrier call (on any CPU) then B cannot execute before 
	that sys_membarrier call.

Of course, this form is a consequence of the more restrictive form.

> It gets a bit trickier with interleavings of different combinations
> of RCU, SRCU, and sys_membarrier().  Yes, your cat code very elegantly
> sorts this out, but my goal is to be able to explain a given example
> to someone.

I don't think you're going to be able to fit different combinations of
RCU, SRCU, and sys_membarrier into this picture.  How would you allow
tests with incorrect interleaving, such as GP - memb - RSCS - nothing,
while forbidding similar tests with correct interleaving?

Alan

Paul E. McKenney Dec. 14, 2018, 5:20 a.m. UTC | #36

On Thu, Dec 13, 2018 at 09:26:47PM -0500, Alan Stern wrote:
> On Thu, 13 Dec 2018, Paul E. McKenney wrote:
> 
> > > > A good next step would be to automatically generate random tests along
> > > > with an automatically generated prediction, like I did for RCU a few
> > > > years back.  I should be able to generalize my time-based cheat for RCU to
> > > > also cover SRCU, though sys_membarrier() will require a bit more thought.
> > > > (The time-based cheat was to have fixed duration RCU grace periods and
> > > > RCU read-side critical sections, with the grace period duration being
> > > > slightly longer than that of the critical sections.  The number of
> > > > processes is of course limited by the chosen durations, but that limit
> > > > can easily be made insanely large.)
> > > 
> > > Imagine that each sys_membarrier call takes a fixed duration and each 
> > > other instruction takes slightly less (the idea being that each 
> > > instruction is a critical section).  Instructions can be reordered 
> > > (although not across a sys_membarrier call), but no matter how the 
> > > reordering is done, the result is disallowed.
> 
> This turns out not to be right.  Instead, imagine that each 
> sys_membarrier call takes a fixed duration, T.  Other instructions can 
> take arbitrary amounts of time and can be reordered abitrarily, with 
> two restrictions:
> 
> 	Instructions cannot be reordered past a sys_membarrier call;
> 
> 	If instructions A and B are reordered then the time duration
> 	from B to A must be less than T.
> 
> If you prefer, you can replace the second restriction with something a 
> little more liberal:
> 
> 	If A and B are reordered and A ends up executing after a 
> 	sys_membarrier call (on any CPU) then B cannot execute before 
> 	that sys_membarrier call.
> 
> Of course, this form is a consequence of the more restrictive form.

Makes sense.  And the zero-size critical sections are why sys_membarrier()
cannot be directly used for classic deferred reclamation.

> > It gets a bit trickier with interleavings of different combinations
> > of RCU, SRCU, and sys_membarrier().  Yes, your cat code very elegantly
> > sorts this out, but my goal is to be able to explain a given example
> > to someone.
> 
> I don't think you're going to be able to fit different combinations of
> RCU, SRCU, and sys_membarrier into this picture.  How would you allow
> tests with incorrect interleaving, such as GP - memb - RSCS - nothing,
> while forbidding similar tests with correct interleaving?

Well, no, I cannot do a simple linear scan tracking time, which is what
the current scripts do.  I must instead find longest sequence with all
operations of the same type (RCU, SRCU, or memb) and work out their
worst-case timing.  If the overall effect of a given sequence is to
go backwards in time, the result is allowed.  Otherwise eliminate that
sequence from the cycle and repeat.  If everything is eliminated, the
cycle is forbidden.

Which can be thought of as an iterative process similar to something
called "rcu-fence", can't it?  ;-)

							Thanx, Paul

Alan Stern Dec. 14, 2018, 3:31 p.m. UTC | #37

On Thu, 13 Dec 2018, Paul E. McKenney wrote:

> > > I guess that I still haven't gotten over being a bit surprised that the
> > > RCU counting rule also applies to sys_membarrier().  ;-)
> > 
> > Why not?  They are both synchronization mechanisms with heavy-weight
> > write sides and light-weight read sides, and most importantly, they
> > provide the same Guarantee.
> 
> True, but I do feel the need to poke at it.
> 
> The zero-size sys_membarrier() read-side critical sections do make
> things act a bit differently, for example, interchanging the accesses
> in an RCU read-side critical section has no effect, while doing so in
> a sys_membarrier() reader can cause the result to be allowed.  One key
> point is that everything before the end of a read-side critical section
> of any type is ordered before any later grace period of that same type,
> and vice versa.
> 
> This is why reordering accesses matters for sys_membarrier() readers but
> not for RCU and SRCU readers -- in the case of RCU and SRCU readers,
> the accesses are inside the read-side critical section, while for
> sys_membarrier() readers, the read-side critical sections don't have
> an inside.  So yes, ordering also matters in the case of SRCU and
> RCU readers for accesses outside of the read-side critical sections.
> The reason sys_membarrier() seems surprising to me isn't because it is
> any different in theoretical structure, but rather because the practice
> is to put RCU and SRCU read-side accesses inside a read-side critical
> sections, which is impossible for sys_membarrier().

RCU and sys_membarrier are more similar than you might think at first.  
For one thing, if there were primitives for blocking and unblocking
reception of IPIs, those primitives would delimit critical sections for
sys_membarrier.  (Maybe such things do exist; I wouldn't know.)

For another, the way we model RCU isn't fully accurate for the Linux
kernel, as you know.  Since individual instructions cannot be
preempted, each instruction is a tiny read-side critical section.
Thus, litmus tests like this one:

	P0			P1
	Wa=1			Wb=1
	synchronize_rcu()	Ra=0
	Rb=0

actually are forbidden in the kernel (provided P1 isn't part of the
idle loop!), even though the LKMM allows them.  However, it wouldn't
be forbidden if the accesses in P1 were swapped -- just like with
sys_membarrier.

Put these two observations together and you see that sys_membarrier is
almost exactly the same as RCU without explicit read-side critical
sections. Perhaps this isn't surprising, given that the initial
implementation of sys_membarrier() was pretty much the same as
synchronize_rcu().

> The other thing that took some time to get used to is the possibility
> of long delays during sys_membarrier() execution, allowing significant
> execution and reordering between different CPUs' IPIs.  This was key
> to my understanding of the six-process example, and probably needs to
> be clearly called out, including in an example or two.

In all the examples I'm aware of, no more than one of the IPIs
generated by each sys_membarrier call really matters.  (Of course,
there's no way to know in advance which one it will be, so you have to
send an IPI to every CPU.)  The execution delays and reordering
between different CPUs' IPIs don't appear to be significant.

> The interleaving restrictions are straightforward for me, but the
> fixed-time approach does have some interesting cross-talk potential
> between sys_membarrier() and RCU read-side critical sections whose
> accesses have been reversed.  I don't believe that it is possible to
> leverage this "order the other guy's read-side critical sections" effect
> in the general case, but I could be missing something.

I regard the fixed-time approach as nothing more than a heuristic
aid.  It's not an accurate explaination of what's really going on.

> If you are claiming that I am worrying unnecessarily, you are probably
> right.  But if I didn't worry unnecessarily, RCU wouldn't work at all!  ;-)

Alan

Paul E. McKenney Dec. 14, 2018, 6:43 p.m. UTC | #38

On Fri, Dec 14, 2018 at 10:31:51AM -0500, Alan Stern wrote:
> On Thu, 13 Dec 2018, Paul E. McKenney wrote:
> 
> > > > I guess that I still haven't gotten over being a bit surprised that the
> > > > RCU counting rule also applies to sys_membarrier().  ;-)
> > > 
> > > Why not?  They are both synchronization mechanisms with heavy-weight
> > > write sides and light-weight read sides, and most importantly, they
> > > provide the same Guarantee.
> > 
> > True, but I do feel the need to poke at it.
> > 
> > The zero-size sys_membarrier() read-side critical sections do make
> > things act a bit differently, for example, interchanging the accesses
> > in an RCU read-side critical section has no effect, while doing so in
> > a sys_membarrier() reader can cause the result to be allowed.  One key
> > point is that everything before the end of a read-side critical section
> > of any type is ordered before any later grace period of that same type,
> > and vice versa.
> > 
> > This is why reordering accesses matters for sys_membarrier() readers but
> > not for RCU and SRCU readers -- in the case of RCU and SRCU readers,
> > the accesses are inside the read-side critical section, while for
> > sys_membarrier() readers, the read-side critical sections don't have
> > an inside.  So yes, ordering also matters in the case of SRCU and
> > RCU readers for accesses outside of the read-side critical sections.
> > The reason sys_membarrier() seems surprising to me isn't because it is
> > any different in theoretical structure, but rather because the practice
> > is to put RCU and SRCU read-side accesses inside a read-side critical
> > sections, which is impossible for sys_membarrier().
> 
> RCU and sys_membarrier are more similar than you might think at first.  
> For one thing, if there were primitives for blocking and unblocking
> reception of IPIs, those primitives would delimit critical sections for
> sys_membarrier.  (Maybe such things do exist; I wouldn't know.)

Within the kernel of course local_irq_disable() and friends.  In
userspace, there have been proposals to make the IPI handler interact
with rseq or equivalent, which would have a roughly similar effect.

> For another, the way we model RCU isn't fully accurate for the Linux
> kernel, as you know.  Since individual instructions cannot be
> preempted, each instruction is a tiny read-side critical section.
> Thus, litmus tests like this one:
> 
> 	P0			P1
> 	Wa=1			Wb=1
> 	synchronize_rcu()	Ra=0
> 	Rb=0
> 
> actually are forbidden in the kernel (provided P1 isn't part of the
> idle loop!), even though the LKMM allows them.  However, it wouldn't
> be forbidden if the accesses in P1 were swapped -- just like with
> sys_membarrier.

And that P1 isn't executing on a CPU that RCU believes to be offline,
but yes.

But this is an implementation choice, and SRCU makes a different choice,
which would allow the litmus test shown above.  And it would be good to
keep this freedom for the implementation, in other words, this difference
is a good thing, so let's please keep it.  ;-)

> Put these two observations together and you see that sys_membarrier is
> almost exactly the same as RCU without explicit read-side critical
> sections. Perhaps this isn't surprising, given that the initial
> implementation of sys_membarrier() was pretty much the same as
> synchronize_rcu().

Heh!  The initial implementation in the Linux kernel was exactly
synchronize_sched().  ;-)

I would say that sys_membarrier() has zero-sized read-side critical
sections, either comprising a single instruction (as is the case for
synchronize_sched(), actually), preempt-disable regions of code
(which are irrelevant to userspace execution), or the spaces between
consecutive pairs of instructions (as is the case for the newer
IPI-based implementation).

The model picks the single-instruction option, and I haven't yet found
a problem with this -- which is no surprise given that, as you say,
an actual implementation makes this same choice.

> > The other thing that took some time to get used to is the possibility
> > of long delays during sys_membarrier() execution, allowing significant
> > execution and reordering between different CPUs' IPIs.  This was key
> > to my understanding of the six-process example, and probably needs to
> > be clearly called out, including in an example or two.
> 
> In all the examples I'm aware of, no more than one of the IPIs
> generated by each sys_membarrier call really matters.  (Of course,
> there's no way to know in advance which one it will be, so you have to
> send an IPI to every CPU.)  The execution delays and reordering
> between different CPUs' IPIs don't appear to be significant.

Well, there are litmus tests that are allowed in which the allowed
execution is more easily explained in terms of delays between different
CPUs' IPIs, so it seems worth keeping track of.

There might be a litmus test that can tell the difference between
simultaneous and non-simultaneous IPIs, but I cannot immediately think of
one that matters.  Might be a failure of imagination on my part, though.

> > The interleaving restrictions are straightforward for me, but the
> > fixed-time approach does have some interesting cross-talk potential
> > between sys_membarrier() and RCU read-side critical sections whose
> > accesses have been reversed.  I don't believe that it is possible to
> > leverage this "order the other guy's read-side critical sections" effect
> > in the general case, but I could be missing something.
> 
> I regard the fixed-time approach as nothing more than a heuristic
> aid.  It's not an accurate explaination of what's really going on.

Agreed, albeit a useful heuristic aid in scripts generating litmus tests.

							Thanx, Paul

> > If you are claiming that I am worrying unnecessarily, you are probably
> > right.  But if I didn't worry unnecessarily, RCU wouldn't work at all!  ;-)
> 
> Alan
>

Alan Stern Dec. 14, 2018, 9:39 p.m. UTC | #39

On Fri, 14 Dec 2018, Paul E. McKenney wrote:

> I would say that sys_membarrier() has zero-sized read-side critical
> sections, either comprising a single instruction (as is the case for
> synchronize_sched(), actually), preempt-disable regions of code
> (which are irrelevant to userspace execution), or the spaces between
> consecutive pairs of instructions (as is the case for the newer
> IPI-based implementation).
> 
> The model picks the single-instruction option, and I haven't yet found
> a problem with this -- which is no surprise given that, as you say,
> an actual implementation makes this same choice.

I believe that for RCU tests the LKMM gives the same results for
length-zero critical sections interspersed between all the instructions
and length-one critical sections surrounding all instructions (except
synchronize_rcu).  But the proof is tricky and I haven't checked it
carefully.

> > > The other thing that took some time to get used to is the possibility
> > > of long delays during sys_membarrier() execution, allowing significant
> > > execution and reordering between different CPUs' IPIs.  This was key
> > > to my understanding of the six-process example, and probably needs to
> > > be clearly called out, including in an example or two.
> > 
> > In all the examples I'm aware of, no more than one of the IPIs
> > generated by each sys_membarrier call really matters.  (Of course,
> > there's no way to know in advance which one it will be, so you have to
> > send an IPI to every CPU.)  The execution delays and reordering
> > between different CPUs' IPIs don't appear to be significant.
> 
> Well, there are litmus tests that are allowed in which the allowed
> execution is more easily explained in terms of delays between different
> CPUs' IPIs, so it seems worth keeping track of.
> 
> There might be a litmus test that can tell the difference between
> simultaneous and non-simultaneous IPIs, but I cannot immediately think of
> one that matters.  Might be a failure of imagination on my part, though.

	P0	P1	P2
	Wc=1	[mb01]	Rb=1
	memb	Wa=1	Rc=0
	Ra=0	Wb=1	[mb02]

The IPIs have to appear in the positions shown, which means they cannot
be simultaneous.  The test is allowed because P2's reads can be
reordered.

Alan

Paul E. McKenney Dec. 16, 2018, 6:51 p.m. UTC | #40

On Fri, Dec 14, 2018 at 04:39:34PM -0500, Alan Stern wrote:
> On Fri, 14 Dec 2018, Paul E. McKenney wrote:
> 
> > I would say that sys_membarrier() has zero-sized read-side critical
> > sections, either comprising a single instruction (as is the case for
> > synchronize_sched(), actually), preempt-disable regions of code
> > (which are irrelevant to userspace execution), or the spaces between
> > consecutive pairs of instructions (as is the case for the newer
> > IPI-based implementation).
> > 
> > The model picks the single-instruction option, and I haven't yet found
> > a problem with this -- which is no surprise given that, as you say,
> > an actual implementation makes this same choice.
> 
> I believe that for RCU tests the LKMM gives the same results for
> length-zero critical sections interspersed between all the instructions
> and length-one critical sections surrounding all instructions (except
> synchronize_rcu).  But the proof is tricky and I haven't checked it
> carefully.

That assertion is completely consistent with my implementation experience,
give or take the usual caveats about idle and offline execution.

> > > > The other thing that took some time to get used to is the possibility
> > > > of long delays during sys_membarrier() execution, allowing significant
> > > > execution and reordering between different CPUs' IPIs.  This was key
> > > > to my understanding of the six-process example, and probably needs to
> > > > be clearly called out, including in an example or two.
> > > 
> > > In all the examples I'm aware of, no more than one of the IPIs
> > > generated by each sys_membarrier call really matters.  (Of course,
> > > there's no way to know in advance which one it will be, so you have to
> > > send an IPI to every CPU.)  The execution delays and reordering
> > > between different CPUs' IPIs don't appear to be significant.
> > 
> > Well, there are litmus tests that are allowed in which the allowed
> > execution is more easily explained in terms of delays between different
> > CPUs' IPIs, so it seems worth keeping track of.
> > 
> > There might be a litmus test that can tell the difference between
> > simultaneous and non-simultaneous IPIs, but I cannot immediately think of
> > one that matters.  Might be a failure of imagination on my part, though.
> 
> 	P0	P1	P2
> 	Wc=1	[mb01]	Rb=1
> 	memb	Wa=1	Rc=0
> 	Ra=0	Wb=1	[mb02]
> 
> The IPIs have to appear in the positions shown, which means they cannot
> be simultaneous.  The test is allowed because P2's reads can be
> reordered.

OK, so "simultaneous" IPIs could be emulated in a real implementation by
having sys_membarrier() send each IPI (but not wait for a response), then
execute a full memory barrier and set a shared variable.  Each IPI handler
would spin waiting for the shared variable to be set, then execute a full
memory barrier and atomically increment yet another shared variable and
return from interrupt.  When that other shared variable's value reached
the number of IPIs sent, the sys_membarrier() would execute its final
(already existing) full memory barrier and return.  Horribly expensive
and definitely not recommended, but eminently doable.

The difference between current sys_membarrier() and the "simultaneous"
variant described above is similar to the difference between
non-multicopy-atomic and multicopy-atomic memory ordering.  So, after
thinking it through, my guess is that pretty much any litmus test that
can discern between multicopy-atomic and  non-multicopy-atomic should
be transformable into something that can distinguish between the current
and the "simultaneous" sys_membarrier() implementation.

Seem reasonable?

Or alternatively, may I please apply your Signed-off-by to your earlier
sys_membarrier() patch so that I can queue it?  I will probably also
change smp_memb() to membarrier() or some such.  Again, within the
Linux kernel, membarrier() can be emulated with smp_call_function()
invoking a handler that does smp_mb().

							Thanx, Paul

Alan Stern Dec. 17, 2018, 4:02 p.m. UTC | #41

On Sun, 16 Dec 2018, Paul E. McKenney wrote:

> OK, so "simultaneous" IPIs could be emulated in a real implementation by
> having sys_membarrier() send each IPI (but not wait for a response), then
> execute a full memory barrier and set a shared variable.  Each IPI handler
> would spin waiting for the shared variable to be set, then execute a full
> memory barrier and atomically increment yet another shared variable and
> return from interrupt.  When that other shared variable's value reached
> the number of IPIs sent, the sys_membarrier() would execute its final
> (already existing) full memory barrier and return.  Horribly expensive
> and definitely not recommended, but eminently doable.

I don't think that's right.  What would make the IPIs "simultaneous"  
would be if none of the handlers return until all of them have started
executing.  For example, you could have each handler increment a shared
variable and then spin, waiting for the variable to reach the number of
CPUs, before returning.

What you wrote was to have each handler wait until all the IPIs had 
been sent, which is not the same thing at all.

> The difference between current sys_membarrier() and the "simultaneous"
> variant described above is similar to the difference between
> non-multicopy-atomic and multicopy-atomic memory ordering.  So, after
> thinking it through, my guess is that pretty much any litmus test that
> can discern between multicopy-atomic and  non-multicopy-atomic should
> be transformable into something that can distinguish between the current
> and the "simultaneous" sys_membarrier() implementation.
> 
> Seem reasonable?

Yes.

> Or alternatively, may I please apply your Signed-off-by to your earlier
> sys_membarrier() patch so that I can queue it?  I will probably also
> change smp_memb() to membarrier() or some such.  Again, within the
> Linux kernel, membarrier() can be emulated with smp_call_function()
> invoking a handler that does smp_mb().

Do you really want to put sys_membarrier into the LKMM?  I'm not so 
sure it's appropriate.

Alan

Florian Weimer Dec. 17, 2018, 5:47 p.m. UTC | #42

* Adhemerval Zanella:

>> Well, that already happens with the duplicated header approach if users
>> want to use UAPI headers, too.  It requires complicated synchronization
>> to make it work.
>
> But at least glibc is not imposing it, if user want to use kernel uapi
> header he will explicit include it and it is up to kernel provide a sane
> implementation.

This only works if the header does not include system call wrapper.

>> We can tweak the conditional for the header inclusion further, like
>> this:
>> 
>> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4, 16, 0) \
>> +  || __glibc_has_include (<linux/membarrier.h>) \
>> +# include <linux/membarrier.h>
>> +#else
>> 
>> With GCC 5 and later, this would always prefer the kernel header if
>> available.  There would never be a conflict between the definitions,
>> irrespective of header file inclusion order.
>> 
>> What do you think?
>
> In any case I don't think my suggestion should be a blocker, we
> already rely on linux header for some cases and it seems I am alone in
> trying to decouple glibc from kernel headers.

Okay.  I've reconsidered the header baseline approach and I'm now
proposing to go with the 4.3 header version, the first which had the
system call.  With a later baseline, you'd get an error if you include
<linux/membarrier.h> before <sys/membarrier.h> and the kernel is older
than the chosen baseline.  This cannot happen if we pick the first
supported kernel version as the baseline.  (This will not matter if we
eventually choose the use the __has_include trick and the compiler
supports __has_include.)

I had to adjust the test to define a private copy of the subsequently
added constants.

Thanks,
Florian

2018-12-17  Florian Weimer  <fweimer@redhat.com>

	Linux: Implement membarrier function.
	* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Add
	sys/membarrier.h.
	(tests): Add tst-membarrier.
	* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Add membarrier.
	* sysdeps/unix/sysv/linux/sys/membarrier.h: New file.
	* sysdeps/unix/sysv/linux/tst-membarrier.c: Likewise.
	* sysdeps/unix/sysv/linux/aarch64/libc.abilist (GLIBC_2.29): Add
	membarrier.
	* sysdeps/unix/sysv/linux/alpha/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/arm/libc.abilist (GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/hppa/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/i386/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/ia64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/microblaze/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/nios2/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
	(GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/sh/libc.abilist (GLIBC_2.29): Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/64/libc.abilist (GLIBC_2.29):
	Likewise.
	* sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist (GLIBC_2.29):
	Likewise.

diff --git a/NEWS b/NEWS
index ae80818df4..d8cefe90dd 100644
--- a/NEWS
+++ b/NEWS
@@ -46,6 +46,9 @@ Major new features:
   incosistent mutex state after fork call in multithread environment.
   In both popen and system there is no direct access to user-defined mutexes.
 
+* On Linux, The membarrier function and the <sys/membarrier.h> header file
+  have been added.
+
 Deprecated and removed features, and other changes affecting compatibility:
 
 * The glibc.tune tunable namespace has been renamed to glibc.cpu and the
diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
index 2c1a7dd274..74138429d6 100644
--- a/sysdeps/unix/sysv/linux/Makefile
+++ b/sysdeps/unix/sysv/linux/Makefile
@@ -43,12 +43,13 @@ sysdep_headers += sys/mount.h sys/acct.h sys/sysctl.h \
 		  bits/siginfo-arch.h bits/siginfo-consts-arch.h \
 		  bits/procfs.h bits/procfs-id.h bits/procfs-extra.h \
 		  bits/procfs-prregset.h bits/mman-map-flags-generic.h \
-		  bits/msq-pad.h bits/sem-pad.h bits/shmlba.h bits/shm-pad.h
+		  bits/msq-pad.h bits/sem-pad.h bits/shmlba.h bits/shm-pad.h \
+		  sys/membarrier.h
 
 tests += tst-clone tst-clone2 tst-clone3 tst-fanotify tst-personality \
 	 tst-quota tst-sync_file_range tst-sysconf-iov_max tst-ttyname \
 	 test-errno-linux tst-memfd_create tst-mlock2 tst-pkey \
-	 tst-rlimit-infinity tst-ofdlocks
+	 tst-rlimit-infinity tst-ofdlocks tst-membarrier
 tests-internal += tst-ofdlocks-compat
 
 
diff --git a/sysdeps/unix/sysv/linux/Versions b/sysdeps/unix/sysv/linux/Versions
index f1e12d9c69..39a61b29eb 100644
--- a/sysdeps/unix/sysv/linux/Versions
+++ b/sysdeps/unix/sysv/linux/Versions
@@ -173,6 +173,7 @@ libc {
   }
   GLIBC_2.29 {
     getcpu;
+    membarrier;
   }
   GLIBC_PRIVATE {
     # functions used in other libraries
diff --git a/sysdeps/unix/sysv/linux/aarch64/libc.abilist b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
index 9c330f325e..eff4b1d055 100644
--- a/sysdeps/unix/sysv/linux/aarch64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/aarch64/libc.abilist
@@ -2139,5 +2139,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
diff --git a/sysdeps/unix/sysv/linux/alpha/libc.abilist b/sysdeps/unix/sysv/linux/alpha/libc.abilist
index f630fa4c6f..6f37593acc 100644
--- a/sysdeps/unix/sysv/linux/alpha/libc.abilist
+++ b/sysdeps/unix/sysv/linux/alpha/libc.abilist
@@ -2034,6 +2034,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/arm/libc.abilist b/sysdeps/unix/sysv/linux/arm/libc.abilist
index b96f45590f..b2e5cc4113 100644
--- a/sysdeps/unix/sysv/linux/arm/libc.abilist
+++ b/sysdeps/unix/sysv/linux/arm/libc.abilist
@@ -124,6 +124,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.4 _Exit F
diff --git a/sysdeps/unix/sysv/linux/hppa/libc.abilist b/sysdeps/unix/sysv/linux/hppa/libc.abilist
index 088a8ee369..a45c266c33 100644
--- a/sysdeps/unix/sysv/linux/hppa/libc.abilist
+++ b/sysdeps/unix/sysv/linux/hppa/libc.abilist
@@ -1881,6 +1881,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/i386/libc.abilist b/sysdeps/unix/sysv/linux/i386/libc.abilist
index f7ff2c57b9..3597737fb2 100644
--- a/sysdeps/unix/sysv/linux/i386/libc.abilist
+++ b/sysdeps/unix/sysv/linux/i386/libc.abilist
@@ -2046,6 +2046,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/ia64/libc.abilist b/sysdeps/unix/sysv/linux/ia64/libc.abilist
index becd8b1033..908e8b4d52 100644
--- a/sysdeps/unix/sysv/linux/ia64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/ia64/libc.abilist
@@ -1915,6 +1915,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
index 74e42a5209..8fc66fd450 100644
--- a/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/coldfire/libc.abilist
@@ -125,6 +125,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.4 _Exit F
diff --git a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
index 4af5a74e8a..e4c60d6f83 100644
--- a/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
+++ b/sysdeps/unix/sysv/linux/m68k/m680x0/libc.abilist
@@ -1990,6 +1990,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/microblaze/libc.abilist b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
index ccef673fd2..377e5d7049 100644
--- a/sysdeps/unix/sysv/linux/microblaze/libc.abilist
+++ b/sysdeps/unix/sysv/linux/microblaze/libc.abilist
@@ -2131,5 +2131,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
index 1054bb599e..4f93740486 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/fpu/libc.abilist
@@ -1968,6 +1968,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
index 4f5b5ffebf..c8e956a922 100644
--- a/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips32/nofpu/libc.abilist
@@ -1966,6 +1966,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
index 943aee58d4..6f69ce6478 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n32/libc.abilist
@@ -1974,6 +1974,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
index 17a5d17ef9..eaf6fdd1a4 100644
--- a/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/mips/mips64/n64/libc.abilist
@@ -1969,6 +1969,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/nios2/libc.abilist b/sysdeps/unix/sysv/linux/nios2/libc.abilist
index 4d62a540fd..d6becb39db 100644
--- a/sysdeps/unix/sysv/linux/nios2/libc.abilist
+++ b/sysdeps/unix/sysv/linux/nios2/libc.abilist
@@ -2172,5 +2172,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
index ecc2d6fa13..2b45a9dedf 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/fpu/libc.abilist
@@ -1994,6 +1994,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
index f5830f9c33..a9cb22ec2e 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc32/nofpu/libc.abilist
@@ -1998,6 +1998,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
index 633d8f4792..26216a660b 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/be/libc.abilist
@@ -124,6 +124,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 _Exit F
diff --git a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
index 2c712636ef..066985b4a4 100644
--- a/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
+++ b/sysdeps/unix/sysv/linux/powerpc/powerpc64/le/libc.abilist
@@ -2229,5 +2229,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
diff --git a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
index 195bc8b2cf..6b52ed0eeb 100644
--- a/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/riscv/rv64/libc.abilist
@@ -2101,5 +2101,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
index 334def033c..80db25db0c 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-32/libc.abilist
@@ -2003,6 +2003,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
index 536f4c4ced..8fcc88ef9a 100644
--- a/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/s390/s390-64/libc.abilist
@@ -1909,6 +1909,7 @@ GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 __fentry__ F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/sh/libc.abilist b/sysdeps/unix/sysv/linux/sh/libc.abilist
index 30ae3b6ebb..8a49fa8a4b 100644
--- a/sysdeps/unix/sysv/linux/sh/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sh/libc.abilist
@@ -1885,6 +1885,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
index 68b107d080..e17b9e72da 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc32/libc.abilist
@@ -1997,6 +1997,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
index e5b6a4da50..34f741d00b 100644
--- a/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/sparc/sparc64/libc.abilist
@@ -1938,6 +1938,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/sys/membarrier.h b/sysdeps/unix/sysv/linux/sys/membarrier.h
new file mode 100644
index 0000000000..0e0941bc1c
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/sys/membarrier.h
@@ -0,0 +1,48 @@
+/* Memory barriers.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _SYS_MEMBARRIER_H
+#define _SYS_MEMBARRIER_H 1
+
+#include <features.h>
+
+__BEGIN_DECLS
+
+/* Perform a memory barrier on multiple threads.  */
+int membarrier (int __op, int __flags) __THROW;
+
+__END_DECLS
+
+/* Obtain the definitions of the MEMBARRIER_CMD_* constants.  */
+
+#include <linux/version.h>
+#if LINUX_VERSION_CODE >= KERNEL_VERSION (4, 3, 0)
+# include <linux/membarrier.h>
+#else
+
+/* Definitions from Linux 4.3 follow.  */
+
+enum membarrier_cmd
+{
+  MEMBARRIER_CMD_QUERY = 0,
+  MEMBARRIER_CMD_SHARED = 1
+};
+
+#endif
+
+#endif /* _SYS_MEMBARRIER_H */
diff --git a/sysdeps/unix/sysv/linux/syscalls.list b/sysdeps/unix/sysv/linux/syscalls.list
index e24ea29e35..3deee2bc19 100644
--- a/sysdeps/unix/sysv/linux/syscalls.list
+++ b/sysdeps/unix/sysv/linux/syscalls.list
@@ -112,3 +112,4 @@ process_vm_writev EXTRA	process_vm_writev i:ipipii process_vm_writev
 memfd_create    EXTRA	memfd_create	i:si    memfd_create
 pkey_alloc	EXTRA	pkey_alloc	i:ii	pkey_alloc
 pkey_free	EXTRA	pkey_free	i:i	pkey_free
+membarrier	EXTRA	membarrier	i:ii	membarrier
diff --git a/sysdeps/unix/sysv/linux/tst-membarrier.c b/sysdeps/unix/sysv/linux/tst-membarrier.c
new file mode 100644
index 0000000000..48f4cddf02
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/tst-membarrier.c
@@ -0,0 +1,66 @@
+/* Tests for the membarrier function.
+   Copyright (C) 2018 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <errno.h>
+#include <stdio.h>
+#include <support/check.h>
+#include <sys/membarrier.h>
+
+static int
+do_test (void)
+{
+  int supported = membarrier (MEMBARRIER_CMD_QUERY, 0);
+  if (supported == -1)
+    {
+      if (errno == ENOSYS)
+        FAIL_UNSUPPORTED ("membarrier system call not implemented");
+      else
+        FAIL_EXIT1 ("membarrier: %m");
+    }
+
+  if ((supported & MEMBARRIER_CMD_SHARED) == 0)
+    FAIL_UNSUPPORTED ("shared memory barriers not supported");
+
+  puts ("info: membarrier is supported on this system");
+
+  /* This was not included in the original implementation in Linux
+     4.3.  */
+  const enum membarrier_cmd cmd_private_expedited = 8;
+  const enum membarrier_cmd cmd_register_private_expedited = 16;
+
+  /* The shared barrier is always implemented.  */
+  TEST_COMPARE (supported & MEMBARRIER_CMD_SHARED, MEMBARRIER_CMD_SHARED);
+  TEST_COMPARE (membarrier (MEMBARRIER_CMD_SHARED, 0), 0);
+
+  /* If the private-expedited barrier is advertised, execute it after
+     registering the intent.  */
+  if (supported & cmd_private_expedited)
+    {
+      puts ("info: MEMBARRIER_CMD_PRIVATE_EXPEDITED is supported");
+      TEST_COMPARE (supported & cmd_register_private_expedited,
+                    cmd_register_private_expedited);
+      TEST_COMPARE (membarrier (cmd_register_private_expedited, 0), 0);
+      TEST_COMPARE (membarrier (cmd_private_expedited, 0), 0);
+    }
+  else
+    puts ("info: MEMBARRIER_CMD_PRIVATE_EXPEDITED not supported");
+
+  return 0;
+}
+
+#include <support/test-driver.c>
diff --git a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
index 86dfb0c94d..6f15b7b371 100644
--- a/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/64/libc.abilist
@@ -1896,6 +1896,7 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F
 GLIBC_2.3 __ctype_b_loc F
diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
index dd688263aa..88e1e1f36d 100644
--- a/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
+++ b/sysdeps/unix/sysv/linux/x86_64/x32/libc.abilist
@@ -2147,5 +2147,6 @@ GLIBC_2.28 thrd_equal F
 GLIBC_2.28 thrd_sleep F
 GLIBC_2.28 thrd_yield F
 GLIBC_2.29 getcpu F
+GLIBC_2.29 membarrier F
 GLIBC_2.29 posix_spawn_file_actions_addchdir_np F
 GLIBC_2.29 posix_spawn_file_actions_addfchdir_np F

Paul E. McKenney Dec. 17, 2018, 6:32 p.m. UTC | #43

On Mon, Dec 17, 2018 at 11:02:40AM -0500, Alan Stern wrote:
> On Sun, 16 Dec 2018, Paul E. McKenney wrote:
> 
> > OK, so "simultaneous" IPIs could be emulated in a real implementation by
> > having sys_membarrier() send each IPI (but not wait for a response), then
> > execute a full memory barrier and set a shared variable.  Each IPI handler
> > would spin waiting for the shared variable to be set, then execute a full
> > memory barrier and atomically increment yet another shared variable and
> > return from interrupt.  When that other shared variable's value reached
> > the number of IPIs sent, the sys_membarrier() would execute its final
> > (already existing) full memory barrier and return.  Horribly expensive
> > and definitely not recommended, but eminently doable.
> 
> I don't think that's right.  What would make the IPIs "simultaneous"  
> would be if none of the handlers return until all of them have started
> executing.  For example, you could have each handler increment a shared
> variable and then spin, waiting for the variable to reach the number of
> CPUs, before returning.
> 
> What you wrote was to have each handler wait until all the IPIs had 
> been sent, which is not the same thing at all.

You are right, the handlers need to do the atomic increment before
waiting for the shared variable to be set, and the sys_membarrier()
must wait for the incremented variable to reach its final value before
setting the shared variable.

> > The difference between current sys_membarrier() and the "simultaneous"
> > variant described above is similar to the difference between
> > non-multicopy-atomic and multicopy-atomic memory ordering.  So, after
> > thinking it through, my guess is that pretty much any litmus test that
> > can discern between multicopy-atomic and  non-multicopy-atomic should
> > be transformable into something that can distinguish between the current
> > and the "simultaneous" sys_membarrier() implementation.
> > 
> > Seem reasonable?
> 
> Yes.
> 
> > Or alternatively, may I please apply your Signed-off-by to your earlier
> > sys_membarrier() patch so that I can queue it?  I will probably also
> > change smp_memb() to membarrier() or some such.  Again, within the
> > Linux kernel, membarrier() can be emulated with smp_call_function()
> > invoking a handler that does smp_mb().
> 
> Do you really want to put sys_membarrier into the LKMM?  I'm not so 
> sure it's appropriate.

We do need it for the benefit of the C++ folks, but you are right that
it need not be accepted into the kernel to be useful to them.

So agreed, let's hold off for the time being.

							Thanx, Paul

Rich Felker Feb. 22, 2019, 8:39 a.m. UTC | #44

On Wed, Nov 28, 2018 at 04:05:01PM +0100, Florian Weimer wrote:
> This is essentially a repost of last year's patch, rebased to the glibc
> 2.29 symbol version and reflecting the introduction of
> MEMBARRIER_CMD_GLOBAL.
> 
> I'm not including any changes to manual/ here because the set of
> supported operations is evolving rapidly, we could not get consensus for
> the language I proposed the last time, and I do not want to contribute
> to the manual for the time being.
> 
> Thanks,
> Florian
> 
> 2018-11-28  Florian Weimer  <fweimer@redhat.com>
> 
> 	Linux: Implement membarrier function.
> 	* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Add
> 	sys/membarrier.h.
> 	(tests): Add tst-membarrier.
> 	* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Add membarrier.
> 	* sysdeps/unix/sysv/linux/sys/membarrier.h: New file.

I notice that the declaration moved from sys/mman.h to this new header
since the previous version of this patch. Is this an intentional
change, and is it where everyone now intends/agrees for it to be when
it gets merged?

Rich

Florian Weimer Feb. 22, 2019, 10:25 a.m. UTC | #45

* Rich Felker:

> On Wed, Nov 28, 2018 at 04:05:01PM +0100, Florian Weimer wrote:
>> This is essentially a repost of last year's patch, rebased to the glibc
>> 2.29 symbol version and reflecting the introduction of
>> MEMBARRIER_CMD_GLOBAL.
>> 
>> I'm not including any changes to manual/ here because the set of
>> supported operations is evolving rapidly, we could not get consensus for
>> the language I proposed the last time, and I do not want to contribute
>> to the manual for the time being.
>> 
>> Thanks,
>> Florian
>> 
>> 2018-11-28  Florian Weimer  <fweimer@redhat.com>
>> 
>> 	Linux: Implement membarrier function.
>> 	* sysdeps/unix/sysv/linux/Makefile (sysdep_headers): Add
>> 	sys/membarrier.h.
>> 	(tests): Add tst-membarrier.
>> 	* sysdeps/unix/sysv/linux/Versions (GLIBC_2.27): Add membarrier.
>> 	* sysdeps/unix/sysv/linux/sys/membarrier.h: New file.
>
> I notice that the declaration moved from sys/mman.h to this new header
> since the previous version of this patch. Is this an intentional
> change,

Yes, it makes it clearer how we avoid maintaining a separate list of
constants for this.

> and is it where everyone now intends/agrees for it to be when
> it gets merged?

I don't know if there is consensus, sorry.

Thanks,
Florian

Linux: Implement membarrier function

Commit Message

Comments

Patch