[RFC] glibc: Perform rseq(2) registration at nptl init and thread creation

Message ID 20180919144438.1066-1-mathieu.desnoyers@efficios.com
State New
Headers show
Series
  • [RFC] glibc: Perform rseq(2) registration at nptl init and thread creation
Related show

Commit Message

Mathieu Desnoyers Sept. 19, 2018, 2:44 p.m.
Here is a rough prototype registering rseq(2) TLS for each thread
(including main), and unregistering for each thread (excluding
main). "rseq" stands for Restartable Sequences.

Things to consider:

- Move __rseq_refcount to an extra field at the end of __rseq_abi to
  eliminate one symbol. This would require to wrap struct rseq into
  e.g. struct rseq_lib or such, e.g.:

struct rseq_lib {
  struct rseq kabi;
  int refcount;
};

All libraries/programs which try to register rseq (glibc, early-adopter
applications, early-adopter libraries) should use the rseq refcount.
It becomes part of the ABI within a user-space process, but it's not
part of the ABI shared with the kernel per se.

- Restructure how this code is organized so glibc keeps building on
  non-Linux targets.

- We do not need an atomic increment/decrement for the refcount per
  se. Just being atomic with respect to the current thread (and nested
  signals) would be enough. What is the proper API to use there ?
  Should we expose struct rseq_lib in a public glibc header ? Should
  we create a rseq(3) man page ?

- Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we
  want a non-weak symbol there ? (and let all other early user
  libraries use weak)

- Should we pull linux/rseq.h from the Linux kernel into the glibc
  tree or keep a dependency on installed kernel headers ?

- I plan to host a librseq library containing mostly helper headers
  that vastly simplify using rseq. It will contain a librseq.so
  which provides an API to explicitly register rseq to the
  kernel (for early adopters). The refcounting mechanism in the TLS
  would be used to ensure combining librseq.so and libc.so.6 with rseq
  support works fine.

- How early do we want to register rseq and how late do we want to
  unregister it ? It's important to consider if we expect rseq to
  be used by the memory allocator and within destructor callbacks.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Carlos O'Donell <carlos@redhat.com>
CC: Florian Weimer <fweimer@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ben Maurer <bmaurer@fb.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
CC: Boqun Feng <boqun.feng@gmail.com>
CC: Will Deacon <will.deacon@arm.com>
CC: Dave Watson <davejwatson@fb.com>
CC: Paul Turner <pjt@google.com>
CC: libc-alpha@sourceware.org
CC: linux-kernel@vger.kernel.org
CC: linux-api@vger.kernel.org
---
 nptl/Versions                  |   1 +
 nptl/nptl-init.c               |   3 +
 nptl/pthreadP.h                |  45 ++++++++++
 nptl/pthread_create.c          |  15 ++++
 sysdeps/unix/sysv/linux/rseq.h | 147 +++++++++++++++++++++++++++++++++
 5 files changed, 211 insertions(+)
 create mode 100644 sysdeps/unix/sysv/linux/rseq.h

Comments

Joseph Myers Sept. 19, 2018, 4:37 p.m. | #1
On Wed, 19 Sep 2018, Mathieu Desnoyers wrote:

> Here is a rough prototype registering rseq(2) TLS for each thread
> (including main), and unregistering for each thread (excluding
> main). "rseq" stands for Restartable Sequences.

A final patch would need to add documentation and tests and a NEWS entry 
and fix various coding style issues.

> diff --git a/nptl/Versions b/nptl/Versions
> index e7f691da7a..7316c2815d 100644
> --- a/nptl/Versions
> +++ b/nptl/Versions
> @@ -275,6 +275,7 @@ libpthread {
>      mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy;
>      call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal;
>      cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set;
> +    __rseq_abi; __rseq_refcount;

That's the GLIBC_2.28 section, but 2.28 is already out.  New symbols would 
need to go in GLIBC_2.29 or later (and the ABI test baselines would all 
need updating).

> diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h

This looks like it's coming from the Linux kernel.  Can't the relevant 
uapi header just be used directly without copying into glibc (with due 
care to ensure that glibc still builds if the kernel headers used for the 
build are too old - you need such conditionals anyway if they don't define 
the relevant syscall number)?
Mathieu Desnoyers Sept. 19, 2018, 4:53 p.m. | #2
----- On Sep 19, 2018, at 12:37 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Wed, 19 Sep 2018, Mathieu Desnoyers wrote:
> 
>> Here is a rough prototype registering rseq(2) TLS for each thread
>> (including main), and unregistering for each thread (excluding
>> main). "rseq" stands for Restartable Sequences.
> 
> A final patch would need to add documentation and tests and a NEWS entry
> and fix various coding style issues.
> 
>> diff --git a/nptl/Versions b/nptl/Versions
>> index e7f691da7a..7316c2815d 100644
>> --- a/nptl/Versions
>> +++ b/nptl/Versions
>> @@ -275,6 +275,7 @@ libpthread {
>>      mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy;
>>      call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal;
>>      cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set;
>> +    __rseq_abi; __rseq_refcount;
> 
> That's the GLIBC_2.28 section, but 2.28 is already out.  New symbols would
> need to go in GLIBC_2.29 or later (and the ABI test baselines would all
> need updating).

Certainly. I'm just not sure where I need to poke to update the ABI version
from 2.28 to 2.29 in my dev branch.

> 
>> diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h
> 
> This looks like it's coming from the Linux kernel.  Can't the relevant
> uapi header just be used directly without copying into glibc (with due
> care to ensure that glibc still builds if the kernel headers used for the
> build are too old - you need such conditionals anyway if they don't define
> the relevant syscall number)?

This is indeed in the list of "things to consider" I've put in the patch
commit message. If the usual practice is to build against uapi kernel headers
outside of the glibc tree, I'm fine with that.

Thanks,

Mathieu


> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Joseph Myers Sept. 19, 2018, 5:10 p.m. | #3
On Wed, 19 Sep 2018, Mathieu Desnoyers wrote:

> > This looks like it's coming from the Linux kernel.  Can't the relevant
> > uapi header just be used directly without copying into glibc (with due
> > care to ensure that glibc still builds if the kernel headers used for the
> > build are too old - you need such conditionals anyway if they don't define
> > the relevant syscall number)?
> 
> This is indeed in the list of "things to consider" I've put in the patch
> commit message. If the usual practice is to build against uapi kernel headers
> outside of the glibc tree, I'm fine with that.

We build with, currently, 3.2 or later headers (since 3.2 is EOL there's a 
case for updating the minimum in glibc for both compile time and runtime, 
but I haven't proposed that since there isn't much cleanup that would 
enable and there's the open question of Carlos's proposal to eliminate the 
runtime check on the kernel version and just let things try to run anyway 
even if it's older than the configured minimum).  Functions depending on 
new syscalls may return ENOSYS errors if the headers used to build glibc 
were too old.  Since this patch is providing a data interface rather than 
functions that can set errno to ENOSYS, presumably you have some other way 
of signalling unavailability which would apply both with a too-old kernel 
at runtime and too-old headers at compile time.
Szabolcs Nagy Sept. 19, 2018, 5:38 p.m. | #4
On 19/09/18 15:44, Mathieu Desnoyers wrote:
> Things to consider:
> 
> - Move __rseq_refcount to an extra field at the end of __rseq_abi to
>    eliminate one symbol. This would require to wrap struct rseq into
>    e.g. struct rseq_lib or such, e.g.:
> 
> struct rseq_lib {
>    struct rseq kabi;
>    int refcount;
> };
> 
> All libraries/programs which try to register rseq (glibc, early-adopter
> applications, early-adopter libraries) should use the rseq refcount.
> It becomes part of the ABI within a user-space process, but it's not
> part of the ABI shared with the kernel per se.
> 
> - Restructure how this code is organized so glibc keeps building on
>    non-Linux targets.
> 
> - We do not need an atomic increment/decrement for the refcount per
>    se. Just being atomic with respect to the current thread (and nested
>    signals) would be enough. What is the proper API to use there ?
>    Should we expose struct rseq_lib in a public glibc header ? Should
>    we create a rseq(3) man page ?
> 
> - Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we
>    want a non-weak symbol there ? (and let all other early user
>    libraries use weak)
> 

i don't think there is precedent for exposing tls symbol in glibc
(e.g. errno is exposed via __errno_location function) so there
might be issues with this (but i don't have immediate concerns).

> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
> index fe75d04113..20ee197d94 100644
> --- a/nptl/pthread_create.c
> +++ b/nptl/pthread_create.c
> @@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__;
>   /* Number of threads running.  */
>   unsigned int __nptl_nthreads = 1;
>   
> +__attribute__((weak, tls_model("initial-exec"))) __thread
> +volatile struct rseq __rseq_abi = {
> +	.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> +};
> +
> +__attribute__((weak, tls_model("initial-exec"))) __thread
> +volatile int __rseq_refcount;
>  

note that libpthread.so is built with -ftls-model=initial-exec

(and if it wasn't then you'd want to put the attribute on the
declaration in the internal header file, not on the definition,
so the actual tls accesses generate the right code)
Joseph Myers Sept. 19, 2018, 7:49 p.m. | #5
On Wed, 19 Sep 2018, Szabolcs Nagy wrote:

> i don't think there is precedent for exposing tls symbol in glibc
> (e.g. errno is exposed via __errno_location function) so there
> might be issues with this (but i don't have immediate concerns).

There have been suggestions to expose TLS errno - but also suggestions 
that use of __errno_location is more efficient, at least in terms of code 
size everywhere errno is accessed (for some ABIs, anyway).

The ABI tests have code that would list .tbss symbols as "T" in ABI test 
baselines, but no existing ABI baselines use that.
Mathieu Desnoyers Sept. 19, 2018, 8:10 p.m. | #6
----- On Sep 19, 2018, at 3:49 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Wed, 19 Sep 2018, Szabolcs Nagy wrote:
> 
>> i don't think there is precedent for exposing tls symbol in glibc
>> (e.g. errno is exposed via __errno_location function) so there
>> might be issues with this (but i don't have immediate concerns).
> 
> There have been suggestions to expose TLS errno - but also suggestions
> that use of __errno_location is more efficient, at least in terms of code
> size everywhere errno is accessed (for some ABIs, anyway).

AFAIU, the trade-off is different between the errno use-case and the
rseq use-case.

If my understanding is correct, errno is not supposed to be used in
fast-paths, only when an error is returned. So size is more important than
speed there.

Comparatively, rseq is _meant_ to speed up fast-paths. A per-cpu statistics
counter can be incremented in 2ns with rseq on a Intel E5-2630, which is faster
than a simple function call.

So for rseq, we should favor speed over space, which means the user applications
and libraries would need access to the TLS symbol without requiring an
accessor function. That's also why I'm using the initial-exec tls model
rather than the global-dynamic: I want to make sure no function call is
generated there.

> The ABI tests have code that would list .tbss symbols as "T" in ABI test
> baselines, but no existing ABI baselines use that.

Thanks,

Mathieu
Mathieu Desnoyers Sept. 19, 2018, 9:01 p.m. | #7
----- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.nagy@arm.com wrote:

> On 19/09/18 15:44, Mathieu Desnoyers wrote:
>> Things to consider:
>> 
>> - Move __rseq_refcount to an extra field at the end of __rseq_abi to
>>    eliminate one symbol. This would require to wrap struct rseq into
>>    e.g. struct rseq_lib or such, e.g.:
>> 
>> struct rseq_lib {
>>    struct rseq kabi;
>>    int refcount;
>> };
>> 
>> All libraries/programs which try to register rseq (glibc, early-adopter
>> applications, early-adopter libraries) should use the rseq refcount.
>> It becomes part of the ABI within a user-space process, but it's not
>> part of the ABI shared with the kernel per se.
>> 
>> - Restructure how this code is organized so glibc keeps building on
>>    non-Linux targets.
>> 
>> - We do not need an atomic increment/decrement for the refcount per
>>    se. Just being atomic with respect to the current thread (and nested
>>    signals) would be enough. What is the proper API to use there ?
>>    Should we expose struct rseq_lib in a public glibc header ? Should
>>    we create a rseq(3) man page ?
>> 
>> - Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we
>>    want a non-weak symbol there ? (and let all other early user
>>    libraries use weak)
>> 
> 
> i don't think there is precedent for exposing tls symbol in glibc
> (e.g. errno is exposed via __errno_location function) so there
> might be issues with this (but i don't have immediate concerns).
> 
>> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
>> index fe75d04113..20ee197d94 100644
>> --- a/nptl/pthread_create.c
>> +++ b/nptl/pthread_create.c
>> @@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__;
>>   /* Number of threads running.  */
>>   unsigned int __nptl_nthreads = 1;
>>   
>> +__attribute__((weak, tls_model("initial-exec"))) __thread
>> +volatile struct rseq __rseq_abi = {
>> +	.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
>> +};
>> +
>> +__attribute__((weak, tls_model("initial-exec"))) __thread
>> +volatile int __rseq_refcount;
>>  
> 
> note that libpthread.so is built with -ftls-model=initial-exec

Which would indeed make these annotations redundant. I'll remove
them.

> (and if it wasn't then you'd want to put the attribute on the
> declaration in the internal header file, not on the definition,
> so the actual tls accesses generate the right code)

This area is one where I'm still uneasy on my comprehension of
the details, especially that it goes in a different direction than
what you are recommending.

I've read through https://www.akkadia.org/drepper/tls.pdf Section 5
"Linker Optimizations" to try to figure it out, and I end up being
under the impression that applying the tls_model("initial-exec")
attribute to a symbol declaration in a header file does not have
much impact on the accesses that use that variable. Reading through
that section, it seems that the variable definition is the one that
matters, and then the compiler/linker/loader are tweaking the sites
that reference the TLS variable through code rewrite based on the
most efficient mechanism that each phase knows can be used at each
stage.

What am I missing ?

Thanks,

Mathieu
Szabolcs Nagy Sept. 20, 2018, 10:28 a.m. | #8
On 19/09/18 22:01, Mathieu Desnoyers wrote:
> ----- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.nagy@arm.com wrote:
>> note that libpthread.so is built with -ftls-model=initial-exec
> 
> Which would indeed make these annotations redundant. I'll remove
> them.
> 
>> (and if it wasn't then you'd want to put the attribute on the
>> declaration in the internal header file, not on the definition,
>> so the actual tls accesses generate the right code)
> 
> This area is one where I'm still uneasy on my comprehension of
> the details, especially that it goes in a different direction than
> what you are recommending.
> 
> I've read through https://www.akkadia.org/drepper/tls.pdf Section 5
> "Linker Optimizations" to try to figure it out, and I end up being
> under the impression that applying the tls_model("initial-exec")
> attribute to a symbol declaration in a header file does not have
> much impact on the accesses that use that variable. Reading through
> that section, it seems that the variable definition is the one that
> matters, and then the compiler/linker/loader are tweaking the sites
> that reference the TLS variable through code rewrite based on the
> most efficient mechanism that each phase knows can be used at each
> stage.
> 
> What am I missing ?

in general if you rely on linker relaxations you may not
get optimal code because the linker cannot remove
instructions, just nop them out.

(e.g. on aarch64 an initial-exec access is 4 instructions
a general dynamic (tlsdesc) access is 6 instructions +
it involves a call, so the return address has to be saved
and restored (+ 3 instructions for stack operations if
there were none otherwise, which the linker cannot change))
Mathieu Desnoyers Sept. 20, 2018, 8:04 p.m. | #9
----- On Sep 19, 2018, at 12:37 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Wed, 19 Sep 2018, Mathieu Desnoyers wrote:
> 
>> Here is a rough prototype registering rseq(2) TLS for each thread
>> (including main), and unregistering for each thread (excluding
>> main). "rseq" stands for Restartable Sequences.
> 
> A final patch would need to add documentation and tests and a NEWS entry
> and fix various coding style issues.

Sure,

> 
>> diff --git a/nptl/Versions b/nptl/Versions
>> index e7f691da7a..7316c2815d 100644
>> --- a/nptl/Versions
>> +++ b/nptl/Versions
>> @@ -275,6 +275,7 @@ libpthread {
>>      mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy;
>>      call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal;
>>      cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set;
>> +    __rseq_abi; __rseq_refcount;
> 
> That's the GLIBC_2.28 section, but 2.28 is already out.  New symbols would
> need to go in GLIBC_2.29 or later (and the ABI test baselines would all
> need updating).

OK

> 
>> diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h
> 
> This looks like it's coming from the Linux kernel.  Can't the relevant
> uapi header just be used directly without copying into glibc (with due
> care to ensure that glibc still builds if the kernel headers used for the
> build are too old - you need such conditionals anyway if they don't define
> the relevant syscall number)?

Something like this in pthreadP.h ?

+#ifdef __NR_rseq
+#include <sysdeps/unix/sysv/linux/rseq-internal.h>
+#else
+#include <sysdeps/nptl/rseq-internal.h>
+#endif  /* __NR_rseq.  */

where sysdeps/unix/sysv/linux/rseq-internal.h contains the linux
implementation of rseq_register_current_thread () and
rseq_unregister_current_thread (), and sysdeps/nptl/rseq-internal.h
contains stubs.

Am I on the right track ?

Thanks,

Mathieu



> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Mathieu Desnoyers Sept. 20, 2018, 8:14 p.m. | #10
----- On Sep 19, 2018, at 1:10 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Wed, 19 Sep 2018, Mathieu Desnoyers wrote:
> 
>> > This looks like it's coming from the Linux kernel.  Can't the relevant
>> > uapi header just be used directly without copying into glibc (with due
>> > care to ensure that glibc still builds if the kernel headers used for the
>> > build are too old - you need such conditionals anyway if they don't define
>> > the relevant syscall number)?
>> 
>> This is indeed in the list of "things to consider" I've put in the patch
>> commit message. If the usual practice is to build against uapi kernel headers
>> outside of the glibc tree, I'm fine with that.
> 
> We build with, currently, 3.2 or later headers (since 3.2 is EOL there's a
> case for updating the minimum in glibc for both compile time and runtime,
> but I haven't proposed that since there isn't much cleanup that would
> enable and there's the open question of Carlos's proposal to eliminate the
> runtime check on the kernel version and just let things try to run anyway
> even if it's older than the configured minimum).

Are you saying glibc has an explicit check for the kernel version visible
from /proc before using specific features ? If so, how can this work with
the variety of feature backports we find in the distribution kernels out
there ?

Checking whether specific system calls return ENOSYS errors seems more
flexible.

> Functions depending on
> new syscalls may return ENOSYS errors if the headers used to build glibc
> were too old.  Since this patch is providing a data interface rather than
> functions that can set errno to ENOSYS, presumably you have some other way
> of signalling unavailability which would apply both with a too-old kernel
> at runtime and too-old headers at compile time.

For too-old kernel at runtime, having rseq registration return ENOSYS
leaves the the content of __rseq_abi->cpu_id at its initial value
(RSEQ_CPU_ID_UNINITIALIZED = -1).

For too-old headers at compile time, one possibility is that we don't event
expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway
for ABI consistency purposes, then we'd leave its cpu_id field at the initial
value (-1). But that would require that we copy linux/rseq.h into the glibc
source tree.

Thoughts ?

Thanks,

Mathieu

> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Joseph Myers Sept. 20, 2018, 8:20 p.m. | #11
On Thu, 20 Sep 2018, Mathieu Desnoyers wrote:

> Are you saying glibc has an explicit check for the kernel version visible
> from /proc before using specific features ? If so, how can this work with
> the variety of feature backports we find in the distribution kernels out
> there ?

See sysdeps/unix/sysv/linux/dl-sysdep.c and 
sysdeps/unix/sysv/linux/dl-osinfo.h.  As I said, Carlos has proposed 
removing that check.

> For too-old headers at compile time, one possibility is that we don't event
> expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway
> for ABI consistency purposes, then we'd leave its cpu_id field at the initial
> value (-1). But that would require that we copy linux/rseq.h into the glibc
> source tree.

The ABI needs to be independent of the kernel headers used.  I don't think 
you need to copy linux/rseq.h; all you should need is to e.g. define an 
array of suitable size and alignment with the relevant member initialized 
and a suitable explanatory comment.
Joseph Myers Sept. 20, 2018, 8:29 p.m. | #12
On Thu, 20 Sep 2018, Mathieu Desnoyers wrote:

> Something like this in pthreadP.h ?
> 
> +#ifdef __NR_rseq
> +#include <sysdeps/unix/sysv/linux/rseq-internal.h>
> +#else
> +#include <sysdeps/nptl/rseq-internal.h>
> +#endif  /* __NR_rseq.  */
> 
> where sysdeps/unix/sysv/linux/rseq-internal.h contains the linux
> implementation of rseq_register_current_thread () and
> rseq_unregister_current_thread (), and sysdeps/nptl/rseq-internal.h
> contains stubs.
> 
> Am I on the right track ?

It's hard to define the right abstractions for what goes where given that 
only Linux uses NPTL since the removal of the NaCl port.  I suppose it 
does make logical sense for a #include of <linux/rseq.h> to go somewhere 
under sysdeps/unix/sysv/linux/.
Mathieu Desnoyers Sept. 21, 2018, 4:29 p.m. | #13
----- On Sep 20, 2018, at 4:20 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Thu, 20 Sep 2018, Mathieu Desnoyers wrote:
> 
>> Are you saying glibc has an explicit check for the kernel version visible
>> from /proc before using specific features ? If so, how can this work with
>> the variety of feature backports we find in the distribution kernels out
>> there ?
> 
> See sysdeps/unix/sysv/linux/dl-sysdep.c and
> sysdeps/unix/sysv/linux/dl-osinfo.h.  As I said, Carlos has proposed
> removing that check.

For the system calls I implement and maintain, I typically ensure there is
a set of parameters that can be used when issuing the system call so it
can either succeed or fail with ENOSYS without having side-effects. It's
specifically meant to be used for feature discovery in a library
initialization phase. It's especially useful if the application needs to
keep state around related to the system call across its execution, e.g.
robust futexes.

> 
>> For too-old headers at compile time, one possibility is that we don't event
>> expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway
>> for ABI consistency purposes, then we'd leave its cpu_id field at the initial
>> value (-1). But that would require that we copy linux/rseq.h into the glibc
>> source tree.
> 
> The ABI needs to be independent of the kernel headers used.  I don't think
> you need to copy linux/rseq.h; all you should need is to e.g. define an
> array of suitable size and alignment with the relevant member initialized
> and a suitable explanatory comment.

In that case, I'm thinking declaring a minimal structure in glibc code may be
clearer than the array, e.g.:

[pthreadP.h]

enum libc_rseq_cpu_id_state {
  LIBC_RSEQ_CPU_ID_UNINITIALIZED = -1,
  LIBC_RSEQ_CPU_ID_REGISTRATION_FAILED = -2,
};

/* linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI
   size is 20 bytes. For support of multiple rseq users within a process,
   user-space defines an extra 4 bytes field as a reference count, for a
   total of 24 bytes.  */
struct libc_rseq {
  /* kernel-userspace ABI. */
  uint32_t cpu_id_start;
  uint32_t cpu_id;
  uint64_t rseq_cs;
  uint32_t flags;
  /* user-space ABI. */
  uint32_t refcount;
} __attribute__((aligned(4 * sizeof(uint64_t))));

[pthread_create.h]

__thread volatile struct libc_rseq __rseq_abi = {
  .cpu_id = LIBC_RSEQ_CPU_ID_UNINITIALIZED,
};

Thanks,

Mathieu

Patch

diff --git a/nptl/Versions b/nptl/Versions
index e7f691da7a..7316c2815d 100644
--- a/nptl/Versions
+++ b/nptl/Versions
@@ -275,6 +275,7 @@  libpthread {
     mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy;
     call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal;
     cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set;
+    __rseq_abi; __rseq_refcount;
   }
 
   GLIBC_PRIVATE {
diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c
index 907411d5bc..0c056c3fce 100644
--- a/nptl/nptl-init.c
+++ b/nptl/nptl-init.c
@@ -431,6 +431,9 @@  __pthread_initialize_minimal_internal (void)
 
   /* Determine whether the machine is SMP or not.  */
   __is_smp = is_smp_system ();
+
+    /* Register rseq ABI to the kernel. */
+  (void) rseq_register_current_thread();
 }
 strong_alias (__pthread_initialize_minimal_internal,
 	      __pthread_initialize_minimal)
diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h
index 13bdb11133..94f54c630e 100644
--- a/nptl/pthreadP.h
+++ b/nptl/pthreadP.h
@@ -33,6 +33,8 @@ 
 #include <kernel-features.h>
 #include <errno.h>
 #include <internal-signals.h>
+//TODO: fix sysdeps include to make it portable to non-Linux systems.
+#include <sysdeps/unix/sysv/linux/rseq.h>
 
 
 /* Atomic operations on TLS memory.  */
@@ -660,4 +662,47 @@  check_stacksize_attr (size_t st)
 		  "offset of " #member " field of " #type " != "	\
 		  ASSERT_PTHREAD_STRING (offset))
 
+//TODO: we may want to move the rseq code to a linux sysdeps file.
+#define RSEQ_SIG 0x53053053
+
+extern __thread volatile struct rseq __rseq_abi;
+extern __thread volatile int __rseq_refcount;
+
+static inline int
+rseq_register_current_thread (void)
+{
+  int rc, ret = 0;
+  INTERNAL_SYSCALL_DECL (err);
+
+  if (atomic_increment_val(&__rseq_refcount) - 1)
+    goto end;
+  rc = INTERNAL_SYSCALL_CALL(rseq, err, &__rseq_abi, sizeof(struct rseq),
+                             0, RSEQ_SIG);
+  if (!rc)
+    goto end;
+  if (INTERNAL_SYSCALL_ERRNO(rc, err) != EBUSY)
+    __rseq_abi.cpu_id = -2;
+  ret = -1;
+  atomic_decrement(&__rseq_refcount);
+end:
+  return ret;
+}
+
+static inline int
+rseq_unregister_current_thread (void)
+{
+  int rc, ret = 0;
+  INTERNAL_SYSCALL_DECL (err);
+
+  if (atomic_decrement_val(&__rseq_refcount))
+    goto end;
+  rc = INTERNAL_SYSCALL_CALL(rseq, err, &__rseq_abi, sizeof(struct rseq),
+                             RSEQ_FLAG_UNREGISTER, RSEQ_SIG);
+  if (!rc)
+    goto end;
+  ret = -1;
+end:
+  return ret;
+}
+
 #endif	/* pthreadP.h */
diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c
index fe75d04113..20ee197d94 100644
--- a/nptl/pthread_create.c
+++ b/nptl/pthread_create.c
@@ -52,6 +52,13 @@  static struct pthread *__nptl_last_event __attribute_used__;
 /* Number of threads running.  */
 unsigned int __nptl_nthreads = 1;
 
+__attribute__((weak, tls_model("initial-exec"))) __thread
+volatile struct rseq __rseq_abi = {
+	.cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+};
+
+__attribute__((weak, tls_model("initial-exec"))) __thread
+volatile int __rseq_refcount;
 
 /* Code to allocate and deallocate a stack.  */
 #include "allocatestack.c"
@@ -378,6 +385,7 @@  __free_tcb (struct pthread *pd)
 START_THREAD_DEFN
 {
   struct pthread *pd = START_THREAD_SELF;
+  bool has_rseq = false;
 
 #if HP_TIMING_AVAIL
   /* Remember the time when the thread was started.  */
@@ -445,6 +453,9 @@  START_THREAD_DEFN
   unwind_buf.priv.data.prev = NULL;
   unwind_buf.priv.data.cleanup = NULL;
 
+  /* Register rseq TLS to the kernel. */
+  has_rseq = !rseq_register_current_thread();
+
   if (__glibc_likely (! not_first_call))
     {
       /* Store the new cleanup handler info.  */
@@ -487,6 +498,10 @@  START_THREAD_DEFN
       THREAD_SETMEM (pd, result, ret);
     }
 
+  /* Unregister rseq TLS from kernel. */
+  if (has_rseq && rseq_unregister_current_thread())
+    abort();
+
   /* Call destructors for the thread_local TLS variables.  */
 #ifndef SHARED
   if (&__call_tls_dtors != NULL)
diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h
new file mode 100644
index 0000000000..9a402fdb60
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/rseq.h
@@ -0,0 +1,147 @@ 
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_RSEQ_H
+#define _UAPI_LINUX_RSEQ_H
+
+/*
+ * linux/rseq.h
+ *
+ * Restartable sequences system call API
+ *
+ * Copyright (c) 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+ */
+
+#include <linux/types.h>
+#include <asm/byteorder.h>
+
+enum rseq_cpu_id_state {
+	RSEQ_CPU_ID_UNINITIALIZED		= -1,
+	RSEQ_CPU_ID_REGISTRATION_FAILED		= -2,
+};
+
+enum rseq_flags {
+	RSEQ_FLAG_UNREGISTER = (1 << 0),
+};
+
+enum rseq_cs_flags_bit {
+	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT	= 0,
+	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT	= 1,
+	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT	= 2,
+};
+
+enum rseq_cs_flags {
+	RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT	=
+		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT),
+	RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL	=
+		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT),
+	RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE	=
+		(1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT),
+};
+
+/*
+ * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line. It is usually declared as
+ * link-time constant data.
+ */
+struct rseq_cs {
+	/* Version of this structure. */
+	__u32 version;
+	/* enum rseq_cs_flags */
+	__u32 flags;
+	__u64 start_ip;
+	/* Offset from start_ip. */
+	__u64 post_commit_offset;
+	__u64 abort_ip;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+/*
+ * struct rseq is aligned on 4 * 8 bytes to ensure it is always
+ * contained within a single cache-line.
+ *
+ * A single struct rseq per thread is allowed.
+ */
+struct rseq {
+	/*
+	 * Restartable sequences cpu_id_start field. Updated by the
+	 * kernel. Read by user-space with single-copy atomicity
+	 * semantics. This field should only be read by the thread which
+	 * registered this data structure. Aligned on 32-bit. Always
+	 * contains a value in the range of possible CPUs, although the
+	 * value may not be the actual current CPU (e.g. if rseq is not
+	 * initialized). This CPU number value should always be compared
+	 * against the value of the cpu_id field before performing a rseq
+	 * commit or returning a value read from a data structure indexed
+	 * using the cpu_id_start value.
+	 */
+	__u32 cpu_id_start;
+	/*
+	 * Restartable sequences cpu_id field. Updated by the kernel.
+	 * Read by user-space with single-copy atomicity semantics. This
+	 * field should only be read by the thread which registered this
+	 * data structure. Aligned on 32-bit. Values
+	 * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED
+	 * have a special semantic: the former means "rseq uninitialized",
+	 * and latter means "rseq initialization failed". This value is
+	 * meant to be read within rseq critical sections and compared
+	 * with the cpu_id_start value previously read, before performing
+	 * the commit instruction, or read and compared with the
+	 * cpu_id_start value before returning a value loaded from a data
+	 * structure indexed using the cpu_id_start value.
+	 */
+	__u32 cpu_id;
+	/*
+	 * Restartable sequences rseq_cs field.
+	 *
+	 * Contains NULL when no critical section is active for the current
+	 * thread, or holds a pointer to the currently active struct rseq_cs.
+	 *
+	 * Updated by user-space, which sets the address of the currently
+	 * active rseq_cs at the beginning of assembly instruction sequence
+	 * block, and set to NULL by the kernel when it restarts an assembly
+	 * instruction sequence block, as well as when the kernel detects that
+	 * it is preempting or delivering a signal outside of the range
+	 * targeted by the rseq_cs. Also needs to be set to NULL by user-space
+	 * before reclaiming memory that contains the targeted struct rseq_cs.
+	 *
+	 * Read and set by the kernel. Set by user-space with single-copy
+	 * atomicity semantics. This field should only be updated by the
+	 * thread which registered this data structure. Aligned on 64-bit.
+	 */
+	union {
+		__u64 ptr64;
+#ifdef __LP64__
+		__u64 ptr;
+#else
+		struct {
+#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN)
+			__u32 padding;		/* Initialized to zero. */
+			__u32 ptr32;
+#else /* LITTLE */
+			__u32 ptr32;
+			__u32 padding;		/* Initialized to zero. */
+#endif /* ENDIAN */
+		} ptr;
+#endif
+	} rseq_cs;
+
+	/*
+	 * Restartable sequences flags field.
+	 *
+	 * This field should only be updated by the thread which
+	 * registered this data structure. Read by the kernel.
+	 * Mainly used for single-stepping through rseq critical sections
+	 * with debuggers.
+	 *
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+	 *     Inhibit instruction sequence block restart on preemption
+	 *     for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+	 *     Inhibit instruction sequence block restart on signal
+	 *     delivery for this thread.
+	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
+	 *     Inhibit instruction sequence block restart on migration for
+	 *     this thread.
+	 */
+	__u32 flags;
+} __attribute__((aligned(4 * sizeof(__u64))));
+
+#endif /* _UAPI_LINUX_RSEQ_H */