Message ID | 20180919144438.1066-1-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers | show |
Series | [RFC] glibc: Perform rseq(2) registration at nptl init and thread creation | expand |
On Wed, 19 Sep 2018, Mathieu Desnoyers wrote: > Here is a rough prototype registering rseq(2) TLS for each thread > (including main), and unregistering for each thread (excluding > main). "rseq" stands for Restartable Sequences. A final patch would need to add documentation and tests and a NEWS entry and fix various coding style issues. > diff --git a/nptl/Versions b/nptl/Versions > index e7f691da7a..7316c2815d 100644 > --- a/nptl/Versions > +++ b/nptl/Versions > @@ -275,6 +275,7 @@ libpthread { > mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy; > call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal; > cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set; > + __rseq_abi; __rseq_refcount; That's the GLIBC_2.28 section, but 2.28 is already out. New symbols would need to go in GLIBC_2.29 or later (and the ABI test baselines would all need updating). > diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h This looks like it's coming from the Linux kernel. Can't the relevant uapi header just be used directly without copying into glibc (with due care to ensure that glibc still builds if the kernel headers used for the build are too old - you need such conditionals anyway if they don't define the relevant syscall number)?
----- On Sep 19, 2018, at 12:37 PM, Joseph Myers joseph@codesourcery.com wrote: > On Wed, 19 Sep 2018, Mathieu Desnoyers wrote: > >> Here is a rough prototype registering rseq(2) TLS for each thread >> (including main), and unregistering for each thread (excluding >> main). "rseq" stands for Restartable Sequences. > > A final patch would need to add documentation and tests and a NEWS entry > and fix various coding style issues. > >> diff --git a/nptl/Versions b/nptl/Versions >> index e7f691da7a..7316c2815d 100644 >> --- a/nptl/Versions >> +++ b/nptl/Versions >> @@ -275,6 +275,7 @@ libpthread { >> mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy; >> call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal; >> cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set; >> + __rseq_abi; __rseq_refcount; > > That's the GLIBC_2.28 section, but 2.28 is already out. New symbols would > need to go in GLIBC_2.29 or later (and the ABI test baselines would all > need updating). Certainly. I'm just not sure where I need to poke to update the ABI version from 2.28 to 2.29 in my dev branch. > >> diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h > > This looks like it's coming from the Linux kernel. Can't the relevant > uapi header just be used directly without copying into glibc (with due > care to ensure that glibc still builds if the kernel headers used for the > build are too old - you need such conditionals anyway if they don't define > the relevant syscall number)? This is indeed in the list of "things to consider" I've put in the patch commit message. If the usual practice is to build against uapi kernel headers outside of the glibc tree, I'm fine with that. Thanks, Mathieu > > -- > Joseph S. Myers > joseph@codesourcery.com
On Wed, 19 Sep 2018, Mathieu Desnoyers wrote: > > This looks like it's coming from the Linux kernel. Can't the relevant > > uapi header just be used directly without copying into glibc (with due > > care to ensure that glibc still builds if the kernel headers used for the > > build are too old - you need such conditionals anyway if they don't define > > the relevant syscall number)? > > This is indeed in the list of "things to consider" I've put in the patch > commit message. If the usual practice is to build against uapi kernel headers > outside of the glibc tree, I'm fine with that. We build with, currently, 3.2 or later headers (since 3.2 is EOL there's a case for updating the minimum in glibc for both compile time and runtime, but I haven't proposed that since there isn't much cleanup that would enable and there's the open question of Carlos's proposal to eliminate the runtime check on the kernel version and just let things try to run anyway even if it's older than the configured minimum). Functions depending on new syscalls may return ENOSYS errors if the headers used to build glibc were too old. Since this patch is providing a data interface rather than functions that can set errno to ENOSYS, presumably you have some other way of signalling unavailability which would apply both with a too-old kernel at runtime and too-old headers at compile time.
On 19/09/18 15:44, Mathieu Desnoyers wrote: > Things to consider: > > - Move __rseq_refcount to an extra field at the end of __rseq_abi to > eliminate one symbol. This would require to wrap struct rseq into > e.g. struct rseq_lib or such, e.g.: > > struct rseq_lib { > struct rseq kabi; > int refcount; > }; > > All libraries/programs which try to register rseq (glibc, early-adopter > applications, early-adopter libraries) should use the rseq refcount. > It becomes part of the ABI within a user-space process, but it's not > part of the ABI shared with the kernel per se. > > - Restructure how this code is organized so glibc keeps building on > non-Linux targets. > > - We do not need an atomic increment/decrement for the refcount per > se. Just being atomic with respect to the current thread (and nested > signals) would be enough. What is the proper API to use there ? > Should we expose struct rseq_lib in a public glibc header ? Should > we create a rseq(3) man page ? > > - Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we > want a non-weak symbol there ? (and let all other early user > libraries use weak) > i don't think there is precedent for exposing tls symbol in glibc (e.g. errno is exposed via __errno_location function) so there might be issues with this (but i don't have immediate concerns). > diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c > index fe75d04113..20ee197d94 100644 > --- a/nptl/pthread_create.c > +++ b/nptl/pthread_create.c > @@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__; > /* Number of threads running. */ > unsigned int __nptl_nthreads = 1; > > +__attribute__((weak, tls_model("initial-exec"))) __thread > +volatile struct rseq __rseq_abi = { > + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, > +}; > + > +__attribute__((weak, tls_model("initial-exec"))) __thread > +volatile int __rseq_refcount; > note that libpthread.so is built with -ftls-model=initial-exec (and if it wasn't then you'd want to put the attribute on the declaration in the internal header file, not on the definition, so the actual tls accesses generate the right code)
On Wed, 19 Sep 2018, Szabolcs Nagy wrote: > i don't think there is precedent for exposing tls symbol in glibc > (e.g. errno is exposed via __errno_location function) so there > might be issues with this (but i don't have immediate concerns). There have been suggestions to expose TLS errno - but also suggestions that use of __errno_location is more efficient, at least in terms of code size everywhere errno is accessed (for some ABIs, anyway). The ABI tests have code that would list .tbss symbols as "T" in ABI test baselines, but no existing ABI baselines use that.
----- On Sep 19, 2018, at 3:49 PM, Joseph Myers joseph@codesourcery.com wrote: > On Wed, 19 Sep 2018, Szabolcs Nagy wrote: > >> i don't think there is precedent for exposing tls symbol in glibc >> (e.g. errno is exposed via __errno_location function) so there >> might be issues with this (but i don't have immediate concerns). > > There have been suggestions to expose TLS errno - but also suggestions > that use of __errno_location is more efficient, at least in terms of code > size everywhere errno is accessed (for some ABIs, anyway). AFAIU, the trade-off is different between the errno use-case and the rseq use-case. If my understanding is correct, errno is not supposed to be used in fast-paths, only when an error is returned. So size is more important than speed there. Comparatively, rseq is _meant_ to speed up fast-paths. A per-cpu statistics counter can be incremented in 2ns with rseq on a Intel E5-2630, which is faster than a simple function call. So for rseq, we should favor speed over space, which means the user applications and libraries would need access to the TLS symbol without requiring an accessor function. That's also why I'm using the initial-exec tls model rather than the global-dynamic: I want to make sure no function call is generated there. > The ABI tests have code that would list .tbss symbols as "T" in ABI test > baselines, but no existing ABI baselines use that. Thanks, Mathieu
----- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.nagy@arm.com wrote: > On 19/09/18 15:44, Mathieu Desnoyers wrote: >> Things to consider: >> >> - Move __rseq_refcount to an extra field at the end of __rseq_abi to >> eliminate one symbol. This would require to wrap struct rseq into >> e.g. struct rseq_lib or such, e.g.: >> >> struct rseq_lib { >> struct rseq kabi; >> int refcount; >> }; >> >> All libraries/programs which try to register rseq (glibc, early-adopter >> applications, early-adopter libraries) should use the rseq refcount. >> It becomes part of the ABI within a user-space process, but it's not >> part of the ABI shared with the kernel per se. >> >> - Restructure how this code is organized so glibc keeps building on >> non-Linux targets. >> >> - We do not need an atomic increment/decrement for the refcount per >> se. Just being atomic with respect to the current thread (and nested >> signals) would be enough. What is the proper API to use there ? >> Should we expose struct rseq_lib in a public glibc header ? Should >> we create a rseq(3) man page ? >> >> - Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we >> want a non-weak symbol there ? (and let all other early user >> libraries use weak) >> > > i don't think there is precedent for exposing tls symbol in glibc > (e.g. errno is exposed via __errno_location function) so there > might be issues with this (but i don't have immediate concerns). > >> diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c >> index fe75d04113..20ee197d94 100644 >> --- a/nptl/pthread_create.c >> +++ b/nptl/pthread_create.c >> @@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__; >> /* Number of threads running. */ >> unsigned int __nptl_nthreads = 1; >> >> +__attribute__((weak, tls_model("initial-exec"))) __thread >> +volatile struct rseq __rseq_abi = { >> + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, >> +}; >> + >> +__attribute__((weak, tls_model("initial-exec"))) __thread >> +volatile int __rseq_refcount; >> > > note that libpthread.so is built with -ftls-model=initial-exec Which would indeed make these annotations redundant. I'll remove them. > (and if it wasn't then you'd want to put the attribute on the > declaration in the internal header file, not on the definition, > so the actual tls accesses generate the right code) This area is one where I'm still uneasy on my comprehension of the details, especially that it goes in a different direction than what you are recommending. I've read through https://www.akkadia.org/drepper/tls.pdf Section 5 "Linker Optimizations" to try to figure it out, and I end up being under the impression that applying the tls_model("initial-exec") attribute to a symbol declaration in a header file does not have much impact on the accesses that use that variable. Reading through that section, it seems that the variable definition is the one that matters, and then the compiler/linker/loader are tweaking the sites that reference the TLS variable through code rewrite based on the most efficient mechanism that each phase knows can be used at each stage. What am I missing ? Thanks, Mathieu
On 19/09/18 22:01, Mathieu Desnoyers wrote: > ----- On Sep 19, 2018, at 1:38 PM, Szabolcs Nagy szabolcs.nagy@arm.com wrote: >> note that libpthread.so is built with -ftls-model=initial-exec > > Which would indeed make these annotations redundant. I'll remove > them. > >> (and if it wasn't then you'd want to put the attribute on the >> declaration in the internal header file, not on the definition, >> so the actual tls accesses generate the right code) > > This area is one where I'm still uneasy on my comprehension of > the details, especially that it goes in a different direction than > what you are recommending. > > I've read through https://www.akkadia.org/drepper/tls.pdf Section 5 > "Linker Optimizations" to try to figure it out, and I end up being > under the impression that applying the tls_model("initial-exec") > attribute to a symbol declaration in a header file does not have > much impact on the accesses that use that variable. Reading through > that section, it seems that the variable definition is the one that > matters, and then the compiler/linker/loader are tweaking the sites > that reference the TLS variable through code rewrite based on the > most efficient mechanism that each phase knows can be used at each > stage. > > What am I missing ? in general if you rely on linker relaxations you may not get optimal code because the linker cannot remove instructions, just nop them out. (e.g. on aarch64 an initial-exec access is 4 instructions a general dynamic (tlsdesc) access is 6 instructions + it involves a call, so the return address has to be saved and restored (+ 3 instructions for stack operations if there were none otherwise, which the linker cannot change))
----- On Sep 19, 2018, at 12:37 PM, Joseph Myers joseph@codesourcery.com wrote: > On Wed, 19 Sep 2018, Mathieu Desnoyers wrote: > >> Here is a rough prototype registering rseq(2) TLS for each thread >> (including main), and unregistering for each thread (excluding >> main). "rseq" stands for Restartable Sequences. > > A final patch would need to add documentation and tests and a NEWS entry > and fix various coding style issues. Sure, > >> diff --git a/nptl/Versions b/nptl/Versions >> index e7f691da7a..7316c2815d 100644 >> --- a/nptl/Versions >> +++ b/nptl/Versions >> @@ -275,6 +275,7 @@ libpthread { >> mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy; >> call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal; >> cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set; >> + __rseq_abi; __rseq_refcount; > > That's the GLIBC_2.28 section, but 2.28 is already out. New symbols would > need to go in GLIBC_2.29 or later (and the ABI test baselines would all > need updating). OK > >> diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h > > This looks like it's coming from the Linux kernel. Can't the relevant > uapi header just be used directly without copying into glibc (with due > care to ensure that glibc still builds if the kernel headers used for the > build are too old - you need such conditionals anyway if they don't define > the relevant syscall number)? Something like this in pthreadP.h ? +#ifdef __NR_rseq +#include <sysdeps/unix/sysv/linux/rseq-internal.h> +#else +#include <sysdeps/nptl/rseq-internal.h> +#endif /* __NR_rseq. */ where sysdeps/unix/sysv/linux/rseq-internal.h contains the linux implementation of rseq_register_current_thread () and rseq_unregister_current_thread (), and sysdeps/nptl/rseq-internal.h contains stubs. Am I on the right track ? Thanks, Mathieu > > -- > Joseph S. Myers > joseph@codesourcery.com
----- On Sep 19, 2018, at 1:10 PM, Joseph Myers joseph@codesourcery.com wrote: > On Wed, 19 Sep 2018, Mathieu Desnoyers wrote: > >> > This looks like it's coming from the Linux kernel. Can't the relevant >> > uapi header just be used directly without copying into glibc (with due >> > care to ensure that glibc still builds if the kernel headers used for the >> > build are too old - you need such conditionals anyway if they don't define >> > the relevant syscall number)? >> >> This is indeed in the list of "things to consider" I've put in the patch >> commit message. If the usual practice is to build against uapi kernel headers >> outside of the glibc tree, I'm fine with that. > > We build with, currently, 3.2 or later headers (since 3.2 is EOL there's a > case for updating the minimum in glibc for both compile time and runtime, > but I haven't proposed that since there isn't much cleanup that would > enable and there's the open question of Carlos's proposal to eliminate the > runtime check on the kernel version and just let things try to run anyway > even if it's older than the configured minimum). Are you saying glibc has an explicit check for the kernel version visible from /proc before using specific features ? If so, how can this work with the variety of feature backports we find in the distribution kernels out there ? Checking whether specific system calls return ENOSYS errors seems more flexible. > Functions depending on > new syscalls may return ENOSYS errors if the headers used to build glibc > were too old. Since this patch is providing a data interface rather than > functions that can set errno to ENOSYS, presumably you have some other way > of signalling unavailability which would apply both with a too-old kernel > at runtime and too-old headers at compile time. For too-old kernel at runtime, having rseq registration return ENOSYS leaves the the content of __rseq_abi->cpu_id at its initial value (RSEQ_CPU_ID_UNINITIALIZED = -1). For too-old headers at compile time, one possibility is that we don't event expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway for ABI consistency purposes, then we'd leave its cpu_id field at the initial value (-1). But that would require that we copy linux/rseq.h into the glibc source tree. Thoughts ? Thanks, Mathieu > > -- > Joseph S. Myers > joseph@codesourcery.com
On Thu, 20 Sep 2018, Mathieu Desnoyers wrote: > Are you saying glibc has an explicit check for the kernel version visible > from /proc before using specific features ? If so, how can this work with > the variety of feature backports we find in the distribution kernels out > there ? See sysdeps/unix/sysv/linux/dl-sysdep.c and sysdeps/unix/sysv/linux/dl-osinfo.h. As I said, Carlos has proposed removing that check. > For too-old headers at compile time, one possibility is that we don't event > expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway > for ABI consistency purposes, then we'd leave its cpu_id field at the initial > value (-1). But that would require that we copy linux/rseq.h into the glibc > source tree. The ABI needs to be independent of the kernel headers used. I don't think you need to copy linux/rseq.h; all you should need is to e.g. define an array of suitable size and alignment with the relevant member initialized and a suitable explanatory comment.
On Thu, 20 Sep 2018, Mathieu Desnoyers wrote: > Something like this in pthreadP.h ? > > +#ifdef __NR_rseq > +#include <sysdeps/unix/sysv/linux/rseq-internal.h> > +#else > +#include <sysdeps/nptl/rseq-internal.h> > +#endif /* __NR_rseq. */ > > where sysdeps/unix/sysv/linux/rseq-internal.h contains the linux > implementation of rseq_register_current_thread () and > rseq_unregister_current_thread (), and sysdeps/nptl/rseq-internal.h > contains stubs. > > Am I on the right track ? It's hard to define the right abstractions for what goes where given that only Linux uses NPTL since the removal of the NaCl port. I suppose it does make logical sense for a #include of <linux/rseq.h> to go somewhere under sysdeps/unix/sysv/linux/.
----- On Sep 20, 2018, at 4:20 PM, Joseph Myers joseph@codesourcery.com wrote: > On Thu, 20 Sep 2018, Mathieu Desnoyers wrote: > >> Are you saying glibc has an explicit check for the kernel version visible >> from /proc before using specific features ? If so, how can this work with >> the variety of feature backports we find in the distribution kernels out >> there ? > > See sysdeps/unix/sysv/linux/dl-sysdep.c and > sysdeps/unix/sysv/linux/dl-osinfo.h. As I said, Carlos has proposed > removing that check. For the system calls I implement and maintain, I typically ensure there is a set of parameters that can be used when issuing the system call so it can either succeed or fail with ENOSYS without having side-effects. It's specifically meant to be used for feature discovery in a library initialization phase. It's especially useful if the application needs to keep state around related to the system call across its execution, e.g. robust futexes. > >> For too-old headers at compile time, one possibility is that we don't event >> expose the __rseq_abi TLS symbol. OTOH, if we need to keep exposing it anyway >> for ABI consistency purposes, then we'd leave its cpu_id field at the initial >> value (-1). But that would require that we copy linux/rseq.h into the glibc >> source tree. > > The ABI needs to be independent of the kernel headers used. I don't think > you need to copy linux/rseq.h; all you should need is to e.g. define an > array of suitable size and alignment with the relevant member initialized > and a suitable explanatory comment. In that case, I'm thinking declaring a minimal structure in glibc code may be clearer than the array, e.g.: [pthreadP.h] enum libc_rseq_cpu_id_state { LIBC_RSEQ_CPU_ID_UNINITIALIZED = -1, LIBC_RSEQ_CPU_ID_REGISTRATION_FAILED = -2, }; /* linux/rseq.h defines struct rseq as aligned on 32 bytes. The kernel ABI size is 20 bytes. For support of multiple rseq users within a process, user-space defines an extra 4 bytes field as a reference count, for a total of 24 bytes. */ struct libc_rseq { /* kernel-userspace ABI. */ uint32_t cpu_id_start; uint32_t cpu_id; uint64_t rseq_cs; uint32_t flags; /* user-space ABI. */ uint32_t refcount; } __attribute__((aligned(4 * sizeof(uint64_t)))); [pthread_create.h] __thread volatile struct libc_rseq __rseq_abi = { .cpu_id = LIBC_RSEQ_CPU_ID_UNINITIALIZED, }; Thanks, Mathieu
diff --git a/nptl/Versions b/nptl/Versions index e7f691da7a..7316c2815d 100644 --- a/nptl/Versions +++ b/nptl/Versions @@ -275,6 +275,7 @@ libpthread { mtx_init; mtx_lock; mtx_timedlock; mtx_trylock; mtx_unlock; mtx_destroy; call_once; cnd_broadcast; cnd_destroy; cnd_init; cnd_signal; cnd_timedwait; cnd_wait; tss_create; tss_delete; tss_get; tss_set; + __rseq_abi; __rseq_refcount; } GLIBC_PRIVATE { diff --git a/nptl/nptl-init.c b/nptl/nptl-init.c index 907411d5bc..0c056c3fce 100644 --- a/nptl/nptl-init.c +++ b/nptl/nptl-init.c @@ -431,6 +431,9 @@ __pthread_initialize_minimal_internal (void) /* Determine whether the machine is SMP or not. */ __is_smp = is_smp_system (); + + /* Register rseq ABI to the kernel. */ + (void) rseq_register_current_thread(); } strong_alias (__pthread_initialize_minimal_internal, __pthread_initialize_minimal) diff --git a/nptl/pthreadP.h b/nptl/pthreadP.h index 13bdb11133..94f54c630e 100644 --- a/nptl/pthreadP.h +++ b/nptl/pthreadP.h @@ -33,6 +33,8 @@ #include <kernel-features.h> #include <errno.h> #include <internal-signals.h> +//TODO: fix sysdeps include to make it portable to non-Linux systems. +#include <sysdeps/unix/sysv/linux/rseq.h> /* Atomic operations on TLS memory. */ @@ -660,4 +662,47 @@ check_stacksize_attr (size_t st) "offset of " #member " field of " #type " != " \ ASSERT_PTHREAD_STRING (offset)) +//TODO: we may want to move the rseq code to a linux sysdeps file. +#define RSEQ_SIG 0x53053053 + +extern __thread volatile struct rseq __rseq_abi; +extern __thread volatile int __rseq_refcount; + +static inline int +rseq_register_current_thread (void) +{ + int rc, ret = 0; + INTERNAL_SYSCALL_DECL (err); + + if (atomic_increment_val(&__rseq_refcount) - 1) + goto end; + rc = INTERNAL_SYSCALL_CALL(rseq, err, &__rseq_abi, sizeof(struct rseq), + 0, RSEQ_SIG); + if (!rc) + goto end; + if (INTERNAL_SYSCALL_ERRNO(rc, err) != EBUSY) + __rseq_abi.cpu_id = -2; + ret = -1; + atomic_decrement(&__rseq_refcount); +end: + return ret; +} + +static inline int +rseq_unregister_current_thread (void) +{ + int rc, ret = 0; + INTERNAL_SYSCALL_DECL (err); + + if (atomic_decrement_val(&__rseq_refcount)) + goto end; + rc = INTERNAL_SYSCALL_CALL(rseq, err, &__rseq_abi, sizeof(struct rseq), + RSEQ_FLAG_UNREGISTER, RSEQ_SIG); + if (!rc) + goto end; + ret = -1; +end: + return ret; +} + #endif /* pthreadP.h */ diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c index fe75d04113..20ee197d94 100644 --- a/nptl/pthread_create.c +++ b/nptl/pthread_create.c @@ -52,6 +52,13 @@ static struct pthread *__nptl_last_event __attribute_used__; /* Number of threads running. */ unsigned int __nptl_nthreads = 1; +__attribute__((weak, tls_model("initial-exec"))) __thread +volatile struct rseq __rseq_abi = { + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, +}; + +__attribute__((weak, tls_model("initial-exec"))) __thread +volatile int __rseq_refcount; /* Code to allocate and deallocate a stack. */ #include "allocatestack.c" @@ -378,6 +385,7 @@ __free_tcb (struct pthread *pd) START_THREAD_DEFN { struct pthread *pd = START_THREAD_SELF; + bool has_rseq = false; #if HP_TIMING_AVAIL /* Remember the time when the thread was started. */ @@ -445,6 +453,9 @@ START_THREAD_DEFN unwind_buf.priv.data.prev = NULL; unwind_buf.priv.data.cleanup = NULL; + /* Register rseq TLS to the kernel. */ + has_rseq = !rseq_register_current_thread(); + if (__glibc_likely (! not_first_call)) { /* Store the new cleanup handler info. */ @@ -487,6 +498,10 @@ START_THREAD_DEFN THREAD_SETMEM (pd, result, ret); } + /* Unregister rseq TLS from kernel. */ + if (has_rseq && rseq_unregister_current_thread()) + abort(); + /* Call destructors for the thread_local TLS variables. */ #ifndef SHARED if (&__call_tls_dtors != NULL) diff --git a/sysdeps/unix/sysv/linux/rseq.h b/sysdeps/unix/sysv/linux/rseq.h new file mode 100644 index 0000000000..9a402fdb60 --- /dev/null +++ b/sysdeps/unix/sysv/linux/rseq.h @@ -0,0 +1,147 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_RSEQ_H +#define _UAPI_LINUX_RSEQ_H + +/* + * linux/rseq.h + * + * Restartable sequences system call API + * + * Copyright (c) 2015-2018 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> + */ + +#include <linux/types.h> +#include <asm/byteorder.h> + +enum rseq_cpu_id_state { + RSEQ_CPU_ID_UNINITIALIZED = -1, + RSEQ_CPU_ID_REGISTRATION_FAILED = -2, +}; + +enum rseq_flags { + RSEQ_FLAG_UNREGISTER = (1 << 0), +}; + +enum rseq_cs_flags_bit { + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT = 0, + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT = 1, + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT = 2, +}; + +enum rseq_cs_flags { + RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT), + RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT), + RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE = + (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), +}; + +/* + * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. It is usually declared as + * link-time constant data. + */ +struct rseq_cs { + /* Version of this structure. */ + __u32 version; + /* enum rseq_cs_flags */ + __u32 flags; + __u64 start_ip; + /* Offset from start_ip. */ + __u64 post_commit_offset; + __u64 abort_ip; +} __attribute__((aligned(4 * sizeof(__u64)))); + +/* + * struct rseq is aligned on 4 * 8 bytes to ensure it is always + * contained within a single cache-line. + * + * A single struct rseq per thread is allowed. + */ +struct rseq { + /* + * Restartable sequences cpu_id_start field. Updated by the + * kernel. Read by user-space with single-copy atomicity + * semantics. This field should only be read by the thread which + * registered this data structure. Aligned on 32-bit. Always + * contains a value in the range of possible CPUs, although the + * value may not be the actual current CPU (e.g. if rseq is not + * initialized). This CPU number value should always be compared + * against the value of the cpu_id field before performing a rseq + * commit or returning a value read from a data structure indexed + * using the cpu_id_start value. + */ + __u32 cpu_id_start; + /* + * Restartable sequences cpu_id field. Updated by the kernel. + * Read by user-space with single-copy atomicity semantics. This + * field should only be read by the thread which registered this + * data structure. Aligned on 32-bit. Values + * RSEQ_CPU_ID_UNINITIALIZED and RSEQ_CPU_ID_REGISTRATION_FAILED + * have a special semantic: the former means "rseq uninitialized", + * and latter means "rseq initialization failed". This value is + * meant to be read within rseq critical sections and compared + * with the cpu_id_start value previously read, before performing + * the commit instruction, or read and compared with the + * cpu_id_start value before returning a value loaded from a data + * structure indexed using the cpu_id_start value. + */ + __u32 cpu_id; + /* + * Restartable sequences rseq_cs field. + * + * Contains NULL when no critical section is active for the current + * thread, or holds a pointer to the currently active struct rseq_cs. + * + * Updated by user-space, which sets the address of the currently + * active rseq_cs at the beginning of assembly instruction sequence + * block, and set to NULL by the kernel when it restarts an assembly + * instruction sequence block, as well as when the kernel detects that + * it is preempting or delivering a signal outside of the range + * targeted by the rseq_cs. Also needs to be set to NULL by user-space + * before reclaiming memory that contains the targeted struct rseq_cs. + * + * Read and set by the kernel. Set by user-space with single-copy + * atomicity semantics. This field should only be updated by the + * thread which registered this data structure. Aligned on 64-bit. + */ + union { + __u64 ptr64; +#ifdef __LP64__ + __u64 ptr; +#else + struct { +#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN) + __u32 padding; /* Initialized to zero. */ + __u32 ptr32; +#else /* LITTLE */ + __u32 ptr32; + __u32 padding; /* Initialized to zero. */ +#endif /* ENDIAN */ + } ptr; +#endif + } rseq_cs; + + /* + * Restartable sequences flags field. + * + * This field should only be updated by the thread which + * registered this data structure. Read by the kernel. + * Mainly used for single-stepping through rseq critical sections + * with debuggers. + * + * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT + * Inhibit instruction sequence block restart on preemption + * for this thread. + * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL + * Inhibit instruction sequence block restart on signal + * delivery for this thread. + * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE + * Inhibit instruction sequence block restart on migration for + * this thread. + */ + __u32 flags; +} __attribute__((aligned(4 * sizeof(__u64)))); + +#endif /* _UAPI_LINUX_RSEQ_H */
Here is a rough prototype registering rseq(2) TLS for each thread (including main), and unregistering for each thread (excluding main). "rseq" stands for Restartable Sequences. Things to consider: - Move __rseq_refcount to an extra field at the end of __rseq_abi to eliminate one symbol. This would require to wrap struct rseq into e.g. struct rseq_lib or such, e.g.: struct rseq_lib { struct rseq kabi; int refcount; }; All libraries/programs which try to register rseq (glibc, early-adopter applications, early-adopter libraries) should use the rseq refcount. It becomes part of the ABI within a user-space process, but it's not part of the ABI shared with the kernel per se. - Restructure how this code is organized so glibc keeps building on non-Linux targets. - We do not need an atomic increment/decrement for the refcount per se. Just being atomic with respect to the current thread (and nested signals) would be enough. What is the proper API to use there ? Should we expose struct rseq_lib in a public glibc header ? Should we create a rseq(3) man page ? - Revisit use of "weak" symbol for __rseq_abi in glibc. Perhaps we want a non-weak symbol there ? (and let all other early user libraries use weak) - Should we pull linux/rseq.h from the Linux kernel into the glibc tree or keep a dependency on installed kernel headers ? - I plan to host a librseq library containing mostly helper headers that vastly simplify using rseq. It will contain a librseq.so which provides an API to explicitly register rseq to the kernel (for early adopters). The refcounting mechanism in the TLS would be used to ensure combining librseq.so and libc.so.6 with rseq support works fine. - How early do we want to register rseq and how late do we want to unregister it ? It's important to consider if we expect rseq to be used by the memory allocator and within destructor callbacks. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> CC: Carlos O'Donell <carlos@redhat.com> CC: Florian Weimer <fweimer@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ben Maurer <bmaurer@fb.com> CC: Peter Zijlstra <peterz@infradead.org> CC: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> CC: Boqun Feng <boqun.feng@gmail.com> CC: Will Deacon <will.deacon@arm.com> CC: Dave Watson <davejwatson@fb.com> CC: Paul Turner <pjt@google.com> CC: libc-alpha@sourceware.org CC: linux-kernel@vger.kernel.org CC: linux-api@vger.kernel.org --- nptl/Versions | 1 + nptl/nptl-init.c | 3 + nptl/pthreadP.h | 45 ++++++++++ nptl/pthread_create.c | 15 ++++ sysdeps/unix/sysv/linux/rseq.h | 147 +++++++++++++++++++++++++++++++++ 5 files changed, 211 insertions(+) create mode 100644 sysdeps/unix/sysv/linux/rseq.h