[RFC] pthread_setspecific: Provide signal-safety across keys
diff mbox series

Message ID 20171017212919.741-1-mathieu.desnoyers@efficios.com
State New
Headers show
Series
  • [RFC] pthread_setspecific: Provide signal-safety across keys
Related show

Commit Message

Mathieu Desnoyers Oct. 17, 2017, 9:29 p.m. UTC
The intent here is to provide signal-safety against callers to
pthread_{get/set}specific that work on different pthread keys,
without hurting performance of the normal use-cases that do not
care about signal-safety.

One thing to keep in mind is that callers of pthread_{get/set}specific
on a given key can disable signals if they require signal-safety
against operations on their key.

The real problem in my situation (lttng-ust tracer) is having
pthread_setspecific (invoked by the application) do memory allocation
and update pointers without disabling signals when it needs to allocate
"level2". This only happens the first time a key >=
PTHREAD_KEY_2NDLEVEL_SIZE is encountered by a thread. If lttng-ust
tracing inserted into a signal handler nests over this allocation, the
application pthread key specific may be lost, and memory leaked.

So I wonder how acceptable it would be to make just the memory
allocation part of pthread_setspecific signal-safe ? Technically, this
is not required very often, so it should not cause a significant
overhead.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: "Carlos O'Donell" <carlos@redhat.com>
CC: "Ben Maurer" <bmaurer@fb.com>
CC: libc-alpha@sourceware.org
---
 nptl/pthread_setspecific.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Florian Weimer Oct. 17, 2017, 9:54 p.m. UTC | #1
* Mathieu Desnoyers:

>  	  if (value == NULL)
>  	    /* We don't have to do anything.  The value would in any case
>  	       be NULL.  We can save the memory allocation.  */
>  	    return 0;
>  
> +	  /* Ensure that pthread_setspecific and pthread_getspecific are
> +             signal-safe when used on different keys.  */
> +          sigfillset (&ss);
> +          pthread_sigmask (SIG_BLOCK, &ss, &old_ss);
> +          /* Read level2 again with signals disabled.  */
> +          level2 = THREAD_GETMEM_NC (self, specific, idx1st);
> +          if (level2 != NULL)
> +            goto skip_resize;
> +
>  	  level2
>  	    = (struct pthread_key_data *) calloc (PTHREAD_KEY_2NDLEVEL_SIZE,
>  						  sizeof (*level2));

I don't see how this adds signal safety.  If the function is called
from a signal handler, interrupting the malloc code, then this will
still result in undefined behavior (heap corruption or deadlocks).

Does lttng-ust call these functions even if the application does not?
Then you really need to switch to real thread-local variables with the
intial-exec model, possibly with indirection through a thread-local
pointer variable and on-demand allocation using mmap in case you do
not want to have these allocations in every thread.
Joseph Myers Oct. 17, 2017, 9:55 p.m. UTC | #2
On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:

> The intent here is to provide signal-safety against callers to
> pthread_{get/set}specific that work on different pthread keys,
> without hurting performance of the normal use-cases that do not
> care about signal-safety.

If we wish to provide particular safety guarantees beyond whatever is 
required by POSIX, I think they should be documented in the glibc manual 
as a GNU extension.
Carlos O'Donell Oct. 17, 2017, 10:01 p.m. UTC | #3
On 10/17/2017 02:54 PM, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
>>  	  if (value == NULL)
>>  	    /* We don't have to do anything.  The value would in any case
>>  	       be NULL.  We can save the memory allocation.  */
>>  	    return 0;
>>  
>> +	  /* Ensure that pthread_setspecific and pthread_getspecific are
>> +             signal-safe when used on different keys.  */
>> +          sigfillset (&ss);
>> +          pthread_sigmask (SIG_BLOCK, &ss, &old_ss);
>> +          /* Read level2 again with signals disabled.  */
>> +          level2 = THREAD_GETMEM_NC (self, specific, idx1st);
>> +          if (level2 != NULL)
>> +            goto skip_resize;
>> +
>>  	  level2
>>  	    = (struct pthread_key_data *) calloc (PTHREAD_KEY_2NDLEVEL_SIZE,
>>  						  sizeof (*level2));
> 
> I don't see how this adds signal safety.  If the function is called
> from a signal handler, interrupting the malloc code, then this will
> still result in undefined behavior (heap corruption or deadlocks).
> 
> Does lttng-ust call these functions even if the application does not?
> Then you really need to switch to real thread-local variables with the
> intial-exec model, possibly with indirection through a thread-local
> pointer variable and on-demand allocation using mmap in case you do
> not want to have these allocations in every thread.
 
Agreed. The calloc proceeds in AS context and that leads to undefined
behaviour.

Please also read Florian's recommendation very carefully, you must us
IE model TLS vars to ensure pre-allocation and avoid potential OOM
during first variable touch, and to avoid first variable touch in a
signal handler resulting in the same problem as above (though we are
looking to fix this).
Mathieu Desnoyers Oct. 17, 2017, 10:09 p.m. UTC | #4
----- On Oct 17, 2017, at 5:54 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>>  	  if (value == NULL)
>>  	    /* We don't have to do anything.  The value would in any case
>>  	       be NULL.  We can save the memory allocation.  */
>>  	    return 0;
>>  
>> +	  /* Ensure that pthread_setspecific and pthread_getspecific are
>> +             signal-safe when used on different keys.  */
>> +          sigfillset (&ss);
>> +          pthread_sigmask (SIG_BLOCK, &ss, &old_ss);
>> +          /* Read level2 again with signals disabled.  */
>> +          level2 = THREAD_GETMEM_NC (self, specific, idx1st);
>> +          if (level2 != NULL)
>> +            goto skip_resize;
>> +
>>  	  level2
>>  	    = (struct pthread_key_data *) calloc (PTHREAD_KEY_2NDLEVEL_SIZE,
>>  						  sizeof (*level2));
> 
> I don't see how this adds signal safety.  If the function is called
> from a signal handler, interrupting the malloc code, then this will
> still result in undefined behavior (heap corruption or deadlocks).

Good point, I should probably turn this calloc into a on-demand mmap
or such then.

> 
> Does lttng-ust call these functions even if the application does not?

Yes.

> Then you really need to switch to real thread-local variables with the
> intial-exec model, possibly with indirection through a thread-local
> pointer variable and on-demand allocation using mmap in case you do
> not want to have these allocations in every thread.

Actually, liburcu-bp, which is the library actually using
pthread_setspecific, does use a TLS pointer to a mmap'd region.

The __thread variable sits in a library shared object, so it's not
possible to use the initial-exec model. I work around this by touching
the object once in a library constructor before other threads are
launched. AFAIU, this takes care of performing the TLS lazy fixup.

Thanks,

Mathieu
Carlos O'Donell Oct. 17, 2017, 10:09 p.m. UTC | #5
On 10/17/2017 03:09 PM, Mathieu Desnoyers wrote:
> The __thread variable sits in a library shared object, so it's not
> possible to use the initial-exec model. I work around this by touching
> the object once in a library constructor before other threads are
> launched. AFAIU, this takes care of performing the TLS lazy fixup.

You *can* use IE model, but doing so *might* cause your library not to
load if we run out of static TLS space (some amount is reserved by the
static linker for loading such libraries that use IE model TLS).
The static TLS space is therefore something like a distro-wide global
resources :-)

However, we don't recommend IE model if you can get away with the
performance of GD model and can manage to touch the variable early
before it might be used in a signal handler.
Carlos O'Donell Oct. 17, 2017, 10:10 p.m. UTC | #6
On 10/17/2017 03:10 PM, Mathieu Desnoyers wrote:
>> Please also read Florian's recommendation very carefully, you must us
>> IE model TLS vars to ensure pre-allocation and avoid potential OOM
>> during first variable touch, and to avoid first variable touch in a
>> signal handler resulting in the same problem as above (though we are
>> looking to fix this).
> 
> Hopefully touching the TLS variable from a library constructor sides-steps
> this issue. This has worked well in practice so far.

It does. Thank you for the clarification.
Mathieu Desnoyers Oct. 17, 2017, 10:10 p.m. UTC | #7
----- On Oct 17, 2017, at 6:01 PM, Carlos O'Donell carlos@redhat.com wrote:

> On 10/17/2017 02:54 PM, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>> 
>>>  	  if (value == NULL)
>>>  	    /* We don't have to do anything.  The value would in any case
>>>  	       be NULL.  We can save the memory allocation.  */
>>>  	    return 0;
>>>  
>>> +	  /* Ensure that pthread_setspecific and pthread_getspecific are
>>> +             signal-safe when used on different keys.  */
>>> +          sigfillset (&ss);
>>> +          pthread_sigmask (SIG_BLOCK, &ss, &old_ss);
>>> +          /* Read level2 again with signals disabled.  */
>>> +          level2 = THREAD_GETMEM_NC (self, specific, idx1st);
>>> +          if (level2 != NULL)
>>> +            goto skip_resize;
>>> +
>>>  	  level2
>>>  	    = (struct pthread_key_data *) calloc (PTHREAD_KEY_2NDLEVEL_SIZE,
>>>  						  sizeof (*level2));
>> 
>> I don't see how this adds signal safety.  If the function is called
>> from a signal handler, interrupting the malloc code, then this will
>> still result in undefined behavior (heap corruption or deadlocks).
>> 
>> Does lttng-ust call these functions even if the application does not?
>> Then you really need to switch to real thread-local variables with the
>> intial-exec model, possibly with indirection through a thread-local
>> pointer variable and on-demand allocation using mmap in case you do
>> not want to have these allocations in every thread.
> 
> Agreed. The calloc proceeds in AS context and that leads to undefined
> behaviour.

I indeed plan to spin an updated patch using mmap instead.

> 
> Please also read Florian's recommendation very carefully, you must us
> IE model TLS vars to ensure pre-allocation and avoid potential OOM
> during first variable touch, and to avoid first variable touch in a
> signal handler resulting in the same problem as above (though we are
> looking to fix this).

Hopefully touching the TLS variable from a library constructor sides-steps
this issue. This has worked well in practice so far.

Thanks,

Mathieu

> 
> --
> Cheers,
> Carlos.
Florian Weimer Oct. 17, 2017, 10:19 p.m. UTC | #8
* Mathieu Desnoyers:

>> Then you really need to switch to real thread-local variables with the
>> intial-exec model, possibly with indirection through a thread-local
>> pointer variable and on-demand allocation using mmap in case you do
>> not want to have these allocations in every thread.
>
> Actually, liburcu-bp, which is the library actually using
> pthread_setspecific, does use a TLS pointer to a mmap'd region.
>
> The __thread variable sits in a library shared object, so it's not
> possible to use the initial-exec model.

glibc supports this, and there shouldn't be any issues if the library
is loaded into the initial process image (LD_PRELOAD or DT_NEEDED
reference).  dlopen can be problematic, though.  With current
upstream, additional initial-exec TLS variables do not impact later
dlopen.  (This was not always true.)

> I work around this by touching
> the object once in a library constructor before other threads are
> launched. AFAIU, this takes care of performing the TLS lazy fixup.

Global-dynamic currently has lazy allocation for each thread, so a
library constructor is not good enough to ensure initialization.  We
probably want to fix this, but major language standards say that TLS
access from signal handlers is undefined, so it's not even a real bug.

But I have to admit that TLS access in signal handlers can be very
useful and is often the only way to implement certain things without
introducing a full managed runtime.
Mathieu Desnoyers Oct. 17, 2017, 10:27 p.m. UTC | #9
----- On Oct 17, 2017, at 5:55 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:
> 
>> The intent here is to provide signal-safety against callers to
>> pthread_{get/set}specific that work on different pthread keys,
>> without hurting performance of the normal use-cases that do not
>> care about signal-safety.
> 
> If we wish to provide particular safety guarantees beyond whatever is
> required by POSIX, I think they should be documented in the glibc manual
> as a GNU extension.

Good point! I'm narrowing it down to this text:

manual/threads.texi

@deftypefun int pthread_setspecific (pthread_key_t @var{key}, const void *@var{value})
@standards{POSIX, pthread.h}
@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{} @ascuheap{}}@acunsafe{@acucorrupt{} @acsmem{}}}
@c pthread_setspecific @asucorrupt @ascuheap @acucorrupt @acsmem
@c   a level2 block may be allocated by a signal handler after
@c   another call already made a decision to allocate it, thus losing
@c   the allocated value.  the seq number is updated before the
@c   value, which might cause an earlier-generation value to seem
@c   current if setspecific is cancelled or interrupted by a signal
@c  KEY_UNUSED ok
@c  calloc dup @ascuheap @acsmem
Associate the thread-specific @var{value} with @var{key} in the calling thread.
@end deftypefun

We'd need to edit the part about level2 block and allocation.

The part about seq number should stay there. It's the part that states
non-async-signal-safety across concurrent uses of pthread_setspecific
for a given key.

Would this edited version work ? I'm not sure I grasp the full meaning
of @acsmem there.

@deftypefun int pthread_setspecific (pthread_key_t @var{key}, const void *@var{value})
@standards{POSIX, pthread.h}
@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{}}@acunsafe{@acucorrupt{} @acsmem{}}}
@c pthread_setspecific @asucorrupt @acucorrupt @acsmem
@c   the seq number is updated before the value, which might cause
@c   an earlier-generation value to seem current if setspecific is
@c   cancelled or interrupted by a signal
@c  KEY_UNUSED ok
@c  dup @acsmem
Associate the thread-specific @var{value} with @var{key} in the calling thread.
@end deftypefun

Thanks,

Mathieu

> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Joseph Myers Oct. 17, 2017, 10:32 p.m. UTC | #10
On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:

> @c pthread_setspecific @asucorrupt @ascuheap @acucorrupt @acsmem
> @c   a level2 block may be allocated by a signal handler after
> @c   another call already made a decision to allocate it, thus losing
> @c   the allocated value.  the seq number is updated before the
> @c   value, which might cause an earlier-generation value to seem
> @c   current if setspecific is cancelled or interrupted by a signal
> @c  KEY_UNUSED ok
> @c  calloc dup @ascuheap @acsmem
> Associate the thread-specific @var{value} with @var{key} in the calling thread.
> @end deftypefun
> 
> We'd need to edit the part about level2 block and allocation.

That part is *comments* explaining to people reading the manual source 
code why the particular user-visible @safety{} annotations were determined 
from the library sources.

It's true we should keep those up to date when the library sources change.  
But my point was that there should be *user-visible* documentation - not 
just comments in the manual sources - of any safety guarantees we choose 
to provide beyond POSIX.
Mathieu Desnoyers Oct. 17, 2017, 10:48 p.m. UTC | #11
----- On Oct 17, 2017, at 6:32 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:
> 
>> @c pthread_setspecific @asucorrupt @ascuheap @acucorrupt @acsmem
>> @c   a level2 block may be allocated by a signal handler after
>> @c   another call already made a decision to allocate it, thus losing
>> @c   the allocated value.  the seq number is updated before the
>> @c   value, which might cause an earlier-generation value to seem
>> @c   current if setspecific is cancelled or interrupted by a signal
>> @c  KEY_UNUSED ok
>> @c  calloc dup @ascuheap @acsmem
>> Associate the thread-specific @var{value} with @var{key} in the calling thread.
>> @end deftypefun
>> 
>> We'd need to edit the part about level2 block and allocation.
> 
> That part is *comments* explaining to people reading the manual source
> code why the particular user-visible @safety{} annotations were determined
> from the library sources.
> 
> It's true we should keep those up to date when the library sources change.
> But my point was that there should be *user-visible* documentation - not
> just comments in the manual sources - of any safety guarantees we choose
> to provide beyond POSIX.

Can you provide a general hint on which repo and which documentation file
I should update ?

Thanks,

Mathieu

> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Joseph Myers Oct. 17, 2017, 10:50 p.m. UTC | #12
On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:

> Can you provide a general hint on which repo and which documentation file
> I should update ?

That same file in the manual.  But rather than just editing comments 
referring to details of the implementation, you need to write non-comment 
text explaining at the user API level, without reference to implementation 
internals, what API guarantees are provided in this regard.  The text 
should be sufficient for users to understand what uses they can make of 
the API without needing to refer to implementation sources.
Mathieu Desnoyers Oct. 17, 2017, 11:03 p.m. UTC | #13
----- On Oct 17, 2017, at 6:19 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>>> Then you really need to switch to real thread-local variables with the
>>> intial-exec model, possibly with indirection through a thread-local
>>> pointer variable and on-demand allocation using mmap in case you do
>>> not want to have these allocations in every thread.
>>
>> Actually, liburcu-bp, which is the library actually using
>> pthread_setspecific, does use a TLS pointer to a mmap'd region.
>>
>> The __thread variable sits in a library shared object, so it's not
>> possible to use the initial-exec model.
> 
> glibc supports this, and there shouldn't be any issues if the library
> is loaded into the initial process image (LD_PRELOAD or DT_NEEDED
> reference).  dlopen can be problematic, though.  With current
> upstream, additional initial-exec TLS variables do not impact later
> dlopen.  (This was not always true.)

Unfortunately, it is allowed to dlopen() the lttng-ust tracer library
as well.

> 
>> I work around this by touching
>> the object once in a library constructor before other threads are
>> launched. AFAIU, this takes care of performing the TLS lazy fixup.
> 
> Global-dynamic currently has lazy allocation for each thread, so a
> library constructor is not good enough to ensure initialization.  We
> probably want to fix this, but major language standards say that TLS
> access from signal handlers is undefined, so it's not even a real bug.
> 
> But I have to admit that TLS access in signal handlers can be very
> useful and is often the only way to implement certain things without
> introducing a full managed runtime.

Hrm, if with GD there is indeed memory allocation performed within each
thread even after the initial explicit access done from the library
constructor, then I may indeed need to use IE instead, even though it's
a somewhat global resource.

But since lttng-ust can be dlopen'd, I understand that IE is not
appropriate neither.

It looks like there is no good solution there ?

Thanks,

Mathieu
Mathieu Desnoyers Oct. 17, 2017, 11:15 p.m. UTC | #14
----- On Oct 17, 2017, at 6:50 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:
> 
>> Can you provide a general hint on which repo and which documentation file
>> I should update ?
> 
> That same file in the manual.  But rather than just editing comments
> referring to details of the implementation, you need to write non-comment
> text explaining at the user API level, without reference to implementation
> internals, what API guarantees are provided in this regard.  The text
> should be sufficient for users to understand what uses they can make of
> the API without needing to refer to implementation sources.

Gotcha. So I guess we need to update pthread_getspecific too
to document its async-signal safetyness, to make all this
somewhat useful.

How about the following ?

@deftypefun void *pthread_getspecific (pthread_key_t @var{key})
@standards{POSIX, pthread.h}
@safety{@prelim{}@mtsafe{}@assafe{}@acsafe{}}
@c pthread_getspecific ok
Return the thread-specific data associated with @var{key} in the calling
thread.
As a non-POSIX extension, pthread_getspecific is async-signal safe.
@end deftypefun

@deftypefun int pthread_setspecific (pthread_key_t @var{key}, const void *@var{value})
@standards{POSIX, pthread.h}
@safety{@prelim{}@mtsafe{}@asunsafe{@asucorrupt{}}@acunsafe{@acucorrupt{} @acsmem{}}}
@c pthread_setspecific @asucorrupt @acucorrupt @acsmem
@c   the seq number is updated before the value, which might cause
@c   an earlier-generation value to seem current if setspecific is
@c   cancelled or interrupted by a signal
@c  KEY_UNUSED ok
@c  dup @acsmem
Associate the thread-specific @var{value} with @var{key} in the calling thread.
As a non-POSIX extension, pthread_setspecific is async-signal safe for
concurrent invocations against different @var{key}, but not async-signal
safe for concurrent invocations against the same @var{key}.
@end deftypefun

Thanks,

Mathieu


> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Joseph Myers Oct. 17, 2017, 11:16 p.m. UTC | #15
On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:

> As a non-POSIX extension, pthread_getspecific is async-signal safe.

Well, you should use @code{} around the function name in the 
documentation.  But I think that the documentation covers the sorts of 
things I'd expect.  This is not a review of the merits of the proposed 
change or of whether the patch actually achieves the semantics you 
document.
Mathieu Desnoyers Oct. 17, 2017, 11:19 p.m. UTC | #16
----- On Oct 17, 2017, at 6:09 PM, Carlos O'Donell carlos@redhat.com wrote:

> On 10/17/2017 03:09 PM, Mathieu Desnoyers wrote:
>> The __thread variable sits in a library shared object, so it's not
>> possible to use the initial-exec model. I work around this by touching
>> the object once in a library constructor before other threads are
>> launched. AFAIU, this takes care of performing the TLS lazy fixup.
> 
> You *can* use IE model, but doing so *might* cause your library not to
> load if we run out of static TLS space (some amount is reserved by the
> static linker for loading such libraries that use IE model TLS).
> The static TLS space is therefore something like a distro-wide global
> resources :-)
> 
> However, we don't recommend IE model if you can get away with the
> performance of GD model and can manage to touch the variable early
> before it might be used in a signal handler.

Touching it from a constructor only does it from the "main" thread
perspective. Is there a need to touch it outside of a signal handler
in every thread ?

Thanks,

Mathieu

> 
> --
> Cheers,
> Carlos.
Mathieu Desnoyers Oct. 17, 2017, 11:21 p.m. UTC | #17
----- On Oct 17, 2017, at 7:16 PM, Joseph Myers joseph@codesourcery.com wrote:

> On Tue, 17 Oct 2017, Mathieu Desnoyers wrote:
> 
>> As a non-POSIX extension, pthread_getspecific is async-signal safe.
> 
> Well, you should use @code{} around the function name in the
> documentation.  But I think that the documentation covers the sorts of
> things I'd expect.  This is not a review of the merits of the proposed
> change or of whether the patch actually achieves the semantics you
> document.

Fixed. Thanks for the feedback!

Mathieu

> 
> --
> Joseph S. Myers
> joseph@codesourcery.com
Carlos O'Donell Oct. 18, 2017, 4:34 a.m. UTC | #18
On 10/17/2017 04:19 PM, Mathieu Desnoyers wrote:
> ----- On Oct 17, 2017, at 6:09 PM, Carlos O'Donell carlos@redhat.com wrote:
> 
>> On 10/17/2017 03:09 PM, Mathieu Desnoyers wrote:
>>> The __thread variable sits in a library shared object, so it's not
>>> possible to use the initial-exec model. I work around this by touching
>>> the object once in a library constructor before other threads are
>>> launched. AFAIU, this takes care of performing the TLS lazy fixup.
>>
>> You *can* use IE model, but doing so *might* cause your library not to
>> load if we run out of static TLS space (some amount is reserved by the
>> static linker for loading such libraries that use IE model TLS).
>> The static TLS space is therefore something like a distro-wide global
>> resources :-)
>>
>> However, we don't recommend IE model if you can get away with the
>> performance of GD model and can manage to touch the variable early
>> before it might be used in a signal handler.
> 
> Touching it from a constructor only does it from the "main" thread
> perspective. Is there a need to touch it outside of a signal handler
> in every thread ?

Yes, you must touch the TLS variable in every new thread *outside*
of the signal handler, since each access in each new thread is
required to force the deferred allocation to be carried out.

I apologize for my earlier statement, I thought perhaps this was
just one thread of execution.
Florian Weimer Oct. 18, 2017, 5:36 a.m. UTC | #19
* Mathieu Desnoyers:

> ----- On Oct 17, 2017, at 6:19 PM, Florian Weimer fw@deneb.enyo.de wrote:
>
>> * Mathieu Desnoyers:
>> 
>>>> Then you really need to switch to real thread-local variables with the
>>>> intial-exec model, possibly with indirection through a thread-local
>>>> pointer variable and on-demand allocation using mmap in case you do
>>>> not want to have these allocations in every thread.
>>>
>>> Actually, liburcu-bp, which is the library actually using
>>> pthread_setspecific, does use a TLS pointer to a mmap'd region.
>>>
>>> The __thread variable sits in a library shared object, so it's not
>>> possible to use the initial-exec model.
>> 
>> glibc supports this, and there shouldn't be any issues if the library
>> is loaded into the initial process image (LD_PRELOAD or DT_NEEDED
>> reference).  dlopen can be problematic, though.  With current
>> upstream, additional initial-exec TLS variables do not impact later
>> dlopen.  (This was not always true.)
>
> Unfortunately, it is allowed to dlopen() the lttng-ust tracer library
> as well.

The question is whether this is a common use case, or if it is
acceptable to have somewhat limited support for this (in the sense
that if you dlopen many different libraries which use initial-exec
TLS, you might see a failure eventually).
Mathieu Desnoyers Oct. 18, 2017, 4:21 p.m. UTC | #20
----- On Oct 18, 2017, at 1:36 AM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>> ----- On Oct 17, 2017, at 6:19 PM, Florian Weimer fw@deneb.enyo.de wrote:
>>
>>> * Mathieu Desnoyers:
>>> 
>>>>> Then you really need to switch to real thread-local variables with the
>>>>> intial-exec model, possibly with indirection through a thread-local
>>>>> pointer variable and on-demand allocation using mmap in case you do
>>>>> not want to have these allocations in every thread.
>>>>
>>>> Actually, liburcu-bp, which is the library actually using
>>>> pthread_setspecific, does use a TLS pointer to a mmap'd region.
>>>>
>>>> The __thread variable sits in a library shared object, so it's not
>>>> possible to use the initial-exec model.
>>> 
>>> glibc supports this, and there shouldn't be any issues if the library
>>> is loaded into the initial process image (LD_PRELOAD or DT_NEEDED
>>> reference).  dlopen can be problematic, though.  With current
>>> upstream, additional initial-exec TLS variables do not impact later
>>> dlopen.  (This was not always true.)
>>
>> Unfortunately, it is allowed to dlopen() the lttng-ust tracer library
>> as well.
> 
> The question is whether this is a common use case, or if it is
> acceptable to have somewhat limited support for this (in the sense
> that if you dlopen many different libraries which use initial-exec
> TLS, you might see a failure eventually).

My use-case is a library called from an application or library instrumentation.
The objective is to minimally change the application/library (basically only
instrument it). It should be possible to only instrument and trace a
linked .so without changing the application.

I need the tracer to run fine in a signal handler, and it happens to
use TLS variables.

I would assume this use-case is not so common, as the requirement
of being used in a signal handler is more specific to a tracer than
a generic library.

So let's assume dlopen failure is acceptable if there are too many
initial-exec TLS. Would it work today to fix signal handler tracing
by changing lttng-ust to use TLS_IE, or would it require changes to
glibc ?

Thanks,

Mathieu
Carlos O'Donell Oct. 18, 2017, 5:40 p.m. UTC | #21
On 10/18/2017 09:21 AM, Mathieu Desnoyers wrote:
> So let's assume dlopen failure is acceptable if there are too many
> initial-exec TLS. Would it work today to fix signal handler tracing
> by changing lttng-ust to use TLS_IE, or would it require changes to
> glibc ?

No changes to glibc are required. We already use TLS IE in glibc
specifically to fix this issue. Also libGL also uses TLS IE in
the distro for performance reasons, so we get testing there.

How much TLS data do you have? With libGL it's just a few words,
and there is lots of spare space in the static TLS space for
that.
Mathieu Desnoyers Oct. 18, 2017, 6:16 p.m. UTC | #22
----- On Oct 18, 2017, at 1:40 PM, carlos carlos@redhat.com wrote:

> On 10/18/2017 09:21 AM, Mathieu Desnoyers wrote:
>> So let's assume dlopen failure is acceptable if there are too many
>> initial-exec TLS. Would it work today to fix signal handler tracing
>> by changing lttng-ust to use TLS_IE, or would it require changes to
>> glibc ?
> 
> No changes to glibc are required. We already use TLS IE in glibc
> specifically to fix this issue. Also libGL also uses TLS IE in
> the distro for performance reasons, so we get testing there.
> 
> How much TLS data do you have? With libGL it's just a few words,
> and there is lots of spare space in the static TLS space for
> that.

For liburcu-bp, used by lttng-ust, a single pointer:

DEFINE_URCU_TLS(struct rcu_reader *, rcu_reader);

I make sure to do place the bulk of data into mmap'd memory.
So, very little TLS space is required.

Thanks,

Mathieu

> 
> --
> Cheers,
> Carlos.
Carlos O'Donell Oct. 18, 2017, 6:38 p.m. UTC | #23
On 10/18/2017 11:16 AM, Mathieu Desnoyers wrote:
> ----- On Oct 18, 2017, at 1:40 PM, carlos carlos@redhat.com wrote:
> 
>> On 10/18/2017 09:21 AM, Mathieu Desnoyers wrote:
>>> So let's assume dlopen failure is acceptable if there are too many
>>> initial-exec TLS. Would it work today to fix signal handler tracing
>>> by changing lttng-ust to use TLS_IE, or would it require changes to
>>> glibc ?
>>
>> No changes to glibc are required. We already use TLS IE in glibc
>> specifically to fix this issue. Also libGL also uses TLS IE in
>> the distro for performance reasons, so we get testing there.
>>
>> How much TLS data do you have? With libGL it's just a few words,
>> and there is lots of spare space in the static TLS space for
>> that.
> 
> For liburcu-bp, used by lttng-ust, a single pointer:
> 
> DEFINE_URCU_TLS(struct rcu_reader *, rcu_reader);
> 
> I make sure to do place the bulk of data into mmap'd memory.
> So, very little TLS space is required.

Perfect. Just like libGL then, which just has two pointers.

The probability of your library not loading is going to be
vanishingly small, so I think you can switch to TLS IE without
a problem.
Mathieu Desnoyers Oct. 18, 2017, 9:30 p.m. UTC | #24
----- On Oct 18, 2017, at 2:38 PM, carlos carlos@redhat.com wrote:

> On 10/18/2017 11:16 AM, Mathieu Desnoyers wrote:
>> ----- On Oct 18, 2017, at 1:40 PM, carlos carlos@redhat.com wrote:
>> 
>>> On 10/18/2017 09:21 AM, Mathieu Desnoyers wrote:
>>>> So let's assume dlopen failure is acceptable if there are too many
>>>> initial-exec TLS. Would it work today to fix signal handler tracing
>>>> by changing lttng-ust to use TLS_IE, or would it require changes to
>>>> glibc ?
>>>
>>> No changes to glibc are required. We already use TLS IE in glibc
>>> specifically to fix this issue. Also libGL also uses TLS IE in
>>> the distro for performance reasons, so we get testing there.
>>>
>>> How much TLS data do you have? With libGL it's just a few words,
>>> and there is lots of spare space in the static TLS space for
>>> that.
>> 
>> For liburcu-bp, used by lttng-ust, a single pointer:
>> 
>> DEFINE_URCU_TLS(struct rcu_reader *, rcu_reader);
>> 
>> I make sure to do place the bulk of data into mmap'd memory.
>> So, very little TLS space is required.
> 
> Perfect. Just like libGL then, which just has two pointers.
> 
> The probability of your library not loading is going to be
> vanishingly small, so I think you can switch to TLS IE without
> a problem.

I'm trying to figure out the right TLS model for each liburcu
libraries. Given that it means a major soname bump, I prefer to
get it right from the get go.

For TLS meant to be touched by a signal handler at first use
in a thread, the hard requirement seems to be TLS initial-exec.
This takes care of urcu-bp, the call-rcu, and rcu-defer TLS.

I have other liburcu flavors where the application is required to
explicitly register the TLS data structure at thread start. This
lifts the "hard" requirement on making the first use signal-handler
safe, given that I can mandate the registration to be performed
outside of signal handler.

The largest data structure I have is 64 bytes. Is it still
acceptable for the IE model ?

I understand that there is a performance benefit in using
the initial-exec model over the global-dynamic. Given liburcu
tends to be used on fast-paths, should I simply move all
liburcu TLS variables to the IE model ?

Thanks,

Mathieu



> 
> --
> Cheers,
> Carlos.
Florian Weimer Oct. 18, 2017, 9:34 p.m. UTC | #25
* Mathieu Desnoyers:

> I understand that there is a performance benefit in using
> the initial-exec model over the global-dynamic. Given liburcu
> tends to be used on fast-paths, should I simply move all
> liburcu TLS variables to the IE model ?

That should be okay.

The situation for dlopen will improve because we will eventually offer
some tuning knobs, so that it's possible to increase the reserve in
anticipation of future dlopen calls.

Right now, if your libraries are compatible with LD_PRELOAD, that
would also be a workable solution, I think.
Szabolcs Nagy Oct. 19, 2017, 9:44 a.m. UTC | #26
On 17/10/17 23:19, Florian Weimer wrote:
> Global-dynamic currently has lazy allocation for each thread, so a
> library constructor is not good enough to ensure initialization.  We
> probably want to fix this, but major language standards say that TLS
> access from signal handlers is undefined, so it's not even a real bug.

the relevant standard for a libc implementation is iso c which has

7.14.1.1p5
"If the signal occurs other than as the result of calling the abort
or raise function, the behavior is undefined if the signal handler
refers to any object with static or thread storage duration that is
not a lock-free atomic object other than by assigning a value to an
object declared as volatile sig_atomic_t, or the signal handler calls
any function in the standard library other than the abort function,
the _Exit function, the quick_exit function, or the signal function
with the first argument equal to the signal number corresponding to
the signal that caused the invocation of the handler."

i.e. lock-free atomic and volatile sig_atomic_t tls access should work.
Florian Weimer Oct. 19, 2017, 10:37 a.m. UTC | #27
On 10/19/2017 11:44 AM, Szabolcs Nagy wrote:
> On 17/10/17 23:19, Florian Weimer wrote:
>> Global-dynamic currently has lazy allocation for each thread, so a
>> library constructor is not good enough to ensure initialization.  We
>> probably want to fix this, but major language standards say that TLS
>> access from signal handlers is undefined, so it's not even a real bug.
> 
> the relevant standard for a libc implementation is iso c which has
> 
> 7.14.1.1p5
> "If the signal occurs other than as the result of calling the abort
> or raise function, the behavior is undefined if the signal handler
> refers to any object with static or thread storage duration that is
> not a lock-free atomic object other than by assigning a value to an
> object declared as volatile sig_atomic_t, or the signal handler calls
> any function in the standard library other than the abort function,
> the _Exit function, the quick_exit function, or the signal function
> with the first argument equal to the signal number corresponding to
> the signal that caused the invocation of the handler."
> 
> i.e. lock-free atomic and volatile sig_atomic_t tls access should work.

For us, it depends on the TLS model.  Async-signal-safe TLS access is 
considered unimplementable in general, which is why C++ makes it undefined:

“
\pnum
\indextext{signal-safe!evaluation|see{evaluation, signal-safe}}%
An evaluation is \defnx{signal-safe}{evaluation!signal-safe} unless it 
includes one of the following:

\begin{itemize}
[…]
\item
an access to an object with thread storage duration;
”

Thanks,
Florian
Szabolcs Nagy Oct. 19, 2017, 11:04 a.m. UTC | #28
On 19/10/17 11:37, Florian Weimer wrote:
> On 10/19/2017 11:44 AM, Szabolcs Nagy wrote:
>> On 17/10/17 23:19, Florian Weimer wrote:
>>> Global-dynamic currently has lazy allocation for each thread, so a
>>> library constructor is not good enough to ensure initialization.  We
>>> probably want to fix this, but major language standards say that TLS
>>> access from signal handlers is undefined, so it's not even a real bug.
>>
>> the relevant standard for a libc implementation is iso c which has
>>
>> 7.14.1.1p5
>> "If the signal occurs other than as the result of calling the abort
>> or raise function, the behavior is undefined if the signal handler
>> refers to any object with static or thread storage duration that is
>> not a lock-free atomic object other than by assigning a value to an
>> object declared as volatile sig_atomic_t, or the signal handler calls
>> any function in the standard library other than the abort function,
>> the _Exit function, the quick_exit function, or the signal function
>> with the first argument equal to the signal number corresponding to
>> the signal that caused the invocation of the handler."
>>
>> i.e. lock-free atomic and volatile sig_atomic_t tls access should work.
> 
> For us, it depends on the TLS model.  Async-signal-safe TLS access is considered unimplementable in general,

musl implements it: you just have to allocate tls for
all dsos and all threads in dlopen and pthread_create.
(if unbounded dlopen/dlclose calls are supported then
it's a bit trickier since dtv slots need to be reused
then, but should be possible in principle)

and i think shared libraries are unimplementable for
c++ already (ctors can execute arbitrary code during
dlopen and can make it crash and there are odr issues
with vague linkage vs RTLD_LOCAL), so general dynamic
tls should not come up in a strictly conforming c++
program, so that's not a good reason to drop as-safety.

> which is why C++ makes it undefined:
> 
> “
> \pnum
> \indextext{signal-safe!evaluation|see{evaluation, signal-safe}}%
> An evaluation is \defnx{signal-safe}{evaluation!signal-safe} unless it includes one of the following:
> 
> \begin{itemize}
> […]
> \item
> an access to an object with thread storage duration;
> ”
> 

that's new text in c++17 (and yet another incompatibility
with c).
Mathieu Desnoyers Oct. 19, 2017, 11:19 a.m. UTC | #29
----- On Oct 18, 2017, at 5:34 PM, Florian Weimer fw@deneb.enyo.de wrote:

> * Mathieu Desnoyers:
> 
>> I understand that there is a performance benefit in using
>> the initial-exec model over the global-dynamic. Given liburcu
>> tends to be used on fast-paths, should I simply move all
>> liburcu TLS variables to the IE model ?
> 
> That should be okay.
> 
> The situation for dlopen will improve because we will eventually offer
> some tuning knobs, so that it's possible to increase the reserve in
> anticipation of future dlopen calls.
> 
> Right now, if your libraries are compatible with LD_PRELOAD, that
> would also be a workable solution, I think.

I also figured I can introduce that TLS model change without bumping the
library soname, even though those TLS symbols are exposed, because the
link-editor figures out at runtime which reloc to apply based on the
TLS model.

This basically means the __attribute__((tls_model("initial-exec"))) should
be applied to the definition, and the declaration of the TLS does not need
any attribute. The link-editor will figure out the right reloc to apply
at link-time, thus modifying the code of the TLS accesses.

Let me know if I misunderstand that part,

Thanks!

Mathieu
Florian Weimer Oct. 19, 2017, 12:49 p.m. UTC | #30
On 10/19/2017 01:04 PM, Szabolcs Nagy wrote:
> On 19/10/17 11:37, Florian Weimer wrote:
>> On 10/19/2017 11:44 AM, Szabolcs Nagy wrote:
>>> On 17/10/17 23:19, Florian Weimer wrote:
>>>> Global-dynamic currently has lazy allocation for each thread, so a
>>>> library constructor is not good enough to ensure initialization.  We
>>>> probably want to fix this, but major language standards say that TLS
>>>> access from signal handlers is undefined, so it's not even a real bug.
>>>
>>> the relevant standard for a libc implementation is iso c which has
>>>
>>> 7.14.1.1p5
>>> "If the signal occurs other than as the result of calling the abort
>>> or raise function, the behavior is undefined if the signal handler
>>> refers to any object with static or thread storage duration that is
>>> not a lock-free atomic object other than by assigning a value to an
>>> object declared as volatile sig_atomic_t, or the signal handler calls
>>> any function in the standard library other than the abort function,
>>> the _Exit function, the quick_exit function, or the signal function
>>> with the first argument equal to the signal number corresponding to
>>> the signal that caused the invocation of the handler."
>>>
>>> i.e. lock-free atomic and volatile sig_atomic_t tls access should work.
>>
>> For us, it depends on the TLS model.  Async-signal-safe TLS access is considered unimplementable in general,
> 
> musl implements it: you just have to allocate tls for
> all dsos and all threads in dlopen and pthread_create.
> (if unbounded dlopen/dlclose calls are supported then
> it's a bit trickier since dtv slots need to be reused
> then, but should be possible in principle)

We have received guidance that this is not desirable for all TLS usage 
scenarios.  Some applications might not want to allocate backing storage 
for the full set of TLS variables for those threads which do not need 
all of them.

But the entire world is not Linux.

>> which is why C++ makes it undefined:
>>
>> “
>> \pnum
>> \indextext{signal-safe!evaluation|see{evaluation, signal-safe}}%
>> An evaluation is \defnx{signal-safe}{evaluation!signal-safe} unless it includes one of the following:
>>
>> \begin{itemize}
>> […]
>> \item
>> an access to an object with thread storage duration;
>> ”
>>
> 
> that's new text in c++17 (and yet another incompatibility
> with c).

In general, the C++ standard adheres much more to implementation 
experience than the C standard.  This incompatibility is probably a 
defect in the C standard.

Thanks,
Florian
Joseph Myers Oct. 19, 2017, 3:43 p.m. UTC | #31
On Thu, 19 Oct 2017, Szabolcs Nagy wrote:

> the relevant standard for a libc implementation is iso c which has
> 
> 7.14.1.1p5
> "If the signal occurs other than as the result of calling the abort
> or raise function, the behavior is undefined if the signal handler
> refers to any object with static or thread storage duration that is
> not a lock-free atomic object other than by assigning a value to an
> object declared as volatile sig_atomic_t, or the signal handler calls
> any function in the standard library other than the abort function,
> the _Exit function, the quick_exit function, or the signal function
> with the first argument equal to the signal number corresponding to
> the signal that caused the invocation of the handler."

Note that this changes following DR#462 to also allow calling 
<stdatomic.h> functions for lock-free types - but that doesn't change the 
principle that TLS access should work and so TLS should be allocated at 
dlopen / pthread_create time.
Joseph Myers Oct. 19, 2017, 3:49 p.m. UTC | #32
On Thu, 19 Oct 2017, Florian Weimer wrote:

> > musl implements it: you just have to allocate tls for
> > all dsos and all threads in dlopen and pthread_create.
> > (if unbounded dlopen/dlclose calls are supported then
> > it's a bit trickier since dtv slots need to be reused
> > then, but should be possible in principle)
> 
> We have received guidance that this is not desirable for all TLS usage
> scenarios.  Some applications might not want to allocate backing storage for
> the full set of TLS variables for those threads which do not need all of them.

My assertion is that the default, for new binaries, should be to do the 
allocation at dlopen / pthread_create time.  This allows for the 
possibility of having a dlopen flag / pthread attribute / other API to 
allow lazy allocation in a particular case, with new symbol versions being 
used to keep existing binaries defaulting to lazy allocation.

Patch
diff mbox series

diff --git a/nptl/pthread_setspecific.c b/nptl/pthread_setspecific.c
index 214af3b661..aa5a1f17c8 100644
--- a/nptl/pthread_setspecific.c
+++ b/nptl/pthread_setspecific.c
@@ -61,18 +61,33 @@  __pthread_setspecific (pthread_key_t key, const void *value)
       level2 = THREAD_GETMEM_NC (self, specific, idx1st);
       if (level2 == NULL)
 	{
+          sigset_t ss, old_ss;
+
 	  if (value == NULL)
 	    /* We don't have to do anything.  The value would in any case
 	       be NULL.  We can save the memory allocation.  */
 	    return 0;
 
+	  /* Ensure that pthread_setspecific and pthread_getspecific are
+             signal-safe when used on different keys.  */
+          sigfillset (&ss);
+          pthread_sigmask (SIG_BLOCK, &ss, &old_ss);
+          /* Read level2 again with signals disabled.  */
+          level2 = THREAD_GETMEM_NC (self, specific, idx1st);
+          if (level2 != NULL)
+            goto skip_resize;
+
 	  level2
 	    = (struct pthread_key_data *) calloc (PTHREAD_KEY_2NDLEVEL_SIZE,
 						  sizeof (*level2));
-	  if (level2 == NULL)
+	  if (level2 == NULL) {
+            pthread_sigmask (SIG_SETMASK, &old_ss, NULL);
 	    return ENOMEM;
+	  }
 
 	  THREAD_SETMEM_NC (self, specific, idx1st, level2);
+skip_resize:
+          pthread_sigmask (SIG_SETMASK, &old_ss, NULL);
 	}
 
       /* Pointer to the right array element.  */