diff mbox series

New TLS usage in libgcc_s.so.1, compatibility impact

Message ID 8734v1ieke.fsf@oldenburg.str.redhat.com
State New
Headers show
Series New TLS usage in libgcc_s.so.1, compatibility impact | expand

Commit Message

Florian Weimer Jan. 13, 2024, 12:49 p.m. UTC
This commit

commit 8abddb187b33480d8827f44ec655f45734a1749d
Author: Andrew Burgess <andrew.burgess@embecosm.com>
Date:   Sat Aug 5 14:31:06 2023 +0200

    libgcc: support heap-based trampolines
    
    Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
    and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
    __builtin_nested_func_ptr_deleted functions for these targets.
    
    Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
    Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
    Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>

added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
built, it ends up using a dynamic TLS variant on the Linux targets.
This means that there is no up-front TLS allocation with glibc (but
there would be one with musl).

There is still a compatibility impact because glibc assigns a TLS module
ID upfront.  This seems to be what causes the
ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
up with an infinite regress during process termination because
libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
happens, the bottom of the stack looks like this:

#4447 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
#4448 0x00007ffff7fdb142 in free (ptr=<optimized out>)
    at ../include/rtld-malloc.h:50
#4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
#4450 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f2bfc0, 
    gen=<optimized out>) at ../elf/dl-tls.c:916
#4451 0x00007ffff7fddccc in __tls_get_addr ()
    at ../sysdeps/x86_64/tls_get_addr.S:55
#4452 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
#4453 0x00007ffff7fdb142 in free (ptr=<optimized out>)
    at ../include/rtld-malloc.h:50
#4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#4455 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f39fa0, 
    gen=<optimized out>) at ../elf/dl-tls.c:916
#4456 0x00007ffff7fddccc in __tls_get_addr ()
    at ../sysdeps/x86_64/tls_get_addr.S:55
#4457 0x00007ffff7f36113 in lttng_ust_cancelstate_disable_push ()
   from /lib64/liblttng-ust-common.so.1
#4458 0x00007ffff7f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
#4459 0x00007ffff7f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
#4460 0x00007ffff7fca0f2 in _dl_call_fini (
    closure_map=closure_map@entry=0x7ffff7fbe000) at dl-call_fini.c:43
#4461 0x00007ffff7fce06e in _dl_fini () at dl-fini.c:114
#4462 0x00007ffff7d82fe6 in __run_exit_handlers () from /lib64/libc.so.6

Cc:ing <lttng-dev@lists.lttng.org> for awareness.

The issue also requires a recent glibc with changes to DTV management:
commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
access after dlopen [BZ #19924]").  If I understand things correctly,
before this glibc change, we didn't deallocate the old DTV, so there was
no call to the free function.

On the glibc side, we should recommend that intercepting mallocs and its
dependencies use initial-exec TLS because that kind of TLS does not use
malloc.  If intercepting mallocs using dynamic TLS work at all, that's
totally by accident, and was in the past helped by glibc bug 19924.  (I
don't think there is anything special about libgcc_s.so.1 that triggers
the test failure above, it is just an object with dynamic TLS that is
implicitly loaded via dlopen at the right stage of the test.)  In this
particular case, we can also paper over the test failure in glibc by not
call free at all because the argument is a null pointer:

As the comment hints, we shouldn't be using malloc for TLS memory at all
because it is not AS-safe, but that's a long-term change.  This change
seems rather specific to this particular test case failure because it
relies on libgcc_s.so.1 never using TLS before it gets unloaded.

Regarding the libgcc_s side, I'm not sure if the TLS usage there should
be considered a real problem, although I'm a bit nervous about it.
However, the current implementation caches one page of trampolines past
the outermost nested function pointer deallocation (otherwise creating
one function pointer per thread in a loop would be really expensive).
It looks to me that is never freed, so if the thread exits even with
proper unwinding (e.g., on glibc with code compiled with -fexceptions),
there is a memory leak.  Integration with glibc could avoid this issue,
and also help with the longjmp problem, and fix setcontext/swapcontext,
too.

Thanks,
Florian

Comments

Szabolcs Nagy Jan. 15, 2024, 12:46 p.m. UTC | #1
The 01/13/2024 13:49, Florian Weimer wrote:
> This commit
> 
> commit 8abddb187b33480d8827f44ec655f45734a1749d
> Author: Andrew Burgess <andrew.burgess@embecosm.com>
> Date:   Sat Aug 5 14:31:06 2023 +0200
> 
>     libgcc: support heap-based trampolines
>     
>     Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
>     and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>     __builtin_nested_func_ptr_deleted functions for these targets.
>     
>     Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
>     Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
>     Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
> 
> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
> built, it ends up using a dynamic TLS variant on the Linux targets.
> This means that there is no up-front TLS allocation with glibc (but
> there would be one with musl).
> 
> There is still a compatibility impact because glibc assigns a TLS module
> ID upfront.  This seems to be what causes the
> ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
> up with an infinite regress during process termination because
> libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
> happens, the bottom of the stack looks like this:
> 
> #4447 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
> #4448 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>     at ../include/rtld-malloc.h:50
> #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
> #4450 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f2bfc0, 
>     gen=<optimized out>) at ../elf/dl-tls.c:916
> #4451 0x00007ffff7fddccc in __tls_get_addr ()
>     at ../sysdeps/x86_64/tls_get_addr.S:55
> #4452 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
> #4453 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>     at ../include/rtld-malloc.h:50
> #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
> #4455 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f39fa0, 
>     gen=<optimized out>) at ../elf/dl-tls.c:916
> #4456 0x00007ffff7fddccc in __tls_get_addr ()
>     at ../sysdeps/x86_64/tls_get_addr.S:55
> #4457 0x00007ffff7f36113 in lttng_ust_cancelstate_disable_push ()
>    from /lib64/liblttng-ust-common.so.1
> #4458 0x00007ffff7f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
> #4459 0x00007ffff7f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
> #4460 0x00007ffff7fca0f2 in _dl_call_fini (
>     closure_map=closure_map@entry=0x7ffff7fbe000) at dl-call_fini.c:43
> #4461 0x00007ffff7fce06e in _dl_fini () at dl-fini.c:114
> #4462 0x00007ffff7d82fe6 in __run_exit_handlers () from /lib64/libc.so.6
> 
> Cc:ing <lttng-dev@lists.lttng.org> for awareness.
> 
> The issue also requires a recent glibc with changes to DTV management:
> commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
> access after dlopen [BZ #19924]").  If I understand things correctly,
> before this glibc change, we didn't deallocate the old DTV, so there was
> no call to the free function.

with 19924 fixed, after a dlopen or dlclose every thread updates
its dtv on the next dynamic tls access.

before that, dtv was only updated up to the generation of the
module being accessed for a particular tls access.

so hitting the free in the dtv update path is now more likely
but the free is not new, it was there before.

also note that this is unlikely to happen on aarch64 since
tlsdesc only does dynamic tls access after a 512byte static
tls reservation runs out.

> 
> On the glibc side, we should recommend that intercepting mallocs and its
> dependencies use initial-exec TLS because that kind of TLS does not use
> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
> totally by accident, and was in the past helped by glibc bug 19924.  (I

right.

> don't think there is anything special about libgcc_s.so.1 that triggers
> the test failure above, it is just an object with dynamic TLS that is
> implicitly loaded via dlopen at the right stage of the test.)  In this
> particular case, we can also paper over the test failure in glibc by not
> call free at all because the argument is a null pointer:
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 7b3dd9ab60..14c71cbd06 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -819,7 +819,8 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
>  		 dtv entry free it.  Note: this is not AS-safe.  */
>  	      /* XXX Ideally we will at some point create a memory
>  		 pool.  */
> -	      free (dtv[modid].pointer.to_free);
> +	      if (dtv[modid].pointer.to_free != NULL)
> +		free (dtv[modid].pointer.to_free);
>  	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>  	      dtv[modid].pointer.to_free = NULL;

can be done, but !=NULL is more likely since we do modid reuse
after dlclose.

there is also a realloc in dtv resizing which happens when more
than 16 modules with tls are loaded after thread creation
(DTV_SURPLUS).

i'm not sure if it's worth supporting malloc interposers that
only work sometimes.

> 
> As the comment hints, we shouldn't be using malloc for TLS memory at all
> because it is not AS-safe, but that's a long-term change.  This change
> seems rather specific to this particular test case failure because it
> relies on libgcc_s.so.1 never using TLS before it gets unloaded.
> 
> Regarding the libgcc_s side, I'm not sure if the TLS usage there should
> be considered a real problem, although I'm a bit nervous about it.
> However, the current implementation caches one page of trampolines past
> the outermost nested function pointer deallocation (otherwise creating
> one function pointer per thread in a loop would be really expensive).
> It looks to me that is never freed, so if the thread exits even with
> proper unwinding (e.g., on glibc with code compiled with -fexceptions),
> there is a memory leak.  Integration with glibc could avoid this issue,
> and also help with the longjmp problem, and fix setcontext/swapcontext,
> too.
> 
> Thanks,
> Florian
>
Adhemerval Zanella Netto Jan. 15, 2024, 1:55 p.m. UTC | #2
On 15/01/24 09:46, Szabolcs Nagy wrote:
> The 01/13/2024 13:49, Florian Weimer wrote:
>> This commit
>>
>> commit 8abddb187b33480d8827f44ec655f45734a1749d
>> Author: Andrew Burgess <andrew.burgess@embecosm.com>
>> Date:   Sat Aug 5 14:31:06 2023 +0200
>>
>>     libgcc: support heap-based trampolines
>>     
>>     Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
>>     and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>>     __builtin_nested_func_ptr_deleted functions for these targets.
>>     
>>     Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
>>     Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
>>     Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
>>
>> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
>> built, it ends up using a dynamic TLS variant on the Linux targets.
>> This means that there is no up-front TLS allocation with glibc (but
>> there would be one with musl).
>>
>> There is still a compatibility impact because glibc assigns a TLS module
>> ID upfront.  This seems to be what causes the
>> ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
>> up with an infinite regress during process termination because
>> libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
>> happens, the bottom of the stack looks like this:
>>
>> #4447 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
>> #4448 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>>     at ../include/rtld-malloc.h:50
>> #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
>> #4450 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f2bfc0, 
>>     gen=<optimized out>) at ../elf/dl-tls.c:916
>> #4451 0x00007ffff7fddccc in __tls_get_addr ()
>>     at ../sysdeps/x86_64/tls_get_addr.S:55
>> #4452 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
>> #4453 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>>     at ../include/rtld-malloc.h:50
>> #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
>> #4455 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f39fa0, 
>>     gen=<optimized out>) at ../elf/dl-tls.c:916
>> #4456 0x00007ffff7fddccc in __tls_get_addr ()
>>     at ../sysdeps/x86_64/tls_get_addr.S:55
>> #4457 0x00007ffff7f36113 in lttng_ust_cancelstate_disable_push ()
>>    from /lib64/liblttng-ust-common.so.1
>> #4458 0x00007ffff7f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
>> #4459 0x00007ffff7f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
>> #4460 0x00007ffff7fca0f2 in _dl_call_fini (
>>     closure_map=closure_map@entry=0x7ffff7fbe000) at dl-call_fini.c:43
>> #4461 0x00007ffff7fce06e in _dl_fini () at dl-fini.c:114
>> #4462 0x00007ffff7d82fe6 in __run_exit_handlers () from /lib64/libc.so.6
>>
>> Cc:ing <lttng-dev@lists.lttng.org> for awareness.
>>
>> The issue also requires a recent glibc with changes to DTV management:
>> commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
>> access after dlopen [BZ #19924]").  If I understand things correctly,
>> before this glibc change, we didn't deallocate the old DTV, so there was
>> no call to the free function.
> 
> with 19924 fixed, after a dlopen or dlclose every thread updates
> its dtv on the next dynamic tls access.
> 
> before that, dtv was only updated up to the generation of the
> module being accessed for a particular tls access.
> 
> so hitting the free in the dtv update path is now more likely
> but the free is not new, it was there before.
> 
> also note that this is unlikely to happen on aarch64 since
> tlsdesc only does dynamic tls access after a 512byte static
> tls reservation runs out.
> 
>>
>> On the glibc side, we should recommend that intercepting mallocs and its
>> dependencies use initial-exec TLS because that kind of TLS does not use
>> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
>> totally by accident, and was in the past helped by glibc bug 19924.  (I
> 
> right.
> 
>> don't think there is anything special about libgcc_s.so.1 that triggers
>> the test failure above, it is just an object with dynamic TLS that is
>> implicitly loaded via dlopen at the right stage of the test.)  In this
>> particular case, we can also paper over the test failure in glibc by not
>> call free at all because the argument is a null pointer:
>>
>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>> index 7b3dd9ab60..14c71cbd06 100644
>> --- a/elf/dl-tls.c
>> +++ b/elf/dl-tls.c
>> @@ -819,7 +819,8 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
>>  		 dtv entry free it.  Note: this is not AS-safe.  */
>>  	      /* XXX Ideally we will at some point create a memory
>>  		 pool.  */
>> -	      free (dtv[modid].pointer.to_free);
>> +	      if (dtv[modid].pointer.to_free != NULL)
>> +		free (dtv[modid].pointer.to_free);
>>  	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>>  	      dtv[modid].pointer.to_free = NULL;
> 
> can be done, but !=NULL is more likely since we do modid reuse
> after dlclose.
> 
> there is also a realloc in dtv resizing which happens when more
> than 16 modules with tls are loaded after thread creation
> (DTV_SURPLUS).
> 
> i'm not sure if it's worth supporting malloc interposers that
> only work sometimes.
> 

Maybe one option would to try reinstate the async-signal-safe TLS
code to avoid malloc/free in dynamic TLS altogether.  We revert it on
2.14 release cause it broke ASAN/LSAN [1], but I think we might try 
to reinstate on 2.40 and work with sanitizer project to get this sort 
out.

[1] https://sourceware.org/pipermail/libc-alpha/2014-January/047931.html
Carlos O'Donell Jan. 15, 2024, 2:47 p.m. UTC | #3
On 1/15/24 08:55, Adhemerval Zanella Netto wrote:
> 
> 
> On 15/01/24 09:46, Szabolcs Nagy wrote:
>> The 01/13/2024 13:49, Florian Weimer wrote:
>>> This commit
>>>
>>> commit 8abddb187b33480d8827f44ec655f45734a1749d
>>> Author: Andrew Burgess <andrew.burgess@embecosm.com>
>>> Date:   Sat Aug 5 14:31:06 2023 +0200
>>>
>>>     libgcc: support heap-based trampolines
>>>     
>>>     Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
>>>     and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>>>     __builtin_nested_func_ptr_deleted functions for these targets.
>>>     
>>>     Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
>>>     Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
>>>     Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
>>>
>>> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
>>> built, it ends up using a dynamic TLS variant on the Linux targets.
>>> This means that there is no up-front TLS allocation with glibc (but
>>> there would be one with musl).
>>>
>>> There is still a compatibility impact because glibc assigns a TLS module
>>> ID upfront.  This seems to be what causes the
>>> ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
>>> up with an infinite regress during process termination because
>>> libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
>>> happens, the bottom of the stack looks like this:
>>>
>>> #4447 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
>>> #4448 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>>>     at ../include/rtld-malloc.h:50
>>> #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
>>> #4450 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f2bfc0, 
>>>     gen=<optimized out>) at ../elf/dl-tls.c:916
>>> #4451 0x00007ffff7fddccc in __tls_get_addr ()
>>>     at ../sysdeps/x86_64/tls_get_addr.S:55
>>> #4452 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
>>> #4453 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>>>     at ../include/rtld-malloc.h:50
>>> #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
>>> #4455 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f39fa0, 
>>>     gen=<optimized out>) at ../elf/dl-tls.c:916
>>> #4456 0x00007ffff7fddccc in __tls_get_addr ()
>>>     at ../sysdeps/x86_64/tls_get_addr.S:55
>>> #4457 0x00007ffff7f36113 in lttng_ust_cancelstate_disable_push ()
>>>    from /lib64/liblttng-ust-common.so.1
>>> #4458 0x00007ffff7f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
>>> #4459 0x00007ffff7f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
>>> #4460 0x00007ffff7fca0f2 in _dl_call_fini (
>>>     closure_map=closure_map@entry=0x7ffff7fbe000) at dl-call_fini.c:43
>>> #4461 0x00007ffff7fce06e in _dl_fini () at dl-fini.c:114
>>> #4462 0x00007ffff7d82fe6 in __run_exit_handlers () from /lib64/libc.so.6
>>>
>>> Cc:ing <lttng-dev@lists.lttng.org> for awareness.
>>>
>>> The issue also requires a recent glibc with changes to DTV management:
>>> commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
>>> access after dlopen [BZ #19924]").  If I understand things correctly,
>>> before this glibc change, we didn't deallocate the old DTV, so there was
>>> no call to the free function.
>>
>> with 19924 fixed, after a dlopen or dlclose every thread updates
>> its dtv on the next dynamic tls access.
>>
>> before that, dtv was only updated up to the generation of the
>> module being accessed for a particular tls access.
>>
>> so hitting the free in the dtv update path is now more likely
>> but the free is not new, it was there before.
>>
>> also note that this is unlikely to happen on aarch64 since
>> tlsdesc only does dynamic tls access after a 512byte static
>> tls reservation runs out.
>>
>>>
>>> On the glibc side, we should recommend that intercepting mallocs and its
>>> dependencies use initial-exec TLS because that kind of TLS does not use
>>> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
>>> totally by accident, and was in the past helped by glibc bug 19924.  (I
>>
>> right.
>>
>>> don't think there is anything special about libgcc_s.so.1 that triggers
>>> the test failure above, it is just an object with dynamic TLS that is
>>> implicitly loaded via dlopen at the right stage of the test.)  In this
>>> particular case, we can also paper over the test failure in glibc by not
>>> call free at all because the argument is a null pointer:
>>>
>>> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
>>> index 7b3dd9ab60..14c71cbd06 100644
>>> --- a/elf/dl-tls.c
>>> +++ b/elf/dl-tls.c
>>> @@ -819,7 +819,8 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
>>>  		 dtv entry free it.  Note: this is not AS-safe.  */
>>>  	      /* XXX Ideally we will at some point create a memory
>>>  		 pool.  */
>>> -	      free (dtv[modid].pointer.to_free);
>>> +	      if (dtv[modid].pointer.to_free != NULL)
>>> +		free (dtv[modid].pointer.to_free);
>>>  	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>>>  	      dtv[modid].pointer.to_free = NULL;
>>
>> can be done, but !=NULL is more likely since we do modid reuse
>> after dlclose.
>>
>> there is also a realloc in dtv resizing which happens when more
>> than 16 modules with tls are loaded after thread creation
>> (DTV_SURPLUS).
>>
>> i'm not sure if it's worth supporting malloc interposers that
>> only work sometimes.
>>
> 
> Maybe one option would to try reinstate the async-signal-safe TLS
> code to avoid malloc/free in dynamic TLS altogether.  We revert it on
> 2.14 release cause it broke ASAN/LSAN [1], but I think we might try 
> to reinstate on 2.40 and work with sanitizer project to get this sort 
> out.

I agree. TLS should be seen more like .bss/.data rather than something that is allocated
with malloc().

If we leak memory via TLS that is a glibc bug that we can deal with, but making it easier
to find glibc bugs is also a  benefit to the community, but not as valuable a benefit as
making TLS correctly async-signal safe.

Likewise we need to discuss when the memory is allocated, regardless of which allocator is
used, including allocation up-front at dlopen() time. 
> [1] https://sourceware.org/pipermail/libc-alpha/2014-January/047931.html
>
Florian Weimer Jan. 15, 2024, 3:35 p.m. UTC | #4
* Carlos O'Donell:

> I agree. TLS should be seen more like .bss/.data rather than something
> that is allocated with malloc().

There wasn't consensus regarding this in 2014.  See below.

> If we leak memory via TLS that is a glibc bug that we can deal with,

This is something that libgcc_s.so.1 does in GCC 14 if the heap
trampolines are used.

> but making it easier to find glibc bugs is also a benefit to the
> community, but not as valuable a benefit as making TLS correctly
> async-signal safe.
>
> Likewise we need to discuss when the memory is allocated, regardless
> of which allocator is used, including allocation up-front at dlopen()
> time.
>> [1] https://sourceware.org/pipermail/libc-alpha/2014-January/047931.html

The change conflated multiple issues: sanitizer support,
async-signal-safe TLS access, and eager allocation of all TLS-related
memory, so that subsequent accesses cannot fail.  My impression was the
main point of contention was eager allocation because it was perceived
as a breaking semantic change.  Nowadays, as long as we are willing to
maintain both allocator variants, we could offer a choice between them
controlled by a tunable.  For sanitizer compatibility, we could perform
eager allocation using malloc.  It's probably a good idea to do it this
way anyway because a separate mmap-based allocator would increase TLB
pressure.

Thanks,
Florian
Iain Sandoe Jan. 15, 2024, 3:38 p.m. UTC | #5
> On 15 Jan 2024, at 15:35, Florian Weimer <fweimer@redhat.com> wrote:
> 
> * Carlos O'Donell:
> 
>> I agree. TLS should be seen more like .bss/.data rather than something
>> that is allocated with malloc().
> 
> There wasn't consensus regarding this in 2014.  See below.
> 
>> If we leak memory via TLS that is a glibc bug that we can deal with,
> 
> This is something that libgcc_s.so.1 does in GCC 14 if the heap
> trampolines are used.

Is there a GCC BZ for this?
(if there is something we should address in GCC, it would be better sooner)

Iain


>> but making it easier to find glibc bugs is also a benefit to the
>> community, but not as valuable a benefit as making TLS correctly
>> async-signal safe.
>> 
>> Likewise we need to discuss when the memory is allocated, regardless
>> of which allocator is used, including allocation up-front at dlopen()
>> time.
>>> [1] https://sourceware.org/pipermail/libc-alpha/2014-January/047931.html
> 
> The change conflated multiple issues: sanitizer support,
> async-signal-safe TLS access, and eager allocation of all TLS-related
> memory, so that subsequent accesses cannot fail.  My impression was the
> main point of contention was eager allocation because it was perceived
> as a breaking semantic change.  Nowadays, as long as we are willing to
> maintain both allocator variants, we could offer a choice between them
> controlled by a tunable.  For sanitizer compatibility, we could perform
> eager allocation using malloc.  It's probably a good idea to do it this
> way anyway because a separate mmap-based allocator would increase TLB
> pressure.
> 
> Thanks,
> Florian
>
Joseph Myers Jan. 15, 2024, 4:29 p.m. UTC | #6
On Mon, 15 Jan 2024, Florian Weimer via Gcc wrote:

> The change conflated multiple issues: sanitizer support,
> async-signal-safe TLS access, and eager allocation of all TLS-related
> memory, so that subsequent accesses cannot fail.  My impression was the
> main point of contention was eager allocation because it was perceived
> as a breaking semantic change.  Nowadays, as long as we are willing to
> maintain both allocator variants, we could offer a choice between them
> controlled by a tunable.  For sanitizer compatibility, we could perform
> eager allocation using malloc.  It's probably a good idea to do it this
> way anyway because a separate mmap-based allocator would increase TLB
> pressure.

Related to eager allocation is the question of whether libgcc_s.so.1 
should be loaded unconditionally by glibc at startup - doing so has much 
the same motivation (avoiding subsequent failures from interfaces that 
don't have any good way to signal failure, when glibc currently loads 
libgcc_s.so.1 dynamically), but also no doubt compatibility risk.
Florian Weimer Jan. 15, 2024, 4:44 p.m. UTC | #7
* Iain Sandoe:

>> On 15 Jan 2024, at 15:35, Florian Weimer <fweimer@redhat.com> wrote:
>> 
>> * Carlos O'Donell:
>> 
>>> I agree. TLS should be seen more like .bss/.data rather than something
>>> that is allocated with malloc().
>> 
>> There wasn't consensus regarding this in 2014.  See below.
>> 
>>> If we leak memory via TLS that is a glibc bug that we can deal with,
>> 
>> This is something that libgcc_s.so.1 does in GCC 14 if the heap
>> trampolines are used.
>
> Is there a GCC BZ for this?
> (if there is something we should address in GCC, it would be better sooner)

Sorry, I wanted to write a reproducer first.  With it, I found two more
issue.

  Memory (resource) leak in -ftrampoline-impl=heap
  <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113401>

  Incorrect symbol versions for __builtin_nested_func_ptr_created,
  __builtin_nested_func_ptr in libgcc_s.so.1
  <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113402>

  __builtin_nested_func_ptr_created, __builtin_nested_func_ptr should be
  dynamically linked by default
  <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113403>

Thanks,
Florian
Mathieu Desnoyers Jan. 15, 2024, 7:05 p.m. UTC | #8
On 2024-01-13 07:49, Florian Weimer via lttng-dev wrote:
> This commit
> 
> commit 8abddb187b33480d8827f44ec655f45734a1749d
> Author: Andrew Burgess <andrew.burgess@embecosm.com>
> Date:   Sat Aug 5 14:31:06 2023 +0200
> 
>      libgcc: support heap-based trampolines
>      
>      Add support for heap-based trampolines on x86_64-linux, aarch64-linux,
>      and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>      __builtin_nested_func_ptr_deleted functions for these targets.
>      
>      Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
>      Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
>      Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
> 
> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
> built, it ends up using a dynamic TLS variant on the Linux targets.
> This means that there is no up-front TLS allocation with glibc (but
> there would be one with musl).

Trying to wrap my head around this:

If I get this right, the previous behavior was that glibc did allocate
global-dynamic variables from libraries which are preloaded and loaded
on c startup as if they were initial-exec, but now that libgcc_s.so.1
has a dynamic TLS variable, all those libraries loaded on c startup that
have global-dynamic TLS do not get the initial allocation special
treatment anymore. Is that more or less correct ?

(note: it's entirely possible that my understanding is entirely wrong,
please correct me if it's the case)

> 
> There is still a compatibility impact because glibc assigns a TLS module
> ID upfront.  This seems to be what causes the
> ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
> up with an infinite regress during process termination because
> libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
> happens, the bottom of the stack looks like this:
> 
> #4447 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
> #4448 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>      at ../include/rtld-malloc.h:50
> #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
> #4450 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f2bfc0,
>      gen=<optimized out>) at ../elf/dl-tls.c:916
> #4451 0x00007ffff7fddccc in __tls_get_addr ()
>      at ../sysdeps/x86_64/tls_get_addr.S:55
> #4452 0x00007ffff7f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
> #4453 0x00007ffff7fdb142 in free (ptr=<optimized out>)
>      at ../include/rtld-malloc.h:50
> #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
> #4455 0x00007ffff7fdb214 in update_get_addr (ti=0x7ffff7f39fa0,
>      gen=<optimized out>) at ../elf/dl-tls.c:916
> #4456 0x00007ffff7fddccc in __tls_get_addr ()
>      at ../sysdeps/x86_64/tls_get_addr.S:55
> #4457 0x00007ffff7f36113 in lttng_ust_cancelstate_disable_push ()
>     from /lib64/liblttng-ust-common.so.1
> #4458 0x00007ffff7f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
> #4459 0x00007ffff7f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
> #4460 0x00007ffff7fca0f2 in _dl_call_fini (
>      closure_map=closure_map@entry=0x7ffff7fbe000) at dl-call_fini.c:43
> #4461 0x00007ffff7fce06e in _dl_fini () at dl-fini.c:114
> #4462 0x00007ffff7d82fe6 in __run_exit_handlers () from /lib64/libc.so.6
> 
> Cc:ing <lttng-dev@lists.lttng.org> for awareness.

I've prepared a change for lttng-ust to move the lttng-ust libc wrapper
"malloc nesting" guard variable from global-dynamic to initial-exec:

https://review.lttng.org/c/lttng-ust/+/11677 Fix: libc wrapper: use initial-exec for malloc_nesting TLS

This should help for the infinite recursion issue, but if my understanding
is correct about the impact of effectively changing the behavior used
for global-dynamic variables in preloaded and on-startup-loaded libraries
introduced by this libgcc change, I suspect we have other new issues here,
such as problems with async-signal safety of other global-dynamic variables
within LTTng-UST.

But moving all TLS variables used by lttng-ust from global-dynamic to
initial-exec is tricky, because a prior attempt to do so introduced regressions
in use-cases where lttng-ust was dlopen'd by Java or Python, AFAIU situations
where the runtimes were already using most of the extra memory pool for
dlopen'd libraries initial-exec variables, causing dlopen of lttng-ust
to fail.

Thanks Florian for letting us know about this,

Mathieu

> 
> The issue also requires a recent glibc with changes to DTV management:
> commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
> access after dlopen [BZ #19924]").  If I understand things correctly,
> before this glibc change, we didn't deallocate the old DTV, so there was
> no call to the free function.
> 
> On the glibc side, we should recommend that intercepting mallocs and its
> dependencies use initial-exec TLS because that kind of TLS does not use
> malloc.  If intercepting mallocs using dynamic TLS work at all, that's
> totally by accident, and was in the past helped by glibc bug 19924.  (I
> don't think there is anything special about libgcc_s.so.1 that triggers
> the test failure above, it is just an object with dynamic TLS that is
> implicitly loaded via dlopen at the right stage of the test.)  In this
> particular case, we can also paper over the test failure in glibc by not
> call free at all because the argument is a null pointer:
> 
> diff --git a/elf/dl-tls.c b/elf/dl-tls.c
> index 7b3dd9ab60..14c71cbd06 100644
> --- a/elf/dl-tls.c
> +++ b/elf/dl-tls.c
> @@ -819,7 +819,8 @@ _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
>   		 dtv entry free it.  Note: this is not AS-safe.  */
>   	      /* XXX Ideally we will at some point create a memory
>   		 pool.  */
> -	      free (dtv[modid].pointer.to_free);
> +	      if (dtv[modid].pointer.to_free != NULL)
> +		free (dtv[modid].pointer.to_free);
>   	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
>   	      dtv[modid].pointer.to_free = NULL;
> 
> As the comment hints, we shouldn't be using malloc for TLS memory at all
> because it is not AS-safe, but that's a long-term change.  This change
> seems rather specific to this particular test case failure because it
> relies on libgcc_s.so.1 never using TLS before it gets unloaded.
> 
> Regarding the libgcc_s side, I'm not sure if the TLS usage there should
> be considered a real problem, although I'm a bit nervous about it.
> However, the current implementation caches one page of trampolines past
> the outermost nested function pointer deallocation (otherwise creating
> one function pointer per thread in a loop would be really expensive).
> It looks to me that is never freed, so if the thread exits even with
> proper unwinding (e.g., on glibc with code compiled with -fexceptions),
> there is a memory leak.  Integration with glibc could avoid this issue,
> and also help with the longjmp problem, and fix setcontext/swapcontext,
> too.
> 
> Thanks,
> Florian
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev@lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Florian Weimer Jan. 15, 2024, 7:42 p.m. UTC | #9
* Mathieu Desnoyers:

> On 2024-01-13 07:49, Florian Weimer via lttng-dev wrote:
>> This commit
>> commit 8abddb187b33480d8827f44ec655f45734a1749d
>> Author: Andrew Burgess <andrew.burgess@embecosm.com>
>> Date:   Sat Aug 5 14:31:06 2023 +0200
>>      libgcc: support heap-based trampolines
>>           Add support for heap-based trampolines on x86_64-linux,
>> aarch64-linux,
>>      and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
>>      __builtin_nested_func_ptr_deleted functions for these targets.
>>           Co-Authored-By: Maxim Blinov <maxim.blinov@embecosm.com>
>>      Co-Authored-By: Iain Sandoe <iain@sandoe.co.uk>
>>      Co-Authored-By: Francois-Xavier Coudert <fxcoudert@gcc.gnu.org>
>> added TLS usage to libgcc_s.so.1.  The way that libgcc_s is
>> currently
>> built, it ends up using a dynamic TLS variant on the Linux targets.
>> This means that there is no up-front TLS allocation with glibc (but
>> there would be one with musl).
>
> Trying to wrap my head around this:
>
> If I get this right, the previous behavior was that glibc did allocate
> global-dynamic variables from libraries which are preloaded and loaded
> on c startup as if they were initial-exec, but now that libgcc_s.so.1
> has a dynamic TLS variable, all those libraries loaded on c startup that
> have global-dynamic TLS do not get the initial allocation special
> treatment anymore. Is that more or less correct ?

Ahh.  I had forgotten about this aspect.  The allocation from the static
TLS area still happens as before.

> I've prepared a change for lttng-ust to move the lttng-ust libc wrapper
> "malloc nesting" guard variable from global-dynamic to initial-exec:
>
> https://review.lttng.org/c/lttng-ust/+/11677 Fix: libc wrapper: use initial-exec for malloc_nesting TLS

I don't know if this is completely sufficient if there are other TLS
variables in the lttng stack.

> This should help for the infinite recursion issue, but if my understanding
> is correct about the impact of effectively changing the behavior used
> for global-dynamic variables in preloaded and on-startup-loaded libraries
> introduced by this libgcc change, I suspect we have other new issues here,
> such as problems with async-signal safety of other global-dynamic variables
> within LTTng-UST.

This didn't change, and the allocation is not done lazily (contrary to
what I might have written before).  But even on the __tls_get_addr fast
path, we check the TLS generation counter, and if that has changed, we
do extra bookkeeping work.  TLS usage in libgcc_s.so.1 means that in the
now-failing test, the generation counter changed.  Before bug 19924

  TLS performance degradation after dlopen 
  <https://sourceware.org/bugzilla/show_bug.cgi?id=19924>

was fixed, we did not do this bookkeeping work, which is why the problem
didn't occur.

General use of lttng should be fine, I think, only the malloc wrapper
has this problem.

> But moving all TLS variables used by lttng-ust from global-dynamic to
> initial-exec is tricky, because a prior attempt to do so introduced
> regressions in use-cases where lttng-ust was dlopen'd by Java or
> Python, AFAIU situations where the runtimes were already using most of
> the extra memory pool for dlopen'd libraries initial-exec variables,
> causing dlopen of lttng-ust to fail.

Oh, right, that makes it quite difficult.  Could you link a private copy
of the libraries into the wrapper that uses initial-exec TLS?

Thanks,
Florian
Mathieu Desnoyers Jan. 15, 2024, 8:08 p.m. UTC | #10
On 2024-01-15 14:42, Florian Weimer wrote:
> * Mathieu Desnoyers:
> 
[...]
> 
> General use of lttng should be fine, I think, only the malloc wrapper
> has this problem.

The purpose of the nesting counter TLS variable in the malloc wrapper
is to catch situations like this where a global-dynamic TLS access
(or any unexpected memory access done as a side-effect from calling
libc) from within LTTng-UST instrumentation would internally attempt to
call recursively into the malloc wrapper. In that nested case, we skip
the instrumentation and call the libc function directly.

I agree with your conclusion that only this nesting counter gating variable
actually needs to be initial-exec.

> 
>> But moving all TLS variables used by lttng-ust from global-dynamic to
>> initial-exec is tricky, because a prior attempt to do so introduced
>> regressions in use-cases where lttng-ust was dlopen'd by Java or
>> Python, AFAIU situations where the runtimes were already using most of
>> the extra memory pool for dlopen'd libraries initial-exec variables,
>> causing dlopen of lttng-ust to fail.
> 
> Oh, right, that makes it quite difficult.  Could you link a private copy
> of the libraries into the wrapper that uses initial-exec TLS?

Unfortunately not easily, because by design LTTng-UST is meant to be a
singleton per-process. Changing this would have far-reaching impacts on
interactions with the LTTng-UST tracepoint instrumentation, as well as
impacts on synchronization between the LTTng-UST agent thread and
application calling fork/clone. Also AFAIR, the LTTng session daemon
(at least until recently) does not expect multiple concurrent
registrations from a given process.

Thanks,

Mathieu
diff mbox series

Patch

diff --git a/elf/dl-tls.c b/elf/dl-tls.c
index 7b3dd9ab60..14c71cbd06 100644
--- a/elf/dl-tls.c
+++ b/elf/dl-tls.c
@@ -819,7 +819,8 @@  _dl_update_slotinfo (unsigned long int req_modid, size_t new_gen)
 		 dtv entry free it.  Note: this is not AS-safe.  */
 	      /* XXX Ideally we will at some point create a memory
 		 pool.  */
-	      free (dtv[modid].pointer.to_free);
+	      if (dtv[modid].pointer.to_free != NULL)
+		free (dtv[modid].pointer.to_free);
 	      dtv[modid].pointer.val = TLS_DTV_UNALLOCATED;
 	      dtv[modid].pointer.to_free = NULL;