Message ID | CAMe9rOrOvwrCr5stJ+TkNFczmFYP6+yQLRwmTjjEdQCOguXyXg@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 07/04/2017 10:28 PM, H.J. Lu wrote: > 2017-07-04 Florian Weimer <fweimer@redhat.com> > H.J. Lu <hongjiu.lu@intel.com> > > [BZ #21609] > * sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr. > (gen-as-const-headers): Add rtld-offsets.sym. > * sysdeps/x86_64/dl-tls.c: New file. > * sysdeps/x86_64/rtld-offsets.sym: Likwise. > * sysdeps/x86_64/tls_get_addr.S: Likewise. > * sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards. > * sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New. > (TI_OFFSET_OFFSET): Likwise. I verified that the addition of __tls_get_addr_slow does not affect the inlining decisions made by GCC 6, so I think this patch is okay. Would you please commit it? I would like to backport it to older glibc branches as well (at least 2.25 and 2.24). I'm not sure if the second patch is actually needed, and if the new symbol is the right way to do it. It penalizes those who use modern and fixed compilers by making their lives more complicated, while we should do it the other way round (those who use unfixed compilers should suffer, at least if the choice is between the two outcomes). Thanks, Florian
On 07/06/2017 07:34 AM, Florian Weimer wrote: > On 07/04/2017 10:28 PM, H.J. Lu wrote: >> 2017-07-04 Florian Weimer <fweimer@redhat.com> >> H.J. Lu <hongjiu.lu@intel.com> >> >> [BZ #21609] >> * sysdeps/x86_64/Makefile (sysdep-dl-routines): Add tls_get_addr. >> (gen-as-const-headers): Add rtld-offsets.sym. >> * sysdeps/x86_64/dl-tls.c: New file. >> * sysdeps/x86_64/rtld-offsets.sym: Likwise. >> * sysdeps/x86_64/tls_get_addr.S: Likewise. >> * sysdeps/x86_64/dl-tls.h: Add multiple inclusion guards. >> * sysdeps/x86_64/tlsdesc.sym (TI_MODULE_OFFSET): New. >> (TI_OFFSET_OFFSET): Likwise. > > I verified that the addition of __tls_get_addr_slow does not affect the > inlining decisions made by GCC 6, so I think this patch is okay. Would > you please commit it? > > I would like to backport it to older glibc branches as well (at least > 2.25 and 2.24). > > I'm not sure if the second patch is actually needed, and if the new > symbol is the right way to do it. It penalizes those who use modern and > fixed compilers by making their lives more complicated, while we should > do it the other way round (those who use unfixed compilers should > suffer, at least if the choice is between the two outcomes). Isn't the "suffering" just that you cannot build a compiler that is both compatible with the old glibc (performance loss) and new glibc (fastest performance)? You have to choose one.
On 07/06/2017 04:30 PM, Carlos O'Donell wrote: >> I'm not sure if the second patch is actually needed, and if the new >> symbol is the right way to do it. It penalizes those who use modern and >> fixed compilers by making their lives more complicated, while we should >> do it the other way round (those who use unfixed compilers should >> suffer, at least if the choice is between the two outcomes). > > Isn't the "suffering" just that you cannot build a compiler that is both > compatible with the old glibc (performance loss) and new glibc (fastest > performance)? No, there is also the programmer inconvenience that they'd have to deal with yet another -m flag. If it's unconditional, you wouldn't be able to use the new GCC with the alternate symbol enabled for glibc development (stable branch maintenance, to be specific). It's that kind of complexity I'm objecting to. Thanks, Florian
On 07/06/2017 10:35 AM, Florian Weimer wrote: > On 07/06/2017 04:30 PM, Carlos O'Donell wrote: > >>> I'm not sure if the second patch is actually needed, and if the new >>> symbol is the right way to do it. It penalizes those who use modern and >>> fixed compilers by making their lives more complicated, while we should >>> do it the other way round (those who use unfixed compilers should >>> suffer, at least if the choice is between the two outcomes). >> >> Isn't the "suffering" just that you cannot build a compiler that is both >> compatible with the old glibc (performance loss) and new glibc (fastest >> performance)? > > No, there is also the programmer inconvenience that they'd have to deal > with yet another -m flag. > > If it's unconditional, you wouldn't be able to use the new GCC with the > alternate symbol enabled for glibc development (stable branch > maintenance, to be specific). > > It's that kind of complexity I'm objecting to. The complexity is a consequence of the bug and our desire to attain optimal performance for __tls_get_addr calls, which are integral to __thread usage. Objecting abstractly to the complexity does not take into account the entire set of consequences surrounding this problem. The objections must be balanced against the gains being sought here, and the performance our users want. As far as I understand it there are two paths we could take: (a) __tls_get_addr aligns the stack for you if required, with a fast and slow path, everyone pays the small cost to check fir misalignment, and only the broken old binaries are forced to realign. This change works for old and new binaries without any need to coordinate between the compiler and the runtime. Pro: No gcc/glibc coordination (no additional flags). Con: All applications pay a small cost to check. (b) We add a ___tls_get_addr which assumes alignment of the stack and a newly configured gcc that notices the system glibc provides this symbol automatically defaults to building with this option enabled. One of: (b.1) A configure option to disable the use of the new ___tls_get_addr. or (b.2) A new machine option -m<disable tls align>/-m<enable tls align> option is used to control this feature. Both b.1 and b.2 change the ABI. Pro: Fastest solution. Only old applications pay the check+align cost. Con: All application developers must ensure gcc/glibc coordination. When building a new gcc with a new glibc, the resulting compiler defaults to the new ABI, and if you need to build an old glibc you must add to your CFLAGS the right -m<disable tls align>. So the question becomes: How many users build their gcc/glibc and need to handle the coordination? Are we asking all future users to pay a cost to simplify developer workflow? As much as I don't like the complexity, I see the inherent value to our users in what HJ is proposing.
On 07/06/2017 04:52 PM, Carlos O'Donell wrote: > As far as I understand it there are two paths we could take: > > (a) __tls_get_addr aligns the stack for you if required, with a fast and > slow path, everyone pays the small cost to check fir misalignment, and > only the broken old binaries are forced to realign. This change works > for old and new binaries without any need to coordinate between the > compiler and the runtime. > > Pro: No gcc/glibc coordination (no additional flags). > Con: All applications pay a small cost to check. > > (b) We add a ___tls_get_addr which assumes alignment of the stack and a > newly configured gcc that notices the system glibc provides this symbol > automatically defaults to building with this option enabled. > One of: > (b.1) A configure option to disable the use of the new ___tls_get_addr. > or > (b.2) A new machine option -m<disable tls align>/-m<enable tls align> > option is used to control this feature. > Both b.1 and b.2 change the ABI. > > Pro: Fastest solution. Only old applications pay the check+align cost. > Con: All application developers must ensure gcc/glibc coordination. > When building a new gcc with a new glibc, the resulting compiler defaults > to the new ABI, and if you need to build an old glibc you must add to > your CFLAGS the right -m<disable tls align>. There's also this option: (c) glibc provides __tls_get_addr (two leading underscores) under a new symbol version. Pro: No GCC changes required for upstream maintained versions. No problems for glibc development, no new -m flag. As fast as (b) after relinking against a newer glibc. Breakage with older/unfixed GCC is detectable during development (unlike the reason for the workaround, which would have needed clairvoyance on the developer's part). Con: Relinking code compiled with the old, unfixed GCC against the new glibc still produces broken binaries. I think (c) is better than (b). But (c) vs (a) is a tough call. Thanks, Florian
On Thu, Jul 6, 2017 at 8:07 AM, Florian Weimer <fweimer@redhat.com> wrote: > On 07/06/2017 04:52 PM, Carlos O'Donell wrote: > >> As far as I understand it there are two paths we could take: >> >> (a) __tls_get_addr aligns the stack for you if required, with a fast and >> slow path, everyone pays the small cost to check fir misalignment, and >> only the broken old binaries are forced to realign. This change works >> for old and new binaries without any need to coordinate between the >> compiler and the runtime. >> >> Pro: No gcc/glibc coordination (no additional flags). >> Con: All applications pay a small cost to check. >> >> (b) We add a ___tls_get_addr which assumes alignment of the stack and a >> newly configured gcc that notices the system glibc provides this symbol >> automatically defaults to building with this option enabled. >> One of: >> (b.1) A configure option to disable the use of the new ___tls_get_addr. >> or >> (b.2) A new machine option -m<disable tls align>/-m<enable tls align> >> option is used to control this feature. >> Both b.1 and b.2 change the ABI. >> >> Pro: Fastest solution. Only old applications pay the check+align cost. >> Con: All application developers must ensure gcc/glibc coordination. >> When building a new gcc with a new glibc, the resulting compiler defaults >> to the new ABI, and if you need to build an old glibc you must add to >> your CFLAGS the right -m<disable tls align>. > > There's also this option: > > (c) glibc provides __tls_get_addr (two leading underscores) > under a new symbol version. > > Pro: No GCC changes required for upstream maintained versions. > No problems for glibc development, no new -m flag. > As fast as (b) after relinking against a newer glibc. > Breakage with older/unfixed GCC is detectable during development > (unlike the reason for the workaround, which would have needed > clairvoyance on the developer's part). > > Con: Relinking code compiled with the old, unfixed GCC against the > new glibc still produces broken binaries. > (c) doesn't fix the problem for unfixed GCC at all. As far as unfixed GCC is concerned, nothing is changed. For fixed GCC, it is the same as without the bug for [BZ #21609]. I think (c) is the least suitable option.
On 07/06/2017 05:11 PM, H.J. Lu wrote: >> There's also this option: >> >> (c) glibc provides __tls_get_addr (two leading underscores) >> under a new symbol version. >> >> Pro: No GCC changes required for upstream maintained versions. >> No problems for glibc development, no new -m flag. >> As fast as (b) after relinking against a newer glibc. >> Breakage with older/unfixed GCC is detectable during development >> (unlike the reason for the workaround, which would have needed >> clairvoyance on the developer's part). >> >> Con: Relinking code compiled with the old, unfixed GCC against the >> new glibc still produces broken binaries. >> > > (c) doesn't fix the problem for unfixed GCC at all. As far as unfixed GCC > is concerned, nothing is changed. It fixes existing binaries by addressing the implied ABI breakage by the glibc update. It's my strong belief that compiler bugs should be fixed in the compiler, not in glibc. Thanks, Florian
On Thu, 6 Jul 2017, Carlos O'Donell wrote: > (a) __tls_get_addr aligns the stack for you if required, with a fast and > slow path, everyone pays the small cost to check fir misalignment, and > only the broken old binaries are forced to realign. This change works > for old and new binaries without any need to coordinate between the > compiler and the runtime. > > Pro: No gcc/glibc coordination (no additional flags). > Con: All applications pay a small cost to check. No, this is not entirely accurate. Only *on first access to a given TLS object in a given thread* you go via the slow path and need to realign the stack. In which case the work to realign the stack is _tiny_ compared to the amount of work Glibc already does on initial TLS access. Repeated accesses to TLS data can go via the fast path and *not even check stack alignment*. There's no additional cost there. There is no need to introduce new symbols nor symbol versions. It's sufficient to implement amd64 __tls_get_attr in assembly, rename generic __tls_get_attr, and call the renamed generic implementation from the slow path after realigning the stack. (it's might be even possible to avoid "in assembly" part by relying on optimization and GCC extensions to achieve the same) Alexander
On Thu, 6 Jul 2017, Alexander Monakov wrote: > There is no need to introduce new symbols nor symbol versions. It's > sufficient to implement amd64 __tls_get_attr in assembly, rename generic > __tls_get_attr, and call the renamed generic implementation from the slow > path after realigning the stack. Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need aligned stack, and its callees update_get_addr and tls_get_addr_tail are already marked noinline, so isn't it sufficient to additionally mark those with __attribute__((force_align_arg_pointer)) on x86-64? Alexander
On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote: > On Thu, 6 Jul 2017, Alexander Monakov wrote: >> There is no need to introduce new symbols nor symbol versions. It's >> sufficient to implement amd64 __tls_get_attr in assembly, rename generic >> __tls_get_attr, and call the renamed generic implementation from the slow >> path after realigning the stack. > > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need > aligned stack, and its callees update_get_addr and tls_get_addr_tail are > already marked noinline, so isn't it sufficient to additionally mark those > with __attribute__((force_align_arg_pointer)) on x86-64? > Please take a look at the current master branch. I have pushed a fix today: https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5
On Thu, 6 Jul 2017, H.J. Lu wrote: > On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote: > > On Thu, 6 Jul 2017, Alexander Monakov wrote: > > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need > > aligned stack, and its callees update_get_addr and tls_get_addr_tail are > > already marked noinline, so isn't it sufficient to additionally mark those > > with __attribute__((force_align_arg_pointer)) on x86-64? > > > > Please take a look at the current master branch. I have pushed a fix > today: > > https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5 Right, but my point was that by simply adding the attribute to two existing noinline functions all that new complexity could have been avoided. What's the point of the proposed new ___tls_get_addr? Avoiding realigning the stack in cases when Glibc has to do rather complex work anyway? Is it even possible to measure the difference? Alexander
On 07/06/2017 11:15 AM, Florian Weimer wrote: > On 07/06/2017 05:11 PM, H.J. Lu wrote: > >>> There's also this option: >>> >>> (c) glibc provides __tls_get_addr (two leading underscores) >>> under a new symbol version. >>> >>> Pro: No GCC changes required for upstream maintained versions. >>> No problems for glibc development, no new -m flag. >>> As fast as (b) after relinking against a newer glibc. >>> Breakage with older/unfixed GCC is detectable during development >>> (unlike the reason for the workaround, which would have needed >>> clairvoyance on the developer's part). >>> >>> Con: Relinking code compiled with the old, unfixed GCC against the >>> new glibc still produces broken binaries. >>> >> >> (c) doesn't fix the problem for unfixed GCC at all. As far as unfixed GCC >> is concerned, nothing is changed. > > It fixes existing binaries by addressing the implied ABI breakage by the > glibc update. > > It's my strong belief that compiler bugs should be fixed in the > compiler, not in glibc. The GNU C Library is part of the GNU Toolchain and the implementation. We should work together to offer the best solution to our users. In the case of (a), and (b) (assuming you don't downgrade your glibc, which is dangerous for other reasons) it works even if the packages are rebuilt, and that's an important "Pro" benefit. I'd say choosing between (a) and (b) is what's really up for discussion. Yes (c) is a simple option, but in the end, given how distributions operate, symbol versioning should be our last choice (one which we don't have to make in this case). I think we've pretty much agreed that (b) is going to be the solution for glibc 2.26, and HJ has checked in changes to that effect. Future optimizations about ___tls_get_addr can be discussed later.
On 07/06/2017 11:44 AM, Alexander Monakov wrote: > On Thu, 6 Jul 2017, H.J. Lu wrote: > >> On Thu, Jul 6, 2017 at 8:32 AM, Alexander Monakov <amonakov@ispras.ru> wrote: >>> On Thu, 6 Jul 2017, Alexander Monakov wrote: >>> Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need >>> aligned stack, and its callees update_get_addr and tls_get_addr_tail are >>> already marked noinline, so isn't it sufficient to additionally mark those >>> with __attribute__((force_align_arg_pointer)) on x86-64? >>> >> >> Please take a look at the current master branch. I have pushed a fix >> today: >> >> https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5 > > Right, but my point was that by simply adding the attribute to two existing > noinline functions all that new complexity could have been avoided. > > What's the point of the proposed new ___tls_get_addr? Avoiding realigning > the stack in cases when Glibc has to do rather complex work anyway? Is it > even possible to measure the difference? Option (b) has been chosen for this release, so that's what we're doing for the release, but we can continue to discuss the optimization here. (1) Cost. No matter what you do it's extra instructions on the hot path forever. (2) Small changes add up. Measuring small changes like this is hard, and what about the sum of the impact on a whole system, or systems? Power? CPU cycles? All of those have costs for something as fundamental as __tls_get_addr. The extra instructions do have i-cache pressure (which might be worked around by using an 'unlikely' section far away to store code, which complicates jumping to it). (3) Better benchmarks. It would be nice to add a TLS microbenchmark into the suite and see if it's measurable in a longer running test. Objective data would help settle the decision on if the optimization is worth the developer complexity.
On Thu, 6 Jul 2017, Carlos O'Donell wrote: > Option (b) has been chosen for this release, so that's what we're doing for > the release, but we can continue to discuss the optimization here. > > (1) Cost. > > No matter what you do it's extra instructions on the hot path forever. What extra instructions on the hot path? I don't see any. There's no need to check stack alignment on the hot path -- only on the slow path. Alexadner
On 07/06/2017 05:32 PM, Alexander Monakov wrote: > On Thu, 6 Jul 2017, Alexander Monakov wrote: >> There is no need to introduce new symbols nor symbol versions. It's >> sufficient to implement amd64 __tls_get_attr in assembly, rename generic >> __tls_get_attr, and call the renamed generic implementation from the slow >> path after realigning the stack. > > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need > aligned stack, and its callees update_get_addr and tls_get_addr_tail are > already marked noinline, so isn't it sufficient to additionally mark those > with __attribute__((force_align_arg_pointer)) on x86-64? I'm not sure if GCC will honor force_align_arg_pointer on a static function which provably called with a properly aligned stack only. We'd need another attribute on __tls_get_addr to tell GCC that it is called with a misaligned stack pointer. Thanks, Florian
On 07/06/2017 05:44 PM, Alexander Monakov wrote: > On Thu, 6 Jul 2017, H.J. Lu wrote: >> Please take a look at the current master branch. I have pushed a fix >> today: >> >> https://sourceware.org/git/?p=glibc.git;a=commit;h=031e519c95c069abe4e4c7c59e2b4b67efccdee5 > Right, but my point was that by simply adding the attribute to two existing > noinline functions all that new complexity could have been avoided. At least the complexity is restricted to the x86-64 port. Injecting the x86-specific function attribute into generic code is not exactly trivial, either. Thanks, Florian
On 07/06/2017 06:23 PM, Alexander Monakov wrote: > On Thu, 6 Jul 2017, Carlos O'Donell wrote: >> Option (b) has been chosen for this release, so that's what we're doing for >> the release, but we can continue to discuss the optimization here. >> >> (1) Cost. >> >> No matter what you do it's extra instructions on the hot path forever. > > What extra instructions on the hot path? I don't see any. There's no > need to check stack alignment on the hot path -- only on the slow path. Indeed, the x86-64 fast path is identical to the GCC-compiled version in current master. Florian
On Thu, Jul 06, 2017 at 10:15:36PM +0200, Florian Weimer wrote: > On 07/06/2017 05:32 PM, Alexander Monakov wrote: > > On Thu, 6 Jul 2017, Alexander Monakov wrote: > >> There is no need to introduce new symbols nor symbol versions. It's > >> sufficient to implement amd64 __tls_get_attr in assembly, rename generic > >> __tls_get_attr, and call the renamed generic implementation from the slow > >> path after realigning the stack. > > > > Actually, looking at elf/dl-tls.c, I see that __tls_get_addr doesn't need > > aligned stack, and its callees update_get_addr and tls_get_addr_tail are > > already marked noinline, so isn't it sufficient to additionally mark those > > with __attribute__((force_align_arg_pointer)) on x86-64? > > I'm not sure if GCC will honor force_align_arg_pointer on a static > function which provably called with a properly aligned stack only. We'd > need another attribute on __tls_get_addr to tell GCC that it is called > with a misaligned stack pointer. If it is not inlined, it will honor the attribute. But if you want to be double-sure, add used attribute to the function, that means it may be called in means not known to the compiler. Jakub
On 07/06/2017 06:08 PM, Carlos O'Donell wrote: > On 07/06/2017 11:15 AM, Florian Weimer wrote: >> On 07/06/2017 05:11 PM, H.J. Lu wrote: >> >>>> There's also this option: >>>> >>>> (c) glibc provides __tls_get_addr (two leading underscores) >>>> under a new symbol version. >>>> >>>> Pro: No GCC changes required for upstream maintained versions. >>>> No problems for glibc development, no new -m flag. >>>> As fast as (b) after relinking against a newer glibc. >>>> Breakage with older/unfixed GCC is detectable during development >>>> (unlike the reason for the workaround, which would have needed >>>> clairvoyance on the developer's part). >>>> >>>> Con: Relinking code compiled with the old, unfixed GCC against the >>>> new glibc still produces broken binaries. >>>> >>> >>> (c) doesn't fix the problem for unfixed GCC at all. As far as unfixed GCC >>> is concerned, nothing is changed. >> >> It fixes existing binaries by addressing the implied ABI breakage by the >> glibc update. >> >> It's my strong belief that compiler bugs should be fixed in the >> compiler, not in glibc. > > The GNU C Library is part of the GNU Toolchain and the implementation. > > We should work together to offer the best solution to our users. Easier said than done when the guilty party has moved on and no longer supports the buggy releases. > In the case of (a), and (b) (assuming you don't downgrade your glibc, which > is dangerous for other reasons) it works even if the packages are rebuilt, > and that's an important "Pro" benefit. But that's *never* been a goal with the way we maintain glibc. We are quite eager to break this in many other contexts. > I'd say choosing between (a) and (b) is what's really up for discussion. I still think (c) is superior to the ___tls_get_addr (three _) solution. The three _s probably have to be made known to the sanitizers and to valgrind, too. > Yes (c) is a simple option, but in the end, given how distributions operate, > symbol versioning should be our last choice (one which we don't have to make > in this case). Would you please expand on that? For RPM-based distributions, the new ___tls_get_addr (3 _) symbol would also add a GLIBC_2.26 or GLIBC_2.27 dependency, causing the same issue as any other versioned symbol. > I think we've pretty much agreed that (b) is going to be the solution for > glibc 2.26, and HJ has checked in changes to that effect. Future optimizations > about ___tls_get_addr can be discussed later. Surely you mean (a) here? No new symbol for 2.26? Thanks, Florian
On 07/06/2017 04:17 PM, Florian Weimer wrote: > On 07/06/2017 06:23 PM, Alexander Monakov wrote: >> On Thu, 6 Jul 2017, Carlos O'Donell wrote: >>> Option (b) has been chosen for this release, so that's what we're doing for >>> the release, but we can continue to discuss the optimization here. >>> >>> (1) Cost. >>> >>> No matter what you do it's extra instructions on the hot path forever. >> >> What extra instructions on the hot path? I don't see any. There's no >> need to check stack alignment on the hot path -- only on the slow path. > > Indeed, the x86-64 fast path is identical to the GCC-compiled version in > current master. My apologies, when I wrote 'hot path' I was thinking of process startup where the first call to __tls_get_addr (for a given dtv entry) always goes through the slow path. The reason I still want to highlight this is that there is a non-zero cost paid, and it adds up over time with other decisions we make. So while profiling is important, eventually we don't see any particular peak in the profile, just a lot of costly noise which is hard to speed up without a deeper understanding of the infrastructure and how we want it to work.
On 07/06/2017 04:26 PM, Florian Weimer wrote: > On 07/06/2017 06:08 PM, Carlos O'Donell wrote: >> In the case of (a), and (b) (assuming you don't downgrade your glibc, which >> is dangerous for other reasons) it works even if the packages are rebuilt, >> and that's an important "Pro" benefit. > > But that's *never* been a goal with the way we maintain glibc. We are > quite eager to break this in many other contexts. I do not mean to set precedent for other technical discussions, and I am talking specifically about the problem of __tls_get_addr. I think that symbol versioning is a good solution to this problem, but it's the simplest solution, and if we can do better, then we should. >> I'd say choosing between (a) and (b) is what's really up for discussion. > > I still think (c) is superior to the ___tls_get_addr (three _) solution. > The three _s probably have to be made known to the sanitizers and to > valgrind, too. That is a good 'Con' against (b). Is it true? We should have someone like Mark Wielaard comment on this? >> Yes (c) is a simple option, but in the end, given how distributions operate, >> symbol versioning should be our last choice (one which we don't have to make >> in this case). > > Would you please expand on that? For RPM-based distributions, the new > ___tls_get_addr (3 _) symbol would also add a GLIBC_2.26 or GLIBC_2.27 > dependency, causing the same issue as any other versioned symbol. My comment above was only to say that distributions rebuild packages for many reasons, and after that rebuild is complete the package may fail to compile, or operate correctly at runtime because of a semantic change we made using symbol versioning. That is OK though, we allowed existing users to continue to user their packages for as long as they wanted, and that is the ABI guarantee we give. I do not want to remove value from symbol versioning, or header changes, all things which we are allowed to do. I want us to do better than symbol versioning, if we can, and if the cost/reward is in favour of doing that. When comparing (b) to (c) from a distro perspective we see: (b) New ___tls_get_addr (3x_) API. Pro: Link failures if you get the coordination wrong. Con: Need to coordinate a glibc/gcc release in the distro. (c) New symbol version. Pro: No need to coordinate a glibc/gcc release in the distro. Con: Can create broken packages depending on which glibc/gcc combination is in the buildroot, or was used by the user. >> I think we've pretty much agreed that (b) is going to be the solution for >> glibc 2.26, and HJ has checked in changes to that effect. Future optimizations >> about ___tls_get_addr can be discussed later. > > Surely you mean (a) here? No new symbol for 2.26? Sorry, yes (a). No new symbol for 2.26, which is what HJ checked in. In closing I would like to point out that it is not as cut-and-dry as we make it out to be and that choosing between (a), (b), and (c) should be done with due consideration. Thankfully we can always revisit an optimization given our choice with (a). For example we should probably look at TLSDESC again for x86_64.
On Fri, 2017-07-07 at 12:19 -0400, Carlos O'Donell wrote: > On 07/06/2017 04:26 PM, Florian Weimer wrote: > > I still think (c) is superior to the ___tls_get_addr (three _) solution. > > The three _s probably have to be made known to the sanitizers and to > > valgrind, too. > > That is a good 'Con' against (b). Is it true? We should have someone like > Mark Wielaard comment on this? valgrind doesn't treat __tls_get_addr special. I haven't checked what __tls_get_addr does precisely. But apparently it operates on well defined memory locations, so doesn't need any special treatment.
On Fri, 7 Jul 2017, Carlos O'Donell wrote: > My apologies, when I wrote 'hot path' I was thinking of process startup > where the first call to __tls_get_addr (for a given dtv entry) always goes > through the slow path. > > The reason I still want to highlight this is that there is a non-zero > cost paid, and it adds up over time with other decisions we make. In this case you're talking about literally 5 instructions on paths that involve syscalls and thousands of other instructions. Consider whether bloating the dynamic table by 1 entry and causing additional work for the dynamic linker (for all programs by the way, not only those using __tls_get_addr) outweighs the cost of those 5 instructions. Alexander
On Fri, Jul 07, 2017 at 08:13:49PM +0300, Alexander Monakov wrote: > On Fri, 7 Jul 2017, Carlos O'Donell wrote: > > My apologies, when I wrote 'hot path' I was thinking of process startup > > where the first call to __tls_get_addr (for a given dtv entry) always goes > > through the slow path. > > > > The reason I still want to highlight this is that there is a non-zero > > cost paid, and it adds up over time with other decisions we make. > > In this case you're talking about literally 5 instructions on paths that > involve syscalls and thousands of other instructions. And it could be even 2 (have the fast path done for when __tls_get_addr has been called on that TLS object already first, then testb $15, %spl je __tls_get_addr_slow ! Now do the actual stack realignment and __tls_get_addr_slow call (or testl $15, %esp; whatever is faster). Jakub
On 07/07/2017 01:13 PM, Alexander Monakov wrote: > On Fri, 7 Jul 2017, Carlos O'Donell wrote: >> My apologies, when I wrote 'hot path' I was thinking of process startup >> where the first call to __tls_get_addr (for a given dtv entry) always goes >> through the slow path. >> >> The reason I still want to highlight this is that there is a non-zero >> cost paid, and it adds up over time with other decisions we make. > > In this case you're talking about literally 5 instructions on paths that > involve syscalls and thousands of other instructions. > > Consider whether bloating the dynamic table by 1 entry and causing > additional work for the dynamic linker (for all programs by the way, > not only those using __tls_get_addr) outweighs the cost of those 5 > instructions. The answer is not black and white. One must pay the slow path cost for every dtv entry that needs to be initialized, but one only pays the symbol lookup cost once for the call to ___tls_get_addr. One always has to lookup a version of __tls_get_addr, so it might or might not, cost a little more to have an extra symbol (depends on the symbol order choices made by the static linker). The extra .dynamic entry for the triple underscore or versioned symbol is added to a hash bucket for lookup, and so should be as easy to find as the original symbol if the new version sorts first. I think the final decision (a), to keep the original symbol and alter the slow path to realign the stack (as designed by Florian) is a good solution. It is a better solution than (c) symbol versioning, but only because it lacks the downside that recompiling bumps you forward to the new version which breaks. However, I don't throw out HJ's suggestion for a new API without giving it due consideration, and it has attractive benefits. Though I don't see the additional symbol in the .dynsym as a significant "Con" for that solution, but then again I should try to experiment with some microbenchmarks.
* Jakub Jelinek: > On Fri, Jul 07, 2017 at 08:13:49PM +0300, Alexander Monakov wrote: >> On Fri, 7 Jul 2017, Carlos O'Donell wrote: >> > My apologies, when I wrote 'hot path' I was thinking of process startup >> > where the first call to __tls_get_addr (for a given dtv entry) always goes >> > through the slow path. >> > >> > The reason I still want to highlight this is that there is a non-zero >> > cost paid, and it adds up over time with other decisions we make. >> >> In this case you're talking about literally 5 instructions on paths that >> involve syscalls and thousands of other instructions. > > And it could be even 2 (have the fast path done for when __tls_get_addr > has been called on that TLS object already first, then > testb $15, %spl > je __tls_get_addr_slow > ! Now do the actual stack realignment and __tls_get_addr_slow call > > (or testl $15, %esp; whatever is faster). I think the condition needs to be reversed, at the very least. I don't know the exact nature of the GCC bug. If it only causes misalignment by 8 bytes, this could work: testl $8, %esp /* Jump if the stack is already properly aligned for the function entry. */ jne __tls_get_addr_slow /* Otherwise, __tls_get_addr was called with a misaligned stack, but this means the stack is properly aligned for a call instruction. */ call __tls_get_addr_slow ret But it really assumes that the stack alignment is either 0 or 8 (mod 16) at the entry of the __tls_get_addr function.
From b59d4b240c18673a9018e338702e51536d6f25d2 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Tue, 4 Jul 2017 09:24:09 -0700 Subject: [PATCH 2/2] x86-64: Export ___tls_get_addr On x86-64, __tls_get_addr is changed to realign stack to support GCC older than GCC 4.9.4: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58066 This patch exports ___tls_get_addr as the alternative x86-64 runtime interface to TLS with aligned stack. This requires linker support for TLS optimization with ___tls_get_addr. The GCC 4.9.4 and newer can generate call to ___tls_get_addr, instead of __tls_get_addr. We can use the weakref assembly directive to redirect __tls_get_addr to ___tls_get_addr if linker supports ___tls_get_addr and GCC doesn't. * sysdeps/unix/sysv/linux/x86_64/64/ld.abilist: Add GLIBC_2.26 and ___tls_get_addr. * sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist: Likewise. * sysdeps/x86_64/Versions: Export ___tls_get_addr in GLIBC_2.26. --- sysdeps/unix/sysv/linux/x86_64/64/ld.abilist | 2 ++ sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist | 2 ++ sysdeps/x86_64/Versions | 6 ++++++ 3 files changed, 10 insertions(+) diff --git a/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist b/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist index 07cab4b..884e52c 100644 --- a/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/64/ld.abilist @@ -6,6 +6,8 @@ GLIBC_2.2.5 calloc F GLIBC_2.2.5 free F GLIBC_2.2.5 malloc F GLIBC_2.2.5 realloc F +GLIBC_2.26 GLIBC_2.26 A +GLIBC_2.26 ___tls_get_addr F GLIBC_2.3 GLIBC_2.3 A GLIBC_2.3 __tls_get_addr F GLIBC_2.4 GLIBC_2.4 A diff --git a/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist b/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist index 236357b..d42a7a4 100644 --- a/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist +++ b/sysdeps/unix/sysv/linux/x86_64/x32/ld.abilist @@ -7,3 +7,5 @@ GLIBC_2.16 calloc F GLIBC_2.16 free F GLIBC_2.16 malloc F GLIBC_2.16 realloc F +GLIBC_2.26 GLIBC_2.26 A +GLIBC_2.26 ___tls_get_addr F diff --git a/sysdeps/x86_64/Versions b/sysdeps/x86_64/Versions index a437f85..b8c25c9 100644 --- a/sysdeps/x86_64/Versions +++ b/sysdeps/x86_64/Versions @@ -10,3 +10,9 @@ libm { exp2l; } } +ld { + GLIBC_2.26 { + # The alternative x86-64 runtime interface to TLS with aligned stack. + ___tls_get_addr; + } +} -- 2.9.4