Message ID | 20191121021040.14554-1-sandra@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | Compile elf/rtld.c with -fno-tree-loop-distribute-patterns. | expand |
* Sandra Loosemore: > In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns. > This optimization causes GCC to "helpfully" convert the hand-written > loop in _dl_start into a call to memset, which is not available that > early in program startup. Similar problems in other places in GLIBC > have been addressed by explicitly building with > -fno-tree-loop-distribute-patterns, but this one may have been > overlooked previously because it only affects targets where > HAVE_BUILTIN_MEMSET is not defined. Thanks for pointing out this problem. > +# On targets without __builtin_memset, rtld.c uses a hand-coded loop > +# in _dl_start. Make sure this isn't turned into a call to regular memset. > +ifeq (yes,$(have-loop-to-function)) > +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns > +endif Is it possible to do this via a pragma? If I understand things correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets (where initialization of the dynamic loader is simpler).
On 11/23/19 6:25 AM, Florian Weimer wrote: > * Sandra Loosemore: > >> In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns. >> This optimization causes GCC to "helpfully" convert the hand-written >> loop in _dl_start into a call to memset, which is not available that >> early in program startup. Similar problems in other places in GLIBC >> have been addressed by explicitly building with >> -fno-tree-loop-distribute-patterns, but this one may have been >> overlooked previously because it only affects targets where >> HAVE_BUILTIN_MEMSET is not defined. > > Thanks for pointing out this problem. > >> +# On targets without __builtin_memset, rtld.c uses a hand-coded loop >> +# in _dl_start. Make sure this isn't turned into a call to regular memset. >> +ifeq (yes,$(have-loop-to-function)) >> +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns >> +endif > > Is it possible to do this via a pragma? If I understand things > correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets > (where initialization of the dynamic loader is simpler). I see that the definitions of memset and memmove use "inhibit_loop_to_libcall" (which expands into an optimize attribute) to prevent recursion, but I didn't think the header where that is defined (include/libc-symbols.h) is supposed to be included in the dynamic linker? Also, already in elf/Makefile there is another instance where it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just copied that. I don't work with glibc internals enough to have a good feel for what the preferred solution is but I'll test a different solution if this one isn't good enough. -Sandra
* Sandra Loosemore: >> Is it possible to do this via a pragma? If I understand things >> correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets >> (where initialization of the dynamic loader is simpler). > > I see that the definitions of memset and memmove use > "inhibit_loop_to_libcall" (which expands into an optimize attribute) to > prevent recursion, but I didn't think the header where that is defined > (include/libc-symbols.h) is supposed to be included in the dynamic > linker? elf/rtld.os is built with -include ../include/libc-symbols.h, so the declaration should be in scope. I don't know why it is not effective. It probably only applies to the implementations of memset and memmove themselves (if the generic ones written in C are used). > Also, already in elf/Makefile there is another instance where > it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just > copied that. I don't work with glibc internals enough to have a good > feel for what the preferred solution is but I'll test a different > solution if this one isn't good enough. I had hoped we could write something like this at the start of elf/rtld.c: #ifndef PI_STATIC_AND_HIDDEN # pragma GCC optimize ("no-tree-loop-distribute-patterns") #endif Then the optimization would still be applied on the targets where it is safe to do so. But I don't have a strong opinion about this and would appreciate feedback from others.
On 25/11/2019 05:08, Florian Weimer wrote: > * Sandra Loosemore: > >>> Is it possible to do this via a pragma? If I understand things >>> correctly, this is not necessary for PI_STATIC_AND_HIDDEN targets >>> (where initialization of the dynamic loader is simpler). >> >> I see that the definitions of memset and memmove use >> "inhibit_loop_to_libcall" (which expands into an optimize attribute) to >> prevent recursion, but I didn't think the header where that is defined >> (include/libc-symbols.h) is supposed to be included in the dynamic >> linker? > > elf/rtld.os is built with -include ../include/libc-symbols.h, so the > declaration should be in scope. I don't know why it is not effective. > It probably only applies to the implementations of memset and memmove > themselves (if the generic ones written in C are used). > >> Also, already in elf/Makefile there is another instance where >> it adds -fno-tree-loop-distribute-patterns to the CFLAGS, so I just >> copied that. I don't work with glibc internals enough to have a good >> feel for what the preferred solution is but I'll test a different >> solution if this one isn't good enough. > > I had hoped we could write something like this at the start of > elf/rtld.c: > > #ifndef PI_STATIC_AND_HIDDEN > # pragma GCC optimize ("no-tree-loop-distribute-patterns") > #endif > > Then the optimization would still be applied on the targets where it > is safe to do so. > > But I don't have a strong opinion about this and would appreciate > feedback from others. > We already have a similar code to handle a similar issue: 484 /* Partly clean the `bootstrap_map' structure up. Don't use 485 `memset' since it might not be built in or inlined and we cannot 486 make function calls at this point. Use '__builtin_memset' if we 487 know it is available. We do not have to clear the memory if we 488 do not have to use the temporary bootstrap_map. Global variables 489 are initialized to zero by default. */ 490 #ifndef DONT_USE_BOOTSTRAP_MAP 491 # ifdef HAVE_BUILTIN_MEMSET 492 __builtin_memset (bootstrap_map.l_info, '\0', sizeof (bootstrap_map.l_info)); 493 # else 494 for (size_t cnt = 0; 495 cnt < sizeof (bootstrap_map.l_info) / sizeof (bootstrap_map.l_info[0]); 496 ++cnt) 497 bootstrap_map.l_info[cnt] = 0; 498 # endif 499 #endif The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is stack allocated for !PI_STATIC_AND_HIDDEN. (As a side-note, I really think these kind of micro-optimization is just over-complicate for minimal gain) However, I am not sure there is really a restriction regarding PI_STATIC_AND_HIDDEN and internal function calls. Just after this memset call it has: 516 if (bootstrap_map.l_addr || ! bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)]) 517 { 518 /* Relocate ourselves so we can do normal function calls and 519 data access using the global offset table. */ 520 521 ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0); 522 } 523 bootstrap_map.l_relocated = 1; So it should be safe to call memcpy/memset calls in _dl_start after this point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) . Is gcc creating a mem* calls before ld.so reallocate itself? If it were, where exactly?
* Adhemerval Zanella: > We already have a similar code to handle a similar issue: > > 484 /* Partly clean the `bootstrap_map' structure up. Don't use > 485 `memset' since it might not be built in or inlined and we cannot > 486 make function calls at this point. Use '__builtin_memset' if we > 487 know it is available. We do not have to clear the memory if we > 488 do not have to use the temporary bootstrap_map. Global variables > 489 are initialized to zero by default. */ > 490 #ifndef DONT_USE_BOOTSTRAP_MAP > 491 # ifdef HAVE_BUILTIN_MEMSET > 492 __builtin_memset (bootstrap_map.l_info, '\0', sizeof > (bootstrap_map.l_info)); > 493 # else > 494 for (size_t cnt = 0; > 495 cnt < sizeof (bootstrap_map.l_info) / sizeof > (bootstrap_map.l_info[0]); > 496 ++cnt) > 497 bootstrap_map.l_info[cnt] = 0; > 498 # endif > 499 #endif > > The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself > calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is > stack allocated for !PI_STATIC_AND_HIDDEN. > > (As a side-note, I really think these kind of micro-optimization is just > over-complicate for minimal gain) > > However, I am not sure there is really a restriction regarding > PI_STATIC_AND_HIDDEN and internal function calls. Just after this memset > call it has: > > 516 if (bootstrap_map.l_addr || ! > bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)]) > 517 { > 518 /* Relocate ourselves so we can do normal function calls and > 519 data access using the global offset table. */ > 520 > 521 ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0); > 522 } > 523 bootstrap_map.l_relocated = 1; > > So it should be safe to call memcpy/memset calls in _dl_start after this > point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) . > Is gcc creating a mem* calls before ld.so reallocate itself? If it were, > where exactly? In the code you quoted. It reintroduces the memset call we so carefully tried to remove.
On 11/25/19 1:08 AM, Florian Weimer wrote: > > I had hoped we could write something like this at the start of > elf/rtld.c: > > #ifndef PI_STATIC_AND_HIDDEN > # pragma GCC optimize ("no-tree-loop-distribute-patterns") > #endif > > Then the optimization would still be applied on the targets where it > is safe to do so. Well, I could certainly hack up and test a new patch to do that, if that is the recommended approach. But I see no other existing uses of "pragma GCC optimize" anywhere in glibc other than in a few test cases, while elf/Makefile already specifies -fno-tree-loop-distribute-patterns on dl-tunables.c in the same way I added it for rtld.c, so my original patch seems more consistent with current practice. > But I don't have a strong opinion about this and would appreciate > feedback from others. Same here. I just want this bug fixed and will do it in whatever way the community agrees on. As I said, the dynamic linker is currently completely broken on nios2 when built with GCC 10. -Sandra
On 25/11/2019 17:05, Florian Weimer wrote: > * Adhemerval Zanella: > >> We already have a similar code to handle a similar issue: >> >> 484 /* Partly clean the `bootstrap_map' structure up. Don't use >> 485 `memset' since it might not be built in or inlined and we cannot >> 486 make function calls at this point. Use '__builtin_memset' if we >> 487 know it is available. We do not have to clear the memory if we >> 488 do not have to use the temporary bootstrap_map. Global variables >> 489 are initialized to zero by default. */ >> 490 #ifndef DONT_USE_BOOTSTRAP_MAP >> 491 # ifdef HAVE_BUILTIN_MEMSET >> 492 __builtin_memset (bootstrap_map.l_info, '\0', sizeof >> (bootstrap_map.l_info)); >> 493 # else >> 494 for (size_t cnt = 0; >> 495 cnt < sizeof (bootstrap_map.l_info) / sizeof >> (bootstrap_map.l_info[0]); >> 496 ++cnt) >> 497 bootstrap_map.l_info[cnt] = 0; >> 498 # endif >> 499 #endif >> >> The HAVE_BUILTIN_MEMSET on configure.ac check if __builtin_memset itself >> calls memset and it is within DONT_USE_BOOTSTRAP_MAP mainly because it is >> stack allocated for !PI_STATIC_AND_HIDDEN. >> >> (As a side-note, I really think these kind of micro-optimization is just >> over-complicate for minimal gain) >> >> However, I am not sure there is really a restriction regarding >> PI_STATIC_AND_HIDDEN and internal function calls. Just after this memset >> call it has: >> >> 516 if (bootstrap_map.l_addr || ! >> bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)]) >> 517 { >> 518 /* Relocate ourselves so we can do normal function calls and >> 519 data access using the global offset table. */ >> 520 >> 521 ELF_DYNAMIC_RELOCATE (&bootstrap_map, 0, 0, 0); >> 522 } >> 523 bootstrap_map.l_relocated = 1; >> >> So it should be safe to call memcpy/memset calls in _dl_start after this >> point (as some ports that with !PI_STATIC_AND_HIDDEN does with memcpy) . >> Is gcc creating a mem* calls before ld.so reallocate itself? If it were, >> where exactly? > > In the code you quoted. It reintroduces the memset call we so > carefully tried to remove. > Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc for instance, slight larger on mips). I wonder if we could just create a static global on rtld.c to simplify it, since the stack usage won't be released anyway. Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant?
* Adhemerval Zanella: > Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc > for instance, slight larger on mips). I wonder if we could just create a > static global on rtld.c to simplify it, since the stack usage won't be > released anyway. I thought the stack is reset by the startup stub? At one point, we need to rewrite the bootstrap relocation in something else besides C, but I suppose we can just stick in -fno-tree-loop-distribute-patterns today. > Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant? Not sure if I understand the question. The memset is likely redundant on Linux because the stack is fresh and unperturbed. If the startup stub allocated it, clearing it wouldn't be necessary. But the situation is a bit iffy at the C level.
On 25/11/2019 19:28, Florian Weimer wrote: > * Adhemerval Zanella: > >> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc >> for instance, slight larger on mips). I wonder if we could just create a >> static global on rtld.c to simplify it, since the stack usage won't be >> released anyway. > > I thought the stack is reset by the startup stub? It is and also it seems we are stuck on using stack allocated for bootstrap_map for some architectures. For instance, the powerpc32/fpic always generate a GOT access even for static variables (and some comments on rltd.c about it do not grasp all architecture requirements). > > At one point, we need to rewrite the bootstrap relocation in something > else besides C, but I suppose we can just stick in > -fno-tree-loop-distribute-patterns today. Maybe with some reorganization and refactoring we can simplify the include hack done with dynamic-link.h while taking care of not generating the required relocations (maybe a compiler flag could help us, at least to warn when such relocation are being emitted?). But I see your point where we are using C to produce code that is a specific subset of language standard. > >> Also, isn't the memset on !DONT_USE_BOOTSTRAP_MAP redundant? > > Not sure if I understand the question. The memset is likely redundant > on Linux because the stack is fresh and unperturbed. If the startup > stub allocated it, clearing it wouldn't be necessary. But the > situation is a bit iffy at the C level. > Never mind, my confusion here.
* Adhemerval Zanella: > On 25/11/2019 19:28, Florian Weimer wrote: >> * Adhemerval Zanella: >> >>> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc >>> for instance, slight larger on mips). I wonder if we could just create a >>> static global on rtld.c to simplify it, since the stack usage won't be >>> released anyway. >> >> I thought the stack is reset by the startup stub? > > It is and also it seems we are stuck on using stack allocated for > bootstrap_map for some architectures. For instance, the > powerpc32/fpic always generate a GOT access even for static > variables (and some comments on rltd.c about it do not grasp all > architecture requirements). So should we just apply Sandra's patch? >> At one point, we need to rewrite the bootstrap relocation in something >> else besides C, but I suppose we can just stick in >> -fno-tree-loop-distribute-patterns today. > > Maybe with some reorganization and refactoring we can simplify the include > hack done with dynamic-link.h while taking care of not generating the > required relocations (maybe a compiler flag could help us, at least to > warn when such relocation are being emitted?). But I see your point where > we are using C to produce code that is a specific subset of language > standard. It's even worse—we try to restrict what we feed to the front end, hoping that what comes out of the back end, assembler and linker matches our requirements. That's not how we should do things today.
On 26/11/2019 11:14, Florian Weimer wrote: > * Adhemerval Zanella: > >> On 25/11/2019 19:28, Florian Weimer wrote: >>> * Adhemerval Zanella: >>> >>>> Sigh... The dl_start_final_info struct is roughly 660 bytes (664 on powerpc >>>> for instance, slight larger on mips). I wonder if we could just create a >>>> static global on rtld.c to simplify it, since the stack usage won't be >>>> released anyway. >>> >>> I thought the stack is reset by the startup stub? >> >> It is and also it seems we are stuck on using stack allocated for >> bootstrap_map for some architectures. For instance, the >> powerpc32/fpic always generate a GOT access even for static >> variables (and some comments on rltd.c about it do not grasp all >> architecture requirements). > > So should we just apply Sandra's patch? I don't have a strong preference: either disabling for whole rltd.c (Sandra's patch), your suggestion to restrict !PI_STATIC_AND_HIDDEN, or to use inhibit_loop_to_libcall on _dl_start only. I don't think -ftree-loop-distribute-patterns would have a large performance difference. > >>> At one point, we need to rewrite the bootstrap relocation in something >>> else besides C, but I suppose we can just stick in >>> -fno-tree-loop-distribute-patterns today. >> >> Maybe with some reorganization and refactoring we can simplify the include >> hack done with dynamic-link.h while taking care of not generating the >> required relocations (maybe a compiler flag could help us, at least to >> warn when such relocation are being emitted?). But I see your point where >> we are using C to produce code that is a specific subset of language >> standard. > > It's even worse—we try to restrict what we feed to the front end, > hoping that what comes out of the back end, assembler and linker > matches our requirements. That's not how we should do things today. Yeah, this specific code is fragile and simple changes such as changing a variable scope can break it. I think we ca at least document the restrictions and organize better the code (the include in the middle of the function is really hacky).
* Sandra Loosemore: > In GCC 10, the default at -O2 is now -ftree-loop-distribute-patterns. > This optimization causes GCC to "helpfully" convert the hand-written > loop in _dl_start into a call to memset, which is not available that > early in program startup. Similar problems in other places in GLIBC > have been addressed by explicitly building with > -fno-tree-loop-distribute-patterns, but this one may have been > overlooked previously because it only affects targets where > HAVE_BUILTIN_MEMSET is not defined. > > This patch fixes a bug observed on nios2-linux-gnu target that caused > all programs to segv on startup. > --- > elf/Makefile | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/elf/Makefile b/elf/Makefile > index 0668818..b05af5c 100644 > --- a/elf/Makefile > +++ b/elf/Makefile > @@ -65,6 +65,12 @@ CFLAGS-dl-runtime.c += -fexceptions > -fasynchronous-unwind-tables > CFLAGS-dl-lookup.c += -fexceptions -fasynchronous-unwind-tables > CFLAGS-dl-iterate-phdr.c += $(uses-callbacks) > > +# On targets without __builtin_memset, rtld.c uses a hand-coded loop > +# in _dl_start. Make sure this isn't turned into a call to regular memset. > +ifeq (yes,$(have-loop-to-function)) > +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns > +endif > + > # Compile rtld itself without stack protection. > # Also compile all routines in the static library that are elided from > # the shared libc because they are in libc.a in the same way. I've pushed this now, given that Adhemerval had not objections either, and it's a solution we have available today. Thanks.
diff --git a/elf/Makefile b/elf/Makefile index 0668818..b05af5c 100644 --- a/elf/Makefile +++ b/elf/Makefile @@ -65,6 +65,12 @@ CFLAGS-dl-runtime.c += -fexceptions -fasynchronous-unwind-tables CFLAGS-dl-lookup.c += -fexceptions -fasynchronous-unwind-tables CFLAGS-dl-iterate-phdr.c += $(uses-callbacks) +# On targets without __builtin_memset, rtld.c uses a hand-coded loop +# in _dl_start. Make sure this isn't turned into a call to regular memset. +ifeq (yes,$(have-loop-to-function)) +CFLAGS-rtld.c += -fno-tree-loop-distribute-patterns +endif + # Compile rtld itself without stack protection. # Also compile all routines in the static library that are elided from # the shared libc because they are in libc.a in the same way.