Message ID | 1430757479-14241-5-git-send-email-amonakov@ispras.ru |
---|---|
State | New |
Headers | show |
Ping? Any comment about this patch? On Mon, 4 May 2015, Alexander Monakov wrote: > With -fno-plt, we don't have to reject even direct calls as sibcall > candidates. > > This patch depends on '-fplt' flag that is introduced in another patch. > > This patch requires that with -fno-plt all sibcall candidates go through > prepare_call_address that transforms the call to a GOT lookup. > > OK? > * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt. > > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > index f29e053..b734350 100644 > --- a/gcc/config/i386/i386.c > +++ b/gcc/config/i386/i386.c > @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) > /* If we are generating position-independent code, we cannot sibcall > optimize any indirect call, or a direct call to a global function, > as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ > if (!TARGET_MACHO > && !TARGET_64BIT > && flag_pic > + && flag_plt > && (decl && !targetm.binds_local_p (decl))) > return false; > > /* If we need to align the outgoing stack, then sibcalling would > unalign the stack, which may break the called function. */ > if (ix86_minimum_incoming_stack_boundary (true) >
On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote: > Ping? Any comment about this patch? > > On Mon, 4 May 2015, Alexander Monakov wrote: > >> With -fno-plt, we don't have to reject even direct calls as sibcall >> candidates. >> >> This patch depends on '-fplt' flag that is introduced in another patch. >> >> This patch requires that with -fno-plt all sibcall candidates go through >> prepare_call_address that transforms the call to a GOT lookup. >> >> OK? >> * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt. >> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c >> index f29e053..b734350 100644 >> --- a/gcc/config/i386/i386.c >> +++ b/gcc/config/i386/i386.c >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) >> /* If we are generating position-independent code, we cannot sibcall >> optimize any indirect call, or a direct call to a global function, >> as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ >> if (!TARGET_MACHO >> && !TARGET_64BIT >> && flag_pic >> + && flag_plt >> && (decl && !targetm.binds_local_p (decl))) >> return false; >> >> /* If we need to align the outgoing stack, then sibcalling would >> unalign the stack, which may break the called function. */ >> if (ix86_minimum_incoming_stack_boundary (true) >> I think it should be done via psABI change similar to https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI which I have implemented on users/hjl/relax branch in binutils.
> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote: > > Ping? Any comment about this patch? > > > > On Mon, 4 May 2015, Alexander Monakov wrote: > > > >> With -fno-plt, we don't have to reject even direct calls as sibcall > >> candidates. > >> > >> This patch depends on '-fplt' flag that is introduced in another patch. > >> > >> This patch requires that with -fno-plt all sibcall candidates go through > >> prepare_call_address that transforms the call to a GOT lookup. > >> > >> OK? > >> * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt. > >> > >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > >> index f29e053..b734350 100644 > >> --- a/gcc/config/i386/i386.c > >> +++ b/gcc/config/i386/i386.c > >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) > >> /* If we are generating position-independent code, we cannot sibcall > >> optimize any indirect call, or a direct call to a global function, > >> as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ > >> if (!TARGET_MACHO > >> && !TARGET_64BIT > >> && flag_pic > >> + && flag_plt > >> && (decl && !targetm.binds_local_p (decl))) > >> return false; > >> > >> /* If we need to align the outgoing stack, then sibcalling would > >> unalign the stack, which may break the called function. */ > >> if (ix86_minimum_incoming_stack_boundary (true) > >> > > I think it should be done via psABI change similar to > > https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI > > which I have implemented on users/hjl/relax branch in binutils. OK, I am trying to understand how relax branch works and what difference it makes. As I underestand it, the main purpose is to be able to make relaxed call of call function that will, in 64bit mode, either result to RIP relative call with extra NOP just before the instruction if FUNCTION binds within the DSO or to indirect call through GOT bypassing the PLT. This saves overhead of PLT and increase every such call by extra NOP for no-LTO builds and even in LTO when the symbol is defined but interposable. This is actually really nice trick. Now this is about 32bit mode where explicit GOT pointer register is needed (how this work with large code model on x86-64?). It is needed by PLT, but I suppose to implement the same relaxation for 32bit it would need to use EBX to lookup the GOT pointer, too, so the check above would still be valid. The patches makes sense to be given that we support -fno-plt now. Honza > > -- > H.J.
On Fri, May 15, 2015 at 12:48 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote: >> > Ping? Any comment about this patch? >> > >> > On Mon, 4 May 2015, Alexander Monakov wrote: >> > >> >> With -fno-plt, we don't have to reject even direct calls as sibcall >> >> candidates. >> >> >> >> This patch depends on '-fplt' flag that is introduced in another patch. >> >> >> >> This patch requires that with -fno-plt all sibcall candidates go through >> >> prepare_call_address that transforms the call to a GOT lookup. >> >> >> >> OK? >> >> * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt. >> >> >> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c >> >> index f29e053..b734350 100644 >> >> --- a/gcc/config/i386/i386.c >> >> +++ b/gcc/config/i386/i386.c >> >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) >> >> /* If we are generating position-independent code, we cannot sibcall >> >> optimize any indirect call, or a direct call to a global function, >> >> as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ >> >> if (!TARGET_MACHO >> >> && !TARGET_64BIT >> >> && flag_pic >> >> + && flag_plt >> >> && (decl && !targetm.binds_local_p (decl))) >> >> return false; >> >> >> >> /* If we need to align the outgoing stack, then sibcalling would >> >> unalign the stack, which may break the called function. */ >> >> if (ix86_minimum_incoming_stack_boundary (true) >> >> >> >> I think it should be done via psABI change similar to >> >> https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI >> >> which I have implemented on users/hjl/relax branch in binutils. > > OK, I am trying to understand how relax branch works and what difference it makes. > As I underestand it, the main purpose is to be able to make relaxed call of > > call function > > that will, in 64bit mode, either result to RIP relative call with extra NOP just > before the instruction if FUNCTION binds within the DSO or to indirect call through > GOT bypassing the PLT. This saves overhead of PLT and increase every such call > by extra NOP for no-LTO builds and even in LTO when the symbol is defined but > interposable. This is actually really nice trick. > > Now this is about 32bit mode where explicit GOT pointer register is needed > (how this work with large code model on x86-64?). It is needed by PLT, but I suppose > to implement the same relaxation for 32bit it would need to use EBX to lookup the > GOT pointer, too, so the check above would still be valid. > With relax branch in 32-bit, there are 2 cases: 1. PIC or PIE: We generate set up EBX relax call foo@PLT It is almost the same as we do now, except for the relax prefix. If foo is defined in another shared library or may be preempted, linker will generate call *foo@GOTPLT(%ebx) If foo turns out local, linker will output relax call foo 2. Non PIC/PIE: We generate relax call foo If foo is defined in a DSO, linker will generate call/jmp *foo@GOTPLT We don't set up EBX in this case. If foo turns out local, linker will output relax call foo
On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote: > With relax branch in 32-bit, there are 2 cases: > > 1. PIC or PIE: We generate > > set up EBX > relax call foo@PLT > > It is almost the same as we do now, except for the relax prefix. > If foo is defined in another shared library or may be preempted, > linker will generate > > call *foo@GOTPLT(%ebx) > > If foo turns out local, linker will output > > relax call foo This does not address the initial and primary motivation for no-plt on 32-bit: eliminating the awful codegen constraint costs of the GOT-register (ebx, and equivalent on other targets) ABI for calling PLT entries. If instead you generated code that sets up an expression for the GOT slot using arbitrary registers, and relaxed it to a direct call (possibly rendering the register setup useless), it would be comparable to the no-plt approach. So for example: set up ecx (or whatever register) relax call *foo@GOT(%ecx) and relax to: set up ecx (or whatever register; now useless) relax call foo But the no-plt approach is still superior in that the address load from the GOT can be hoisted out of loops, etc., resulting in something like: call *%esi This could be valuable in loops calling a math function repeatedly, for example. Overall I'm still not a fan of the relaxation approach. There are very few places it would actually help that couldn't already be improved better with use of visibility, and it can't give codegen as good as no-plt option. Rich
On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote: > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote: >> With relax branch in 32-bit, there are 2 cases: >> >> 1. PIC or PIE: We generate >> >> set up EBX >> relax call foo@PLT >> >> It is almost the same as we do now, except for the relax prefix. >> If foo is defined in another shared library or may be preempted, >> linker will generate >> >> call *foo@GOTPLT(%ebx) >> >> If foo turns out local, linker will output >> >> relax call foo > > This does not address the initial and primary motivation for no-plt on > 32-bit: eliminating the awful codegen constraint costs of the > GOT-register (ebx, and equivalent on other targets) ABI for calling > PLT entries. If instead you generated code that sets up an expression > for the GOT slot using arbitrary registers, and relaxed it to a direct > call (possibly rendering the register setup useless), it would be > comparable to the no-plt approach. So for example: > > set up ecx (or whatever register) > relax call *foo@GOT(%ecx) > > and relax to: > > set up ecx (or whatever register; now useless) > relax call foo > > But the no-plt approach is still superior in that the address load > from the GOT can be hoisted out of loops, etc., resulting in something > like: > > call *%esi > > This could be valuable in loops calling a math function repeatedly, > for example. > > Overall I'm still not a fan of the relaxation approach. There are very > few places it would actually help that couldn't already be improved > better with use of visibility, and it can't give codegen as good as > no-plt option. With no-plt option, compiler has to know if a function is external or may be preempted. If compiler guessed wrong, the generated DSO or executable will always go through indirect branch even though the target is local. With relax branch, the decision is left to linker. Of course, EBX must be used unless we add a new PLT relocation for each register used to to hold GOT base, like relax call foo@PLT_ECX relax call foo@PLT_EDX ...
On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote: > On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote: > > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote: > >> With relax branch in 32-bit, there are 2 cases: > >> > >> 1. PIC or PIE: We generate > >> > >> set up EBX > >> relax call foo@PLT > >> > >> It is almost the same as we do now, except for the relax prefix. > >> If foo is defined in another shared library or may be preempted, > >> linker will generate > >> > >> call *foo@GOTPLT(%ebx) > >> > >> If foo turns out local, linker will output > >> > >> relax call foo > > > > This does not address the initial and primary motivation for no-plt on > > 32-bit: eliminating the awful codegen constraint costs of the > > GOT-register (ebx, and equivalent on other targets) ABI for calling > > PLT entries. If instead you generated code that sets up an expression > > for the GOT slot using arbitrary registers, and relaxed it to a direct > > call (possibly rendering the register setup useless), it would be > > comparable to the no-plt approach. So for example: > > > > set up ecx (or whatever register) > > relax call *foo@GOT(%ecx) > > > > and relax to: > > > > set up ecx (or whatever register; now useless) > > relax call foo > > > > But the no-plt approach is still superior in that the address load > > from the GOT can be hoisted out of loops, etc., resulting in something > > like: > > > > call *%esi > > > > This could be valuable in loops calling a math function repeatedly, > > for example. > > > > Overall I'm still not a fan of the relaxation approach. There are very > > few places it would actually help that couldn't already be improved > > better with use of visibility, and it can't give codegen as good as > > no-plt option. > > With no-plt option, compiler has to know if a function is external > or may be preempted. I still don't see significant practical cases where the linker would know this but the compiler can't. If you use visibility properly, the compiler knows, and if you do LTO and -Bsymbolic[-functions], the compiler should have that information available at LTO time (this is an enhancement that needs to be made, though). > If compiler guessed wrong, the generated > DSO or executable will always go through indirect branch even > though the target is local. The only way this is avoided now is with -Bsymbolic[-functions] which is not widely used. Otherwise interposition is always allowed for default-visibility functions, so I don't see how the indirect branch here is suboptimal. > With relax branch, the decision is left > to linker. Of course, EBX must be used unless we add a new PLT > relocation for each register used to to hold GOT base, like > > relax call foo@PLT_ECX > relax call foo@PLT_EDX No, that's not needed. If the linker doesn't make the relaxation, the instruction the compiler generated remains in place, and has the effective address expression using whichever register it wanted: relax call *foo@GOT(%ecx) relax call *foo@GOT(%edx) etc. If the linker chooses to relax it to a direct call, no register at all is needed, so the linker can just throw this away and use: call foo for all of them. Rich
On Fri, May 15, 2015 at 1:42 PM, Rich Felker <dalias@libc.org> wrote: > On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote: >> On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote: >> > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote: >> >> With relax branch in 32-bit, there are 2 cases: >> >> >> >> 1. PIC or PIE: We generate >> >> >> >> set up EBX >> >> relax call foo@PLT >> >> >> >> It is almost the same as we do now, except for the relax prefix. >> >> If foo is defined in another shared library or may be preempted, >> >> linker will generate >> >> >> >> call *foo@GOTPLT(%ebx) >> >> >> >> If foo turns out local, linker will output >> >> >> >> relax call foo >> > >> > This does not address the initial and primary motivation for no-plt on >> > 32-bit: eliminating the awful codegen constraint costs of the >> > GOT-register (ebx, and equivalent on other targets) ABI for calling >> > PLT entries. If instead you generated code that sets up an expression >> > for the GOT slot using arbitrary registers, and relaxed it to a direct >> > call (possibly rendering the register setup useless), it would be >> > comparable to the no-plt approach. So for example: >> > >> > set up ecx (or whatever register) >> > relax call *foo@GOT(%ecx) >> > >> > and relax to: >> > >> > set up ecx (or whatever register; now useless) >> > relax call foo >> > >> > But the no-plt approach is still superior in that the address load >> > from the GOT can be hoisted out of loops, etc., resulting in something >> > like: >> > >> > call *%esi >> > >> > This could be valuable in loops calling a math function repeatedly, >> > for example. >> > >> > Overall I'm still not a fan of the relaxation approach. There are very >> > few places it would actually help that couldn't already be improved >> > better with use of visibility, and it can't give codegen as good as >> > no-plt option. >> >> With no-plt option, compiler has to know if a function is external >> or may be preempted. > > I still don't see significant practical cases where the linker would > know this but the compiler can't. If you use visibility properly, the > compiler knows, and if you do LTO and -Bsymbolic[-functions], the > compiler should have that information available at LTO time (this is > an enhancement that needs to be made, though). There are codes like extern void foo (void); void bar (void) { foo (); } Even with LTO, compiler may have to assume foo is external when foo is compiled with LTO. >> If compiler guessed wrong, the generated >> DSO or executable will always go through indirect branch even >> though the target is local. > > The only way this is avoided now is with -Bsymbolic[-functions] which > is not widely used. Otherwise interposition is always allowed for > default-visibility functions, so I don't see how the indirect branch > here is suboptimal. Relax branch is to avoid indirect branch to local targets. If you don't think indirect branch to local targets is a performance issue, relax branch isn't for you. >> With relax branch, the decision is left >> to linker. Of course, EBX must be used unless we add a new PLT >> relocation for each register used to to hold GOT base, like >> >> relax call foo@PLT_ECX >> relax call foo@PLT_EDX > > No, that's not needed. If the linker doesn't make the relaxation, the > instruction the compiler generated remains in place, and has the > effective address expression using whichever register it wanted: > > relax call *foo@GOT(%ecx) > relax call *foo@GOT(%edx) > etc. relax branch is only used for direct branch and it isn't for indirect branch. I will implement relax call foo@PLT(%reg) The compiler can pick any registers to hold GOT base. Lazy binding is supported only when EBX is used. > If the linker chooses to relax it to a direct call, no register at all > is needed, so the linker can just throw this away and use: > > call foo > > for all of them. > > Rich
Hello, > > There are codes like > > extern void foo (void); > > void > bar (void) > { > foo (); > } > > Even with LTO, compiler may have to assume foo is external > when foo is compiled with LTO. This is not exactly true if FOO is defined in other translation unit compiled with LTO and hidden visibility. OK, so as I get it, we get the following cases: 1) compiler knows it is generating call to a local symbol a current unit (binds_to_current_def_p returns true). We handle this correctly by doing IP relative call. 2) compiler knows it is generating call to a local symbol in DSO (binds_local_p return true) Currently I think this is only the -fno-pic case or case of explicit hidden visibility and in this case we do IP relative call. We may want to propose plugin API update adding PREVAILING_DEF_EXP. So copiler would be able to default to this case for PREVAILING_DEF and we will also catch cases where the symbol is defined in current DSO as weak symbol, but the definition is not LTO. This would be also way to communicate -Bsymbolic[-functions] across the plugin API. 3) compiler knows there is going to be definition in the current DSO (by seeing a COMDAT function body or resolution info) that is interposable but because the function is inline or -fno-semantic-interposition happens, the semantics will not change. In this case it would be nice to arrange IP relative call to the hidden alias. This may require an extension both on compiler and linker side. I was thinking of doing so for comdats by adding hidden alias with fixed mangling, like __gnu_<function>.hiddenalias, and referring it. But I think it is not safe as linker may throw away section that is produced by GCC and prevail section that is not leaving to an undefined symbol? I think this is rather common case in C++ (never made any stats) because uninlined comdats are quite common. 4) compiler has no clue but linker may know better Here we traditionally always produce a PLT call. In cases the call is known to be hot in the program it makes sense to trade lazy binding for performance and produce call via GOT reference (-fno-plt). I also see that H.J.'s branch helps us to actually avoid the GOT reference in cases the symbol ends up binding locally. How the lazy binding with relaxation works? We may try to communicate down the information whether the symbol can or can not semantically interpose to the linker, so it can do -Bsymbolic by default for inline and COMDAT functions. Actually perhaps the linker can just default to this for all comdat defined symbols? I think it still make sense to work on non-LTO codegen improvements. As much as I would like everyone to LTO and FDO, most people don't. 5) Compiler knows it is generating call to external function. We do not special case this, but we could add binds_external_p and make it to determine this case from resolution info during LTO. I do not see if this case is any different from 4 from PIC codegen perspective except that perhaps the relax relocation will allow us to lazy bind? Honza
On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > Hello, >> >> There are codes like >> >> extern void foo (void); >> >> void >> bar (void) >> { >> foo (); >> } >> >> Even with LTO, compiler may have to assume foo is external >> when foo is compiled with LTO. > > This is not exactly true if FOO is defined in other translation unit > compiled with LTO and hidden visibility. I was meant to say " when foo is compiled without LTO.". > OK, so as I get it, we get the following cases: > > 1) compiler knows it is generating call to a local symbol a current > unit (binds_to_current_def_p returns true). > > We handle this correctly by doing IP relative call. > > 2) compiler knows it is generating call to a local symbol in DSO > (binds_local_p return true) > Currently I think this is only the -fno-pic case or case of explicit > hidden visibility and in this case we do IP relative call. > > We may want to propose plugin API update adding PREVAILING_DEF_EXP. > So copiler would be able to default to this case for PREVAILING_DEF > and we will also catch cases where the symbol is defined in current > DSO as weak symbol, but the definition is not LTO. > This would be also way to communicate -Bsymbolic[-functions] across > the plugin API. > > 3) compiler knows there is going to be definition in the current DSO > (by seeing a COMDAT function body or resolution info) that is interposable > but because the function is inline or -fno-semantic-interposition happens, > the semantics will not change. > > In this case it would be nice to arrange IP relative call to the > hidden alias. This may require an extension both on compiler and linker > side. > > I was thinking of doing so for comdats by adding hidden alias with > fixed mangling, like __gnu_<function>.hiddenalias, and referring it. > But I think it is not safe as linker may throw away section that > is produced by GCC and prevail section that is not leaving to an undefined > symbol? > > I think this is rather common case in C++ (never made any stats) because > uninlined comdats are quite common. > > 4) compiler has no clue but linker may know better > > Here we traditionally always produce a PLT call. In cases the call > is known to be hot in the program it makes sense to trade lazy binding > for performance and produce call via GOT reference (-fno-plt). > I also see that H.J.'s branch helps us to actually avoid the GOT > reference in cases the symbol ends up binding locally. How the lazy > binding with relaxation works? If there is no GOT slot allocated for symbol foo, linker should resolve foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push instruction, to support lazy binding. Otherwise, linker should resolve it to its GOT slot address. > We may try to communicate down the information whether the symbol can > or can not semantically interpose to the linker, so it can do > -Bsymbolic by default for inline and COMDAT functions. > Actually perhaps the linker can just default to this for all comdat > defined symbols? > > I think it still make sense to work on non-LTO codegen improvements. > As much as I would like everyone to LTO and FDO, most people don't. > > 5) Compiler knows it is generating call to external function. > We do not special case this, but we could add binds_external_p and > make it to determine this case from resolution info during LTO. > > I do not see if this case is any different from 4 from PIC codegen > perspective except that perhaps the relax relocation will allow us to lazy > bind? My relax branch proposal works even without LTO.
On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > My relax branch proposal works even without LTO. > I will borrow GOTPCREL from x86-64 and do [hjl@gnu-6 relax-4]$ cat b.S call *foo@GOTPCREL(%eax) [hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S [hjl@gnu-6 relax-4]$ ./objdump -dwr b.o b.o: file format elf32-i386 Disassembly of section .text: 00000000 <.text>: 0: ff 90 fc ff ff ff call *-0x4(%eax) 2: R_386_RELAX_GOT32 foo [hjl@gnu-6 relax-4]$ And linker can turn it into relax call foo if foo is defined locally.
On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> My relax branch proposal works even without LTO. >> > > I will borrow GOTPCREL from x86-64 and do > > [hjl@gnu-6 relax-4]$ cat b.S > call *foo@GOTPCREL(%eax) call *foo@GOTPLT(%eax) is a better choice. > [hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S > [hjl@gnu-6 relax-4]$ ./objdump -dwr b.o > > b.o: file format elf32-i386 > > > Disassembly of section .text: > > 00000000 <.text>: > 0: ff 90 fc ff ff ff call *-0x4(%eax) 2: R_386_RELAX_GOT32 foo > [hjl@gnu-6 relax-4]$ > > And linker can turn it into > > relax call foo > > if foo is defined locally.
On Fri, May 15, 2015 at 04:14:07PM -0700, H.J. Lu wrote: > On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > > Hello, > >> > >> There are codes like > >> > >> extern void foo (void); > >> > >> void > >> bar (void) > >> { > >> foo (); > >> } > >> > >> Even with LTO, compiler may have to assume foo is external > >> when foo is compiled with LTO. > > > > This is not exactly true if FOO is defined in other translation unit > > compiled with LTO and hidden visibility. > > I was meant to say " when foo is compiled without LTO.". > > > OK, so as I get it, we get the following cases: > > > > 1) compiler knows it is generating call to a local symbol a current > > unit (binds_to_current_def_p returns true). > > > > We handle this correctly by doing IP relative call. > > > > 2) compiler knows it is generating call to a local symbol in DSO > > (binds_local_p return true) > > Currently I think this is only the -fno-pic case or case of explicit > > hidden visibility and in this case we do IP relative call. > > > > We may want to propose plugin API update adding PREVAILING_DEF_EXP. > > So copiler would be able to default to this case for PREVAILING_DEF > > and we will also catch cases where the symbol is defined in current > > DSO as weak symbol, but the definition is not LTO. > > This would be also way to communicate -Bsymbolic[-functions] across > > the plugin API. > > > > 3) compiler knows there is going to be definition in the current DSO > > (by seeing a COMDAT function body or resolution info) that is interposable > > but because the function is inline or -fno-semantic-interposition happens, > > the semantics will not change. > > > > In this case it would be nice to arrange IP relative call to the > > hidden alias. This may require an extension both on compiler and linker > > side. > > > > I was thinking of doing so for comdats by adding hidden alias with > > fixed mangling, like __gnu_<function>.hiddenalias, and referring it. > > But I think it is not safe as linker may throw away section that > > is produced by GCC and prevail section that is not leaving to an undefined > > symbol? > > > > I think this is rather common case in C++ (never made any stats) because > > uninlined comdats are quite common. > > > > 4) compiler has no clue but linker may know better > > > > Here we traditionally always produce a PLT call. In cases the call > > is known to be hot in the program it makes sense to trade lazy binding > > for performance and produce call via GOT reference (-fno-plt). > > I also see that H.J.'s branch helps us to actually avoid the GOT > > reference in cases the symbol ends up binding locally. How the lazy > > binding with relaxation works? > > If there is no GOT slot allocated for symbol foo, linker should resolve > foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push > instruction, to support lazy binding. Otherwise, linker should resolve it > to its GOT slot address. Forget lazy binding. It's dead anyway because serious distros want PIE+relro+bindnow+... If people really want lazy binding, they can use options which support it, but I don't want to keep suffering the codegen cost of lazy binding despite never using it. There should be an option to generate optimal code equivalent to what you get with Alexander Monakov's patches for those of us who aren't trying to support this legacy feature that precludes good performance and precludes hardening. Rich
On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote: > On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > >> My relax branch proposal works even without LTO. > >> > > > > I will borrow GOTPCREL from x86-64 and do > > > > [hjl@gnu-6 relax-4]$ cat b.S > > call *foo@GOTPCREL(%eax) > > call *foo@GOTPLT(%eax) > > is a better choice. foo@GOTPCREL is preferable (but does not yet exist for ia32, so the reloc type would have to be added) since it saves a useless add. Instead of: call __x86.get_pc_thunk.ax addl $_GLOBAL_OFFSET_TABLE_, %eax call *foo@GOTPLT(%eax) you can just do: call __x86.get_pc_thunk.ax call *foo@GOTPCREL(%eax) Note that it also works to have extra instructions between: call __x86.get_pc_thunk.ax 1: ... call *foo@GOTPCREL+(1b-.)(%eax) I may not have gotten the syntax quite right, but hopefully yoy get the idea. This same approach (with GOTPCREL) can be used for _all_ GOT accesses, including global data, to eliminate the useless add. Rich
On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote: > On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote: >> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> >> My relax branch proposal works even without LTO. >> >> >> > >> > I will borrow GOTPCREL from x86-64 and do >> > >> > [hjl@gnu-6 relax-4]$ cat b.S >> > call *foo@GOTPCREL(%eax) >> >> call *foo@GOTPLT(%eax) >> >> is a better choice. > > foo@GOTPCREL is preferable (but does not yet exist for ia32, so the > reloc type would have to be added) since it saves a useless add. > Instead of: > > call __x86.get_pc_thunk.ax > addl $_GLOBAL_OFFSET_TABLE_, %eax > call *foo@GOTPLT(%eax) > > you can just do: > > call __x86.get_pc_thunk.ax > call *foo@GOTPCREL(%eax) > > Note that it also works to have extra instructions between: > > call __x86.get_pc_thunk.ax > 1: ... > call *foo@GOTPCREL+(1b-.)(%eax) > > I may not have gotten the syntax quite right, but hopefully yoy get > the idea. This same approach (with GOTPCREL) can be used for _all_ GOT > accesses, including global data, to eliminate the useless add. > This is a good idea. But I'd like to use something for both i386 and x86-64. I am proposing call/jmp *foo@GOTPCRELAX+addend(%reg) It is similar to @GOTPCREL, but with a new relax relocation. Before I can do that, I need to fix https://sourceware.org/bugzilla/show_bug.cgi?id=18423 first.
On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote: >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote: >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >>> >> My relax branch proposal works even without LTO. >>> >> >>> > >>> > I will borrow GOTPCREL from x86-64 and do >>> > >>> > [hjl@gnu-6 relax-4]$ cat b.S >>> > call *foo@GOTPCREL(%eax) >>> >>> call *foo@GOTPLT(%eax) >>> >>> is a better choice. >> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the >> reloc type would have to be added) since it saves a useless add. >> Instead of: >> >> call __x86.get_pc_thunk.ax >> addl $_GLOBAL_OFFSET_TABLE_, %eax >> call *foo@GOTPLT(%eax) >> >> you can just do: >> >> call __x86.get_pc_thunk.ax >> call *foo@GOTPCREL(%eax) >> >> Note that it also works to have extra instructions between: >> >> call __x86.get_pc_thunk.ax >> 1: ... >> call *foo@GOTPCREL+(1b-.)(%eax) >> >> I may not have gotten the syntax quite right, but hopefully yoy get >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT >> accesses, including global data, to eliminate the useless add. >> > > This is a good idea. But I'd like to use something for both i386 and > x86-64. I am proposing > > call/jmp *foo@GOTPCRELAX+addend(%reg) > > It is similar to @GOTPCREL, but with a new relax relocation. Before > I can do that, I need to fix It doesn't work. REG must hold GOT base for other GOT relocations. We need to keep addl $_GLOBAL_OFFSET_TABLE_, %eax
On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote: > On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote: > >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote: > >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > >>> >> My relax branch proposal works even without LTO. > >>> >> > >>> > > >>> > I will borrow GOTPCREL from x86-64 and do > >>> > > >>> > [hjl@gnu-6 relax-4]$ cat b.S > >>> > call *foo@GOTPCREL(%eax) > >>> > >>> call *foo@GOTPLT(%eax) > >>> > >>> is a better choice. > >> > >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the > >> reloc type would have to be added) since it saves a useless add. > >> Instead of: > >> > >> call __x86.get_pc_thunk.ax > >> addl $_GLOBAL_OFFSET_TABLE_, %eax > >> call *foo@GOTPLT(%eax) > >> > >> you can just do: > >> > >> call __x86.get_pc_thunk.ax > >> call *foo@GOTPCREL(%eax) > >> > >> Note that it also works to have extra instructions between: > >> > >> call __x86.get_pc_thunk.ax > >> 1: ... > >> call *foo@GOTPCREL+(1b-.)(%eax) > >> > >> I may not have gotten the syntax quite right, but hopefully yoy get > >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT > >> accesses, including global data, to eliminate the useless add. > >> > > > > This is a good idea. But I'd like to use something for both i386 and > > x86-64. I am proposing > > > > call/jmp *foo@GOTPCRELAX+addend(%reg) > > > > It is similar to @GOTPCREL, but with a new relax relocation. Before > > I can do that, I need to fix > > It doesn't work. REG must hold GOT base for other GOT relocations. > We need to keep > > addl $_GLOBAL_OFFSET_TABLE_, %eax Like I just said, all foo@GOT(%gotreg) can be replaced with foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register pointing to the referenced label (the point at which the program counter was saved). This is a minor but useful optimization that can be made for all GOT accesses, not just ones for [relaxable] function calls. Rich
On Sat, May 16, 2015 at 12:03 PM, Rich Felker <dalias@libc.org> wrote: > On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote: >> On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote: >> >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote: >> >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote: >> >>> >> My relax branch proposal works even without LTO. >> >>> >> >> >>> > >> >>> > I will borrow GOTPCREL from x86-64 and do >> >>> > >> >>> > [hjl@gnu-6 relax-4]$ cat b.S >> >>> > call *foo@GOTPCREL(%eax) >> >>> >> >>> call *foo@GOTPLT(%eax) >> >>> >> >>> is a better choice. >> >> >> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the >> >> reloc type would have to be added) since it saves a useless add. >> >> Instead of: >> >> >> >> call __x86.get_pc_thunk.ax >> >> addl $_GLOBAL_OFFSET_TABLE_, %eax >> >> call *foo@GOTPLT(%eax) >> >> >> >> you can just do: >> >> >> >> call __x86.get_pc_thunk.ax >> >> call *foo@GOTPCREL(%eax) >> >> >> >> Note that it also works to have extra instructions between: >> >> >> >> call __x86.get_pc_thunk.ax >> >> 1: ... >> >> call *foo@GOTPCREL+(1b-.)(%eax) >> >> >> >> I may not have gotten the syntax quite right, but hopefully yoy get >> >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT >> >> accesses, including global data, to eliminate the useless add. >> >> >> > >> > This is a good idea. But I'd like to use something for both i386 and >> > x86-64. I am proposing >> > >> > call/jmp *foo@GOTPCRELAX+addend(%reg) >> > >> > It is similar to @GOTPCREL, but with a new relax relocation. Before >> > I can do that, I need to fix >> >> It doesn't work. REG must hold GOT base for other GOT relocations. >> We need to keep >> >> addl $_GLOBAL_OFFSET_TABLE_, %eax > > Like I just said, all foo@GOT(%gotreg) can be replaced with > foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register > pointing to the referenced label (the point at which the program > counter was saved). This is a minor but useful optimization that can > be made for all GOT accesses, not just ones for [relaxable] function > calls. There is also foo@GOTOFF(%reg). Remove addl is independent of relax branch. I will leave it out. Relax branch will support call/jmp *bar@GOTRELAX(%reg) for both i386 and x86-64.
On Fri, 15 May 2015, Jan Hubicka wrote: > > >> With -fno-plt, we don't have to reject even direct calls as sibcall > > >> candidates. > > >> > > >> This patch depends on '-fplt' flag that is introduced in another patch. > > >> > > >> This patch requires that with -fno-plt all sibcall candidates go through > > >> prepare_call_address that transforms the call to a GOT lookup. > > >> > > >> OK? > > >> * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt. > > >> > > >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c > > >> index f29e053..b734350 100644 > > >> --- a/gcc/config/i386/i386.c > > >> +++ b/gcc/config/i386/i386.c > > >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) > > >> /* If we are generating position-independent code, we cannot sibcall > > >> optimize any indirect call, or a direct call to a global function, > > >> as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ > > >> if (!TARGET_MACHO > > >> && !TARGET_64BIT > > >> && flag_pic > > >> + && flag_plt > > >> && (decl && !targetm.binds_local_p (decl))) > > >> return false; > > >> > > >> /* If we need to align the outgoing stack, then sibcalling would > > >> unalign the stack, which may break the called function. */ > > >> if (ix86_minimum_incoming_stack_boundary (true) > > >> > > > > I think it should be done via psABI change similar to > > > > https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI > > > > which I have implemented on users/hjl/relax branch in binutils. > > OK, I am trying to understand how relax branch works and what difference it makes. > As I underestand it, the main purpose is to be able to make relaxed call of > > call function > > that will, in 64bit mode, either result to RIP relative call with extra NOP just > before the instruction if FUNCTION binds within the DSO or to indirect call through > GOT bypassing the PLT. This saves overhead of PLT and increase every such call > by extra NOP for no-LTO builds and even in LTO when the symbol is defined but > interposable. This is actually really nice trick. > > Now this is about 32bit mode where explicit GOT pointer register is needed > (how this work with large code model on x86-64?). It is needed by PLT, but I suppose > to implement the same relaxation for 32bit it would need to use EBX to lookup the > GOT pointer, too, so the check above would still be valid. > > The patches makes sense to be given that we support -fno-plt now. After this message the discussion diverged in the direction of H.J.Lu's proposed relaxation scheme involving new type of relocations. I'm not clear if my patch is actually approved. I'd like to point out that it doesn't clash with H.J.Lu's work. It improves codegen by allowing sibcalls in more circumstances. Alexander
> > After this message the discussion diverged in the direction of H.J.Lu's > proposed relaxation scheme involving new type of relocations. > > I'm not clear if my patch is actually approved. I'd like to point out that it > doesn't clash with H.J.Lu's work. It improves codegen by allowing sibcalls in > more circumstances. Yes, the original patch is OK. Honza > > Alexander
Hi, On Fri, 15 May 2015, Rich Felker wrote: > Forget lazy binding. It's dead anyway because serious distros want > PIE+relro+bindnow+... You keep saying this, but I can't help the feeling it's mostly because musl doesn't support it ;-) No, you don't have to use bindnow to get the effects of relro. Sure there's more parts of the GOT protected with it, but if that's really that much more hardened is up for debate. > If people really want lazy binding, they can use options which support > it, but I don't want to keep suffering the codegen cost of lazy binding > despite never using it. > There should be an option to generate optimal code equivalent to what > you get with Alexander Monakov's patches for those of us who aren't > trying to support this legacy feature that precludes good performance > and precludes hardening. H.J.'s branch is for _improving_ code on top of the no-plt code, it's not replacing it or an alternative for it. Ciao, Michael.
On 05/19/2015 08:43 AM, Michael Matz wrote: > Hi, > > On Fri, 15 May 2015, Rich Felker wrote: > >> Forget lazy binding. It's dead anyway because serious distros want >> PIE+relro+bindnow+... > > You keep saying this, but I can't help the feeling it's mostly because > musl doesn't support it ;-) FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the distribution. It's not clear yet how far bindnow will go though. jeff
Hi, On Tue, 19 May 2015, Jeff Law wrote: > > > Forget lazy binding. It's dead anyway because serious distros want > > > PIE+relro+bindnow+... > > > > You keep saying this, but I can't help the feeling it's mostly because > > musl doesn't support it ;-) > > FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the > distribution. Yeah, us as well, though I don't necessarily see the point for most packages; feels a bit like a checkmark item :) Ciao, Michael.
On Tue, May 19, 2015 at 04:43:53PM +0200, Michael Matz wrote: > Hi, > > On Fri, 15 May 2015, Rich Felker wrote: > > > Forget lazy binding. It's dead anyway because serious distros want > > PIE+relro+bindnow+... > > You keep saying this, but I can't help the feeling it's mostly because > musl doesn't support it ;-) Well the reasons musl doesn't support it are partly the above, and partly that it's been a continuous source of subtle bugs in glibc -- things like clobbering new vector registers, missing synchronization, failures to be async-signal-safe, etc. So it's not that I think lazy binding is bad because musl doesn't support it, but rather that musl doesn't support lazy binding because I think it's bad. :-) > No, you don't have to use bindnow to get the effects of relro. Sure > there's more parts of the GOT protected with it, but if that's really that > much more hardened is up for debate. Normally it's function addresses that you care about protecting -- they're the easy vector for arbitrary code execution -- and they're unprotected without bindnow. Addresses of global data could also be an attack vector, but a more difficult one to exploit. > > If people really want lazy binding, they can use options which support > > it, but I don't want to keep suffering the codegen cost of lazy binding > > despite never using it. > > > There should be an option to generate optimal code equivalent to what > > you get with Alexander Monakov's patches for those of us who aren't > > trying to support this legacy feature that precludes good performance > > and precludes hardening. > > H.J.'s branch is for _improving_ code on top of the no-plt code, it's not > replacing it or an alternative for it. Thanks for the clarification -- this was the part I was failing to understand. I'm still mildly worried that concerns for supporting relaxation might lead to decisions not to optimize code in ways that would be difficult to relax (e.g. certain types of address load reordering or hoisting) but I don't understand GCC internals sufficiently to know if this concern is warranted or not. As long as his work isn't interfering with the ability of -fno-plt to generate optimal code, I agree it's both inappropriate and counter-productive for me to be objecting to part or all of it. I would still like to see the @GOTPCREL stuff added and used instead of @GOT, as I mentioned earlier in the thread, but I agree that's independent of relaxation support and shouldn't block it. Rich
On 05/19/2015 11:06 AM, Rich Felker wrote: > I'm still mildly worried that concerns for supporting > relaxation might lead to decisions not to optimize code in ways that > would be difficult to relax (e.g. certain types of address load > reordering or hoisting) but I don't understand GCC internals > sufficiently to know if this concern is warranted or not. It is. The relaxation that HJ is working on requires that the reads from the got not be hoisted. I'm not especially convinced that what he's working on is a win. With LTO, the compiler can do the same job that he's attempting in the linker, without an extra nop. Without LTO, leaving it to the linker means that you can't hoist the load and hide the memory latency. > I would still like to see the @GOTPCREL stuff added and used instead > of @GOT, as I mentioned earlier in the thread, but I agree that's > independent of relaxation support and shouldn't block it. I don't think that @GOTPCREL for 32-bit is a good idea. This is the scheme that Darwin uses, so we do have some experience with it. In order for it to work you've got to have a pointer to a random address in the function. It means that you can only "easily" compute the address once. If you need the value again you wind up with the same "extra" addl insn that we have with the current GOT pointer. We've just started to do inter-function register allocation. The next step along those lines is to share the computation of GOT between multiple functions. At which point it really helps to have one global base address to talk about. r~
On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: > On 05/19/2015 11:06 AM, Rich Felker wrote: >> I'm still mildly worried that concerns for supporting >> relaxation might lead to decisions not to optimize code in ways that >> would be difficult to relax (e.g. certain types of address load >> reordering or hoisting) but I don't understand GCC internals >> sufficiently to know if this concern is warranted or not. > > It is. The relaxation that HJ is working on requires that the reads from the > got not be hoisted. I'm not especially convinced that what he's working on is > a win. > > With LTO, the compiler can do the same job that he's attempting in the linker, > without an extra nop. Without LTO, leaving it to the linker means that you > can't hoist the load and hide the memory latency. > My relax approach won't take away any optimization done by compiler. It simply turns indirect branch into direct branch with a nop prefix at link-time. I am having a hard time to understand why we shouldn't do it.
On Tue, May 19, 2015 at 06:01:07PM +0200, Michael Matz wrote: > Hi, > > On Tue, 19 May 2015, Jeff Law wrote: > > > > > Forget lazy binding. It's dead anyway because serious distros want > > > > PIE+relro+bindnow+... > > > > > > You keep saying this, but I can't help the feeling it's mostly because > > > musl doesn't support it ;-) > > > > FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the > > distribution. > > Yeah, us as well, though I don't necessarily see the point for most > packages; feels a bit like a checkmark item :) These days it's fairly rare to have software which does not interact at all with untrusted data. Consider how much user-facing application software that was not previously considered security-critical is making network connections using complex protocols (e.g. anything with TLS, IM protocols, ...), opening image files from random sources (attachments, files that happen to be in a browsed-to directory, on USB sticks, etc.), and so on. I think it's smart to be hardening everything, at least for distros providing all sorts of random unvetted software. Rich
On 05/19/2015 12:06 PM, H.J. Lu wrote: > On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: >> On 05/19/2015 11:06 AM, Rich Felker wrote: >>> I'm still mildly worried that concerns for supporting >>> relaxation might lead to decisions not to optimize code in ways that >>> would be difficult to relax (e.g. certain types of address load >>> reordering or hoisting) but I don't understand GCC internals >>> sufficiently to know if this concern is warranted or not. >> >> It is. The relaxation that HJ is working on requires that the reads from the >> got not be hoisted. I'm not especially convinced that what he's working on is >> a win. >> >> With LTO, the compiler can do the same job that he's attempting in the linker, >> without an extra nop. Without LTO, leaving it to the linker means that you >> can't hoist the load and hide the memory latency. >> > > My relax approach won't take away any optimization done by compiler. > It simply turns indirect branch into direct branch with a nop prefix at > link-time. I am having a hard time to understand why we shouldn't do it. I well understand what you're doing. But my point is that the only time the compiler should present you with the form of indirect branch you're looking for is when there's no place to hoist the load. At which point, is it really worth adding a new relocation to the ABI? Is it really worth adding new code to the linker that won't be exercised often? r~
On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: > On 05/19/2015 12:06 PM, H.J. Lu wrote: >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: >>> On 05/19/2015 11:06 AM, Rich Felker wrote: >>>> I'm still mildly worried that concerns for supporting >>>> relaxation might lead to decisions not to optimize code in ways that >>>> would be difficult to relax (e.g. certain types of address load >>>> reordering or hoisting) but I don't understand GCC internals >>>> sufficiently to know if this concern is warranted or not. >>> >>> It is. The relaxation that HJ is working on requires that the reads from the >>> got not be hoisted. I'm not especially convinced that what he's working on is >>> a win. >>> >>> With LTO, the compiler can do the same job that he's attempting in the linker, >>> without an extra nop. Without LTO, leaving it to the linker means that you >>> can't hoist the load and hide the memory latency. >>> >> >> My relax approach won't take away any optimization done by compiler. >> It simply turns indirect branch into direct branch with a nop prefix at >> link-time. I am having a hard time to understand why we shouldn't do it. > > I well understand what you're doing. > > But my point is that the only time the compiler should present you with the > form of indirect branch you're looking for is when there's no place to hoist > the load. > > At which point, is it really worth adding a new relocation to the ABI? Is it > really worth adding new code to the linker that won't be exercised often? I believe there are plenty of indirect branches via GOT when compiling PIE/PIC with -fno-plt: [hjl@gnu-6 gcc]$ cat /tmp/x.c extern void foo (void); void bar (void) { foo (); } [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt [hjl@gnu-6 gcc]$ cat x.s .file "x.c" .section .text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4,,15 .globl bar .type bar, @function bar: .LFB0: .cfi_startproc jmp *foo@GOTPCREL(%rip) .cfi_endproc .LFE0: .size bar, .-bar
On Tue, May 19, 2015 at 11:59:00AM -0700, Richard Henderson wrote: > On 05/19/2015 11:06 AM, Rich Felker wrote: > > I'm still mildly worried that concerns for supporting > > relaxation might lead to decisions not to optimize code in ways that > > would be difficult to relax (e.g. certain types of address load > > reordering or hoisting) but I don't understand GCC internals > > sufficiently to know if this concern is warranted or not. > > It is. The relaxation that HJ is working on requires that the reads from the > got not be hoisted. I'm not especially convinced that what he's working on is > a win. Well as long as -fno-plt actually generates a load from the GOT like what would be done for data access, and does not go out of its way to produce something compatible with relaxation, my hope is that it would not affected by the pessimization. I'm not sure if that's the case though. > With LTO, the compiler can do the same job that he's attempting in the linker, > without an extra nop. Without LTO, leaving it to the linker means that you > can't hoist the load and hide the memory latency. Yes, this is my feeling too. Alexander Monakov have been discussing it on #musl a bit and I think the conclusion we reached is that relaxation is possibly a significant real-world win for non-PIC main executables, where it's very likely that addresses will be resolved at ld-time and for the programmer not to specifically annotate this with protected visibility. In such a case, you get either a direct call or a direct address load and indirect call, rather than hitting an extra cache line in the PLT thunk to do the address load and indirect call. Note that, being non-PIC, there is no GOT register involved here. > > I would still like to see the @GOTPCREL stuff added and used instead > > of @GOT, as I mentioned earlier in the thread, but I agree that's > > independent of relaxation support and shouldn't block it. > > I don't think that @GOTPCREL for 32-bit is a good idea. This is the scheme > that Darwin uses, so we do have some experience with it. > > In order for it to work you've got to have a pointer to a random address in the > function. It means that you can only "easily" compute the address once. If > you need the value again you wind up with the same "extra" addl insn that we > have with the current GOT pointer. Why would you recompute it (this requires a fairly expensive call that reads or pops its own return address) rather than simply spilling the already-computed value and reloading it from the stack? The only example I can think of where it might make sense is when you don't want to load the address unconditionally because there are shrink-wrappable code paths that don't need it, but multple code paths that do, in which case they would each load different values. Is this the concern you have in mind? > We've just started to do inter-function register allocation. The next step > along those lines is to share the computation of GOT between multiple > functions. At which point it really helps to have one global base address to > talk about. I see -- that would be another case where it simplifies things. Rich
On 05/19/2015 12:17 PM, H.J. Lu wrote: >> But my point is that the only time the compiler should present you with the >> form of indirect branch you're looking for is when there's no place to hoist >> the load. >> >> At which point, is it really worth adding a new relocation to the ABI? Is it >> really worth adding new code to the linker that won't be exercised often? > > I believe there are plenty of indirect branches via GOT when compiling > PIE/PIC with -fno-plt: > > [hjl@gnu-6 gcc]$ cat /tmp/x.c > extern void foo (void); > > void > bar (void) > { > foo (); > } Sure, as I said, when there's no place to hoist the load. Try anything more complicated, void bar (void) { int i; for (i = 0; i < 10; ++i) foo (); } void baz (void) { foo (); foo (); } and you'll not see the call *foo@GOTPCREL(%rip) form. Of course there's also plenty of times where combine recreates exactly that form when perhaps the scheduler might have preferred otherwise. Those are optimization choices to be addressed under separate cover. My point that we can already do what you want via LTO, without adding new relocations, is still relevant. r~
On 05/19/2015 12:35 PM, Rich Felker wrote: > Why would you recompute it (this requires a fairly expensive call that > reads or pops its own return address) rather than simply spilling the > already-computed value and reloading it from the stack? > > The only example I can think of where it might make sense is when you > don't want to load the address unconditionally because there are > shrink-wrappable code paths that don't need it, but multple code paths > that do, in which case they would each load different values. Is this > the concern you have in mind? That too. I was thinking of exception landing pads, i.e. catches and cleanups, where in the past we've had to re-compute the GOT address. Though now that I think on that more, it wasn't x86 that had that particular landing pad trouble. r~
On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote: > On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: > > On 05/19/2015 12:06 PM, H.J. Lu wrote: > >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: > >>> On 05/19/2015 11:06 AM, Rich Felker wrote: > >>>> I'm still mildly worried that concerns for supporting > >>>> relaxation might lead to decisions not to optimize code in ways that > >>>> would be difficult to relax (e.g. certain types of address load > >>>> reordering or hoisting) but I don't understand GCC internals > >>>> sufficiently to know if this concern is warranted or not. > >>> > >>> It is. The relaxation that HJ is working on requires that the reads from the > >>> got not be hoisted. I'm not especially convinced that what he's working on is > >>> a win. > >>> > >>> With LTO, the compiler can do the same job that he's attempting in the linker, > >>> without an extra nop. Without LTO, leaving it to the linker means that you > >>> can't hoist the load and hide the memory latency. > >>> > >> > >> My relax approach won't take away any optimization done by compiler. > >> It simply turns indirect branch into direct branch with a nop prefix at > >> link-time. I am having a hard time to understand why we shouldn't do it. > > > > I well understand what you're doing. > > > > But my point is that the only time the compiler should present you with the > > form of indirect branch you're looking for is when there's no place to hoist > > the load. > > > > At which point, is it really worth adding a new relocation to the ABI? Is it > > really worth adding new code to the linker that won't be exercised often? > > I believe there are plenty of indirect branches via GOT when compiling > PIE/PIC with -fno-plt: > > [hjl@gnu-6 gcc]$ cat /tmp/x.c > extern void foo (void); > > void > bar (void) > { > foo (); > } > [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt > [hjl@gnu-6 gcc]$ cat x.s > ..file "x.c" > ..section .text.unlikely,"ax",@progbits > ..LCOLDB0: > ..text > ..LHOTB0: > ..p2align 4,,15 > ..globl bar > ..type bar, @function > bar: > ..LFB0: > ..cfi_startproc > jmp *foo@GOTPCREL(%rip) > ..cfi_endproc > ..LFE0: > ..size bar, .-bar I agree these exist. What I question is whether the savings from the linker being able to relax this to a direct call in the case where the programmer failed to let the compiler make it a direct call to begin with (by using hidden or protected visibility) are worth the cost of not being able to hoist the load out of loops or schedule it earlier in cases where relaxation is not possible because the call target is not defined in the same DSO. Rich
On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote: > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote: >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: >> > On 05/19/2015 12:06 PM, H.J. Lu wrote: >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote: >> >>>> I'm still mildly worried that concerns for supporting >> >>>> relaxation might lead to decisions not to optimize code in ways that >> >>>> would be difficult to relax (e.g. certain types of address load >> >>>> reordering or hoisting) but I don't understand GCC internals >> >>>> sufficiently to know if this concern is warranted or not. >> >>> >> >>> It is. The relaxation that HJ is working on requires that the reads from the >> >>> got not be hoisted. I'm not especially convinced that what he's working on is >> >>> a win. >> >>> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker, >> >>> without an extra nop. Without LTO, leaving it to the linker means that you >> >>> can't hoist the load and hide the memory latency. >> >>> >> >> >> >> My relax approach won't take away any optimization done by compiler. >> >> It simply turns indirect branch into direct branch with a nop prefix at >> >> link-time. I am having a hard time to understand why we shouldn't do it. >> > >> > I well understand what you're doing. >> > >> > But my point is that the only time the compiler should present you with the >> > form of indirect branch you're looking for is when there's no place to hoist >> > the load. >> > >> > At which point, is it really worth adding a new relocation to the ABI? Is it >> > really worth adding new code to the linker that won't be exercised often? >> >> I believe there are plenty of indirect branches via GOT when compiling >> PIE/PIC with -fno-plt: >> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c >> extern void foo (void); >> >> void >> bar (void) >> { >> foo (); >> } >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt >> [hjl@gnu-6 gcc]$ cat x.s >> ..file "x.c" >> ..section .text.unlikely,"ax",@progbits >> ..LCOLDB0: >> ..text >> ..LHOTB0: >> ..p2align 4,,15 >> ..globl bar >> ..type bar, @function >> bar: >> ..LFB0: >> ..cfi_startproc >> jmp *foo@GOTPCREL(%rip) >> ..cfi_endproc >> ..LFE0: >> ..size bar, .-bar > > I agree these exist. What I question is whether the savings from the > linker being able to relax this to a direct call in the case where the > programmer failed to let the compiler make it a direct call to begin > with (by using hidden or protected visibility) are worth the cost of > not being able to hoist the load out of loops or schedule it earlier > in cases where relaxation is not possible because the call target is > not defined in the same DSO. Just for fun. I compiled binutils as PIE with -fno-plt -flto: [hjl@gnu-mic-2 gas]$ file as-new as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not stripped [hjl@gnu-mic-2 gas]$ There are 43: ff 25 21 93 2d 00 jmpq *0x2d9321(%rip) # 3d5f58 <_DYNAMIC+0x1e8> and 1983 ff 15 eb f4 38 00 callq *0x38f4eb(%rip) # 3d60e0 <_DYNAMIC+0x370>
On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote: > On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote: > > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote: > >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: > >> > On 05/19/2015 12:06 PM, H.J. Lu wrote: > >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: > >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote: > >> >>>> I'm still mildly worried that concerns for supporting > >> >>>> relaxation might lead to decisions not to optimize code in ways that > >> >>>> would be difficult to relax (e.g. certain types of address load > >> >>>> reordering or hoisting) but I don't understand GCC internals > >> >>>> sufficiently to know if this concern is warranted or not. > >> >>> > >> >>> It is. The relaxation that HJ is working on requires that the reads from the > >> >>> got not be hoisted. I'm not especially convinced that what he's working on is > >> >>> a win. > >> >>> > >> >>> With LTO, the compiler can do the same job that he's attempting in the linker, > >> >>> without an extra nop. Without LTO, leaving it to the linker means that you > >> >>> can't hoist the load and hide the memory latency. > >> >>> > >> >> > >> >> My relax approach won't take away any optimization done by compiler. > >> >> It simply turns indirect branch into direct branch with a nop prefix at > >> >> link-time. I am having a hard time to understand why we shouldn't do it. > >> > > >> > I well understand what you're doing. > >> > > >> > But my point is that the only time the compiler should present you with the > >> > form of indirect branch you're looking for is when there's no place to hoist > >> > the load. > >> > > >> > At which point, is it really worth adding a new relocation to the ABI? Is it > >> > really worth adding new code to the linker that won't be exercised often? > >> > >> I believe there are plenty of indirect branches via GOT when compiling > >> PIE/PIC with -fno-plt: > >> > >> [hjl@gnu-6 gcc]$ cat /tmp/x.c > >> extern void foo (void); > >> > >> void > >> bar (void) > >> { > >> foo (); > >> } > >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt > >> [hjl@gnu-6 gcc]$ cat x.s > >> ..file "x.c" > >> ..section .text.unlikely,"ax",@progbits > >> ..LCOLDB0: > >> ..text > >> ..LHOTB0: > >> ..p2align 4,,15 > >> ..globl bar > >> ..type bar, @function > >> bar: > >> ..LFB0: > >> ..cfi_startproc > >> jmp *foo@GOTPCREL(%rip) > >> ..cfi_endproc > >> ..LFE0: > >> ..size bar, .-bar > > > > I agree these exist. What I question is whether the savings from the > > linker being able to relax this to a direct call in the case where the > > programmer failed to let the compiler make it a direct call to begin > > with (by using hidden or protected visibility) are worth the cost of > > not being able to hoist the load out of loops or schedule it earlier > > in cases where relaxation is not possible because the call target is > > not defined in the same DSO. > > Just for fun. I compiled binutils as PIE with -fno-plt -flto: > > [hjl@gnu-mic-2 gas]$ file as-new > as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), > dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not > stripped > [hjl@gnu-mic-2 gas]$ > > There are 43: > > ff 25 21 93 2d 00 jmpq *0x2d9321(%rip) # 3d5f58 <_DYNAMIC+0x1e8> > > and 1983 > > ff 15 eb f4 38 00 callq *0x38f4eb(%rip) # 3d60e0 <_DYNAMIC+0x370> How many of those would be relaxed? I suspect it depends a lot on whether libbfd is static or shared. Rich
On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote: > On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote: >> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote: >> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote: >> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: >> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote: >> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: >> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote: >> >> >>>> I'm still mildly worried that concerns for supporting >> >> >>>> relaxation might lead to decisions not to optimize code in ways that >> >> >>>> would be difficult to relax (e.g. certain types of address load >> >> >>>> reordering or hoisting) but I don't understand GCC internals >> >> >>>> sufficiently to know if this concern is warranted or not. >> >> >>> >> >> >>> It is. The relaxation that HJ is working on requires that the reads from the >> >> >>> got not be hoisted. I'm not especially convinced that what he's working on is >> >> >>> a win. >> >> >>> >> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker, >> >> >>> without an extra nop. Without LTO, leaving it to the linker means that you >> >> >>> can't hoist the load and hide the memory latency. >> >> >>> >> >> >> >> >> >> My relax approach won't take away any optimization done by compiler. >> >> >> It simply turns indirect branch into direct branch with a nop prefix at >> >> >> link-time. I am having a hard time to understand why we shouldn't do it. >> >> > >> >> > I well understand what you're doing. >> >> > >> >> > But my point is that the only time the compiler should present you with the >> >> > form of indirect branch you're looking for is when there's no place to hoist >> >> > the load. >> >> > >> >> > At which point, is it really worth adding a new relocation to the ABI? Is it >> >> > really worth adding new code to the linker that won't be exercised often? >> >> >> >> I believe there are plenty of indirect branches via GOT when compiling >> >> PIE/PIC with -fno-plt: >> >> >> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c >> >> extern void foo (void); >> >> >> >> void >> >> bar (void) >> >> { >> >> foo (); >> >> } >> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt >> >> [hjl@gnu-6 gcc]$ cat x.s >> >> ..file "x.c" >> >> ..section .text.unlikely,"ax",@progbits >> >> ..LCOLDB0: >> >> ..text >> >> ..LHOTB0: >> >> ..p2align 4,,15 >> >> ..globl bar >> >> ..type bar, @function >> >> bar: >> >> ..LFB0: >> >> ..cfi_startproc >> >> jmp *foo@GOTPCREL(%rip) >> >> ..cfi_endproc >> >> ..LFE0: >> >> ..size bar, .-bar >> > >> > I agree these exist. What I question is whether the savings from the >> > linker being able to relax this to a direct call in the case where the >> > programmer failed to let the compiler make it a direct call to begin >> > with (by using hidden or protected visibility) are worth the cost of >> > not being able to hoist the load out of loops or schedule it earlier >> > in cases where relaxation is not possible because the call target is >> > not defined in the same DSO. >> >> Just for fun. I compiled binutils as PIE with -fno-plt -flto: >> >> [hjl@gnu-mic-2 gas]$ file as-new >> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), >> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not >> stripped >> [hjl@gnu-mic-2 gas]$ >> >> There are 43: >> >> ff 25 21 93 2d 00 jmpq *0x2d9321(%rip) # 3d5f58 <_DYNAMIC+0x1e8> >> >> and 1983 >> >> ff 15 eb f4 38 00 callq *0x38f4eb(%rip) # 3d60e0 <_DYNAMIC+0x370> > > How many of those would be relaxed? I suspect it depends a lot on > whether libbfd is static or shared. When shared libraries are enabled, there are 177 indirect branches to locally defined functions. Call to any locally defined functions, which aren't compiled with LTO, is indirect.
On Tue, May 19, 2015 at 05:10:11PM -0700, H.J. Lu wrote: > On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote: > > On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote: > >> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote: > >> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote: > >> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote: > >> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote: > >> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote: > >> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote: > >> >> >>>> I'm still mildly worried that concerns for supporting > >> >> >>>> relaxation might lead to decisions not to optimize code in ways that > >> >> >>>> would be difficult to relax (e.g. certain types of address load > >> >> >>>> reordering or hoisting) but I don't understand GCC internals > >> >> >>>> sufficiently to know if this concern is warranted or not. > >> >> >>> > >> >> >>> It is. The relaxation that HJ is working on requires that the reads from the > >> >> >>> got not be hoisted. I'm not especially convinced that what he's working on is > >> >> >>> a win. > >> >> >>> > >> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker, > >> >> >>> without an extra nop. Without LTO, leaving it to the linker means that you > >> >> >>> can't hoist the load and hide the memory latency. > >> >> >>> > >> >> >> > >> >> >> My relax approach won't take away any optimization done by compiler. > >> >> >> It simply turns indirect branch into direct branch with a nop prefix at > >> >> >> link-time. I am having a hard time to understand why we shouldn't do it. > >> >> > > >> >> > I well understand what you're doing. > >> >> > > >> >> > But my point is that the only time the compiler should present you with the > >> >> > form of indirect branch you're looking for is when there's no place to hoist > >> >> > the load. > >> >> > > >> >> > At which point, is it really worth adding a new relocation to the ABI? Is it > >> >> > really worth adding new code to the linker that won't be exercised often? > >> >> > >> >> I believe there are plenty of indirect branches via GOT when compiling > >> >> PIE/PIC with -fno-plt: > >> >> > >> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c > >> >> extern void foo (void); > >> >> > >> >> void > >> >> bar (void) > >> >> { > >> >> foo (); > >> >> } > >> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt > >> >> [hjl@gnu-6 gcc]$ cat x.s > >> >> ..file "x.c" > >> >> ..section .text.unlikely,"ax",@progbits > >> >> ..LCOLDB0: > >> >> ..text > >> >> ..LHOTB0: > >> >> ..p2align 4,,15 > >> >> ..globl bar > >> >> ..type bar, @function > >> >> bar: > >> >> ..LFB0: > >> >> ..cfi_startproc > >> >> jmp *foo@GOTPCREL(%rip) > >> >> ..cfi_endproc > >> >> ..LFE0: > >> >> ..size bar, .-bar > >> > > >> > I agree these exist. What I question is whether the savings from the > >> > linker being able to relax this to a direct call in the case where the > >> > programmer failed to let the compiler make it a direct call to begin > >> > with (by using hidden or protected visibility) are worth the cost of > >> > not being able to hoist the load out of loops or schedule it earlier > >> > in cases where relaxation is not possible because the call target is > >> > not defined in the same DSO. > >> > >> Just for fun. I compiled binutils as PIE with -fno-plt -flto: > >> > >> [hjl@gnu-mic-2 gas]$ file as-new > >> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), > >> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not > >> stripped > >> [hjl@gnu-mic-2 gas]$ > >> > >> There are 43: > >> > >> ff 25 21 93 2d 00 jmpq *0x2d9321(%rip) # 3d5f58 <_DYNAMIC+0x1e8> > >> > >> and 1983 > >> > >> ff 15 eb f4 38 00 callq *0x38f4eb(%rip) # 3d60e0 <_DYNAMIC+0x370> > > > > How many of those would be relaxed? I suspect it depends a lot on > > whether libbfd is static or shared. > > When shared libraries are enabled, there are 177 indirect branches > to locally defined functions. Call to any locally defined functions, > which aren't compiled with LTO, is indirect. And are the above indirect calls/jumps (1983+43) candidates for scheduling/hoisting the address load (that's not being done yet), or are they the ones the compiler opted not to schedule/hoist? The win from relaxation seems small here, but as long as you're not going to block optimizations that would preclude relaxing, I don't see any disadvantages to doing it. Rich
Hi, On Tue, 19 May 2015, Richard Henderson wrote: > It is. The relaxation that HJ is working on requires that the reads > from the got not be hoisted. I'm not especially convinced that what > he's working on is a win. > > With LTO, the compiler can do the same job that he's attempting in the > linker, without an extra nop. Without LTO, leaving it to the linker > means that you can't hoist the load and hide the memory latency. Well, hoisting always needs a register, and if hoisted out of a loop (which you all seem to be after) that register is live through the whole loop body. You need a register for each different called function in such loop, trading the one GOT pointer with N other registers. For register-starved machines this is a real problem, even x86-64 doesn't have that many. I.e. I'm not convinced that this hoisting will really be much of a win that often, outside toy examples. Sure, the compiler can hoist function addresses trivially, but I think it will lead to spilling more often than not, or alternatively the hoisting will be undone by the register allocators rematerialization. Of course, this would have to be measured for real not hand-waved, but, well, I'd be surprised if it's not so. Ciao, Michael.
On Wed, May 20, 2015 at 5:10 AM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Tue, 19 May 2015, Richard Henderson wrote: > >> It is. The relaxation that HJ is working on requires that the reads >> from the got not be hoisted. I'm not especially convinced that what >> he's working on is a win. >> >> With LTO, the compiler can do the same job that he's attempting in the >> linker, without an extra nop. Without LTO, leaving it to the linker >> means that you can't hoist the load and hide the memory latency. > > Well, hoisting always needs a register, and if hoisted out of a loop > (which you all seem to be after) that register is live through the whole > loop body. You need a register for each different called function in such > loop, trading the one GOT pointer with N other registers. For > register-starved machines this is a real problem, even x86-64 doesn't have > that many. I.e. I'm not convinced that this hoisting will really be much > of a win that often, outside toy examples. Sure, the compiler can hoist > function addresses trivially, but I think it will lead to spilling more > often than not, or alternatively the hoisting will be undone by the > register allocators rematerialization. Of course, this would have to be > measured for real not hand-waved, but, well, I'd be surprised if it's not > so. > We should replace "call/jmp *foo@GOTPCREL(%rip)" with "call/jmp *foo@GOTRELAX(%rip)". As an option, we apply -fno-plt to both PIC and non-PIC codes, if foo is externally defined. It will save one indirect branch if GCC is right. If GCC is wrong and foo is defined locally, we get a nop prefix/suffix. We have nothing to lose.
On Wed, May 20, 2015 at 02:10:41PM +0200, Michael Matz wrote: > Hi, > > On Tue, 19 May 2015, Richard Henderson wrote: > > > It is. The relaxation that HJ is working on requires that the reads > > from the got not be hoisted. I'm not especially convinced that what > > he's working on is a win. > > > > With LTO, the compiler can do the same job that he's attempting in the > > linker, without an extra nop. Without LTO, leaving it to the linker > > means that you can't hoist the load and hide the memory latency. > > Well, hoisting always needs a register, and if hoisted out of a loop > (which you all seem to be after) that register is live through the whole > loop body. You need a register for each different called function in such > loop, trading the one GOT pointer with N other registers. For > register-starved machines this is a real problem, even x86-64 doesn't have > that many. I.e. I'm not convinced that this hoisting will really be much > of a win that often, outside toy examples. Sure, the compiler can hoist > function addresses trivially, but I think it will lead to spilling more > often than not, or alternatively the hoisting will be undone by the > register allocators rematerialization. Of course, this would have to be > measured for real not hand-waved, but, well, I'd be surprised if it's not > so. The obvious example where it's useful on x86_64 is a major class: anything where the majority of the callee's data is floating point and thus kept in xmm registers. In that case register pressure is a lot lower, and there's also an obvious class of cross-DSO functions calls you'd be making over and over again: anything from libm. Rich
Hi, On Wed, 20 May 2015, Rich Felker wrote: > > of a win that often, outside toy examples. Sure, the compiler can hoist > > function addresses trivially, but I think it will lead to spilling more > > often than not, or alternatively the hoisting will be undone by the > > register allocators rematerialization. Of course, this would have to be > > measured for real not hand-waved, but, well, I'd be surprised if it's not > > so. > > The obvious example where it's useful on x86_64 is a major class: Yes, I can construct all kinds of examples where it's useful. That doesn't touch the topic of real-world cases or hard numbers actually comparing the number of hoisted callee addresses, the number that stay hoisted until after register allocation and the number of spills added by hoisting, on some relevant code base, like gcc itself, or SPEC. > anything where the majority of the callee's data is floating point and > thus kept in xmm registers. This code tends to work on multiple arrays in practice, and hence integer registers are required for all the addresses and offsets and loop counters. > In that case register pressure is a lot lower, Register pressure on x86 is never low :) Yes, x86-64 and others are much better in this regard. > and there's also an obvious class of cross-DSO functions calls you'd be > making over and over again: anything from libm. Ciao, Michael.
On 05/19/2015 06:06 PM, Rich Felker wrote: > And are the above indirect calls/jumps (1983+43) candidates for > scheduling/hoisting the address load (that's not being done yet), or > are they the ones the compiler opted not to schedule/hoist? The win > from relaxation seems small here, but as long as you're not going to > block optimizations that would preclude relaxing, I don't see any > disadvantages to doing it. FWIW, I bootstrapped gcc with lto and -fpie -fno-plt: total calls 252436 total indirect 21198 (8.4%) via got 10128 (4.0% / 48%) via reg 9007 (3.6% / 42%) via data 2063 (0.8% / 10%) Those via data are things like callq *0x145fdc4(%rip) # 19c0ea8 <lang_hooks+0x1e8> callq *0x14517cc(%rip) # 19c0388 <targetm+0x328> where we have a call to a hook at a known address. Those via reg (or complex address) are also self explanatory -- we have all sorts of hooks and indirection inside gcc, so this is unsurprising. That said, the very first one I examined, 000000000056735e <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334>: ... 56736f: mov 0x144f6f2(%rip),%r13 # 19b6a68 <_DYNAMIC+0x928> ... 567380: sub $0x18,%r12 567384: test %ebx,%ebx 567386: js 567394 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x36> 567388: mov 0x28(%rbp,%r12,1),%rdi 56738d: dec %ebx 56738f: callq *%r13 567392: jmp 567380 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x22> ... does in fact hoist the address of "free" out of the loop. Those via got can be identified by comparing the address against readelf -r to examine the dynamic relocations. There are plenty of truly non-local calls, e.g. to libc. These obviously cannot be relaxed. Of those 10128 calls via the got, I found EXACTLY ONE that was local, to _Z22const_0_to_255_operandP7rtx_def12machine_mode from _ZL19ix86_expand_builtinP9tree_nodeP7rtx_defS2_12machine_modei.lto_priv.2163 This is certain to be a bug, though I don't know where. There are plenty of other calls to const_0_to_255_operand elsewhere, and they are all, as expected, direct. This will likely take significant detective work... r~
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index f29e053..b734350 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) /* If we are generating position-independent code, we cannot sibcall optimize any indirect call, or a direct call to a global function, as the PLT requires %ebx be live. (Darwin does not have a PLT.) */ if (!TARGET_MACHO && !TARGET_64BIT && flag_pic + && flag_plt && (decl && !targetm.binds_local_p (decl))) return false; /* If we need to align the outgoing stack, then sibcalling would unalign the stack, which may break the called function. */ if (ix86_minimum_incoming_stack_boundary (true)