diff mbox

[i386] Allow sibcalls in no-PLT PIC

Message ID 1430757479-14241-5-git-send-email-amonakov@ispras.ru
State New
Headers show

Commit Message

Alexander Monakov May 4, 2015, 4:37 p.m. UTC
With -fno-plt, we don't have to reject even direct calls as sibcall
candidates.

This patch depends on '-fplt' flag that is introduced in another patch.

This patch requires that with -fno-plt all sibcall candidates go through
prepare_call_address that transforms the call to a GOT lookup.

OK?
	* config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.

Comments

Alexander Monakov May 15, 2015, 4:27 p.m. UTC | #1
Ping?  Any comment about this patch?

On Mon, 4 May 2015, Alexander Monakov wrote:

> With -fno-plt, we don't have to reject even direct calls as sibcall
> candidates.
> 
> This patch depends on '-fplt' flag that is introduced in another patch.
> 
> This patch requires that with -fno-plt all sibcall candidates go through
> prepare_call_address that transforms the call to a GOT lookup.
> 
> OK?
> 	* config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> 
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index f29e053..b734350 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>    /* If we are generating position-independent code, we cannot sibcall
>       optimize any indirect call, or a direct call to a global function,
>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>    if (!TARGET_MACHO
>        && !TARGET_64BIT
>        && flag_pic
> +      && flag_plt
>        && (decl && !targetm.binds_local_p (decl)))
>      return false;
>  
>    /* If we need to align the outgoing stack, then sibcalling would
>       unalign the stack, which may break the called function.  */
>    if (ix86_minimum_incoming_stack_boundary (true)
>
H.J. Lu May 15, 2015, 4:37 p.m. UTC | #2
On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> Ping?  Any comment about this patch?
>
> On Mon, 4 May 2015, Alexander Monakov wrote:
>
>> With -fno-plt, we don't have to reject even direct calls as sibcall
>> candidates.
>>
>> This patch depends on '-fplt' flag that is introduced in another patch.
>>
>> This patch requires that with -fno-plt all sibcall candidates go through
>> prepare_call_address that transforms the call to a GOT lookup.
>>
>> OK?
>>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
>>
>> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> index f29e053..b734350 100644
>> --- a/gcc/config/i386/i386.c
>> +++ b/gcc/config/i386/i386.c
>> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>>    /* If we are generating position-independent code, we cannot sibcall
>>       optimize any indirect call, or a direct call to a global function,
>>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>>    if (!TARGET_MACHO
>>        && !TARGET_64BIT
>>        && flag_pic
>> +      && flag_plt
>>        && (decl && !targetm.binds_local_p (decl)))
>>      return false;
>>
>>    /* If we need to align the outgoing stack, then sibcalling would
>>       unalign the stack, which may break the called function.  */
>>    if (ix86_minimum_incoming_stack_boundary (true)
>>

I think it should be done via psABI change similar to

https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI

which I have implemented on users/hjl/relax branch in binutils.
Jan Hubicka May 15, 2015, 7:48 p.m. UTC | #3
> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> > Ping?  Any comment about this patch?
> >
> > On Mon, 4 May 2015, Alexander Monakov wrote:
> >
> >> With -fno-plt, we don't have to reject even direct calls as sibcall
> >> candidates.
> >>
> >> This patch depends on '-fplt' flag that is introduced in another patch.
> >>
> >> This patch requires that with -fno-plt all sibcall candidates go through
> >> prepare_call_address that transforms the call to a GOT lookup.
> >>
> >> OK?
> >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> >>
> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> >> index f29e053..b734350 100644
> >> --- a/gcc/config/i386/i386.c
> >> +++ b/gcc/config/i386/i386.c
> >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
> >>    /* If we are generating position-independent code, we cannot sibcall
> >>       optimize any indirect call, or a direct call to a global function,
> >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
> >>    if (!TARGET_MACHO
> >>        && !TARGET_64BIT
> >>        && flag_pic
> >> +      && flag_plt
> >>        && (decl && !targetm.binds_local_p (decl)))
> >>      return false;
> >>
> >>    /* If we need to align the outgoing stack, then sibcalling would
> >>       unalign the stack, which may break the called function.  */
> >>    if (ix86_minimum_incoming_stack_boundary (true)
> >>
> 
> I think it should be done via psABI change similar to
> 
> https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
> 
> which I have implemented on users/hjl/relax branch in binutils.

OK, I am trying to understand how relax branch works and what difference it makes.
As I underestand it, the main purpose is to be able to make relaxed call of

   call function

that will, in 64bit mode, either result to RIP relative call with extra NOP just
before the instruction if FUNCTION binds within the DSO or to indirect call through
GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
interposable.  This is actually really nice trick.

Now this is about 32bit mode where explicit GOT pointer register is needed
(how this work with large code model on x86-64?). It is needed by PLT, but I suppose
to implement the same relaxation for 32bit it would need to use EBX to lookup the
GOT pointer, too, so the check above would still be valid.

The patches makes sense to be given that we support -fno-plt now. 

Honza
> 
> -- 
> H.J.
H.J. Lu May 15, 2015, 8:08 p.m. UTC | #4
On Fri, May 15, 2015 at 12:48 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Fri, May 15, 2015 at 9:27 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
>> > Ping?  Any comment about this patch?
>> >
>> > On Mon, 4 May 2015, Alexander Monakov wrote:
>> >
>> >> With -fno-plt, we don't have to reject even direct calls as sibcall
>> >> candidates.
>> >>
>> >> This patch depends on '-fplt' flag that is introduced in another patch.
>> >>
>> >> This patch requires that with -fno-plt all sibcall candidates go through
>> >> prepare_call_address that transforms the call to a GOT lookup.
>> >>
>> >> OK?
>> >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
>> >>
>> >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> >> index f29e053..b734350 100644
>> >> --- a/gcc/config/i386/i386.c
>> >> +++ b/gcc/config/i386/i386.c
>> >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>> >>    /* If we are generating position-independent code, we cannot sibcall
>> >>       optimize any indirect call, or a direct call to a global function,
>> >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
>> >>    if (!TARGET_MACHO
>> >>        && !TARGET_64BIT
>> >>        && flag_pic
>> >> +      && flag_plt
>> >>        && (decl && !targetm.binds_local_p (decl)))
>> >>      return false;
>> >>
>> >>    /* If we need to align the outgoing stack, then sibcalling would
>> >>       unalign the stack, which may break the called function.  */
>> >>    if (ix86_minimum_incoming_stack_boundary (true)
>> >>
>>
>> I think it should be done via psABI change similar to
>>
>> https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
>>
>> which I have implemented on users/hjl/relax branch in binutils.
>
> OK, I am trying to understand how relax branch works and what difference it makes.
> As I underestand it, the main purpose is to be able to make relaxed call of
>
>    call function
>
> that will, in 64bit mode, either result to RIP relative call with extra NOP just
> before the instruction if FUNCTION binds within the DSO or to indirect call through
> GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
> by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
> interposable.  This is actually really nice trick.
>
> Now this is about 32bit mode where explicit GOT pointer register is needed
> (how this work with large code model on x86-64?). It is needed by PLT, but I suppose
> to implement the same relaxation for 32bit it would need to use EBX to lookup the
> GOT pointer, too, so the check above would still be valid.
>

With relax branch in 32-bit, there are 2 cases:

1. PIC or PIE:  We generate

set up EBX
relax call foo@PLT

It is almost the same as we do now, except for the relax prefix.
If foo is defined in another shared library or may be preempted,
linker will generate

call *foo@GOTPLT(%ebx)

If foo turns out local, linker will output

relax call foo

2. Non PIC/PIE: We generate

relax call foo

If foo is defined in a DSO,  linker will generate

call/jmp *foo@GOTPLT

We don't set up EBX in this case.  If foo turns out local, linker will output

relax call foo
Rich Felker May 15, 2015, 8:23 p.m. UTC | #5
On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
> With relax branch in 32-bit, there are 2 cases:
> 
> 1. PIC or PIE:  We generate
> 
> set up EBX
> relax call foo@PLT
> 
> It is almost the same as we do now, except for the relax prefix.
> If foo is defined in another shared library or may be preempted,
> linker will generate
> 
> call *foo@GOTPLT(%ebx)
> 
> If foo turns out local, linker will output
> 
> relax call foo

This does not address the initial and primary motivation for no-plt on
32-bit: eliminating the awful codegen constraint costs of the
GOT-register (ebx, and equivalent on other targets) ABI for calling
PLT entries. If instead you generated code that sets up an expression
for the GOT slot using arbitrary registers, and relaxed it to a direct
call (possibly rendering the register setup useless), it would be
comparable to the no-plt approach. So for example:

set up ecx (or whatever register)
relax call *foo@GOT(%ecx)

and relax to:

set up ecx (or whatever register; now useless)
relax call foo

But the no-plt approach is still superior in that the address load
from the GOT can be hoisted out of loops, etc., resulting in something
like:

call *%esi

This could be valuable in loops calling a math function repeatedly,
for example.

Overall I'm still not a fan of the relaxation approach. There are very
few places it would actually help that couldn't already be improved
better with use of visibility, and it can't give codegen as good as
no-plt option.

Rich
H.J. Lu May 15, 2015, 8:35 p.m. UTC | #6
On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
>> With relax branch in 32-bit, there are 2 cases:
>>
>> 1. PIC or PIE:  We generate
>>
>> set up EBX
>> relax call foo@PLT
>>
>> It is almost the same as we do now, except for the relax prefix.
>> If foo is defined in another shared library or may be preempted,
>> linker will generate
>>
>> call *foo@GOTPLT(%ebx)
>>
>> If foo turns out local, linker will output
>>
>> relax call foo
>
> This does not address the initial and primary motivation for no-plt on
> 32-bit: eliminating the awful codegen constraint costs of the
> GOT-register (ebx, and equivalent on other targets) ABI for calling
> PLT entries. If instead you generated code that sets up an expression
> for the GOT slot using arbitrary registers, and relaxed it to a direct
> call (possibly rendering the register setup useless), it would be
> comparable to the no-plt approach. So for example:
>
> set up ecx (or whatever register)
> relax call *foo@GOT(%ecx)
>
> and relax to:
>
> set up ecx (or whatever register; now useless)
> relax call foo
>
> But the no-plt approach is still superior in that the address load
> from the GOT can be hoisted out of loops, etc., resulting in something
> like:
>
> call *%esi
>
> This could be valuable in loops calling a math function repeatedly,
> for example.
>
> Overall I'm still not a fan of the relaxation approach. There are very
> few places it would actually help that couldn't already be improved
> better with use of visibility, and it can't give codegen as good as
> no-plt option.

With no-plt option, compiler has to know if a function is external
or may be preempted.  If compiler guessed wrong, the generated
DSO or executable will always go through indirect branch even
though the target is local.  With relax branch, the decision is left
to linker.  Of course, EBX must be used unless we add a new PLT
relocation for each register used to to hold GOT base, like

relax call foo@PLT_ECX
relax call foo@PLT_EDX
...
Rich Felker May 15, 2015, 8:42 p.m. UTC | #7
On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
> > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
> >> With relax branch in 32-bit, there are 2 cases:
> >>
> >> 1. PIC or PIE:  We generate
> >>
> >> set up EBX
> >> relax call foo@PLT
> >>
> >> It is almost the same as we do now, except for the relax prefix.
> >> If foo is defined in another shared library or may be preempted,
> >> linker will generate
> >>
> >> call *foo@GOTPLT(%ebx)
> >>
> >> If foo turns out local, linker will output
> >>
> >> relax call foo
> >
> > This does not address the initial and primary motivation for no-plt on
> > 32-bit: eliminating the awful codegen constraint costs of the
> > GOT-register (ebx, and equivalent on other targets) ABI for calling
> > PLT entries. If instead you generated code that sets up an expression
> > for the GOT slot using arbitrary registers, and relaxed it to a direct
> > call (possibly rendering the register setup useless), it would be
> > comparable to the no-plt approach. So for example:
> >
> > set up ecx (or whatever register)
> > relax call *foo@GOT(%ecx)
> >
> > and relax to:
> >
> > set up ecx (or whatever register; now useless)
> > relax call foo
> >
> > But the no-plt approach is still superior in that the address load
> > from the GOT can be hoisted out of loops, etc., resulting in something
> > like:
> >
> > call *%esi
> >
> > This could be valuable in loops calling a math function repeatedly,
> > for example.
> >
> > Overall I'm still not a fan of the relaxation approach. There are very
> > few places it would actually help that couldn't already be improved
> > better with use of visibility, and it can't give codegen as good as
> > no-plt option.
> 
> With no-plt option, compiler has to know if a function is external
> or may be preempted.

I still don't see significant practical cases where the linker would
know this but the compiler can't. If you use visibility properly, the
compiler knows, and if you do LTO and -Bsymbolic[-functions], the
compiler should have that information available at LTO time (this is
an enhancement that needs to be made, though).

> If compiler guessed wrong, the generated
> DSO or executable will always go through indirect branch even
> though the target is local.

The only way this is avoided now is with -Bsymbolic[-functions] which
is not widely used. Otherwise interposition is always allowed for
default-visibility functions, so I don't see how the indirect branch
here is suboptimal.

> With relax branch, the decision is left
> to linker.  Of course, EBX must be used unless we add a new PLT
> relocation for each register used to to hold GOT base, like
> 
> relax call foo@PLT_ECX
> relax call foo@PLT_EDX

No, that's not needed. If the linker doesn't make the relaxation, the
instruction the compiler generated remains in place, and has the
effective address expression using whichever register it wanted:

relax call *foo@GOT(%ecx)
relax call *foo@GOT(%edx)
etc.

If the linker chooses to relax it to a direct call, no register at all
is needed, so the linker can just throw this away and use:

call foo

for all of them.

Rich
H.J. Lu May 15, 2015, 9:55 p.m. UTC | #8
On Fri, May 15, 2015 at 1:42 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 01:35:14PM -0700, H.J. Lu wrote:
>> On Fri, May 15, 2015 at 1:23 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Fri, May 15, 2015 at 01:08:15PM -0700, H.J. Lu wrote:
>> >> With relax branch in 32-bit, there are 2 cases:
>> >>
>> >> 1. PIC or PIE:  We generate
>> >>
>> >> set up EBX
>> >> relax call foo@PLT
>> >>
>> >> It is almost the same as we do now, except for the relax prefix.
>> >> If foo is defined in another shared library or may be preempted,
>> >> linker will generate
>> >>
>> >> call *foo@GOTPLT(%ebx)
>> >>
>> >> If foo turns out local, linker will output
>> >>
>> >> relax call foo
>> >
>> > This does not address the initial and primary motivation for no-plt on
>> > 32-bit: eliminating the awful codegen constraint costs of the
>> > GOT-register (ebx, and equivalent on other targets) ABI for calling
>> > PLT entries. If instead you generated code that sets up an expression
>> > for the GOT slot using arbitrary registers, and relaxed it to a direct
>> > call (possibly rendering the register setup useless), it would be
>> > comparable to the no-plt approach. So for example:
>> >
>> > set up ecx (or whatever register)
>> > relax call *foo@GOT(%ecx)
>> >
>> > and relax to:
>> >
>> > set up ecx (or whatever register; now useless)
>> > relax call foo
>> >
>> > But the no-plt approach is still superior in that the address load
>> > from the GOT can be hoisted out of loops, etc., resulting in something
>> > like:
>> >
>> > call *%esi
>> >
>> > This could be valuable in loops calling a math function repeatedly,
>> > for example.
>> >
>> > Overall I'm still not a fan of the relaxation approach. There are very
>> > few places it would actually help that couldn't already be improved
>> > better with use of visibility, and it can't give codegen as good as
>> > no-plt option.
>>
>> With no-plt option, compiler has to know if a function is external
>> or may be preempted.
>
> I still don't see significant practical cases where the linker would
> know this but the compiler can't. If you use visibility properly, the
> compiler knows, and if you do LTO and -Bsymbolic[-functions], the
> compiler should have that information available at LTO time (this is
> an enhancement that needs to be made, though).

There are codes like

extern void foo (void);

void
bar (void)
{
  foo ();
}

Even with LTO, compiler may have to assume foo is external
when foo is compiled with LTO.

>> If compiler guessed wrong, the generated
>> DSO or executable will always go through indirect branch even
>> though the target is local.
>
> The only way this is avoided now is with -Bsymbolic[-functions] which
> is not widely used. Otherwise interposition is always allowed for
> default-visibility functions, so I don't see how the indirect branch
> here is suboptimal.

Relax branch is to avoid indirect branch to local targets.  If
you don't think  indirect branch to local targets is a performance
issue, relax branch isn't for you.

>> With relax branch, the decision is left
>> to linker.  Of course, EBX must be used unless we add a new PLT
>> relocation for each register used to to hold GOT base, like
>>
>> relax call foo@PLT_ECX
>> relax call foo@PLT_EDX
>
> No, that's not needed. If the linker doesn't make the relaxation, the
> instruction the compiler generated remains in place, and has the
> effective address expression using whichever register it wanted:
>
> relax call *foo@GOT(%ecx)
> relax call *foo@GOT(%edx)
> etc.

relax branch is only used for direct branch and it isn't for indirect
branch. I will implement

relax call foo@PLT(%reg)

The compiler can pick any registers to hold GOT base.  Lazy
binding is supported only when EBX is used.

> If the linker chooses to relax it to a direct call, no register at all
> is needed, so the linker can just throw this away and use:
>
> call foo
>
> for all of them.
>
> Rich
Jan Hubicka May 15, 2015, 11:08 p.m. UTC | #9
Hello,
> 
> There are codes like
> 
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }
> 
> Even with LTO, compiler may have to assume foo is external
> when foo is compiled with LTO.

This is not exactly true if FOO is defined in other translation unit
compiled with LTO and hidden visibility.

OK, so as I get it, we get the following cases:

 1) compiler knows it is generating call to a local symbol a current
    unit (binds_to_current_def_p returns true).

    We handle this correctly by doing IP relative call.

 2) compiler knows it is generating call to a local symbol in DSO
    (binds_local_p return true)
    Currently I think this is only the -fno-pic case or case of explicit
    hidden visibility and in this case we do IP relative call.

    We may want to propose plugin API update adding PREVAILING_DEF_EXP.
    So copiler would be able to default to this case for PREVAILING_DEF
    and we will also catch cases where the symbol is defined in current
    DSO as weak symbol, but the definition is not LTO.
    This would be also way to communicate -Bsymbolic[-functions] across
    the plugin API.

 3) compiler knows there is going to be definition in the current DSO
    (by seeing a COMDAT function body or resolution info) that is interposable
    but because the function is inline or -fno-semantic-interposition happens,
    the semantics will not change.

    In this case it would be nice to arrange IP relative call to the
    hidden alias.  This may require an extension both on compiler and linker
    side.

    I was thinking of doing so for comdats by adding hidden alias with
    fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
    But I think it is not safe as linker may throw away section that
    is produced by GCC and prevail section that is not leaving to an undefined
    symbol?

    I think this is rather common case in C++ (never made any stats) because
    uninlined comdats are quite common.

 4) compiler has no clue but linker may know better

    Here we traditionally always produce a PLT call.  In cases the call
    is known to be hot in the program it makes sense to trade lazy binding
    for performance and produce call via GOT reference (-fno-plt).
    I also see that H.J.'s branch helps us to actually avoid the GOT
    reference in cases the symbol ends up binding locally. How the lazy
    binding with relaxation works?

    We may try to communicate down the information whether the symbol can
    or can not semantically interpose to the linker, so it can do
    -Bsymbolic by default for inline and COMDAT functions.
    Actually perhaps the linker can just default to this for all comdat
    defined symbols?

    I think it still make sense to work on non-LTO codegen improvements.
    As much as I would like everyone to LTO and FDO, most people don't.

 5) Compiler knows it is generating call to external function.
    We do not special case this, but we could add binds_external_p and
    make it to determine this case from resolution info during LTO.

    I do not see if this case is any different from 4 from PIC codegen
    perspective except that perhaps the relax relocation will allow us to lazy
    bind?

Honza
H.J. Lu May 15, 2015, 11:14 p.m. UTC | #10
On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hello,
>>
>> There are codes like
>>
>> extern void foo (void);
>>
>> void
>> bar (void)
>> {
>>   foo ();
>> }
>>
>> Even with LTO, compiler may have to assume foo is external
>> when foo is compiled with LTO.
>
> This is not exactly true if FOO is defined in other translation unit
> compiled with LTO and hidden visibility.

I was meant to say " when foo is compiled without LTO.".

> OK, so as I get it, we get the following cases:
>
>  1) compiler knows it is generating call to a local symbol a current
>     unit (binds_to_current_def_p returns true).
>
>     We handle this correctly by doing IP relative call.
>
>  2) compiler knows it is generating call to a local symbol in DSO
>     (binds_local_p return true)
>     Currently I think this is only the -fno-pic case or case of explicit
>     hidden visibility and in this case we do IP relative call.
>
>     We may want to propose plugin API update adding PREVAILING_DEF_EXP.
>     So copiler would be able to default to this case for PREVAILING_DEF
>     and we will also catch cases where the symbol is defined in current
>     DSO as weak symbol, but the definition is not LTO.
>     This would be also way to communicate -Bsymbolic[-functions] across
>     the plugin API.
>
>  3) compiler knows there is going to be definition in the current DSO
>     (by seeing a COMDAT function body or resolution info) that is interposable
>     but because the function is inline or -fno-semantic-interposition happens,
>     the semantics will not change.
>
>     In this case it would be nice to arrange IP relative call to the
>     hidden alias.  This may require an extension both on compiler and linker
>     side.
>
>     I was thinking of doing so for comdats by adding hidden alias with
>     fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
>     But I think it is not safe as linker may throw away section that
>     is produced by GCC and prevail section that is not leaving to an undefined
>     symbol?
>
>     I think this is rather common case in C++ (never made any stats) because
>     uninlined comdats are quite common.
>
>  4) compiler has no clue but linker may know better
>
>     Here we traditionally always produce a PLT call.  In cases the call
>     is known to be hot in the program it makes sense to trade lazy binding
>     for performance and produce call via GOT reference (-fno-plt).
>     I also see that H.J.'s branch helps us to actually avoid the GOT
>     reference in cases the symbol ends up binding locally. How the lazy
>     binding with relaxation works?

If there is no GOT slot allocated for symbol foo, linker should resolve
foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push
instruction, to support  lazy binding.  Otherwise, linker should resolve it
to its GOT slot address.

>     We may try to communicate down the information whether the symbol can
>     or can not semantically interpose to the linker, so it can do
>     -Bsymbolic by default for inline and COMDAT functions.
>     Actually perhaps the linker can just default to this for all comdat
>     defined symbols?
>
>     I think it still make sense to work on non-LTO codegen improvements.
>     As much as I would like everyone to LTO and FDO, most people don't.
>
>  5) Compiler knows it is generating call to external function.
>     We do not special case this, but we could add binds_external_p and
>     make it to determine this case from resolution info during LTO.
>
>     I do not see if this case is any different from 4 from PIC codegen
>     perspective except that perhaps the relax relocation will allow us to lazy
>     bind?

My relax branch proposal works even without LTO.
H.J. Lu May 15, 2015, 11:30 p.m. UTC | #11
On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> My relax branch proposal works even without LTO.
>

I will borrow GOTPCREL from x86-64 and do

[hjl@gnu-6 relax-4]$ cat b.S
call *foo@GOTPCREL(%eax)
[hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S
[hjl@gnu-6 relax-4]$ ./objdump -dwr b.o

b.o:     file format elf32-i386


Disassembly of section .text:

00000000 <.text>:
   0: ff 90 fc ff ff ff     call   *-0x4(%eax) 2: R_386_RELAX_GOT32 foo
[hjl@gnu-6 relax-4]$

And linker can turn it into

relax call foo

if foo is defined locally.
H.J. Lu May 15, 2015, 11:34 p.m. UTC | #12
On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> My relax branch proposal works even without LTO.
>>
>
> I will borrow GOTPCREL from x86-64 and do
>
> [hjl@gnu-6 relax-4]$ cat b.S
> call *foo@GOTPCREL(%eax)

call *foo@GOTPLT(%eax)

is a better choice.

> [hjl@gnu-6 relax-4]$ ./as -32 -o b.o b.S
> [hjl@gnu-6 relax-4]$ ./objdump -dwr b.o
>
> b.o:     file format elf32-i386
>
>
> Disassembly of section .text:
>
> 00000000 <.text>:
>    0: ff 90 fc ff ff ff     call   *-0x4(%eax) 2: R_386_RELAX_GOT32 foo
> [hjl@gnu-6 relax-4]$
>
> And linker can turn it into
>
> relax call foo
>
> if foo is defined locally.
Rich Felker May 15, 2015, 11:44 p.m. UTC | #13
On Fri, May 15, 2015 at 04:14:07PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 4:08 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hello,
> >>
> >> There are codes like
> >>
> >> extern void foo (void);
> >>
> >> void
> >> bar (void)
> >> {
> >>   foo ();
> >> }
> >>
> >> Even with LTO, compiler may have to assume foo is external
> >> when foo is compiled with LTO.
> >
> > This is not exactly true if FOO is defined in other translation unit
> > compiled with LTO and hidden visibility.
> 
> I was meant to say " when foo is compiled without LTO.".
> 
> > OK, so as I get it, we get the following cases:
> >
> >  1) compiler knows it is generating call to a local symbol a current
> >     unit (binds_to_current_def_p returns true).
> >
> >     We handle this correctly by doing IP relative call.
> >
> >  2) compiler knows it is generating call to a local symbol in DSO
> >     (binds_local_p return true)
> >     Currently I think this is only the -fno-pic case or case of explicit
> >     hidden visibility and in this case we do IP relative call.
> >
> >     We may want to propose plugin API update adding PREVAILING_DEF_EXP.
> >     So copiler would be able to default to this case for PREVAILING_DEF
> >     and we will also catch cases where the symbol is defined in current
> >     DSO as weak symbol, but the definition is not LTO.
> >     This would be also way to communicate -Bsymbolic[-functions] across
> >     the plugin API.
> >
> >  3) compiler knows there is going to be definition in the current DSO
> >     (by seeing a COMDAT function body or resolution info) that is interposable
> >     but because the function is inline or -fno-semantic-interposition happens,
> >     the semantics will not change.
> >
> >     In this case it would be nice to arrange IP relative call to the
> >     hidden alias.  This may require an extension both on compiler and linker
> >     side.
> >
> >     I was thinking of doing so for comdats by adding hidden alias with
> >     fixed mangling, like __gnu_<function>.hiddenalias, and referring it.
> >     But I think it is not safe as linker may throw away section that
> >     is produced by GCC and prevail section that is not leaving to an undefined
> >     symbol?
> >
> >     I think this is rather common case in C++ (never made any stats) because
> >     uninlined comdats are quite common.
> >
> >  4) compiler has no clue but linker may know better
> >
> >     Here we traditionally always produce a PLT call.  In cases the call
> >     is known to be hot in the program it makes sense to trade lazy binding
> >     for performance and produce call via GOT reference (-fno-plt).
> >     I also see that H.J.'s branch helps us to actually avoid the GOT
> >     reference in cases the symbol ends up binding locally. How the lazy
> >     binding with relaxation works?
> 
> If there is no GOT slot allocated for symbol foo, linker should resolve
> foo@GOTPLT(%ebx) to to its PLT slot address + 6, which is the push
> instruction, to support  lazy binding.  Otherwise, linker should resolve it
> to its GOT slot address.

Forget lazy binding. It's dead anyway because serious distros want
PIE+relro+bindnow+... If people really want lazy binding, they can use
options which support it, but I don't want to keep suffering the
codegen cost of lazy binding despite never using it. There should be
an option to generate optimal code equivalent to what you get with
Alexander Monakov's patches for those of us who aren't trying to
support this legacy feature that precludes good performance and
precludes hardening.

Rich
Rich Felker May 15, 2015, 11:49 p.m. UTC | #14
On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >> My relax branch proposal works even without LTO.
> >>
> >
> > I will borrow GOTPCREL from x86-64 and do
> >
> > [hjl@gnu-6 relax-4]$ cat b.S
> > call *foo@GOTPCREL(%eax)
> 
> call *foo@GOTPLT(%eax)
> 
> is a better choice.

foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
reloc type would have to be added) since it saves a useless add.
Instead of:

	call __x86.get_pc_thunk.ax
	addl $_GLOBAL_OFFSET_TABLE_, %eax
	call *foo@GOTPLT(%eax)

you can just do:

	call __x86.get_pc_thunk.ax
	call *foo@GOTPCREL(%eax)

Note that it also works to have extra instructions between:

	call __x86.get_pc_thunk.ax
1:	...
	call *foo@GOTPCREL+(1b-.)(%eax)

I may not have gotten the syntax quite right, but hopefully yoy get
the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
accesses, including global data, to eliminate the useless add.

Rich
H.J. Lu May 16, 2015, 2:19 p.m. UTC | #15
On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >> My relax branch proposal works even without LTO.
>> >>
>> >
>> > I will borrow GOTPCREL from x86-64 and do
>> >
>> > [hjl@gnu-6 relax-4]$ cat b.S
>> > call *foo@GOTPCREL(%eax)
>>
>> call *foo@GOTPLT(%eax)
>>
>> is a better choice.
>
> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
> reloc type would have to be added) since it saves a useless add.
> Instead of:
>
>         call __x86.get_pc_thunk.ax
>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>         call *foo@GOTPLT(%eax)
>
> you can just do:
>
>         call __x86.get_pc_thunk.ax
>         call *foo@GOTPCREL(%eax)
>
> Note that it also works to have extra instructions between:
>
>         call __x86.get_pc_thunk.ax
> 1:      ...
>         call *foo@GOTPCREL+(1b-.)(%eax)
>
> I may not have gotten the syntax quite right, but hopefully yoy get
> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
> accesses, including global data, to eliminate the useless add.
>

This is a good idea.  But I'd like to use something for both i386 and
x86-64.  I am proposing

call/jmp *foo@GOTPCRELAX+addend(%reg)

It is similar to @GOTPCREL, but with a new relax relocation.  Before
I can do that, I need to fix

https://sourceware.org/bugzilla/show_bug.cgi?id=18423

first.
H.J. Lu May 16, 2015, 6:59 p.m. UTC | #16
On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
>> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> >> My relax branch proposal works even without LTO.
>>> >>
>>> >
>>> > I will borrow GOTPCREL from x86-64 and do
>>> >
>>> > [hjl@gnu-6 relax-4]$ cat b.S
>>> > call *foo@GOTPCREL(%eax)
>>>
>>> call *foo@GOTPLT(%eax)
>>>
>>> is a better choice.
>>
>> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
>> reloc type would have to be added) since it saves a useless add.
>> Instead of:
>>
>>         call __x86.get_pc_thunk.ax
>>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>>         call *foo@GOTPLT(%eax)
>>
>> you can just do:
>>
>>         call __x86.get_pc_thunk.ax
>>         call *foo@GOTPCREL(%eax)
>>
>> Note that it also works to have extra instructions between:
>>
>>         call __x86.get_pc_thunk.ax
>> 1:      ...
>>         call *foo@GOTPCREL+(1b-.)(%eax)
>>
>> I may not have gotten the syntax quite right, but hopefully yoy get
>> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
>> accesses, including global data, to eliminate the useless add.
>>
>
> This is a good idea.  But I'd like to use something for both i386 and
> x86-64.  I am proposing
>
> call/jmp *foo@GOTPCRELAX+addend(%reg)
>
> It is similar to @GOTPCREL, but with a new relax relocation.  Before
> I can do that, I need to fix

It doesn't work.  REG must hold GOT base for other GOT relocations.
We need to keep

addl $_GLOBAL_OFFSET_TABLE_, %eax
Rich Felker May 16, 2015, 7:03 p.m. UTC | #17
On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote:
> On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
> >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
> >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >>> >> My relax branch proposal works even without LTO.
> >>> >>
> >>> >
> >>> > I will borrow GOTPCREL from x86-64 and do
> >>> >
> >>> > [hjl@gnu-6 relax-4]$ cat b.S
> >>> > call *foo@GOTPCREL(%eax)
> >>>
> >>> call *foo@GOTPLT(%eax)
> >>>
> >>> is a better choice.
> >>
> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
> >> reloc type would have to be added) since it saves a useless add.
> >> Instead of:
> >>
> >>         call __x86.get_pc_thunk.ax
> >>         addl $_GLOBAL_OFFSET_TABLE_, %eax
> >>         call *foo@GOTPLT(%eax)
> >>
> >> you can just do:
> >>
> >>         call __x86.get_pc_thunk.ax
> >>         call *foo@GOTPCREL(%eax)
> >>
> >> Note that it also works to have extra instructions between:
> >>
> >>         call __x86.get_pc_thunk.ax
> >> 1:      ...
> >>         call *foo@GOTPCREL+(1b-.)(%eax)
> >>
> >> I may not have gotten the syntax quite right, but hopefully yoy get
> >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
> >> accesses, including global data, to eliminate the useless add.
> >>
> >
> > This is a good idea.  But I'd like to use something for both i386 and
> > x86-64.  I am proposing
> >
> > call/jmp *foo@GOTPCRELAX+addend(%reg)
> >
> > It is similar to @GOTPCREL, but with a new relax relocation.  Before
> > I can do that, I need to fix
> 
> It doesn't work.  REG must hold GOT base for other GOT relocations.
> We need to keep
> 
> addl $_GLOBAL_OFFSET_TABLE_, %eax

Like I just said, all foo@GOT(%gotreg) can be replaced with
foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register
pointing to the referenced label (the point at which the program
counter was saved). This is a minor but useful optimization that can
be made for all GOT accesses, not just ones for [relaxable] function
calls.

Rich
H.J. Lu May 16, 2015, 8:33 p.m. UTC | #18
On Sat, May 16, 2015 at 12:03 PM, Rich Felker <dalias@libc.org> wrote:
> On Sat, May 16, 2015 at 11:59:56AM -0700, H.J. Lu wrote:
>> On Sat, May 16, 2015 at 7:19 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> > On Fri, May 15, 2015 at 4:49 PM, Rich Felker <dalias@libc.org> wrote:
>> >> On Fri, May 15, 2015 at 04:34:57PM -0700, H.J. Lu wrote:
>> >>> On Fri, May 15, 2015 at 4:30 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>> > On Fri, May 15, 2015 at 4:14 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >>> >> My relax branch proposal works even without LTO.
>> >>> >>
>> >>> >
>> >>> > I will borrow GOTPCREL from x86-64 and do
>> >>> >
>> >>> > [hjl@gnu-6 relax-4]$ cat b.S
>> >>> > call *foo@GOTPCREL(%eax)
>> >>>
>> >>> call *foo@GOTPLT(%eax)
>> >>>
>> >>> is a better choice.
>> >>
>> >> foo@GOTPCREL is preferable (but does not yet exist for ia32, so the
>> >> reloc type would have to be added) since it saves a useless add.
>> >> Instead of:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >>         addl $_GLOBAL_OFFSET_TABLE_, %eax
>> >>         call *foo@GOTPLT(%eax)
>> >>
>> >> you can just do:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >>         call *foo@GOTPCREL(%eax)
>> >>
>> >> Note that it also works to have extra instructions between:
>> >>
>> >>         call __x86.get_pc_thunk.ax
>> >> 1:      ...
>> >>         call *foo@GOTPCREL+(1b-.)(%eax)
>> >>
>> >> I may not have gotten the syntax quite right, but hopefully yoy get
>> >> the idea. This same approach (with GOTPCREL) can be used for _all_ GOT
>> >> accesses, including global data, to eliminate the useless add.
>> >>
>> >
>> > This is a good idea.  But I'd like to use something for both i386 and
>> > x86-64.  I am proposing
>> >
>> > call/jmp *foo@GOTPCRELAX+addend(%reg)
>> >
>> > It is similar to @GOTPCREL, but with a new relax relocation.  Before
>> > I can do that, I need to fix
>>
>> It doesn't work.  REG must hold GOT base for other GOT relocations.
>> We need to keep
>>
>> addl $_GLOBAL_OFFSET_TABLE_, %eax
>
> Like I just said, all foo@GOT(%gotreg) can be replaced with
> foo@GOTPCREL+[label-.](%labelreg) where %labelreg is a register
> pointing to the referenced label (the point at which the program
> counter was saved). This is a minor but useful optimization that can
> be made for all GOT accesses, not just ones for [relaxable] function
> calls.

There is also foo@GOTOFF(%reg).  Remove addl is independent of
relax branch.  I will leave it out.  Relax branch will support

call/jmp   *bar@GOTRELAX(%reg)

for both i386 and x86-64.
Alexander Monakov May 18, 2015, 6:24 p.m. UTC | #19
On Fri, 15 May 2015, Jan Hubicka wrote:
> > >> With -fno-plt, we don't have to reject even direct calls as sibcall
> > >> candidates.
> > >>
> > >> This patch depends on '-fplt' flag that is introduced in another patch.
> > >>
> > >> This patch requires that with -fno-plt all sibcall candidates go through
> > >> prepare_call_address that transforms the call to a GOT lookup.
> > >>
> > >> OK?
> > >>       * config/i386/i386.c (ix86_function_ok_for_sibcall): Check flag_plt.
> > >>
> > >> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > >> index f29e053..b734350 100644
> > >> --- a/gcc/config/i386/i386.c
> > >> +++ b/gcc/config/i386/i386.c
> > >> @@ -5448,12 +5448,13 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
> > >>    /* If we are generating position-independent code, we cannot sibcall
> > >>       optimize any indirect call, or a direct call to a global function,
> > >>       as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
> > >>    if (!TARGET_MACHO
> > >>        && !TARGET_64BIT
> > >>        && flag_pic
> > >> +      && flag_plt
> > >>        && (decl && !targetm.binds_local_p (decl)))
> > >>      return false;
> > >>
> > >>    /* If we need to align the outgoing stack, then sibcalling would
> > >>       unalign the stack, which may break the called function.  */
> > >>    if (ix86_minimum_incoming_stack_boundary (true)
> > >>
> > 
> > I think it should be done via psABI change similar to
> > 
> > https://groups.google.com/forum/#!topic/x86-64-abi/n8GYMpqvBxI
> > 
> > which I have implemented on users/hjl/relax branch in binutils.
> 
> OK, I am trying to understand how relax branch works and what difference it makes.
> As I underestand it, the main purpose is to be able to make relaxed call of
> 
>    call function
> 
> that will, in 64bit mode, either result to RIP relative call with extra NOP just
> before the instruction if FUNCTION binds within the DSO or to indirect call through
> GOT bypassing the PLT.  This saves overhead of PLT and increase every such call
> by extra NOP for no-LTO builds and even in LTO when the symbol is defined but
> interposable.  This is actually really nice trick.
> 
> Now this is about 32bit mode where explicit GOT pointer register is needed
> (how this work with large code model on x86-64?). It is needed by PLT, but I suppose
> to implement the same relaxation for 32bit it would need to use EBX to lookup the
> GOT pointer, too, so the check above would still be valid.
> 
> The patches makes sense to be given that we support -fno-plt now.

After this message the discussion diverged in the direction of H.J.Lu's
proposed relaxation scheme involving new type of relocations.

I'm not clear if my patch is actually approved.  I'd like to point out that it
doesn't clash with H.J.Lu's work.  It improves codegen by allowing sibcalls in
more circumstances.

Alexander
Jan Hubicka May 18, 2015, 6:31 p.m. UTC | #20
> 
> After this message the discussion diverged in the direction of H.J.Lu's
> proposed relaxation scheme involving new type of relocations.
> 
> I'm not clear if my patch is actually approved.  I'd like to point out that it
> doesn't clash with H.J.Lu's work.  It improves codegen by allowing sibcalls in
> more circumstances.

Yes, the original patch is OK.

Honza
> 
> Alexander
Michael Matz May 19, 2015, 2:43 p.m. UTC | #21
Hi,

On Fri, 15 May 2015, Rich Felker wrote:

> Forget lazy binding. It's dead anyway because serious distros want
> PIE+relro+bindnow+...

You keep saying this, but I can't help the feeling it's mostly because 
musl doesn't support it ;-)

No, you don't have to use bindnow to get the effects of relro.  Sure 
there's more parts of the GOT protected with it, but if that's really that 
much more hardened is up for debate.

> If people really want lazy binding, they can use options which support 
> it, but I don't want to keep suffering the codegen cost of lazy binding 
> despite never using it.

> There should be an option to generate optimal code equivalent to what 
> you get with Alexander Monakov's patches for those of us who aren't 
> trying to support this legacy feature that precludes good performance 
> and precludes hardening.

H.J.'s branch is for _improving_ code on top of the no-plt code, it's not 
replacing it or an alternative for it.


Ciao,
Michael.
Jeff Law May 19, 2015, 3:02 p.m. UTC | #22
On 05/19/2015 08:43 AM, Michael Matz wrote:
> Hi,
>
> On Fri, 15 May 2015, Rich Felker wrote:
>
>> Forget lazy binding. It's dead anyway because serious distros want
>> PIE+relro+bindnow+...
>
> You keep saying this, but I can't help the feeling it's mostly because
> musl doesn't support it ;-)
FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
distribution.  It's not clear yet how far bindnow will go though.

jeff
Michael Matz May 19, 2015, 4:01 p.m. UTC | #23
Hi,

On Tue, 19 May 2015, Jeff Law wrote:

> > > Forget lazy binding. It's dead anyway because serious distros want 
> > > PIE+relro+bindnow+...
> > 
> > You keep saying this, but I can't help the feeling it's mostly because 
> > musl doesn't support it ;-)
> 
> FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
> distribution.

Yeah, us as well, though I don't necessarily see the point for most 
packages; feels a bit like a checkmark item :)


Ciao,
Michael.
Rich Felker May 19, 2015, 6:06 p.m. UTC | #24
On Tue, May 19, 2015 at 04:43:53PM +0200, Michael Matz wrote:
> Hi,
> 
> On Fri, 15 May 2015, Rich Felker wrote:
> 
> > Forget lazy binding. It's dead anyway because serious distros want
> > PIE+relro+bindnow+...
> 
> You keep saying this, but I can't help the feeling it's mostly because 
> musl doesn't support it ;-)

Well the reasons musl doesn't support it are partly the above, and
partly that it's been a continuous source of subtle bugs in glibc --
things like clobbering new vector registers, missing synchronization,
failures to be async-signal-safe, etc. So it's not that I think lazy
binding is bad because musl doesn't support it, but rather that musl
doesn't support lazy binding because I think it's bad. :-)

> No, you don't have to use bindnow to get the effects of relro.  Sure 
> there's more parts of the GOT protected with it, but if that's really that 
> much more hardened is up for debate.

Normally it's function addresses that you care about protecting --
they're the easy vector for arbitrary code execution -- and they're
unprotected without bindnow. Addresses of global data could also be an
attack vector, but a more difficult one to exploit.

> > If people really want lazy binding, they can use options which support 
> > it, but I don't want to keep suffering the codegen cost of lazy binding 
> > despite never using it.
> 
> > There should be an option to generate optimal code equivalent to what 
> > you get with Alexander Monakov's patches for those of us who aren't 
> > trying to support this legacy feature that precludes good performance 
> > and precludes hardening.
> 
> H.J.'s branch is for _improving_ code on top of the no-plt code, it's not 
> replacing it or an alternative for it.

Thanks for the clarification -- this was the part I was failing to
understand. I'm still mildly worried that concerns for supporting
relaxation might lead to decisions not to optimize code in ways that
would be difficult to relax (e.g. certain types of address load
reordering or hoisting) but I don't understand GCC internals
sufficiently to know if this concern is warranted or not. As long as
his work isn't interfering with the ability of -fno-plt to generate
optimal code, I agree it's both inappropriate and counter-productive
for me to be objecting to part or all of it.

I would still like to see the @GOTPCREL stuff added and used instead
of @GOT, as I mentioned earlier in the thread, but I agree that's
independent of relaxation support and shouldn't block it.

Rich
Richard Henderson May 19, 2015, 6:59 p.m. UTC | #25
On 05/19/2015 11:06 AM, Rich Felker wrote:
> I'm still mildly worried that concerns for supporting
> relaxation might lead to decisions not to optimize code in ways that
> would be difficult to relax (e.g. certain types of address load
> reordering or hoisting) but I don't understand GCC internals
> sufficiently to know if this concern is warranted or not.

It is.  The relaxation that HJ is working on requires that the reads from the
got not be hoisted.  I'm not especially convinced that what he's working on is
a win.

With LTO, the compiler can do the same job that he's attempting in the linker,
without an extra nop.  Without LTO, leaving it to the linker means that you
can't hoist the load and hide the memory latency.

> I would still like to see the @GOTPCREL stuff added and used instead
> of @GOT, as I mentioned earlier in the thread, but I agree that's
> independent of relaxation support and shouldn't block it.

I don't think that @GOTPCREL for 32-bit is a good idea.  This is the scheme
that Darwin uses, so we do have some experience with it.

In order for it to work you've got to have a pointer to a random address in the
function.  It means that you can only "easily" compute the address once.  If
you need the value again you wind up with the same "extra" addl insn that we
have with the current GOT pointer.

We've just started to do inter-function register allocation.  The next step
along those lines is to share the computation of GOT between multiple
functions.  At which point it really helps to have one global base address to
talk about.


r~
H.J. Lu May 19, 2015, 7:06 p.m. UTC | #26
On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> I'm still mildly worried that concerns for supporting
>> relaxation might lead to decisions not to optimize code in ways that
>> would be difficult to relax (e.g. certain types of address load
>> reordering or hoisting) but I don't understand GCC internals
>> sufficiently to know if this concern is warranted or not.
>
> It is.  The relaxation that HJ is working on requires that the reads from the
> got not be hoisted.  I'm not especially convinced that what he's working on is
> a win.
>
> With LTO, the compiler can do the same job that he's attempting in the linker,
> without an extra nop.  Without LTO, leaving it to the linker means that you
> can't hoist the load and hide the memory latency.
>

My relax approach won't take away any optimization done by compiler.
It simply turns indirect branch into direct branch with a nop prefix at
link-time.  I am having a hard time to understand why we shouldn't do it.
Rich Felker May 19, 2015, 7:10 p.m. UTC | #27
On Tue, May 19, 2015 at 06:01:07PM +0200, Michael Matz wrote:
> Hi,
> 
> On Tue, 19 May 2015, Jeff Law wrote:
> 
> > > > Forget lazy binding. It's dead anyway because serious distros want 
> > > > PIE+relro+bindnow+...
> > > 
> > > You keep saying this, but I can't help the feeling it's mostly because 
> > > musl doesn't support it ;-)
> > 
> > FWIW, Red Hat is pushing PIE & partial RELRO deeper and deeper into the 
> > distribution.
> 
> Yeah, us as well, though I don't necessarily see the point for most 
> packages; feels a bit like a checkmark item :)

These days it's fairly rare to have software which does not interact
at all with untrusted data. Consider how much user-facing application
software that was not previously considered security-critical is
making network connections using complex protocols (e.g. anything with
TLS, IM protocols, ...), opening image files from random sources
(attachments, files that happen to be in a browsed-to directory, on
USB sticks, etc.), and so on. I think it's smart to be hardening
everything, at least for distros providing all sorts of random
unvetted software.

Rich
Richard Henderson May 19, 2015, 7:11 p.m. UTC | #28
On 05/19/2015 12:06 PM, H.J. Lu wrote:
> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>>> I'm still mildly worried that concerns for supporting
>>> relaxation might lead to decisions not to optimize code in ways that
>>> would be difficult to relax (e.g. certain types of address load
>>> reordering or hoisting) but I don't understand GCC internals
>>> sufficiently to know if this concern is warranted or not.
>>
>> It is.  The relaxation that HJ is working on requires that the reads from the
>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> a win.
>>
>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> can't hoist the load and hide the memory latency.
>>
> 
> My relax approach won't take away any optimization done by compiler.
> It simply turns indirect branch into direct branch with a nop prefix at
> link-time.  I am having a hard time to understand why we shouldn't do it.

I well understand what you're doing.

But my point is that the only time the compiler should present you with the
form of indirect branch you're looking for is when there's no place to hoist
the load.

At which point, is it really worth adding a new relocation to the ABI?  Is it
really worth adding new code to the linker that won't be exercised often?


r~
H.J. Lu May 19, 2015, 7:17 p.m. UTC | #29
On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>>>> I'm still mildly worried that concerns for supporting
>>>> relaxation might lead to decisions not to optimize code in ways that
>>>> would be difficult to relax (e.g. certain types of address load
>>>> reordering or hoisting) but I don't understand GCC internals
>>>> sufficiently to know if this concern is warranted or not.
>>>
>>> It is.  The relaxation that HJ is working on requires that the reads from the
>>> got not be hoisted.  I'm not especially convinced that what he's working on is
>>> a win.
>>>
>>> With LTO, the compiler can do the same job that he's attempting in the linker,
>>> without an extra nop.  Without LTO, leaving it to the linker means that you
>>> can't hoist the load and hide the memory latency.
>>>
>>
>> My relax approach won't take away any optimization done by compiler.
>> It simply turns indirect branch into direct branch with a nop prefix at
>> link-time.  I am having a hard time to understand why we shouldn't do it.
>
> I well understand what you're doing.
>
> But my point is that the only time the compiler should present you with the
> form of indirect branch you're looking for is when there's no place to hoist
> the load.
>
> At which point, is it really worth adding a new relocation to the ABI?  Is it
> really worth adding new code to the linker that won't be exercised often?

I believe there are plenty of indirect branches via GOT when compiling
PIE/PIC with -fno-plt:

[hjl@gnu-6 gcc]$ cat /tmp/x.c
extern void foo (void);

void
bar (void)
{
  foo ();
}
[hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
[hjl@gnu-6 gcc]$ cat x.s
.file "x.c"
.section .text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4,,15
.globl bar
.type bar, @function
bar:
.LFB0:
.cfi_startproc
jmp *foo@GOTPCREL(%rip)
.cfi_endproc
.LFE0:
.size bar, .-bar
Rich Felker May 19, 2015, 7:35 p.m. UTC | #30
On Tue, May 19, 2015 at 11:59:00AM -0700, Richard Henderson wrote:
> On 05/19/2015 11:06 AM, Rich Felker wrote:
> > I'm still mildly worried that concerns for supporting
> > relaxation might lead to decisions not to optimize code in ways that
> > would be difficult to relax (e.g. certain types of address load
> > reordering or hoisting) but I don't understand GCC internals
> > sufficiently to know if this concern is warranted or not.
> 
> It is.  The relaxation that HJ is working on requires that the reads from the
> got not be hoisted.  I'm not especially convinced that what he's working on is
> a win.

Well as long as -fno-plt actually generates a load from the GOT like
what would be done for data access, and does not go out of its way to
produce something compatible with relaxation, my hope is that it would
not affected by the pessimization. I'm not sure if that's the case
though.

> With LTO, the compiler can do the same job that he's attempting in the linker,
> without an extra nop.  Without LTO, leaving it to the linker means that you
> can't hoist the load and hide the memory latency.

Yes, this is my feeling too. Alexander Monakov have been discussing it
on #musl a bit and I think the conclusion we reached is that
relaxation is possibly a significant real-world win for non-PIC main
executables, where it's very likely that addresses will be resolved at
ld-time and for the programmer not to specifically annotate this with
protected visibility. In such a case, you get either a direct call or
a direct address load and indirect call, rather than hitting an extra
cache line in the PLT thunk to do the address load and indirect call.
Note that, being non-PIC, there is no GOT register involved here.

> > I would still like to see the @GOTPCREL stuff added and used instead
> > of @GOT, as I mentioned earlier in the thread, but I agree that's
> > independent of relaxation support and shouldn't block it.
> 
> I don't think that @GOTPCREL for 32-bit is a good idea.  This is the scheme
> that Darwin uses, so we do have some experience with it.
> 
> In order for it to work you've got to have a pointer to a random address in the
> function.  It means that you can only "easily" compute the address once.  If
> you need the value again you wind up with the same "extra" addl insn that we
> have with the current GOT pointer.

Why would you recompute it (this requires a fairly expensive call that
reads or pops its own return address) rather than simply spilling the
already-computed value and reloading it from the stack?

The only example I can think of where it might make sense is when you
don't want to load the address unconditionally because there are
shrink-wrappable code paths that don't need it, but multple code paths
that do, in which case they would each load different values. Is this
the concern you have in mind?

> We've just started to do inter-function register allocation.  The next step
> along those lines is to share the computation of GOT between multiple
> functions.  At which point it really helps to have one global base address to
> talk about.

I see -- that would be another case where it simplifies things.

Rich
Richard Henderson May 19, 2015, 7:47 p.m. UTC | #31
On 05/19/2015 12:17 PM, H.J. Lu wrote:
>> But my point is that the only time the compiler should present you with the
>> form of indirect branch you're looking for is when there's no place to hoist
>> the load.
>>
>> At which point, is it really worth adding a new relocation to the ABI?  Is it
>> really worth adding new code to the linker that won't be exercised often?
> 
> I believe there are plenty of indirect branches via GOT when compiling
> PIE/PIC with -fno-plt:
> 
> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }

Sure, as I said, when there's no place to hoist the load.

Try anything more complicated,

void bar (void)
{
  int i;
  for (i = 0; i < 10; ++i)
    foo ();
}

void baz (void)
{
  foo ();
  foo ();
}

and you'll not see the call *foo@GOTPCREL(%rip) form.

Of course there's also plenty of times where combine recreates exactly that
form when perhaps the scheduler might have preferred otherwise.  Those are
optimization choices to be addressed under separate cover.

My point that we can already do what you want via LTO, without adding new
relocations, is still relevant.


r~
Richard Henderson May 19, 2015, 7:54 p.m. UTC | #32
On 05/19/2015 12:35 PM, Rich Felker wrote:
> Why would you recompute it (this requires a fairly expensive call that
> reads or pops its own return address) rather than simply spilling the
> already-computed value and reloading it from the stack?
> 
> The only example I can think of where it might make sense is when you
> don't want to load the address unconditionally because there are
> shrink-wrappable code paths that don't need it, but multple code paths
> that do, in which case they would each load different values. Is this
> the concern you have in mind?

That too.  I was thinking of exception landing pads, i.e. catches and cleanups,
where in the past we've had to re-compute the GOT address.  Though now that I
think on that more, it wasn't x86 that had that particular landing pad trouble.


r~
Rich Felker May 19, 2015, 8:15 p.m. UTC | #33
On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >>>> I'm still mildly worried that concerns for supporting
> >>>> relaxation might lead to decisions not to optimize code in ways that
> >>>> would be difficult to relax (e.g. certain types of address load
> >>>> reordering or hoisting) but I don't understand GCC internals
> >>>> sufficiently to know if this concern is warranted or not.
> >>>
> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >>> a win.
> >>>
> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >>> can't hoist the load and hide the memory latency.
> >>>
> >>
> >> My relax approach won't take away any optimization done by compiler.
> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >
> > I well understand what you're doing.
> >
> > But my point is that the only time the compiler should present you with the
> > form of indirect branch you're looking for is when there's no place to hoist
> > the load.
> >
> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> > really worth adding new code to the linker that won't be exercised often?
> 
> I believe there are plenty of indirect branches via GOT when compiling
> PIE/PIC with -fno-plt:
> 
> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> extern void foo (void);
> 
> void
> bar (void)
> {
>   foo ();
> }
> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> [hjl@gnu-6 gcc]$ cat x.s
> ..file "x.c"
> ..section .text.unlikely,"ax",@progbits
> ..LCOLDB0:
> ..text
> ..LHOTB0:
> ..p2align 4,,15
> ..globl bar
> ..type bar, @function
> bar:
> ..LFB0:
> ..cfi_startproc
> jmp *foo@GOTPCREL(%rip)
> ..cfi_endproc
> ..LFE0:
> ..size bar, .-bar

I agree these exist. What I question is whether the savings from the
linker being able to relax this to a direct call in the case where the
programmer failed to let the compiler make it a direct call to begin
with (by using hidden or protected visibility) are worth the cost of
not being able to hoist the load out of loops or schedule it earlier
in cases where relaxation is not possible because the call target is
not defined in the same DSO.

Rich
H.J. Lu May 19, 2015, 8:27 p.m. UTC | #34
On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
>> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> >>>> I'm still mildly worried that concerns for supporting
>> >>>> relaxation might lead to decisions not to optimize code in ways that
>> >>>> would be difficult to relax (e.g. certain types of address load
>> >>>> reordering or hoisting) but I don't understand GCC internals
>> >>>> sufficiently to know if this concern is warranted or not.
>> >>>
>> >>> It is.  The relaxation that HJ is working on requires that the reads from the
>> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> >>> a win.
>> >>>
>> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> >>> can't hoist the load and hide the memory latency.
>> >>>
>> >>
>> >> My relax approach won't take away any optimization done by compiler.
>> >> It simply turns indirect branch into direct branch with a nop prefix at
>> >> link-time.  I am having a hard time to understand why we shouldn't do it.
>> >
>> > I well understand what you're doing.
>> >
>> > But my point is that the only time the compiler should present you with the
>> > form of indirect branch you're looking for is when there's no place to hoist
>> > the load.
>> >
>> > At which point, is it really worth adding a new relocation to the ABI?  Is it
>> > really worth adding new code to the linker that won't be exercised often?
>>
>> I believe there are plenty of indirect branches via GOT when compiling
>> PIE/PIC with -fno-plt:
>>
>> [hjl@gnu-6 gcc]$ cat /tmp/x.c
>> extern void foo (void);
>>
>> void
>> bar (void)
>> {
>>   foo ();
>> }
>> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
>> [hjl@gnu-6 gcc]$ cat x.s
>> ..file "x.c"
>> ..section .text.unlikely,"ax",@progbits
>> ..LCOLDB0:
>> ..text
>> ..LHOTB0:
>> ..p2align 4,,15
>> ..globl bar
>> ..type bar, @function
>> bar:
>> ..LFB0:
>> ..cfi_startproc
>> jmp *foo@GOTPCREL(%rip)
>> ..cfi_endproc
>> ..LFE0:
>> ..size bar, .-bar
>
> I agree these exist. What I question is whether the savings from the
> linker being able to relax this to a direct call in the case where the
> programmer failed to let the compiler make it a direct call to begin
> with (by using hidden or protected visibility) are worth the cost of
> not being able to hoist the load out of loops or schedule it earlier
> in cases where relaxation is not possible because the call target is
> not defined in the same DSO.

Just for fun.  I compiled binutils as PIE with -fno-plt -flto:

[hjl@gnu-mic-2 gas]$ file as-new
as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
stripped
[hjl@gnu-mic-2 gas]$

There are 43:

ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>

and 1983

ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>
Rich Felker May 19, 2015, 8:54 p.m. UTC | #35
On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >> >>>> I'm still mildly worried that concerns for supporting
> >> >>>> relaxation might lead to decisions not to optimize code in ways that
> >> >>>> would be difficult to relax (e.g. certain types of address load
> >> >>>> reordering or hoisting) but I don't understand GCC internals
> >> >>>> sufficiently to know if this concern is warranted or not.
> >> >>>
> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >> >>> a win.
> >> >>>
> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >> >>> can't hoist the load and hide the memory latency.
> >> >>>
> >> >>
> >> >> My relax approach won't take away any optimization done by compiler.
> >> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >> >
> >> > I well understand what you're doing.
> >> >
> >> > But my point is that the only time the compiler should present you with the
> >> > form of indirect branch you're looking for is when there's no place to hoist
> >> > the load.
> >> >
> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> >> > really worth adding new code to the linker that won't be exercised often?
> >>
> >> I believe there are plenty of indirect branches via GOT when compiling
> >> PIE/PIC with -fno-plt:
> >>
> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> >> extern void foo (void);
> >>
> >> void
> >> bar (void)
> >> {
> >>   foo ();
> >> }
> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> >> [hjl@gnu-6 gcc]$ cat x.s
> >> ..file "x.c"
> >> ..section .text.unlikely,"ax",@progbits
> >> ..LCOLDB0:
> >> ..text
> >> ..LHOTB0:
> >> ..p2align 4,,15
> >> ..globl bar
> >> ..type bar, @function
> >> bar:
> >> ..LFB0:
> >> ..cfi_startproc
> >> jmp *foo@GOTPCREL(%rip)
> >> ..cfi_endproc
> >> ..LFE0:
> >> ..size bar, .-bar
> >
> > I agree these exist. What I question is whether the savings from the
> > linker being able to relax this to a direct call in the case where the
> > programmer failed to let the compiler make it a direct call to begin
> > with (by using hidden or protected visibility) are worth the cost of
> > not being able to hoist the load out of loops or schedule it earlier
> > in cases where relaxation is not possible because the call target is
> > not defined in the same DSO.
> 
> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
> 
> [hjl@gnu-mic-2 gas]$ file as-new
> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
> stripped
> [hjl@gnu-mic-2 gas]$
> 
> There are 43:
> 
> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
> 
> and 1983
> 
> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>

How many of those would be relaxed? I suspect it depends a lot on
whether libbfd is static or shared.

Rich
H.J. Lu May 20, 2015, 12:10 a.m. UTC | #36
On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote:
> On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
>> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
>> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
>> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
>> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
>> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
>> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
>> >> >>>> I'm still mildly worried that concerns for supporting
>> >> >>>> relaxation might lead to decisions not to optimize code in ways that
>> >> >>>> would be difficult to relax (e.g. certain types of address load
>> >> >>>> reordering or hoisting) but I don't understand GCC internals
>> >> >>>> sufficiently to know if this concern is warranted or not.
>> >> >>>
>> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
>> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
>> >> >>> a win.
>> >> >>>
>> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
>> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
>> >> >>> can't hoist the load and hide the memory latency.
>> >> >>>
>> >> >>
>> >> >> My relax approach won't take away any optimization done by compiler.
>> >> >> It simply turns indirect branch into direct branch with a nop prefix at
>> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
>> >> >
>> >> > I well understand what you're doing.
>> >> >
>> >> > But my point is that the only time the compiler should present you with the
>> >> > form of indirect branch you're looking for is when there's no place to hoist
>> >> > the load.
>> >> >
>> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
>> >> > really worth adding new code to the linker that won't be exercised often?
>> >>
>> >> I believe there are plenty of indirect branches via GOT when compiling
>> >> PIE/PIC with -fno-plt:
>> >>
>> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
>> >> extern void foo (void);
>> >>
>> >> void
>> >> bar (void)
>> >> {
>> >>   foo ();
>> >> }
>> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
>> >> [hjl@gnu-6 gcc]$ cat x.s
>> >> ..file "x.c"
>> >> ..section .text.unlikely,"ax",@progbits
>> >> ..LCOLDB0:
>> >> ..text
>> >> ..LHOTB0:
>> >> ..p2align 4,,15
>> >> ..globl bar
>> >> ..type bar, @function
>> >> bar:
>> >> ..LFB0:
>> >> ..cfi_startproc
>> >> jmp *foo@GOTPCREL(%rip)
>> >> ..cfi_endproc
>> >> ..LFE0:
>> >> ..size bar, .-bar
>> >
>> > I agree these exist. What I question is whether the savings from the
>> > linker being able to relax this to a direct call in the case where the
>> > programmer failed to let the compiler make it a direct call to begin
>> > with (by using hidden or protected visibility) are worth the cost of
>> > not being able to hoist the load out of loops or schedule it earlier
>> > in cases where relaxation is not possible because the call target is
>> > not defined in the same DSO.
>>
>> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
>>
>> [hjl@gnu-mic-2 gas]$ file as-new
>> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
>> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
>> stripped
>> [hjl@gnu-mic-2 gas]$
>>
>> There are 43:
>>
>> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
>>
>> and 1983
>>
>> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>
>
> How many of those would be relaxed? I suspect it depends a lot on
> whether libbfd is static or shared.

When shared libraries are enabled, there are 177 indirect branches
to locally defined functions.  Call to any locally defined functions,
which aren't compiled with LTO, is indirect.
Rich Felker May 20, 2015, 1:06 a.m. UTC | #37
On Tue, May 19, 2015 at 05:10:11PM -0700, H.J. Lu wrote:
> On Tue, May 19, 2015 at 1:54 PM, Rich Felker <dalias@libc.org> wrote:
> > On Tue, May 19, 2015 at 01:27:06PM -0700, H.J. Lu wrote:
> >> On Tue, May 19, 2015 at 1:15 PM, Rich Felker <dalias@libc.org> wrote:
> >> > On Tue, May 19, 2015 at 12:17:18PM -0700, H.J. Lu wrote:
> >> >> On Tue, May 19, 2015 at 12:11 PM, Richard Henderson <rth@redhat.com> wrote:
> >> >> > On 05/19/2015 12:06 PM, H.J. Lu wrote:
> >> >> >> On Tue, May 19, 2015 at 11:59 AM, Richard Henderson <rth@redhat.com> wrote:
> >> >> >>> On 05/19/2015 11:06 AM, Rich Felker wrote:
> >> >> >>>> I'm still mildly worried that concerns for supporting
> >> >> >>>> relaxation might lead to decisions not to optimize code in ways that
> >> >> >>>> would be difficult to relax (e.g. certain types of address load
> >> >> >>>> reordering or hoisting) but I don't understand GCC internals
> >> >> >>>> sufficiently to know if this concern is warranted or not.
> >> >> >>>
> >> >> >>> It is.  The relaxation that HJ is working on requires that the reads from the
> >> >> >>> got not be hoisted.  I'm not especially convinced that what he's working on is
> >> >> >>> a win.
> >> >> >>>
> >> >> >>> With LTO, the compiler can do the same job that he's attempting in the linker,
> >> >> >>> without an extra nop.  Without LTO, leaving it to the linker means that you
> >> >> >>> can't hoist the load and hide the memory latency.
> >> >> >>>
> >> >> >>
> >> >> >> My relax approach won't take away any optimization done by compiler.
> >> >> >> It simply turns indirect branch into direct branch with a nop prefix at
> >> >> >> link-time.  I am having a hard time to understand why we shouldn't do it.
> >> >> >
> >> >> > I well understand what you're doing.
> >> >> >
> >> >> > But my point is that the only time the compiler should present you with the
> >> >> > form of indirect branch you're looking for is when there's no place to hoist
> >> >> > the load.
> >> >> >
> >> >> > At which point, is it really worth adding a new relocation to the ABI?  Is it
> >> >> > really worth adding new code to the linker that won't be exercised often?
> >> >>
> >> >> I believe there are plenty of indirect branches via GOT when compiling
> >> >> PIE/PIC with -fno-plt:
> >> >>
> >> >> [hjl@gnu-6 gcc]$ cat /tmp/x.c
> >> >> extern void foo (void);
> >> >>
> >> >> void
> >> >> bar (void)
> >> >> {
> >> >>   foo ();
> >> >> }
> >> >> [hjl@gnu-6 gcc]$ ./xgcc -B./ -fPIC -O3 -S /tmp/x.c -fno-plt
> >> >> [hjl@gnu-6 gcc]$ cat x.s
> >> >> ..file "x.c"
> >> >> ..section .text.unlikely,"ax",@progbits
> >> >> ..LCOLDB0:
> >> >> ..text
> >> >> ..LHOTB0:
> >> >> ..p2align 4,,15
> >> >> ..globl bar
> >> >> ..type bar, @function
> >> >> bar:
> >> >> ..LFB0:
> >> >> ..cfi_startproc
> >> >> jmp *foo@GOTPCREL(%rip)
> >> >> ..cfi_endproc
> >> >> ..LFE0:
> >> >> ..size bar, .-bar
> >> >
> >> > I agree these exist. What I question is whether the savings from the
> >> > linker being able to relax this to a direct call in the case where the
> >> > programmer failed to let the compiler make it a direct call to begin
> >> > with (by using hidden or protected visibility) are worth the cost of
> >> > not being able to hoist the load out of loops or schedule it earlier
> >> > in cases where relaxation is not possible because the call target is
> >> > not defined in the same DSO.
> >>
> >> Just for fun.  I compiled binutils as PIE with -fno-plt -flto:
> >>
> >> [hjl@gnu-mic-2 gas]$ file as-new
> >> as-new: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV),
> >> dynamically linked (uses shared libs), for GNU/Linux 2.6.32, not
> >> stripped
> >> [hjl@gnu-mic-2 gas]$
> >>
> >> There are 43:
> >>
> >> ff 25 21 93 2d 00     jmpq   *0x2d9321(%rip)        # 3d5f58 <_DYNAMIC+0x1e8>
> >>
> >> and 1983
> >>
> >> ff 15 eb f4 38 00     callq  *0x38f4eb(%rip)        # 3d60e0 <_DYNAMIC+0x370>
> >
> > How many of those would be relaxed? I suspect it depends a lot on
> > whether libbfd is static or shared.
> 
> When shared libraries are enabled, there are 177 indirect branches
> to locally defined functions.  Call to any locally defined functions,
> which aren't compiled with LTO, is indirect.

And are the above indirect calls/jumps (1983+43) candidates for
scheduling/hoisting the address load (that's not being done yet), or
are they the ones the compiler opted not to schedule/hoist? The win
from relaxation seems small here, but as long as you're not going to
block optimizations that would preclude relaxing, I don't see any
disadvantages to doing it.

Rich
Michael Matz May 20, 2015, 12:10 p.m. UTC | #38
Hi,

On Tue, 19 May 2015, Richard Henderson wrote:

> It is.  The relaxation that HJ is working on requires that the reads 
> from the got not be hoisted.  I'm not especially convinced that what 
> he's working on is a win.
> 
> With LTO, the compiler can do the same job that he's attempting in the 
> linker, without an extra nop.  Without LTO, leaving it to the linker 
> means that you can't hoist the load and hide the memory latency.

Well, hoisting always needs a register, and if hoisted out of a loop 
(which you all seem to be after) that register is live through the whole 
loop body.  You need a register for each different called function in such 
loop, trading the one GOT pointer with N other registers.  For 
register-starved machines this is a real problem, even x86-64 doesn't have 
that many.  I.e. I'm not convinced that this hoisting will really be much 
of a win that often, outside toy examples.  Sure, the compiler can hoist 
function addresses trivially, but I think it will lead to spilling more 
often than not, or alternatively the hoisting will be undone by the 
register allocators rematerialization.  Of course, this would have to be 
measured for real not hand-waved, but, well, I'd be surprised if it's not 
so.


Ciao,
Michael.
H.J. Lu May 20, 2015, 12:35 p.m. UTC | #39
On Wed, May 20, 2015 at 5:10 AM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Tue, 19 May 2015, Richard Henderson wrote:
>
>> It is.  The relaxation that HJ is working on requires that the reads
>> from the got not be hoisted.  I'm not especially convinced that what
>> he's working on is a win.
>>
>> With LTO, the compiler can do the same job that he's attempting in the
>> linker, without an extra nop.  Without LTO, leaving it to the linker
>> means that you can't hoist the load and hide the memory latency.
>
> Well, hoisting always needs a register, and if hoisted out of a loop
> (which you all seem to be after) that register is live through the whole
> loop body.  You need a register for each different called function in such
> loop, trading the one GOT pointer with N other registers.  For
> register-starved machines this is a real problem, even x86-64 doesn't have
> that many.  I.e. I'm not convinced that this hoisting will really be much
> of a win that often, outside toy examples.  Sure, the compiler can hoist
> function addresses trivially, but I think it will lead to spilling more
> often than not, or alternatively the hoisting will be undone by the
> register allocators rematerialization.  Of course, this would have to be
> measured for real not hand-waved, but, well, I'd be surprised if it's not
> so.
>

We should replace "call/jmp *foo@GOTPCREL(%rip)" with
 "call/jmp *foo@GOTRELAX(%rip)".   As an option, we apply
-fno-plt to both PIC and non-PIC codes, if foo is externally defined.
It will save one indirect branch if GCC is right.  If GCC is wrong
and foo is defined locally, we get a nop prefix/suffix. We have
nothing to lose.
Rich Felker May 20, 2015, 2:09 p.m. UTC | #40
On Wed, May 20, 2015 at 02:10:41PM +0200, Michael Matz wrote:
> Hi,
> 
> On Tue, 19 May 2015, Richard Henderson wrote:
> 
> > It is.  The relaxation that HJ is working on requires that the reads 
> > from the got not be hoisted.  I'm not especially convinced that what 
> > he's working on is a win.
> > 
> > With LTO, the compiler can do the same job that he's attempting in the 
> > linker, without an extra nop.  Without LTO, leaving it to the linker 
> > means that you can't hoist the load and hide the memory latency.
> 
> Well, hoisting always needs a register, and if hoisted out of a loop 
> (which you all seem to be after) that register is live through the whole 
> loop body.  You need a register for each different called function in such 
> loop, trading the one GOT pointer with N other registers.  For 
> register-starved machines this is a real problem, even x86-64 doesn't have 
> that many.  I.e. I'm not convinced that this hoisting will really be much 
> of a win that often, outside toy examples.  Sure, the compiler can hoist 
> function addresses trivially, but I think it will lead to spilling more 
> often than not, or alternatively the hoisting will be undone by the 
> register allocators rematerialization.  Of course, this would have to be 
> measured for real not hand-waved, but, well, I'd be surprised if it's not 
> so.

The obvious example where it's useful on x86_64 is a major class:
anything where the majority of the callee's data is floating point and
thus kept in xmm registers. In that case register pressure is a lot
lower, and there's also an obvious class of cross-DSO functions calls
you'd be making over and over again: anything from libm.

Rich
Michael Matz May 20, 2015, 2:19 p.m. UTC | #41
Hi,

On Wed, 20 May 2015, Rich Felker wrote:

> > of a win that often, outside toy examples.  Sure, the compiler can hoist 
> > function addresses trivially, but I think it will lead to spilling more 
> > often than not, or alternatively the hoisting will be undone by the 
> > register allocators rematerialization.  Of course, this would have to be 
> > measured for real not hand-waved, but, well, I'd be surprised if it's not 
> > so.
> 
> The obvious example where it's useful on x86_64 is a major class: 

Yes, I can construct all kinds of examples where it's useful.  That 
doesn't touch the topic of real-world cases or hard numbers actually 
comparing the number of hoisted callee addresses, the number that stay 
hoisted until after register allocation and the number of spills added by 
hoisting, on some relevant code base, like gcc itself, or SPEC.

> anything where the majority of the callee's data is floating point and 
> thus kept in xmm registers.

This code tends to work on multiple arrays in practice, and hence integer 
registers are required for all the addresses and offsets and loop 
counters.

> In that case register pressure is a lot lower,

Register pressure on x86 is never low :)  Yes, x86-64 and others are much 
better in this regard.

> and there's also an obvious class of cross-DSO functions calls you'd be 
> making over and over again: anything from libm.


Ciao,
Michael.
Richard Henderson May 22, 2015, 6:19 p.m. UTC | #42
On 05/19/2015 06:06 PM, Rich Felker wrote:
> And are the above indirect calls/jumps (1983+43) candidates for
> scheduling/hoisting the address load (that's not being done yet), or
> are they the ones the compiler opted not to schedule/hoist? The win
> from relaxation seems small here, but as long as you're not going to
> block optimizations that would preclude relaxing, I don't see any
> disadvantages to doing it.

FWIW, I bootstrapped gcc with lto and -fpie -fno-plt:

	total calls	252436
	total indirect	21198	(8.4%)
	via got		10128	(4.0% / 48%)
	via reg		9007	(3.6% / 42%)
	via data	2063	(0.8% / 10%)

Those via data are things like

        callq  *0x145fdc4(%rip) # 19c0ea8 <lang_hooks+0x1e8>
        callq  *0x14517cc(%rip) # 19c0388 <targetm+0x328>

where we have a call to a hook at a known address.

Those via reg (or complex address) are also self explanatory -- we have all
sorts of hooks and indirection inside gcc, so this is unsurprising.  That said,
the very first one I examined,

000000000056735e <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334>:
  ...
  56736f: mov    0x144f6f2(%rip),%r13        # 19b6a68 <_DYNAMIC+0x928>
  ...
  567380: sub    $0x18,%r12
  567384: test   %ebx,%ebx
  567386: js     567394 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x36>
  567388: mov    0x28(%rbp,%r12,1),%rdi
  56738d: dec    %ebx
  56738f: callq  *%r13
  567392: jmp    567380 <_ZL15omega_free_eqnsP5eqn_di.lto_priv.3334+0x22>
  ...

does in fact hoist the address of "free" out of the loop.


Those via got can be identified by comparing the address against readelf -r to
examine the dynamic relocations.  There are plenty of truly non-local calls,
e.g. to libc.  These obviously cannot be relaxed.

Of those 10128 calls via the got, I found EXACTLY ONE that was local, to

  _Z22const_0_to_255_operandP7rtx_def12machine_mode

from

  _ZL19ix86_expand_builtinP9tree_nodeP7rtx_defS2_12machine_modei.lto_priv.2163

This is certain to be a bug, though I don't know where.  There are plenty of
other calls to const_0_to_255_operand elsewhere, and they are all, as expected,
direct.  This will likely take significant detective work...



r~
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index f29e053..b734350 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -5448,12 +5448,13 @@  ix86_function_ok_for_sibcall (tree decl, tree exp)
   /* If we are generating position-independent code, we cannot sibcall
      optimize any indirect call, or a direct call to a global function,
      as the PLT requires %ebx be live. (Darwin does not have a PLT.)  */
   if (!TARGET_MACHO
       && !TARGET_64BIT
       && flag_pic
+      && flag_plt
       && (decl && !targetm.binds_local_p (decl)))
     return false;
 
   /* If we need to align the outgoing stack, then sibcalling would
      unalign the stack, which may break the called function.  */
   if (ix86_minimum_incoming_stack_boundary (true)