diff mbox

Expand PIC calls without PLT with -fno-plt

Message ID alpine.LNX.2.11.1505061730460.22867@monopod.intra.ispras.ru
State New
Headers show

Commit Message

Alexander Monakov May 6, 2015, 3:24 p.m. UTC
On Mon, 4 May 2015, Jeff Law wrote:
> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> > On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> > > On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> > > > This patch introduces option -fno-plt that allows to expand calls that
> > > > would
> > > > go via PLT to load the address of the function immediately at call site
> > > > (which
> > > > introduces a GOT load).  Cover letter explains the motivation for this
> > > > patch.
> > > >
> > > > New option documentation for invoke.texi is missing from the patch; if
> > > > this is
> > > > accepted I'll be happy to send a v2 with documentation added.
> > > >
> > > >  * calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> > > >  indirect call by forcing address into a pseudo with -fno-plt.
> > > >  * common.opt (flag_plt): New option.
> > > OK once you cobble together the invoke.texi changes.
> >
> > Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> > inline the plt slot's first part, then lazy binding will work fine.
> I must have missed Alan/Michael's message.
> 
> ISTM the win here is that by going through the GOT, you can CSE the GOT
> reference and possibly get some more register allocation freedom.  Is that
> still the case with Alan/Michael's approach?

If the same PLT stubs as today are to be used, it constrains the compiler on
32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
specific register.  It's possible to imagine more complex PLT stubs that
obtain GOT pointer on their own, but in that case you can't let optimizations
such as loop invariant motion move the GOT load away from the call in a
fashion that could result in PLT stub pointer be reused many times.

Going ahead with this patch now allows anyone to play with no-PLT codegen on
any architecture.  As you can see from this series, on x86 it uncovered several
codegen blunders (and fixing those should improve normal codegen as well -- so
everybody wins).

Below is my proposed patch for invoke.texi.  Still OK to check in?

	* doc/invoke.texi (Code Generation Options): Add -fno-plt.
	([-fno-plt]): Document.

Comments

Jakub Jelinek May 6, 2015, 3:45 p.m. UTC | #1
On Wed, May 06, 2015 at 06:24:58PM +0300, Alexander Monakov wrote:
> If the same PLT stubs as today are to be used, it constrains the compiler on
> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
> specific register.  It's possible to imagine more complex PLT stubs that
> obtain GOT pointer on their own, but in that case you can't let optimizations
> such as loop invariant motion move the GOT load away from the call in a
> fashion that could result in PLT stub pointer be reused many times.

Why?
32-bit x86 (shouldn't we care much more about x86-64, where this is a
non-issue?) PLT looks like:

4c2b7310 <_Unwind_Find_FDE@plt-0x10>:
4c2b7310:       ff b3 04 00 00 00       pushl  0x4(%ebx)
4c2b7316:       ff a3 08 00 00 00       jmp    *0x8(%ebx)
4c2b731c:       00 00                   add    %al,(%eax)
        ...

4c2b7320 <_Unwind_Find_FDE@plt>:
4c2b7320:       ff a3 0c 00 00 00       jmp    *0xc(%ebx)
4c2b7326:       68 00 00 00 00          push   $0x0
4c2b732b:       e9 e0 ff ff ff          jmp    4c2b7310

4c2b7330 <realloc@plt>:
4c2b7330:       ff a3 10 00 00 00       jmp    *0x10(%ebx)
4c2b7336:       68 08 00 00 00          push   $0x8
4c2b733b:       e9 d0 ff ff ff          jmp    4c2b7310

The linker would know very well what kind of relocations are used for
particular PLT slot, and for the new relocations which would resolve to the
address of the .got.plt slot it could just tweak corresponding 3rd insn
in the slot, to not jump to first plt slot - 16, but a few bytes before that
that would just load the address of _G_O_T_ into %ebx and then fallthru
into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
slower in that case, but no requirement on %ebx to contain _G_O_T_.

As for hoisting the load of the call address before the loop, with lazy
binding that has the obvious disadvantage that you'd resolve the slot again
and again, if you are unlucky enough that the function hasn't been resolved
yet.  Unless the shared PLT stub after computing _G_O_T_ (for x86) also
rechecks the .got.plt address.

	Jakub
Jeff Law May 6, 2015, 3:55 p.m. UTC | #2
On 05/06/2015 09:45 AM, Jakub Jelinek wrote:

> As for hoisting the load of the call address before the loop, with lazy
> binding that has the obvious disadvantage that you'd resolve the slot again
> and again, if you are unlucky enough that the function hasn't been resolved
> yet.  Unless the shared PLT stub after computing _G_O_T_ (for x86) also
> rechecks the .got.plt address.
Yea, but I suspect that's the rare case rather than the common case.

Of course, it's so bloody expensive when it happens, it might totally 
outweigh the aggregated benefits from all the other profitable hoisted 
GOT loads.

jeff
Alexander Monakov May 6, 2015, 4:43 p.m. UTC | #3
On Wed, 6 May 2015, Jakub Jelinek wrote:
> The linker would know very well what kind of relocations are used for
> particular PLT slot, and for the new relocations which would resolve to the
> address of the .got.plt slot it could just tweak corresponding 3rd insn
> in the slot, to not jump to first plt slot - 16, but a few bytes before that
> that would just load the address of _G_O_T_ into %ebx and then fallthru
> into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> slower in that case, but no requirement on %ebx to contain _G_O_T_.

No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.

Alexander
Rich Felker May 6, 2015, 5:35 p.m. UTC | #4
On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> On Wed, 6 May 2015, Jakub Jelinek wrote:
> > The linker would know very well what kind of relocations are used for
> > particular PLT slot, and for the new relocations which would resolve to the
> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> 
> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.

Indeed. And the situation is the same on almost all targets. The only
exceptions are those with direct PC-relative addressing (like x86_64)
and those with reserved inter-procedural linkage registers and
efficient PC-relative address loading via them (like ARM and AArch64).
MIPS (o32) is also an interesting exception in that the normal ABI is
already PLT-free, and while callees need a PIC register loaded, it's a
call-clobbered register, not a call-saved one, so it doesn't make the
same kind of trouble,

I really don't see a need to make no-PLT code gen support lazy binding
when it's necessarily going to be costly to do so, and precludes most
of the benefits of the no-PLT approach. Anyone still wanting/needing
lazy binding semantics can use PLT, and can even choose on a per-TU
basis (or maybe even more fine-grained with pragmas/attributes?).
Those of us who are suffering the cost of PLT with no benefits
(because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
adding -fno-plt) and enjoy something like a 10% performance boost in
PIC/PIE.

Rich
H.J. Lu May 6, 2015, 6:26 p.m. UTC | #5
On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> > The linker would know very well what kind of relocations are used for
>> > particular PLT slot, and for the new relocations which would resolve to the
>> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>>
>> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>
> Indeed. And the situation is the same on almost all targets. The only
> exceptions are those with direct PC-relative addressing (like x86_64)
> and those with reserved inter-procedural linkage registers and
> efficient PC-relative address loading via them (like ARM and AArch64).
> MIPS (o32) is also an interesting exception in that the normal ABI is
> already PLT-free, and while callees need a PIC register loaded, it's a
> call-clobbered register, not a call-saved one, so it doesn't make the
> same kind of trouble,
>
> I really don't see a need to make no-PLT code gen support lazy binding
> when it's necessarily going to be costly to do so, and precludes most
> of the benefits of the no-PLT approach. Anyone still wanting/needing
> lazy binding semantics can use PLT, and can even choose on a per-TU
> basis (or maybe even more fine-grained with pragmas/attributes?).
> Those of us who are suffering the cost of PLT with no benefits
> (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> adding -fno-plt) and enjoy something like a 10% performance boost in
> PIC/PIE.
>

There are things compiler can do for performance and correctness
if it is told what options will be passed to linker.  -z now is one and
-Bsymbolic is another one:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886

I think we should add -fnow and -fsymbolic.  Together with LTO,
we can generate faster executables as well as shared libraries.
Rich Felker May 6, 2015, 6:37 p.m. UTC | #6
On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
> >> > The linker would know very well what kind of relocations are used for
> >> > particular PLT slot, and for the new relocations which would resolve to the
> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> >>
> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
> >
> > Indeed. And the situation is the same on almost all targets. The only
> > exceptions are those with direct PC-relative addressing (like x86_64)
> > and those with reserved inter-procedural linkage registers and
> > efficient PC-relative address loading via them (like ARM and AArch64).
> > MIPS (o32) is also an interesting exception in that the normal ABI is
> > already PLT-free, and while callees need a PIC register loaded, it's a
> > call-clobbered register, not a call-saved one, so it doesn't make the
> > same kind of trouble,
> >
> > I really don't see a need to make no-PLT code gen support lazy binding
> > when it's necessarily going to be costly to do so, and precludes most
> > of the benefits of the no-PLT approach. Anyone still wanting/needing
> > lazy binding semantics can use PLT, and can even choose on a per-TU
> > basis (or maybe even more fine-grained with pragmas/attributes?).
> > Those of us who are suffering the cost of PLT with no benefits
> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> > adding -fno-plt) and enjoy something like a 10% performance boost in
> > PIC/PIE.
> >
> 
> There are things compiler can do for performance and correctness
> if it is told what options will be passed to linker.  -z now is one and
> -Bsymbolic is another one:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
> 
> I think we should add -fnow and -fsymbolic.  Together with LTO,
> we can generate faster executables as well as shared libraries.

I don't see how knowing about -Bsymbolic can help the compiler
optimize. Without visibility, it can't know whether the symbols will
be defined in the same DSO. With visibility, it can already do the
equivalent hints. Perhaps it helps in the case where the symbol is
already defined (and non-weak) in the same TU, but I think in this
case it should already be optimizing the reference. Symbol
interposition over top of a non-weak symbol from the same TU is always
invalid and the compiler should not be pessimizing code to make it
work.

As for -fnow, I haven't thought about it much but I also don't see
many places where it could help. The only benefit that comes to mind
is on targets with weak memory order, where it would eliminate some of
the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
work on AArch64). It might also benefit PLT calls on such targets, but
you would get a lot more benefit from -fno-plt, and in that case -fnow
would not allow any further optimization.

Rich
H.J. Lu May 6, 2015, 6:44 p.m. UTC | #7
On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
>> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> >> > The linker would know very well what kind of relocations are used for
>> >> > particular PLT slot, and for the new relocations which would resolve to the
>> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>> >>
>> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>> >
>> > Indeed. And the situation is the same on almost all targets. The only
>> > exceptions are those with direct PC-relative addressing (like x86_64)
>> > and those with reserved inter-procedural linkage registers and
>> > efficient PC-relative address loading via them (like ARM and AArch64).
>> > MIPS (o32) is also an interesting exception in that the normal ABI is
>> > already PLT-free, and while callees need a PIC register loaded, it's a
>> > call-clobbered register, not a call-saved one, so it doesn't make the
>> > same kind of trouble,
>> >
>> > I really don't see a need to make no-PLT code gen support lazy binding
>> > when it's necessarily going to be costly to do so, and precludes most
>> > of the benefits of the no-PLT approach. Anyone still wanting/needing
>> > lazy binding semantics can use PLT, and can even choose on a per-TU
>> > basis (or maybe even more fine-grained with pragmas/attributes?).
>> > Those of us who are suffering the cost of PLT with no benefits
>> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
>> > adding -fno-plt) and enjoy something like a 10% performance boost in
>> > PIC/PIE.
>> >
>>
>> There are things compiler can do for performance and correctness
>> if it is told what options will be passed to linker.  -z now is one and
>> -Bsymbolic is another one:
>>
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
>>
>> I think we should add -fnow and -fsymbolic.  Together with LTO,
>> we can generate faster executables as well as shared libraries.
>
> I don't see how knowing about -Bsymbolic can help the compiler
> optimize. Without visibility, it can't know whether the symbols will
> be defined in the same DSO. With visibility, it can already do the
> equivalent hints. Perhaps it helps in the case where the symbol is
> already defined (and non-weak) in the same TU, but I think in this
> case it should already be optimizing the reference. Symbol
> interposition over top of a non-weak symbol from the same TU is always
> invalid and the compiler should not be pessimizing code to make it
> work.

-Bsymbolic will bind all references to local definitions in shared libraries,
with and without visibility, weak or non-weak.  Compiler can use it
in binds_tls_local_p and we can generate much better codes in shared
libraries.

> As for -fnow, I haven't thought about it much but I also don't see
> many places where it could help. The only benefit that comes to mind
> is on targets with weak memory order, where it would eliminate some of
> the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
> work on AArch64). It might also benefit PLT calls on such targets, but
> you would get a lot more benefit from -fno-plt, and in that case -fnow
> would not allow any further optimization.
>

-fno-plt doesn't work with lazy binding.  -fnow tells compiler that
lazy binding is not used and it can optimize without PLT.  With
-flto -fnow, compiler can make much better choices.
Rich Felker May 6, 2015, 7:01 p.m. UTC | #8
On Wed, May 06, 2015 at 11:44:57AM -0700, H.J. Lu wrote:
> On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
> > On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
> >> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
> >> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
> >> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
> >> >> > The linker would know very well what kind of relocations are used for
> >> >> > particular PLT slot, and for the new relocations which would resolve to the
> >> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
> >> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
> >> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
> >> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
> >> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
> >> >>
> >> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
> >> >
> >> > Indeed. And the situation is the same on almost all targets. The only
> >> > exceptions are those with direct PC-relative addressing (like x86_64)
> >> > and those with reserved inter-procedural linkage registers and
> >> > efficient PC-relative address loading via them (like ARM and AArch64).
> >> > MIPS (o32) is also an interesting exception in that the normal ABI is
> >> > already PLT-free, and while callees need a PIC register loaded, it's a
> >> > call-clobbered register, not a call-saved one, so it doesn't make the
> >> > same kind of trouble,
> >> >
> >> > I really don't see a need to make no-PLT code gen support lazy binding
> >> > when it's necessarily going to be costly to do so, and precludes most
> >> > of the benefits of the no-PLT approach. Anyone still wanting/needing
> >> > lazy binding semantics can use PLT, and can even choose on a per-TU
> >> > basis (or maybe even more fine-grained with pragmas/attributes?).
> >> > Those of us who are suffering the cost of PLT with no benefits
> >> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
> >> > adding -fno-plt) and enjoy something like a 10% performance boost in
> >> > PIC/PIE.
> >> >
> >>
> >> There are things compiler can do for performance and correctness
> >> if it is told what options will be passed to linker.  -z now is one and
> >> -Bsymbolic is another one:
> >>
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
> >>
> >> I think we should add -fnow and -fsymbolic.  Together with LTO,
> >> we can generate faster executables as well as shared libraries.
> >
> > I don't see how knowing about -Bsymbolic can help the compiler
> > optimize. Without visibility, it can't know whether the symbols will
> > be defined in the same DSO. With visibility, it can already do the
> > equivalent hints. Perhaps it helps in the case where the symbol is
> > already defined (and non-weak) in the same TU, but I think in this
> > case it should already be optimizing the reference. Symbol
> > interposition over top of a non-weak symbol from the same TU is always
> > invalid and the compiler should not be pessimizing code to make it
> > work.
> 
> -Bsymbolic will bind all references to local definitions in shared libraries,
> with and without visibility, weak or non-weak.  Compiler can use it
> in binds_tls_local_p and we can generate much better codes in shared
> libraries.

Yes, I'm aware of what it does. But at compile-time the compiler can't
know whether the referenced symbol will be defined in the same DSO
unless this is visibility annotation telling it. Even when linking a
shared library using -Bsymbolic, the library code can still make calls
(or data references) to symbols in other DSOs.

> > As for -fnow, I haven't thought about it much but I also don't see
> > many places where it could help. The only benefit that comes to mind
> > is on targets with weak memory order, where it would eliminate some of
> > the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
> > work on AArch64). It might also benefit PLT calls on such targets, but
> > you would get a lot more benefit from -fno-plt, and in that case -fnow
> > would not allow any further optimization.
> 
> -fno-plt doesn't work with lazy binding.  -fnow tells compiler that
> lazy binding is not used and it can optimize without PLT.  With
> -flto -fnow, compiler can make much better choices.

Ah, I see now you had LTO in mind. In that case the compiler does know
when the symbol is defined in the same DSO for -Bsymbolic. So that
clears up the usefulness of your proposed -fsymbolic. I still don't
see how -fnow would have a lot of practical usefulness, but I'm
certainly not opposed to it.

Rich
H.J. Lu May 6, 2015, 7:05 p.m. UTC | #9
On Wed, May 6, 2015 at 12:01 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 11:44:57AM -0700, H.J. Lu wrote:
>> On Wed, May 6, 2015 at 11:37 AM, Rich Felker <dalias@libc.org> wrote:
>> > On Wed, May 06, 2015 at 11:26:29AM -0700, H.J. Lu wrote:
>> >> On Wed, May 6, 2015 at 10:35 AM, Rich Felker <dalias@libc.org> wrote:
>> >> > On Wed, May 06, 2015 at 07:43:58PM +0300, Alexander Monakov wrote:
>> >> >> On Wed, 6 May 2015, Jakub Jelinek wrote:
>> >> >> > The linker would know very well what kind of relocations are used for
>> >> >> > particular PLT slot, and for the new relocations which would resolve to the
>> >> >> > address of the .got.plt slot it could just tweak corresponding 3rd insn
>> >> >> > in the slot, to not jump to first plt slot - 16, but a few bytes before that
>> >> >> > that would just load the address of _G_O_T_ into %ebx and then fallthru
>> >> >> > into the 0x4c2b7310 snippet above.  The lazy binding would be a few ticks
>> >> >> > slower in that case, but no requirement on %ebx to contain _G_O_T_.
>> >> >>
>> >> >> No, %ebx is callee-saved, so you can't outright overwrite it in the PLT stub.
>> >> >
>> >> > Indeed. And the situation is the same on almost all targets. The only
>> >> > exceptions are those with direct PC-relative addressing (like x86_64)
>> >> > and those with reserved inter-procedural linkage registers and
>> >> > efficient PC-relative address loading via them (like ARM and AArch64).
>> >> > MIPS (o32) is also an interesting exception in that the normal ABI is
>> >> > already PLT-free, and while callees need a PIC register loaded, it's a
>> >> > call-clobbered register, not a call-saved one, so it doesn't make the
>> >> > same kind of trouble,
>> >> >
>> >> > I really don't see a need to make no-PLT code gen support lazy binding
>> >> > when it's necessarily going to be costly to do so, and precludes most
>> >> > of the benefits of the no-PLT approach. Anyone still wanting/needing
>> >> > lazy binding semantics can use PLT, and can even choose on a per-TU
>> >> > basis (or maybe even more fine-grained with pragmas/attributes?).
>> >> > Those of us who are suffering the cost of PLT with no benefits
>> >> > (because we use -Wl,-z,relro -Wl,-z,now) can just be rid of it (by
>> >> > adding -fno-plt) and enjoy something like a 10% performance boost in
>> >> > PIC/PIE.
>> >> >
>> >>
>> >> There are things compiler can do for performance and correctness
>> >> if it is told what options will be passed to linker.  -z now is one and
>> >> -Bsymbolic is another one:
>> >>
>> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65886
>> >>
>> >> I think we should add -fnow and -fsymbolic.  Together with LTO,
>> >> we can generate faster executables as well as shared libraries.
>> >
>> > I don't see how knowing about -Bsymbolic can help the compiler
>> > optimize. Without visibility, it can't know whether the symbols will
>> > be defined in the same DSO. With visibility, it can already do the
>> > equivalent hints. Perhaps it helps in the case where the symbol is
>> > already defined (and non-weak) in the same TU, but I think in this
>> > case it should already be optimizing the reference. Symbol
>> > interposition over top of a non-weak symbol from the same TU is always
>> > invalid and the compiler should not be pessimizing code to make it
>> > work.
>>
>> -Bsymbolic will bind all references to local definitions in shared libraries,
>> with and without visibility, weak or non-weak.  Compiler can use it
>> in binds_tls_local_p and we can generate much better codes in shared
>> libraries.
>
> Yes, I'm aware of what it does. But at compile-time the compiler can't
> know whether the referenced symbol will be defined in the same DSO
> unless this is visibility annotation telling it. Even when linking a
> shared library using -Bsymbolic, the library code can still make calls
> (or data references) to symbols in other DSOs.

Even without LTO, -fsymbolic -fPIC will generate better codes for

---
int glob_a = 1;

int foo ()
{
  return glob_a;
}
---

and

---
int glob_a (void)
{
  return -1;
}

int foo ()
{
  return glob_a ();
}
---


>> > As for -fnow, I haven't thought about it much but I also don't see
>> > many places where it could help. The only benefit that comes to mind
>> > is on targets with weak memory order, where it would eliminate some of
>> > the cost of synchronizing TLSDESC lazy bindings (see Szabolcs Nagy's
>> > work on AArch64). It might also benefit PLT calls on such targets, but
>> > you would get a lot more benefit from -fno-plt, and in that case -fnow
>> > would not allow any further optimization.
>>
>> -fno-plt doesn't work with lazy binding.  -fnow tells compiler that
>> lazy binding is not used and it can optimize without PLT.  With
>> -flto -fnow, compiler can make much better choices.
>
> Ah, I see now you had LTO in mind. In that case the compiler does know
> when the symbol is defined in the same DSO for -Bsymbolic. So that
> clears up the usefulness of your proposed -fsymbolic. I still don't
> see how -fnow would have a lot of practical usefulness, but I'm
> certainly not opposed to it.
>
> Rich
Rich Felker May 6, 2015, 7:17 p.m. UTC | #10
On Wed, May 06, 2015 at 12:05:20PM -0700, H.J. Lu wrote:
> >> -Bsymbolic will bind all references to local definitions in shared libraries,
> >> with and without visibility, weak or non-weak.  Compiler can use it
> >> in binds_tls_local_p and we can generate much better codes in shared
> >> libraries.
> >
> > Yes, I'm aware of what it does. But at compile-time the compiler can't
> > know whether the referenced symbol will be defined in the same DSO
> > unless this is visibility annotation telling it. Even when linking a
> > shared library using -Bsymbolic, the library code can still make calls
> > (or data references) to symbols in other DSOs.
> 
> Even without LTO, -fsymbolic -fPIC will generate better codes for
> 
> ---
> int glob_a = 1;
> 
> int foo ()
> {
>   return glob_a;
> }
> ---

I see how this case is improved, but it depends on the dubious (and
undocumented?) behavior of -Bsymbolic breaking copy relocations.

> and
> 
> ---
> int glob_a (void)
> {
>   return -1;
> }
> 
> int foo ()
> {
>   return glob_a ();
> }
> ---

I don't see how this case is improved unless GCC is failing to
consider strong definitions in the same TU as locally-binding. If this
is the case, is there a reason for that behavior? IMO it's wrong.

Rich
H.J. Lu May 6, 2015, 7:24 p.m. UTC | #11
On Wed, May 6, 2015 at 12:17 PM, Rich Felker <dalias@libc.org> wrote:
> On Wed, May 06, 2015 at 12:05:20PM -0700, H.J. Lu wrote:
>> >> -Bsymbolic will bind all references to local definitions in shared libraries,
>> >> with and without visibility, weak or non-weak.  Compiler can use it
>> >> in binds_tls_local_p and we can generate much better codes in shared
>> >> libraries.
>> >
>> > Yes, I'm aware of what it does. But at compile-time the compiler can't
>> > know whether the referenced symbol will be defined in the same DSO
>> > unless this is visibility annotation telling it. Even when linking a
>> > shared library using -Bsymbolic, the library code can still make calls
>> > (or data references) to symbols in other DSOs.
>>
>> Even without LTO, -fsymbolic -fPIC will generate better codes for
>>
>> ---
>> int glob_a = 1;
>>
>> int foo ()
>> {
>>   return glob_a;
>> }
>> ---
>
> I see how this case is improved, but it depends on the dubious (and
> undocumented?) behavior of -Bsymbolic breaking copy relocations.

-Bsymbolic breaks copy relocations, independent of compiler.
However, we can pass -fsymbolic when building PIE to avoid
copy relocation.  With -fsymbolic -fPIE -pie -flto, we can generate
direct reference for locally defined symbol.


>> and
>>
>> ---
>> int glob_a (void)
>> {
>>   return -1;
>> }
>>
>> int foo ()
>> {
>>   return glob_a ();
>> }
>> ---
>
> I don't see how this case is improved unless GCC is failing to
> consider strong definitions in the same TU as locally-binding. If this
> is the case, is there a reason for that behavior? IMO it's wrong.

glob_a is a strong definition.  If you have another strong definition,
you will get a linker error.
Jeff Law May 7, 2015, 6:22 p.m. UTC | #12
On 05/06/2015 09:24 AM, Alexander Monakov wrote:
> On Mon, 4 May 2015, Jeff Law wrote:
>> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
>>> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>>>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>>>> This patch introduces option -fno-plt that allows to expand calls that
>>>>> would
>>>>> go via PLT to load the address of the function immediately at call site
>>>>> (which
>>>>> introduces a GOT load).  Cover letter explains the motivation for this
>>>>> patch.
>>>>>
>>>>> New option documentation for invoke.texi is missing from the patch; if
>>>>> this is
>>>>> accepted I'll be happy to send a v2 with documentation added.
>>>>>
>>>>>   * calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>>>>>   indirect call by forcing address into a pseudo with -fno-plt.
>>>>>   * common.opt (flag_plt): New option.
>>>> OK once you cobble together the invoke.texi changes.
>>>
>>> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
>>> inline the plt slot's first part, then lazy binding will work fine.
>> I must have missed Alan/Michael's message.
>>
>> ISTM the win here is that by going through the GOT, you can CSE the GOT
>> reference and possibly get some more register allocation freedom.  Is that
>> still the case with Alan/Michael's approach?
>
> If the same PLT stubs as today are to be used, it constrains the compiler on
> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
> specific register.  It's possible to imagine more complex PLT stubs that
> obtain GOT pointer on their own, but in that case you can't let optimizations
> such as loop invariant motion move the GOT load away from the call in a
> fashion that could result in PLT stub pointer be reused many times.
>
> Going ahead with this patch now allows anyone to play with no-PLT codegen on
> any architecture.  As you can see from this series, on x86 it uncovered several
> codegen blunders (and fixing those should improve normal codegen as well -- so
> everybody wins).
>
> Below is my proposed patch for invoke.texi.  Still OK to check in?
>
> 	* doc/invoke.texi (Code Generation Options): Add -fno-plt.
> 	([-fno-plt]): Document.
We're not changing the defaults, so I think this is fine.  Whether or 
not it proves useful is still to be determined.

jeff
H.J. Lu May 7, 2015, 7:13 p.m. UTC | #13
On Thu, May 7, 2015 at 11:22 AM, Jeff Law <law@redhat.com> wrote:
> On 05/06/2015 09:24 AM, Alexander Monakov wrote:
>>
>> On Mon, 4 May 2015, Jeff Law wrote:
>>>
>>> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
>>>>
>>>> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>>>>>
>>>>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>>>>>
>>>>>> This patch introduces option -fno-plt that allows to expand calls that
>>>>>> would
>>>>>> go via PLT to load the address of the function immediately at call
>>>>>> site
>>>>>> (which
>>>>>> introduces a GOT load).  Cover letter explains the motivation for this
>>>>>> patch.
>>>>>>
>>>>>> New option documentation for invoke.texi is missing from the patch; if
>>>>>> this is
>>>>>> accepted I'll be happy to send a v2 with documentation added.
>>>>>>
>>>>>>   * calls.c (prepare_call_address): Transform PLT call to GOT lookup
>>>>>> and
>>>>>>   indirect call by forcing address into a pseudo with -fno-plt.
>>>>>>   * common.opt (flag_plt): New option.
>>>>>
>>>>> OK once you cobble together the invoke.texi changes.
>>>>
>>>>
>>>> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes
>>>> to
>>>> inline the plt slot's first part, then lazy binding will work fine.
>>>
>>> I must have missed Alan/Michael's message.
>>>
>>> ISTM the win here is that by going through the GOT, you can CSE the GOT
>>> reference and possibly get some more register allocation freedom.  Is
>>> that
>>> still the case with Alan/Michael's approach?
>>
>>
>> If the same PLT stubs as today are to be used, it constrains the compiler
>> on
>> 32-bit x86 and possibly other arches where PLT stubs need GOT pointer in a
>> specific register.  It's possible to imagine more complex PLT stubs that
>> obtain GOT pointer on their own, but in that case you can't let
>> optimizations
>> such as loop invariant motion move the GOT load away from the call in a
>> fashion that could result in PLT stub pointer be reused many times.
>>
>> Going ahead with this patch now allows anyone to play with no-PLT codegen
>> on
>> any architecture.  As you can see from this series, on x86 it uncovered
>> several
>> codegen blunders (and fixing those should improve normal codegen as well
>> -- so
>> everybody wins).
>>
>> Below is my proposed patch for invoke.texi.  Still OK to check in?
>>
>>         * doc/invoke.texi (Code Generation Options): Add -fno-plt.
>>         ([-fno-plt]): Document.
>
> We're not changing the defaults, so I think this is fine.  Whether or not it
> proves useful is still to be determined.
>

We should do if we know -z now will be passed to linker and function
foo is defined in a shared library.  Without the new relocation, we will
only know for sure if foo is defined in a shared library when we do LTO.
With the new relocation, we can do it for all non-local functions via a
compiler switch.
Michael Matz May 11, 2015, 11:48 a.m. UTC | #14
Hi,

On Wed, 6 May 2015, Rich Felker wrote:

> I don't see how this case is improved unless GCC is failing to consider 
> strong definitions in the same TU as locally-binding.

Interposition of non-static non-inline non-weak symbols is supported 
independend of if they are defined in the same TU or not (if you're 
producing a shared lib, that is).  I.e. no, they are not considered 
locally-binding (for instance, they aren't automatically inlined).

> If this is the case, is there a reason for that behavior?

Because IMHO interposition is orthogonal to TU placement, and hence 
shouldn't be influenced by it.  There's visibility, inline hints or 
static-ness to achieve different effects.  (perhaps the real reason is: 
because it always worked like that :) )

> IMO it's wrong.

Why?  I think it's right.


Ciao,
Michael.
Rich Felker May 11, 2015, 2:19 p.m. UTC | #15
On Mon, May 11, 2015 at 01:48:03PM +0200, Michael Matz wrote:
> Hi,
> 
> On Wed, 6 May 2015, Rich Felker wrote:
> 
> > I don't see how this case is improved unless GCC is failing to consider 
> > strong definitions in the same TU as locally-binding.
> 
> Interposition of non-static non-inline non-weak symbols is supported 
> independend of if they are defined in the same TU or not (if you're 
> producing a shared lib, that is).  I.e. no, they are not considered 
> locally-binding (for instance, they aren't automatically inlined).
>
> > If this is the case, is there a reason for that behavior?
> 
> Because IMHO interposition is orthogonal to TU placement, and hence 
> shouldn't be influenced by it.  There's visibility, inline hints or 
> static-ness to achieve different effects.  (perhaps the real reason is: 
> because it always worked like that :) )
> 
> > IMO it's wrong.
> 
> Why?  I think it's right.

I see it as an unnecessary pessimization. The ELF shared library
semantics for allowing interposition were designed to avoid behavioral
regressions versus static linking, and this is not such a case. With
an archive-type library, it's possible to cause whole TUs to be
omitted when linking as long as whatever symbol(s) may have been
needed from them are already defined elsewhere; interposition makes
the same possible with dynamic linking. But if symbols A and B were
both in the same TU, having A defined prior to searching an archive
but B undefined will cause the TU that defines both to be pulled in,
and is such a linking error (multiple definitions). So I'm not sure
why it's desirable to support this.

The "it always worked like that" argument may suffice if people are
depending on this behavior now (OTOH I'd rather see it break so they
fix their breakage of static linking) but I suspect the historical
reason it worked like that is that compilers were not smart enough to
process whole TUs at a time but just worked with one function at a
time and did not know that a referenced symbol was in the same TU.

BTW visibility can't really address the issue except with hacks
(hidden aliases) or protected visibility (which is hard to use because
it's broken on lots of toolchain versions).

Rich
diff mbox

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 520c2c5..fd4199c 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -1122,7 +1122,7 @@  See S/390 and zSeries Options.
 -finstrument-functions-exclude-function-list=@var{sym},@var{sym},@dots{} @gol
 -finstrument-functions-exclude-file-list=@var{file},@var{file},@dots{} @gol
 -fno-common  -fno-ident @gol
--fpcc-struct-return  -fpic  -fPIC -fpie -fPIE @gol
+-fpcc-struct-return  -fpic  -fPIC -fpie -fPIE -fno-plt @gol
 -fno-jump-tables @gol
 -frecord-gcc-switches @gol
 -freg-struct-return  -fshort-enums @gol
@@ -23615,6 +23615,16 @@  used during linking.
 @code{__pie__} and @code{__PIE__}.  The macros have the value 1
 for @option{-fpie} and 2 for @option{-fPIE}.
 
+@item -fno-plt
+@opindex fno-plt
+Do not use PLT for external function calls in position-independent code.
+Instead, load callee address at call site from GOT and branch to it.
+This leads to more efficient code by eliminating PLT stubs and exposing
+GOT load to optimizations.  On architectures such as 32-bit x86 where
+PLT stubs expect GOT pointer in a specific register, this gives more
+register allocation freedom to the compiler.  Lazy binding requires PLT:
+with @option{-fno-plt} all external symbols are resolved at load time.
+
 @item -fno-jump-tables
 @opindex fno-jump-tables
 Do not use jump tables for switch statements even where it would be