diff mbox

Expand PIC calls without PLT with -fno-plt

Message ID 1430757479-14241-6-git-send-email-amonakov@ispras.ru
State New
Headers show

Commit Message

Alexander Monakov May 4, 2015, 4:37 p.m. UTC
This patch introduces option -fno-plt that allows to expand calls that would
go via PLT to load the address of the function immediately at call site (which
introduces a GOT load).  Cover letter explains the motivation for this patch.

New option documentation for invoke.texi is missing from the patch; if this is
accepted I'll be happy to send a v2 with documentation added.

	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
	indirect call by forcing address into a pseudo with -fno-plt.
	* common.opt (flag_plt): New option.

Comments

Jeff Law May 4, 2015, 5:34 p.m. UTC | #1
On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
>
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
>
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.
OK once you cobble together the invoke.texi changes.

Jeff
Jakub Jelinek May 4, 2015, 5:39 p.m. UTC | #2
On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> >This patch introduces option -fno-plt that allows to expand calls that would
> >go via PLT to load the address of the function immediately at call site (which
> >introduces a GOT load).  Cover letter explains the motivation for this patch.
> >
> >New option documentation for invoke.texi is missing from the patch; if this is
> >accepted I'll be happy to send a v2 with documentation added.
> >
> >	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> >	indirect call by forcing address into a pseudo with -fno-plt.
> >	* common.opt (flag_plt): New option.
> OK once you cobble together the invoke.texi changes.

Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
inline the plt slot's first part, then lazy binding will work fine.

	Jakub
Jeff Law May 4, 2015, 5:42 p.m. UTC | #3
On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
>> On 05/04/2015 10:37 AM, Alexander Monakov wrote:
>>> This patch introduces option -fno-plt that allows to expand calls that would
>>> go via PLT to load the address of the function immediately at call site (which
>>> introduces a GOT load).  Cover letter explains the motivation for this patch.
>>>
>>> New option documentation for invoke.texi is missing from the patch; if this is
>>> accepted I'll be happy to send a v2 with documentation added.
>>>
>>> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>>> 	indirect call by forcing address into a pseudo with -fno-plt.
>>> 	* common.opt (flag_plt): New option.
>> OK once you cobble together the invoke.texi changes.
>
> Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> inline the plt slot's first part, then lazy binding will work fine.
I must have missed Alan/Michael's message.

ISTM the win here is that by going through the GOT, you can CSE the GOT 
reference and possibly get some more register allocation freedom.  Is 
that still the case with Alan/Michael's approach?

jeff
Rich Felker May 6, 2015, 3:08 a.m. UTC | #4
On Mon, May 04, 2015 at 11:42:20AM -0600, Jeff Law wrote:
> On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> >On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> >>On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> >>>This patch introduces option -fno-plt that allows to expand calls that would
> >>>go via PLT to load the address of the function immediately at call site (which
> >>>introduces a GOT load).  Cover letter explains the motivation for this patch.
> >>>
> >>>New option documentation for invoke.texi is missing from the patch; if this is
> >>>accepted I'll be happy to send a v2 with documentation added.
> >>>
> >>>	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> >>>	indirect call by forcing address into a pseudo with -fno-plt.
> >>>	* common.opt (flag_plt): New option.
> >>OK once you cobble together the invoke.texi changes.
> >
> >Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> >inline the plt slot's first part, then lazy binding will work fine.
> I must have missed Alan/Michael's message.
> 
> ISTM the win here is that by going through the GOT, you can CSE the
> GOT reference and possibly get some more register allocation
> freedom.  Is that still the case with Alan/Michael's approach?

There are many advantages to 'going through the GOT'. CSE'ing the
reference is just one. The biggest (IMO) is that you can avoid the bad
PLT ABI that most targets have, where making a call to a PLT slot
requires the GOT address to be pre-loaded into a fixed, call-saved
register. This precludes sibcalls and forces many functions which
otherwise would not need their own stack frames to create one for
saving the old value of the GOT register. See my blog entry on the
topic here: http://ewontfix.com/18/

Anyone who really wants lazy binding can use -fplt (which is
presumably still the default; I didn't check) but lazy binding should
largely be considered deprecated anyway since effective use of relro
protection requires -z now too, in which case you're paying all the
costs (which are considerable!) for lazy binding support even though
you won't get it.

Rich
Jan Hubicka May 10, 2015, 4:59 p.m. UTC | #5
> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
> 
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
> 
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.
> 
> diff --git a/gcc/common.opt b/gcc/common.opt
> index b49ac46..cd8b256 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>  Generate position-independent code if possible (small mode)
>  
>  fpie
>  Common Report Var(flag_pie,1) Negative(fPIC)
>  Generate position-independent code for executables if possible (small mode)
>  
> +fplt
> +Common Report Var(flag_plt) Init(1)
> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
> +

This won't play well with LTO since fplt will become another global flag while
it affects codegen.

I still did not catch up with the other thread and Hj's work on doing this
transparently in linker, but if this is getting in, please add Optimization to
fplt, so the PLT usage can be decided with per function granuality.

Honza
Jan Hubicka May 10, 2015, 5:07 p.m. UTC | #6
> On Mon, May 04, 2015 at 11:42:20AM -0600, Jeff Law wrote:
> > On 05/04/2015 11:39 AM, Jakub Jelinek wrote:
> > >On Mon, May 04, 2015 at 11:34:05AM -0600, Jeff Law wrote:
> > >>On 05/04/2015 10:37 AM, Alexander Monakov wrote:
> > >>>This patch introduces option -fno-plt that allows to expand calls that would
> > >>>go via PLT to load the address of the function immediately at call site (which
> > >>>introduces a GOT load).  Cover letter explains the motivation for this patch.
> > >>>
> > >>>New option documentation for invoke.texi is missing from the patch; if this is
> > >>>accepted I'll be happy to send a v2 with documentation added.
> > >>>
> > >>>	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> > >>>	indirect call by forcing address into a pseudo with -fno-plt.
> > >>>	* common.opt (flag_plt): New option.
> > >>OK once you cobble together the invoke.texi changes.
> > >
> > >Isn't what Michael/Alan suggested better?  I mean as/ld/compiler changes to
> > >inline the plt slot's first part, then lazy binding will work fine.
> > I must have missed Alan/Michael's message.
> > 
> > ISTM the win here is that by going through the GOT, you can CSE the
> > GOT reference and possibly get some more register allocation
> > freedom.  Is that still the case with Alan/Michael's approach?
> 
> There are many advantages to 'going through the GOT'. CSE'ing the
> reference is just one. The biggest (IMO) is that you can avoid the bad
> PLT ABI that most targets have, where making a call to a PLT slot
> requires the GOT address to be pre-loaded into a fixed, call-saved
> register. This precludes sibcalls and forces many functions which
> otherwise would not need their own stack frames to create one for
> saving the old value of the GOT register. See my blog entry on the
> topic here: http://ewontfix.com/18/

One common pattern I noticed while looking at codegen for speculative devirtualization
is that in case we do not inline the virtual call we end up with

if (ptr = &foo)
  foo()

which leads to both GOT lookup to figure out address of foo and call across PLT.
It would be nice to handle this gratefully.

Note that one of improvements I want to do to devirt machinery is to change
the code seuqence to:

 if (vptr == &expected_vtable)
   foo ()
 else
   vptr[token]();

To saven the vtable lookup. But this is not possible in all cases - it happens
that there are multiple predicted vtables all agreeeing on the partiuclar slot.

Honza
> 
> Anyone who really wants lazy binding can use -fplt (which is
> presumably still the default; I didn't check) but lazy binding should
> largely be considered deprecated anyway since effective use of relro
> protection requires -z now too, in which case you're paying all the
> costs (which are considerable!) for lazy binding support even though
> you won't get it.
> 
> Rich
Jeff Law May 11, 2015, 8:36 p.m. UTC | #7
On 05/10/2015 10:59 AM, Jan Hubicka wrote:
>> This patch introduces option -fno-plt that allows to expand calls that would
>> go via PLT to load the address of the function immediately at call site (which
>> introduces a GOT load).  Cover letter explains the motivation for this patch.
>>
>> New option documentation for invoke.texi is missing from the patch; if this is
>> accepted I'll be happy to send a v2 with documentation added.
>>
>> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
>> 	indirect call by forcing address into a pseudo with -fno-plt.
>> 	* common.opt (flag_plt): New option.
>>
>> diff --git a/gcc/common.opt b/gcc/common.opt
>> index b49ac46..cd8b256 100644
>> --- a/gcc/common.opt
>> +++ b/gcc/common.opt
>> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>>   Generate position-independent code if possible (small mode)
>>
>>   fpie
>>   Common Report Var(flag_pie,1) Negative(fPIC)
>>   Generate position-independent code for executables if possible (small mode)
>>
>> +fplt
>> +Common Report Var(flag_plt) Init(1)
>> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
>> +
>
> This won't play well with LTO since fplt will become another global flag while
> it affects codegen.
I know Richi explained this to me in the past, but I can't remember the 
details of why this is bad.  Can you walk me through it again?

jeff
H.J. Lu May 11, 2015, 8:55 p.m. UTC | #8
On Mon, May 11, 2015 at 1:36 PM, Jeff Law <law@redhat.com> wrote:
> On 05/10/2015 10:59 AM, Jan Hubicka wrote:
>>>
>>> This patch introduces option -fno-plt that allows to expand calls that
>>> would
>>> go via PLT to load the address of the function immediately at call site
>>> (which
>>> introduces a GOT load).  Cover letter explains the motivation for this
>>> patch.
>>>
>>> New option documentation for invoke.texi is missing from the patch; if
>>> this is
>>> accepted I'll be happy to send a v2 with documentation added.
>>>
>>>         * calls.c (prepare_call_address): Transform PLT call to GOT
>>> lookup and
>>>         indirect call by forcing address into a pseudo with -fno-plt.
>>>         * common.opt (flag_plt): New option.
>>>
>>> diff --git a/gcc/common.opt b/gcc/common.opt
>>> index b49ac46..cd8b256 100644
>>> --- a/gcc/common.opt
>>> +++ b/gcc/common.opt
>>> @@ -1773,12 +1773,16 @@ Common Report Var(flag_pic,1) Negative(fpie)
>>>   Generate position-independent code if possible (small mode)
>>>
>>>   fpie
>>>   Common Report Var(flag_pie,1) Negative(fPIC)
>>>   Generate position-independent code for executables if possible (small
>>> mode)
>>>
>>> +fplt
>>> +Common Report Var(flag_plt) Init(1)
>>> +Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
>>> +
>>
>>
>> This won't play well with LTO since fplt will become another global flag
>> while
>> it affects codegen.
>
> I know Richi explained this to me in the past, but I can't remember the
> details of why this is bad.  Can you walk me through it again?
>

I have proposed a different approach:

https://gcc.gnu.org/ml/gcc/2015-05/msg00086.html
Jan Hubicka May 11, 2015, 9:59 p.m. UTC | #9
> >> This won't play well with LTO since fplt will become another global flag
> >> while
> >> it affects codegen.
> >
> > I know Richi explained this to me in the past, but I can't remember the
> > details of why this is bad.  Can you walk me through it again?
> >
> 
> I have proposed a different approach:
> 
> https://gcc.gnu.org/ml/gcc/2015-05/msg00086.html

THe RELAX_PC* approach looks indeed interesting (I still need to catch up
with the thread), but to answer Jeff's question.
With LTO we need to handle stuff like
gcc a.c -fplt -flto -c -O2
gcc b.c -fno-plt -flto -c -Os
gcc a.o b.o -flto

and generaly we would like to mimmic as closely as possible what happens with
non-LTO builds. That is functions originating from a.c should be -O2 optimized
with PLT and functions from b.c should size optimized w/o PLT (wich makes cross
module inlining fun).
To do so we now attach implicit optimization/target node to each function that
stores the flags used to build the unit.  optimization nodes contains only
those flags that are defined as Optimization.

So in general if we have a flag that is about function codegen and we are able
to produce function with different values of the flag in one unit, we really
want to mark it as Optimization (and decide what we want to do about inlining
across the flag boundary). Not all flags works like this, for example -fPIC is
a global flag and then there is Richi's code in lto-wrapper that reados those
options from all .o files first and somehow chose the prevailing one for the
whole program.

In longer term we want to eliminate as many as possible of those global flags
(for exmaple -m32 can stay global as you can not mix it with -m64) and also
to explicitely represent some of the flags in IL, so inlining across boundaries
works as expected.

Honza
> 
> 
> -- 
> H.J.
Jiong Wang June 22, 2015, 3:51 p.m. UTC | #10
On 04/05/15 17:37, Alexander Monakov wrote:
> This patch introduces option -fno-plt that allows to expand calls that would
> go via PLT to load the address of the function immediately at call site (which
> introduces a GOT load).  Cover letter explains the motivation for this patch.
>
> New option documentation for invoke.texi is missing from the patch; if this is
> accepted I'll be happy to send a v2 with documentation added.
>
> 	* calls.c (prepare_call_address): Transform PLT call to GOT lookup and
> 	indirect call by forcing address into a pseudo with -fno-plt.
> 	* common.opt (flag_plt): New option.

Have done a quick experiment, -fno-plt doesn't work on AArch64.

it's because although this patch force the function address into register,
but the combine pass runs later combine it back as AArch64 have defined such
insn pattern.

For X86, it's not combined back. From the rtl dump, it's because the rtl 
pre pass
has moved the address load instruction into another basic block and 
combine pass
don't combine across basic blocks. Also, x86 backend has done some check 
on flag_plt
in the new added ix86_nopic_noplt_attribute_p which could help generate 
correct insns.

What I can think of the fix on AArch64 is by restricting the call symbol 
under
"flag_plt == true" only, so that call via register can't be combined 
into call
symbol direct,

Or better to prohibit combine pass for such combining? as the generic 
fix on combine may
fix other broken targets.

Thoughts?

Regards,
Jiong
diff mbox

Patch

diff --git a/gcc/calls.c b/gcc/calls.c
index 970415d..0c3b9aa 100644
--- a/gcc/calls.c
+++ b/gcc/calls.c
@@ -222,12 +222,18 @@  prepare_call_address (tree fndecl_or_type, rtx funexp, rtx static_chain_value,
     /* If we are using registers for parameters, force the
        function address into a register now.  */
     funexp = ((reg_parm_seen
 	       && targetm.small_register_classes_for_mode_p (FUNCTION_MODE))
 	      ? force_not_mem (memory_address (FUNCTION_MODE, funexp))
 	      : memory_address (FUNCTION_MODE, funexp));
+  else if (flag_pic && !flag_plt && fndecl_or_type
+	   && TREE_CODE (fndecl_or_type) == FUNCTION_DECL
+	   && !targetm.binds_local_p (fndecl_or_type))
+    {
+      funexp = force_reg (Pmode, funexp);
+    }
   else if (! sibcallp)
     {
 #ifndef NO_FUNCTION_CSE
       if (optimize && ! flag_no_function_cse)
 	funexp = force_reg (Pmode, funexp);
 #endif
diff --git a/gcc/common.opt b/gcc/common.opt
index b49ac46..cd8b256 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1773,12 +1773,16 @@  Common Report Var(flag_pic,1) Negative(fpie)
 Generate position-independent code if possible (small mode)
 
 fpie
 Common Report Var(flag_pie,1) Negative(fPIC)
 Generate position-independent code for executables if possible (small mode)
 
+fplt
+Common Report Var(flag_plt) Init(1)
+Use PLT for PIC calls (-fno-plt: load the address from GOT at call site)
+
 fplugin=
 Common Joined RejectNegative Var(common_deferred_options) Defer
 Specify a plugin to load
 
 fplugin-arg-
 Common Joined RejectNegative Var(common_deferred_options) Defer