diff mbox

[RFC,X86_64] Eliminate PLT stubs for specified external functions via -fno-plt=

Message ID CAAs8HmycgqrcMGq9P1uXTX-2fcLKUfTT4BLDxj7ov+47jjfsHQ@mail.gmail.com
State New
Headers show

Commit Message

Sriraman Tallam May 21, 2015, 9:31 p.m. UTC
On Thu, May 21, 2015 at 2:12 PM, Sriraman Tallam <tmsriram@google.com> wrote:
> On Sun, May 10, 2015 at 10:01 AM, Sriraman Tallam <tmsriram@google.com> wrote:
>>
>> On Sun, May 10, 2015, 8:19 AM H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>> On Sat, May 9, 2015 at 9:34 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>> On Mon, May 4, 2015 at 7:45 AM, Michael Matz <matz@suse.de> wrote:
>>>> Hi,
>>>>
>>>> On Thu, 30 Apr 2015, Sriraman Tallam wrote:
>>>>
>>>>> We noticed that one of our benchmarks sped-up by ~1% when we eliminated
>>>>> PLT stubs for some of the hot external library functions like memcmp,
>>>>> pow.  The win was from better icache and itlb performance. The main
>>>>> reason was that the PLT stubs had no spatial locality with the
>>>>> call-sites. I have started looking at ways to tell the compiler to
>>>>> eliminate PLT stubs (in-effect inline them) for specified external
>>>>> functions, for x86_64. I have a proposal and a patch and I would like to
>>>>> hear what you think.
>>>>>
>>>>> This comes with caveats.  This cannot be generally done for all
>>>>> functions marked extern as it is impossible for the compiler to say if a
>>>>> function is "truly extern" (defined in a shared library). If a function
>>>>> is not truly extern(ends up defined in the final executable), then
>>>>> calling it indirectly is a performance penalty as it could have been a
>>>>> direct call.
>>>>
>>>> This can be fixed by Alans idea.
>>>>
>>>>> Further, the newly created GOT entries are fixed up at
>>>>> start-up and do not get lazily bound.
>>>>
>>>> And this can be fixed by some enhancements in the linker and dynamic
>>>> linker.  The idea is to still generate a PLT stub and make its GOT entry
>>>> point to it initially (like a normal got.plt slot).  Then the first
>>>> indirect call will use the address of PLT entry (starting lazy
>>>> resolution)
>>>> and update the GOT slot with the real address, so further indirect calls
>>>> will directly go to the function.
>>>>
>>>> This requires a new asm marker (and hence new reloc) as normally if
>>>> there's a GOT slot it's filled by the real symbols address, unlike if
>>>> there's only a got.plt slot.  E.g. a
>>>>
>>>>   call *foo@GOTPLT(%rip)
>>>>
>>>> would generate a GOT slot (and fill its address into above call insn),
>>>> but
>>>> generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one.
>>>>
>>>
>>> I added the "relax" prefix support to x86 assembler on users/hjl/relax
>>> branch
>>>
>>> at
>>>
>>> https://sourceware.org/git/?p=binutils-gdb.git;a=summary
>>>
>>> [hjl@gnu-tools-1 relax-3]$ cat r.S
>>> .text
>>> relax jmp foo
>>> relax call foo
>>> relax jmp foo@plt
>>> relax call foo@plt
>>> [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S
>>> [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o
>>>
>>> r.o:     file format elf64-x86-64
>>>
>>>
>>> Disassembly of section .text:
>>>
>>> 0000000000000000 <.text>:
>>>    0: 66 e9 00 00 00 00     data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4
>>>    6: 66 e8 00 00 00 00     data16 callq 0xc 8: R_X86_64_RELAX_PC32
>>> foo-0x4
>>>    c: 66 e9 00 00 00 00     data16 jmpq 0x12 e:
>>> R_X86_64_RELAX_PLT32foo-0x4
>>>   12: 66 e8 00 00 00 00     data16 callq 0x18 14:
>>> R_X86_64_RELAX_PLT32foo-0x4
>>> [hjl@gnu-tools-1 relax-3]$
>>>
>>> Right now, the relax relocations are treated as PC32/PLT32 relocations.
>>> I am working on linker support.
>>>
>>
>> I implemented the linker support for x86-64:
>>
>> 00000000 <main>:
>>    0: 48 83 ec 08           sub    $0x8,%rsp
>>    4: e8 00 00 00 00       callq  9 <main+0x9> 5: R_X86_64_PC32 plt-0x4
>>    9: e8 00 00 00 00       callq  e <main+0xe> a: R_X86_64_PLT32 plt-0x4
>>    e: e8 00 00 00 00       callq  13 <main+0x13> f: R_X86_64_PC32 bar-0x4
>>   13: 66 e8 00 00 00 00     data16 callq 19 <main+0x19> 15:
>> R_X86_64_RELAX_PC32 bar-0x4
>>   19: 66 e8 00 00 00 00     data16 callq 1f <main+0x1f> 1b:
>> R_X86_64_RELAX_PLT32 bar-0x4
>>   1f: 66 e8 00 00 00 00     data16 callq 25 <main+0x25> 21:
>> R_X86_64_RELAX_PC32 foo-0x4
>>   25: 66 e8 00 00 00 00     data16 callq 2b <main+0x2b> 27:
>> R_X86_64_RELAX_PLT32 foo-0x4
>>   2b: 31 c0                 xor    %eax,%eax
>>   2d: 48 83 c4 08           add    $0x8,%rsp
>>   31: c3                   retq
>>
>> 00400460 <main>:
>>   400460: 48 83 ec 08           sub    $0x8,%rsp
>>   400464: e8 d7 ff ff ff       callq  400440 <plt@plt>
>>   400469: e8 d2 ff ff ff       callq  400440 <plt@plt>
>>   40046e: e8 ad ff ff ff       callq  400420 <bar@plt>
>>   400473: ff 15 ff 03 20 00     callq  *0x2003ff(%rip)        # 600878
>> <_DYNAMIC+0xf8>
>>   400479: ff 15 f9 03 20 00     callq  *0x2003f9(%rip)        # 600878
>> <_DYNAMIC+0xf8>
>>   40047f: 66 e8 f3 00 00 00     data16 callq 400578 <foo>
>>   400485: 66 e8 ed 00 00 00     data16 callq 400578 <foo>
>>   40048b: 31 c0                 xor    %eax,%eax
>>   40048d: 48 83 c4 08           add    $0x8,%rsp
>>   400491: c3                   retq
>>
>> Sriraman, can you give it a try?
>
>
> I like HJ's proposal here and it is important that the linker fixes
> unnecessary indirect calls to direct ones.
>
> However, independently I think my original proposal is still useful
> and I want to pitch it again for the following reasons.
>
> AFAIU, Alexander Monakov's -fno-plt does not solve the following:
>
> * Does not do anything for non-PIC code. The compiler does not
> generate a @PLT call but the linker will route all external calls via
> PLT.  We noticed a problem with non-PIC executables where the PLT
> stubs were causing too many icache misses and are interested in a
> solution for this.
> * Aggressively uses indirect calls even if the final symbol is not
> truly external.  This needs HJ's linker patch to fix unnecessary
> indirect calls to direct calls.
>
> My original proposal, for x86_64 only, was to add
> -fno-plt=<function-name>. This lets the user decide for which
> functions PLT must be avoided.  Let the compiler always generate an
> indirect call using call *func@GOTPCREL(%rip).  We could do this for
> non-PIC code too.  No need for linker fixups since this relies on the
> user to know that func is from a shared object.
>
> I am reattaching the patch.

I also want to add that my proposal with -fno-plt=<fn_name> does not
take away anything from the newly added -fno-plt option or the linker
patch HJ proposed.  It is orthogonal and lets the user decide the
subset of external functions for which PLT must be avoided.

Thanks
Sri

>
> Thanks
> Sri
>
>
>>
>> Thanks! Will do!
>>
>> Sri
>>
>> --
>> H.J.
>>
>>
* common.opt (-fno-plt=): New option.
	* config/i386/i386.c (avoid_plt_to_call): New function.
	(ix86_output_call_insn):  Check if PLT needs to be avoided
	and call or jump indirectly if true.
	* opts-global.c (htab_str_eq): New function.
	(avoid_plt_fnsymbol_names_tab): New htab.
	(handle_common_deferred_options): Handle -fno-plt=
diff mbox

Patch

Index: common.opt
===================================================================
--- common.opt	(revision 222892)
+++ common.opt	(working copy)
@@ -1087,6 +1087,11 @@  fdbg-cnt=
 Common RejectNegative Joined Var(common_deferred_options) Defer
 -fdbg-cnt=<counter>:<limit>[,<counter>:<limit>,...]	Set the debug counter limit.   
 
+fno-plt=
+Common RejectNegative Joined Var(common_deferred_options) Defer
+-fno-plt=<symbol1>  Avoid going through the PLT when calling the specified function.
+Allow multiple instances of this option with different function names.
+
 fdebug-prefix-map=
 Common Joined RejectNegative Var(common_deferred_options) Defer
 Map one directory name to another in debug information
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 222892)
+++ config/i386/i386.c	(working copy)
@@ -25282,6 +25282,25 @@  ix86_expand_call (rtx retval, rtx fnaddr, rtx call
   return call;
 }
 
+extern htab_t avoid_plt_fnsymbol_names_tab;
+/* If the function referenced by call_op is to a external function
+   and calls via PLT must be avoided as specified by -fno-plt=, then
+   return true.  */
+
+static int
+avoid_plt_to_call(rtx call_op)
+{
+  const char *name;
+  if (GET_CODE (call_op) != SYMBOL_REF
+      || SYMBOL_REF_LOCAL_P (call_op)
+      || avoid_plt_fnsymbol_names_tab == NULL)
+    return 0;
+  name = XSTR (call_op, 0);
+  if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL)
+    return 1;
+  return 0;
+}
+
 /* Output the assembly for a call instruction.  */
 
 const char *
@@ -25294,7 +25313,12 @@  ix86_output_call_insn (rtx insn, rtx call_op)
   if (SIBLING_CALL_P (insn))
     {
       if (direct_p)
-	xasm = "jmp\t%P0";
+	{
+	  if (avoid_plt_to_call (call_op))
+	    xasm = "jmp\t*%p0@GOTPCREL(%%rip)";
+	  else
+	    xasm = "jmp\t%P0";
+	}
       /* SEH epilogue detection requires the indirect branch case
 	 to include REX.W.  */
       else if (TARGET_SEH)
@@ -25346,9 +25370,15 @@  ix86_output_call_insn (rtx insn, rtx call_op)
     }
 
   if (direct_p)
-    xasm = "call\t%P0";
+    {
+      if (avoid_plt_to_call (call_op))
+        xasm = "call\t*%p0@GOTPCREL(%%rip)";
+      else
+        xasm = "call\t%P0";
+    }
   else
     xasm = "call\t%A0";
+ 
 
   output_asm_insn (xasm, &call_op);
 
Index: opts-global.c
===================================================================
--- opts-global.c	(revision 222892)
+++ opts-global.c	(working copy)
@@ -47,6 +47,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "xregex.h"
 #include "attribs.h"
 #include "stringpool.h"
+#include "hash-table.h"
 
 typedef const char *const_char_p; /* For DEF_VEC_P.  */
 
@@ -420,6 +421,17 @@  decode_options (struct gcc_options *opts, struct g
   finish_options (opts, opts_set, loc);
 }
 
+/* Helper function for the hash table that compares the
+   existing entry (S1) with the given string (S2).  */
+
+static int
+htab_str_eq (const void *s1, const void *s2)
+{
+  return !strcmp ((const char *)s1, (const char *) s2);
+}
+
+htab_t avoid_plt_fnsymbol_names_tab = NULL;
+
 /* Process common options that have been deferred until after the
    handlers have been called for all options.  */
 
@@ -539,6 +551,15 @@  handle_common_deferred_options (void)
 	  stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg));
 	  break;
 
+        case OPT_fno_plt_:
+	  void **slot;
+	  if (avoid_plt_fnsymbol_names_tab == NULL)
+	    avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string,
+							htab_str_eq, NULL);
+          slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT);
+          *slot = (void *)opt->arg;
+          break;
+
 	default:
 	  gcc_unreachable ();
 	}