Message ID | 20200312024850.GE5384@bubble.grove.modra.org |
---|---|
State | New |
Headers | show |
Series | [RS6000] make PLT loads volatile | expand |
Hi! On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote: > With lazy PLT resolution the first load of a PLT entry may be a value > pointing at a resolver stub. gcc's loop processing can result in the > PLT load in inline PLT calls being hoisted out of a loop in the > mistaken idea that this is an optimisation. It isn't. If the value > hoisted was that for a resolver stub then every call to that function > in the loop will go via the resolver, slowing things down quite > dramatically. > > The PLT really is volatile, so teach gcc about that. It would be nice if we could keep it cached after it has been resolved once, this has potential for regressing performance if we don't? And LD_BIND_NOW should keep working just as fast as it is now, too? Segher
On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote: > Hi! > > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote: > > With lazy PLT resolution the first load of a PLT entry may be a value > > pointing at a resolver stub. gcc's loop processing can result in the > > PLT load in inline PLT calls being hoisted out of a loop in the > > mistaken idea that this is an optimisation. It isn't. If the value > > hoisted was that for a resolver stub then every call to that function > > in the loop will go via the resolver, slowing things down quite > > dramatically. > > > > The PLT really is volatile, so teach gcc about that. > > It would be nice if we could keep it cached after it has been resolved > once, this has potential for regressing performance if we don't? And > LD_BIND_NOW should keep working just as fast as it is now, too? Using a call-saved register to cache a load out of the PLT looks really silly when the inline PLT call is turned back into a direct call by the linker. You end up with an unnecessary save and restore of the register, plus copies from the register to r12. What's the chance of someone reporting that as a gcc "bug"? :-) Then there's the possibility that shortening the number of instructions between two calls of a small function runs into stalls. How can we teach gcc about these unknowns? ie. How to weight use of a call-saved register to cache PLT loads against other possible uses of that register in a loop? It's quite likely not a good use, even when gcc knows the PLT entry has been resolved.. Which means some gcc infrastructure would be needed to do this sensibly and without the necessary infrastructure, I think gcc hoisting a PLT load out of a loop should never be done.
Hi! On Fri, Mar 13, 2020 at 10:06:01AM +1030, Alan Modra wrote: > On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote: > > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote: > > > With lazy PLT resolution the first load of a PLT entry may be a value > > > pointing at a resolver stub. gcc's loop processing can result in the > > > PLT load in inline PLT calls being hoisted out of a loop in the > > > mistaken idea that this is an optimisation. It isn't. If the value > > > hoisted was that for a resolver stub then every call to that function > > > in the loop will go via the resolver, slowing things down quite > > > dramatically. > > > > > > The PLT really is volatile, so teach gcc about that. > > > > It would be nice if we could keep it cached after it has been resolved > > once, this has potential for regressing performance if we don't? And > > LD_BIND_NOW should keep working just as fast as it is now, too? > > Using a call-saved register to cache a load out of the PLT looks > really silly Who said anything about using call-saved registers? GCC will usually make a stack slot for this, and only use a non-volatile register when that is profitable. (I know it is a bit too aggressive with it, but that is a generic problem). > when the inline PLT call is turned back into a direct > call by the linker. Ah, so yeah, for direct calls we do not want this. I was thinking this was about indirect calls (via a bctrl that is), dunno how I got that misperception. Sorry. What is this like for indirect calls (at C level)? Does your patch do anything to those? Segher
On Fri, Mar 13, 2020 at 10:40:38AM -0500, Segher Boessenkool wrote: > Hi! > > On Fri, Mar 13, 2020 at 10:06:01AM +1030, Alan Modra wrote: > > On Thu, Mar 12, 2020 at 11:57:17AM -0500, Segher Boessenkool wrote: > > > On Thu, Mar 12, 2020 at 01:18:50PM +1030, Alan Modra wrote: > > > > With lazy PLT resolution the first load of a PLT entry may be a value > > > > pointing at a resolver stub. gcc's loop processing can result in the > > > > PLT load in inline PLT calls being hoisted out of a loop in the > > > > mistaken idea that this is an optimisation. It isn't. If the value > > > > hoisted was that for a resolver stub then every call to that function > > > > in the loop will go via the resolver, slowing things down quite > > > > dramatically. > > > > > > > > The PLT really is volatile, so teach gcc about that. > > > > > > It would be nice if we could keep it cached after it has been resolved > > > once, this has potential for regressing performance if we don't? And > > > LD_BIND_NOW should keep working just as fast as it is now, too? > > > > Using a call-saved register to cache a load out of the PLT looks > > really silly > > Who said anything about using call-saved registers? GCC will usually > make a stack slot for this, and only use a non-volatile register when > that is profitable. (I know it is a bit too aggressive with it, but > that is a generic problem). Using a stack slot comes about due to hoisting then running out of call-saved registers in the loop. Score another reason not to hoist PLT loads. > > when the inline PLT call is turned back into a direct > > call by the linker. > > Ah, so yeah, for direct calls we do not want this. I was thinking this > was about indirect calls (via a bctrl that is), dunno how I got that > misperception. Sorry. > > What is this like for indirect calls (at C level)? Does your patch do > anything to those? No effect at all. To put your mind at rest on this point you can verify quite easily by noticing that UNSPECV_PLT* is only generated in rs6000_longcall_ref, and calls to that function are conditional on GET_CODE (func_desc) == SYMBOL_REF.
Hi! On Sat, Mar 14, 2020 at 09:30:02AM +1030, Alan Modra wrote: > On Fri, Mar 13, 2020 at 10:40:38AM -0500, Segher Boessenkool wrote: > > > Using a call-saved register to cache a load out of the PLT looks > > > really silly > > > > Who said anything about using call-saved registers? GCC will usually > > make a stack slot for this, and only use a non-volatile register when > > that is profitable. (I know it is a bit too aggressive with it, but > > that is a generic problem). > > Using a stack slot comes about due to hoisting then running out of > call-saved registers in the loop. Score another reason not to hoist > PLT loads. LRA can do this directly, without ever using call-saved registers. There are some other passes that can do rematerialisation as well. Not enough yet, but GCC does *not* use non-volatile registers to save values, unless it thinks that is cheaper (which it currently thinks too often, but that is a separate problem). > > > when the inline PLT call is turned back into a direct > > > call by the linker. > > > > Ah, so yeah, for direct calls we do not want this. I was thinking this > > was about indirect calls (via a bctrl that is), dunno how I got that > > misperception. Sorry. > > > > What is this like for indirect calls (at C level)? Does your patch do > > anything to those? > > No effect at all. To put your mind at rest on this point you can > verify quite easily by noticing that UNSPECV_PLT* is only generated in > rs6000_longcall_ref, and calls to that function are conditional on > GET_CODE (func_desc) == SYMBOL_REF. Could you please send a new patch (could be the same patch even) that is easier to review for me? With things like all of the above info in the message (describing the setting, the problem, and the solution). Or should I read the original message another ten times until it clicks? It certainly is possible this matter just is hard to grasp :-/ Thanks in advance, Segher
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 46b7dec2abd..2d6790877fa 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -19264,8 +19264,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg) if (rs6000_pcrel_p (cfun)) { rtx reg = gen_rtx_REG (Pmode, regno); - rtx u = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg), - UNSPEC_PLT_PCREL); + rtx u = gen_rtx_UNSPEC_VOLATILE (Pmode, + gen_rtvec (3, base, call_ref, arg), + UNSPECV_PLT_PCREL); emit_insn (gen_rtx_SET (reg, u)); return reg; } @@ -19284,8 +19285,9 @@ rs6000_longcall_ref (rtx call_ref, rtx arg) rtx reg = gen_rtx_REG (Pmode, regno); rtx hi = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, base, call_ref, arg), UNSPEC_PLT16_HA); - rtx lo = gen_rtx_UNSPEC (Pmode, gen_rtvec (3, reg, call_ref, arg), - UNSPEC_PLT16_LO); + rtx lo = gen_rtx_UNSPEC_VOLATILE (Pmode, + gen_rtvec (3, reg, call_ref, arg), + UNSPECV_PLT16_LO); emit_insn (gen_rtx_SET (reg, hi)); emit_insn (gen_rtx_SET (reg, lo)); return reg; diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index ad88b6783af..5a8e9de670b 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -148,8 +148,6 @@ UNSPEC_SI_FROM_SF UNSPEC_PLTSEQ UNSPEC_PLT16_HA - UNSPEC_PLT16_LO - UNSPEC_PLT_PCREL ]) ;; @@ -178,6 +176,8 @@ UNSPECV_MTFSB1 ; Set FPSCR Field bit to 1 UNSPECV_SPLIT_STACK_RETURN ; A camouflaged return UNSPECV_SPEC_BARRIER ; Speculation barrier + UNSPECV_PLT16_LO + UNSPECV_PLT_PCREL ]) ; The three different kinds of epilogue. @@ -10359,10 +10359,10 @@ (define_insn "*pltseq_plt16_lo_<mode>" [(set (match_operand:P 0 "gpc_reg_operand" "=r") - (unspec:P [(match_operand:P 1 "gpc_reg_operand" "b") - (match_operand:P 2 "symbol_ref_operand" "s") - (match_operand:P 3 "" "")] - UNSPEC_PLT16_LO))] + (unspec_volatile:P [(match_operand:P 1 "gpc_reg_operand" "b") + (match_operand:P 2 "symbol_ref_operand" "s") + (match_operand:P 3 "" "")] + UNSPECV_PLT16_LO))] "TARGET_PLTSEQ" { return rs6000_pltseq_template (operands, RS6000_PLTSEQ_PLT16_LO); @@ -10382,10 +10382,10 @@ (define_insn "*pltseq_plt_pcrel<mode>" [(set (match_operand:P 0 "gpc_reg_operand" "=r") - (unspec:P [(match_operand:P 1 "" "") - (match_operand:P 2 "symbol_ref_operand" "s") - (match_operand:P 3 "" "")] - UNSPEC_PLT_PCREL))] + (unspec_volatile:P [(match_operand:P 1 "" "") + (match_operand:P 2 "symbol_ref_operand" "s") + (match_operand:P 3 "" "")] + UNSPECV_PLT_PCREL))] "HAVE_AS_PLTSEQ && TARGET_ELF && rs6000_pcrel_p (cfun)" {