Message ID | 777706000.8424762.1401128436051.JavaMail.zimbra@redhat.com |
---|---|
State | New |
Headers | show |
On Mon, May 26, 2014 at 02:20:36PM -0400, Kai Tietz wrote: > --- i386.c (revision 210936) > +++ i386.c (working copy) > @@ -5298,6 +5298,12 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) > decl_or_type = type; > } > > + /* We need to reject stdarg-function for x86_64 ABI as accumulator > + is used as argument. */ > + if (TARGET_64BIT && stdarg_p (type) > + && ix86_function_type_abi (type) == SYSV_ABI) > + return false; > + > /* Check that the return value locations are the same. Like > if we are returning floats on the 80387 register stack, we cannot > make a sibcall from a function that doesn't return a float to a > @@ -24916,8 +24922,19 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call > ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode) > : !call_insn_operand (XEXP (fnaddr, 0), word_mode)) > { > + rtx r; > fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1); > - fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr)); > + if (!sibcall) > + r = copy_to_mode_reg (word_mode, fnaddr); > + else > + { > + r = gen_rtx_REG (word_mode, AX_REG); > + if (! general_operand (fnaddr, VOIDmode)) > + fnaddr = force_operand (fnaddr, r); Wrong formatting. > + if (fnaddr != r) > + emit_move_insn (r, fnaddr); > + } > + fnaddr = gen_rtx_MEM (QImode, r); > } > > call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1); In any case, I still can't understand how limiting the choices of the register allocator can improve code rather than making it worse. If the accumulator is available there, why doesn't the RA choose it if it is beneficial? And why aren't other registers similarly suitable for that? Say r10, r11... Jakub
----- Original Message ----- > On Mon, May 26, 2014 at 02:20:36PM -0400, Kai Tietz wrote: > > --- i386.c (revision 210936) > > +++ i386.c (working copy) > > @@ -5298,6 +5298,12 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) > > decl_or_type = type; > > } > > > > + /* We need to reject stdarg-function for x86_64 ABI as accumulator > > + is used as argument. */ > > + if (TARGET_64BIT && stdarg_p (type) > > + && ix86_function_type_abi (type) == SYSV_ABI) > > + return false; > > + > > /* Check that the return value locations are the same. Like > > if we are returning floats on the 80387 register stack, we cannot > > make a sibcall from a function that doesn't return a float to a > > @@ -24916,8 +24922,19 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call > > ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode) > > : !call_insn_operand (XEXP (fnaddr, 0), word_mode)) > > { > > + rtx r; > > fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1); > > - fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr)); > > + if (!sibcall) > > + r = copy_to_mode_reg (word_mode, fnaddr); > > + else > > + { > > + r = gen_rtx_REG (word_mode, AX_REG); > > + if (! general_operand (fnaddr, VOIDmode)) > > + fnaddr = force_operand (fnaddr, r); > > Wrong formatting. Thanks for the heads up. Fix the superflous tab. > > + if (fnaddr != r) > > + emit_move_insn (r, fnaddr); > > + } > > + fnaddr = gen_rtx_MEM (QImode, r); > > } > > > > call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1); > > In any case, I still can't understand how limiting the choices of the > register allocator can improve code rather than making it worse. > If the accumulator is available there, why doesn't the RA choose it > if it is beneficial? And why aren't other registers similarly suitable for > that? Say r10, r11... I don't see it as limiting. The intend of this is more to have fixed patterns on epilogue. And in fact is accumulator that register which can be used as scratch-register for all i386-targets. Beside for varardic-functions, which anyway aren't any good candidates for sibling-call-optimization (on x86_64 due ABI). Well, for x86_64 ABI we might could consider to use R11_REG instead of AX_REG. Is there any advantage in special-case for x86_64 ABI? The R10-register isn't a good choice due it might be used as drap-register and therefore can't be loaded before epilogue gets destroyed. > Jakub > Kai
On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote: > > In any case, I still can't understand how limiting the choices of the > > register allocator can improve code rather than making it worse. > > If the accumulator is available there, why doesn't the RA choose it > > if it is beneficial? And why aren't other registers similarly suitable for > > that? Say r10, r11... > > I don't see it as limiting. The intend of this is more to have fixed > patterns on epilogue. And in fact is accumulator that register which can > be used as scratch-register for all i386-targets. Beside for > varardic-functions, which anyway aren't any good candidates for > sibling-call-optimization (on x86_64 due ABI). Well, for x86_64 ABI we > might could consider to use R11_REG instead of AX_REG. Is there any > advantage in special-case for x86_64 ABI? The R10-register isn't a good > choice due it might be used as drap-register and therefore can't be loaded > before epilogue gets destroyed. It is limiting. If r11/rax and often also r10 can be chosen, telling the RA it can only choose rax is a limitation. Can you show some testcase where your patch is actually beneficial? We should analyze why the RA made that choice. Jakub
----- Original Message ----- > On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote: > > > In any case, I still can't understand how limiting the choices of the > > > register allocator can improve code rather than making it worse. > > > If the accumulator is available there, why doesn't the RA choose it > > > if it is beneficial? And why aren't other registers similarly suitable > > > for > > > that? Say r10, r11... > > > > I don't see it as limiting. The intend of this is more to have fixed > > patterns on epilogue. And in fact is accumulator that register which can > > be used as scratch-register for all i386-targets. Beside for > > varardic-functions, which anyway aren't any good candidates for > > sibling-call-optimization (on x86_64 due ABI). Well, for x86_64 ABI we > > might could consider to use R11_REG instead of AX_REG. Is there any > > advantage in special-case for x86_64 ABI? The R10-register isn't a good > > choice due it might be used as drap-register and therefore can't be loaded > > before epilogue gets destroyed. > > It is limiting. If r11/rax and often also r10 can be chosen, telling the RA > it can only choose rax is a limitation. No, it isn't. For sure. The code-branch choosing the accu to call after epilogue isn't used as memories for sibling-calls aren't allowed. This code-branch will get active with my other sibling-tail-call patch. > Can you show some testcase where your patch is actually beneficial? We > should analyze why the RA made that choice. So it is obvious I can't provide you samples you asked for. Nevertheless I am pretty interested to see a sample by you (with activated sibling-tail-call memories) which chooses for tail-call-register for memory something else then accu. > Jakub > Kai
On 05/26/14 14:32, Kai Tietz wrote: > ----- Original Message ----- >> On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote: >>>> In any case, I still can't understand how limiting the choices >>>> of the register allocator can improve code rather than making >>>> it worse. If the accumulator is available there, why doesn't >>>> the RA choose it if it is beneficial? And why aren't other >>>> registers similarly suitable for that? Say r10, r11... >>> >>> I don't see it as limiting. The intend of this is more to have >>> fixed patterns on epilogue. And in fact is accumulator that >>> register which can be used as scratch-register for all >>> i386-targets. Beside for varardic-functions, which anyway aren't >>> any good candidates for sibling-call-optimization (on x86_64 due >>> ABI). Well, for x86_64 ABI we might could consider to use >>> R11_REG instead of AX_REG. Is there any advantage in >>> special-case for x86_64 ABI? The R10-register isn't a good >>> choice due it might be used as drap-register and therefore can't >>> be loaded before epilogue gets destroyed. >> >> It is limiting. If r11/rax and often also r10 can be chosen, >> telling the RA it can only choose rax is a limitation. > > No, it isn't. For sure. The code-branch choosing the accu to call > after epilogue isn't used as memories for sibling-calls aren't > allowed. This code-branch will get active with my other > sibling-tail-call patch. But the value we want may be sitting around in a convenient register (such as r11). So if we force the sibcall to use %rax, then we have to generate a copy. Yet if we have a constraint for the set of registers allowed here, then we give the register allocator the opportunity to coalesce away the copies. Jeff
Ok, so just the part ok the patch remains to prevent varardic-functions being a sibling-tail-call remains. Kai
On 05/27/2014 08:39 AM, Jeff Law wrote: > But the value we want may be sitting around in a convenient register (such as > r11). So if we force the sibcall to use %rax, then we have to generate a > copy. Yet if we have a constraint for the set of registers allowed here, then > we give the register allocator the opportunity to coalesce away the copies. We do have an existing call-clobbered register class that we currently do use for indirect sibcalls. The problem that Kai is trying to work around is that he doesn't want to load the value into a register, but its address. But the register allocator only has BASE_REG_CLASS and INDEX_REG_CLASS, and has no possibility to force the address components into registers that are call clobbered. My thinking is that, with few exceptions, this doesn't help anything. If we have to load the full address into a register: lea ofs(base, index, scale), %eax ... call *0(%eax) we might as well include the memory load mov ofs(base, index, scale), %eax ... call *%eax As far as I can see, the only time we can actually save an instruction is when the address is fully symbolic: mov foo, %eax ... call *%eax becomes call *foo r~
On 05/27/14 10:22, Richard Henderson wrote: > On 05/27/2014 08:39 AM, Jeff Law wrote: >> But the value we want may be sitting around in a convenient register (such as >> r11). So if we force the sibcall to use %rax, then we have to generate a >> copy. Yet if we have a constraint for the set of registers allowed here, then >> we give the register allocator the opportunity to coalesce away the copies. > > We do have an existing call-clobbered register class that we currently do use > for indirect sibcalls. > > The problem that Kai is trying to work around is that he doesn't want to load > the value into a register, but its address. But the register allocator only > has BASE_REG_CLASS and INDEX_REG_CLASS, and has no possibility to force the > address components into registers that are call clobbered. > > My thinking is that, with few exceptions, this doesn't help anything. If we > have to load the full address into a register: > > lea ofs(base, index, scale), %eax > ... > call *0(%eax) > > we might as well include the memory load > > mov ofs(base, index, scale), %eax > ... > call *%eax Ok. My misunderstanding. Granted, this probably doesn't happen enough to matter, but isn't it likely profitable from a pipeline standpoint to go ahead and do the memory load separate from the indirect call/jump as well? > > As far as I can see, the only time we can actually save an instruction is when > the address is fully symbolic: > > mov foo, %eax > ... > call *%eax > > becomes > > call *foo Noted. jeff
On 05/27/2014 09:48 AM, Jeff Law wrote: >> lea ofs(base, index, scale), %eax >> ... >> call *0(%eax) >> >> we might as well include the memory load >> >> mov ofs(base, index, scale), %eax >> ... >> call *%eax > Ok. My misunderstanding. > > Granted, this probably doesn't happen enough to matter, but isn't it likely > profitable from a pipeline standpoint to go ahead and do the memory load > separate from the indirect call/jump as well? I'm sure. Especially if the data is currently living in L2, and the scheduler has enough room to move the load sufficiently early to hide the latency. As I've said before, I think this is interesting solely as a space optimization. r~
Index: i386.c =================================================================== --- i386.c (revision 210936) +++ i386.c (working copy) @@ -5298,6 +5298,12 @@ ix86_function_ok_for_sibcall (tree decl, tree exp) decl_or_type = type; } + /* We need to reject stdarg-function for x86_64 ABI as accumulator + is used as argument. */ + if (TARGET_64BIT && stdarg_p (type) + && ix86_function_type_abi (type) == SYSV_ABI) + return false; + /* Check that the return value locations are the same. Like if we are returning floats on the 80387 register stack, we cannot make a sibcall from a function that doesn't return a float to a @@ -24916,8 +24922,19 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode) : !call_insn_operand (XEXP (fnaddr, 0), word_mode)) { + rtx r; fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1); - fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr)); + if (!sibcall) + r = copy_to_mode_reg (word_mode, fnaddr); + else + { + r = gen_rtx_REG (word_mode, AX_REG); + if (! general_operand (fnaddr, VOIDmode)) + fnaddr = force_operand (fnaddr, r); + if (fnaddr != r) + emit_move_insn (r, fnaddr); + } + fnaddr = gen_rtx_MEM (QImode, r); } call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1);