diff mbox

[i386] : Expand sibling-tail-calls via accumulator register

Message ID 777706000.8424762.1401128436051.JavaMail.zimbra@redhat.com
State New
Headers show

Commit Message

Kai Tietz May 26, 2014, 6:20 p.m. UTC
Hi,

adjusted patch to make it more bullet-proved and tested it.

2014-05-26  Kai Tietz  <ktietz@redhat.com>

	* config/i386/i386.c (ix86_expand_call): Enforce for sibcalls
	on memory the use of accumulator-register.
	(ix86_function_ok_for_sibcall): Reject for x86_64 ABI sibling
	calls for varardic-functions.

Regression tested for x86_64-unknown-linux-gnu, x86_64-w64-mingw32, and i686-pc-cygwin.
Ok for apply?

Regards,
Kai

Comments

Jakub Jelinek May 26, 2014, 6:32 p.m. UTC | #1
On Mon, May 26, 2014 at 02:20:36PM -0400, Kai Tietz wrote:
> --- i386.c	(revision 210936)
> +++ i386.c	(working copy)
> @@ -5298,6 +5298,12 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
>        decl_or_type = type;
>      }
>  
> +  /* We need to reject stdarg-function for x86_64 ABI as accumulator
> +     is used as argument.  */
> +  if (TARGET_64BIT && stdarg_p (type)
> +      && ix86_function_type_abi (type) == SYSV_ABI)
> +    return false;
> +
>    /* Check that the return value locations are the same.  Like
>       if we are returning floats on the 80387 register stack, we cannot
>       make a sibcall from a function that doesn't return a float to a
> @@ -24916,8 +24922,19 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call
>  	   ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode)
>  	   : !call_insn_operand (XEXP (fnaddr, 0), word_mode))
>      {
> +      rtx r;
>        fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1);
> -      fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr));
> +      if (!sibcall)
> +	r = copy_to_mode_reg (word_mode, fnaddr);
> +      else
> +	{
> +	  r = gen_rtx_REG (word_mode, AX_REG);
> +	  if (! general_operand (fnaddr, VOIDmode))
> +		    fnaddr = force_operand (fnaddr, r);

Wrong formatting.

> +	  if (fnaddr != r)
> +	    emit_move_insn (r, fnaddr);
> +	}
> +      fnaddr = gen_rtx_MEM (QImode, r);
>      }
>  
>    call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1);

In any case, I still can't understand how limiting the choices of the
register allocator can improve code rather than making it worse.
If the accumulator is available there, why doesn't the RA choose it
if it is beneficial?  And why aren't other registers similarly suitable for
that?  Say r10, r11...

	Jakub
Kai Tietz May 26, 2014, 7:22 p.m. UTC | #2
----- Original Message -----
> On Mon, May 26, 2014 at 02:20:36PM -0400, Kai Tietz wrote:
> > --- i386.c	(revision 210936)
> > +++ i386.c	(working copy)
> > @@ -5298,6 +5298,12 @@ ix86_function_ok_for_sibcall (tree decl, tree exp)
> >        decl_or_type = type;
> >      }
> >  
> > +  /* We need to reject stdarg-function for x86_64 ABI as accumulator
> > +     is used as argument.  */
> > +  if (TARGET_64BIT && stdarg_p (type)
> > +      && ix86_function_type_abi (type) == SYSV_ABI)
> > +    return false;
> > +
> >    /* Check that the return value locations are the same.  Like
> >       if we are returning floats on the 80387 register stack, we cannot
> >       make a sibcall from a function that doesn't return a float to a
> > @@ -24916,8 +24922,19 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call
> >  	   ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode)
> >  	   : !call_insn_operand (XEXP (fnaddr, 0), word_mode))
> >      {
> > +      rtx r;
> >        fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1);
> > -      fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr));
> > +      if (!sibcall)
> > +	r = copy_to_mode_reg (word_mode, fnaddr);
> > +      else
> > +	{
> > +	  r = gen_rtx_REG (word_mode, AX_REG);
> > +	  if (! general_operand (fnaddr, VOIDmode))
> > +		    fnaddr = force_operand (fnaddr, r);
> 
> Wrong formatting.

Thanks for the heads up.  Fix the superflous tab.

> > +	  if (fnaddr != r)
> > +	    emit_move_insn (r, fnaddr);
> > +	}
> > +      fnaddr = gen_rtx_MEM (QImode, r);
> >      }
> >  
> >    call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1);
> 
> In any case, I still can't understand how limiting the choices of the
> register allocator can improve code rather than making it worse.
> If the accumulator is available there, why doesn't the RA choose it
> if it is beneficial?  And why aren't other registers similarly suitable for
> that?  Say r10, r11...

I don't see it as limiting.  The intend of this is more to have fixed patterns on epilogue.  And in fact is accumulator that register which can be used as scratch-register for all i386-targets.  Beside for varardic-functions, which anyway aren't any good candidates for sibling-call-optimization (on x86_64 due ABI).
Well, for x86_64 ABI we might could consider to use R11_REG instead of AX_REG.  Is there any advantage in special-case for x86_64 ABI?
The R10-register isn't a good choice due it might be used as drap-register and therefore can't be loaded before epilogue gets destroyed.
 
> 	Jakub
> 

Kai
Jakub Jelinek May 26, 2014, 7:35 p.m. UTC | #3
On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote:
> > In any case, I still can't understand how limiting the choices of the
> > register allocator can improve code rather than making it worse.
> > If the accumulator is available there, why doesn't the RA choose it
> > if it is beneficial?  And why aren't other registers similarly suitable for
> > that?  Say r10, r11...
> 
> I don't see it as limiting.  The intend of this is more to have fixed
> patterns on epilogue.  And in fact is accumulator that register which can
> be used as scratch-register for all i386-targets.  Beside for
> varardic-functions, which anyway aren't any good candidates for
> sibling-call-optimization (on x86_64 due ABI).  Well, for x86_64 ABI we
> might could consider to use R11_REG instead of AX_REG.  Is there any
> advantage in special-case for x86_64 ABI?  The R10-register isn't a good
> choice due it might be used as drap-register and therefore can't be loaded
> before epilogue gets destroyed.

It is limiting.  If r11/rax and often also r10 can be chosen, telling the RA
it can only choose rax is a limitation.

Can you show some testcase where your patch is actually beneficial?  We
should analyze why the RA made that choice.

	Jakub
Kai Tietz May 26, 2014, 8:32 p.m. UTC | #4
----- Original Message -----
> On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote:
> > > In any case, I still can't understand how limiting the choices of the
> > > register allocator can improve code rather than making it worse.
> > > If the accumulator is available there, why doesn't the RA choose it
> > > if it is beneficial?  And why aren't other registers similarly suitable
> > > for
> > > that?  Say r10, r11...
> > 
> > I don't see it as limiting.  The intend of this is more to have fixed
> > patterns on epilogue.  And in fact is accumulator that register which can
> > be used as scratch-register for all i386-targets.  Beside for
> > varardic-functions, which anyway aren't any good candidates for
> > sibling-call-optimization (on x86_64 due ABI).  Well, for x86_64 ABI we
> > might could consider to use R11_REG instead of AX_REG.  Is there any
> > advantage in special-case for x86_64 ABI?  The R10-register isn't a good
> > choice due it might be used as drap-register and therefore can't be loaded
> > before epilogue gets destroyed.
> 
> It is limiting.  If r11/rax and often also r10 can be chosen, telling the RA
> it can only choose rax is a limitation.

No, it isn't.  For sure.  The code-branch choosing the accu to call after epilogue isn't used as memories for sibling-calls aren't allowed.  This code-branch will get active with my other sibling-tail-call patch.
 
> Can you show some testcase where your patch is actually beneficial?  We
> should analyze why the RA made that choice.

So it is obvious I can't provide you samples you asked for.  Nevertheless I am pretty interested to see a sample by you (with activated sibling-tail-call memories) which chooses for tail-call-register for memory something else then accu.
 
> 	Jakub
> 

Kai
Jeff Law May 27, 2014, 3:39 p.m. UTC | #5
On 05/26/14 14:32, Kai Tietz wrote:
> ----- Original Message -----
>> On Mon, May 26, 2014 at 03:22:50PM -0400, Kai Tietz wrote:
>>>> In any case, I still can't understand how limiting the choices
>>>> of the register allocator can improve code rather than making
>>>> it worse. If the accumulator is available there, why doesn't
>>>> the RA choose it if it is beneficial?  And why aren't other
>>>> registers similarly suitable for that?  Say r10, r11...
>>>
>>> I don't see it as limiting.  The intend of this is more to have
>>> fixed patterns on epilogue.  And in fact is accumulator that
>>> register which can be used as scratch-register for all
>>> i386-targets.  Beside for varardic-functions, which anyway aren't
>>> any good candidates for sibling-call-optimization (on x86_64 due
>>> ABI).  Well, for x86_64 ABI we might could consider to use
>>> R11_REG instead of AX_REG.  Is there any advantage in
>>> special-case for x86_64 ABI?  The R10-register isn't a good
>>> choice due it might be used as drap-register and therefore can't
>>> be loaded before epilogue gets destroyed.
>>
>> It is limiting.  If r11/rax and often also r10 can be chosen,
>> telling the RA it can only choose rax is a limitation.
>
> No, it isn't.  For sure.  The code-branch choosing the accu to call
> after epilogue isn't used as memories for sibling-calls aren't
> allowed.  This code-branch will get active with my other
> sibling-tail-call patch.
But the value we want may be sitting around in a convenient register 
(such as r11).  So if we force the sibcall to use %rax, then we have to 
generate a copy.  Yet if we have a constraint for the set of registers 
allowed here, then we give the register allocator the opportunity to 
coalesce away the copies.

Jeff
Kai Tietz May 27, 2014, 3:51 p.m. UTC | #6
Ok, so just the part ok the patch remains to prevent varardic-functions being a sibling-tail-call remains.

Kai
Richard Henderson May 27, 2014, 4:22 p.m. UTC | #7
On 05/27/2014 08:39 AM, Jeff Law wrote:
> But the value we want may be sitting around in a convenient register (such as
> r11).  So if we force the sibcall to use %rax, then we have to generate a
> copy.  Yet if we have a constraint for the set of registers allowed here, then
> we give the register allocator the opportunity to coalesce away the copies.

We do have an existing call-clobbered register class that we currently do use
for indirect sibcalls.

The problem that Kai is trying to work around is that he doesn't want to load
the value into a register, but its address.  But the register allocator only
has BASE_REG_CLASS and INDEX_REG_CLASS, and has no possibility to force the
address components into registers that are call clobbered.

My thinking is that, with few exceptions, this doesn't help anything.  If we
have to load the full address into a register:

	lea	ofs(base, index, scale), %eax
	...
	call	*0(%eax)

we might as well include the memory load

	mov	ofs(base, index, scale), %eax
	...
	call	*%eax

As far as I can see, the only time we can actually save an instruction is when
the address is fully symbolic:

	mov	foo, %eax
	...
	call	*%eax

becomes

	call	*foo


r~
Jeff Law May 27, 2014, 4:48 p.m. UTC | #8
On 05/27/14 10:22, Richard Henderson wrote:
> On 05/27/2014 08:39 AM, Jeff Law wrote:
>> But the value we want may be sitting around in a convenient register (such as
>> r11).  So if we force the sibcall to use %rax, then we have to generate a
>> copy.  Yet if we have a constraint for the set of registers allowed here, then
>> we give the register allocator the opportunity to coalesce away the copies.
>
> We do have an existing call-clobbered register class that we currently do use
> for indirect sibcalls.
>
> The problem that Kai is trying to work around is that he doesn't want to load
> the value into a register, but its address.  But the register allocator only
> has BASE_REG_CLASS and INDEX_REG_CLASS, and has no possibility to force the
> address components into registers that are call clobbered.
>
> My thinking is that, with few exceptions, this doesn't help anything.  If we
> have to load the full address into a register:
>
> 	lea	ofs(base, index, scale), %eax
> 	...
> 	call	*0(%eax)
>
> we might as well include the memory load
>
> 	mov	ofs(base, index, scale), %eax
> 	...
> 	call	*%eax
Ok.  My misunderstanding.

Granted, this probably doesn't happen enough to matter, but isn't it 
likely profitable from a pipeline standpoint to go ahead and do the 
memory load separate from the indirect call/jump as well?


>
> As far as I can see, the only time we can actually save an instruction is when
> the address is fully symbolic:
>
> 	mov	foo, %eax
> 	...
> 	call	*%eax
>
> becomes
>
> 	call	*foo
Noted.
jeff
Richard Henderson May 27, 2014, 5:25 p.m. UTC | #9
On 05/27/2014 09:48 AM, Jeff Law wrote:
>>     lea    ofs(base, index, scale), %eax
>>     ...
>>     call    *0(%eax)
>>
>> we might as well include the memory load
>>
>>     mov    ofs(base, index, scale), %eax
>>     ...
>>     call    *%eax
> Ok.  My misunderstanding.
> 
> Granted, this probably doesn't happen enough to matter, but isn't it likely
> profitable from a pipeline standpoint to go ahead and do the memory load
> separate from the indirect call/jump as well?

I'm sure.  Especially if the data is currently living in L2, and the scheduler
has enough room to move the load sufficiently early to hide the latency.

As I've said before, I think this is interesting solely as a space optimization.


r~
diff mbox

Patch

Index: i386.c
===================================================================
--- i386.c	(revision 210936)
+++ i386.c	(working copy)
@@ -5298,6 +5298,12 @@  ix86_function_ok_for_sibcall (tree decl, tree exp)
       decl_or_type = type;
     }
 
+  /* We need to reject stdarg-function for x86_64 ABI as accumulator
+     is used as argument.  */
+  if (TARGET_64BIT && stdarg_p (type)
+      && ix86_function_type_abi (type) == SYSV_ABI)
+    return false;
+
   /* Check that the return value locations are the same.  Like
      if we are returning floats on the 80387 register stack, we cannot
      make a sibcall from a function that doesn't return a float to a
@@ -24916,8 +24922,19 @@  ix86_expand_call (rtx retval, rtx fnaddr, rtx call
 	   ? !sibcall_insn_operand (XEXP (fnaddr, 0), word_mode)
 	   : !call_insn_operand (XEXP (fnaddr, 0), word_mode))
     {
+      rtx r;
       fnaddr = convert_to_mode (word_mode, XEXP (fnaddr, 0), 1);
-      fnaddr = gen_rtx_MEM (QImode, copy_to_mode_reg (word_mode, fnaddr));
+      if (!sibcall)
+	r = copy_to_mode_reg (word_mode, fnaddr);
+      else
+	{
+	  r = gen_rtx_REG (word_mode, AX_REG);
+	  if (! general_operand (fnaddr, VOIDmode))
+		    fnaddr = force_operand (fnaddr, r);
+	  if (fnaddr != r)
+	    emit_move_insn (r, fnaddr);
+	}
+      fnaddr = gen_rtx_MEM (QImode, r);
     }
 
   call = gen_rtx_CALL (VOIDmode, fnaddr, callarg1);