Patchwork [15/15] tcg: use ext op for deposit

login
register
mail settings
Submitter Alexander Graf
Date April 4, 2011, 2:32 p.m.
Message ID <1301927544-32767-16-git-send-email-agraf@suse.de>
Download mbox | patch
Permalink /patch/89667/
State New
Headers show

Comments

Alexander Graf - April 4, 2011, 2:32 p.m.
With the s390x target we use the deposit instruction to store 32bit values
into 64bit registers without clobbering the upper 32 bits.

This specific operation can be optimized slightly by using the ext operation
instead of an explicit and in the deposit instruction. This patch adds that
special case to the generic deposit implementation.

Signed-off-by: Alexander Graf <agraf@suse.de>
---
 tcg/tcg-op.h |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)
Aurelien Jarno - April 5, 2011, 4:54 a.m.
On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
> With the s390x target we use the deposit instruction to store 32bit values
> into 64bit registers without clobbering the upper 32 bits.
> 
> This specific operation can be optimized slightly by using the ext operation
> instead of an explicit and in the deposit instruction. This patch adds that
> special case to the generic deposit implementation.
> 
> Signed-off-by: Alexander Graf <agraf@suse.de>
> ---
>  tcg/tcg-op.h |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)

Have you really measuring a difference here? This should already be
handled, at least on x86, by this code:

        if (TCG_TARGET_REG_BITS == 64) {
            if (val == 0xffffffffu) {
                tcg_out_ext32u(s, r0, r0);
                return;
            }
            if (val == (uint32_t)val) {
                /* AND with no high bits set can use a 32-bit operation.  */
                rexw = 0;
            }
        }

> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> index 207a89f..88575e7 100644
> --- a/tcg/tcg-op.h
> +++ b/tcg/tcg-op.h
> @@ -2124,7 +2124,11 @@ static inline void tcg_gen_deposit_i64(TCGv_i64 ret, TCGv_i64 arg1,
>    uint64_t mask = (1ull << len) - 1;
>    TCGv_i64 t1 = tcg_temp_new_i64 ();
>  
> -  tcg_gen_andi_i64(t1, arg2, mask);
> +  if (len == 32) {
> +       tcg_gen_ext32u_i64(t1, arg2);
> +  } else {
> +      tcg_gen_andi_i64(t1, arg2, mask);
> +  }
>    tcg_gen_shli_i64(t1, t1, ofs);
>    tcg_gen_andi_i64(ret, arg1, ~(mask << ofs));
>    tcg_gen_or_i64(ret, ret, t1);
> -- 
> 1.6.0.2
> 
> 
>
Alexander Graf - April 5, 2011, 7:55 a.m.
On 05.04.2011, at 06:54, Aurelien Jarno wrote:

> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
>> With the s390x target we use the deposit instruction to store 32bit values
>> into 64bit registers without clobbering the upper 32 bits.
>> 
>> This specific operation can be optimized slightly by using the ext operation
>> instead of an explicit and in the deposit instruction. This patch adds that
>> special case to the generic deposit implementation.
>> 
>> Signed-off-by: Alexander Graf <agraf@suse.de>
>> ---
>> tcg/tcg-op.h |    6 +++++-
>> 1 files changed, 5 insertions(+), 1 deletions(-)
> 
> Have you really measuring a difference here? This should already be
> handled, at least on x86, by this code:
> 
>        if (TCG_TARGET_REG_BITS == 64) {
>            if (val == 0xffffffffu) {
>                tcg_out_ext32u(s, r0, r0);
>                return;
>            }
>            if (val == (uint32_t)val) {
>                /* AND with no high bits set can use a 32-bit operation.  */
>                rexw = 0;
>            }
>        }

I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.


Alex
Aurelien Jarno - April 10, 2011, 7:23 p.m.
On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
> 
> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
> 
> > On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
> >> With the s390x target we use the deposit instruction to store 32bit values
> >> into 64bit registers without clobbering the upper 32 bits.
> >> 
> >> This specific operation can be optimized slightly by using the ext operation
> >> instead of an explicit and in the deposit instruction. This patch adds that
> >> special case to the generic deposit implementation.
> >> 
> >> Signed-off-by: Alexander Graf <agraf@suse.de>
> >> ---
> >> tcg/tcg-op.h |    6 +++++-
> >> 1 files changed, 5 insertions(+), 1 deletions(-)
> > 
> > Have you really measuring a difference here? This should already be
> > handled, at least on x86, by this code:
> > 
> >        if (TCG_TARGET_REG_BITS == 64) {
> >            if (val == 0xffffffffu) {
> >                tcg_out_ext32u(s, r0, r0);
> >                return;
> >            }
> >            if (val == (uint32_t)val) {
> >                /* AND with no high bits set can use a 32-bit operation.  */
> >                rexw = 0;
> >            }
> >        }
> 
> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
> 

The question there is looking at -d out_asm. They should be the same at
the end as the code I pasted above is from tcg/i386/tcg-target.c.
Alexander Graf - April 10, 2011, 7:25 p.m.
On 10.04.2011, at 21:23, Aurelien Jarno wrote:

> On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
>> 
>> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
>> 
>>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
>>>> With the s390x target we use the deposit instruction to store 32bit values
>>>> into 64bit registers without clobbering the upper 32 bits.
>>>> 
>>>> This specific operation can be optimized slightly by using the ext operation
>>>> instead of an explicit and in the deposit instruction. This patch adds that
>>>> special case to the generic deposit implementation.
>>>> 
>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>> ---
>>>> tcg/tcg-op.h |    6 +++++-
>>>> 1 files changed, 5 insertions(+), 1 deletions(-)
>>> 
>>> Have you really measuring a difference here? This should already be
>>> handled, at least on x86, by this code:
>>> 
>>>       if (TCG_TARGET_REG_BITS == 64) {
>>>           if (val == 0xffffffffu) {
>>>               tcg_out_ext32u(s, r0, r0);
>>>               return;
>>>           }
>>>           if (val == (uint32_t)val) {
>>>               /* AND with no high bits set can use a 32-bit operation.  */
>>>               rexw = 0;
>>>           }
>>>       }
>> 
>> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
>> 
> 
> The question there is looking at -d out_asm. They should be the same at
> the end as the code I pasted above is from tcg/i386/tcg-target.c.

Yes. I was trying to optimize for maximum op length. TCG defines a maximum number of tcg ops to be issued by each target instruction. Since s390 is very CISCy, there are instructions that translate into lots of microops, but are still faster than a C call (register save/restore mostly).

Without this patch, there are some places where we hit that number :).


Alex
Aurelien Jarno - April 10, 2011, 8:08 p.m.
On Sun, Apr 10, 2011 at 09:25:33PM +0200, Alexander Graf wrote:
> 
> On 10.04.2011, at 21:23, Aurelien Jarno wrote:
> 
> > On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
> >> 
> >> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
> >> 
> >>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
> >>>> With the s390x target we use the deposit instruction to store 32bit values
> >>>> into 64bit registers without clobbering the upper 32 bits.
> >>>> 
> >>>> This specific operation can be optimized slightly by using the ext operation
> >>>> instead of an explicit and in the deposit instruction. This patch adds that
> >>>> special case to the generic deposit implementation.
> >>>> 
> >>>> Signed-off-by: Alexander Graf <agraf@suse.de>
> >>>> ---
> >>>> tcg/tcg-op.h |    6 +++++-
> >>>> 1 files changed, 5 insertions(+), 1 deletions(-)
> >>> 
> >>> Have you really measuring a difference here? This should already be
> >>> handled, at least on x86, by this code:
> >>> 
> >>>       if (TCG_TARGET_REG_BITS == 64) {
> >>>           if (val == 0xffffffffu) {
> >>>               tcg_out_ext32u(s, r0, r0);
> >>>               return;
> >>>           }
> >>>           if (val == (uint32_t)val) {
> >>>               /* AND with no high bits set can use a 32-bit operation.  */
> >>>               rexw = 0;
> >>>           }
> >>>       }
> >> 
> >> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
> >> 
> > 
> > The question there is looking at -d out_asm. They should be the same at
> > the end as the code I pasted above is from tcg/i386/tcg-target.c.
> 
> Yes. I was trying to optimize for maximum op length. TCG defines a maximum number of tcg ops to be issued by each target instruction. Since s390 is very CISCy, there are instructions that translate into lots of microops, but are still faster than a C call (register save/restore mostly).
> 
> Without this patch, there are some places where we hit that number :).

Is it on 32-bit on or 64-bit? If we reach this number, it's probably
better to either implement this instruction with an helper, or maybe
increase the number of maximum ops. What is this instruction?
Alexander Graf - April 10, 2011, 8:17 p.m.
On 10.04.2011, at 22:08, Aurelien Jarno wrote:

> On Sun, Apr 10, 2011 at 09:25:33PM +0200, Alexander Graf wrote:
>> 
>> On 10.04.2011, at 21:23, Aurelien Jarno wrote:
>> 
>>> On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
>>>> 
>>>> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
>>>> 
>>>>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
>>>>>> With the s390x target we use the deposit instruction to store 32bit values
>>>>>> into 64bit registers without clobbering the upper 32 bits.
>>>>>> 
>>>>>> This specific operation can be optimized slightly by using the ext operation
>>>>>> instead of an explicit and in the deposit instruction. This patch adds that
>>>>>> special case to the generic deposit implementation.
>>>>>> 
>>>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
>>>>>> ---
>>>>>> tcg/tcg-op.h |    6 +++++-
>>>>>> 1 files changed, 5 insertions(+), 1 deletions(-)
>>>>> 
>>>>> Have you really measuring a difference here? This should already be
>>>>> handled, at least on x86, by this code:
>>>>> 
>>>>>      if (TCG_TARGET_REG_BITS == 64) {
>>>>>          if (val == 0xffffffffu) {
>>>>>              tcg_out_ext32u(s, r0, r0);
>>>>>              return;
>>>>>          }
>>>>>          if (val == (uint32_t)val) {
>>>>>              /* AND with no high bits set can use a 32-bit operation.  */
>>>>>              rexw = 0;
>>>>>          }
>>>>>      }
>>>> 
>>>> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
>>>> 
>>> 
>>> The question there is looking at -d out_asm. They should be the same at
>>> the end as the code I pasted above is from tcg/i386/tcg-target.c.
>> 
>> Yes. I was trying to optimize for maximum op length. TCG defines a maximum number of tcg ops to be issued by each target instruction. Since s390 is very CISCy, there are instructions that translate into lots of microops, but are still faster than a C call (register save/restore mostly).
>> 
>> Without this patch, there are some places where we hit that number :).
> 
> Is it on 32-bit on or 64-bit? If we reach this number, it's probably
> better to either implement this instruction with an helper, or maybe
> increase the number of maximum ops. What is this instruction?

This was on x86_64. I hit limits with LMH and LM, but reduced them to fit into the picture with this optimization :). If you like, I can give you a statically linked binary that could exceed the limits.


Alex
Aurelien Jarno - April 10, 2011, 8:28 p.m.
On Sun, Apr 10, 2011 at 10:17:26PM +0200, Alexander Graf wrote:
> 
> On 10.04.2011, at 22:08, Aurelien Jarno wrote:
> 
> > On Sun, Apr 10, 2011 at 09:25:33PM +0200, Alexander Graf wrote:
> >> 
> >> On 10.04.2011, at 21:23, Aurelien Jarno wrote:
> >> 
> >>> On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
> >>>> 
> >>>>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
> >>>>>> With the s390x target we use the deposit instruction to store 32bit values
> >>>>>> into 64bit registers without clobbering the upper 32 bits.
> >>>>>> 
> >>>>>> This specific operation can be optimized slightly by using the ext operation
> >>>>>> instead of an explicit and in the deposit instruction. This patch adds that
> >>>>>> special case to the generic deposit implementation.
> >>>>>> 
> >>>>>> Signed-off-by: Alexander Graf <agraf@suse.de>
> >>>>>> ---
> >>>>>> tcg/tcg-op.h |    6 +++++-
> >>>>>> 1 files changed, 5 insertions(+), 1 deletions(-)
> >>>>> 
> >>>>> Have you really measuring a difference here? This should already be
> >>>>> handled, at least on x86, by this code:
> >>>>> 
> >>>>>      if (TCG_TARGET_REG_BITS == 64) {
> >>>>>          if (val == 0xffffffffu) {
> >>>>>              tcg_out_ext32u(s, r0, r0);
> >>>>>              return;
> >>>>>          }
> >>>>>          if (val == (uint32_t)val) {
> >>>>>              /* AND with no high bits set can use a 32-bit operation.  */
> >>>>>              rexw = 0;
> >>>>>          }
> >>>>>      }
> >>>> 
> >>>> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
> >>>> 
> >>> 
> >>> The question there is looking at -d out_asm. They should be the same at
> >>> the end as the code I pasted above is from tcg/i386/tcg-target.c.
> >> 
> >> Yes. I was trying to optimize for maximum op length. TCG defines a maximum number of tcg ops to be issued by each target instruction. Since s390 is very CISCy, there are instructions that translate into lots of microops, but are still faster than a C call (register save/restore mostly).
> >> 
> >> Without this patch, there are some places where we hit that number :).
> > 
> > Is it on 32-bit on or 64-bit? If we reach this number, it's probably
> > better to either implement this instruction with an helper, or maybe
> > increase the number of maximum ops. What is this instruction?
> 
> This was on x86_64. I hit limits with LMH and LM, but reduced them to fit into the picture with this optimization :). If you like, I can give you a statically linked binary that could exceed the limits.
> 

Yeah for what I see it's the loop is unrolled there. Not sure it is the
best to do. Also if the limit is exceeded on 64-bit it is for sure
exceeded on 32-bit hosts.
Alexander Graf - April 14, 2011, 4:27 p.m.
On 04/10/2011 10:28 PM, Aurelien Jarno wrote:
> On Sun, Apr 10, 2011 at 10:17:26PM +0200, Alexander Graf wrote:
>> On 10.04.2011, at 22:08, Aurelien Jarno wrote:
>>
>>> On Sun, Apr 10, 2011 at 09:25:33PM +0200, Alexander Graf wrote:
>>>> On 10.04.2011, at 21:23, Aurelien Jarno wrote:
>>>>
>>>>> On Tue, Apr 05, 2011 at 09:55:09AM +0200, Alexander Graf wrote:
>>>>>> On 05.04.2011, at 06:54, Aurelien Jarno wrote:
>>>>>>
>>>>>>> On Mon, Apr 04, 2011 at 04:32:24PM +0200, Alexander Graf wrote:
>>>>>>>> With the s390x target we use the deposit instruction to store 32bit values
>>>>>>>> into 64bit registers without clobbering the upper 32 bits.
>>>>>>>>
>>>>>>>> This specific operation can be optimized slightly by using the ext operation
>>>>>>>> instead of an explicit and in the deposit instruction. This patch adds that
>>>>>>>> special case to the generic deposit implementation.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexander Graf<agraf@suse.de>
>>>>>>>> ---
>>>>>>>> tcg/tcg-op.h |    6 +++++-
>>>>>>>> 1 files changed, 5 insertions(+), 1 deletions(-)
>>>>>>> Have you really measuring a difference here? This should already be
>>>>>>> handled, at least on x86, by this code:
>>>>>>>
>>>>>>>       if (TCG_TARGET_REG_BITS == 64) {
>>>>>>>           if (val == 0xffffffffu) {
>>>>>>>               tcg_out_ext32u(s, r0, r0);
>>>>>>>               return;
>>>>>>>           }
>>>>>>>           if (val == (uint32_t)val) {
>>>>>>>               /* AND with no high bits set can use a 32-bit operation.  */
>>>>>>>               rexw = 0;
>>>>>>>           }
>>>>>>>       }
>>>>>> I've certainly looked at the -d op logs and seen that instead of creating a const tcg variable plus an AND there was now an extu opcode issued, yes. No idea why the case up there didn't trigger.
>>>>>>
>>>>> The question there is looking at -d out_asm. They should be the same at
>>>>> the end as the code I pasted above is from tcg/i386/tcg-target.c.
>>>> Yes. I was trying to optimize for maximum op length. TCG defines a maximum number of tcg ops to be issued by each target instruction. Since s390 is very CISCy, there are instructions that translate into lots of microops, but are still faster than a C call (register save/restore mostly).
>>>>
>>>> Without this patch, there are some places where we hit that number :).
>>> Is it on 32-bit on or 64-bit? If we reach this number, it's probably
>>> better to either implement this instruction with an helper, or maybe
>>> increase the number of maximum ops. What is this instruction?
>> This was on x86_64. I hit limits with LMH and LM, but reduced them to fit into the picture with this optimization :). If you like, I can give you a statically linked binary that could exceed the limits.
>>
> Yeah for what I see it's the loop is unrolled there. Not sure it is the
> best to do. Also if the limit is exceeded on 64-bit it is for sure
> exceeded on 32-bit hosts.


Ok, I've optimized some code paths for 32bit hosts now and bumped up the 
limit on 32bit to 128 ops. I've also dropped the ext op patch - it's not 
necessary anymore. I've optimized the 64bit version well enough so we 
don't require it anymore :)


Alex

Patch

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 207a89f..88575e7 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -2124,7 +2124,11 @@  static inline void tcg_gen_deposit_i64(TCGv_i64 ret, TCGv_i64 arg1,
   uint64_t mask = (1ull << len) - 1;
   TCGv_i64 t1 = tcg_temp_new_i64 ();
 
-  tcg_gen_andi_i64(t1, arg2, mask);
+  if (len == 32) {
+       tcg_gen_ext32u_i64(t1, arg2);
+  } else {
+      tcg_gen_andi_i64(t1, arg2, mask);
+  }
   tcg_gen_shli_i64(t1, t1, ofs);
   tcg_gen_andi_i64(ret, arg1, ~(mask << ofs));
   tcg_gen_or_i64(ret, ret, t1);