diff mbox

[1/3] tcg: add ext{8,16,32}u_i{32,64} TCG ops

Message ID e388f3bbe334e3781ff318ce92d1474be6aaa99a.1254512633.git.aurelien@aurel32.net
State Superseded
Headers show

Commit Message

Aurelien Jarno Sept. 30, 2009, 9:09 p.m. UTC
Currently zero extensions ops are implemented by a and op with a
constant. This is then catched in some backend, and replaced by
a zero extension instruction. While this works well on RISC
machines, this adds a useless register move on non-RISC machines.

Example on x86:
  ext16u_i32 r1, r2
is translated into
  mov    %eax,%ebx
  movzwl %bx, %ebx
while the optimized version should be:
  movzwl %ax, %ebx

This patch adds ext{8,16,32}u_i{32,64} TCG ops that can be
implemented in the backends to avoid emitting useless register
moves.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 tcg/tcg-op.h  |   24 +++++++++++++++++++++---
 tcg/tcg-opc.h |   15 +++++++++++++++
 2 files changed, 36 insertions(+), 3 deletions(-)

Comments

Nathan Froyd Oct. 2, 2009, 8:07 p.m. UTC | #1
On Wed, Sep 30, 2009 at 11:09:35PM +0200, Aurelien Jarno wrote:
> Currently zero extensions ops are implemented by a and op with a
> constant. This is then catched in some backend, and replaced by
> a zero extension instruction. While this works well on RISC
> machines, this adds a useless register move on non-RISC machines.
> 
> This patch adds ext{8,16,32}u_i{32,64} TCG ops that can be
> implemented in the backends to avoid emitting useless register
> moves.

I have to ask--does this make things go faster?

-Nathan
Aurelien Jarno Oct. 2, 2009, 8:54 p.m. UTC | #2
On Fri, Oct 02, 2009 at 01:07:12PM -0700, Nathan Froyd wrote:
> On Wed, Sep 30, 2009 at 11:09:35PM +0200, Aurelien Jarno wrote:
> > Currently zero extensions ops are implemented by a and op with a
> > constant. This is then catched in some backend, and replaced by
> > a zero extension instruction. While this works well on RISC
> > machines, this adds a useless register move on non-RISC machines.
> > 
> > This patch adds ext{8,16,32}u_i{32,64} TCG ops that can be
> > implemented in the backends to avoid emitting useless register
> > moves.
> 
> I have to ask--does this make things go faster?
> 

It depends on the target, it needs to use zero extension (MIPS for
example almost only does sign extension). It gives a 1.5% gain on
my test with qemu-86_64.

It should also give a gain on 64 bit system targets running a 32 
bit OS (i386 on x86_64, ppc on ppc64), as they are doing a zero 
extension on a lot of instruction.
Paul Brook Nov. 10, 2009, 2:51 p.m. UTC | #3
On Wednesday 30 September 2009, Aurelien Jarno wrote:
> Currently zero extensions ops are implemented by a and op with a
> constant. This is then catched in some backend, and replaced by
> a zero extension instruction. While this works well on RISC
> machines, this adds a useless register move on non-RISC machines.
> 
> Example on x86:
>   ext16u_i32 r1, r2
> is translated into
>   mov    %eax,%ebx
>   movzwl %bx, %ebx
> while the optimized version should be:
>   movzwl %ax, %ebx

I don't like your solution. Having two operations that do the same thing is 
bad, especially when we have no way of converting one into the other, and no 
clear indication which is best.

I think we need to understand why does the original code introduces an extra 
copy. At first glance there's no good reason for it to be there.

Paul
Aurelien Jarno Nov. 10, 2009, 3:38 p.m. UTC | #4
On Tue, Nov 10, 2009 at 02:51:21PM +0000, Paul Brook wrote:
> On Wednesday 30 September 2009, Aurelien Jarno wrote:
> > Currently zero extensions ops are implemented by a and op with a
> > constant. This is then catched in some backend, and replaced by
> > a zero extension instruction. While this works well on RISC
> > machines, this adds a useless register move on non-RISC machines.
> > 
> > Example on x86:
> >   ext16u_i32 r1, r2
> > is translated into
> >   mov    %eax,%ebx
> >   movzwl %bx, %ebx
> > while the optimized version should be:
> >   movzwl %ax, %ebx
> 
> I don't like your solution. Having two operations that do the same thing is 
> bad, especially when we have no way of converting one into the other, and no 
> clear indication which is best.

We don't have to operations, we have a few optional operations (ext*u) 
that are implemented with andi on TCG targets that don't support it. This
is the same as ext*s which is not always implemented natively.

> I think we need to understand why does the original code introduces an extra 
> copy. At first glance there's no good reason for it to be there.

That's easy to understand. ext16u_i32 r1, r2 is translated into 
andi r1, r2, 0xffff, that is:

  movi_i32 tmp0, 0xffff
  and r1, r2, tmp0

On x86*, and is defined as:

  { INDEX_op_and_i32, { "r", "0", "ri" } },

That is the first the argument is an alias to the output. This is
therefore converted into (even if this is never represented this way
inside TCG):

  movi_i32 tmp0, 0xffff
  mov r1, r2
  and r1, r1, tmp0

The x86 TCG target has some special cases for the and implementation
and converts it to the movzwl instruction instead of and + immediate.
diff mbox

Patch

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 7cb6934..faf2e8b 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -1189,16 +1189,22 @@  static inline void tcg_gen_ext16s_i32(TCGv_i32 ret, TCGv_i32 arg)
 #endif
 }
 
-/* These are currently just for convenience.
-   We assume a target will recognise these automatically .  */
 static inline void tcg_gen_ext8u_i32(TCGv_i32 ret, TCGv_i32 arg)
 {
+#ifdef TCG_TARGET_HAS_ext8u_i32
+    tcg_gen_op2_i32(INDEX_op_ext8u_i32, ret, arg);
+#else
     tcg_gen_andi_i32(ret, arg, 0xffu);
+#endif
 }
 
 static inline void tcg_gen_ext16u_i32(TCGv_i32 ret, TCGv_i32 arg)
 {
+#ifdef TCG_TARGET_HAS_ext16u_i32
+    tcg_gen_op2_i32(INDEX_op_ext16u_i32, ret, arg);
+#else
     tcg_gen_andi_i32(ret, arg, 0xffffu);
+#endif
 }
 
 /* Note: we assume the two high bytes are set to zero */
@@ -1358,17 +1364,29 @@  static inline void tcg_gen_ext32s_i64(TCGv_i64 ret, TCGv_i64 arg)
 
 static inline void tcg_gen_ext8u_i64(TCGv_i64 ret, TCGv_i64 arg)
 {
+#ifdef TCG_TARGET_HAS_ext8u_i64
+    tcg_gen_op2_i64(INDEX_op_ext8u_i64, ret, arg);
+#else
     tcg_gen_andi_i64(ret, arg, 0xffu);
+#endif
 }
 
 static inline void tcg_gen_ext16u_i64(TCGv_i64 ret, TCGv_i64 arg)
 {
+#ifdef TCG_TARGET_HAS_ext16u_i64
+    tcg_gen_op2_i64(INDEX_op_ext16u_i64, ret, arg);
+#else
     tcg_gen_andi_i64(ret, arg, 0xffffu);
+#endif
 }
 
 static inline void tcg_gen_ext32u_i64(TCGv_i64 ret, TCGv_i64 arg)
 {
+#ifdef TCG_TARGET_HAS_ext32u_i64
+    tcg_gen_op2_i64(INDEX_op_ext32u_i64, ret, arg);
+#else
     tcg_gen_andi_i64(ret, arg, 0xffffffffu);
+#endif
 }
 
 /* Note: we assume the target supports move between 32 and 64 bit
@@ -1382,7 +1400,7 @@  static inline void tcg_gen_trunc_i64_i32(TCGv_i32 ret, TCGv_i64 arg)
    registers */
 static inline void tcg_gen_extu_i32_i64(TCGv_i64 ret, TCGv_i32 arg)
 {
-    tcg_gen_andi_i64(ret, MAKE_TCGV_I64(GET_TCGV_I32(arg)), 0xffffffffu);
+    tcg_gen_ext32u_i64(ret, MAKE_TCGV_I64(GET_TCGV_I32(arg)));
 }
 
 /* Note: we assume the target supports move between 32 and 64 bit
diff --git a/tcg/tcg-opc.h b/tcg/tcg-opc.h
index 3a095fc..b7f3fd7 100644
--- a/tcg/tcg-opc.h
+++ b/tcg/tcg-opc.h
@@ -89,6 +89,12 @@  DEF2(ext8s_i32, 1, 1, 0, 0)
 #ifdef TCG_TARGET_HAS_ext16s_i32
 DEF2(ext16s_i32, 1, 1, 0, 0)
 #endif
+#ifdef TCG_TARGET_HAS_ext8u_i32
+DEF2(ext8u_i32, 1, 1, 0, 0)
+#endif
+#ifdef TCG_TARGET_HAS_ext16u_i32
+DEF2(ext16u_i32, 1, 1, 0, 0)
+#endif
 #ifdef TCG_TARGET_HAS_bswap16_i32
 DEF2(bswap16_i32, 1, 1, 0, 0)
 #endif
@@ -152,6 +158,15 @@  DEF2(ext16s_i64, 1, 1, 0, 0)
 #ifdef TCG_TARGET_HAS_ext32s_i64
 DEF2(ext32s_i64, 1, 1, 0, 0)
 #endif
+#ifdef TCG_TARGET_HAS_ext8u_i64
+DEF2(ext8u_i64, 1, 1, 0, 0)
+#endif
+#ifdef TCG_TARGET_HAS_ext16u_i64
+DEF2(ext16u_i64, 1, 1, 0, 0)
+#endif
+#ifdef TCG_TARGET_HAS_ext32u_i64
+DEF2(ext32u_i64, 1, 1, 0, 0)
+#endif
 #ifdef TCG_TARGET_HAS_bswap16_i64
 DEF2(bswap16_i64, 1, 1, 0, 0)
 #endif