Patchwork [7/7] tcg: Streamline movcond_i64 using movcond_i32

login
register
mail settings
Submitter Richard Henderson
Date Sept. 21, 2012, 5:13 p.m.
Message ID <1348247620-12734-8-git-send-email-rth@twiddle.net>
Download mbox | patch
Permalink /patch/185861/
State New
Headers show

Comments

Richard Henderson - Sept. 21, 2012, 5:13 p.m.
When movcond_i32 is available we can further reduce the generated
op count from 12 to 6, and the generated code size on i686 from
88 to 74 bytes.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/tcg-op.h | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)
Aurelien Jarno - Sept. 21, 2012, 9:23 p.m.
On Fri, Sep 21, 2012 at 10:13:40AM -0700, Richard Henderson wrote:
> When movcond_i32 is available we can further reduce the generated
> op count from 12 to 6, and the generated code size on i686 from
> 88 to 74 bytes.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
>  tcg/tcg-op.h | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
> index 3e375ea..0145a09 100644
> --- a/tcg/tcg-op.h
> +++ b/tcg/tcg-op.h
> @@ -2147,16 +2147,24 @@ static inline void tcg_gen_movcond_i64(TCGCond cond, TCGv_i64 ret,
>          tcg_gen_op6i_i32(INDEX_op_setcond2_i32, t0,
>                           TCGV_LOW(c1), TCGV_HIGH(c1),
>                           TCGV_LOW(c2), TCGV_HIGH(c2), cond);
> -        tcg_gen_neg_i32(t0, t0);
>  
> -        tcg_gen_and_i32(t1, TCGV_LOW(v1), t0);
> -        tcg_gen_andc_i32(TCGV_LOW(ret), TCGV_LOW(v2), t0);
> -        tcg_gen_or_i32(TCGV_LOW(ret), TCGV_LOW(ret), t1);
> +        if (TCG_TARGET_HAS_movcond_i32) {
> +            tcg_gen_movi_i32(t1, 0);
> +            tcg_gen_movcond_i32(TCG_COND_NE, TCGV_LOW(ret), t0, t1,
> +                                TCGV_LOW(v1), TCGV_LOW(v2));
> +            tcg_gen_movcond_i32(TCG_COND_NE, TCGV_HIGH(ret), t0, t1,
> +                                TCGV_HIGH(v1), TCGV_HIGH(v2));
> +        } else {
> +            tcg_gen_neg_i32(t0, t0);
>  
> -        tcg_gen_and_i32(t1, TCGV_HIGH(v1), t0);
> -        tcg_gen_andc_i32(TCGV_HIGH(ret), TCGV_HIGH(v2), t0);
> -        tcg_gen_or_i32(TCGV_HIGH(ret), TCGV_HIGH(ret), t1);
> +            tcg_gen_and_i32(t1, TCGV_LOW(v1), t0);
> +            tcg_gen_andc_i32(TCGV_LOW(ret), TCGV_LOW(v2), t0);
> +            tcg_gen_or_i32(TCGV_LOW(ret), TCGV_LOW(ret), t1);
>  
> +            tcg_gen_and_i32(t1, TCGV_HIGH(v1), t0);
> +            tcg_gen_andc_i32(TCGV_HIGH(ret), TCGV_HIGH(v2), t0);
> +            tcg_gen_or_i32(TCGV_HIGH(ret), TCGV_HIGH(ret), t1);
> +        }
>          tcg_temp_free_i32(t0);
>          tcg_temp_free_i32(t1);
>      } else {

At some point I tried to think how to implement movcond_i64 for 
MIPS directly in the backend. I just tried your patch, and I got
this kind of code:

| 0x2bb2ae58:  sltu       at,zero,s4
| 0x2bb2ae5c:  sltu       t0,zero,s3
| 0x2bb2ae60:  or s3,at,t0
| 0x2bb2ae64:  movz       s1,s5,s3
| 0x2bb2ae68:  movz       s2,s6,s3
|
| (in some cases some constants/globals loading appear in the middle, but
| that's not due to movcond).

It's basically the kind of code I would have written. It's clearly
better to implement it directly in TCG.

Now I wonder if it wouldn't be better to write brcond2 as setcond2 +
brcond. And even setcond2 as a pair of setcond in TCG, which would allow
some optimizations in case both high parts are zero.

Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Aurelien Jarno <aurelien@aurel32.net>
Richard Henderson - Sept. 21, 2012, 10:27 p.m.
On 09/21/2012 02:23 PM, Aurelien Jarno wrote:
> Now I wonder if it wouldn't be better to write brcond2 as setcond2 +
> brcond. And even setcond2 as a pair of setcond in TCG, which would allow
> some optimizations in case both high parts are zero.

I think brcond2 vs setcond2 is a choice that has to be made on a
host-by-host basis.  E.g. for i386 we implement setcond2 with branches.
E.g. for hppa setcond2, while not using branches, is twice the size of
brcond2.  But there's nothing saying you couldn't have the mips
version of brcond2 use setcond2 internals to do its job.

On the other hand, having tcg/optimize.c reduce both setcond2 and brcond2
to setcond and brcond with the appropriate values of zero would be a most
welcome improvement.

Also, I've been thinking about having a tcg.h function that produces the
cond-without-equality table that I introduced to fix hppa recently.  Using
that could reduce code size in some of the other backends as well.


r~
Aurelien Jarno - Sept. 22, 2012, 9:30 a.m.
On Fri, Sep 21, 2012 at 03:27:52PM -0700, Richard Henderson wrote:
> On 09/21/2012 02:23 PM, Aurelien Jarno wrote:
> > Now I wonder if it wouldn't be better to write brcond2 as setcond2 +
> > brcond. And even setcond2 as a pair of setcond in TCG, which would allow
> > some optimizations in case both high parts are zero.
> 
> I think brcond2 vs setcond2 is a choice that has to be made on a
> host-by-host basis.  E.g. for i386 we implement setcond2 with branches.
> E.g. for hppa setcond2, while not using branches, is twice the size of
> brcond2.  But there's nothing saying you couldn't have the mips
> version of brcond2 use setcond2 internals to do its job.

Yes and now. Currently the MIPS version of setcond2, uses 2 temporary
registers and the return argument. This is therefore not possible to
make brcond2 to use setcond2. 

I have just seen it's possible to reorder the instructions to use two 
temporaries only, but you see the problem there. Doing that as the TCG
level means you can use more temporaries, at the backend level means you
have to play with the registers you have.

> On the other hand, having tcg/optimize.c reduce both setcond2 and brcond2
> to setcond and brcond with the appropriate values of zero would be a most
> welcome improvement.

That's a good idea.

> Also, I've been thinking about having a tcg.h function that produces the
> cond-without-equality table that I introduced to fix hppa recently.  Using
> that could reduce code size in some of the other backends as well.
> 

I am using such a table in the MIPS backend, but done with a big switch.

Patch

diff --git a/tcg/tcg-op.h b/tcg/tcg-op.h
index 3e375ea..0145a09 100644
--- a/tcg/tcg-op.h
+++ b/tcg/tcg-op.h
@@ -2147,16 +2147,24 @@  static inline void tcg_gen_movcond_i64(TCGCond cond, TCGv_i64 ret,
         tcg_gen_op6i_i32(INDEX_op_setcond2_i32, t0,
                          TCGV_LOW(c1), TCGV_HIGH(c1),
                          TCGV_LOW(c2), TCGV_HIGH(c2), cond);
-        tcg_gen_neg_i32(t0, t0);
 
-        tcg_gen_and_i32(t1, TCGV_LOW(v1), t0);
-        tcg_gen_andc_i32(TCGV_LOW(ret), TCGV_LOW(v2), t0);
-        tcg_gen_or_i32(TCGV_LOW(ret), TCGV_LOW(ret), t1);
+        if (TCG_TARGET_HAS_movcond_i32) {
+            tcg_gen_movi_i32(t1, 0);
+            tcg_gen_movcond_i32(TCG_COND_NE, TCGV_LOW(ret), t0, t1,
+                                TCGV_LOW(v1), TCGV_LOW(v2));
+            tcg_gen_movcond_i32(TCG_COND_NE, TCGV_HIGH(ret), t0, t1,
+                                TCGV_HIGH(v1), TCGV_HIGH(v2));
+        } else {
+            tcg_gen_neg_i32(t0, t0);
 
-        tcg_gen_and_i32(t1, TCGV_HIGH(v1), t0);
-        tcg_gen_andc_i32(TCGV_HIGH(ret), TCGV_HIGH(v2), t0);
-        tcg_gen_or_i32(TCGV_HIGH(ret), TCGV_HIGH(ret), t1);
+            tcg_gen_and_i32(t1, TCGV_LOW(v1), t0);
+            tcg_gen_andc_i32(TCGV_LOW(ret), TCGV_LOW(v2), t0);
+            tcg_gen_or_i32(TCGV_LOW(ret), TCGV_LOW(ret), t1);
 
+            tcg_gen_and_i32(t1, TCGV_HIGH(v1), t0);
+            tcg_gen_andc_i32(TCGV_HIGH(ret), TCGV_HIGH(v2), t0);
+            tcg_gen_or_i32(TCGV_HIGH(ret), TCGV_HIGH(ret), t1);
+        }
         tcg_temp_free_i32(t0);
         tcg_temp_free_i32(t1);
     } else {