diff mbox

[simplify-rtx] Simplify vec_merge of vec_duplicates into vec_concat

Message ID 593669EA.9090005@foss.arm.com
State New
Headers show

Commit Message

Kyrill Tkachov June 6, 2017, 8:38 a.m. UTC
Hi all,

Another vec_merge simplification that's missing from simplify-rtx.c is transforming
a vec_merge of two vec_duplicates. For example:
(set (reg:V2DF 80)
     (vec_merge:V2DF (vec_duplicate:V2DF (reg:DF 84))
         (vec_duplicate:V2DF (reg:DF 81))
         (const_int 2)))

Can be transformed into the simpler:
(set (reg:V2DF 80)
     (vec_concat:V2DF (reg:DF 81)
                 (reg:DF 84)))

I believe this should always be beneficial.
I'm still looking into finding a small testcase demonstrating this, but on aarch64 SPEC
I've seen this eliminate some really bizzare codegen where GCC was generating nonsense like:
   ldr q18, [sp, 448]
   ins v18.d[0], v23.d[0]
   ins v18.d[1], v22.d[0]

With q18 being pushed and popped off the stack in the prologue and epilogue of the function!
These are large files from SPEC that I haven't been able to analyse yet as to why GCC even attempts
to do that, but with this patch it doesn't try to load a register and overwrite all its lanes.
This patch shaves off about 5k of code size from zeusmp on aarch64 at -O3, so I believe it's a good
thing to do.

Ok?

Thanks,
Kyrill

2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge
     of two vec_duplicates into a vec_concat.

Comments

Jeff Law June 27, 2017, 10:26 p.m. UTC | #1
On 06/06/2017 02:38 AM, Kyrill Tkachov wrote:
> Hi all,
> 
> Another vec_merge simplification that's missing from simplify-rtx.c is
> transforming
> a vec_merge of two vec_duplicates. For example:
> (set (reg:V2DF 80)
>     (vec_merge:V2DF (vec_duplicate:V2DF (reg:DF 84))
>         (vec_duplicate:V2DF (reg:DF 81))
>         (const_int 2)))
> 
> Can be transformed into the simpler:
> (set (reg:V2DF 80)
>     (vec_concat:V2DF (reg:DF 81)
>                 (reg:DF 84)))
> 
> I believe this should always be beneficial.
> I'm still looking into finding a small testcase demonstrating this, but
> on aarch64 SPEC
> I've seen this eliminate some really bizzare codegen where GCC was
> generating nonsense like:
>   ldr q18, [sp, 448]
>   ins v18.d[0], v23.d[0]
>   ins v18.d[1], v22.d[0]
> 
> With q18 being pushed and popped off the stack in the prologue and
> epilogue of the function!
> These are large files from SPEC that I haven't been able to analyse yet
> as to why GCC even attempts
> to do that, but with this patch it doesn't try to load a register and
> overwrite all its lanes.
> This patch shaves off about 5k of code size from zeusmp on aarch64 at
> -O3, so I believe it's a good
> thing to do.
> 
> Ok?
> 
> Thanks,
> Kyrill
> 
> 2017-06-06  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     * simplify-rtx.c (simplify_ternary_operation): Simplify vec_merge
>     of two vec_duplicates into a vec_concat.
OK.  Though I'd really like to see a testcase to exercise the
simplification.

jeff
diff mbox

Patch

diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 0727ca690e9d7f2c14907e3888e67da31ecb1ed6..ac7c4131c2ffef44e66cdc95f09b7bf4d4ce5192 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5760,6 +5760,24 @@  simplify_ternary_operation (enum rtx_code code, machine_mode mode,
 	      if (!side_effects_p (otherop))
 		return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
 	    }
+
+	  /* Replace (vec_merge (vec_duplicate x) (vec_duplicate y)
+				 (const_int n))
+	     with (vec_concat x y) or (vec_concat y x) depending on value
+	     of N.  */
+	  if (GET_CODE (op0) == VEC_DUPLICATE
+	      && GET_CODE (op1) == VEC_DUPLICATE
+	      && GET_MODE_NUNITS (GET_MODE (op0)) == 2
+	      && GET_MODE_NUNITS (GET_MODE (op1)) == 2
+	      && IN_RANGE (sel, 1, 2))
+	    {
+	      rtx newop0 = XEXP (op0, 0);
+	      rtx newop1 = XEXP (op1, 0);
+	      if (sel == 2)
+		std::swap (newop0, newop1);
+
+	      return simplify_gen_binary (VEC_CONCAT, mode, newop0, newop1);
+	    }
 	}
 
       if (rtx_equal_p (op0, op1)