Patchwork [i386] scalar ops that preserve the high part of a vector

login
register
mail settings
Submitter Marc Glisse
Date Oct. 13, 2012, 8:52 a.m.
Message ID <alpine.DEB.2.02.1210131032460.9651@stedding.saclay.inria.fr>
Download mbox | patch
Permalink /patch/191287/
State New
Headers show

Comments

Marc Glisse - Oct. 13, 2012, 8:52 a.m.
Hello,

this patch provides an alternate pattern to let combine recognize scalar 
operations that preserve the high part of a vector. If the strategy is all 
right, I could do the same for more operations (mul, div, ...). Something 
similar is also possible for V4SF (different pattern though), but probably 
not as useful.

bootstrap+testsuite ok.

2012-10-13  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855

gcc/
 	* config/i386/sse.md (*sse2_vm<plusminus_insn>v2df3): New define_insn.

gcc/testsuite/
 	* gcc.target/i386/pr54855.c: New testcase.
Uros Bizjak - Oct. 14, 2012, 8:45 a.m.
On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
> Hello,
>
> this patch provides an alternate pattern to let combine recognize scalar
> operations that preserve the high part of a vector. If the strategy is all
> right, I could do the same for more operations (mul, div, ...). Something
> similar is also possible for V4SF (different pattern though), but probably
> not as useful.

But, we _do_ have vec_merge pattern that describes the operation.
Adding another one to each operation just to satisfy combine is IMO
not correct approach. I'd rather see generic RTX simplification that
simplifies your proposed pattern to vec_merge pattern. Also, as you
mention in PR54855, Comment #5, the approach is too fragile...

Uros.
Marc Glisse - Oct. 14, 2012, 11:42 a.m.
On Sun, 14 Oct 2012, Uros Bizjak wrote:

> On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
>> Hello,
>>
>> this patch provides an alternate pattern to let combine recognize scalar
>> operations that preserve the high part of a vector. If the strategy is all
>> right, I could do the same for more operations (mul, div, ...). Something
>> similar is also possible for V4SF (different pattern though), but probably
>> not as useful.
>
> But, we _do_ have vec_merge pattern that describes the operation.
> Adding another one to each operation just to satisfy combine is IMO
> not correct approach.

At some point I wondered about _replacing_ the existing pattern, so there 
would only be one ;-)

The vec_merge pattern takes as argument 2 vectors instead of a vector and 
a scalar, and describes the operation as a vector operation where we drop 
half of the result, instead of a scalar operation where we re-add the top 
half of the vector. I don't know if that's the most convenient choice. 
Adding code in simplify-rtx to replace vec_merge with vec_concat / 
vec_select might be easier than the other way around.


If the middle-end somehow gave us:
(plus X (vec_concat Y 0))
it would seem a bit strange to add an optimization that turns it into:
(vec_merge (plus X (subreg:V2DF Y)) X 1)
but then producing:
(vec_concat (plus (vec_select X 0) Y) (vec_select X 1))
would be strange as well.
(ignoring the signed zero issues here)

> I'd rather see generic RTX simplification that
> simplifies your proposed pattern to vec_merge pattern.

Ok, I'll see what I can do.

> Also, as you mention in PR54855, Comment #5, the approach is too 
> fragile...

I am not sure I can make the RTX simplification much less fragile... 
Whenever I see (vec_concat X (vec_select Y 1)), I would have to check 
whether X is some (possibly large) tree of scalar computations involving 
Y[0], move it all to vec_merge computations, and fix other users of some 
of those scalars to now use S[0]. Seems too hard, I would stop at 
single-operation X that is used only once. Besides, the gain is larger in 
proportion when there is a single operation :-)

Thank you for your comments,
Marc Glisse - Nov. 30, 2012, 12:34 p.m.
On Sun, 14 Oct 2012, Marc Glisse wrote:

> On Sun, 14 Oct 2012, Uros Bizjak wrote:
>
>> On Sat, Oct 13, 2012 at 10:52 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
>>> Hello,
>>> 
>>> this patch provides an alternate pattern to let combine recognize scalar
>>> operations that preserve the high part of a vector. If the strategy is all
>>> right, I could do the same for more operations (mul, div, ...). Something
>>> similar is also possible for V4SF (different pattern though), but probably
>>> not as useful.
>> 
>> But, we _do_ have vec_merge pattern that describes the operation.
>> Adding another one to each operation just to satisfy combine is IMO
>> not correct approach.
>
> At some point I wondered about _replacing_ the existing pattern, so there 
> would only be one ;-)
>
> The vec_merge pattern takes as argument 2 vectors instead of a vector and a 
> scalar, and describes the operation as a vector operation where we drop half 
> of the result, instead of a scalar operation where we re-add the top half of 
> the vector. I don't know if that's the most convenient choice. Adding code in 
> simplify-rtx to replace vec_merge with vec_concat / vec_select might be 
> easier than the other way around.
>
>
> If the middle-end somehow gave us:
> (plus X (vec_concat Y 0))
> it would seem a bit strange to add an optimization that turns it into:
> (vec_merge (plus X (subreg:V2DF Y)) X 1)
> but then producing:
> (vec_concat (plus (vec_select X 0) Y) (vec_select X 1))
> would be strange as well.
> (ignoring the signed zero issues here)
>
>> I'd rather see generic RTX simplification that
>> simplifies your proposed pattern to vec_merge pattern.
>
> Ok, I'll see what I can do.
>
>> Also, as you mention in PR54855, Comment #5, the approach is too fragile...
>
> I am not sure I can make the RTX simplification much less fragile... Whenever 
> I see (vec_concat X (vec_select Y 1)), I would have to check whether X is 
> some (possibly large) tree of scalar computations involving Y[0], move it all 
> to vec_merge computations, and fix other users of some of those scalars to 
> now use S[0]. Seems too hard, I would stop at single-operation X that is used 
> only once. Besides, the gain is larger in proportion when there is a single 
> operation :-)
>
> Thank you for your comments,

Hello,

I experimented with the simplify-rtx transformation you suggested, see:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855

It works when the argument is a register, but not for memory (which is 
where the constant is in the testcase). And the description of the 
operation in sse.md does seem problematic. It says the second argument is:

             (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))

but Intel's documentation says "The source operand can be an XMM register 
or a 64-bit memory location", not quite the same.

Do you think the .md description should really stay this way, or could we 
change it to something that better reflects "64-bit memory location"?
Uros Bizjak - Nov. 30, 2012, 1:42 p.m.
On Fri, Nov 30, 2012 at 1:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

> Hello,
>
> I experimented with the simplify-rtx transformation you suggested, see:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54855
>
> It works when the argument is a register, but not for memory (which is where
> the constant is in the testcase). And the description of the operation in
> sse.md does seem problematic. It says the second argument is:
>
>             (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
>
> but Intel's documentation says "The source operand can be an XMM register or
> a 64-bit memory location", not quite the same.
>
> Do you think the .md description should really stay this way, or could we
> change it to something that better reflects "64-bit memory location"?

For reference, we are talking about:

(define_insn "<sse>_vm<plusminus_insn><mode>3"
  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
	(vec_merge:VF_128
	  (plusminus:VF_128
	    (match_operand:VF_128 1 "register_operand" "0,x")
	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
	  (match_dup 1)
	  (const_int 1)))]
  "TARGET_SSE"
  "@
   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
  [(set_attr "isa" "noavx,avx")
   (set_attr "type" "sseadd")
   (set_attr "prefix" "orig,vex")
   (set_attr "mode" "<ssescalarmode>")])

No, looking at your description, the operand 2 should be scalar
operand (we use _s{s,d} scalar instruction here), and for doubles this
should refer to 64bit memory location. I don't remember all the
details about vec_merge scalar instructions, but it looks to me that
canonical representation should be more like your proposal:

+(define_insn "*sse2_vm<plusminus_insn>v2df3"
+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
+    (vec_concat:V2DF
+      (plusminus:DF
+        (vec_select:DF
+          (match_operand:V2DF 1 "register_operand" "0,x")
+          (parallel [(const_int 0)]))
+        (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
+      (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
+  "TARGET_SSE2"

Uros.
Marc Glisse - Nov. 30, 2012, 10:32 p.m.
On Fri, 30 Nov 2012, Uros Bizjak wrote:

> For reference, we are talking about:
>
> (define_insn "<sse>_vm<plusminus_insn><mode>3"
>  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
> 	(vec_merge:VF_128
> 	  (plusminus:VF_128
> 	    (match_operand:VF_128 1 "register_operand" "0,x")
> 	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> 	  (match_dup 1)
> 	  (const_int 1)))]
>  "TARGET_SSE"
>  "@
>   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
>   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
>  [(set_attr "isa" "noavx,avx")
>   (set_attr "type" "sseadd")
>   (set_attr "prefix" "orig,vex")
>   (set_attr "mode" "<ssescalarmode>")])
>
> No, looking at your description, the operand 2 should be scalar
> operand (we use _s{s,d} scalar instruction here), and for doubles this
> should refer to 64bit memory location. I don't remember all the
> details about vec_merge scalar instructions, but it looks to me that
> canonical representation should be more like your proposal:
>
> +(define_insn "*sse2_vm<plusminus_insn>v2df3"
> +  [(set (match_operand:V2DF 0 "register_operand" "=x,x")
> +    (vec_concat:V2DF
> +      (plusminus:DF
> +        (vec_select:DF
> +          (match_operand:V2DF 1 "register_operand" "0,x")
> +          (parallel [(const_int 0)]))
> +        (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
> +      (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]
> +  "TARGET_SSE2"

Thank you.

Among the following possible patterns, my choice (if nobody objects) is to 
use 4) for V2DF and 3) (rewritten without iterators) for V4SF. The 
question is then what should be done about the builtins and intrinsics. 
_mm_add_sd takes two __m128. If I change the signature of 
__builtin_ia32_addsd, I can make _mm_add_sd pass __B[0] as second 
argument, but I don't know if I am allowed to change that signature. 
Otherwise I guess I'll need to keep a separate expander for it (I'd rather 
not). And then there are several other operations than +/- to handle.


1) Current pattern:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (plusminus:VF_128
 	    (match_operand:VF_128 1 "register_operand" "0,x")
 	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
 	  (match_dup 1)
 	  (const_int 1)))]

2) Minimal fix:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (plusminus:VF_128
 	    (match_operand:VF_128 1 "register_operand" "0,x")
 	    (vec_duplicate:VF_128
 	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]

3) With the operation in scalar mode:

   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
 	  (vec_duplicate:VF_128
 	    (plusminus:<ssescalarmode>
 	      (vec_select:<ssescalarmode>
 		(match_operand:VF_128 1 "register_operand" "0,x")
 		(parallel [(const_int 0)]))
 	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))))
 	  (match_dup 1)
 	  (const_int 1)))]

4) Special version which only makes sense for vectors of 2 elements:

   [(set (match_operand:V2DF 0 "register_operand" "=x,x")
 	(vec_concat:V2DF
 	  (plusminus:DF
 	    (vec_select:DF
 	      (match_operand:V2DF 1 "register_operand" "0,x")
 	      (parallel [(const_int 0)]))
 	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))
 	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]

Patch

Index: config/i386/sse.md

===================================================================
--- config/i386/sse.md	(revision 192420)

+++ config/i386/sse.md	(working copy)

@@ -812,20 +812,38 @@ 

 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
 
+(define_insn "*sse2_vm<plusminus_insn>v2df3"

+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")

+	(vec_concat:V2DF

+	  (plusminus:DF

+	    (vec_select:DF 

+	      (match_operand:V2DF 1 "register_operand" "0,x")

+	      (parallel [(const_int 0)]))

+	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))

+	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]

+  "TARGET_SSE2"

+  "@

+   <plusminus_mnemonic>sd\t{%2, %0|%0, %2}

+   v<plusminus_mnemonic>sd\t{%2, %1, %0|%0, %1, %2}"

+  [(set_attr "isa" "noavx,avx")

+   (set_attr "type" "sseadd")

+   (set_attr "prefix" "orig,vex")

+   (set_attr "mode" "DF")])

+

 (define_expand "mul<mode>3"
   [(set (match_operand:VF 0 "register_operand")
 	(mult:VF
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE"
   "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);")
 
 (define_insn "*mul<mode>3"
   [(set (match_operand:VF 0 "register_operand" "=x,x")
Index: testsuite/gcc.target/i386/pr54855.c

===================================================================
--- testsuite/gcc.target/i386/pr54855.c	(revision 0)

+++ testsuite/gcc.target/i386/pr54855.c	(revision 0)

@@ -0,0 +1,18 @@ 

+/* { dg-do compile } */

+/* { dg-options "-O -msse2" } */

+

+typedef double vec __attribute__((vector_size(16)));

+

+vec f (vec x)

+{

+  x[0] += 2;

+  return x;

+}

+

+vec g (vec x)

+{

+  x[0] -= 1;

+  return x;

+}

+

+/* { dg-final { scan-assembler-not "mov" } } */