diff mbox series

[v2] rs6000: Prefer assigning the MMA vector operands to altivec registers [PR105556]

Message ID 90380588-1f25-3efb-376b-a9820fe3bd46@linux.ibm.com
State New
Headers show
Series [v2] rs6000: Prefer assigning the MMA vector operands to altivec registers [PR105556] | expand

Commit Message

Peter Bergner May 16, 2022, 10:31 p.m. UTC
On 5/10/22 5:35 PM, Segher Boessenkool wrote:
> Out of interest, did you try using v,?wa (so just two alternatives, not
> four)?  Or did you think it wouldresult in  measurably worse code?  Or
> did you decide it is not such bad backend code size explosion after
> all :-)

So I tried using just "v,?wa" instead of the 4 alternative "v,v,?d,?d"
version and that fixes the performance issue too and is simpler too.
The other option is "better", in that it can allow one operand to get
a "v" reg when the other gets a "d" reg, but I think that's just a
micro-optimization and not worth the extra complexity in the pattern.
Thanks for the suggestion!

Changes since v1:
	* Use v,?wa constraints rather than v,v,?d,?d.
	* Update git log entry and ChangeLog text.


When optimizing the DGEMM kernel in OpenBLAS to use MMA, the MMA code
uses all 8 accumulators, which overlap all vs0-vs31 vector registers.
Current trunk assigns one of the normal vector inputs to one of the MMA
instructions, which forces us to spill one of the accumulators to memory,
leading to poor performance.  The solution here is to replace the "wa"
constraints for the vector input operands in the MMA instruction patterns
with "v,?wa" so that we prefer using the altivec registers vs32-vs63
over the vs0-vs31 registers.

Bootstrap and regtesting on powerpc64le-linux showed no regressions.
Ok for trunk and the GCC12 release branch after some burn-in time on
trunk?

Technically, the same issue exists in GCC11 and GCC10, but the RA
assignment is OK with the current code, so unless/until we have a
test case that exhibits the issue, I'm only asking for a backport
to GCC12 which does show the performance problem.

Peter


gcc/
	PR target/105556
	* config/rs6000/mma.md (mma_<vv>, mma_<avv>, mma_<pv>, mma_<apv>,
	mma_<vvi4i4i8>, mma_<avvi4i4i8>, mma_<vvi4i4i2>, mma_<avvi4i4i2>,
	mma_<vvi4i4>, mma_<avvi4i4>, mma_<pvi4i2>, mma_<apvi4i2>,
	mma_<vvi4i4i4>, mma_<avvi4i4i4>): Replace "wa" constraints with "v,?wa".
	Update other operands accordingly.

Comments

Segher Boessenkool May 17, 2022, 11:41 p.m. UTC | #1
On Mon, May 16, 2022 at 05:31:31PM -0500, Peter Bergner wrote:
> On 5/10/22 5:35 PM, Segher Boessenkool wrote:
> > Out of interest, did you try using v,?wa (so just two alternatives, not
> > four)?  Or did you think it wouldresult in  measurably worse code?  Or
> > did you decide it is not such bad backend code size explosion after
> > all :-)
> 
> So I tried using just "v,?wa" instead of the 4 alternative "v,v,?d,?d"
> version and that fixes the performance issue too and is simpler too.
> The other option is "better", in that it can allow one operand to get
> a "v" reg when the other gets a "d" reg, but I think that's just a
> micro-optimization and not worth the extra complexity in the pattern.
> Thanks for the suggestion!

The difference is that "v,?wa" makes no difference between one or more
lower VSRs used.  But whenever you would see that there is so much
register allocation pressure already that it does not change anything
materially.

> gcc/
> 	PR target/105556
> 	* config/rs6000/mma.md (mma_<vv>, mma_<avv>, mma_<pv>, mma_<apv>,
> 	mma_<vvi4i4i8>, mma_<avvi4i4i8>, mma_<vvi4i4i2>, mma_<avvi4i4i2>,
> 	mma_<vvi4i4>, mma_<avvi4i4>, mma_<pvi4i2>, mma_<apvi4i2>,
> 	mma_<vvi4i4i4>, mma_<avvi4i4i4>): Replace "wa" constraints with "v,?wa".
> 	Update other operands accordingly.

>  (define_insn "mma_<vv>"
> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
> -		    (match_operand:V16QI 2 "vsx_register_operand" "wa")]
> +  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
> +		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]

You now have two "?" on alternative 1, instead of just one.  This is the
same as if you had had
  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,??wa")
		    (match_operand:V16QI 2 "vsx_register_operand" "v,wa")]
The "?" are per alternative, not really per operand.  It won't change
much here of course, just penalise more than you perhaps expected.

With or without that changed: okay for trunk and for 12 (after the usual
cooldown).  Thanks!  Also okay for 11 and 10, shoukd you want that later
anyway.


Segher
Peter Bergner May 18, 2022, 2:40 a.m. UTC | #2
On 5/17/22 6:41 PM, Segher Boessenkool wrote:
> On Mon, May 16, 2022 at 05:31:31PM -0500, Peter Bergner wrote:
>>  (define_insn "mma_<vv>"
>> -  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
>> -	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
>> -		    (match_operand:V16QI 2 "vsx_register_operand" "wa")]
>> +  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
>> +	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
>> +		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
> 
> You now have two "?" on alternative 1, instead of just one.  This is the
> same as if you had had
>   [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
> 	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,??wa")
> 		    (match_operand:V16QI 2 "vsx_register_operand" "v,wa")]
> The "?" are per alternative, not really per operand.  It won't change
> much here of course, just penalise more than you perhaps expected.

Ak, ok.  I think giving an extra penalty is fine here, since we really
really want to use altivec regs here, so I went with the patch as is.
Pushed.

I'll wait a few days before backporting to GCC 12.  As for GCC 11 & 10,
I'll wait until someone actually has a test case that shows the same
problem.  I suspect a patch went into GCC 12 that changed the costs
slightly and that's why we don't see the problem on the older branches.
Thanks!

Peter
diff mbox series

Patch

diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 907c9d6d516..a183b6a168a 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -490,50 +490,50 @@  (define_insn "mma_xxsetaccz"
   [(set_attr "type" "mma")])
 
 (define_insn "mma_<vv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
 		    MMA_VV))]
   "TARGET_MMA"
   "<vv> %A0,%x1,%x2"
   [(set_attr "type" "mma")])
 
 (define_insn "mma_<avv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
 		    MMA_AVV))]
   "TARGET_MMA"
   "<avv> %A0,%x2,%x3"
   [(set_attr "type" "mma")])
 
 (define_insn "mma_<pv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")]
 		    MMA_PV))]
   "TARGET_MMA"
   "<pv> %A0,%x1,%x2"
   [(set_attr "type" "mma")])
 
 (define_insn "mma_<apv>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:OO 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")]
 		    MMA_APV))]
   "TARGET_MMA"
   "<apv> %A0,%x2,%x3"
   [(set_attr "type" "mma")])
 
 (define_insn "mma_<vvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "u8bit_cint_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "u8bit_cint_operand" "n,n")]
 		    MMA_VVI4I4I8))]
   "TARGET_MMA"
   "<vvi4i4i8> %A0,%x1,%x2,%3,%4,%5"
@@ -541,13 +541,13 @@  (define_insn "mma_<vvi4i4i8>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<avvi4i4i8>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n")
-		    (match_operand:SI 6 "u8bit_cint_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 6 "u8bit_cint_operand" "n,n")]
 		    MMA_AVVI4I4I8))]
   "TARGET_MMA"
   "<avvi4i4i8> %A0,%x2,%x3,%4,%5,%6"
@@ -555,12 +555,12 @@  (define_insn "mma_<avvi4i4i8>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<vvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
 		    MMA_VVI4I4I2))]
   "TARGET_MMA"
   "<vvi4i4i2> %A0,%x1,%x2,%3,%4,%5"
@@ -568,13 +568,13 @@  (define_insn "mma_<vvi4i4i2>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<avvi4i4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n")
-		    (match_operand:SI 6 "const_0_to_3_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 6 "const_0_to_3_operand" "n,n")]
 		    MMA_AVVI4I4I2))]
   "TARGET_MMA"
   "<avvi4i4i2> %A0,%x2,%x3,%4,%5,%6"
@@ -582,11 +582,11 @@  (define_insn "mma_<avvi4i4i2>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<vvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")]
 		    MMA_VVI4I4))]
   "TARGET_MMA"
   "<vvi4i4> %A0,%x1,%x2,%3,%4"
@@ -594,12 +594,12 @@  (define_insn "mma_<vvi4i4>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<avvi4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
 		    MMA_AVVI4I4))]
   "TARGET_MMA"
   "<avvi4i4> %A0,%x2,%x3,%4,%5"
@@ -607,11 +607,11 @@  (define_insn "mma_<avvi4i4>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<pvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n")
-		    (match_operand:SI 4 "const_0_to_3_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:OO 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 4 "const_0_to_3_operand" "n,n")]
 		    MMA_PVI4I2))]
   "TARGET_MMA"
   "<pvi4i2> %A0,%x1,%x2,%3,%4"
@@ -619,12 +619,12 @@  (define_insn "mma_<pvi4i2>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<apvi4i2>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:OO 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_3_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:OO 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_3_operand" "n,n")]
 		    MMA_APVI4I2))]
   "TARGET_MMA"
   "<apvi4i2> %A0,%x2,%x3,%4,%5"
@@ -632,12 +632,12 @@  (define_insn "mma_<apvi4i2>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<vvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:SI 3 "const_0_to_15_operand" "n")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:V16QI 1 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 3 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")]
 		    MMA_VVI4I4I4))]
   "TARGET_MMA"
   "<vvi4i4i4> %A0,%x1,%x2,%3,%4,%5"
@@ -645,13 +645,13 @@  (define_insn "mma_<vvi4i4i4>"
    (set_attr "prefixed" "yes")])
 
 (define_insn "mma_<avvi4i4i4>"
-  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d")
-	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0")
-		    (match_operand:V16QI 2 "vsx_register_operand" "wa")
-		    (match_operand:V16QI 3 "vsx_register_operand" "wa")
-		    (match_operand:SI 4 "const_0_to_15_operand" "n")
-		    (match_operand:SI 5 "const_0_to_15_operand" "n")
-		    (match_operand:SI 6 "const_0_to_15_operand" "n")]
+  [(set (match_operand:XO 0 "fpr_reg_operand" "=&d,&d")
+	(unspec:XO [(match_operand:XO 1 "fpr_reg_operand" "0,0")
+		    (match_operand:V16QI 2 "vsx_register_operand" "v,?wa")
+		    (match_operand:V16QI 3 "vsx_register_operand" "v,?wa")
+		    (match_operand:SI 4 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 5 "const_0_to_15_operand" "n,n")
+		    (match_operand:SI 6 "const_0_to_15_operand" "n,n")]
 		    MMA_AVVI4I4I4))]
   "TARGET_MMA"
   "<avvi4i4i4> %A0,%x2,%x3,%4,%5,%6"