diff mbox

[10/11,RS6000] Migrate reduction optabs to reduc_..._scal

Message ID 54635096.1040508@arm.com
State New
Headers show

Commit Message

Alan Lawrence Nov. 12, 2014, 12:20 p.m. UTC
So I'm no expert on RS6000 here, but following on from Segher's observation 
about the change in pattern...so the difference in 'expand' is exactly that, a 
vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a 
vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining 
the two previous insns.

However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, 
when the combiner tries to put those two together, I see:

Trying 30 -> 31:
Failed to match this instruction:
(set (reg:DF 179 [ stmp_s_5.7D.2196 ])
     (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
                 (parallel [
                         (const_int 1 [0x1])
                         (const_int 0 [0])
                     ]))
             (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
         (parallel [
                 (const_int 1 [0x1])
             ])))

That is, it looks like combine_simplify_rtx has transformed the (vec_concat 
(vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a 
single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.

So despite the comment (in vsx.md):

;; Combiner patterns with the vector reduction patterns that knows we can get
;; to the top element of the V2DF array without doing an extract.

It looks like the code generation prior to my patch, considered better, was 
because the combiner didn't actually use the pattern?

In that case whilst you may want to dig into register allocation, 
cannot_change_mode_class, etc., for other reasons, I think the best fix for 
migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns 
and just emit two insns, the old pattern followed by a vec_extract. The attached 
snippet does this (I won't call it a patch yet, and it applies on top of the 
previous patch - I went the route of calling the two gen functions rather than 
copying their RTL sequences, but could do the latter if that were 
preferable???), and restores code generation to the original form on your 
example above; it bootstraps OK but I'm still running check-gcc on the Compile 
Farm...

However, again on your example above, I note that if I *remove* the 
reduc_plus_scal_v2df pattern altogether, I get:

.sum:
         li 10,512        # 52   *movdi_internal64/4     [length = 4]
         ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
         xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
         mtctr 10         # 48   *movdi_internal64/11    [length = 4]
         .align 4
.L2:
         lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
         addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
         xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
         bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
         xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
         xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
         nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
         blr      # 55   return  [length = 4]

this is presumably using gcc's scalar reduction code, but (to my untrained eye 
on powerpc!) it looks even better than the first form above (the same in the 
loop, and in the reduction, an xxpermdi is replaced by a nop !)...

--Alan


Segher Boessenkool wrote:
> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
>> However, the double pattern is completely broken.  This cannot go in.
> 
> [snip]
> 
>> It is unacceptable to have to do the inner loop doing a load, vector add, and
>> store in the loop.
> 
> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
> add, the latter a float add.  And it uses the same pseudoregister for the
> accumulator throughout.  IRA decides a register is more expensive than
> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
> seem to like the subreg very much.
> 
> The new code does look nicer otherwise :-)
> 
> 
> Segher
>

Comments

Alan Lawrence Nov. 12, 2014, 6:49 p.m. UTC | #1
Have run check-gcc on gcc110.fsffrance.org (powerpc64-unknown-linux-gnu) using 
this snippet on top of original patch; no regressions.

Alan Lawrence wrote:
> So I'm no expert on RS6000 here, but following on from Segher's observation 
> about the change in pattern...so the difference in 'expand' is exactly that, a 
> vsx_reduc_splus_v2df followed by a vec_extract to DF, becomes a 
> vsx_reduc_splus_v2df_scalar - as I expected the combiner to produce by combining 
> the two previous insns.
> 
> However, inspecting the logs from -fdump-rtl-combine-all, *without* my patch, 
> when the combiner tries to put those two together, I see:
> 
> Trying 30 -> 31:
> Failed to match this instruction:
> (set (reg:DF 179 [ stmp_s_5.7D.2196 ])
>      (vec_select:DF (plus:V2DF (vec_select:V2DF (reg:V2DF 173 [ vect_s_5.6D.2195 ])
>                  (parallel [
>                          (const_int 1 [0x1])
>                          (const_int 0 [0])
>                      ]))
>              (reg:V2DF 173 [ vect_s_5.6D.2195 ]))
>          (parallel [
>                  (const_int 1 [0x1])
>              ])))
> 
> That is, it looks like combine_simplify_rtx has transformed the (vec_concat 
> (vec_select ... 1) (vec_select ... 0)) from the vsx_reduc_plus_v2df insn, into a 
> single vec_select, which does not match the vsx_reduc_plus_v2df_scalar insn.
> 
> So despite the comment (in vsx.md):
> 
> ;; Combiner patterns with the vector reduction patterns that knows we can get
> ;; to the top element of the V2DF array without doing an extract.
> 
> It looks like the code generation prior to my patch, considered better, was 
> because the combiner didn't actually use the pattern?
> 
> In that case whilst you may want to dig into register allocation, 
> cannot_change_mode_class, etc., for other reasons, I think the best fix for 
> migrating to reduc_plus_scal... is simply to avoid using the "Combiner" patterns 
> and just emit two insns, the old pattern followed by a vec_extract. The attached 
> snippet does this (I won't call it a patch yet, and it applies on top of the 
> previous patch - I went the route of calling the two gen functions rather than 
> copying their RTL sequences, but could do the latter if that were 
> preferable???), and restores code generation to the original form on your 
> example above; it bootstraps OK but I'm still running check-gcc on the Compile 
> Farm...
> 
> However, again on your example above, I note that if I *remove* the 
> reduc_plus_scal_v2df pattern altogether, I get:
> 
> .sum:
>          li 10,512        # 52   *movdi_internal64/4     [length = 4]
>          ld 9,.LC2@toc(2)         # 20   *movdi_internal64/2     [length = 4]
>          xxlxor 0,0,0     # 17   *vsx_movv2df/12 [length = 4]
>          mtctr 10         # 48   *movdi_internal64/11    [length = 4]
>          .align 4
> .L2:
>          lxvd2x 12,0,9    # 23   *vsx_movv2df/2  [length = 4]
>          addi 9,9,16      # 25   *adddi3_internal1/2     [length = 4]
>          xvadddp 0,0,12   # 24   *vsx_addv2df3/1 [length = 4]
>          bdnz .L2         # 47   *ctrdi_internal1/1      [length = 4]
>          xxsldwi 12,0,0,2         # 30   vsx_xxsldwi_v2df        [length = 4]
>          xvadddp 1,0,12   # 31   *vsx_addv2df3/1 [length = 4]
>          nop      # 37   *vsx_extract_v2df_internal2/1   [length = 4]
>          blr      # 55   return  [length = 4]
> 
> this is presumably using gcc's scalar reduction code, but (to my untrained eye 
> on powerpc!) it looks even better than the first form above (the same in the 
> loop, and in the reduction, an xxpermdi is replaced by a nop !)...
> 
> --Alan
> 
> 
> Segher Boessenkool wrote:
>> On Mon, Nov 10, 2014 at 05:36:24PM -0500, Michael Meissner wrote:
>>> However, the double pattern is completely broken.  This cannot go in.
>> [snip]
>>
>>> It is unacceptable to have to do the inner loop doing a load, vector add, and
>>> store in the loop.
>> Before the patch, the final reduction used *vsx_reduc_splus_v2df; after
>> the patch, it is *vsx_reduc_plus_v2df_scalar.  The former does a vector
>> add, the latter a float add.  And it uses the same pseudoregister for the
>> accumulator throughout.  IRA decides a register is more expensive than
>> memory for this, I suppose because it wants both V2DF and DF?  It doesn't
>> seem to like the subreg very much.
>>
>> The new code does look nicer otherwise :-)
>>
>>
>> Segher
diff mbox

Patch

diff --git a/gcc/config/rs6000/vector.md b/gcc/config/rs6000/vector.md
index 54b18aa..18a7f08 100644
--- a/gcc/config/rs6000/vector.md
+++ b/gcc/config/rs6000/vector.md
@@ -1078,21 +1078,15 @@ 
 ;; Vector reduction expanders for VSX
 
 (define_expand "reduc_<VEC_reduc_name>_scal_v2df"
-  [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
-		   (vec_select:DF
-		    (VEC_reduc:V2DF
-		     (vec_concat:V2DF
-		      (vec_select:DF
-		       (match_operand:V2DF 1 "vfloat_operand" "")
-		       (parallel [(const_int 1)]))
-		      (vec_select:DF
-		       (match_dup 1)
-		       (parallel [(const_int 0)])))
-		     (match_dup 1))
-		    (parallel [(const_int 1)])))
-	      (clobber (match_scratch:DF 2 ""))])]
+  [(match_operand:DF 0 "register_operand" "")
+   (VEC_reduc:V2DF (match_operand:V2DF 1 "vfloat_operand" "") (const_int 0))]
   "VECTOR_UNIT_VSX_P (V2DFmode)"
-  "")
+  {
+    rtx vec = gen_reg_rtx (V2DFmode);
+    emit_insn (gen_vsx_reduc_<VEC_reduc_name>_v2df (vec, operand1));
+    emit_insn (gen_vsx_extract_v2df (operand0, vec, const1_rtx));
+    DONE;
+  })
 
 ; The (VEC_reduc:V4SF
 ;	(op1)
diff --git a/gcc/config/rs6000/vsx.md b/gcc/config/rs6000/vsx.md
index 7aa0f12..8df6a45 100644
--- a/gcc/config/rs6000/vsx.md
+++ b/gcc/config/rs6000/vsx.md
@@ -2150,7 +2150,7 @@ 
 
 ;; Vector reduction insns and splitters
 
-(define_insn_and_split "*vsx_reduc_<VEC_reduc_name>_v2df"
+(define_insn_and_split "vsx_reduc_<VEC_reduc_name>_v2df"
   [(set (match_operand:V2DF 0 "vfloat_operand" "=&wd,&?wa,wd,?wa")
 	(VEC_reduc:V2DF
 	 (vec_concat:V2DF