diff mbox series

rs6000: Support doubleword swaps removal in rot64 load store [PR100085]

Message ID 20210602081932.2683429-1-luoxhu@linux.ibm.com
State New
Headers show
Series rs6000: Support doubleword swaps removal in rot64 load store [PR100085] | expand

Commit Message

Xionghu Luo June 2, 2021, 8:19 a.m. UTC
On P8LE, extra rot64+rot64 load or store instructions are generated
in float128 to vector __int128 conversion.

This patch teaches pass swaps to also handle such pattens to remove
extra swap instructions.

(insn 7 6 8 2 (set (subreg:V1TI (reg:KF 123) 0)
        (rotate:V1TI (mem/u/c:V1TI (reg/f:DI 121) [0  S16 A128])
	            (const_int 64 [0x40]))) {*vsx_le_permute_v1ti})
(insn 8 7 9 2 (set (subreg:V1TI (reg:KF 122) 0)
        (rotate:V1TI (subreg:V1TI (reg:KF 123) 0)
	            (const_int 64 [0x40])))  {*vsx_le_permute_v1ti})
=>
(insn 22 6 23 2 (set (subreg:V1TI (reg:KF 123) 0)
        (mem/u/c:V1TI (and:DI (reg/f:DI 121)
	          (const_int -16 [0xfffffffffffffff0])) [0  S16 A128])))
(insn 23 22 25 2 (set (subreg:V1TI (reg:KF 122) 0)
        (subreg:V1TI (reg:KF 123) 0)))

gcc/ChangeLog:

	* config/rs6000/rs6000-p8swap.c (pattern_is_rotate64_p): New.
	(insn_is_load_p): Use pattern_is_rotate64_p.
	(insn_is_swap_p): Likewise.
	(quad_aligned_load_p): Likewise.
	(const_load_sequence_p): Likewise.
	(replace_swapped_aligned_load): Likewise.
	(recombine_lvx_pattern): Likewise.
	(recombine_stvx_pattern): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/powerpc/float128-call.c: Adjust.
	* gcc.target/powerpc/pr100085.c: New test.
---
 gcc/config/rs6000/rs6000-p8swap.c             | 37 +++++++++++++++----
 .../gcc.target/powerpc/float128-call.c        |  4 +-
 gcc/testsuite/gcc.target/powerpc/pr100085.c   | 28 ++++++++++++++
 3 files changed, 60 insertions(+), 9 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr100085.c

Comments

Segher Boessenkool June 2, 2021, 10:20 p.m. UTC | #1
On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
> On P8LE, extra rot64+rot64 load or store instructions are generated
> in float128 to vector __int128 conversion.
> 
> This patch teaches pass swaps to also handle such pattens to remove
> extra swap instructions.

Did you check if this is already handled by simplify-rtx if the mode had
been TImode (not V1TImode)?  If not, why do you not handle it there?

> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile { target { powerpc*-*-linux* } } } */

Just
  /* { dg-do compile } */
please (or is there any reason to do this on linux only?)

> +/* { dg-require-effective-target powerpc_float128_sw_ok } */
> +/* { dg-options "-O2 -mdejagnu-cpu=power8 -mfloat128 -mno-regnames" } */

-mfloat128 is implied by -mcpu=power8.  Why do you use -mno-regnames?

> +#ifndef __FLOAT128__
> +#error "-mfloat128 is not supported."
> +#endif

So this can be deleted as well.


Segher
Xionghu Luo June 3, 2021, 12:46 a.m. UTC | #2
Hi,

On 2021/6/3 06:20, Segher Boessenkool wrote:
> On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
>> On P8LE, extra rot64+rot64 load or store instructions are generated
>> in float128 to vector __int128 conversion.
>>
>> This patch teaches pass swaps to also handle such pattens to remove
>> extra swap instructions.
> 
> Did you check if this is already handled by simplify-rtx if the mode had
> been TImode (not V1TImode)?  If not, why do you not handle it there?

I tried to do it in combine or peephole, the later pass split2
or split3 will still split it to rotate + rotate again as we have split
after reload, and this pattern is quite P8LE specific, so put it in pass
swap.  The simplify-rtx could simplify 
r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
operations already.


vsx.md:

;; The post-reload split requires that we re-permute the source
;; register in case it is still live.
(define_split
  [(set (match_operand:VSX_LE_128 0 "memory_operand")
        (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
  "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
   && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
  [(const_int 0)]
{
  rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
  rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
  rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
  DONE;
})

 
Thanks,
Xionghu
Xionghu Luo June 3, 2021, 6:49 a.m. UTC | #3
On 2021/6/3 08:46, Xionghu Luo via Gcc-patches wrote:
> Hi,
> 
> On 2021/6/3 06:20, Segher Boessenkool wrote:
>> On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
>>> On P8LE, extra rot64+rot64 load or store instructions are generated
>>> in float128 to vector __int128 conversion.
>>>
>>> This patch teaches pass swaps to also handle such pattens to remove
>>> extra swap instructions.
>>
>> Did you check if this is already handled by simplify-rtx if the mode had
>> been TImode (not V1TImode)?  If not, why do you not handle it there?
> 
> I tried to do it in combine or peephole, the later pass split2
> or split3 will still split it to rotate + rotate again as we have split
> after reload, and this pattern is quite P8LE specific, so put it in pass
> swap.  The simplify-rtx could simplify
> r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
> operations already.
> 

And, forgot to mention, after swap pass removes the rotate with this patch,
dse1 pass followed could remove the stack operation, which avoid to split
to rotate load/store again in later passes.

If remove the rotate in simplify-rtx like below:

+++ b/gcc/simplify-rtx.c
@@ -3830,10 +3830,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code code,
     case ROTATE:
       if (trueop1 == CONST0_RTX (mode))
        return op0;
+
+      if (GET_CODE (trueop0) == ROTATE && trueop1 == GEN_INT (64)
+         && CONST_INT_P (XEXP (trueop0, 1))
+         && INTVAL (XEXP (trueop0, 1)) == 64)
+       return XEXP (trueop0, 0);

Combine still fail to merge the two instructions:

Trying 6 -> 7:
    6: r120:KF#0=r125:KF#0<-<0x40
      REG_DEAD r125:KF
    7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
      REG_DEAD r120:KF
Successfully matched this instruction:
(set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
            (reg:DI 123)) [1  S16 A128])
    (subreg:V1TI (reg:KF 125) 0))
rejecting combination of insns 6 and 7
original costs 4 + 4 = 8
replacement cost 12

By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,
it could merge the instructions:

    21: r125:KF=%v2:KF
      REG_DEAD %v2:KF
    2: NOTE_INSN_DELETED
    3: NOTE_INSN_FUNCTION_BEG
    6: NOTE_INSN_DELETED
   17: r123:DI=0x20
    7: [sfp:DI+r123:DI]=r125:KF#0
      REG_DEAD r125:KF
   19: NOTE_INSN_DELETED
   14: %v2:V1TI=[sfp:DI+r123:DI]
      REG_DEAD r123:DI
   15: use %v2:V1TI

Then followed split1 pass will still split it to due to no dse pass
between to remove the memory operations on stack, remove the rotate
in swap won't face such problem since it runs before dse and no split
pass between them:

   21: r125:KF=%v2:KF
      REG_DEAD %v2:KF
    2: NOTE_INSN_DELETED
    3: NOTE_INSN_FUNCTION_BEG
    6: NOTE_INSN_DELETED
   17: r123:DI=0x20
   22: r126:V1TI=r125:KF#0<-<0x40
   23: [sfp:DI+r123:DI]=r126:V1TI<-<0x40
   19: NOTE_INSN_DELETED
   24: r127:V1TI=[sfp:DI+r123:DI]<-<0x40
   25: %v2:V1TI=r127:V1TI<-<0x40
   15: use %v2:V1TI

Thanks,
Xionghu
Li, Pan2 via Gcc-patches June 3, 2021, 1:09 p.m. UTC | #4
On 6/2/21 7:46 PM, Xionghu Luo wrote:
> Hi,
>
> On 2021/6/3 06:20, Segher Boessenkool wrote:
>> On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
>>> On P8LE, extra rot64+rot64 load or store instructions are generated
>>> in float128 to vector __int128 conversion.
>>>
>>> This patch teaches pass swaps to also handle such pattens to remove
>>> extra swap instructions.
>> Did you check if this is already handled by simplify-rtx if the mode had
>> been TImode (not V1TImode)?  If not, why do you not handle it there?
> I tried to do it in combine or peephole, the later pass split2
> or split3 will still split it to rotate + rotate again as we have split
> after reload, and this pattern is quite P8LE specific, so put it in pass
> swap.  The simplify-rtx could simplify
> r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
> operations already.
>
>
> vsx.md:
>
> ;; The post-reload split requires that we re-permute the source
> ;; register in case it is still live.
> (define_split
>    [(set (match_operand:VSX_LE_128 0 "memory_operand")
>          (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>    "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>     && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>    [(const_int 0)]
> {
>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>    rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>    DONE;
> })

Note also that swap optimization can handle more general cases than 
simplify-rtx.  In my view it's best to have it covered in both places.

Bill

>   
> Thanks,
> Xionghu
Segher Boessenkool June 3, 2021, 8:16 p.m. UTC | #5
Hi!

On Thu, Jun 03, 2021 at 08:46:46AM +0800, Xionghu Luo wrote:
> On 2021/6/3 06:20, Segher Boessenkool wrote:
> > On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
> >> On P8LE, extra rot64+rot64 load or store instructions are generated
> >> in float128 to vector __int128 conversion.
> >>
> >> This patch teaches pass swaps to also handle such pattens to remove
> >> extra swap instructions.
> > 
> > Did you check if this is already handled by simplify-rtx if the mode had
> > been TImode (not V1TImode)?  If not, why do you not handle it there?
> 
> I tried to do it in combine or peephole, the later pass split2
> or split3 will still split it to rotate + rotate again as we have split
> after reload, and this pattern is quite P8LE specific, so put it in pass
> swap.  The simplify-rtx could simplify 
> r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
> operations already.

What mode are those subregs?  Abbreviated RTL printouts are very lossy.
Assuming those are TImode (please check), then yes, that is what I
asked, thanks.

> ;; The post-reload split requires that we re-permute the source
> ;; register in case it is still live.
> (define_split
>   [(set (match_operand:VSX_LE_128 0 "memory_operand")
>         (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>   "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>    && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>   [(const_int 0)]
> {
>   rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>   rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>   rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>   DONE;
> })

Yes, that needs improvement itself.

The tthing to realise is that TImode is optimised by generic code just
fine (as all scalar integer modes are), but V1TImode is not.  We have
that mode because we really needed to not put TImode in vector registers
so much on older cpus, but that balance may have changed by now.  Worth
experimenting with, we now can do pretty much all noormal operations in
vector registers!


Segher
Segher Boessenkool June 3, 2021, 8:19 p.m. UTC | #6
On Thu, Jun 03, 2021 at 08:09:36AM -0500, Bill Schmidt wrote:
> Note also that swap optimization can handle more general cases than 
> simplify-rtx.

Do you have examples?  That should be fixed (unless it is something
Power-specific?)

> In my view it's best to have it covered in both places.

Oh certainly, and we need that p8swaps pass on at least p8 anyway (but
perhaps we can allow TImode in vector regs on later cpus).


Segher
Segher Boessenkool June 3, 2021, 8:31 p.m. UTC | #7
On Thu, Jun 03, 2021 at 02:49:15PM +0800, Xionghu Luo wrote:
> If remove the rotate in simplify-rtx like below:
> 
> +++ b/gcc/simplify-rtx.c
> @@ -3830,10 +3830,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code code,
>      case ROTATE:
>        if (trueop1 == CONST0_RTX (mode))
>         return op0;
> +
> +      if (GET_CODE (trueop0) == ROTATE && trueop1 == GEN_INT (64)
> +         && CONST_INT_P (XEXP (trueop0, 1))
> +         && INTVAL (XEXP (trueop0, 1)) == 64)
> +       return XEXP (trueop0, 0);

(The hardcoded 64 need improving -- but this is just a proof of concept
I'll assume :-) )

> Combine still fail to merge the two instructions:
> 
> Trying 6 -> 7:
>     6: r120:KF#0=r125:KF#0<-<0x40
>       REG_DEAD r125:KF
>     7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
>       REG_DEAD r120:KF
> Successfully matched this instruction:
> (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
>             (reg:DI 123)) [1  S16 A128])
>     (subreg:V1TI (reg:KF 125) 0))
> rejecting combination of insns 6 and 7
> original costs 4 + 4 = 8
> replacement cost 12

So what instructions were these?  Why did the store cost 4 but the new
one costs 12?

> By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,

It should be the same cost as the other store!

> it could merge the instructions:
> 
>     21: r125:KF=%v2:KF
>       REG_DEAD %v2:KF
>     2: NOTE_INSN_DELETED
>     3: NOTE_INSN_FUNCTION_BEG
>     6: NOTE_INSN_DELETED
>    17: r123:DI=0x20
>     7: [sfp:DI+r123:DI]=r125:KF#0
>       REG_DEAD r125:KF
>    19: NOTE_INSN_DELETED
>    14: %v2:V1TI=[sfp:DI+r123:DI]
>       REG_DEAD r123:DI
>    15: use %v2:V1TI
> 
> Then followed split1 pass will still split it to due to no dse pass
> between to remove the memory operations on stack, remove the rotate
> in swap won't face such problem since it runs before dse and no split
> pass between them:

Sure, but none of that is the point.  I asked if we did this for TImode
properly, and maybe we do, but:

>    22: r126:V1TI=r125:KF#0<-<0x40
>    23: [sfp:DI+r123:DI]=r126:V1TI<-<0x40

... this is V1TI mode.


Segher
Li, Pan2 via Gcc-patches June 3, 2021, 8:34 p.m. UTC | #8
On 6/3/21 3:19 PM, Segher Boessenkool wrote:
> On Thu, Jun 03, 2021 at 08:09:36AM -0500, Bill Schmidt wrote:
>> Note also that swap optimization can handle more general cases than
>> simplify-rtx.
> Do you have examples?  That should be fixed (unless it is something
> Power-specific?)
It is Power-specific.  This optimization looks at entire webs of 
computation to determine whether the computation can be done in the 
"wrong" lanes without problems, and removes all the extra swaps at 
once.  That is beyond what simplify-rtx is capable of.  It takes full 
dataflow analysis to see what can legally be done.  As a side effect, 
p8swaps takes care of some cases that simplify-rtx would catch at that 
particular point in compilation, but not all.
>>   In my view it's best to have it covered in both places.
> Oh certainly, and we need that p8swaps pass on at least p8 anyway (but
> perhaps we can allow TImode in vector regs on later cpus).

Yes, I agree that should potentially be revisited, but we still have the 
ABI requirement to keep them in GPRs for parameter passing, so there are 
limits on what we can do.

Bill

>
>
> Segher
Xionghu Luo June 4, 2021, 1:40 a.m. UTC | #9
On 2021/6/4 04:31, Segher Boessenkool wrote:
> On Thu, Jun 03, 2021 at 02:49:15PM +0800, Xionghu Luo wrote:
>> If remove the rotate in simplify-rtx like below:
>>
>> +++ b/gcc/simplify-rtx.c
>> @@ -3830,10 +3830,16 @@ simplify_context::simplify_binary_operation_1 (rtx_code code,
>>       case ROTATE:
>>         if (trueop1 == CONST0_RTX (mode))
>>          return op0;
>> +
>> +      if (GET_CODE (trueop0) == ROTATE && trueop1 == GEN_INT (64)
>> +         && CONST_INT_P (XEXP (trueop0, 1))
>> +         && INTVAL (XEXP (trueop0, 1)) == 64)
>> +       return XEXP (trueop0, 0);
> 
> (The hardcoded 64 need improving -- but this is just a proof of concept
> I'll assume :-) )
> 
>> Combine still fail to merge the two instructions:
>>
>> Trying 6 -> 7:
>>      6: r120:KF#0=r125:KF#0<-<0x40
>>        REG_DEAD r125:KF
>>      7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
>>        REG_DEAD r120:KF
>> Successfully matched this instruction:
>> (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
>>              (reg:DI 123)) [1  S16 A128])
>>      (subreg:V1TI (reg:KF 125) 0))
>> rejecting combination of insns 6 and 7
>> original costs 4 + 4 = 8
>> replacement cost 12
> 
> So what instructions were these?  Why did the store cost 4 but the new
> one costs 12?

For this case of __float128 to vector __int128:

typedef union
{
  __float128 vf1;
  vector __int128 vi128;
  __int128 i128;
} VF_128;

vector __int128
foo1 (__float128 f128)
{
  VF_128 vunion;

  vunion.vf1 = f128;
  return vunion.vi128;
}

Without this patch, the RTL in combine is:

(insn 6 3 17 2 (set (subreg:V1TI (reg:KF 120 [ f128 ]) 0)
        (rotate:V1TI (subreg:V1TI (reg:KF 125) 0)
            (const_int 64 [0x40]))) "pr100085.c":258:14 1113 {*vsx_le_permute_v1ti}
     (expr_list:REG_DEAD (reg:KF 125)
        (nil)))
(insn 17 6 7 2 (set (reg:DI 123)
        (const_int 32 [0x20])) "pr100085.c":258:14 636 {*movdi_internal64}
     (nil))
(insn 7 17 19 2 (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
                (reg:DI 123)) [1  S16 A128])
        (rotate:V1TI (subreg:V1TI (reg:KF 120 [ f128 ]) 0)
            (const_int 64 [0x40]))) "pr100085.c":258:14 1113 {*vsx_le_permute_v1ti}
     (expr_list:REG_DEAD (reg:KF 120 [ f128 ])
        (nil)))
(note 19 7 14 2 NOTE_INSN_DELETED)
(insn 14 19 15 2 (set (reg/i:V1TI 66 %v2)
        (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
                (reg:DI 123)) [1  S16 A128])) "pr100085.c":260:1 1119 {*vsx_le_perm_load_v1ti}
     (expr_list:REG_DEAD (reg:DI 123)
        (nil)))
(insn 15 14 0 2 (use (reg/i:V1TI 66 %v2)) "pr100085.c":260:1 -1
     (nil))

insn 6 and insn 7 are two vsx_le_permute_v1ti instructions each with costs 4,
(The two instructions are VSX and LE specific like Bill said, swap pass tries 
to remove insn if legal).  If remove the rotates in simplify-rtx.c
(simplify_context::simplify_binary_operation_1) like my last reply, combine will
try to merge them to vsx_le_perm_store_v1ti whose insn cost is 12 and meet "rejecting
combination".  They are all V1TI mode.


> 
>> By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,
> 
> It should be the same cost as the other store!

vsx_le_permute_v1ti's cost is defined to 4 in vsx.md:

;; Little endian word swapping for 128-bit types that are either scalars or the
;; special V1TI container class, which it is not appropriate to use vec_select
;; for the type.
(define_insn "*vsx_le_permute_<mode>"
  [(set (match_operand:VSX_TI 0 "nonimmediate_operand" "=wa,wa,Z,&r,&r,Q")
	(rotate:VSX_TI
	 (match_operand:VSX_TI 1 "input_operand" "wa,Z,wa,r,Q,r")
	 (const_int 64)))]
  "!BYTES_BIG_ENDIAN && TARGET_VSX && !TARGET_P9_VECTOR"
  "@
   xxpermdi %x0,%x1,%x1,2
   lxvd2x %x0,%y1
   stxvd2x %x1,%y0
   mr %0,%L1\;mr %L0,%1
   ld%U1%X1 %0,%L1\;ld%U1%X1 %L0,%1
   std%U0%X0 %L1,%0\;std%U0%X0 %1,%L0"
  [(set_attr "length" "*,*,*,8,8,8")
   (set_attr "type" "vecperm,vecload,vecstore,*,load,store")])

> 
>> it could merge the instructions:
>>
>>      21: r125:KF=%v2:KF
>>        REG_DEAD %v2:KF
>>      2: NOTE_INSN_DELETED
>>      3: NOTE_INSN_FUNCTION_BEG
>>      6: NOTE_INSN_DELETED
>>     17: r123:DI=0x20
>>      7: [sfp:DI+r123:DI]=r125:KF#0
>>        REG_DEAD r125:KF
>>     19: NOTE_INSN_DELETED
>>     14: %v2:V1TI=[sfp:DI+r123:DI]
>>        REG_DEAD r123:DI
>>     15: use %v2:V1TI
>>
>> Then followed split1 pass will still split it to due to no dse pass
>> between to remove the memory operations on stack, remove the rotate
>> in swap won't face such problem since it runs before dse and no split
>> pass between them:
> 
> Sure, but none of that is the point.  I asked if we did this for TImode
> properly, and maybe we do, but:
> 
>>     22: r126:V1TI=r125:KF#0<-<0x40
>>     23: [sfp:DI+r123:DI]=r126:V1TI<-<0x40
> 
> ... this is V1TI mode.
> 
> 
> Segher
>
Xionghu Luo June 4, 2021, 2:15 a.m. UTC | #10
Hi,


On 2021/6/3 21:09, Bill Schmidt wrote:
> On 6/2/21 7:46 PM, Xionghu Luo wrote:
>> Hi,
>>
>> On 2021/6/3 06:20, Segher Boessenkool wrote:
>>> On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
>>>> On P8LE, extra rot64+rot64 load or store instructions are generated
>>>> in float128 to vector __int128 conversion.
>>>>
>>>> This patch teaches pass swaps to also handle such pattens to remove
>>>> extra swap instructions.
>>> Did you check if this is already handled by simplify-rtx if the mode had
>>> been TImode (not V1TImode)?  If not, why do you not handle it there?
>> I tried to do it in combine or peephole, the later pass split2
>> or split3 will still split it to rotate + rotate again as we have split
>> after reload, and this pattern is quite P8LE specific, so put it in pass
>> swap.  The simplify-rtx could simplify
>> r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
>> operations already.
>>
>>
>> vsx.md:
>>
>> ;; The post-reload split requires that we re-permute the source
>> ;; register in case it is still live.
>> (define_split
>>    [(set (match_operand:VSX_LE_128 0 "memory_operand")
>>          (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>>    "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && 
>> !TARGET_P9_VECTOR
>>     && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>>    [(const_int 0)]
>> {
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    DONE;
>> })
> 
> Note also that swap optimization can handle more general cases than 
> simplify-rtx.  In my view it's best to have it covered in both places.
> 

But this pattern is after reload quite later than swap optimization,
so it couldn't remove the swap operations as expected, I have a below
example that matched the above pattern in pass split2, this may be not 
quite appropriate as there is a function call between the load and store.

extern vector __int128 foo1 (__float128 a);

int foo2 ()
{
  __binary128 f128 = {3.1415926535897932384626433832795028841971693993751058Q};
  vector __int128 ret = foo1 (f128);
  return ret[0];
}


295r.split (*see insn 35, 36, 37*):

...
Splitting with gen_split_558 (vsx.md:1079)
...

(insn 33 12 34 2 (set (reg/f:DI 9 %r9 [121])
        (high:DI (unspec:DI [
                    (symbol_ref:DI ("*.LANCHOR0") [flags 0x182])
                    (reg:DI 2 %r2)
                ] UNSPEC_TOCREL))) "pr100085.c":279:25 715 {*largetoc_high}
     (nil))
(insn 34 33 6 2 (set (reg/f:DI 9 %r9 [121])
        (lo_sum:DI (reg/f:DI 9 %r9 [121])
            (unspec:DI [
                    (symbol_ref:DI ("*.LANCHOR0") [flags 0x182])
                    (reg:DI 2 %r2)
                ] UNSPEC_TOCREL))) "pr100085.c":279:25 717 {*largetoc_low}
     (expr_list:REG_EQUAL (symbol_ref:DI ("*.LANCHOR0") [flags 0x182])
        (nil)))
(insn 6 34 8 2 (set (reg:V1TI 66 %v2 [123])
        (rotate:V1TI (mem/c:V1TI (reg/f:DI 9 %r9 [121]) [1 f128+0 S16 A128])
            (const_int 64 [0x40]))) "pr100085.c":279:25 1113 {*vsx_le_permute_v1ti}
     (nil))
(insn 8 6 9 2 (set (reg:V1TI 66 %v2)
        (rotate:V1TI (reg:V1TI 66 %v2 [123])
            (const_int 64 [0x40]))) "pr100085.c":279:25 1113 {*vsx_le_permute_v1ti}
     (nil))
(call_insn 9 8 32 2 (parallel [
            (set (reg:V1TI 66 %v2)
                (call (mem:SI (symbol_ref:DI ("foo1") [flags 0x41]  <function_decl 0x7ffff4fb6f00 foo1>) [0 foo
1 S4 A8])
                    (const_int 0 [0])))
            (use (const_int 0 [0]))
            (clobber (reg:DI 96 lr))
        ]) "pr100085.c":279:25 735 {*call_value_nonlocal_aixdi}
     (expr_list:REG_CALL_DECL (symbol_ref:DI ("foo1") [flags 0x41]  <function_decl 0x7ffff4fb6f00 foo1>)
        (nil))
    (expr_list (use (reg:DI 2 %r2))
        (expr_list:KF (use (reg:KF 66 %v2))
            (nil))))
(insn 32 9 35 2 (set (reg:DI 9 %r9 [138])
        (plus:DI (reg/f:DI 1 %r1)
            (const_int 32 [0x20]))) "pr100085.c":279:25 66 {*adddi3}
     (nil))
(insn 35 32 36 2 (set (reg:V1TI 66 %v2)
        (rotate:V1TI (reg:V1TI 66 %v2)
            (const_int 64 [0x40]))) "pr100085.c":279:25 1113 {*vsx_le_permute_v1ti}
     (nil))
(insn 36 35 37 2 (set (mem/c:V1TI (reg:DI 9 %r9 [138]) [2 %sfp+32 S16 A128])
        (rotate:V1TI (reg:V1TI 66 %v2)
            (const_int 64 [0x40]))) "pr100085.c":279:25 1113 {*vsx_le_permute_v1ti}
     (nil))
(insn 37 36 28 2 (set (reg:V1TI 66 %v2)
        (rotate:V1TI (reg:V1TI 66 %v2)
            (const_int 64 [0x40]))) "pr100085.c":279:25 1113 {*vsx_le_permute_v1ti}
     (nil))
(insn 28 37 17 2 (set (reg:DI 3 %r3 [133])
        (mem/c:DI (plus:DI (reg/f:DI 1 %r1)
                (const_int 32 [0x20])) [2 %sfp+32 S8 A128])) "pr100085.c":279:25 636 {*movdi_internal64}
     (nil))
(insn 17 28 18 2 (set (reg/i:DI 3 %r3)
        (sign_extend:DI (reg:SI 3 %r3 [129]))) "pr100085.c":281:1 31 {extendsidi2}
     (nil))
(insn 18 17 30 2 (use (reg/i:DI 3 %r3)) "pr100085.c":281:1 -1
     (nil))
Xionghu Luo June 4, 2021, 2:45 a.m. UTC | #11
On 2021/6/4 04:16, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jun 03, 2021 at 08:46:46AM +0800, Xionghu Luo wrote:
>> On 2021/6/3 06:20, Segher Boessenkool wrote:
>>> On Wed, Jun 02, 2021 at 03:19:32AM -0500, Xionghu Luo wrote:
>>>> On P8LE, extra rot64+rot64 load or store instructions are generated
>>>> in float128 to vector __int128 conversion.
>>>>
>>>> This patch teaches pass swaps to also handle such pattens to remove
>>>> extra swap instructions.
>>>
>>> Did you check if this is already handled by simplify-rtx if the mode had
>>> been TImode (not V1TImode)?  If not, why do you not handle it there?
>>
>> I tried to do it in combine or peephole, the later pass split2
>> or split3 will still split it to rotate + rotate again as we have split
>> after reload, and this pattern is quite P8LE specific, so put it in pass
>> swap.  The simplify-rtx could simplify
>> r124:KF#0=r123:KF#0<-<0x40<-<0x40 to r124:KF#0=r123:KF#0 for register
>> operations already.
> 
> What mode are those subregs?  Abbreviated RTL printouts are very lossy.
> Assuming those are TImode (please check), then yes, that is what I
> asked, thanks.

typedef union
{
  __float128 vf1;
  vector __int128 vi128;
  __int128 i128;
} VF_128;


VF_128 vu;

int foo3 ()
{
  __float128 f128 = {3.1415926535897932384626433832795028841971693993751058Q};
  vu.vf1 = f128;
  vector __int128 ret = vu.vi128;
  return ret[0];
}

This case catches such optimization, they are also V1TImode:

Trying 8 -> 9:
    8: r122:KF#0=r123:KF#0<-<0x40
      REG_DEAD r123:KF
    9: r124:KF#0=r122:KF#0<-<0x40
      REG_DEAD r122:KF
Successfully matched this instruction:
(set (subreg:V1TI (reg:KF 124) 0)
    (rotate:V1TI (rotate:V1TI (subreg:V1TI (reg:KF 123) 0)
            (const_int 64 [0x40]))
        (const_int 64 [0x40])))
allowing combination of insns 8 and 9
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 8.
modifying insn i3     9: r124:KF#0=r123:KF#0<-<0x40<-<0x40
      REG_DEAD r123:KF
deferring rescan insn with uid = 9.


With confirmation, actually it was optimized by this pattern from vsx.md
in split1 pass instead of simplify-rtx.


(define_insn_and_split "*vsx_le_undo_permute_<mode>"
  [(set (match_operand:VSX_TI 0 "vsx_register_operand" "=wa,wa")
	(rotate:VSX_TI
	 (rotate:VSX_TI
	  (match_operand:VSX_TI 1 "vsx_register_operand" "0,wa")
	  (const_int 64))
	 (const_int 64)))]
  "!BYTES_BIG_ENDIAN && TARGET_VSX"
  "@
   #
   xxlor %x0,%x1"
  "&& 1"
  [(set (match_dup 0) (match_dup 1))]
{
  if (reload_completed && REGNO (operands[0]) == REGNO (operands[1]))
    {
      emit_note (NOTE_INSN_DELETED);
      DONE;
    }
}
  [(set_attr "length" "0,4")
   (set_attr "type" "veclogical")])


> 
>> ;; The post-reload split requires that we re-permute the source
>> ;; register in case it is still live.
>> (define_split
>>    [(set (match_operand:VSX_LE_128 0 "memory_operand")
>>          (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>>    "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>>     && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>>    [(const_int 0)]
>> {
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    DONE;
>> })
> 
> Yes, that needs improvement itself.
> 
> The tthing to realise is that TImode is optimised by generic code just
> fine (as all scalar integer modes are), but V1TImode is not.  We have
> that mode because we really needed to not put TImode in vector registers
> so much on older cpus, but that balance may have changed by now.  Worth
> experimenting with, we now can do pretty much all noormal operations in
> vector registers!
>
We have two issues stated in PR100085, one is __float128 to vector __int128 (V1TImode),
the other is float128 to __float128 to __int128 (TImode).   The first one could be 
solved by this patch(by pass swap optimization), so to cover the TImode case, we should
also generate rotate permute instructions when gen_movti for P8LE like gen_movkf
in vector.md(below change is exactly copied from "mov<mode>"...)

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 84820d3b5cb..f35c235a39e 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -7385,7 +7385,22 @@ (define_expand "mov<mode>"
        (match_operand:INT 1 "any_operand"))]
   ""
 {
-  rs6000_emit_move (operands[0], operands[1], <MODE>mode);
+  /* When generating load/store instructions to/from VSX registers on
+     pre-power9 hardware in little endian mode, we need to emit register
+     permute instructions to byte swap the contents, since the VSX load/store
+     instructions do not include a byte swap as part of their operation.
+     Altivec loads and stores have no such problem, so we skip them below.  */
+  if (!BYTES_BIG_ENDIAN
+      && VECTOR_MEM_VSX_P (<MODE>mode)
+      && !TARGET_P9_VECTOR
+      && !gpr_or_gpr_p (operands[0], operands[1])
+      && ((memory_operand (operands[0], <MODE>mode)
+          && !altivec_indexed_or_indirect_operand(operands[0], <MODE>mode))
+         ^ (memory_operand (operands[1], <MODE>mode)
+            && !altivec_indexed_or_indirect_operand(operands[1], <MODE>mode))))
+      rs6000_emit_le_vsx_move (operands[0], operands[1], <MODE>mode);
+    else
+      rs6000_emit_move (operands[0], operands[1], <MODE>mode);
   DONE;
 })

test case:

__int128
foo4 (__float128 f128)
{
  VF_128 vunion;

  vunion.vf1 = f128;
  return vunion.i128;
}

trunk VS. this patch + above change (Sorry that still using slim RTL for
compare, the instruction pattern are added manually.):


diff pr100085.trunk.c.245r.expand pr100085.c.245r.expand
<     8: r121:TI=[r112:DI]        {*vsx_le_perm_load_ti}
<     9: r117:TI=r121:TI
<    13: %r3:TI=r117:TI
<    14: use %r3:TI
---
>     8: r122:V2DI=vec_select([r112:DI],parallel)    {vsx_ld_elemrev_v2di}
>     9: r121:TI#0=vec_select(r122:V2DI,parallel)    {*vsx_xxpermdi2_le_v2di}
>    10: r117:TI=r121:TI   
>    14: %r3:TI=r117:TI
>    15: use %r3:TI


diff pr100085.trunk.c.251r.swaps pr100085.c.251r.swaps:
<     6: r120:KF#0=r118:KF#0<-<0x40
<    16: r122:DI=0x20
<     7: [sfp:DI+r122:DI]=r120:KF#0<-<0x40
<    17: r123:DI=sfp:DI+0x20
<     8: r121:TI=[r123:DI]
<     9: r117:TI=r121:TI
<    13: %r3:TI=r117:TI
<    14: use %r3:TI
---
>    20: r120:KF#0=r118:KF#0
>    17: r123:DI=0x20
>    19: [sfp:DI+r123:DI&0xfffffffffffffff0]=r120:KF#0
>    18: r124:DI=0x20
>    21: r122:V2DI=[sfp:DI+r124:DI&0xfffffffffffffff0]
>    22: r121:TI#0=r122:V2DI
>    10: r117:TI=r121:TI
>    14: %r3:TI=r117:TI
>    15: use %r3:TI


diff pr100085.trunk.s pr100085.s
<       xxpermdi %vs34,%vs34,%vs34,2
<       li %r10,32
<       addi %r8,%r1,-48
<       addi %r9,%r1,-16
<       stxvd2x %vs34,%r8,%r10
<       ld %r3,0(%r9)
<       ld %r4,8(%r9)
---
>       xxpermdi %vs0,%vs34,%vs34,3
>       mfvsrd %r4,%vs34
>       mfvsrd %r3,%vs0
Segher Boessenkool June 8, 2021, 8:11 p.m. UTC | #12
Hi!

On Fri, Jun 04, 2021 at 09:40:58AM +0800, Xionghu Luo wrote:
> >> Combine still fail to merge the two instructions:
> >>
> >> Trying 6 -> 7:
> >>      6: r120:KF#0=r125:KF#0<-<0x40
> >>        REG_DEAD r125:KF
> >>      7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
> >>        REG_DEAD r120:KF
> >> Successfully matched this instruction:
> >> (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
> >>              (reg:DI 123)) [1  S16 A128])
> >>      (subreg:V1TI (reg:KF 125) 0))
> >> rejecting combination of insns 6 and 7
> >> original costs 4 + 4 = 8
> >> replacement cost 12
> > 
> > So what instructions were these?  Why did the store cost 4 but the new
> > one costs 12?

The *vsx_le_perm_store_<mode> instruction has the *preferred*
alternative with cost 12, while the other alternative has cost 8.  Why
is that?  That looks like a bug.
   (set_attr "length" "12,8")

> >> By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,
> > 
> > It should be the same cost as the other store!
> 
> vsx_le_permute_v1ti's cost is defined to 4 in vsx.md:

Yes.  Why is alternative 0 of *vsx_le_perm_store_<mode> said to have a
length of 3 insns?


Segher
Xionghu Luo June 9, 2021, 3:20 a.m. UTC | #13
On 2021/6/9 04:11, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Jun 04, 2021 at 09:40:58AM +0800, Xionghu Luo wrote:
>>>> Combine still fail to merge the two instructions:
>>>>
>>>> Trying 6 -> 7:
>>>>       6: r120:KF#0=r125:KF#0<-<0x40
>>>>         REG_DEAD r125:KF
>>>>       7: [sfp:DI+r123:DI]=r120:KF#0<-<0x40
>>>>         REG_DEAD r120:KF
>>>> Successfully matched this instruction:
>>>> (set (mem/c:V1TI (plus:DI (reg/f:DI 110 sfp)
>>>>               (reg:DI 123)) [1  S16 A128])
>>>>       (subreg:V1TI (reg:KF 125) 0))
>>>> rejecting combination of insns 6 and 7
>>>> original costs 4 + 4 = 8
>>>> replacement cost 12
>>>
>>> So what instructions were these?  Why did the store cost 4 but the new
>>> one costs 12?
> 
> The *vsx_le_perm_store_<mode> instruction has the *preferred*
> alternative with cost 12, while the other alternative has cost 8.  Why
> is that?  That looks like a bug.
>     (set_attr "length" "12,8")

12 was introduced by Mike's commit c477a6674364(r6-2577), and all the 5
vsx_le_perm_store_<mode> are set to 12 for modes VSX_D/VSX_W/V8HI/V16QI
/VSX_LE_128, I guess it is split to two rs6000_emit_le_vsx_permute before
reload, but 3 rs6000_emit_le_vsx_permute after reload, so the length is
12, then it seems also not reasonable to change it from 12 to 8?  And I am
not sure when the alternative 1 will be chosen?

vsx.md:
;; The post-reload split requires that we re-permute the source
;; register in case it is still live.
(define_split
  [(set (match_operand:VSX_LE_128 0 "memory_operand")
        (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
  "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
   && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
  [(const_int 0)]
{
  rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
  rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
  rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
  DONE;
}) 

> 
>>>> By hacking the vsx_le_perm_store_v1ti INSN_COST from 12 to 8,
>>>
>>> It should be the same cost as the other store!
>>
>> vsx_le_permute_v1ti's cost is defined to 4 in vsx.md:
> 
> Yes.  Why is alternative 0 of *vsx_le_perm_store_<mode> said to have a
> length of 3 insns?
> 
> 
> Segher
>
Segher Boessenkool June 9, 2021, 4:24 p.m. UTC | #14
On Wed, Jun 09, 2021 at 11:20:20AM +0800, Xionghu Luo wrote:
> On 2021/6/9 04:11, Segher Boessenkool wrote:
> > On Fri, Jun 04, 2021 at 09:40:58AM +0800, Xionghu Luo wrote:
> >>>> rejecting combination of insns 6 and 7
> >>>> original costs 4 + 4 = 8
> >>>> replacement cost 12
> >>>
> >>> So what instructions were these?  Why did the store cost 4 but the new
> >>> one costs 12?
> > 
> > The *vsx_le_perm_store_<mode> instruction has the *preferred*
> > alternative with cost 12, while the other alternative has cost 8.  Why
> > is that?  That looks like a bug.
> >     (set_attr "length" "12,8")
> 
> 12 was introduced by Mike's commit c477a6674364(r6-2577), and all the 5
> vsx_le_perm_store_<mode> are set to 12 for modes VSX_D/VSX_W/V8HI/V16QI
> /VSX_LE_128, I guess it is split to two rs6000_emit_le_vsx_permute before
> reload, but 3 rs6000_emit_le_vsx_permute after reload, so the length is
> 12, then it seems also not reasonable to change it from 12 to 8?  And I am
> not sure when the alternative 1 will be chosen?

This is the instruction *length*, not the cost directly.  The length
has to be correct, not lower than it will turn out to be that is, or on
some big testcases you will get branches that cannot reach their target,
and the resulting ICEs.

Alternatives are chosen by register allocation.  Before register
allocation attributes are taken as if alternative 0 is selected (well,
the first enabled alternative is selected, same thing here).

Which alternative is the expected (or wanted) one?  Either put that one
first, or if it is the longer one, give it an explicit cost.

> ;; The post-reload split requires that we re-permute the source
> ;; register in case it is still live.
> (define_split
>   [(set (match_operand:VSX_LE_128 0 "memory_operand")
>         (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>   "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>    && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>   [(const_int 0)]
> {
>   rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>   rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>   rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>   DONE;
> }) 

So it seems like it is only 3 insns in the very unlucky case?  Normally
it will end up as just one simple store?  So you want an explicit cost
here then:

  ;; What is the insn_cost for this insn?  The target hook can still override
  ;; this.  For optimizing for size the "length" attribute is used instead.
  (define_attr "cost" "" (const_int 0))

So you would use something like

     (set_attr "cost" "4,*")

here (if I got that right, please check :-) )

HtH,


Segher
Xionghu Luo June 10, 2021, 7:11 a.m. UTC | #15
On 2021/6/10 00:24, Segher Boessenkool wrote:
> On Wed, Jun 09, 2021 at 11:20:20AM +0800, Xionghu Luo wrote:
>> On 2021/6/9 04:11, Segher Boessenkool wrote:
>>> On Fri, Jun 04, 2021 at 09:40:58AM +0800, Xionghu Luo wrote:
>>>>>> rejecting combination of insns 6 and 7
>>>>>> original costs 4 + 4 = 8
>>>>>> replacement cost 12
>>>>>
>>>>> So what instructions were these?  Why did the store cost 4 but the new
>>>>> one costs 12?
>>>
>>> The *vsx_le_perm_store_<mode> instruction has the *preferred*
>>> alternative with cost 12, while the other alternative has cost 8.  Why
>>> is that?  That looks like a bug.
>>>      (set_attr "length" "12,8")
>>
>> 12 was introduced by Mike's commit c477a6674364(r6-2577), and all the 5
>> vsx_le_perm_store_<mode> are set to 12 for modes VSX_D/VSX_W/V8HI/V16QI
>> /VSX_LE_128, I guess it is split to two rs6000_emit_le_vsx_permute before
>> reload, but 3 rs6000_emit_le_vsx_permute after reload, so the length is
>> 12, then it seems also not reasonable to change it from 12 to 8?  And I am
>> not sure when the alternative 1 will be chosen?
> 
> This is the instruction *length*, not the cost directly.  The length
> has to be correct, not lower than it will turn out to be that is, or on
> some big testcases you will get branches that cannot reach their target,
> and the resulting ICEs.
> 
> Alternatives are chosen by register allocation.  Before register
> allocation attributes are taken as if alternative 0 is selected (well,
> the first enabled alternative is selected, same thing here).
> 
> Which alternative is the expected (or wanted) one?  Either put that one
> first, or if it is the longer one, give it an explicit cost.
> 
>> ;; The post-reload split requires that we re-permute the source
>> ;; register in case it is still live.
>> (define_split
>>    [(set (match_operand:VSX_LE_128 0 "memory_operand")
>>          (match_operand:VSX_LE_128 1 "vsx_register_operand"))]
>>    "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>>     && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>>    [(const_int 0)]
>> {
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>    DONE;
>> })
> 
> So it seems like it is only 3 insns in the very unlucky case?  Normally
> it will end up as just one simple store?

I am afraid there is not "simple store" for *TImode on P8LE*.  There is only
stxvd2x that rotates the element(stvx requires memory to be aligned, not
suitable pattern), so every vsx_le_perm_store_v1ti must be split to 3
instructions for alternative 0, it seems incorrect to force the cost to be 4.


  So you want an explicit cost
> here then:
> 
>    ;; What is the insn_cost for this insn?  The target hook can still override
>    ;; this.  For optimizing for size the "length" attribute is used instead.
>    (define_attr "cost" "" (const_int 0))
> 
> So you would use something like
> 
>       (set_attr "cost" "4,*")
> 
> here (if I got that right, please check :-) )
> 
> HtH,
> 
> 
> Segher
>
Segher Boessenkool June 11, 2021, 8:16 p.m. UTC | #16
On Thu, Jun 10, 2021 at 03:11:08PM +0800, Xionghu Luo wrote:
> On 2021/6/10 00:24, Segher Boessenkool wrote:
> >>    "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
> >>     && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
> >>    [(const_int 0)]
> >> {
> >>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
> >>    rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
> >>    rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
> >>    DONE;
> >> })
> > 
> > So it seems like it is only 3 insns in the very unlucky case?  Normally
> > it will end up as just one simple store?
> 
> I am afraid there is not "simple store" for *TImode on P8LE*.  There is only
> stxvd2x that rotates the element(stvx requires memory to be aligned, not
> suitable pattern), so every vsx_le_perm_store_v1ti must be split to 3
> instructions for alternative 0, it seems incorrect to force the cost to be 4.

Often it could be done as just two insns though?  If the value stored is
not used elsewhere?

So we could make the first alternative cost 8 then as well, which will
also work out for combine, right?

Alternatively we could have what is now the second alternative be the
first, if that is realistic -- that one already has cost 8 (it is just
two machine instructions).


Segher
Xionghu Luo June 16, 2021, 1:39 a.m. UTC | #17
On 2021/6/12 04:16, Segher Boessenkool wrote:
> On Thu, Jun 10, 2021 at 03:11:08PM +0800, Xionghu Luo wrote:
>> On 2021/6/10 00:24, Segher Boessenkool wrote:
>>>>     "!BYTES_BIG_ENDIAN && TARGET_VSX && reload_completed && !TARGET_P9_VECTOR
>>>>      && !altivec_indexed_or_indirect_operand (operands[0], <MODE>mode)"
>>>>     [(const_int 0)]
>>>> {
>>>>     rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>>>     rs6000_emit_le_vsx_permute (operands[0], operands[1], <MODE>mode);
>>>>     rs6000_emit_le_vsx_permute (operands[1], operands[1], <MODE>mode);
>>>>     DONE;
>>>> })
>>>
>>> So it seems like it is only 3 insns in the very unlucky case?  Normally
>>> it will end up as just one simple store?
>>
>> I am afraid there is not "simple store" for *TImode on P8LE*.  There is only
>> stxvd2x that rotates the element(stvx requires memory to be aligned, not
>> suitable pattern), so every vsx_le_perm_store_v1ti must be split to 3
>> instructions for alternative 0, it seems incorrect to force the cost to be 4.
> 
> Often it could be done as just two insns though?  If the value stored is
> not used elsewhere?
> 
> So we could make the first alternative cost 8 then as well, which will
> also work out for combine, right?
> 
> Alternatively we could have what is now the second alternative be the
> first, if that is realistic -- that one already has cost 8 (it is just
> two machine instructions).

Attached the patch to update the 5 *vsx_le_perm_store_<xxx> function
costs from 12 to 8.
diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000-p8swap.c b/gcc/config/rs6000/rs6000-p8swap.c
index ec503ab742f..3b74e05e396 100644
--- a/gcc/config/rs6000/rs6000-p8swap.c
+++ b/gcc/config/rs6000/rs6000-p8swap.c
@@ -250,6 +250,20 @@  union_uses (swap_web_entry *insn_entry, rtx insn, df_ref def)
     }
 }
 
+/* Return 1 iff PAT is a rotate 64 bit expression; else return 0.  */
+
+static bool
+pattern_is_rotate64_p (rtx pat)
+{
+  rtx rot = SET_SRC (pat);
+
+  if (GET_CODE (rot) == ROTATE && CONST_INT_P (XEXP (rot, 1))
+      && INTVAL (XEXP (rot, 1)) == 64)
+    return true;
+
+  return false;
+}
+
 /* Return 1 iff INSN is a load insn, including permuting loads that
    represent an lvxd2x instruction; else return 0.  */
 static unsigned int
@@ -266,6 +280,9 @@  insn_is_load_p (rtx insn)
 	  && MEM_P (XEXP (SET_SRC (body), 0)))
 	return 1;
 
+      if (pattern_is_rotate64_p (body) && MEM_P (XEXP (SET_SRC (body), 0)))
+	return 1;
+
       return 0;
     }
 
@@ -305,6 +322,8 @@  insn_is_swap_p (rtx insn)
   if (GET_CODE (body) != SET)
     return 0;
   rtx rhs = SET_SRC (body);
+  if (pattern_is_rotate64_p (body))
+    return 1;
   if (GET_CODE (rhs) != VEC_SELECT)
     return 0;
   rtx parallel = XEXP (rhs, 1);
@@ -392,7 +411,8 @@  quad_aligned_load_p (swap_web_entry *insn_entry, rtx_insn *insn)
      false.  */
   rtx body = PATTERN (def_insn);
   if (GET_CODE (body) != SET
-      || GET_CODE (SET_SRC (body)) != VEC_SELECT
+      || !(GET_CODE (SET_SRC (body)) == VEC_SELECT
+	  || pattern_is_rotate64_p (body))
       || !MEM_P (XEXP (SET_SRC (body), 0)))
     return false;
 
@@ -531,7 +551,8 @@  const_load_sequence_p (swap_web_entry *insn_entry, rtx insn)
 	 false.  */
       rtx body = PATTERN (def_insn);
       if (GET_CODE (body) != SET
-	  || GET_CODE (SET_SRC (body)) != VEC_SELECT
+	  || !(GET_CODE (SET_SRC (body)) == VEC_SELECT
+	       || pattern_is_rotate64_p (body))
 	  || !MEM_P (XEXP (SET_SRC (body), 0)))
 	return false;
 
@@ -1730,7 +1751,8 @@  replace_swapped_aligned_load (swap_web_entry *insn_entry, rtx swap_insn)
      swap (indicated by code VEC_SELECT).  */
   rtx body = PATTERN (def_insn);
   gcc_assert ((GET_CODE (body) == SET)
-	      && (GET_CODE (SET_SRC (body)) == VEC_SELECT)
+	      && (GET_CODE (SET_SRC (body)) == VEC_SELECT
+		  || pattern_is_rotate64_p (body))
 	      && MEM_P (XEXP (SET_SRC (body), 0)));
 
   rtx src_exp = XEXP (SET_SRC (body), 0);
@@ -2148,7 +2170,8 @@  recombine_lvx_pattern (rtx_insn *insn, del_info *to_delete)
 {
   rtx body = PATTERN (insn);
   gcc_assert (GET_CODE (body) == SET
-	      && GET_CODE (SET_SRC (body)) == VEC_SELECT
+	      && (GET_CODE (SET_SRC (body)) == VEC_SELECT
+		  || pattern_is_rotate64_p (body))
 	      && MEM_P (XEXP (SET_SRC (body), 0)));
 
   rtx mem = XEXP (SET_SRC (body), 0);
@@ -2223,9 +2246,9 @@  static void
 recombine_stvx_pattern (rtx_insn *insn, del_info *to_delete)
 {
   rtx body = PATTERN (insn);
-  gcc_assert (GET_CODE (body) == SET
-	      && MEM_P (SET_DEST (body))
-	      && GET_CODE (SET_SRC (body)) == VEC_SELECT);
+  gcc_assert (GET_CODE (body) == SET && MEM_P (SET_DEST (body))
+	      && (GET_CODE (SET_SRC (body)) == VEC_SELECT
+		  || pattern_is_rotate64_p (body)));
   rtx mem = SET_DEST (body);
   rtx base_reg = XEXP (mem, 0);
 
diff --git a/gcc/testsuite/gcc.target/powerpc/float128-call.c b/gcc/testsuite/gcc.target/powerpc/float128-call.c
index 5895416e985..a1f09df8a57 100644
--- a/gcc/testsuite/gcc.target/powerpc/float128-call.c
+++ b/gcc/testsuite/gcc.target/powerpc/float128-call.c
@@ -21,5 +21,5 @@ 
 TYPE one (void) { return ONE; }
 void store (TYPE a, TYPE *p) { *p = a; }
 
-/* { dg-final { scan-assembler "lxvd2x 34"  } } */
-/* { dg-final { scan-assembler "stxvd2x 34" } } */
+/* { dg-final { scan-assembler "lvx 2"  } } */
+/* { dg-final { scan-assembler "stvx 2" } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/pr100085.c b/gcc/testsuite/gcc.target/powerpc/pr100085.c
new file mode 100644
index 00000000000..0fb466b4a6a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr100085.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile { target { powerpc*-*-linux* } } } */
+/* { dg-require-effective-target powerpc_float128_sw_ok } */
+/* { dg-options "-O2 -mdejagnu-cpu=power8 -mfloat128 -mno-regnames" } */
+
+#ifndef __FLOAT128__
+#error "-mfloat128 is not supported."
+#endif
+
+typedef __vector unsigned __int128 vui128_t;
+
+typedef union
+{
+  __float128 vf1;
+  vui128_t vx1;
+} __VF_128;
+
+vui128_t
+vec_xfer_bin128_2_vui128t (__float128 f128)
+{
+  __VF_128 vunion;
+  vunion.vf1 = f128;
+  return (vunion.vx1);
+}
+
+/* { dg-final { scan-assembler-not "xxpermdi" } } */
+/* { dg-final { scan-assembler-not "stxvd2x" } } */
+/* { dg-final { scan-assembler-not "lxvd2x" } } */
+