diff mbox series

hppa: Improve expansion of ashldi3 when !TARGET_64BIT

Message ID 044501d677ba$099a5520$1cceff60$@nextmovesoftware.com
State New
Headers show
Series hppa: Improve expansion of ashldi3 when !TARGET_64BIT | expand

Commit Message

Roger Sayle Aug. 21, 2020, 12:53 p.m. UTC
This patch improves the code generated on PA-RISC for DImode
(double word) left shifts by small constants (1-31).  This target
has a very cool shd instruction that can be recognized by combine
for simple shifts, but relying on combine is fragile for more
complicated functions.  This patch tweaks pa.md's ashldi3 expander,
to form the optimal two instruction shd/zdep sequence at RTL
expansion time.

As an example of the benefits of this approach, the simple function

unsigned long long u9(unsigned long long x) { return x*9; }

currently generates 9 instructions

u9:     copy %r25,%r28
        copy %r26,%r29
        extru %r26,2,3,%r21
        zdep %r25,28,29,%r19
        zdep %r26,28,29,%r20
        or %r21,%r19,%r19
        add %r29,%r20,%r29
        addc %r28,%r19,%r28
        bv,n %r0(%r2)

and with this patch now requires only 7:

u9:     copy %r25,%r28
        copy %r26,%r29
        shd %r26,%r25,29,%r19
        zdep %r26,28,29,%r20
        add %r29,%r20,%r29
        addc %r28,%r19,%r28
        bv,n %r0(%r2)


This improvement is a first step towards getting synth_mult to
behave sanely on hppa (PR middle-end/87256).

Unfortunately, it's been a long while since I've had access to a
hppa system, so apart from building a cross-compiler and looking at
the assembler it generates, this patch is completely untested.
I was wondering whether Dave or Jeff (or someone else with access
to real hardware) might "spin" this patch for me?

2020-08-21  Roger Sayle  <roger@nextmovesoftware.com>

	* config/pa/pa.md (ashldi3): Additionally, on !TARGET_64BIT
	generate a two instruction shd/zdep sequence when shifting
	registers by suitable constants.
	(shd_internal): New define_expand to provide gen_shd_internal.


Thanks in advance,
Roger
--
Roger Sayle
NextMove Software
Cambridge, UK

Comments

John David Anglin Aug. 21, 2020, 7 p.m. UTC | #1
Hi Roger,

On 2020-08-21 8:53 a.m., Roger Sayle wrote:
> I was wondering whether Dave or Jeff (or someone else with access
> to real hardware) might "spin" this patch for me?
This may be totally unrelated to this patch but I hit this error in stage2 testing your change:
build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md insn-conditions.md \
        -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c
genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383

It's great that you are back helpting with the middle-end.

Regards,
Dave
Roger Sayle Aug. 22, 2020, 8:52 a.m. UTC | #2
Hi Dave,

It's great to hear from you.  It's been a long while.

Sorry, doh! yes, there's a mistake in my patch (that I introduced when I
renumbered
the operands in the shd's define_expand to be the more logical 0, 1, 2, 3,
then 4).
Sorry for the inconvenience [due to my lack of familiarity with PA-RISC
assembly].
Hopefully you should get much better mileage out of the attached revision.

Thanks again (and my sincere apologies),
Roger
--

-----Original Message-----
From: John David Anglin <dave.anglin@bell.net> 
Sent: 21 August 2020 20:00
To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches'
<gcc-patches@gcc.gnu.org>
Cc: 'Jeff Law' <law@redhat.com>
Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT

Hi Roger,

On 2020-08-21 8:53 a.m., Roger Sayle wrote:
> I was wondering whether Dave or Jeff (or someone else with access to 
> real hardware) might "spin" this patch for me?
This may be totally unrelated to this patch but I hit this error in stage2
testing your change:
build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md
insn-conditions.md \
        -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c
genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383

It's great that you are back helpting with the middle-end.

Regards,
Dave

--
John David Anglin  dave.anglin@bell.net
diff --git a/gcc/config/pa/pa.md b/gcc/config/pa/pa.md
index 6350c68..713ff17 100644
--- a/gcc/config/pa/pa.md
+++ b/gcc/config/pa/pa.md
@@ -6416,9 +6416,32 @@
   [(set (match_operand:DI 0 "register_operand" "")
 	(ashift:DI (match_operand:DI 1 "lhs_lshift_operand" "")
 		   (match_operand:DI 2 "arith32_operand" "")))]
-  "TARGET_64BIT"
+  ""
   "
 {
+  if (!TARGET_64BIT)
+    {
+      if (REG_P (operands[0]) && GET_CODE (operands[2]) == CONST_INT)
+	{
+	  unsigned HOST_WIDE_INT shift = UINTVAL (operands[2]);
+	  if (shift >= 1 && shift <= 31)
+	    {
+	      rtx dst = operands[0];
+	      rtx src = force_reg (DImode, operands[1]);
+	      emit_insn (gen_shd_internal (gen_highpart (SImode, dst),
+					   gen_lowpart (SImode, src),
+					   GEN_INT (32-shift),
+					   gen_highpart (SImode, src),
+					   GEN_INT (shift)));
+	      emit_insn (gen_ashlsi3 (gen_lowpart (SImode, dst),
+				      gen_lowpart (SImode, src),
+				      GEN_INT (shift)));
+	      DONE;
+	    }
+	}
+      /* Fallback to using optabs.c's expand_doubleword_shift.  */
+      FAIL;
+    }
   if (GET_CODE (operands[2]) != CONST_INT)
     {
       rtx temp = gen_reg_rtx (DImode);
@@ -6705,6 +6728,15 @@
   [(set_attr "type" "shift")
    (set_attr "length" "4")])
 
+(define_expand "shd_internal"
+  [(set (match_operand:SI 0 "register_operand")
+	(ior:SI
+	  (lshiftrt:SI (match_operand:SI 1 "register_operand")
+		       (match_operand:SI 2 "const_int_operand"))
+	  (ashift:SI (match_operand:SI 3 "register_operand")
+		     (match_operand:SI 4 "const_int_operand"))))]
+  "")
+
 (define_insn ""
   [(set (match_operand:SI 0 "register_operand" "=r")
 	(and:SI (ashift:SI (match_operand:SI 1 "register_operand" "r")
John David Anglin Aug. 22, 2020, 12:57 p.m. UTC | #3
Hi Roger,

Started a test of your latest version.

It appears we miss 64-bit patterns similar to these:
(define_insn ""
  [(set (match_operand:SI 0 "register_operand" "=r")
        (match_operator:SI 5 "plus_xor_ior_operator"
          [(ashift:SI (match_operand:SI 1 "register_operand" "r")
                      (match_operand:SI 3 "const_int_operand" "n"))
           (lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
                        (match_operand:SI 4 "const_int_operand" "n"))]))]
  "INTVAL (operands[3]) + INTVAL (operands[4]) == 32"
  "{shd|shrpw} %1,%2,%4,%0"
  [(set_attr "type" "shift")
   (set_attr "length" "4")])

(define_insn ""
  [(set (match_operand:SI 0 "register_operand" "=r")
        (match_operator:SI 5 "plus_xor_ior_operator"
          [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
                        (match_operand:SI 4 "const_int_operand" "n"))
           (ashift:SI (match_operand:SI 1 "register_operand" "r")
                      (match_operand:SI 3 "const_int_operand" "n"))]))]
  "INTVAL (operands[3]) + INTVAL (operands[4]) == 32"
  "{shd|shrpw} %1,%2,%4,%0"
  [(set_attr "type" "shift")
   (set_attr "length" "4")])

I'm wondering if it would be useful to define the 64-bit equivalents using the "shrpd" instruction.
If that's the case, then I think it would be good to rename "shd_internal" to "shrpw_internal".

Regards,
Dave

On 2020-08-22 4:52 a.m., Roger Sayle wrote:
> Hi Dave,
>
> It's great to hear from you.  It's been a long while.
>
> Sorry, doh! yes, there's a mistake in my patch (that I introduced when I
> renumbered
> the operands in the shd's define_expand to be the more logical 0, 1, 2, 3,
> then 4).
> Sorry for the inconvenience [due to my lack of familiarity with PA-RISC
> assembly].
> Hopefully you should get much better mileage out of the attached revision.
>
> Thanks again (and my sincere apologies),
> Roger
> --
>
> -----Original Message-----
> From: John David Anglin <dave.anglin@bell.net> 
> Sent: 21 August 2020 20:00
> To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches'
> <gcc-patches@gcc.gnu.org>
> Cc: 'Jeff Law' <law@redhat.com>
> Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
>
> Hi Roger,
>
> On 2020-08-21 8:53 a.m., Roger Sayle wrote:
>> I was wondering whether Dave or Jeff (or someone else with access to 
>> real hardware) might "spin" this patch for me?
> This may be totally unrelated to this patch but I hit this error in stage2
> testing your change:
> build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md
> insn-conditions.md \
>         -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c
> genattrtab: Internal error: abort in attr_alt_union, at genattrtab.c:2383
>
> It's great that you are back helpting with the middle-end.
>
> Regards,
> Dave
>
> --
> John David Anglin  dave.anglin@bell.net
>
Roger Sayle Aug. 22, 2020, 4:01 p.m. UTC | #4
Hi Dave,

Many thanks for your help.

I suspect that the issue with the 64-bit patterns is that the second variant
of 
pa.md's define_insn "shrpdi4" is unlikely ever to match as (minus:DI
(const_int 64) x)
is never "canonical" when x is itself a CONST_INT.  Splitting this
define_insn
into two (or three see below) separate forms; the first as it currently is
and the
second (as you suggest) with 
	"TARGET_64BIT
	  && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
should do the trick.

My first impression was that the DImode shrpd instructions would be most
useful for implementing TI mode shifts, but that TI mode isn't supported by
hppa64.  But then I noticed that the more immediate benefit would be in
supporting rotrdi3 and rotldi3 on TARGET_64BIT that currently don't have
expanders nor insns defined.  Here GCC currently generates three
instructions
where a single shrpd would be optimal.

I'm sure that folks are aware of this but something I learned/found strange
was
the implications of using (match_operator:SI 5 "plus_xor_ior_operator" x y).
Clearly it's nice to have these patterns match IOR, PLUS and XOR (as
recently
pointed out by Jakub in a recent review), but using match_operator has two
side-effects.  The first is that recog's machinery doesn't know these
operators 
are commutative, so two variants of a define_insn using match_operator need
to be specified, i.e. (... y x).  The second side-effect is that for
generation/expansion
it's impossible (inefficient?) to use the corresponding gen_foo functions.
It would
be nice if match_operator:5 resulted in the gen_function taking an enum
rtx_code as the corresponding argument, to specify IOR or PLUS or XOR.
Alas no.  So instead I added the extremely light weight define_expand
for SImode shd_internal to select IOR arbitrarily as the preferred RTL.
You're right, that a similar strategy using a DImode shrpd_internal would
be required for the 64-bit form; but the two are different; it's not that
the
current name would be affected (or should be changed).

Hence the define_expand for rotrdi3 and rotldi3 would/should (for constant
shifts) call gen_shrpd_internal to create an insn that matches one of the
define_insns.

I'd be happy to write that patch, but I personally have no way of testing
it.
It's a shame there isn't a hppa64 machine on the GCC compile farm.

[p.s. I've ordered Gerry Kane's "PA-RISC 2.0 architecture" book from amazon,
so I'll hopefully understand more of what I'm talking about when it
arrives].

Thanks again.
Roger
--

-----Original Message-----
From: John David Anglin <dave.anglin@bell.net> 
Sent: 22 August 2020 13:58
To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches'
<gcc-patches@gcc.gnu.org>
Cc: 'Jeff Law' <law@redhat.com>
Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT

Hi Roger,

Started a test of your latest version.

It appears we miss 64-bit patterns similar to these:
(define_insn ""
  [(set (match_operand:SI 0 "register_operand" "=r")
        (match_operator:SI 5 "plus_xor_ior_operator"
          [(ashift:SI (match_operand:SI 1 "register_operand" "r")
                      (match_operand:SI 3 "const_int_operand" "n"))
           (lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
                        (match_operand:SI 4 "const_int_operand" "n"))]))]
  "INTVAL (operands[3]) + INTVAL (operands[4]) == 32"
  "{shd|shrpw} %1,%2,%4,%0"
  [(set_attr "type" "shift")
   (set_attr "length" "4")])

(define_insn ""
  [(set (match_operand:SI 0 "register_operand" "=r")
        (match_operator:SI 5 "plus_xor_ior_operator"
          [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
                        (match_operand:SI 4 "const_int_operand" "n"))
           (ashift:SI (match_operand:SI 1 "register_operand" "r")
                      (match_operand:SI 3 "const_int_operand" "n"))]))]
  "INTVAL (operands[3]) + INTVAL (operands[4]) == 32"
  "{shd|shrpw} %1,%2,%4,%0"
  [(set_attr "type" "shift")
   (set_attr "length" "4")])

I'm wondering if it would be useful to define the 64-bit equivalents using
the "shrpd" instruction.
If that's the case, then I think it would be good to rename "shd_internal"
to "shrpw_internal".

Regards,
Dave

On 2020-08-22 4:52 a.m., Roger Sayle wrote:
> Hi Dave,
>
> It's great to hear from you.  It's been a long while.
>
> Sorry, doh! yes, there's a mistake in my patch (that I introduced when 
> I renumbered the operands in the shd's define_expand to be the more 
> logical 0, 1, 2, 3, then 4).
> Sorry for the inconvenience [due to my lack of familiarity with 
> PA-RISC assembly].
> Hopefully you should get much better mileage out of the attached revision.
>
> Thanks again (and my sincere apologies), Roger
> --
>
> -----Original Message-----
> From: John David Anglin <dave.anglin@bell.net>
> Sent: 21 August 2020 20:00
> To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches'
> <gcc-patches@gcc.gnu.org>
> Cc: 'Jeff Law' <law@redhat.com>
> Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when 
> !TARGET_64BIT
>
> Hi Roger,
>
> On 2020-08-21 8:53 a.m., Roger Sayle wrote:
>> I was wondering whether Dave or Jeff (or someone else with access to 
>> real hardware) might "spin" this patch for me?
> This may be totally unrelated to this patch but I hit this error in 
> stage2 testing your change:
> build/genattrtab ../../gcc/gcc/common.md ../../gcc/gcc/config/pa/pa.md 
> insn-conditions.md \
>         -Atmp-attrtab.c -Dtmp-dfatab.c -Ltmp-latencytab.c
> genattrtab: Internal error: abort in attr_alt_union, at 
> genattrtab.c:2383
>
> It's great that you are back helpting with the middle-end.
>
> Regards,
> Dave
>
> --
> John David Anglin  dave.anglin@bell.net
>


--
John David Anglin  dave.anglin@bell.net
John David Anglin Aug. 22, 2020, 10:09 p.m. UTC | #5
On 2020-08-22 12:01 p.m., Roger Sayle wrote:
> I suspect that the issue with the 64-bit patterns is that the second variant
> of 
> pa.md's define_insn "shrpdi4" is unlikely ever to match as (minus:DI
> (const_int 64) x)
> is never "canonical" when x is itself a CONST_INT.  Splitting this
> define_insn
> into two (or three see below) separate forms; the first as it currently is
> and the
> second (as you suggest) with 
> 	"TARGET_64BIT
> 	  && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
> should do the trick.
I will go ahead and add the basic patterns.  It seems it would be best if I avoid
using the "plus_xor_ior_operator".  It also seems the 32-bit patterns should avoid it.
>
> My first impression was that the DImode shrpd instructions would be most
> useful for implementing TI mode shifts, but that TI mode isn't supported by
> hppa64.  But then I noticed that the more immediate benefit would be in
> supporting rotrdi3 and rotldi3 on TARGET_64BIT that currently don't have
> expanders nor insns defined.  Here GCC currently generates three
> instructions
> where a single shrpd would be optimal.
It turns out we now need to support TI mode and __int128 for libgomp.  The hppa64-hpux target won't
boot without it.  I had just added a change to support TI mode but it's untested.

Regards,
Dave
Roger Sayle Aug. 22, 2020, 11:24 p.m. UTC | #6
Hi Dave,
I actually think using plus_xor_ior operator is useful.  It means that if combine,
inlining or some other RTL simplification generates these variants, these forms
will still be recognized by the backend.  It's more typing, but the compiler produces
better code.

Here's what I have so far, but please feel free to modify anything.  I'll leave the
rest to you.

With this patch:

unsigned long long rotl4(unsigned long long x)
{
  return (x<<4) | (x>>60);
}

unsigned long long rotr4(unsigned long long x)
{
  return (x<<60) | (x>>4);
}

which previously generated:

rotl4:	depd,z %r26,59,60,%r28
	extrd,u %r26,3,4,%r26
	bve (%r2)
	or %r26,%r28,%r28

rotr4:	extrd,u %r26,59,60,%r28
	depd,z %r26,3,4,%r26
	bve (%r2)
	or %r26,%r28,%r28

now produces:

rotl4:	bve (%r2)
	shrpd %r26,%r26,60,%r28

rotr4:	bve (%r2)
	shrpd %r26,%r26,4,%r28


I'm guessing this is very similar to what you were thinking (or what I described previously).

Many thanks again for trying out these patches/suggestions for me.

Best regards,
Roger
--

-----Original Message-----
From: John David Anglin <dave.anglin@bell.net> 
Sent: 22 August 2020 23:09
To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches' <gcc-patches@gcc.gnu.org>
Cc: 'Jeff Law' <law@redhat.com>
Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT

On 2020-08-22 12:01 p.m., Roger Sayle wrote:
> I suspect that the issue with the 64-bit patterns is that the second 
> variant of pa.md's define_insn "shrpdi4" is unlikely ever to match as 
> (minus:DI (const_int 64) x) is never "canonical" when x is itself a 
> CONST_INT.  Splitting this define_insn into two (or three see below) 
> separate forms; the first as it currently is and the second (as you 
> suggest) with
> 	"TARGET_64BIT
> 	  && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
> should do the trick.
I will go ahead and add the basic patterns.  It seems it would be best if I avoid using the "plus_xor_ior_operator".  It also seems the 32-bit patterns should avoid it.
>
> My first impression was that the DImode shrpd instructions would be 
> most useful for implementing TI mode shifts, but that TI mode isn't 
> supported by hppa64.  But then I noticed that the more immediate 
> benefit would be in supporting rotrdi3 and rotldi3 on TARGET_64BIT 
> that currently don't have expanders nor insns defined.  Here GCC 
> currently generates three instructions where a single shrpd would be 
> optimal.
It turns out we now need to support TI mode and __int128 for libgomp.  The hppa64-hpux target won't boot without it.  I had just added a change to support TI mode but it's untested.

Regards,
Dave

--
John David Anglin  dave.anglin@bell.net
diff --git a/gcc/config/pa/pa.md b/gcc/config/pa/pa.md
index 6350c68..5f04c02 100644
--- a/gcc/config/pa/pa.md
+++ b/gcc/config/pa/pa.md
@@ -6604,32 +6604,82 @@
    (set_attr "length" "4")])
 
 ; Shift right pair word 0 to 31 bits.
-(define_insn "shrpsi4"
-  [(set (match_operand:SI 0 "register_operand" "=r,r")
-	(ior:SI (ashift:SI (match_operand:SI 1 "register_operand" "r,r")
-			   (minus:SI (const_int 32)
-			     (match_operand:SI 3 "shift5_operand" "q,n")))
-		(lshiftrt:SI (match_operand:SI 2 "register_operand" "r,r")
-			     (match_dup 3))))]
+(define_insn "*shrpsi4_1"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+	(match_operator:SI 4 "plus_xor_ior_operator"
+	  [(ashift:SI (match_operand:SI 1 "register_operand" "r")
+		      (minus:SI (const_int 32)
+				(match_operand:SI 3 "register_operand" "q")))
+	   (lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
+			(match_dup 3))]))]
   ""
-  "@
-   {vshd %1,%2,%0|shrpw %1,%2,%%sar,%0}
-   {shd|shrpw} %1,%2,%3,%0"
+  "{vshd %1,%2,%0|shrpw %1,%2,%%sar,%0}"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
+(define_insn "*shrpsi4_2"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+	(match_operator:SI 4 "plus_xor_ior_operator"
+	  [(lshiftrt:SI (match_operand:SI 2 "register_operand" "r")
+			(match_operand:SI 3 "register_operand" "q"))
+	   (ashift:SI (match_operand:SI 1 "register_operand" "r")
+		      (minus:SI (const_int 32)
+				(match_dup 3)))]))]
+  ""
+  "{vshd %1,%2,%0|shrpw %1,%2,%%sar,%0}"
   [(set_attr "type" "shift")
    (set_attr "length" "4")])
 
 ; Shift right pair doubleword 0 to 63 bits.
-(define_insn "shrpdi4"
-  [(set (match_operand:DI 0 "register_operand" "=r,r")
-	(ior:DI (ashift:DI (match_operand:SI 1 "register_operand" "r,r")
-			   (minus:DI (const_int 64)
-			     (match_operand:DI 3 "shift6_operand" "q,n")))
-		(lshiftrt:DI (match_operand:DI 2 "register_operand" "r,r")
-			     (match_dup 3))))]
+(define_insn "*shrpdi4_1"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(match_operator:DI 4 "plus_xor_ior_operator"
+	  [(ashift:DI (match_operand:DI 1 "register_operand" "r")
+		      (minus:DI (const_int 64)
+				(match_operand:DI 3 "register_operand" "q")))
+	   (lshiftrt:DI (match_operand:DI 2 "register_operand" "r")
+			(match_dup 3))]))]
   "TARGET_64BIT"
-  "@
-   shrpd %1,%2,%%sar,%0
-   shrpd %1,%2,%3,%0"
+  "shrpd %1,%2,%%sar,%0"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
+(define_insn "*shrpdi4_2"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(match_operator:DI 4 "plus_xor_ior_operator"
+	  [(lshiftrt:DI (match_operand:DI 2 "register_operand" "r")
+			(match_operand:DI 3 "shift6_operand" "q"))
+	   (ashift:DI (match_operand:SI 1 "register_operand" "r")
+		      (minus:DI (const_int 64)
+				(match_dup 3)))]))]
+  "TARGET_64BIT"
+  "shrpd %1,%2,%%sar,%0"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
+(define_insn "*shrpdi4_3"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(match_operator:DI 5 "plus_xor_ior_operator"
+	  [(ashift:DI (match_operand:DI 1 "register_operand" "r")
+		      (match_operand:DI 3 "const_int_operand" "n"))
+	   (lshiftrt:DI (match_operand:DI 2 "register_operand" "r")
+			(match_operand:DI 4 "const_int_operand" "n"))]))]
+  "TARGET_64BIT
+   && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
+  "shrpd %1,%2,%4,%0"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
+(define_insn "*shrpdi4_4"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+	(match_operator:DI 5 "plus_xor_ior_operator"
+	  [(lshiftrt:DI (match_operand:DI 2 "register_operand" "r")
+			(match_operand:DI 4 "const_int_operand" "n"))
+	   (ashift:DI (match_operand:DI 1 "register_operand" "r")
+		      (match_operand:DI 3 "const_int_operand" "n"))]))]
+  "TARGET_64BIT
+   && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
+  "shrpd %1,%2,%4,%0"
   [(set_attr "type" "shift")
    (set_attr "length" "4")])
 
@@ -6668,7 +6718,7 @@
   /* Else expand normally.  */
 }")
 
-(define_insn ""
+(define_insn "*rotlsi3_internal"
   [(set (match_operand:SI 0 "register_operand" "=r")
         (rotate:SI (match_operand:SI 1 "register_operand" "r")
                    (match_operand:SI 2 "const_int_operand" "n")))]
@@ -6681,6 +6731,54 @@
   [(set_attr "type" "shift")
    (set_attr "length" "4")])
 
+(define_insn "rotrdi3"
+  [(set (match_operand:DI 0 "register_operand" "=r,r")
+	(rotatert:DI (match_operand:DI 1 "register_operand" "r,r")
+		     (match_operand:DI 2 "shift6_operand" "q,n")))]
+  "TARGET_64BIT"
+  "*
+{
+  if (GET_CODE (operands[2]) == CONST_INT)
+    {
+      operands[2] = GEN_INT (INTVAL (operands[2]) & 63);
+      return \"shrpd %1,%1,%2,%0\";
+    }
+  else
+    return \"shrpd %1,%1,%%sar,%0\";
+}"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
+(define_expand "rotldi3"
+  [(set (match_operand:DI 0 "register_operand" "")
+        (rotate:DI (match_operand:DI 1 "register_operand" "")
+                   (match_operand:DI 2 "arith32_operand" "")))]
+  "TARGET_64BIT"
+  "
+{
+  if (GET_CODE (operands[2]) != CONST_INT)
+    {
+      rtx temp = gen_reg_rtx (DImode);
+      emit_insn (gen_subdi3 (temp, GEN_INT (64), operands[2]));
+      emit_insn (gen_rotrdi3 (operands[0], operands[1], temp));
+      DONE;
+    }
+  /* Else expand normally.  */
+}")
+
+(define_insn "*rotldi3_internal"
+  [(set (match_operand:DI 0 "register_operand" "=r")
+        (rotate:DI (match_operand:DI 1 "register_operand" "r")
+                   (match_operand:DI 2 "const_int_operand" "n")))]
+  "TARGET_64BIT"
+  "*
+{
+  operands[2] = GEN_INT ((64 - INTVAL (operands[2])) & 63);
+  return \"shrpd %1,%1,%2,%0\";
+}"
+  [(set_attr "type" "shift")
+   (set_attr "length" "4")])
+
 (define_insn ""
   [(set (match_operand:SI 0 "register_operand" "=r")
 	(match_operator:SI 5 "plus_xor_ior_operator"
Li, Pan2 via Gcc-patches Aug. 24, 2020, 3:42 p.m. UTC | #7
On Fri, 2020-08-21 at 13:53 +0100, Roger Sayle wrote:
> This patch improves the code generated on PA-RISC for DImode
> (double word) left shifts by small constants (1-31).  This target
> has a very cool shd instruction that can be recognized by combine
> for simple shifts, but relying on combine is fragile for more
> complicated functions.  This patch tweaks pa.md's ashldi3 expander,
> to form the optimal two instruction shd/zdep sequence at RTL
> expansion time.
> 
> As an example of the benefits of this approach, the simple function
> 
> unsigned long long u9(unsigned long long x) { return x*9; }
> 
> currently generates 9 instructions
> 
> u9:     copy %r25,%r28
>         copy %r26,%r29
>         extru %r26,2,3,%r21
>         zdep %r25,28,29,%r19
>         zdep %r26,28,29,%r20
>         or %r21,%r19,%r19
>         add %r29,%r20,%r29
>         addc %r28,%r19,%r28
>         bv,n %r0(%r2)
> 
> and with this patch now requires only 7:
> 
> u9:     copy %r25,%r28
>         copy %r26,%r29
>         shd %r26,%r25,29,%r19
>         zdep %r26,28,29,%r20
>         add %r29,%r20,%r29
>         addc %r28,%r19,%r28
>         bv,n %r0(%r2)
> 
> 
> This improvement is a first step towards getting synth_mult to
> behave sanely on hppa (PR middle-end/87256).
> 
> Unfortunately, it's been a long while since I've had access to a
> hppa system, so apart from building a cross-compiler and looking at
> the assembler it generates, this patch is completely untested.
> I was wondering whether Dave or Jeff (or someone else with access
> to real hardware) might "spin" this patch for me?
So while I have an R390 down in the basement, it overheats and I don't think I've
turned it on in a few years.

What I do for testing is exploit QEMU user mode emulation.  I have a little root
filesystem with HPPA binaries -- there's enough bits in that filesystem to build
binutils, gcc, glibc & the linux kernel as well as run the testsuite.

The root filesystems are here:

https://github.com/JeffreyALaw/chroots

The Jenkins tester spins the PA once week.  It failed its last run due to
regressions in the analyzer.  I didn't realize it was running the static analyzer
for the qemu emulated targets, which I turned off yesterday and re-started the PA
build.

http://3.14.90.209:8080/job/hppa-linux-gnu/



Jeff
>
Li, Pan2 via Gcc-patches Aug. 25, 2020, 2:05 p.m. UTC | #8
On Sat, 2020-08-22 at 09:52 +0100, Roger Sayle wrote:
> Hi Dave,
> 
> It's great to hear from you.  It's been a long while.
> 
> Sorry, doh! yes, there's a mistake in my patch (that I introduced when I
> renumbered
> the operands in the shd's define_expand to be the more logical 0, 1, 2, 3,
> then 4).
> Sorry for the inconvenience [due to my lack of familiarity with PA-RISC
> assembly].
> Hopefully you should get much better mileage out of the attached revision.
FWIW this version 3-staged in my tester and is currently running the testsuite.
 It should finish testing later today.

If David doesn't have objections, then I suggest going forward with the patch
independent of the others you've got queued up.

jeff
>
John David Anglin Aug. 25, 2020, 2:35 p.m. UTC | #9
On 2020-08-25 10:05 a.m., Jeff Law wrote:
> On Sat, 2020-08-22 at 09:52 +0100, Roger Sayle wrote:
>> Hi Dave,
>>
>> It's great to hear from you.  It's been a long while.
>>
>> Sorry, doh! yes, there's a mistake in my patch (that I introduced when I
>> renumbered
>> the operands in the shd's define_expand to be the more logical 0, 1, 2, 3,
>> then 4).
>> Sorry for the inconvenience [due to my lack of familiarity with PA-RISC
>> assembly].
>> Hopefully you should get much better mileage out of the attached revision.
> FWIW this version 3-staged in my tester and is currently running the testsuite.
>  It should finish testing later today.
>
> If David doesn't have objections, then I suggest going forward with the patch
> independent of the others you've got queued up.
The patch is fine.  Testing completed this morning on hpux and there are no regressions.

Please install on trunk and gcc-10 branch.

Thanks very much,
Dave
Li, Pan2 via Gcc-patches Aug. 26, 2020, 8:08 p.m. UTC | #10
On Sun, 2020-08-23 at 00:24 +0100, Roger Sayle wrote:
> Hi Dave,
> I actually think using plus_xor_ior operator is useful.  It means that if combine,
> inlining or some other RTL simplification generates these variants, these forms
> will still be recognized by the backend.  It's more typing, but the compiler produces
> better code.
> 
> Here's what I have so far, but please feel free to modify anything.  I'll leave the
> rest to you.
> 
> With this patch:
> 
> unsigned long long rotl4(unsigned long long x)
> {
>   return (x<<4) | (x>>60);
> }
> 
> unsigned long long rotr4(unsigned long long x)
> {
>   return (x<<60) | (x>>4);
> }
> 
> which previously generated:
> 
> rotl4:	depd,z %r26,59,60,%r28
> 	extrd,u %r26,3,4,%r26
> 	bve (%r2)
> 	or %r26,%r28,%r28
> 
> rotr4:	extrd,u %r26,59,60,%r28
> 	depd,z %r26,3,4,%r26
> 	bve (%r2)
> 	or %r26,%r28,%r28
> 
> now produces:
> 
> rotl4:	bve (%r2)
> 	shrpd %r26,%r26,60,%r28
> 
> rotr4:	bve (%r2)
> 	shrpd %r26,%r26,4,%r28
> 
> 
> I'm guessing this is very similar to what you were thinking (or what I described previously).
> 
> Many thanks again for trying out these patches/suggestions for me.
So I put this one into the tester overnight. 

It 3-stages, but trips:
Tests that now fail, but worked before (5 tests):

gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2
gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2
gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0
gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0
gcc.target/hppa/shadd-2.c scan-assembler-times sh.add 2

I think we've already discussed shadd-2.  It's not immediately clear if the slsr
failure is due to this change or something different -- the latter seems like a
definite possibility as we're changing things as we leave gimple, but the test is
checking the result of the gimple optimizers.

Your call on how to proceed.  I can put additional patches into the tester
easily, so if you've got something you want investigated, don't hesitate to reach
out.

jeff
John David Anglin Aug. 26, 2020, 8:33 p.m. UTC | #11
On 2020-08-26 4:08 p.m., Jeff Law wrote:
> It 3-stages, but trips:
> Tests that now fail, but worked before (5 tests):
>
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0
In my last Linux build, these failed:

FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\\\* 4" 2
FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\\\* 5" 0

The above didn't include patchh3.txt.  Seems to be a difference in quoting.  I'll check.
Linux build is close to running this test.

These tests didn't fail on 32-bit hpux without patchh2.txt or in previous Linux build.

Dave
Roger Sayle Aug. 26, 2020, 9:23 p.m. UTC | #12
The failure of slsr-13.c is not caused by the patchh3.txt, but the previous patchh2.txt
that's now on mainline and the gcc-10 branch.  That change provided more accurate
rtx_costs for hppa, and solved the performance problems with synth_mult.

These more accurate target rtx_costs are used by the gimple-ssa-strength-reduction.c
(via a call to mult_by_coeff_cost) to decide whether applying strength reduction would
be profitable.  This test case, slsr-13.c, assumes that two multiplications by four are
cheaper than two multiplications by five.   (I believe) This is not the case on hppa which
has a sh2add instruction, that performs a multiplication by five in one cycle, or exactly
the same cost as performing a left shift by two (i.e. a multiplication by four).  Oddly, I
also believe this isn't the case on x86_64, where the similar lea instruction is (sometimes)
as efficient as left shift by two bits.

I suspect that slsr-13.c should be expected to fail on some platforms depending upon 
a targets instruction set/timings.

Unfortunately, to complicate things in our case, it appears that after RTL optimizations,
performing this strength reduction actually does results in fewer instructions on the PA,
so it's the right thing to do.  I'll need to study the logic in gimple-ssa-strength to see
how mult_by_coeff cost is being used; cost(x*4) == cost(x*5), but cost(x*4+y) < cost(x*5+y).

My apologies for the inconvenience.

Roger
--

-----Original Message-----
From: John David Anglin <dave.anglin@bell.net> 
Sent: 26 August 2020 21:34
To: law@redhat.com; Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches' <gcc-patches@gcc.gnu.org>
Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT

On 2020-08-26 4:08 p.m., Jeff Law wrote:
> It 3-stages, but trips:
> Tests that now fail, but worked before (5 tests):
>
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 4" 2 
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0 
> gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\* 5" 0
In my last Linux build, these failed:

FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\\\* 4" 2
FAIL: gcc.dg/tree-ssa/slsr-13.c scan-tree-dump-times optimized " \\\\* 5" 0

The above didn't include patchh3.txt.  Seems to be a difference in quoting.  I'll check.
Linux build is close to running this test.

These tests didn't fail on 32-bit hpux without patchh2.txt or in previous Linux build.

Dave

--
John David Anglin  dave.anglin@bell.net
Li, Pan2 via Gcc-patches Aug. 26, 2020, 9:30 p.m. UTC | #13
On Wed, 2020-08-26 at 22:23 +0100, Roger Sayle wrote:
> The failure of slsr-13.c is not caused by the patchh3.txt, but the previous patchh2.txt
> that's now on mainline and the gcc-10 branch.  That change provided more accurate
> rtx_costs for hppa, and solved the performance problems with synth_mult.
> 
> These more accurate target rtx_costs are used by the gimple-ssa-strength-reduction.c
> (via a call to mult_by_coeff_cost) to decide whether applying strength reduction would
> be profitable.  This test case, slsr-13.c, assumes that two multiplications by four are
> cheaper than two multiplications by five.   (I believe) This is not the case on hppa which
> has a sh2add instruction, that performs a multiplication by five in one cycle, or exactly
> the same cost as performing a left shift by two (i.e. a multiplication by four).  Oddly, I
> also believe this isn't the case on x86_64, where the similar lea instruction is (sometimes)
> as efficient as left shift by two bits.
Yea, you can do a multiplication by 5 cheap on the PA.  While the x86 can too, I
don't think it's as clear cut a win as the PA, so they may not cost the same as a
multiply by 4 or left shift by 2.


> 
> I suspect that slsr-13.c should be expected to fail on some platforms depending upon 
> a targets instruction set/timings.
Sounds like you're right since it depends on mult_by_coeff_cost under the hood :(
 I presume you or John will xfail it for the PA.


> 
> Unfortunately, to complicate things in our case, it appears that after RTL optimizations,
> performing this strength reduction actually does results in fewer instructions on the PA,
> so it's the right thing to do.  I'll need to study the logic in gimple-ssa-strength to see
> how mult_by_coeff cost is being used; cost(x*4) == cost(x*5), but cost(x*4+y) < cost(x*5+y).
Yea, x*4+y is cheaper than x*5+y on the PA.  The first is a single sh2add, the
second requires an additional add instruction.

Jeff
John David Anglin Aug. 27, 2020, 12:31 a.m. UTC | #14
On 2020-08-26 5:23 p.m., Roger Sayle wrote:
> These more accurate target rtx_costs are used by the gimple-ssa-strength-reduction.c
> (via a call to mult_by_coeff_cost) to decide whether applying strength reduction would
> be profitable.  This test case, slsr-13.c, assumes that two multiplications by four are
> cheaper than two multiplications by five.   (I believe) This is not the case on hppa which
> has a sh2add instruction, that performs a multiplication by five in one cycle, or exactly
> the same cost as performing a left shift by two (i.e. a multiplication by four).  Oddly, I
> also believe this isn't the case on x86_64, where the similar lea instruction is (sometimes)
> as efficient as left shift by two bits.
This looks like a regression.

gcc-10 (prepatch):

        addl %r25,%r26,%r28
        sh2addl %r25,%r28,%r25
        sh2addl %r26,%r28,%r26
        addl %r26,%r28,%r28
        bv %r0(%r2)
        addl %r28,%r25,%r28

  <bb 2> [local count: 1073741824]:
  x1_4 = c_2(D) + s_3(D);
  slsr_11 = s_3(D) * 4;
  x2_6 = x1_4 + slsr_11;
  slsr_12 = c_2(D) * 4;
  x3_8 = x1_4 + slsr_12;
  _1 = x1_4 + x2_6;
  x_9 = _1 + x3_8;
  return x_9;

gcc-11 (with patch):

        addl %r25,%r26,%r19
        sh2addl %r26,%r26,%r28
        addl %r28,%r25,%r28
        sh2addl %r25,%r25,%r25
        addl %r28,%r19,%r28
        addl %r25,%r26,%r26
        bv %r0(%r2)
        addl %r28,%r26,%r28

  <bb 2> [local count: 1073741824]:
  x1_4 = c_2(D) + s_3(D);
  a2_5 = s_3(D) * 5;
  x2_6 = c_2(D) + a2_5;
  a3_7 = c_2(D) * 5;
  x3_8 = s_3(D) + a3_7;
  _1 = x1_4 + x2_6;
  x_9 = _1 + x3_8;
  return x_9;

Regards,
Dave
Roger Sayle Aug. 27, 2020, 7:16 a.m. UTC | #15
>On 2020-08-26 5:23 p.m., Roger Sayle wrote:
>> These more accurate target rtx_costs are used by the 
>> gimple-ssa-strength-reduction.c (via a call to mult_by_coeff_cost) to 
>> decide whether applying strength reduction would be profitable.  This test case, slsr-13.c, assumes that two multiplications by four are
>> cheaper than two multiplications by five.   (I believe) This is not the case on hppa which
>> has a sh2add instruction, that performs a multiplication by five in 
>> one cycle, or exactly the same cost as performing a left shift by two 
>> (i.e. a multiplication by four).  Oddly, I also believe this isn't the 
>> case on x86_64, where the similar lea instruction is (sometimes) as efficient as left shift by two bits.
>This looks like a regression.
>
>gcc-10 (prepatch):
>
>        addl %r25,%r26,%r28
>        sh2addl %r25,%r28,%r25
>        sh2addl %r26,%r28,%r26
>        addl %r26,%r28,%r28
>        bv %r0(%r2)
>        addl %r28,%r25,%r28
>
>  <bb 2> [local count: 1073741824]:
>  x1_4 = c_2(D) + s_3(D);
>  slsr_11 = s_3(D) * 4;
>  x2_6 = x1_4 + slsr_11;
>  slsr_12 = c_2(D) * 4;
>  x3_8 = x1_4 + slsr_12;
>  _1 = x1_4 + x2_6;
>  x_9 = _1 + x3_8;
>  return x_9;
>
>gcc-11 (with patch):
>
>        addl %r25,%r26,%r19
>        sh2addl %r26,%r26,%r28
>        addl %r28,%r25,%r28
>        sh2addl %r25,%r25,%r25
>        addl %r28,%r19,%r28
>        addl %r25,%r26,%r26
>        bv %r0(%r2)
>        addl %r28,%r26,%r28
>
>  <bb 2> [local count: 1073741824]:
>  x1_4 = c_2(D) + s_3(D);
>  a2_5 = s_3(D) * 5;
>  x2_6 = c_2(D) + a2_5;
>  a3_7 = c_2(D) * 5;
>  x3_8 = s_3(D) + a3_7;
>  _1 = x1_4 + x2_6;
>  x_9 = _1 + x3_8;
>  return x_9;
>
> Regards,
> Dave

There are two interesting (tree-optimization) observations here.  The first is that at the tree-ssa
level both of these gimple sequences look to have exactly the same cost, seven assignments on
a target where *4 is the same cost as *5.  The gimple doesn't attempt to model the sh?add/lea
instructions that combine may find, so at RTL expansion both sequences look equivalent.  One
fix may be to have gimple-ssa-strength-reduction.c just prefer multiplications by 2, 4 and 8,
even on targets that have a single cycle "mul" instruction.

The second observation is why isn't tree-ssa-reassoc.c doing something here.  The test case
is evaluating (s+c)+(s+5*c)+(5*s+c), and this strength reduction test is expecting this to turn
into "tmp=s+c;  return tmp+(tmp+4*c)+(4*s+tmp" which is clever and an improvement, but
overlooks the obvious reassociation 7*(s+c).  Indeed LLVM does this in three instructions:

	tmp1 = s+c;
	tmp2 = tmp1<<3;
	return tmp2-tmp1;

Although the PA backend is (mostly) innocent in this, the lowest impact fix/work around is
to have multiplications by 2, 4 and 8 return COSTS_N_INSNS(1)-1, to indicate a preference
when splitting ties.  I'll prepare a patch.

Roger
--
Richard Biener Aug. 27, 2020, 10:33 a.m. UTC | #16
On Thu, Aug 27, 2020 at 9:17 AM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> >On 2020-08-26 5:23 p.m., Roger Sayle wrote:
> >> These more accurate target rtx_costs are used by the
> >> gimple-ssa-strength-reduction.c (via a call to mult_by_coeff_cost) to
> >> decide whether applying strength reduction would be profitable.  This test case, slsr-13.c, assumes that two multiplications by four are
> >> cheaper than two multiplications by five.   (I believe) This is not the case on hppa which
> >> has a sh2add instruction, that performs a multiplication by five in
> >> one cycle, or exactly the same cost as performing a left shift by two
> >> (i.e. a multiplication by four).  Oddly, I also believe this isn't the
> >> case on x86_64, where the similar lea instruction is (sometimes) as efficient as left shift by two bits.
> >This looks like a regression.
> >
> >gcc-10 (prepatch):
> >
> >        addl %r25,%r26,%r28
> >        sh2addl %r25,%r28,%r25
> >        sh2addl %r26,%r28,%r26
> >        addl %r26,%r28,%r28
> >        bv %r0(%r2)
> >        addl %r28,%r25,%r28
> >
> >  <bb 2> [local count: 1073741824]:
> >  x1_4 = c_2(D) + s_3(D);
> >  slsr_11 = s_3(D) * 4;
> >  x2_6 = x1_4 + slsr_11;
> >  slsr_12 = c_2(D) * 4;
> >  x3_8 = x1_4 + slsr_12;
> >  _1 = x1_4 + x2_6;
> >  x_9 = _1 + x3_8;
> >  return x_9;
> >
> >gcc-11 (with patch):
> >
> >        addl %r25,%r26,%r19
> >        sh2addl %r26,%r26,%r28
> >        addl %r28,%r25,%r28
> >        sh2addl %r25,%r25,%r25
> >        addl %r28,%r19,%r28
> >        addl %r25,%r26,%r26
> >        bv %r0(%r2)
> >        addl %r28,%r26,%r28
> >
> >  <bb 2> [local count: 1073741824]:
> >  x1_4 = c_2(D) + s_3(D);
> >  a2_5 = s_3(D) * 5;
> >  x2_6 = c_2(D) + a2_5;
> >  a3_7 = c_2(D) * 5;
> >  x3_8 = s_3(D) + a3_7;
> >  _1 = x1_4 + x2_6;
> >  x_9 = _1 + x3_8;
> >  return x_9;
> >
> > Regards,
> > Dave
>
> There are two interesting (tree-optimization) observations here.  The first is that at the tree-ssa
> level both of these gimple sequences look to have exactly the same cost, seven assignments on
> a target where *4 is the same cost as *5.  The gimple doesn't attempt to model the sh?add/lea
> instructions that combine may find, so at RTL expansion both sequences look equivalent.  One
> fix may be to have gimple-ssa-strength-reduction.c just prefer multiplications by 2, 4 and 8,
> even on targets that have a single cycle "mul" instruction.
>
> The second observation is why isn't tree-ssa-reassoc.c doing something here.  The test case
> is evaluating (s+c)+(s+5*c)+(5*s+c), and this strength reduction test is expecting this to turn
> into "tmp=s+c;  return tmp+(tmp+4*c)+(4*s+tmp" which is clever and an improvement, but
> overlooks the obvious reassociation 7*(s+c).  Indeed LLVM does this in three instructions:

reassoc doesn't work on signed types

>
>         tmp1 = s+c;
>         tmp2 = tmp1<<3;
>         return tmp2-tmp1;
>
> Although the PA backend is (mostly) innocent in this, the lowest impact fix/work around is
> to have multiplications by 2, 4 and 8 return COSTS_N_INSNS(1)-1, to indicate a preference
> when splitting ties.  I'll prepare a patch.
>
> Roger
> --
>
>
John David Anglin Sept. 12, 2020, 3:41 p.m. UTC | #17
Committed.

Regards,
Dave

On 2020-08-22 7:24 p.m., Roger Sayle wrote:
> Hi Dave,
> I actually think using plus_xor_ior operator is useful.  It means that if combine,
> inlining or some other RTL simplification generates these variants, these forms
> will still be recognized by the backend.  It's more typing, but the compiler produces
> better code.
>
> Here's what I have so far, but please feel free to modify anything.  I'll leave the
> rest to you.
>
> With this patch:
>
> unsigned long long rotl4(unsigned long long x)
> {
>   return (x<<4) | (x>>60);
> }
>
> unsigned long long rotr4(unsigned long long x)
> {
>   return (x<<60) | (x>>4);
> }
>
> which previously generated:
>
> rotl4:	depd,z %r26,59,60,%r28
> 	extrd,u %r26,3,4,%r26
> 	bve (%r2)
> 	or %r26,%r28,%r28
>
> rotr4:	extrd,u %r26,59,60,%r28
> 	depd,z %r26,3,4,%r26
> 	bve (%r2)
> 	or %r26,%r28,%r28
>
> now produces:
>
> rotl4:	bve (%r2)
> 	shrpd %r26,%r26,60,%r28
>
> rotr4:	bve (%r2)
> 	shrpd %r26,%r26,4,%r28
>
>
> I'm guessing this is very similar to what you were thinking (or what I described previously).
>
> Many thanks again for trying out these patches/suggestions for me.
>
> Best regards,
> Roger
> --
>
> -----Original Message-----
> From: John David Anglin <dave.anglin@bell.net> 
> Sent: 22 August 2020 23:09
> To: Roger Sayle <roger@nextmovesoftware.com>; 'GCC Patches' <gcc-patches@gcc.gnu.org>
> Cc: 'Jeff Law' <law@redhat.com>
> Subject: Re: [PATCH] hppa: Improve expansion of ashldi3 when !TARGET_64BIT
>
> On 2020-08-22 12:01 p.m., Roger Sayle wrote:
>> I suspect that the issue with the 64-bit patterns is that the second 
>> variant of pa.md's define_insn "shrpdi4" is unlikely ever to match as 
>> (minus:DI (const_int 64) x) is never "canonical" when x is itself a 
>> CONST_INT.  Splitting this define_insn into two (or three see below) 
>> separate forms; the first as it currently is and the second (as you 
>> suggest) with
>> 	"TARGET_64BIT
>> 	  && INTVAL (operands[3]) + INTVAL (operands[4]) == 64"
>> should do the trick.
> I will go ahead and add the basic patterns.  It seems it would be best if I avoid using the "plus_xor_ior_operator".  It also seems the 32-bit patterns should avoid it.
>> My first impression was that the DImode shrpd instructions would be 
>> most useful for implementing TI mode shifts, but that TI mode isn't 
>> supported by hppa64.  But then I noticed that the more immediate 
>> benefit would be in supporting rotrdi3 and rotldi3 on TARGET_64BIT 
>> that currently don't have expanders nor insns defined.  Here GCC 
>> currently generates three instructions where a single shrpd would be 
>> optimal.
> It turns out we now need to support TI mode and __int128 for libgomp.  The hppa64-hpux target won't boot without it.  I had just added a change to support TI mode but it's untested.
>
> Regards,
> Dave
>
> --
> John David Anglin  dave.anglin@bell.net
>
diff mbox series

Patch

diff --git a/gcc/config/pa/pa.md b/gcc/config/pa/pa.md
index 6350c68..e7b7635 100644
--- a/gcc/config/pa/pa.md
+++ b/gcc/config/pa/pa.md
@@ -6416,9 +6416,32 @@ 
   [(set (match_operand:DI 0 "register_operand" "")
 	(ashift:DI (match_operand:DI 1 "lhs_lshift_operand" "")
 		   (match_operand:DI 2 "arith32_operand" "")))]
-  "TARGET_64BIT"
+  ""
   "
 {
+  if (!TARGET_64BIT)
+    {
+      if (REG_P (operands[0]) && GET_CODE (operands[2]) == CONST_INT)
+	{
+          unsigned HOST_WIDE_INT shift = UINTVAL (operands[2]);
+	  if (shift >= 1 && shift <= 31)
+	    {
+	      rtx dst = operands[0];
+	      rtx src = force_reg (DImode, operands[1]);
+	      emit_insn (gen_shd_internal (gen_highpart (SImode, dst),
+					   gen_highpart (SImode, src),
+					   GEN_INT (32-shift),
+					   gen_lowpart (SImode, src),
+					   GEN_INT (shift)));
+	      emit_insn (gen_ashlsi3 (gen_lowpart (SImode, dst),
+				      gen_lowpart (SImode, src),
+				      GEN_INT (shift)));
+	      DONE;
+	    }
+	}
+      /* Fallback to using optabs.c's expand_doubleword_shift.  */
+      FAIL;
+    }
   if (GET_CODE (operands[2]) != CONST_INT)
     {
       rtx temp = gen_reg_rtx (DImode);
@@ -6705,6 +6728,15 @@ 
   [(set_attr "type" "shift")
    (set_attr "length" "4")])
 
+(define_expand "shd_internal"
+  [(set (match_operand:SI 0 "register_operand")
+	(ior:SI
+	  (lshiftrt:SI (match_operand:SI 1 "register_operand")
+		       (match_operand:SI 2 "const_int_operand"))
+	  (ashift:SI (match_operand:SI 3 "register_operand")
+		     (match_operand:SI 4 "const_int_operand"))))]
+  "")
+
 (define_insn ""
   [(set (match_operand:SI 0 "register_operand" "=r")
 	(and:SI (ashift:SI (match_operand:SI 1 "register_operand" "r")