[aarch64] Fix target/pr77729 - missed optimization related to zero extension

Message ID 1505321438.2286.7.camel@cavium.com
State New
Headers show
Series
  • [aarch64] Fix target/pr77729 - missed optimization related to zero extension
Related show

Commit Message

Steve Ellcey Sept. 13, 2017, 4:50 p.m.
This is a patch for PR target/77729 on aarch64.  The code is doing an
unneeded zero extend ('uxtb' in the original report, 'and' in the ToT sources).

The patch looks a bit odd, it is a specialized define_insn for the combine
pass.  At some point in combine (I never did find out where), the zero_extend
is converted to an AND so my instruction looks for an OR of a constant
and an AND expression where one operand of the AND is a subreg and the other
is a constant.  If the two constants add up to 255 that means that the AND
is being used to mask out the upper bits of the register while not messing
up the constant we are using in the OR expression.

I also had to recognize this in the aarch64 cost function or combine would
not use the new expression even when it recognized it as it thought it cost
more than the original uncombined expressions.

Tested on aarch64 with a bootstrap and testsuite run that had no regressions.

OK to checkin?


2017-09-13  Steve Ellcey  <sellcey@cavium.com>

	PR target/77729
	* config/aarch64/aarch64.c (aarch64_rtx_costs):
	Handle cost of *iorqi3_uxtw instruction.
	* config/aarch64/aarch64.md (*iorqi3_uxtw): New
	instruction for combine phase.


2017-09-13  Steve Ellcey  <sellcey@cavium.com>

	* gcc.target/aarch64/pr77729.c: New test.

Comments

Kyrill Tkachov Sept. 13, 2017, 5:13 p.m. | #1
Hi Steve,

On 13/09/17 17:50, Steve Ellcey wrote:
> This is a patch for PR target/77729 on aarch64.  The code is doing an
> unneeded zero extend ('uxtb' in the original report, 'and' in the ToT 
> sources).
>
> The patch looks a bit odd, it is a specialized define_insn for the combine
> pass.  At some point in combine (I never did find out where), the 
> zero_extend
> is converted to an AND so my instruction looks for an OR of a constant
> and an AND expression where one operand of the AND is a subreg and the 
> other
> is a constant.  If the two constants add up to 255 that means that the AND
> is being used to mask out the upper bits of the register while not messing
> up the constant we are using in the OR expression.
>
> I also had to recognize this in the aarch64 cost function or combine would
> not use the new expression even when it recognized it as it thought it 
> cost
> more than the original uncombined expressions.
>
> Tested on aarch64 with a bootstrap and testsuite run that had no 
> regressions.
>
> OK to checkin?
>
>
> 2017-09-13  Steve Ellcey  <sellcey@cavium.com>
>
>         PR target/77729
>         * config/aarch64/aarch64.c (aarch64_rtx_costs):
>         Handle cost of *iorqi3_uxtw instruction.
>         * config/aarch64/aarch64.md (*iorqi3_uxtw): New
>         instruction for combine phase.
>
>
> 2017-09-13  Steve Ellcey  <sellcey@cavium.com>
>
>         * gcc.target/aarch64/pr77729.c: New test.

+;; Specialized OR instruction for combiner.  The AND is masking out bits
+;; not needed in the OR (doing a zero_extend).  The zero_extend is not
+;; needed because we know from the subreg that the upper part of the reg
+;; is zero.
+(define_insn "*iorqi3_uxtw"
+  [(set (match_operand:SI 0 "register_operand" "=r")
+        (ior:SI (and:SI
+		  (subreg:SI (match_operand:QI 1 "register_operand" "r") 0)
+		  (match_operand:SI 2 "const_int_operand" "n"))
+		(match_operand:SI 3 "aarch64_logical_operand" "K")))]
+  "INTVAL (operands[2]) + INTVAL (operands[3]) == 255"
+  "orr\\t%w0, %w1, %3"
+  [(set_attr "type" "logic_imm")]
+)

We are usually hesitant to add explicit subreg matching in the MD pattern
(though I don't remember if there's a hard rule against it).
In this case this looks like a missing simplification from combine (simplify-rtx) so
I think adding it there would be better.

Also, in this pattern operands[3] has the aarch64_logical_operand predicate which allows registers
but in the final instruction check below you unconditionally take its INTVAL which will throw an error
if the compiler is configured with RTL checking enabled.

Thanks,
Kyrill
Segher Boessenkool Sept. 13, 2017, 7:46 p.m. | #2
Hi!

On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote:
> +;; Specialized OR instruction for combiner.  The AND is masking out bits
> +;; not needed in the OR (doing a zero_extend).  The zero_extend is not
> +;; needed because we know from the subreg that the upper part of the reg
> +;; is zero.
> +(define_insn "*iorqi3_uxtw"
> +  [(set (match_operand:SI 0 "register_operand" "=r")
> +        (ior:SI (and:SI
> +		  (subreg:SI (match_operand:QI 1 "register_operand" "r") 0)
> +		  (match_operand:SI 2 "const_int_operand" "n"))
> +		(match_operand:SI 3 "aarch64_logical_operand" "K")))]
> +  "INTVAL (operands[2]) + INTVAL (operands[3]) == 255"
> +  "orr\\t%w0, %w1, %3"
> +  [(set_attr "type" "logic_imm")]
> +)
> 
> We are usually hesitant to add explicit subreg matching in the MD pattern
> (though I don't remember if there's a hard rule against it).
> In this case this looks like a missing simplification from combine 
> (simplify-rtx) so
> I think adding it there would be better.

Yes, it probably belongs as a generic simplification in simplify-rtx.c;
if there is a reason not to do that, it can be done in combine.c instead.


Segher
Steve Ellcey Sept. 13, 2017, 9:46 p.m. | #3
On Wed, 2017-09-13 at 14:46 -0500, Segher Boessenkool wrote:
> On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote:
> > 
> > We are usually hesitant to add explicit subreg matching in the MD pattern
> > (though I don't remember if there's a hard rule against it).
> > In this case this looks like a missing simplification from combine 
> > (simplify-rtx) so
> > I think adding it there would be better.

> Yes, it probably belongs as a generic simplification in simplify-rtx.c;
> if there is a reason not to do that, it can be done in combine.c
> instead.

Actually, now that I look at it some more and compare it to the arm32
version (where we do not have this problem) I think the problem starts
well before combine.

In arm32 rtl expansion, when reading the QI memory location, I see
these instructions get generated:

(insn 10 3 11 2 (set (reg:SI 119)
        (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1
     (nil))
(insn 11 10 12 2 (set (reg:QI 118)
        (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1
     (nil))

And in aarch64 rtl expansion I see:

(insn 10 9 11 (set (reg:QI 81)
        (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8])) "pr77729.c":3 -1
     (nil))

Both of these sequences expand to ldrb but in the arm32 case I know
that I set all 32 bits of the register (even though I only want the
bottom 8 bits), but for aarch64 I only know that I set the bottom 8
bits and I don't know anything about the higher bits, meaning I have to
keep the AND instruction to mask out the upper bits on aarch64.

I think we should change the movqi/movhi expansions on aarch64 to
recognize that the ldrb/ldrh instructions zero out the upper bits in
the register by generating rtl like arm32 does.

Steve Ellcey
sellcey@cavium.com
Jeff Law Sept. 14, 2017, 3:03 p.m. | #4
On 09/13/2017 03:46 PM, Steve Ellcey wrote:
> On Wed, 2017-09-13 at 14:46 -0500, Segher Boessenkool wrote:
>> On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote:
>>>  
>>> We are usually hesitant to add explicit subreg matching in the MD pattern
>>> (though I don't remember if there's a hard rule against it).
>>> In this case this looks like a missing simplification from combine 
>>> (simplify-rtx) so
>>> I think adding it there would be better.
> 
>> Yes, it probably belongs as a generic simplification in simplify-rtx.c;
>> if there is a reason not to do that, it can be done in combine.c
>> instead.
> 
> Actually, now that I look at it some more and compare it to the arm32
> version (where we do not have this problem) I think the problem starts
> well before combine.
> 
> In arm32 rtl expansion, when reading the QI memory location, I see
> these instructions get generated:
> 
> (insn 10 3 11 2 (set (reg:SI 119)
>         (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1
>      (nil))
> (insn 11 10 12 2 (set (reg:QI 118)
>         (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1
>      (nil))
> 
> And in aarch64 rtl expansion I see:
> 
> (insn 10 9 11 (set (reg:QI 81)
>         (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8])) "pr77729.c":3 -1
>      (nil))
> 
> Both of these sequences expand to ldrb but in the arm32 case I know
> that I set all 32 bits of the register (even though I only want the
> bottom 8 bits), but for aarch64 I only know that I set the bottom 8
> bits and I don't know anything about the higher bits, meaning I have to
> keep the AND instruction to mask out the upper bits on aarch64.
It's one of the reasons I discourage subregs -- the number of cases
where we can optimize based on the "don't care" semantics are relatively
small in my experience and I consistently see cases where the "don't
care" property of the subreg turns into "don't know" and suppresses
downstream optimizations.

It's always a judgment call, but more and more often I find myself
pushing towards defining those bits using a zero/sign extension, bit
operation or whatever rather than using subregs.


> 
> I think we should change the movqi/movhi expansions on aarch64 to
> recognize that the ldrb/ldrh instructions zero out the upper bits in
> the register by generating rtl like arm32 does.
Is LOAD_EXTEND_OP defined for aarch64?

It may also be worth looking at ree.c -- my recollection is that it
didn't handle subregs, but it could and probably should.

jeff
Steve Ellcey Sept. 14, 2017, 4:33 p.m. | #5
On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote:
> On 09/13/2017 03:46 PM, Steve Ellcey wrote:
> > 
> > In arm32 rtl expansion, when reading the QI memory location, I see
> > these instructions get generated:
> > 
> > (insn 10 3 11 2 (set (reg:SI 119)
> >         (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0
> > *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1
> >      (nil))
> > (insn 11 10 12 2 (set (reg:QI 118)
> >         (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1
> >      (nil))
> > 
> > And in aarch64 rtl expansion I see:
> > 
> > (insn 10 9 11 (set (reg:QI 81)
> >         (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1
> > A8])) "pr77729.c":3 -1
> >      (nil))
> > 
> > Both of these sequences expand to ldrb but in the arm32 case I know
> > that I set all 32 bits of the register (even though I only want the
> > bottom 8 bits), but for aarch64 I only know that I set the bottom 8
> > bits and I don't know anything about the higher bits, meaning I have to
> > keep the AND instruction to mask out the upper bits on aarch64.

> It's one of the reasons I discourage subregs -- the number of cases
> where we can optimize based on the "don't care" semantics are relatively
> small in my experience and I consistently see cases where the "don't
> care" property of the subreg turns into "don't know" and suppresses
> downstream optimizations.
> 
> It's always a judgment call, but more and more often I find myself
> pushing towards defining those bits using a zero/sign extension, bit
> operation or whatever rather than using subregs.

So if I were loading a QImode to a register (zeroing out the upper
bits) would you generate something like:

(truncate:QI (zero_extend:SI (reg:QI)))

 instead of:

(subreg:QI (reg:SI))

> > I think we should change the movqi/movhi expansions on aarch64 to
> > recognize that the ldrb/ldrh instructions zero out the upper bits in
> > the register by generating rtl like arm32 does.

> Is LOAD_EXTEND_OP defined for aarch64?

Yes, aarch64 defines LOAD_EXTEND_OP to be ZERO_EXTEND.

> It may also be worth looking at ree.c -- my recollection is that it
> didn't handle subregs, but it could and probably should.

I only see a couple of references to subregs in ree.c.  I think they
both involve searching for all uses of a register.

Steve Ellcey
Jeff Law Sept. 14, 2017, 5:53 p.m. | #6
On 09/14/2017 10:33 AM, Steve Ellcey wrote:
> On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote:
>> On 09/13/2017 03:46 PM, Steve Ellcey wrote:
>>>  
>>> In arm32 rtl expansion, when reading the QI memory location, I see
>>> these instructions get generated:
>>>
>>> (insn 10 3 11 2 (set (reg:SI 119)
>>>         (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0
>>> *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1
>>>      (nil))
>>> (insn 11 10 12 2 (set (reg:QI 118)
>>>         (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1
>>>      (nil))
>>>
>>> And in aarch64 rtl expansion I see:
>>>
>>> (insn 10 9 11 (set (reg:QI 81)
>>>         (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1
>>> A8])) "pr77729.c":3 -1
>>>      (nil))
>>>
>>> Both of these sequences expand to ldrb but in the arm32 case I know
>>> that I set all 32 bits of the register (even though I only want the
>>> bottom 8 bits), but for aarch64 I only know that I set the bottom 8
>>> bits and I don't know anything about the higher bits, meaning I have to
>>> keep the AND instruction to mask out the upper bits on aarch64.
> 
>> It's one of the reasons I discourage subregs -- the number of cases
>> where we can optimize based on the "don't care" semantics are relatively
>> small in my experience and I consistently see cases where the "don't
>> care" property of the subreg turns into "don't know" and suppresses
>> downstream optimizations.
>>
>> It's always a judgment call, but more and more often I find myself
>> pushing towards defining those bits using a zero/sign extension, bit
>> operation or whatever rather than using subregs.
> 
> So if I were loading a QImode to a register (zeroing out the upper
> bits) would you generate something like:
> 
> (truncate:QI (zero_extend:SI (reg:QI)))
On a LOAD_EXTEND_OP target which zero extends and
WORD_REGISTER_OPERATIONS as 1 I'd load memory with just

(set (reg:QI) (mem:QI (whatever))

If that object is later used elsewhere it probably will be explicitly
sign/zero extended.  But combine ought to be able to eliminate that
explicit extension.

And I think that's starting to zero in on the problem --
WORD_REGISTER_OPERATIONS is zero on aarch64 as you don't get extension
to word_mode for W form registers.

I wonder if what needs to happen is somehow look to extend that code
somehow so that combine and friends know that the value is zero extended
to 32 bits, even if it's not extended to word_mode.

Jeff
Jeff Law Sept. 14, 2017, 6:03 p.m. | #7
On 09/14/2017 10:33 AM, Steve Ellcey wrote:
> On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote:
>> On 09/13/2017 03:46 PM, Steve Ellcey wrote:
>>>  
>>> In arm32 rtl expansion, when reading the QI memory location, I see
>>> these instructions get generated:
>>>
>>> (insn 10 3 11 2 (set (reg:SI 119)
>>>         (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0
>>> *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1
>>>      (nil))
>>> (insn 11 10 12 2 (set (reg:QI 118)
>>>         (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1
>>>      (nil))
>>>
>>> And in aarch64 rtl expansion I see:
>>>
>>> (insn 10 9 11 (set (reg:QI 81)
>>>         (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1
>>> A8])) "pr77729.c":3 -1
>>>      (nil))
>>>
>>> Both of these sequences expand to ldrb but in the arm32 case I know
>>> that I set all 32 bits of the register (even though I only want the
>>> bottom 8 bits), but for aarch64 I only know that I set the bottom 8
>>> bits and I don't know anything about the higher bits, meaning I have to
>>> keep the AND instruction to mask out the upper bits on aarch64.
> 
>> It's one of the reasons I discourage subregs -- the number of cases
>> where we can optimize based on the "don't care" semantics are relatively
>> small in my experience and I consistently see cases where the "don't
>> care" property of the subreg turns into "don't know" and suppresses
>> downstream optimizations.
>>
>> It's always a judgment call, but more and more often I find myself
>> pushing towards defining those bits using a zero/sign extension, bit
>> operation or whatever rather than using subregs.
> 
> So if I were loading a QImode to a register (zeroing out the upper
> bits) would you generate something like:
> 
> (truncate:QI (zero_extend:SI (reg:QI)))
> 
>  instead of:
> 
> (subreg:QI (reg:SI))
> 
>>> I think we should change the movqi/movhi expansions on aarch64 to
>>> recognize that the ldrb/ldrh instructions zero out the upper bits in
>>> the register by generating rtl like arm32 does.
> 
>> Is LOAD_EXTEND_OP defined for aarch64?
> 
> Yes, aarch64 defines LOAD_EXTEND_OP to be ZERO_EXTEND.
> 
>> It may also be worth looking at ree.c -- my recollection is that it
>> didn't handle subregs, but it could and probably should.
> 
> I only see a couple of references to subregs in ree.c.  I think they
> both involve searching for all uses of a register.
Right.  But the subreg expressions are also forms of extension -- we
just don't know (or care) if it's zero or sign extension.

We'd start by recognizing the paradoxical subreg in
add_removable_extension as a form of an extension similar to zero_extend
and sign_extend.

When we go to combine the "extension" into the defining insn, we would
test 3 forms

(set (target) (zero_extend (exp)))

(set (target) (sign_extend (exp)))

(set (target) (subreg (exp)))

If any form matches an insn on the target, then we're done.


This may require adding some new patterns to aarch64 -- I believe we've
got patterns on x86 to match some of these forms to aid redundant
extension eliminmation.

It might also be helpful to teach ree about LOAD_EXTEND_OP which would
allow combining one of the extension forms with a memory reference.

Jeff
Steve Ellcey Sept. 14, 2017, 6:43 p.m. | #8
On Thu, 2017-09-14 at 11:53 -0600, Jeff Law wrote:

> 
> And I think that's starting to zero in on the problem --
> WORD_REGISTER_OPERATIONS is zero on aarch64 as you don't get extension
> to word_mode for W form registers.
> 
> I wonder if what needs to happen is somehow look to extend that code
> somehow so that combine and friends know that the value is zero extended
> to 32 bits, even if it's not extended to word_mode.
> 
> Jeff

This might be a good long term direction to move but in the mean time
it sure does seem a lot easier to just generate a subreg.  Here is a
patch that does that, it passes bootstrap and has no regressions and
fixes the bug in question (and most likely improves other code as
well).

The "LOAD_EXTEND_OP (<MODE>mode) == ZERO_EXTEND" part of the if
statement is not really necessary since we know this is true on aarch64
but I thought it helped make it clear what we were doing and the
compiler should optimize it away anyway.

OK to checkin this fix while we consider longer term options?

Steve Ellcey
sellcey@cavium.com


2017-09-14  Steve Ellcey  <sellcey@cavium.com>

        PR target/77729
        * config/aarch64/aarch64.md (mov<mode>): Generate subreg for
        short loads to reflect that upper bits are zeroed out on load.


2017-09-14  Steve Ellcey  <sellcey@cavium.com>

	* gcc.target/aarch64/pr77729.c: New test.
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index f8cdb06..bca4cf5 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -864,6 +864,15 @@
 	(match_operand:SHORT 1 "general_operand" ""))]
   ""
   "
+    if (LOAD_EXTEND_OP (<MODE>mode) == ZERO_EXTEND && MEM_P (operands[1])
+	&& can_create_pseudo_p () && optimize > 0)
+      {
+	/* Generate subreg of SImode so we know that the upper bits
+	of the reg are zero and do not need to masked out later.  */
+	rtx reg = gen_reg_rtx (SImode);
+	emit_insn (gen_zero_extend<mode>si2 (reg, operands[1]));
+	operands[1] = gen_lowpart (<MODE>mode, reg);
+      }
     if (GET_CODE (operands[0]) == MEM && operands[1] != const0_rtx)
       operands[1] = force_reg (<MODE>mode, operands[1]);
   "
diff --git a/gcc/testsuite/gcc.target/aarch64/pr77729.c b/gcc/testsuite/gcc.target/aarch64/pr77729.c
index e69de29..2fcda9a 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr77729.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr77729.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int TrieCase3_v1(const char *string)
+{
+    if((string[0] | 32) == 't') {
+        if((string[1] | 32) == 'a') {
+            if((string[2] | 32) == 'g') {
+                return 42;
+            }
+        }
+    }
+    return -1;
+}
+
+int TrieCase3_v2(const char *string)
+{
+    switch(string[0] | 32) {
+    case 't':
+        switch(string[1] | 32) {
+        case 'a':
+            switch(string[2] | 32) {
+            case 'g':
+                return 42;
+            }
+        }
+    }
+    return -1;
+}
+
+/* { dg-final { scan-assembler-not "and" } } */
+/* { dg-final { scan-assembler-not "uxtb" } } */

Patch

diff --git a/gcc/testsuite/gcc.target/aarch64/pr77729.c b/gcc/testsuite/gcc.target/aarch64/pr77729.c
index e69de29..2fcda9a 100644
--- a/gcc/testsuite/gcc.target/aarch64/pr77729.c
+++ b/gcc/testsuite/gcc.target/aarch64/pr77729.c
@@ -0,0 +1,32 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int TrieCase3_v1(const char *string)
+{
+    if((string[0] | 32) == 't') {
+        if((string[1] | 32) == 'a') {
+            if((string[2] | 32) == 'g') {
+                return 42;
+            }
+        }
+    }
+    return -1;
+}
+
+int TrieCase3_v2(const char *string)
+{
+    switch(string[0] | 32) {
+    case 't':
+        switch(string[1] | 32) {
+        case 'a':
+            switch(string[2] | 32) {
+            case 'g':
+                return 42;
+            }
+        }
+    }
+    return -1;
+}
+
+/* { dg-final { scan-assembler-not "and" } } */
+/* { dg-final { scan-assembler-not "uxtb" } } */