diff mbox

[combine,RFC,2/2] PR rtl-optimization/68796: Perfer zero_extract comparison against zero rather than unsupported shorter modes

Message ID 5672D68F.3030408@foss.arm.com
State New
Headers show

Commit Message

Kyrill Tkachov Dec. 17, 2015, 3:36 p.m. UTC
Hi all,

The documentation on RTL canonical forms in md.texi says:

"Equality comparisons of a group of bits (usually a single bit) with zero
  will be written using @code{zero_extract} rather than the equivalent
  @code{and} or @code{sign_extract} operations. "

However, this is not always followed in combine. If it's trying to optimise
a comparison against zero of a bitmask that is the mode mask of some mode
(255 for QImode and 65535 for HImode in the testcases of this patch)
it will instead create a subreg to that shorter mode.
This means that for the example:
int
f255 (int x)
{
   if (x & 255)
     return 1;
   return x;
}

it ends up trying to make a QImode comparison against zero, for which targets like
aarch64 have no pattern.

This patch attempts to fix this in two places in combine.
First is simplify_comparison when handling the and-bitmask case.
Currently it will call gen_lowpart_or_truncate on the argument to produce the short subreg.
With this patch we don't do that when comparing against zero.
This way the and-bitmask form is preserved for make_extraction later on to convert
into a zero_extract.
The second place is in make_extraction itself where it tries to avoid creating a zero_extract,
but the canonicalisation rules and the function comment for make_extraction say that it should
try hard create a zero_extraction when inside a comparison in particular
(" IN_COMPARE is nonzero if we are in a COMPARE.  This means that a
    ZERO_EXTRACT should be built even for bits starting at bit 0.")

With this patch for the testcases:
int
f255 (int x)
{
   if (x & 255)
     return 1;
   return x;
}

int
foo (long x)
{
    return ((short) x != 0) ? x : 1;
}

we now generate for aarch64 at -O2:
f255:
         tst     x0, 255
         csinc   w0, w0, wzr, eq
         ret

and
foo:
         tst     x0, 65535
         csinc   x0, x0, xzr, ne
         ret


instead of the previous:
f255:
         and     w1, w0, 255
         cmp     w1, wzr
         csinc   w0, w0, wzr, eq
         ret

foo:
         sxth    w1, w0
         cmp     w1, wzr
         csinc   x0, x0, xzr, ne
         ret


Bootstrapped and tested on arm, aarch64, x86_64.
To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64 pattern
for comparing a zero_extract with zero.
On aarch64 this greatly increases the usage of the TST instruction by about 54% on SPEC2006.
Performance-wise there were no regressions and slight improvements on SPECINT that may just
be above normal noise (overall 0.5% improvement).
On arm it makes very little difference (arm already defines QI and HImode comparisons against zero)
but makes more use of the lsrs-immediate instruction in place of the arm tst instruction, which has
a shorter encoding in Thumb2 state.
On x86_64 I saw no difference in code size for SPEC2006 on my setup.

What do people think of this approach?
I hope this just enforces the already documented canonicalisation rules with minimal(none?) negative
fallout.

Thanks,
Kyrill


2015-12-17  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68796
     * combine.c (make_extraction): Don't try to avoid the extraction if
     inside a compare.
     (simplify_comparison): Don't truncate to lowpart if comparing against
     zero and target doesn't have a native compare instruction in the
     required short mode.

2015-12-17  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68796
     * gcc.target/aarch64/tst_5.c: New test.
     * gcc.target/aarch64/tst_6.c: Likewise.

Comments

Bernd Schmidt Dec. 17, 2015, 3:58 p.m. UTC | #1
On 12/17/2015 04:36 PM, Kyrill Tkachov wrote:
> The documentation on RTL canonical forms in md.texi says:
>
> "Equality comparisons of a group of bits (usually a single bit) with zero
>   will be written using @code{zero_extract} rather than the equivalent
>   @code{and} or @code{sign_extract} operations. "
>
> However, this is not always followed in combine. If it's trying to optimise
> a comparison against zero of a bitmask that is the mode mask of some mode
> (255 for QImode and 65535 for HImode in the testcases of this patch)
> it will instead create a subreg to that shorter mode.

I suspect that this is an oversight in the documentation, and if given 
two choices the simpler form is intended to be the canonical one.

> it ends up trying to make a QImode comparison against zero, for which
> targets like
> aarch64 have no pattern.

So, can you define a pattern for it...

> To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64
> pattern
> for comparing a zero_extract with zero.

... instead of this one?

> What do people think of this approach?
> I hope this just enforces the already documented canonicalisation rules
> with minimal(none?) negative
> fallout.

I'm not so sure about this. Other ports have QImode comparisons and I 
would want to see some evidence that there are no code quality 
regressions. This is not stage 3 material in any case.


Bernd
Kyrill Tkachov Dec. 17, 2015, 4:10 p.m. UTC | #2
On 17/12/15 15:58, Bernd Schmidt wrote:
> On 12/17/2015 04:36 PM, Kyrill Tkachov wrote:
>> The documentation on RTL canonical forms in md.texi says:
>>
>> "Equality comparisons of a group of bits (usually a single bit) with zero
>>   will be written using @code{zero_extract} rather than the equivalent
>>   @code{and} or @code{sign_extract} operations. "
>>
>> However, this is not always followed in combine. If it's trying to optimise
>> a comparison against zero of a bitmask that is the mode mask of some mode
>> (255 for QImode and 65535 for HImode in the testcases of this patch)
>> it will instead create a subreg to that shorter mode.
>
> I suspect that this is an oversight in the documentation, and if given two choices the simpler form is intended to be the canonical one.
>
>> it ends up trying to make a QImode comparison against zero, for which
>> targets like
>> aarch64 have no pattern.
>
> So, can you define a pattern for it...
>
>> To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64
>> pattern
>> for comparing a zero_extract with zero.
>
> ... instead of this one?
>

Yes, I had investigated that approach and it has the same effect (on aarch64).
My motivation for this approach was to try avoiding defining multiple patterns for what should
be equivalent expressions. But if the short subreg form is intended to be the canonical form...

>> What do people think of this approach?
>> I hope this just enforces the already documented canonicalisation rules
>> with minimal(none?) negative
>> fallout.
>
> I'm not so sure about this. Other ports have QImode comparisons and I would want to see some evidence that there are no code quality regressions. This is not stage 3 material in any case.
>

Well, this patch still produces the QImode comparison if the target has a QImode comparison
(the have_insn_for check in the simplify_comparison hunk).
As I said, the effects on arm and aarch64 were strictly beneficial.
On x86_64 I saw no codegen difference on SPEC2006.
If this is considered too risky at this stage I can propose a QImode pattern for
aarch64 instead to isolate this fix to that backend.

Thanks,
Kyrill

>
> Bernd
Bernd Schmidt Dec. 17, 2015, 4:12 p.m. UTC | #3
On 12/17/2015 05:10 PM, Kyrill Tkachov wrote:
> Well, this patch still produces the QImode comparison if the target has
> a QImode comparison
> (the have_insn_for check in the simplify_comparison hunk).

Ok, I didn't look that closely because I had doubts about the approach. 
This kind of check also goes somewhat against the principles of just 
producing canonical forms of RTL.


Bernd
Kyrill Tkachov Dec. 17, 2015, 4:26 p.m. UTC | #4
On 17/12/15 16:12, Bernd Schmidt wrote:
> On 12/17/2015 05:10 PM, Kyrill Tkachov wrote:
>> Well, this patch still produces the QImode comparison if the target has
>> a QImode comparison
>> (the have_insn_for check in the simplify_comparison hunk).
>
> Ok, I didn't look that closely because I had doubts about the approach. This kind of check also goes somewhat against the principles of just producing canonical forms of RTL.
>

One could argue that if the target has (or advertises having) a native
QImode register comparison then it's objectively a simplification to transform a comparison in a wider mode
to a comparison in the shorter mode.

If, however, the target doesn't have such an instruction (like aarch64 doesn't have QImode registers) then
truncating the wider mode to QImode through a subreg is not less complex than a zero_extract, as both will
involve some form of extracting/masking the desired QImode bits. So picking a canonical form there makes sense,
and the documentation already specifies the zero_extract form as the canonical.

Would be nice to get a definite clarification on whether the subreg form is indeed the canonical one.
Then we can document it and I can just add a QI/HImode compare pattern to aarch64.

Thanks,
Kyrill

>
> Bernd
>
Jeff Law Dec. 17, 2015, 4:46 p.m. UTC | #5
On 12/17/2015 08:58 AM, Bernd Schmidt wrote:
> On 12/17/2015 04:36 PM, Kyrill Tkachov wrote:
>> The documentation on RTL canonical forms in md.texi says:
>>
>> "Equality comparisons of a group of bits (usually a single bit) with zero
>>   will be written using @code{zero_extract} rather than the equivalent
>>   @code{and} or @code{sign_extract} operations. "
>>
>> However, this is not always followed in combine. If it's trying to
>> optimise
>> a comparison against zero of a bitmask that is the mode mask of some mode
>> (255 for QImode and 65535 for HImode in the testcases of this patch)
>> it will instead create a subreg to that shorter mode.
>
> I suspect that this is an oversight in the documentation, and if given
> two choices the simpler form is intended to be the canonical one.
It's also the case that sometimes a SUBREG is preferred because it 
conveys that certain bits are "don't care".  In theory this may allow 
things to optimize better.

However, in practice, I'm not sure that's regularly the case because 
various passes are weak in trying to exploit the semantics of the SUBREG 
and passes are generally pretty strong in their handling of zero_extract 
and friends.

IIRC I actually bumped against this in the gcc-5 cycle when fixing some 
suboptimal code generation issues.  I think it was BZ15184.  I'd check 
the archives for Dec 2014 and Jan 2015.  There may be a mention of this 
issue in there from me (I can recall bumping into it, but can't recall 
if I ever did mentioned it publicly or if I ever submitted the change to 
prefer the zero_extract form over the subreg form.

Jeff
Jeff Law Dec. 17, 2015, 4:52 p.m. UTC | #6
On 12/17/2015 08:58 AM, Bernd Schmidt wrote:
>
> I suspect that this is an oversight in the documentation, and if given
> two choices the simpler form is intended to be the canonical one.
The other BZ I was looking at in this space was 15596.  It's PPC, but 
shows a generic weakness in how we identify extractions and insertions. 
  Fixing it would probably help all the ports that have relatively 
strong methods to set/clear a series of bits in the middle of a word.

It feels like combine has all the information necessary to improve 
things, but the overall combiner flow and APIs are extremely uncooperative.

jeff
Jeff Law Dec. 17, 2015, 4:59 p.m. UTC | #7
On 12/17/2015 09:26 AM, Kyrill Tkachov wrote:
> One could argue that if the target has (or advertises having) a native
> QImode register comparison then it's objectively a simplification to
> transform a comparison in a wider mode
> to a comparison in the shorter mode.
Generally true.

The most commonly cited exception is any port that defines 
WORD_REGISTER_OPERATIONS.  However, I would be comfortable with the idea 
that defining QImode comparisons on a target with 
WORD_REGISTER_OPERATIONS is a pretty explicit indication that it wants 
to try and shorten comparisons for one reason or another.




>
> If, however, the target doesn't have such an instruction (like aarch64
> doesn't have QImode registers) then
> truncating the wider mode to QImode through a subreg is not less complex
> than a zero_extract, as both will
> involve some form of extracting/masking the desired QImode bits. So
> picking a canonical form there makes sense,
> and the documentation already specifies the zero_extract form as the
> canonical.
>
> Would be nice to get a definite clarification on whether the subreg form
> is indeed the canonical one.
The subreg style "extension" isn't really an extension.  It is a way to 
say that we want to look at the object in a wider mode, but we don't 
actually care about the upper bits.  It's generally expected that the 
subreg won't result in the generation of any code.

A zero extract defines all the bits.

In theory the optimizers can use a SUBREG just like they could a REG, 
which should enable additional optimization.  In practice I don't think 
that's been as true as we'd like.

jeff
Kyrill Tkachov Dec. 17, 2015, 5:04 p.m. UTC | #8
Hi Jeff,

On 17/12/15 16:59, Jeff Law wrote:
> On 12/17/2015 09:26 AM, Kyrill Tkachov wrote:
>> One could argue that if the target has (or advertises having) a native
>> QImode register comparison then it's objectively a simplification to
>> transform a comparison in a wider mode
>> to a comparison in the shorter mode.
> Generally true.
>
> The most commonly cited exception is any port that defines WORD_REGISTER_OPERATIONS.  However, I would be comfortable with the idea that defining QImode comparisons on a target with WORD_REGISTER_OPERATIONS is a pretty explicit indication 
> that it wants to try and shorten comparisons for one reason or another.
>
>

I was investigating WORD_REGISTER_OPERATIONS as part of this. But we can't define it for aarch64.
In any case, aarch64 doesn't have QImode registers so I thought we'd try to avoid creating them.

>
>
>>
>> If, however, the target doesn't have such an instruction (like aarch64
>> doesn't have QImode registers) then
>> truncating the wider mode to QImode through a subreg is not less complex
>> than a zero_extract, as both will
>> involve some form of extracting/masking the desired QImode bits. So
>> picking a canonical form there makes sense,
>> and the documentation already specifies the zero_extract form as the
>> canonical.
>>
>> Would be nice to get a definite clarification on whether the subreg form
>> is indeed the canonical one.
> The subreg style "extension" isn't really an extension.  It is a way to say that we want to look at the object in a wider mode, but we don't actually care about the upper bits.  It's generally expected that the subreg won't result in the 
> generation of any code.
>
> A zero extract defines all the bits.

In this case, I'm expecting a QImode compare with zero to map down to the aarch64 TST reg, #255 instruction which
definitely zeroes out any bits outside of QImode (as it is a bitwise AND with a bitmask),
so zero_extract is the more correct expression here, no?

>
>
> In theory the optimizers can use a SUBREG just like they could a REG, which should enable additional optimization.  In practice I don't think that's been as true as we'd like.
>
> jeff
>
Segher Boessenkool Dec. 17, 2015, 5:27 p.m. UTC | #9
On Thu, Dec 17, 2015 at 05:12:16PM +0100, Bernd Schmidt wrote:
> On 12/17/2015 05:10 PM, Kyrill Tkachov wrote:
> >Well, this patch still produces the QImode comparison if the target has
> >a QImode comparison
> >(the have_insn_for check in the simplify_comparison hunk).
> 
> Ok, I didn't look that closely because I had doubts about the approach. 
> This kind of check also goes somewhat against the principles of just 
> producing canonical forms of RTL.

The canonicalisation rules exist so that optimisers only need to match
one form instead of several, and machine descriptions only need to
describe one form instead of several.  For this bitmasking case it
perversely forces you to describe the same instruction in many ways,
for many targets.  This is what the change_zero_ext was about as well.

It's not so easy to fix for the compare case.  Maybe the idea of making
genrecog make code that recognises more forms of the same insn will work
out.  GCC 7 in any case...


Segher
Jeff Law Dec. 17, 2015, 5:33 p.m. UTC | #10
On 12/17/2015 10:04 AM, Kyrill Tkachov wrote:

> In this case, I'm expecting a QImode compare with zero to map down to
> the aarch64 TST reg, #255 instruction which
> definitely zeroes out any bits outside of QImode (as it is a bitwise AND
> with a bitmask),
> so zero_extract is the more correct expression here, no?
It's more about the semantics of the code and how it interacts with RTL 
generation, optimization and analysis than it is with the final assembly 
generated by the backend that drives SUBREG vs zero_extract.

The backend assembly code generator is free to implement stricter 
semantics (such as defining all the bits for a paradoxical subreg), but 
the rest of the compiler can not depend on those stricter semantics.

The easiest way to think about the subreg case here is that it's used 
when we've got a narrow object that we want to view in a wider mode, but 
we don't actually care about the upper bits.  The widening is merely to 
make the mode match another operand.


zero_extract is still the canonical form.  subreg is a specialized form 
for cases where the upper bits are "don't care" values.  This should 
probably be documented as the current state of the world.

I think it's an open question whether or not to drop the subreg form and 
always use zero-extract.  I've certainly seen cases where the former is 
*supposed* to allow better code generation, but in fact actually gets in 
the way resulting in poorer code generation.

Jeff
Kyrill Tkachov Dec. 17, 2015, 5:44 p.m. UTC | #11
On 17/12/15 17:27, Segher Boessenkool wrote:
> On Thu, Dec 17, 2015 at 05:12:16PM +0100, Bernd Schmidt wrote:
>> On 12/17/2015 05:10 PM, Kyrill Tkachov wrote:
>>> Well, this patch still produces the QImode comparison if the target has
>>> a QImode comparison
>>> (the have_insn_for check in the simplify_comparison hunk).
>> Ok, I didn't look that closely because I had doubts about the approach.
>> This kind of check also goes somewhat against the principles of just
>> producing canonical forms of RTL.
> The canonicalisation rules exist so that optimisers only need to match
> one form instead of several, and machine descriptions only need to
> describe one form instead of several.  For this bitmasking case it
> perversely forces you to describe the same instruction in many ways,
> for many targets.  This is what the change_zero_ext was about as well.
>
> It's not so easy to fix for the compare case.  Maybe the idea of making
> genrecog make code that recognises more forms of the same insn will work
> out.  GCC 7 in any case...

Perhaps I had underestimated how involved this issue is :)
So if I want to improve the aarch64 situation for GCC 6,
would the recommended course of action be to just define the
QI and HImode compare against zero patterns?

Note that I think the make_extraction hunk from my patch is in line
with the function comment of make_extraction that says:
"   IN_COMPARE is nonzero if we are in a COMPARE.  This means that a
     ZERO_EXTRACT should be built even for bits starting at bit 0."

whereas the condition that I'm adding "&& !in_compare" is explicitly trying
to avoid an extraction.

But anyway, if this has the potential to cause negative fallout that I
had not anticipated, it can wait for later.

Thanks,
Kyrill

>
> Segher
Bernd Schmidt Dec. 17, 2015, 6:05 p.m. UTC | #12
On 12/17/2015 06:44 PM, Kyrill Tkachov wrote:
> Perhaps I had underestimated how involved this issue is :)
> So if I want to improve the aarch64 situation for GCC 6,
> would the recommended course of action be to just define the
> QI and HImode compare against zero patterns?

For GCC 6 I think this is the only approach.


Bernd
diff mbox

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index 8601d8983ce345e2129dd047b3520d98c0582842..345e63f9a05f2310a5c9e5b239ed069d22565d1c 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -7337,10 +7337,13 @@  make_extraction (machine_mode mode, rtx inner, HOST_WIDE_INT pos,
      low-order bit and this is either not in the destination or we have the
      appropriate STRICT_LOW_PART operation available.
 
+     Don't do this if we are inside a comparison, as the canonicalization
+     rules call for a zero_extract form.
      For MEM, we can avoid an extract if the field starts on an appropriate
      boundary and we can change the mode of the memory reference.  */
 
   if (tmode != BLKmode
+      && !in_compare
       && ((pos_rtx == 0 && (pos % BITS_PER_WORD) == 0
 	   && !MEM_P (inner)
 	   && (inner_mode == tmode
@@ -12108,14 +12111,19 @@  simplify_comparison (enum rtx_code code, rtx *pop0, rtx *pop1)
 
 	     unless TRULY_NOOP_TRUNCATION allows it or the register is
 	     known to hold a value of the required mode the
-	     transformation is invalid.  */
+	     transformation is invalid.
+	     If the target does not have a compare instruction of that mode
+	     don't do this when comparing against 0 since the canonicalization
+	     rules require such an operation to be represented as a
+	     zero_extract, which make_extraction will produce later on.  */
 	  if ((equality_comparison_p || unsigned_comparison_p)
 	      && CONST_INT_P (XEXP (op0, 1))
 	      && (i = exact_log2 ((UINTVAL (XEXP (op0, 1))
 				   & GET_MODE_MASK (mode))
 				  + 1)) >= 0
 	      && const_op >> i == 0
-	      && (tmode = mode_for_size (i, MODE_INT, 1)) != BLKmode)
+	      && (tmode = mode_for_size (i, MODE_INT, 1)) != BLKmode
+	      && (const_op != 0 || have_insn_for (COMPARE, tmode)))
 	    {
 	      op0 = gen_lowpart_or_truncate (tmode, XEXP (op0, 0));
 	      continue;
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_5.c b/gcc/testsuite/gcc.target/aarch64/tst_5.c
new file mode 100644
index 0000000000000000000000000000000000000000..868a3be502e1468592c17b8a9c6a359af00e61a8
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/tst_5.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+f255 (int x)
+{
+  if (x & 255)
+    return 1;
+  return x;
+}
+
+int
+f65535 (int x)
+{
+  if (x & 65535)
+    return 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*255" } } */
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*65535" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/tst_6.c b/gcc/testsuite/gcc.target/aarch64/tst_6.c
new file mode 100644
index 0000000000000000000000000000000000000000..2f1c78d873445b11562c5e2863a3f642ff11f318
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/tst_6.c
@@ -0,0 +1,10 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int
+foo (long x)
+{
+   return ((short) x != 0) ? x : 1;
+}
+
+/* { dg-final { scan-assembler "tst\t(x|w)\[0-9\]*.*65535" } } */