[combine,RFC,2/2] PR rtl-optimization/68796: Perfer zero_extract comparison against zero rather than unsupported shorter modes

Hi all,

The documentation on RTL canonical forms in md.texi says:

"Equality comparisons of a group of bits (usually a single bit) with zero
  will be written using @code{zero_extract} rather than the equivalent
  @code{and} or @code{sign_extract} operations. "

However, this is not always followed in combine. If it's trying to optimise
a comparison against zero of a bitmask that is the mode mask of some mode
(255 for QImode and 65535 for HImode in the testcases of this patch)
it will instead create a subreg to that shorter mode.
This means that for the example:
int
f255 (int x)
{
   if (x & 255)
     return 1;
   return x;
}

it ends up trying to make a QImode comparison against zero, for which targets like
aarch64 have no pattern.

This patch attempts to fix this in two places in combine.
First is simplify_comparison when handling the and-bitmask case.
Currently it will call gen_lowpart_or_truncate on the argument to produce the short subreg.
With this patch we don't do that when comparing against zero.
This way the and-bitmask form is preserved for make_extraction later on to convert
into a zero_extract.
The second place is in make_extraction itself where it tries to avoid creating a zero_extract,
but the canonicalisation rules and the function comment for make_extraction say that it should
try hard create a zero_extraction when inside a comparison in particular
(" IN_COMPARE is nonzero if we are in a COMPARE.  This means that a
    ZERO_EXTRACT should be built even for bits starting at bit 0.")

With this patch for the testcases:
int
f255 (int x)
{
   if (x & 255)
     return 1;
   return x;
}

int
foo (long x)
{
    return ((short) x != 0) ? x : 1;
}

we now generate for aarch64 at -O2:
f255:
         tst     x0, 255
         csinc   w0, w0, wzr, eq
         ret

and
foo:
         tst     x0, 65535
         csinc   x0, x0, xzr, ne
         ret

instead of the previous:
f255:
         and     w1, w0, 255
         cmp     w1, wzr
         csinc   w0, w0, wzr, eq
         ret

foo:
         sxth    w1, w0
         cmp     w1, wzr
         csinc   x0, x0, xzr, ne
         ret

Bootstrapped and tested on arm, aarch64, x86_64.
To get the benefit on aarch64 this needs patch 1/2 that adds an aarch64 pattern
for comparing a zero_extract with zero.
On aarch64 this greatly increases the usage of the TST instruction by about 54% on SPEC2006.
Performance-wise there were no regressions and slight improvements on SPECINT that may just
be above normal noise (overall 0.5% improvement).
On arm it makes very little difference (arm already defines QI and HImode comparisons against zero)
but makes more use of the lsrs-immediate instruction in place of the arm tst instruction, which has
a shorter encoding in Thumb2 state.
On x86_64 I saw no difference in code size for SPEC2006 on my setup.

What do people think of this approach?
I hope this just enforces the already documented canonicalisation rules with minimal(none?) negative
fallout.

Thanks,
Kyrill

2015-12-17  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68796
     * combine.c (make_extraction): Don't try to avoid the extraction if
     inside a compare.
     (simplify_comparison): Don't truncate to lowpart if comparing against
     zero and target doesn't have a native compare instruction in the
     required short mode.

2015-12-17  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR rtl-optimization/68796
     * gcc.target/aarch64/tst_5.c: New test.
     * gcc.target/aarch64/tst_6.c: Likewise.

[combine,RFC,2/2] PR rtl-optimization/68796: Perfer zero_extract comparison against zero rather than unsupported shorter modes

Commit Message

Comments

Patch