diff mbox series

PR middle-end/98865: Optimize (a>>63)*b as -(a>>63)&b in match.pd.

Message ID 012f01d854e5$5fb82fe0$1f288fa0$@nextmovesoftware.com
State New
Headers show
Series PR middle-end/98865: Optimize (a>>63)*b as -(a>>63)&b in match.pd. | expand

Commit Message

Roger Sayle April 20, 2022, 6:35 p.m. UTC
This patch implements the constant folding optimization(s) described in
PR middle-end/98865, which should help address the serious performance
regression of Botan AES-128/XTS mentioned in PR tree-optimization/98856.
This combines aspects of both Jakub Jelinek's patch in comment #2 and
Andrew Pinski's patch in comment #4, so both are listed as co-authors.

Alas truth_valued_p is not quite what we want (and tweaking its
definition has undesirable side-effects), so instead this patch
introduces a new zero_one_valued predicate based on tree_nonzero_bits
that extends truth_valued_p (which is for Boolean types with single
bit precision).  This is then used to simple if X*Y into X&Y when
both X and Y are zero_one_valued_p, and simplify X*Y into (-X)&Y when
X is zero_one_valued_p, in both cases replacing an integer multiplication
with a cheaper bit-wise AND.

This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
and make -k check, both with and with --target_board=unix{-m32}, with
no new failures, except for a tweak required to tree-ssa/vrp116.c.
The recently proposed cmove patch ensures the i386 backend continues
to generate identical code for vrp116.c as before.

Ok, either for mainline or when stage 1 reopens?


2022-04-20  Roger Sayle  <roger@nextmovesoftware.com>
            Andrew Pinski  <apinski@marvell.com>
            Jakub Jelinek  <jakub@redhat.com>

gcc/ChangeLog
	PR middle-end/98865
	* match.pd (match zero_one_valued_p): New predicate.
	(mult @0 @1): Use zero_one_valued_p for transforming into (and @0
@1).
	(mult zero_one_valued_p@0 @1): Convert integer multiplication into
	a negation and a bit-wise AND, if it can't be cheaply implemented by
	a single left shift.

gcc/testsuite/ChangeLog
	PR middle-end/98865
	* gcc.dg/pr98865.c: New test case.
	* gcc.dg/vrp116.c: Tweak test to confirm the integer multiplication
	has been eliminated, not for the actual replacement implementation.

Thanks,
Roger
--

Comments

Richard Biener May 2, 2022, 11:48 a.m. UTC | #1
On Wed, Apr 20, 2022 at 8:35 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> This patch implements the constant folding optimization(s) described in
> PR middle-end/98865, which should help address the serious performance
> regression of Botan AES-128/XTS mentioned in PR tree-optimization/98856.
> This combines aspects of both Jakub Jelinek's patch in comment #2 and
> Andrew Pinski's patch in comment #4, so both are listed as co-authors.
>
> Alas truth_valued_p is not quite what we want (and tweaking its
> definition has undesirable side-effects), so instead this patch
> introduces a new zero_one_valued predicate based on tree_nonzero_bits
> that extends truth_valued_p (which is for Boolean types with single
> bit precision).  This is then used to simple if X*Y into X&Y when
> both X and Y are zero_one_valued_p, and simplify X*Y into (-X)&Y when
> X is zero_one_valued_p, in both cases replacing an integer multiplication
> with a cheaper bit-wise AND.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and with --target_board=unix{-m32}, with
> no new failures, except for a tweak required to tree-ssa/vrp116.c.
> The recently proposed cmove patch ensures the i386 backend continues
> to generate identical code for vrp116.c as before.
>
> Ok, either for mainline or when stage 1 reopens?

One issue is that on GIMPLE we consider (bit_and (negate @0) @1) to be
more costly than (mult @0 @1) because it is two operations rather than one.
So can we at least do (bit_and (negate! @0) @1) and thus require the negate
to be simplified?

Also the

+      && (TREE_CODE (@1) != INTEGER_CST
+          || wi::popcount (wi::to_wide (@1)) > 1))

exception is odd without providing the desired canonicalization to a shift?

In the end all this looks more like something for RTL (expansion?) where we
can query costs rather than canonicalization (to simpler expressions) which
is what we should do on GIMPLE.

Richard.

>
>
> 2022-04-20  Roger Sayle  <roger@nextmovesoftware.com>
>             Andrew Pinski  <apinski@marvell.com>
>             Jakub Jelinek  <jakub@redhat.com>
>
> gcc/ChangeLog
>         PR middle-end/98865
>         * match.pd (match zero_one_valued_p): New predicate.
>         (mult @0 @1): Use zero_one_valued_p for transforming into (and @0
> @1).
>         (mult zero_one_valued_p@0 @1): Convert integer multiplication into
>         a negation and a bit-wise AND, if it can't be cheaply implemented by
>         a single left shift.
>
> gcc/testsuite/ChangeLog
>         PR middle-end/98865
>         * gcc.dg/pr98865.c: New test case.
>         * gcc.dg/vrp116.c: Tweak test to confirm the integer multiplication
>         has been eliminated, not for the actual replacement implementation.
>
> Thanks,
> Roger
> --
>
diff mbox series

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 6d691d3..16a1203 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -285,14 +285,6 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
            || !COMPLEX_FLOAT_TYPE_P (type)))
    (negate @0)))
 
-/* Transform { 0 or 1 } * { 0 or 1 } into { 0 or 1 } & { 0 or 1 } */
-(simplify
- (mult SSA_NAME@1 SSA_NAME@2)
-  (if (INTEGRAL_TYPE_P (type)
-       && get_nonzero_bits (@1) == 1
-       && get_nonzero_bits (@2) == 1)
-   (bit_and @1 @2)))
-
 /* Transform x * { 0 or 1, 0 or 1, ... } into x & { 0 or -1, 0 or -1, ...},
    unless the target has native support for the former but not the latter.  */
 (simplify
@@ -1789,6 +1781,28 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
   (bit_not (bit_not @0))
   @0)
 
+(match zero_one_valued_p
+ @0
+ (if (INTEGRAL_TYPE_P (type) && tree_nonzero_bits (@0) == 1)))
+(match zero_one_valued_p
+ truth_valued_p@0)
+
+/* Transform { 0 or 1 } * { 0 or 1 } into { 0 or 1 } & { 0 or 1 } */
+(simplify
+ (mult zero_one_valued_p@0 zero_one_valued_p@1)
+ (if (INTEGRAL_TYPE_P (type))
+  (bit_and @0 @1)))
+
+/* Transform x * { 0 or 1 } into x & { 0 or -1 }, i.e. an integer
+   multiplication into negate/bitwise and.  Don't do this if the
+   multiplication is cheap, may be implemented by a single shift.  */
+(simplify
+ (mult:c zero_one_valued_p@0 @1)
+ (if (INTEGRAL_TYPE_P (type)
+      && (TREE_CODE (@1) != INTEGER_CST
+          || wi::popcount (wi::to_wide (@1)) > 1))
+  (bit_and (negate @0) @1)))
+
 /* Convert ~ (-A) to A - 1.  */
 (simplify
  (bit_not (convert? (negate @0)))
diff --git a/gcc/testsuite/gcc.dg/pr98865.c b/gcc/testsuite/gcc.dg/pr98865.c
new file mode 100644
index 0000000..e7599d3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr98865.c
@@ -0,0 +1,60 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+#if __SIZEOF_INT__ == 4
+unsigned int foo(unsigned int a, unsigned int b)
+{
+  return (a >> 31) * b;
+}
+
+int bar(int a, int b)
+{
+  return -(a >> 31) * b;
+}
+
+int baz(int a, int b)
+{
+  int c = a >> 31;
+  int d = -c;
+  return d * b;
+}
+#endif
+
+#if __SIZEOF_LONG_LONG__ == 8
+unsigned long long fool (unsigned long long a, unsigned long long b)
+{
+  return (a >> 63) * b;
+}
+
+long long barl (long long a, long long b)
+{
+  return -(a >> 63) * b;
+}
+
+long long bazl (long long a, long long b)
+{
+  long long c = a >> 63;
+  long long d = -c;
+  return d * b;
+}
+#endif
+
+unsigned int pin (int a, unsigned int b)
+{
+  unsigned int t =  a & 1;
+  return t * b;
+}
+
+unsigned long pinl (long a, unsigned long b)
+{
+  unsigned long t =  a & 1;
+  return t * b;
+}
+
+unsigned long long pinll (long long a, unsigned long long b)
+{
+  unsigned long long t =  a & 1;
+  return t * b;
+}
+
+/* { dg-final { scan-tree-dump-not " \\* " "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
index 9e68a77..469b232 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/vrp116.c
@@ -9,4 +9,5 @@  f (int m1, int m2, int c)
   return e ? m1 : m2;
 }
 
-/* { dg-final { scan-tree-dump-times "\\? c_\[0-9\]\\(D\\) : 0" 1 "optimized" } } */
+/* The MULT_EXPR should become BIT_AND_EXPR or COND_EXPR.  */
+/* { dg-final { scan-tree-dump-not " \\* " "optimized" } } */