diff mbox

Fix compile/memory hog in the combiner (PR rtl-optimization/69592)

Message ID 20160201203449.GN3017@tucnak.redhat.com
State New
Headers show

Commit Message

Jakub Jelinek Feb. 1, 2016, 8:34 p.m. UTC
Hi!

On the following testcase we completely uselessly consume about 5.5GB
of RAM and lots of compile time.  The problem is the code to avoid
exponential behavior of nonzero_bits/num_sign_bit_copies on binary
arithmetics rtxes, which causes us to recurse even when handling
of those rtxes is going to ignore those arguments.
So, this patch limits those only to the cases where we are going
to recurse on both arguments, for rtxes where we don't look at arguments
at all or where we only recurse on a single arguments it doesn't make
sense.  On the testcase, one of the rtxes where the new predicates
return false but ARITHMETIC_P is true, is COMPARE, in particular
(compare:CCC (plus (x) (y)) (x)), where it is known even without
looking at the operands that only one bit is possibly non-zero and
number of sign bit copies is always 1.  But without the patch we
needlessly recurse on x, which is set by another similar operation etc.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2016-02-01  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/69592
	* rtlanal.c (nonzero_bits_binary_arith_p): New inline function.
	(cached_nonzero_bits): Use it instead of ARITHMETIC_P.
	(num_sign_bit_copies_binary_arith_p): New inline function.
	(cached_num_sign_bit_copies): Use it instead of ARITHMETIC_P.

	* gcc.dg/pr69592.c: New test.


	Jakub

Comments

Jeff Law Feb. 1, 2016, 10:17 p.m. UTC | #1
On 02/01/2016 01:34 PM, Jakub Jelinek wrote:
> Hi!
>
> On the following testcase we completely uselessly consume about 5.5GB
> of RAM and lots of compile time.  The problem is the code to avoid
> exponential behavior of nonzero_bits/num_sign_bit_copies on binary
> arithmetics rtxes, which causes us to recurse even when handling
> of those rtxes is going to ignore those arguments.
> So, this patch limits those only to the cases where we are going
> to recurse on both arguments, for rtxes where we don't look at arguments
> at all or where we only recurse on a single arguments it doesn't make
> sense.  On the testcase, one of the rtxes where the new predicates
> return false but ARITHMETIC_P is true, is COMPARE, in particular
> (compare:CCC (plus (x) (y)) (x)), where it is known even without
> looking at the operands that only one bit is possibly non-zero and
> number of sign bit copies is always 1.  But without the patch we
> needlessly recurse on x, which is set by another similar operation etc.
>
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>
> 2016-02-01  Jakub Jelinek  <jakub@redhat.com>
>
> 	PR rtl-optimization/69592
> 	* rtlanal.c (nonzero_bits_binary_arith_p): New inline function.
> 	(cached_nonzero_bits): Use it instead of ARITHMETIC_P.
> 	(num_sign_bit_copies_binary_arith_p): New inline function.
> 	(cached_num_sign_bit_copies): Use it instead of ARITHMETIC_P.
>
> 	* gcc.dg/pr69592.c: New test.
I do wonder if we want a pointer back from nonzero_bits1 to 
nonzero_bits_binary_arith_p to keep them in sync.  Similarly for the 
sign_bits version.  Presumably before the appropriate SWITCH statement 
in those functions.

OK with that change.

jeff
Bernd Schmidt Feb. 1, 2016, 11:50 p.m. UTC | #2
On 02/01/2016 09:34 PM, Jakub Jelinek wrote:
> On the following testcase we completely uselessly consume about 5.5GB
> of RAM and lots of compile time.  The problem is the code to avoid
> exponential behavior of nonzero_bits/num_sign_bit_copies on binary
> arithmetics rtxes, which causes us to recurse even when handling
> of those rtxes is going to ignore those arguments.
> So, this patch limits those only to the cases where we are going
> to recurse on both arguments, for rtxes where we don't look at arguments
> at all or where we only recurse on a single arguments it doesn't make
> sense.  On the testcase, one of the rtxes where the new predicates
> return false but ARITHMETIC_P is true, is COMPARE, in particular
> (compare:CCC (plus (x) (y)) (x)), where it is known even without
> looking at the operands that only one bit is possibly non-zero and
> number of sign bit copies is always 1.  But without the patch we
> needlessly recurse on x, which is set by another similar operation etc.

Hmm, so the code we have to eliminate performance problems is itself 
causing them?
I don't see any code handling COMPARE in nonzero_bits1, only the various 
EQ/NE/etc. codes.

> +static inline bool
> +nonzero_bits_binary_arith_p (const_rtx x)
> +{
> +  if (!ARITHMETIC_P (x))
> +    return false;
> +  switch (GET_CODE (x))
> +    {
> +    case AND:
> +    case XOR:
> +    case IOR:
> +    case UMIN:
> +    case UMAX:
> +    case SMIN:
> +    case SMAX:
> +    case PLUS:
> +    case MINUS:
> +    case MULT:
> +    case DIV:
> +    case UDIV:
> +    case MOD:
> +    case UMOD:
> +      return true;
> +    default:
> +      return false;
> +    }

I think I have a slight preference for listing the cases where we know 
we can avoid the exponential behaviour workaround - i.e. just test for 
compares and return false for them. Otherwise someone might add another 
of the arithmetic codes to nonzero_bits without noticing they have to 
adjust this function as well.


Bernd
Jakub Jelinek Feb. 2, 2016, 8:59 a.m. UTC | #3
On Tue, Feb 02, 2016 at 12:50:39AM +0100, Bernd Schmidt wrote:
> On 02/01/2016 09:34 PM, Jakub Jelinek wrote:
> >On the following testcase we completely uselessly consume about 5.5GB
> >of RAM and lots of compile time.  The problem is the code to avoid
> >exponential behavior of nonzero_bits/num_sign_bit_copies on binary
> >arithmetics rtxes, which causes us to recurse even when handling
> >of those rtxes is going to ignore those arguments.
> >So, this patch limits those only to the cases where we are going
> >to recurse on both arguments, for rtxes where we don't look at arguments
> >at all or where we only recurse on a single arguments it doesn't make
> >sense.  On the testcase, one of the rtxes where the new predicates
> >return false but ARITHMETIC_P is true, is COMPARE, in particular
> >(compare:CCC (plus (x) (y)) (x)), where it is known even without
> >looking at the operands that only one bit is possibly non-zero and
> >number of sign bit copies is always 1.  But without the patch we
> >needlessly recurse on x, which is set by another similar operation etc.
> 
> Hmm, so the code we have to eliminate performance problems is itself causing
> them?

Yes.

> I don't see any code handling COMPARE in nonzero_bits1, only the various
> EQ/NE/etc. codes.

Right.

> I think I have a slight preference for listing the cases where we know we
> can avoid the exponential behaviour workaround - i.e. just test for compares
> and return false for them. Otherwise someone might add another of the
> arithmetic codes to nonzero_bits without noticing they have to adjust this
> function as well.

The problem is that there are more codes that aren't handled and thus
listing those that need to be handled is shorter and IMO more maintainable.
There are 32 ARITHMETIC_P codes, and nonzero_bits1 handles by recursing on
both operands 14 of them, and num_signed_bit_copies1 only 10.

I wonder if it wouldn't be better to pass around some structure, containing
for the common case fixed size cache and perhaps fall back to hash_map if
there are more calls to cache than that.  Plus perhaps a recursion depth, so
that we avoid other pathological cases.

	Jakub
Bernd Schmidt Feb. 3, 2016, 5:07 p.m. UTC | #4
On 02/02/2016 09:59 AM, Jakub Jelinek wrote:

> I wonder if it wouldn't be better to pass around some structure, containing
> for the common case fixed size cache and perhaps fall back to hash_map if
> there are more calls to cache than that.  Plus perhaps a recursion depth, so
> that we avoid other pathological cases.

I had the same thought. Maybe for stage1?


Bernd
Jakub Jelinek Feb. 3, 2016, 5:09 p.m. UTC | #5
On Wed, Feb 03, 2016 at 06:07:25PM +0100, Bernd Schmidt wrote:
> On 02/02/2016 09:59 AM, Jakub Jelinek wrote:
> 
> >I wonder if it wouldn't be better to pass around some structure, containing
> >for the common case fixed size cache and perhaps fall back to hash_map if
> >there are more calls to cache than that.  Plus perhaps a recursion depth, so
> >that we avoid other pathological cases.
> 
> I had the same thought. Maybe for stage1?

Yeah.

	Jakub
diff mbox

Patch

--- gcc/rtlanal.c.jj	2016-01-21 21:27:57.000000000 +0100
+++ gcc/rtlanal.c	2016-02-01 18:53:06.130934333 +0100
@@ -4163,6 +4163,36 @@  num_sign_bit_copies (const_rtx x, machin
   return cached_num_sign_bit_copies (x, mode, NULL_RTX, VOIDmode, 0);
 }
 
+/* Return true if nonzero_bits1 might recurse into both operands
+   of X.  */
+
+static inline bool
+nonzero_bits_binary_arith_p (const_rtx x)
+{
+  if (!ARITHMETIC_P (x))
+    return false;
+  switch (GET_CODE (x))
+    {
+    case AND:
+    case XOR:
+    case IOR:
+    case UMIN:
+    case UMAX:
+    case SMIN:
+    case SMAX:
+    case PLUS:
+    case MINUS:
+    case MULT:
+    case DIV:
+    case UDIV:
+    case MOD:
+    case UMOD:
+      return true;
+    default:
+      return false;
+    }
+}
+
 /* The function cached_nonzero_bits is a wrapper around nonzero_bits1.
    It avoids exponential behavior in nonzero_bits1 when X has
    identical subexpressions on the first or the second level.  */
@@ -4179,7 +4209,7 @@  cached_nonzero_bits (const_rtx x, machin
      nonzero_bits1 on X with the subexpressions as KNOWN_X and the
      precomputed value for the subexpression as KNOWN_RET.  */
 
-  if (ARITHMETIC_P (x))
+  if (nonzero_bits_binary_arith_p (x))
     {
       rtx x0 = XEXP (x, 0);
       rtx x1 = XEXP (x, 1);
@@ -4191,13 +4221,13 @@  cached_nonzero_bits (const_rtx x, machin
 						   known_mode, known_ret));
 
       /* Check the second level.  */
-      if (ARITHMETIC_P (x0)
+      if (nonzero_bits_binary_arith_p (x0)
 	  && (x1 == XEXP (x0, 0) || x1 == XEXP (x0, 1)))
 	return nonzero_bits1 (x, mode, x1, mode,
 			      cached_nonzero_bits (x1, mode, known_x,
 						   known_mode, known_ret));
 
-      if (ARITHMETIC_P (x1)
+      if (nonzero_bits_binary_arith_p (x1)
 	  && (x0 == XEXP (x1, 0) || x0 == XEXP (x1, 1)))
 	return nonzero_bits1 (x, mode, x0, mode,
 			      cached_nonzero_bits (x0, mode, known_x,
@@ -4672,6 +4702,33 @@  nonzero_bits1 (const_rtx x, machine_mode
 #undef cached_num_sign_bit_copies
 
 
+/* Return true if num_sign_bit_copies1 might recurse into both operands
+   of X.  */
+
+static inline bool
+num_sign_bit_copies_binary_arith_p (const_rtx x)
+{
+  if (!ARITHMETIC_P (x))
+    return false;
+  switch (GET_CODE (x))
+    {
+    case IOR:
+    case AND:
+    case XOR:
+    case SMIN:
+    case SMAX:
+    case UMIN:
+    case UMAX:
+    case PLUS:
+    case MINUS:
+    case MULT:
+      return true;
+    default:
+      return false;
+    }
+}
+
+
 /* The function cached_num_sign_bit_copies is a wrapper around
    num_sign_bit_copies1.  It avoids exponential behavior in
    num_sign_bit_copies1 when X has identical subexpressions on the
@@ -4689,7 +4746,7 @@  cached_num_sign_bit_copies (const_rtx x,
      num_sign_bit_copies1 on X with the subexpressions as KNOWN_X and
      the precomputed value for the subexpression as KNOWN_RET.  */
 
-  if (ARITHMETIC_P (x))
+  if (num_sign_bit_copies_binary_arith_p (x))
     {
       rtx x0 = XEXP (x, 0);
       rtx x1 = XEXP (x, 1);
@@ -4703,7 +4760,7 @@  cached_num_sign_bit_copies (const_rtx x,
 							    known_ret));
 
       /* Check the second level.  */
-      if (ARITHMETIC_P (x0)
+      if (num_sign_bit_copies_binary_arith_p (x0)
 	  && (x1 == XEXP (x0, 0) || x1 == XEXP (x0, 1)))
 	return
 	  num_sign_bit_copies1 (x, mode, x1, mode,
@@ -4711,7 +4768,7 @@  cached_num_sign_bit_copies (const_rtx x,
 							    known_mode,
 							    known_ret));
 
-      if (ARITHMETIC_P (x1)
+      if (num_sign_bit_copies_binary_arith_p (x1)
 	  && (x0 == XEXP (x1, 0) || x0 == XEXP (x1, 1)))
 	return
 	  num_sign_bit_copies1 (x, mode, x0, mode,
--- gcc/testsuite/gcc.dg/pr69592.c.jj	2016-02-01 19:02:23.122251761 +0100
+++ gcc/testsuite/gcc.dg/pr69592.c	2016-02-01 19:00:30.000000000 +0100
@@ -0,0 +1,16 @@ 
+/* PR rtl-optimization/69592 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+unsigned int
+foo (unsigned int a, unsigned int *b, unsigned int c)
+{
+  unsigned int d;
+#define A(n) d = a + b[n]; if (d < a) c++; a = d;
+#define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n##7) A(n##8) A(n##9)
+#define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n##7) B(n##8) B(n##9)
+#define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n##7) C(n##8) C(n##9)
+#define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n##7) D(n##8) D(n##9)
+  C(1) C(2) C(3) C(4) C(5) C(6)
+  return d + c;
+}