diff mbox

match.pd: optimize unsigned mul overflow check

Message ID alpine.LNX.2.20.1605282232330.2043@monopod.intra.ispras.ru
State New
Headers show

Commit Message

Alexander Monakov May 28, 2016, 8:01 p.m. UTC
Hello,

For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether 'A*B'
overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
invocation of __builtin_mul_overflow to avoid the divide operation.

The following patch implements that as a match.pd transformation.  It looks
like there's a similar thing going on for add/sub overflow check in
tree-ssa-math-opts, but handling this in match.pd seems reasonable.

(user code using the above test would probably also compute 'A*B' a few
statements later; notably, today GCC cannot CSE plain 'A*B' to REALPART_EXPR
of the builtin call on GIMPLE; on x86 it gets cleaned up on RTL)

Thanks to Marc for helping get this in shape on the Bugzilla.

Bootstrapped and regtested on x86_64, OK for trunk?

gcc/
2016-05-28  Alexander Monakov  <amonakov@ispras.ru>
	Marc Glisse  <marc.glisse@inria.fr>

	PR tree-optimization/71289
	* match.pd (-1 / B < A, A > -1 / B): New transformations.

gcc/testsuite/
2016-05-28  Alexander Monakov  <amonakov@ispras.ru>

	PR tree-optimization/71289
	* gcc.dg/pr71289.c: New test.

Comments

Marc Glisse May 29, 2016, 4:41 p.m. UTC | #1
On Sat, 28 May 2016, Alexander Monakov wrote:

> For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether 'A*B'
> overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
> invocation of __builtin_mul_overflow to avoid the divide operation.

Hmm, that division by zero thing is a good point. I may be confusing with 
dereferencing a null pointer, but I believe that some languages catch the 
corresponding signal, so by removing that division you would be changing 
the behavior. I wish we had a -fno-divisions-by-zero or equivalent, but 
otherwise this may require an extra check like tree_expr_nonzero_p, 
although we are quite inconsistent about this (we don't simplify x/x to 1, 
but we do simplify 0%x to 0 if x is not (yet) known to be the constant 0). 
We'll see what the reviewers think...

Any plan on optimizing the 'B && ovf' form?
Marc Glisse May 29, 2016, 9:17 p.m. UTC | #2
On Sat, 28 May 2016, Alexander Monakov wrote:

> For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether 'A*B'
> overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
> invocation of __builtin_mul_overflow to avoid the divide operation.

I forgot to ask earlier: what does this give for modes / platforms where 
umulv4 does not have a specific implementation? Is the generic 
implementation worse than A>-1/B, in which case we may want to check 
optab_handler before doing the transformation? Or is it always at least as 
good?

(I didn't ask because I was assuming the latter, but I am not 100% 
certain)
Alexander Monakov May 30, 2016, 7:14 a.m. UTC | #3
On Sun, 29 May 2016, Marc Glisse wrote:
> On Sat, 28 May 2016, Alexander Monakov wrote:
> 
> > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether
> > 'A*B'
> > overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
> > invocation of __builtin_mul_overflow to avoid the divide operation.
> 
> I forgot to ask earlier: what does this give for modes / platforms where
> umulv4 does not have a specific implementation? Is the generic implementation
> worse than A>-1/B, in which case we may want to check optab_handler before
> doing the transformation? Or is it always at least as good?

If umulv<mode>4 is unavailable (which today is everywhere except x86), gcc
falls back as follows.  First, it tries to see if doing a multiplication in a
2x wider type is possible (which it usually is, as gcc supports __int128_t on
64-bit platforms and 64-bit long long on 32-bit platforms), then it looks at
high bits of the 2x wide product.  This should boil down to doing a 'high
multiply' instruction if original operands' type matches register size, and a
normal multiply + masking high bits if the type is smaller than register.

Second, if the above fails (e.g. with 64-bit operands on a 32-bit platform),
then gcc emits a sequence that performs the multiplication by parts in a 2x
narrower type.

I think the first, more commonly taken, fallback path results in an
always-good code. In the second case, the eliminated 64-bit divide is unlikely
to have a direct hw support; e.g., on i386 it's a library call to __udivdi3.
This makes the transformation a likely loss for code size, a likely win for
performance.  It could be better if GCC could CSE REALPART (IFN_MUL_OVERFLOW)
with A*B on gimple.

Thanks.
Alexander
Richard Biener May 30, 2016, 10:55 a.m. UTC | #4
On Mon, May 30, 2016 at 9:14 AM, Alexander Monakov <amonakov@ispras.ru> wrote:
> On Sun, 29 May 2016, Marc Glisse wrote:
>> On Sat, 28 May 2016, Alexander Monakov wrote:
>>
>> > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether
>> > 'A*B'
>> > overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
>> > invocation of __builtin_mul_overflow to avoid the divide operation.
>>
>> I forgot to ask earlier: what does this give for modes / platforms where
>> umulv4 does not have a specific implementation? Is the generic implementation
>> worse than A>-1/B, in which case we may want to check optab_handler before
>> doing the transformation? Or is it always at least as good?
>
> If umulv<mode>4 is unavailable (which today is everywhere except x86), gcc
> falls back as follows.  First, it tries to see if doing a multiplication in a
> 2x wider type is possible (which it usually is, as gcc supports __int128_t on
> 64-bit platforms and 64-bit long long on 32-bit platforms), then it looks at
> high bits of the 2x wide product.  This should boil down to doing a 'high
> multiply' instruction if original operands' type matches register size, and a
> normal multiply + masking high bits if the type is smaller than register.
>
> Second, if the above fails (e.g. with 64-bit operands on a 32-bit platform),
> then gcc emits a sequence that performs the multiplication by parts in a 2x
> narrower type.
>
> I think the first, more commonly taken, fallback path results in an
> always-good code. In the second case, the eliminated 64-bit divide is unlikely
> to have a direct hw support; e.g., on i386 it's a library call to __udivdi3.
> This makes the transformation a likely loss for code size, a likely win for
> performance.  It could be better if GCC could CSE REALPART (IFN_MUL_OVERFLOW)
> with A*B on gimple.

CCing Jakub who wrote the tree-ssa-math-opts.c code last year.  I remember we
discussed using match.pd but ended up with not doing it there but I
don't remember
the exact reason.  It would be nice to have these in one place rather
than split.

As of umulv<mode>4 handling - users can write overflow checks using the builtins
so we better expand them to the optimal form for each target, thus
canonicalizing
them to the IFN looks reasonable to me.  The plus/minus case in
tree-ssa-math-opts.c
_does_ disable itself if no uaddv is available though.

As for the division by zero thing - division by zero invokes undefined
behavior.  Yes,
with -fnon-call-exceptions you could catch this as exception, but you shouldn't
be able to w/o a -fdivision-by-zero.  And yes, we're quite
inconsistent in folding
of, say, 0/0 - but that is due to warning code (historically).  I'd
rather ignore that
bit in folding and make us consistent here (hopefully w/o regressing
in diagnostics).

Richard.

> Thanks.
> Alexander
Jakub Jelinek May 30, 2016, 11:15 a.m. UTC | #5
On Mon, May 30, 2016 at 12:55:22PM +0200, Richard Biener wrote:
> CCing Jakub who wrote the tree-ssa-math-opts.c code last year.  I remember we
> discussed using match.pd but ended up with not doing it there but I
> don't remember
> the exact reason.

<jakub> richi: does match.pd support turning non-call code into internal call?
<richi> jakub: yes
<richi> jakub: (simplify (plus @1 @2) (fn @1 @2))
<jakub> richi: where fn is IFN_... or something else?
<richi> jakub: IFN_ should work, yes
<richi> jakub: CFN_ as well as far as I understand the code richard added
<jakub> but then I don't control the type of the lhs, right?
<jakub> want to transform
<jakub>   r_4 = x_2(D) - y_3(D);
<jakub>   if (x_2(D) < r_4)
<jakub> into:
<richi> well, the lhs is always the original type of the expression (obviously)
<jakub>   _4 = SUB_OVERFLOW (x_2(D), y_3(D));
<jakub>   r.0_5 = REALPART_EXPR <_4>;
<jakub>   _7 = IMAGPART_EXPR <_4>;
<jakub>   if (_7 != 0)
<jakub> unsigned only
<richi> jakub: hmm, you have two results, the compare and the subtract, that's not possible to handle right now (unless r_4 is unused otherwise)
<richi> jakub: you can write a matcher and do the transform in a pass somewhere
<jakub> richi: ah, ok; any suggestion which pass is best for this?  forwprop, something else?
<richi> jakub: I'd say forwprop is good enough, yes
<jakub> richi: FYI, this is for PR67089, rth is working on the optab side for that
<richi> hmm, the (match ...) code doesn't deal well with values not having an SSA name (the compare in this case)
<richi> "deal well" as in, not implemented
<richi> it would require to output some auxiliary entry points
<jakub> richi: if it is done in forwprop or some other pass, it can do everything in there by hand, doesn't need match.pd help I guess
<richi> jakub: yes

>  It would be nice to have these in one place rather
> than split.
> 
> As of umulv<mode>4 handling - users can write overflow checks using the builtins
> so we better expand them to the optimal form for each target, thus
> canonicalizing
> them to the IFN looks reasonable to me.

> The plus/minus case in tree-ssa-math-opts.c
> _does_ disable itself if no uaddv is available though.

That is because the uaddv/usubv case is so simple without the HW support that
it is unlikely to be beneficial then.  That is not the case for multiplication.

	Jakub
Richard Biener May 30, 2016, 11:31 a.m. UTC | #6
On Mon, May 30, 2016 at 1:15 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Mon, May 30, 2016 at 12:55:22PM +0200, Richard Biener wrote:
>> CCing Jakub who wrote the tree-ssa-math-opts.c code last year.  I remember we
>> discussed using match.pd but ended up with not doing it there but I
>> don't remember
>> the exact reason.
>
> <jakub> richi: does match.pd support turning non-call code into internal call?
> <richi> jakub: yes
> <richi> jakub: (simplify (plus @1 @2) (fn @1 @2))
> <jakub> richi: where fn is IFN_... or something else?
> <richi> jakub: IFN_ should work, yes
> <richi> jakub: CFN_ as well as far as I understand the code richard added
> <jakub> but then I don't control the type of the lhs, right?
> <jakub> want to transform
> <jakub>   r_4 = x_2(D) - y_3(D);
> <jakub>   if (x_2(D) < r_4)
> <jakub> into:
> <richi> well, the lhs is always the original type of the expression (obviously)
> <jakub>   _4 = SUB_OVERFLOW (x_2(D), y_3(D));
> <jakub>   r.0_5 = REALPART_EXPR <_4>;
> <jakub>   _7 = IMAGPART_EXPR <_4>;
> <jakub>   if (_7 != 0)
> <jakub> unsigned only
> <richi> jakub: hmm, you have two results, the compare and the subtract, that's not possible to handle right now (unless r_4 is unused otherwise)
> <richi> jakub: you can write a matcher and do the transform in a pass somewhere
> <jakub> richi: ah, ok; any suggestion which pass is best for this?  forwprop, something else?
> <richi> jakub: I'd say forwprop is good enough, yes
> <jakub> richi: FYI, this is for PR67089, rth is working on the optab side for that
> <richi> hmm, the (match ...) code doesn't deal well with values not having an SSA name (the compare in this case)
> <richi> "deal well" as in, not implemented
> <richi> it would require to output some auxiliary entry points
> <jakub> richi: if it is done in forwprop or some other pass, it can do everything in there by hand, doesn't need match.pd help I guess
> <richi> jakub: yes

Ok, so Alexander - you transform the condition only as the
multiplication value is not available
anyway with your formulation of the overflow check.  That's then
different to Jakubs case,
so match.pd is ok (that is, the patch is ok for trunk).

>>  It would be nice to have these in one place rather
>> than split.
>>
>> As of umulv<mode>4 handling - users can write overflow checks using the builtins
>> so we better expand them to the optimal form for each target, thus
>> canonicalizing
>> them to the IFN looks reasonable to me.
>
>> The plus/minus case in tree-ssa-math-opts.c
>> _does_ disable itself if no uaddv is available though.
>
> That is because the uaddv/usubv case is so simple without the HW support that
> it is unlikely to be beneficial then.  That is not the case for multiplication.

Richard.

>         Jakub
Alexander Monakov May 30, 2016, 3:12 p.m. UTC | #7
On Sun, 29 May 2016, Marc Glisse wrote:
> On Sat, 28 May 2016, Alexander Monakov wrote:
> > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether
> > 'A*B'
> > overflows (or 'B && A > -1 / B' if B may be zero).  Let's optimize it to an
> > invocation of __builtin_mul_overflow to avoid the divide operation.
> 
[snip]
> Any plan on optimizing the 'B && ovf' form?

Not in the immediate future; I don't have a good approach in mind to do that.

(if I get back to this topic I'd try to CSE the product with the IFN result)

Thanks for the discussion and help -- I've applied the patch.

Alexander
diff mbox

Patch

diff --git a/gcc/match.pd b/gcc/match.pd
index 8d05e86..953c070 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2657,6 +2657,25 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
    (out (imagpart @2) { build_zero_cst (TREE_TYPE (@0)); }))))
 
+/* For unsigned operands, A > -1 / B checks whether A * B would overflow.
+   Simplify it to __builtin_mul_overflow (A, B, <unused>).  */
+/* -1 / B < A */
+(for cmp (lt ge)
+     out (ne eq)
+ (simplify
+  (cmp (trunc_div:s integer_all_onesp @1) @0)
+  (if (TYPE_UNSIGNED (TREE_TYPE (@0)) && !VECTOR_TYPE_P (TREE_TYPE (@0)))
+   (with { tree t = TREE_TYPE (@0), cpx = build_complex_type (t); }
+    (out (imagpart (IFN_MUL_OVERFLOW:cpx @0 @1)) { build_zero_cst (t); })))))
+
+/* A > -1 / B */
+(for cmp (gt le)
+     out (ne eq)
+ (simplify
+  (cmp @0 (trunc_div:s integer_all_onesp @1))
+  (if (TYPE_UNSIGNED (TREE_TYPE (@0)) && !VECTOR_TYPE_P (TREE_TYPE (@0)))
+   (with { tree t = TREE_TYPE (@0), cpx = build_complex_type (t); }
+    (out (imagpart (IFN_MUL_OVERFLOW:cpx @0 @1)) { build_zero_cst (t); })))))
 
 /* Simplification of math builtins.  These rules must all be optimizations
    as well as IL simplifications.  If there is a possibility that the new
diff --git a/gcc/testsuite/gcc.dg/pr71289.c b/gcc/testsuite/gcc.dg/pr71289.c
new file mode 100644
index 0000000..39837b9
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr71289.c
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O -fdump-tree-optimized-raw" } */
+
+int f(unsigned a, unsigned b, unsigned *c)
+{
+  if (a > -1 / b)
+    return -1;
+  *c = a * b;
+  return 0;
+}
+
+void g(unsigned long long a, unsigned long long b, unsigned long long *c)
+{
+  if (a <= -1 / b)
+    *c = a * b;
+}
+
+/* { dg-final { scan-tree-dump-not "trunc_div_expr" "optimized" } } */