Message ID | alpine.LNX.2.20.1605282232330.2043@monopod.intra.ispras.ru |
---|---|
State | New |
Headers | show |
On Sat, 28 May 2016, Alexander Monakov wrote: > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether 'A*B' > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an > invocation of __builtin_mul_overflow to avoid the divide operation. Hmm, that division by zero thing is a good point. I may be confusing with dereferencing a null pointer, but I believe that some languages catch the corresponding signal, so by removing that division you would be changing the behavior. I wish we had a -fno-divisions-by-zero or equivalent, but otherwise this may require an extra check like tree_expr_nonzero_p, although we are quite inconsistent about this (we don't simplify x/x to 1, but we do simplify 0%x to 0 if x is not (yet) known to be the constant 0). We'll see what the reviewers think... Any plan on optimizing the 'B && ovf' form?
On Sat, 28 May 2016, Alexander Monakov wrote: > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether 'A*B' > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an > invocation of __builtin_mul_overflow to avoid the divide operation. I forgot to ask earlier: what does this give for modes / platforms where umulv4 does not have a specific implementation? Is the generic implementation worse than A>-1/B, in which case we may want to check optab_handler before doing the transformation? Or is it always at least as good? (I didn't ask because I was assuming the latter, but I am not 100% certain)
On Sun, 29 May 2016, Marc Glisse wrote: > On Sat, 28 May 2016, Alexander Monakov wrote: > > > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether > > 'A*B' > > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an > > invocation of __builtin_mul_overflow to avoid the divide operation. > > I forgot to ask earlier: what does this give for modes / platforms where > umulv4 does not have a specific implementation? Is the generic implementation > worse than A>-1/B, in which case we may want to check optab_handler before > doing the transformation? Or is it always at least as good? If umulv<mode>4 is unavailable (which today is everywhere except x86), gcc falls back as follows. First, it tries to see if doing a multiplication in a 2x wider type is possible (which it usually is, as gcc supports __int128_t on 64-bit platforms and 64-bit long long on 32-bit platforms), then it looks at high bits of the 2x wide product. This should boil down to doing a 'high multiply' instruction if original operands' type matches register size, and a normal multiply + masking high bits if the type is smaller than register. Second, if the above fails (e.g. with 64-bit operands on a 32-bit platform), then gcc emits a sequence that performs the multiplication by parts in a 2x narrower type. I think the first, more commonly taken, fallback path results in an always-good code. In the second case, the eliminated 64-bit divide is unlikely to have a direct hw support; e.g., on i386 it's a library call to __udivdi3. This makes the transformation a likely loss for code size, a likely win for performance. It could be better if GCC could CSE REALPART (IFN_MUL_OVERFLOW) with A*B on gimple. Thanks. Alexander
On Mon, May 30, 2016 at 9:14 AM, Alexander Monakov <amonakov@ispras.ru> wrote: > On Sun, 29 May 2016, Marc Glisse wrote: >> On Sat, 28 May 2016, Alexander Monakov wrote: >> >> > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether >> > 'A*B' >> > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an >> > invocation of __builtin_mul_overflow to avoid the divide operation. >> >> I forgot to ask earlier: what does this give for modes / platforms where >> umulv4 does not have a specific implementation? Is the generic implementation >> worse than A>-1/B, in which case we may want to check optab_handler before >> doing the transformation? Or is it always at least as good? > > If umulv<mode>4 is unavailable (which today is everywhere except x86), gcc > falls back as follows. First, it tries to see if doing a multiplication in a > 2x wider type is possible (which it usually is, as gcc supports __int128_t on > 64-bit platforms and 64-bit long long on 32-bit platforms), then it looks at > high bits of the 2x wide product. This should boil down to doing a 'high > multiply' instruction if original operands' type matches register size, and a > normal multiply + masking high bits if the type is smaller than register. > > Second, if the above fails (e.g. with 64-bit operands on a 32-bit platform), > then gcc emits a sequence that performs the multiplication by parts in a 2x > narrower type. > > I think the first, more commonly taken, fallback path results in an > always-good code. In the second case, the eliminated 64-bit divide is unlikely > to have a direct hw support; e.g., on i386 it's a library call to __udivdi3. > This makes the transformation a likely loss for code size, a likely win for > performance. It could be better if GCC could CSE REALPART (IFN_MUL_OVERFLOW) > with A*B on gimple. CCing Jakub who wrote the tree-ssa-math-opts.c code last year. I remember we discussed using match.pd but ended up with not doing it there but I don't remember the exact reason. It would be nice to have these in one place rather than split. As of umulv<mode>4 handling - users can write overflow checks using the builtins so we better expand them to the optimal form for each target, thus canonicalizing them to the IFN looks reasonable to me. The plus/minus case in tree-ssa-math-opts.c _does_ disable itself if no uaddv is available though. As for the division by zero thing - division by zero invokes undefined behavior. Yes, with -fnon-call-exceptions you could catch this as exception, but you shouldn't be able to w/o a -fdivision-by-zero. And yes, we're quite inconsistent in folding of, say, 0/0 - but that is due to warning code (historically). I'd rather ignore that bit in folding and make us consistent here (hopefully w/o regressing in diagnostics). Richard. > Thanks. > Alexander
On Mon, May 30, 2016 at 12:55:22PM +0200, Richard Biener wrote: > CCing Jakub who wrote the tree-ssa-math-opts.c code last year. I remember we > discussed using match.pd but ended up with not doing it there but I > don't remember > the exact reason. <jakub> richi: does match.pd support turning non-call code into internal call? <richi> jakub: yes <richi> jakub: (simplify (plus @1 @2) (fn @1 @2)) <jakub> richi: where fn is IFN_... or something else? <richi> jakub: IFN_ should work, yes <richi> jakub: CFN_ as well as far as I understand the code richard added <jakub> but then I don't control the type of the lhs, right? <jakub> want to transform <jakub> r_4 = x_2(D) - y_3(D); <jakub> if (x_2(D) < r_4) <jakub> into: <richi> well, the lhs is always the original type of the expression (obviously) <jakub> _4 = SUB_OVERFLOW (x_2(D), y_3(D)); <jakub> r.0_5 = REALPART_EXPR <_4>; <jakub> _7 = IMAGPART_EXPR <_4>; <jakub> if (_7 != 0) <jakub> unsigned only <richi> jakub: hmm, you have two results, the compare and the subtract, that's not possible to handle right now (unless r_4 is unused otherwise) <richi> jakub: you can write a matcher and do the transform in a pass somewhere <jakub> richi: ah, ok; any suggestion which pass is best for this? forwprop, something else? <richi> jakub: I'd say forwprop is good enough, yes <jakub> richi: FYI, this is for PR67089, rth is working on the optab side for that <richi> hmm, the (match ...) code doesn't deal well with values not having an SSA name (the compare in this case) <richi> "deal well" as in, not implemented <richi> it would require to output some auxiliary entry points <jakub> richi: if it is done in forwprop or some other pass, it can do everything in there by hand, doesn't need match.pd help I guess <richi> jakub: yes > It would be nice to have these in one place rather > than split. > > As of umulv<mode>4 handling - users can write overflow checks using the builtins > so we better expand them to the optimal form for each target, thus > canonicalizing > them to the IFN looks reasonable to me. > The plus/minus case in tree-ssa-math-opts.c > _does_ disable itself if no uaddv is available though. That is because the uaddv/usubv case is so simple without the HW support that it is unlikely to be beneficial then. That is not the case for multiplication. Jakub
On Mon, May 30, 2016 at 1:15 PM, Jakub Jelinek <jakub@redhat.com> wrote: > On Mon, May 30, 2016 at 12:55:22PM +0200, Richard Biener wrote: >> CCing Jakub who wrote the tree-ssa-math-opts.c code last year. I remember we >> discussed using match.pd but ended up with not doing it there but I >> don't remember >> the exact reason. > > <jakub> richi: does match.pd support turning non-call code into internal call? > <richi> jakub: yes > <richi> jakub: (simplify (plus @1 @2) (fn @1 @2)) > <jakub> richi: where fn is IFN_... or something else? > <richi> jakub: IFN_ should work, yes > <richi> jakub: CFN_ as well as far as I understand the code richard added > <jakub> but then I don't control the type of the lhs, right? > <jakub> want to transform > <jakub> r_4 = x_2(D) - y_3(D); > <jakub> if (x_2(D) < r_4) > <jakub> into: > <richi> well, the lhs is always the original type of the expression (obviously) > <jakub> _4 = SUB_OVERFLOW (x_2(D), y_3(D)); > <jakub> r.0_5 = REALPART_EXPR <_4>; > <jakub> _7 = IMAGPART_EXPR <_4>; > <jakub> if (_7 != 0) > <jakub> unsigned only > <richi> jakub: hmm, you have two results, the compare and the subtract, that's not possible to handle right now (unless r_4 is unused otherwise) > <richi> jakub: you can write a matcher and do the transform in a pass somewhere > <jakub> richi: ah, ok; any suggestion which pass is best for this? forwprop, something else? > <richi> jakub: I'd say forwprop is good enough, yes > <jakub> richi: FYI, this is for PR67089, rth is working on the optab side for that > <richi> hmm, the (match ...) code doesn't deal well with values not having an SSA name (the compare in this case) > <richi> "deal well" as in, not implemented > <richi> it would require to output some auxiliary entry points > <jakub> richi: if it is done in forwprop or some other pass, it can do everything in there by hand, doesn't need match.pd help I guess > <richi> jakub: yes Ok, so Alexander - you transform the condition only as the multiplication value is not available anyway with your formulation of the overflow check. That's then different to Jakubs case, so match.pd is ok (that is, the patch is ok for trunk). >> It would be nice to have these in one place rather >> than split. >> >> As of umulv<mode>4 handling - users can write overflow checks using the builtins >> so we better expand them to the optimal form for each target, thus >> canonicalizing >> them to the IFN looks reasonable to me. > >> The plus/minus case in tree-ssa-math-opts.c >> _does_ disable itself if no uaddv is available though. > > That is because the uaddv/usubv case is so simple without the HW support that > it is unlikely to be beneficial then. That is not the case for multiplication. Richard. > Jakub
On Sun, 29 May 2016, Marc Glisse wrote: > On Sat, 28 May 2016, Alexander Monakov wrote: > > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether > > 'A*B' > > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an > > invocation of __builtin_mul_overflow to avoid the divide operation. > [snip] > Any plan on optimizing the 'B && ovf' form? Not in the immediate future; I don't have a good approach in mind to do that. (if I get back to this topic I'd try to CSE the product with the IFN result) Thanks for the discussion and help -- I've applied the patch. Alexander
diff --git a/gcc/match.pd b/gcc/match.pd index 8d05e86..953c070 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -2657,6 +2657,25 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && types_match (TREE_TYPE (@0), TREE_TYPE (@1))) (out (imagpart @2) { build_zero_cst (TREE_TYPE (@0)); })))) +/* For unsigned operands, A > -1 / B checks whether A * B would overflow. + Simplify it to __builtin_mul_overflow (A, B, <unused>). */ +/* -1 / B < A */ +(for cmp (lt ge) + out (ne eq) + (simplify + (cmp (trunc_div:s integer_all_onesp @1) @0) + (if (TYPE_UNSIGNED (TREE_TYPE (@0)) && !VECTOR_TYPE_P (TREE_TYPE (@0))) + (with { tree t = TREE_TYPE (@0), cpx = build_complex_type (t); } + (out (imagpart (IFN_MUL_OVERFLOW:cpx @0 @1)) { build_zero_cst (t); }))))) + +/* A > -1 / B */ +(for cmp (gt le) + out (ne eq) + (simplify + (cmp @0 (trunc_div:s integer_all_onesp @1)) + (if (TYPE_UNSIGNED (TREE_TYPE (@0)) && !VECTOR_TYPE_P (TREE_TYPE (@0))) + (with { tree t = TREE_TYPE (@0), cpx = build_complex_type (t); } + (out (imagpart (IFN_MUL_OVERFLOW:cpx @0 @1)) { build_zero_cst (t); }))))) /* Simplification of math builtins. These rules must all be optimizations as well as IL simplifications. If there is a possibility that the new diff --git a/gcc/testsuite/gcc.dg/pr71289.c b/gcc/testsuite/gcc.dg/pr71289.c new file mode 100644 index 0000000..39837b9 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr71289.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O -fdump-tree-optimized-raw" } */ + +int f(unsigned a, unsigned b, unsigned *c) +{ + if (a > -1 / b) + return -1; + *c = a * b; + return 0; +} + +void g(unsigned long long a, unsigned long long b, unsigned long long *c) +{ + if (a <= -1 / b) + *c = a * b; +} + +/* { dg-final { scan-tree-dump-not "trunc_div_expr" "optimized" } } */