Fix PR85726 (div-div suboptimization) and a rant on match.pd :s-flag

Message ID 201805092341.w49Nfr4o030638@ignucius.se.axis.com
State New
Headers show
Series
  • Fix PR85726 (div-div suboptimization) and a rant on match.pd :s-flag
Related show

Commit Message

Hans-Peter Nilsson May 9, 2018, 11:41 p.m.
Replacing a division feeding a division helps only when the
second division is the only user, and "fusing" the divisions is
downright bad if another user of the result of first division is
a modulus of the same value as the second division, forming a
divmod pair.  See the test-case, where for the tested
architectures (which all fail the test-case before the patch)
the div and mod are implemented using the high-part of a widened
multiplication and shift, emitted separately but combined as
late as in rtl, with the multiplicaton and shift re-used.  That
of course does not happen if later passes see (y / 48; y % 3).
While in match.pd, I noticed the corresponding mul-mul match,
which I believe should be guarded the same way.

I noticed this spot investigating performance regressions for
mipsisa32r2el-linux-gnu comparing to gcc-4.7-era code generated
for a compute-intensive library.  I'd say the pattern in the
test-case is common in cases like implementing a 3x3 filter
using SIMD with a vector size 16.  The suboptimization was more
of an (the first, or second) eye-catcher in a hot degraded
function than an actual cause though; fixing it seems to amount
for no more than 2% (where there's a 13% regression) in that
function.  As a contrast, I see e.g. four times as many local
call-saved registers used for some loop-intensive functions, but
I can't dive into the larger problem, at least not right now.
(And no, it's not LRA, AFAICT.)

Tested using the gcc-8.1.0 release on aarch64-unknown-linux-gnu,
powerpc64-unknown-linux-gnu, x86_64-pc-linux-gnu and partly a
cross to mipsisa32r2el-linux-gnu.

The patch is generated using "-w" because otherwise
re-indentation for the single_use-wrapping makes the diff
practically unreadable.  The test-case uses rtl-scanning so the
result can be verified closer to the actual code but still not
restricting it to assembly code.  Scanning just after the (last)
match.pd pass would still be far from the divmod-combining
effects done in rtl.  Unfortunately, the pattern for the
truncated multiplication is observable in various numbers
depending on both the target and the bug, so to avoid littering
I restrict it to my target of interest at the moment.  At least
the presence and absence of the "1/48"-constant is stable across
the line with/without the bug with no false positives.

This is a optimization regression counting from at least 4.9.2,
most likely since r218211, so: ok for for all open branches?

Please also see the ":s"-rant after the patch.

gcc/testsuite:
	PR tree-optimization/85726
	* pr85726.c: New test.

gcc:
	* match.pd (div (div C1) C2): Guard with single_use of inner div.
	(mult (mult C1) C2): Similar.


Now a rant on the match.pd ":s" flag, which reasonable people
may reasonably suggest I should have used in the patch instead
of the (if (single_use ...)).

Initially, I got the match.pd-language (which deserves a proper
name) all wrong.  Then I read the documentation and still got it
wrong.  I "misunderstood" that the ":s" on an operand "O" was
supposed to have the effect of conditionalize the replacement
"R" by wrapping it in "(if (single_use (O)) R)" as in the
suggested patch (above).  To wit, this does not work; it will
*not* stop the replacement as seen in the test-case (THIS IS NOT
A SUGGESTED PATCH):

 (for div (trunc_div exact_div)
  (simplify
-  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
+  (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2)
   (with {
     bool overflow_p;
     wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),

In PR69556, it seems other people seem to have read the
documentation of ":s" the same way, but are corrected by other
comments there, so I guess it's not my reading that's flawed.

I suggest preferably (1) correcting the semantics of ":s" to do
as the documentation says because I don't understand the
explanation in PR69556 comment #4 that the replacement "is still
allowed if it is a single operation as that replaces at least
one other (the one we are simplifying)"; I see that as a
complete nullification of the :s flag, making it a nop.
Alternatively (2), if the :s is *not* a nop in some case add an
example of that case and sufficient explanation to
match-and-simplify.texi *and* the suggested ":S" flag
(i.e. *really* conditionalize the replacement by a single-use of
that operand).

That's just a detail, I do like that language.

brgds, H-P

Comments

Marc Glisse May 10, 2018, 8:33 a.m. | #1
(not a review)

On Thu, 10 May 2018, Hans-Peter Nilsson wrote:

> Replacing a division feeding a division helps only when the
> second division is the only user, and "fusing" the divisions is

Well, that's not quite true.
int x, y;
void f(int n){
   int c = 3 << 20;
   x = n / c;
   y = n / c;
}

Here we can optimize the last division to y = 0. After your patch, we 
likely need VRP to do that simplification. There are probably more 
complicated transformations this disables.

> downright bad if another user of the result of first division is
> a modulus of the same value as the second division, forming a
> divmod pair.  See the test-case, where for the tested
> architectures (which all fail the test-case before the patch)
> the div and mod are implemented using the high-part of a widened
> multiplication and shift, emitted separately but combined as
> late as in rtl, with the multiplicaton and shift re-used.  That
> of course does not happen if later passes see (y / 48; y % 3).
> While in match.pd, I noticed the corresponding mul-mul match,
> which I believe should be guarded the same way.

Did you notice bad codegen because of the multiplication? You are only 
adding a test for divisions. I am asking because I know such a change will 
help some cases and hurt others...

> Now a rant on the match.pd ":s" flag, which reasonable people
> may reasonably suggest I should have used in the patch instead
> of the (if (single_use ...)).
>
> Initially, I got the match.pd-language (which deserves a proper
> name) all wrong.  Then I read the documentation and still got it
> wrong.  I "misunderstood" that the ":s" on an operand "O" was
> supposed to have the effect of conditionalize the replacement
> "R" by wrapping it in "(if (single_use (O)) R)" as in the
> suggested patch (above).  To wit, this does not work; it will
> *not* stop the replacement as seen in the test-case (THIS IS NOT
> A SUGGESTED PATCH):
>
> (for div (trunc_div exact_div)
>  (simplify
> -  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
> +  (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2)
>   (with {
>     bool overflow_p;
>     wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
>
> In PR69556, it seems other people seem to have read the
> documentation of ":s" the same way, but are corrected by other
> comments there, so I guess it's not my reading that's flawed.
>
> I suggest preferably (1) correcting the semantics of ":s" to do
> as the documentation says because I don't understand the
> explanation in PR69556 comment #4 that the replacement "is still
> allowed if it is a single operation as that replaces at least
> one other (the one we are simplifying)"; I see that as a
> complete nullification of the :s flag, making it a nop.
> Alternatively (2), if the :s is *not* a nop in some case add an
> example of that case and sufficient explanation to
> match-and-simplify.texi *and* the suggested ":S" flag
> (i.e. *really* conditionalize the replacement by a single-use of
> that operand).

To simplify, the goal of :s is to avoid increasing the number of 
instructions. Normally, the transformation output is smaller (or the same 
size but cheaper, simpler, more canonical) than the input. But if we can't 
get rid of some of the input intermediate results, the size may still 
increase. :s does test single_use, but it has a special case. If the 
output (possibly after several rounds of resimplifications) is at most one 
instruction, then it cannot be larger than the input (we are at least 
getting rid of the instruction we called the simplification on), so the 
transformation does happen even if !single_use. Originally this was only 
done when the simplification led to an SSA_NAME or a constant, IIRC.

Then people start wanting single_use restrictions to reduce register 
pressure, reduce the size / number of constants, etc. And not all of those 
want exactly the same conditions.

It is useful for high-level transformations to push the canonicalization 
as far as possible, to notice equivalent quantities or constant bounds in 
particular. So on a case by case basis, we use :s or single_use or 
whatever...

If we use both y=x/3 and z=x/15 in the same function, should we make an 
effort to detect it and rewrite to z=y/5?
Segher Boessenkool May 10, 2018, 12:23 p.m. | #2
On Thu, May 10, 2018 at 10:33:39AM +0200, Marc Glisse wrote:
> (not a review)
> 
> On Thu, 10 May 2018, Hans-Peter Nilsson wrote:
> 
> >Replacing a division feeding a division helps only when the
> >second division is the only user, and "fusing" the divisions is
> 
> Well, that's not quite true.
> int x, y;
> void f(int n){
>   int c = 3 << 20;
>   x = n / c;
>   y = x / c;
> }

[ fixed the typo ]

> Here we can optimize the last division to y = 0. After your patch, we 
> likely need VRP to do that simplification. There are probably more 
> complicated transformations this disables.

Without the replacement we have two dependent divisions; with the
replacement we have two independent divisions, and that is much faster
on some (many?) systems (that can do two fixed-point divisions in parallel),
even if things do not simplify further.


Segher
Richard Biener May 11, 2018, 12:17 p.m. | #3
On Thu, May 10, 2018 at 1:41 AM, Hans-Peter Nilsson
<hans-peter.nilsson@axis.com> wrote:

[...]

With all the followups this generated I'm still going to reply here.

> Now a rant on the match.pd ":s" flag, which reasonable people
> may reasonably suggest I should have used in the patch instead
> of the (if (single_use ...)).
>
> Initially, I got the match.pd-language (which deserves a proper
> name) all wrong.  Then I read the documentation and still got it
> wrong.  I "misunderstood" that the ":s" on an operand "O" was
> supposed to have the effect of conditionalize the replacement
> "R" by wrapping it in "(if (single_use (O)) R)" as in the
> suggested patch (above).  To wit, this does not work; it will
> *not* stop the replacement as seen in the test-case (THIS IS NOT
> A SUGGESTED PATCH):
>
>  (for div (trunc_div exact_div)
>   (simplify
> -  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
> +  (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2)
>    (with {
>      bool overflow_p;
>      wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
>
> In PR69556, it seems other people seem to have read the
> documentation of ":s" the same way, but are corrected by other
> comments there, so I guess it's not my reading that's flawed.

The documentation doesn't mention a few details, yes ... I will
amend it.

> I suggest preferably (1) correcting the semantics of ":s" to do
> as the documentation says because I don't understand the
> explanation in PR69556 comment #4 that the replacement "is still
> allowed if it is a single operation as that replaces at least
> one other (the one we are simplifying)"; I see that as a
> complete nullification of the :s flag, making it a nop.

It doesn't make it a nop in all cases.  One motivation of the
match.pd language was to allow pattern-matching and simplification
in during value-numbering in an efficient way.  So when I originally
added :s it failed to detect equivalences it could otherwise handle,
like computing that (x / 5) / 6 is equal to x / 30 because when
one (rightfully so!) adds :s to the x / 5 then for the purpose of
canonicalization (which is what value-numbering is interested in)
the "simplification" no longer happens.  So based on what
our value-numbering implementation handles I restricted :s to
be in effect only if the simplification causes extra expressions
beside the toplevel one to be emitted.  Marc gave good examples
for the (rdiv (rdiv:s ...) ...) pattern.

That's not perfect and we've added explicit single_use () checks
in (too many) places.

Most issues pop up with stmt-local foldings which indeed should
better treat :s "literally".  That's because we are lacking a stmt
folder with a global cost model - if you have two expressions like
tem = x/5; (tem / 7) and (tem / 8) then you _do_ want to simplify
them to x / 35 and x / 40 because tem becomes dead.

I suppose we need a way to make :s behave depending on
context (value numbering or stmt folding).  The current
implementation doesn't make that too easy - in fact the
current semantic is very simple in the implementation.

I also think that if the result evaluates to a non-operation
(constant or operand) we do not want to "honor" :s either.

> Alternatively (2), if the :s is *not* a nop in some case add an
> example of that case and sufficient explanation to
> match-and-simplify.texi *and* the suggested ":S" flag
> (i.e. *really* conditionalize the replacement by a single-use of
> that operand).

I didn't want to add :S.  Beacuse it's not about :s or :S it's really
about the context in which the patterns are used.  In some
sense :s should even be implicitely present on all expressions
since originally we just moved fold-const.c patterns to match.pd
and there's no such thing as multi-uses in GENERIC (ok, only if you
ignore SAVE_EXPR).

That said, without actually reviewing the patch, there's room for
improvement.  That is not adding :S.  What it exactly is remains
to be determined ;)

Richard.

> That's just a detail, I do like that language.
>
> brgds, H-P
Richard Biener May 11, 2018, 12:30 p.m. | #4
On Fri, May 11, 2018 at 2:17 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, May 10, 2018 at 1:41 AM, Hans-Peter Nilsson
> <hans-peter.nilsson@axis.com> wrote:
>
> [...]
>
> With all the followups this generated I'm still going to reply here.
>
>> Now a rant on the match.pd ":s" flag, which reasonable people
>> may reasonably suggest I should have used in the patch instead
>> of the (if (single_use ...)).
>>
>> Initially, I got the match.pd-language (which deserves a proper
>> name) all wrong.  Then I read the documentation and still got it
>> wrong.  I "misunderstood" that the ":s" on an operand "O" was
>> supposed to have the effect of conditionalize the replacement
>> "R" by wrapping it in "(if (single_use (O)) R)" as in the
>> suggested patch (above).  To wit, this does not work; it will
>> *not* stop the replacement as seen in the test-case (THIS IS NOT
>> A SUGGESTED PATCH):
>>
>>  (for div (trunc_div exact_div)
>>   (simplify
>> -  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
>> +  (div (div:s @0 INTEGER_CST@1) INTEGER_CST@2)
>>    (with {
>>      bool overflow_p;
>>      wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
>>
>> In PR69556, it seems other people seem to have read the
>> documentation of ":s" the same way, but are corrected by other
>> comments there, so I guess it's not my reading that's flawed.
>
> The documentation doesn't mention a few details, yes ... I will
> amend it.

As follows, does this make things more clear?

Btw, since your case is about not disrupting a div-mod pair a fix would be
to expose that pairing earlier (very early actually...).  We do expose it
during the widen_mul pass which more or less runs right before RTL
expansion only.  Jakubs suggestion to explicitely check for a
2nd use in a modulo operation would also work but that gets a bit
awkward given it requires immediate uses which are _not_ required
to be present or up-to-date during the folding machinery (only SSA use->def
chains are required to be valid).

Thanks,
Richard.

diff --git a/gcc/doc/match-and-simplify.texi b/gcc/doc/match-and-simplify.texi
index 6bc2c47bee7..024d747cafd 100644
--- a/gcc/doc/match-and-simplify.texi
+++ b/gcc/doc/match-and-simplify.texi
@@ -250,7 +250,9 @@ come second in commutative expressions.

 The second supported flag is @code{s} which tells the code
 generator to fail the pattern if the expression marked with
-@code{s} does have more than one use.  For example in
+@code{s} does have more than one use and the simplification
+results in an expression with more than one operator.
+For example in

 @smallexample
 (simplify
@@ -261,6 +263,14 @@ generator to fail the pattern if the expression marked with
 this avoids the association if @code{(pointer_plus @@0 @@1)} is
 used outside of the matched expression and thus it would stay
 live and not trivially removed by dead code elimination.
+Now consider @code{((x + 3) + -3)} with the temporary
+holding @code{(x + 3)} used elsewhere.  This simplifies down
+to @code{x} which is desirable and thus flagging with @code{s}
+does not prevent the transform.  Now consider @code{((x + 3) + 1)}
+which simplifies to @code{(x + 4)}.  Despite being flagged with
+@code{s} the simplification will be performed.  The
+simplification of @code{((x + a) + 1)} to @code{(x + (a + 1))} will
+not performed in this case though.

 More features exist to avoid too much repetition.



>> I suggest preferably (1) correcting the semantics of ":s" to do
>> as the documentation says because I don't understand the
>> explanation in PR69556 comment #4 that the replacement "is still
>> allowed if it is a single operation as that replaces at least
>> one other (the one we are simplifying)"; I see that as a
>> complete nullification of the :s flag, making it a nop.
>
> It doesn't make it a nop in all cases.  One motivation of the
> match.pd language was to allow pattern-matching and simplification
> in during value-numbering in an efficient way.  So when I originally
> added :s it failed to detect equivalences it could otherwise handle,
> like computing that (x / 5) / 6 is equal to x / 30 because when
> one (rightfully so!) adds :s to the x / 5 then for the purpose of
> canonicalization (which is what value-numbering is interested in)
> the "simplification" no longer happens.  So based on what
> our value-numbering implementation handles I restricted :s to
> be in effect only if the simplification causes extra expressions
> beside the toplevel one to be emitted.  Marc gave good examples
> for the (rdiv (rdiv:s ...) ...) pattern.
>
> That's not perfect and we've added explicit single_use () checks
> in (too many) places.
>
> Most issues pop up with stmt-local foldings which indeed should
> better treat :s "literally".  That's because we are lacking a stmt
> folder with a global cost model - if you have two expressions like
> tem = x/5; (tem / 7) and (tem / 8) then you _do_ want to simplify
> them to x / 35 and x / 40 because tem becomes dead.
>
> I suppose we need a way to make :s behave depending on
> context (value numbering or stmt folding).  The current
> implementation doesn't make that too easy - in fact the
> current semantic is very simple in the implementation.
>
> I also think that if the result evaluates to a non-operation
> (constant or operand) we do not want to "honor" :s either.
>
>> Alternatively (2), if the :s is *not* a nop in some case add an
>> example of that case and sufficient explanation to
>> match-and-simplify.texi *and* the suggested ":S" flag
>> (i.e. *really* conditionalize the replacement by a single-use of
>> that operand).
>
> I didn't want to add :S.  Beacuse it's not about :s or :S it's really
> about the context in which the patterns are used.  In some
> sense :s should even be implicitely present on all expressions
> since originally we just moved fold-const.c patterns to match.pd
> and there's no such thing as multi-uses in GENERIC (ok, only if you
> ignore SAVE_EXPR).
>
> That said, without actually reviewing the patch, there's room for
> improvement.  That is not adding :S.  What it exactly is remains
> to be determined ;)
>
> Richard.
>
>> That's just a detail, I do like that language.
>>
>> brgds, H-P

Patch

--- /dev/null	Tue Oct 29 15:57:07 2002
+++ gcc.dg/pr85726.c	Tue May  8 07:18:19 2018
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-rtl-fwprop1" } */
+
+/* There should just be one mult-as-div sequence, re-used for the mod,
+   not one for each of the y / 3 and y % 3 (as in due to suboptimal
+   simplification as y / 48 and y % 3) and there should no be a trace of
+   the constant used for the 1 / 48 mult-as-div. */
+int ww, vv;
+int x(int y)
+{
+  int z = y / 16;
+  int w = z / 3;
+  int v = z % 3;
+  ww = w;
+  return v;
+}
+/* { dg-final { scan-rtl-dump-times "truncate:SI .lshiftrt:DI .mult:DI .sign_extend:DI" 1 "fwprop1" { target mips*-*-* } } }  */
+/* { dg-final { scan-rtl-dump-times " .0x2aaaaaab" 0 "fwprop1" } }  */

diff --git a/gcc/match.pd b/gcc/match.pd
index 0de4432..e8ebeac 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -278,11 +278,14 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
       && TYPE_UNSIGNED (type))
   (trunc_div @0 @1)))
 
-/* Combine two successive divisions.  Note that combining ceil_div
-   and floor_div is trickier and combining round_div even more so.  */
+/* Combine two successive divisions, if the second is the only
+   user of the result of the first, or else we'll just end up
+   with two divisions anyway.  Note that combining ceil_div and
+   floor_div is trickier and combining round_div even more so.  */
 (for div (trunc_div exact_div)
  (simplify
-  (div (div @0 INTEGER_CST@1) INTEGER_CST@2)
+  (div (div@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+  (if (single_use (@3))
    (with {
      bool overflow_p;
      wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
@@ -292,12 +295,13 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
      (div @0 { wide_int_to_tree (type, mul); })
      (if (TYPE_UNSIGNED (type)
 	  || mul != wi::min_value (TYPE_PRECISION (type), SIGNED))
-     { build_zero_cst (type); })))))
+      { build_zero_cst (type); }))))))
 
 /* Combine successive multiplications.  Similar to above, but handling
    overflow is different.  */
 (simplify
- (mult (mult @0 INTEGER_CST@1) INTEGER_CST@2)
+ (mult (mult@3 @0 INTEGER_CST@1) INTEGER_CST@2)
+ (if (single_use (@3))
   (with {
     bool overflow_p;
     wide_int mul = wi::mul (wi::to_wide (@1), wi::to_wide (@2),
@@ -306,7 +310,7 @@  DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
    /* Skip folding on overflow: the only special case is @1 * @2 == -INT_MIN,
       otherwise undefined overflow implies that @0 must be zero.  */
    (if (!overflow_p || TYPE_OVERFLOW_WRAPS (type))
-   (mult @0 { wide_int_to_tree (type, mul); }))))
+    (mult @0 { wide_int_to_tree (type, mul); })))))
 
 /* Optimize A / A to 1.0 if we don't care about
    NaNs or Infinities.  */