[WIP] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Richard Guenther wrote:
> On Tue, Feb 28, 2012 at 4:10 PM, Ulrich Weigand <uweigand@de.ibm.com> wrote:
> > I'll still need to do proper testing and benchmarking, but I thought
> > I'd post the patch anyway just as a heads-up ...
> >
> > Does this look reasonable to you?
> 
> Yes, that looks reasonable.  Though instead of unsigned_type_for
> please use build_nonstandard_integer_type instead (if the type
> is not already unsigned).  unsigned_type_for does not necessarily
> preserve type-precision and may even return NULL.

OK, I've changed the patch to use build_nonstandard_integer_type, and it
still fixes the original problem.  However, on further examination it
turns out that this "fix" is not because the expression is actually
simplified at tree level.  Rather, it is just that because of some
minor reassociation (the +1 is moved to the end), the *RTL* optimizers
now see a (just as complex but) slightly different RTL sequence, and
therefore combine now just happens to be able to fold everything
together (using the RTL-level simplify_plus_minus machinery).

Even worse, the patch causes a number of regressions:

> FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "\\+" 0
> FAIL: gcc.dg/tree-ssa/loop-15.c scan-tree-dump-times optimized "n_. \\* n_." 1
> FAIL: gcc.dg/vect/no-scevccp-noreassoc-outer-4.c scan-tree-dump-times vect "OUTER LOOP VECTORIZED." 1

The loop-15.c problem is probably just a missed optimization
elsewhere that can be fixed.  The problem is that a loop-end
value used to be computed as "(n - 1) * n + n" which got
simplified via fold_plusminus_mult_expr to "n * n".  When
computing in unsigned, however, the expression becomes something
along the lines of

  "[...] * (unsigned) n + (unsigned) n"

and the top-level of fold_binary_loc removes the *outermost* NOP_EXPRs
resulting in

  "[...] * (unsigned) n + n"

which fold_plusminus_mult_expr no longer recognizes as something
it can handle (since "(unsigned) n" is not operand_equal_p to "n").

I think this can be fixed by having fold_plusminus_mult_expr call
STRIP_NOPS on sub-operands, as is already done elsewhere in fold_binary_loc.

This basically fixes the test case (due to some remaining casts, the
scan-tree-dump patterns still don't match as-is, but the expected code
is again generated).

The no-scevccp-noreassoc-outer-4.c problem is more difficult, and would
seem to point to a fundamental problem with performing the computation
in unsigned mode.  In this case, the end value of an inner loop forms
part of a scalar evolution of the outer loop.  When computing the end
value in unsigned type, that scalar evolution is no longer detected.
In part, this is because the scalar evolution machinery might need to
be improved in handling casts.  In particular, in some places it only
does STRIP_USELESS_TYPE_CONVERSION; changing those to STRIP_NOPS makes
it detect the evolution again.  But I'm not fully convinced this is
a safe change ...   In any case, this still doesn't make the outer
loop vectorizable, since it is no longer provably finite.  Without
the patch, this could be proven *because* the computation of the
inner loop's end value used signed arithmetic which cannot overflow.
By performing the computation in unsigned, this information is in
fact lost.

This would seem to indicate that unconditionally using unsigned
arithmetic even if not strictly necessary might not always be
the best solution either.  I've gone back and looked that at the
original problem that

  (start + 1) + (int)((unsigned int) ~start + (unsigned int) end)

is not simplified by fold-const.

Interestingly enough, it *is* valid to simplify this expression
to just plain "end", even with signed arithmetic, since the
transformation does not introduce any new potential overflows.

This is actually recognized in fold_binary_loc, but only one special
case.  See the code around the following comment:
                  /* The only case we can still associate with two variables
                     is if they are the same, modulo negation.  */

Unfortunately, this handles only A and -A; it doesn't recognize that
A+1 is in fact the negation of ~A ...

There is also code in tree-ssa-forwprop.c:associate_plusminus that
handles quite a number of special cases where re-association *is*
allowed even for signed arithmetic:

        (A +- B) - A       ->  +- B
        (A +- B) -+ B      ->  A
        (CST +- A) +- CST  ->  CST +- A
        (A + CST) +- CST   ->  A + CST
        ~A + A             ->  -1
        ~A + 1             ->  -A 
        A - (A +- B)       ->  -+ B
        A +- (B +- A)      ->  +- B
        CST +- (CST +- A)  ->  CST +- A
        CST +- (A +- CST)  ->  CST +- A
        A + ~A             ->  -1

This handles some cases involving ~, but still not quite the case
we need for this expression.   In addition, the forward propagtion logic
doesn't seem to handle casts very well (as opposed to fold-const, which
makes liberal use of STRIP_NOPS).

I'm wondering where to go from there:

- Why are those special re-association cases handled only in forwprop,
  and not in fold-cost?   I would have expected forwprop to just propagate
  subexpressions into a tree and then use fold-const to actually simplify ...

- Should I try to improve forwprop to handle casts and additional re-
  association cases until it handles the above expression?

- Or should the logic in fold-const be improved to add additional cases
  of re-association?

Thanks for any further suggestions!

I've attached some of the changes mentioned above for reference.

Bye,
Ulrich

Patch: Compute loop-end value using unsigned type.

[WIP] Re: Inefficient end-of-loop value computation - missed optimization somewhere?

Commit Message

Comments

Patch