Fix PR37916 (unnecessary spilling)

Message ID	alpine.LSU.2.21.1811201323280.5354@wotan.suse.de
State	New
Headers	show Return-Path: <gcc-patches-return-490515-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=jckX7oV7WVyjUqplKaLtruouGFaBBkiaTfLnp6DUcc0k7fHxr93/0 nAIwIXjMmUV2NThalBu8ogy9jVNBcbjRKH0yNMZma5SvfCIUhoRZTL6xxEdtTjPg jZ0ftXoqVcwNZNQg8aCRcBayZvQV5jHdS0Mobh1iQ0vii+JaqPh6ss= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Tue, 20 Nov 2018 13:42:32 +0000 (UTC) From: Michael Matz <matz@suse.de> To: gcc-patches@gcc.gnu.org Subject: Fix PR37916 (unnecessary spilling) Message-ID: <alpine.LSU.2.21.1811201323280.5354@wotan.suse.de> User-Agent: Alpine 2.21 (LSU 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII
Series	Fix PR37916 (unnecessary spilling) \| expand Fix PR37916 (unnecessary spilling)

Michael Matz Nov. 20, 2018, 1:42 p.m. UTC

Hi,

this bug report is about cris generating worse code since tree-ssa.  The 
effect is also visible on x86-64.  The symptom is that the work horse of 
adler32.c (from zlib) needs spills in the inner loop, while gcc 3 did not, 
and those spills go away with -fno-tree-reassoc.

The underlying reason for the spills is register pressure, which could 
either be rectified by the pressure aware scheduling (which cris doesn't 
have), or by simply not generating high pressure code to begin with.  In 
this case it's TER which ultimately causes the register pressure to 
increase, and there are many plans in people minds how to fix this (make 
TER regpressure aware, do some regpressure scheduling on gimple, or even 
more ambitious things), but this patch doesn't tackle this.  Instead it 
makes reassoc not generate the situation which causes TER to run wild.

TER increasing register pressure is a long standing problem and so it has 
some heuristics to avoid that.  One wobbly heuristic is that it doesn't 
TER expressions together that have the same base variable as their LHSs.
But reassoc generates only anonymous SSA names for its temporary 
subexpressions, so that TER heuristic doesn't apply.  In this testcase 
it's even the case that reassoc doesn't really change much code (one 
addition moves from the end to the beginning of the inner loop), so that 
whole rewriting is even pointless.

In any case, let's use copy_ssa_name instead of make_ssa_name, when we 
have an obvious LHS; that leads to TER making the same decisions with and 
without -fno-tree-reassoc, leading to the same register pressure and no
spills.

On x86-64 the effect is:
  before patch: 48 bytes stackframe, 24 stack 
    accesses (most of them in the loops), 576 bytes codesize;
  after patch: no stack frame, no stack accesses, 438 bytes codesize

On cris:
  before patch: 64 bytes stack frame, 27 stack access in loops, size of .s 
    145 lines
  after patch: 20 bytes stack frame (as it uses callee saved regs, which 
    is also complained about in the bug report), but no stack accesses
    in loops, size of .s: 125 lines

I'm wondering about testcase: should I add an x86-64 specific that tests 
for no stack accesses, or would that be too constraining in the future?

Regstrapped on x86-64-linux, no regressions.  Okay for trunk?


Ciao,
Michael.

	PR tree-optimization/37916
	* tree-ssa-reassoc.c (make_new_ssa_for_def): Use copy_ssa_name.
	(rewrite_expr_tree, linearize_expr, negate_value,
	repropagate_negates, attempt_builtin_copysign,
	reassociate_bb): Likewise.

Alexander Monakov Nov. 20, 2018, 2:06 p.m. UTC | #1

On Tue, 20 Nov 2018, Michael Matz wrote:
> 
> I'm wondering about testcase: should I add an x86-64 specific that tests 
> for no stack accesses, or would that be too constraining in the future?
> 
> Regstrapped on x86-64-linux, no regressions.  Okay for trunk?

By the way, this patch helps x86-64 on PR 84681.  Some unnecessary spills
remain in the loop, but not as many as without the patch.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84681

Alexander

Jeff Law Nov. 21, 2018, 12:12 a.m. UTC | #2

On 11/20/18 6:42 AM, Michael Matz wrote:
> Hi,
> 
> this bug report is about cris generating worse code since tree-ssa.  The 
> effect is also visible on x86-64.  The symptom is that the work horse of 
> adler32.c (from zlib) needs spills in the inner loop, while gcc 3 did not, 
> and those spills go away with -fno-tree-reassoc.
> 
> The underlying reason for the spills is register pressure, which could 
> either be rectified by the pressure aware scheduling (which cris doesn't 
> have), or by simply not generating high pressure code to begin with.  In 
> this case it's TER which ultimately causes the register pressure to 
> increase, and there are many plans in people minds how to fix this (make 
> TER regpressure aware, do some regpressure scheduling on gimple, or even 
> more ambitious things), but this patch doesn't tackle this.  Instead it 
> makes reassoc not generate the situation which causes TER to run wild.
> 
> TER increasing register pressure is a long standing problem and so it has 
> some heuristics to avoid that.  One wobbly heuristic is that it doesn't 
> TER expressions together that have the same base variable as their LHSs.
> But reassoc generates only anonymous SSA names for its temporary 
> subexpressions, so that TER heuristic doesn't apply.  In this testcase 
> it's even the case that reassoc doesn't really change much code (one 
> addition moves from the end to the beginning of the inner loop), so that 
> whole rewriting is even pointless.
> 
> In any case, let's use copy_ssa_name instead of make_ssa_name, when we 
> have an obvious LHS; that leads to TER making the same decisions with and 
> without -fno-tree-reassoc, leading to the same register pressure and no
> spills.
> 
> On x86-64 the effect is:
>   before patch: 48 bytes stackframe, 24 stack 
>     accesses (most of them in the loops), 576 bytes codesize;
>   after patch: no stack frame, no stack accesses, 438 bytes codesize
> 
> On cris:
>   before patch: 64 bytes stack frame, 27 stack access in loops, size of .s 
>     145 lines
>   after patch: 20 bytes stack frame (as it uses callee saved regs, which 
>     is also complained about in the bug report), but no stack accesses
>     in loops, size of .s: 125 lines
> 
> I'm wondering about testcase: should I add an x86-64 specific that tests 
> for no stack accesses, or would that be too constraining in the future?
I think that'd be a fine test.  Yea it's constraining, but isn't that
the point, to make sure we never generate something dumb for this code.

If we get to a point where we have to have a stack reference we can
re-evaluate the test.

> 
> Regstrapped on x86-64-linux, no regressions.  Okay for trunk?
> 
> 
> Ciao,
> Michael.
> 
> 	PR tree-optimization/37916
> 	* tree-ssa-reassoc.c (make_new_ssa_for_def): Use copy_ssa_name.
> 	(rewrite_expr_tree, linearize_expr, negate_value,
> 	repropagate_negates, attempt_builtin_copysign,
> 	reassociate_bb): Likewise.
I wonder if this would help 21182.  I don't think anyone ever looked to
see if reassoc was involved in the issues in that BZ.

OK for the trunk.

jeff

Richard Biener Nov. 21, 2018, 2:13 p.m. UTC | #3

On Wed, Nov 21, 2018 at 1:12 AM Jeff Law <law@redhat.com> wrote:
>
> On 11/20/18 6:42 AM, Michael Matz wrote:
> > Hi,
> >
> > this bug report is about cris generating worse code since tree-ssa.  The
> > effect is also visible on x86-64.  The symptom is that the work horse of
> > adler32.c (from zlib) needs spills in the inner loop, while gcc 3 did not,
> > and those spills go away with -fno-tree-reassoc.
> >
> > The underlying reason for the spills is register pressure, which could
> > either be rectified by the pressure aware scheduling (which cris doesn't
> > have), or by simply not generating high pressure code to begin with.  In
> > this case it's TER which ultimately causes the register pressure to
> > increase, and there are many plans in people minds how to fix this (make
> > TER regpressure aware, do some regpressure scheduling on gimple, or even
> > more ambitious things), but this patch doesn't tackle this.  Instead it
> > makes reassoc not generate the situation which causes TER to run wild.
> >
> > TER increasing register pressure is a long standing problem and so it has
> > some heuristics to avoid that.  One wobbly heuristic is that it doesn't
> > TER expressions together that have the same base variable as their LHSs.
> > But reassoc generates only anonymous SSA names for its temporary
> > subexpressions, so that TER heuristic doesn't apply.  In this testcase
> > it's even the case that reassoc doesn't really change much code (one
> > addition moves from the end to the beginning of the inner loop), so that
> > whole rewriting is even pointless.
> >
> > In any case, let's use copy_ssa_name instead of make_ssa_name, when we
> > have an obvious LHS; that leads to TER making the same decisions with and
> > without -fno-tree-reassoc, leading to the same register pressure and no
> > spills.

I don't think this is OK.  Take one example, in rewrite_expr_tree for the final
recursion case we replace

   a_1 = _2 + _3;

with sth else, like

    _4 = _5 + 1;

so we compute a new value that may not have been computed before and
certainly not into the user variable a.  If you change this to instead create

  a_4 = _5 + 1;

then this leads to wrong debug info showing a value for 'a' that never existed.
You can observe this with

unsigned int __attribute__((noinline,noipa))
foo (unsigned int a, unsigned int b, unsigned int c, unsigned int d)
{
  a = a + 1;
  a = a + b;
  a = a + c;
  a = a + d;
  a = a + 3;
  return a;
}
int main()
{
  volatile unsigned x = foo (1, 3, 5, 7);
  return 0;
}

when you build this with -O -g -fno-var-tracking-assignments (VTA seems
to hide the issue, probably due to not enough instructions in the end)
you observe

(gdb) s
foo (a=1, b=3, c=5, d=7) at t.c:6
6         a = a + c;
(gdb) disassemble
Dump of assembler code for function foo:
=> 0x00000000004004b2 <+0>:     lea    0x4(%rdx,%rcx,1),%eax
   0x00000000004004b6 <+4>:     add    %esi,%eax
   0x00000000004004b8 <+6>:     add    %edi,%eax
   0x00000000004004ba <+8>:     retq
End of assembler dump.
(gdb) p a
$1 = 1
(gdb) si
7         a = a + d;
(gdb) p a
$2 = 16
(gdb) si
9         return a;
(gdb) p a
$3 = 19
(gdb) si
0x00000000004004ba      9         return a;
(gdb) p a
$4 = <optimized out>
(gdb) p $eax
$5 = 20
(gdb) quit

printing values for a that never should occur.  The sequence
of "allowed" values is 1, 2, 5, 10, 17, 20.  Both 16 and 19
should never be printed.  It's quite obvious if you look at the
reassoc result which is

  a_11 = d_7(D) + 4;
  a_12 = a_11 + c_5(D);
  a_13 = a_12 + b_3(D);
  a_9 = a_13 + a_1(D);
  return a_9;

and the fact that var-tracking looks at REG_DECL for
debug info location generation.  With VTA we get

  # DEBUG BEGIN_STMT
  # DEBUG D#3 => a_1(D) + 1
  # DEBUG a => D#3
  # DEBUG BEGIN_STMT
  a_11 = d_7(D) + 4;
  # DEBUG D#2 => D#3 + b_3(D)
  # DEBUG a => D#2
  # DEBUG BEGIN_STMT
  a_12 = a_11 + c_5(D);
  # DEBUG D#1 => D#2 + c_5(D)
  # DEBUG a => D#1
  # DEBUG BEGIN_STMT
  a_13 = a_12 + b_3(D);
  # DEBUG a => D#1 + d_7(D)
  # DEBUG BEGIN_STMT
  a_9 = a_13 + a_1(D);
  # DEBUG a => a_9
  # DEBUG BEGIN_STMT
  return a_9;

but DEBUG stmts in itself do not make the code valid from
a debug perspective.  Note that the reassociated stmts can
become DEBUG stmts itself if they are later DCEd.

So your patch has to be much more careful to never change
the LHS of stmts that are adjusted (which I think reassoc already does).

Richard.

> > On x86-64 the effect is:
> >   before patch: 48 bytes stackframe, 24 stack
> >     accesses (most of them in the loops), 576 bytes codesize;
> >   after patch: no stack frame, no stack accesses, 438 bytes codesize
> >
> > On cris:
> >   before patch: 64 bytes stack frame, 27 stack access in loops, size of .s
> >     145 lines
> >   after patch: 20 bytes stack frame (as it uses callee saved regs, which
> >     is also complained about in the bug report), but no stack accesses
> >     in loops, size of .s: 125 lines
> >
> > I'm wondering about testcase: should I add an x86-64 specific that tests
> > for no stack accesses, or would that be too constraining in the future?
> I think that'd be a fine test.  Yea it's constraining, but isn't that
> the point, to make sure we never generate something dumb for this code.
>
> If we get to a point where we have to have a stack reference we can
> re-evaluate the test.
>
> >
> > Regstrapped on x86-64-linux, no regressions.  Okay for trunk?
> >
> >
> > Ciao,
> > Michael.
> >
> >       PR tree-optimization/37916
> >       * tree-ssa-reassoc.c (make_new_ssa_for_def): Use copy_ssa_name.
> >       (rewrite_expr_tree, linearize_expr, negate_value,
> >       repropagate_negates, attempt_builtin_copysign,
> >       reassociate_bb): Likewise.
> I wonder if this would help 21182.  I don't think anyone ever looked to
> see if reassoc was involved in the issues in that BZ.
>
> OK for the trunk.
>
> jeff

Jeff Law Nov. 21, 2018, 2:16 p.m. UTC | #4

On 11/21/18 7:13 AM, Richard Biener wrote:
> On Wed, Nov 21, 2018 at 1:12 AM Jeff Law <law@redhat.com> wrote:
>>
>> On 11/20/18 6:42 AM, Michael Matz wrote:
>>> Hi,
>>>
>>> this bug report is about cris generating worse code since tree-ssa.  The
>>> effect is also visible on x86-64.  The symptom is that the work horse of
>>> adler32.c (from zlib) needs spills in the inner loop, while gcc 3 did not,
>>> and those spills go away with -fno-tree-reassoc.
>>>
>>> The underlying reason for the spills is register pressure, which could
>>> either be rectified by the pressure aware scheduling (which cris doesn't
>>> have), or by simply not generating high pressure code to begin with.  In
>>> this case it's TER which ultimately causes the register pressure to
>>> increase, and there are many plans in people minds how to fix this (make
>>> TER regpressure aware, do some regpressure scheduling on gimple, or even
>>> more ambitious things), but this patch doesn't tackle this.  Instead it
>>> makes reassoc not generate the situation which causes TER to run wild.
>>>
>>> TER increasing register pressure is a long standing problem and so it has
>>> some heuristics to avoid that.  One wobbly heuristic is that it doesn't
>>> TER expressions together that have the same base variable as their LHSs.
>>> But reassoc generates only anonymous SSA names for its temporary
>>> subexpressions, so that TER heuristic doesn't apply.  In this testcase
>>> it's even the case that reassoc doesn't really change much code (one
>>> addition moves from the end to the beginning of the inner loop), so that
>>> whole rewriting is even pointless.
>>>
>>> In any case, let's use copy_ssa_name instead of make_ssa_name, when we
>>> have an obvious LHS; that leads to TER making the same decisions with and
>>> without -fno-tree-reassoc, leading to the same register pressure and no
>>> spills.
> 
> I don't think this is OK.  Take one example, in rewrite_expr_tree for the final
> recursion case we replace
> 
>    a_1 = _2 + _3;
> 
> with sth else, like
> 
>     _4 = _5 + 1;
> 
> so we compute a new value that may not have been computed before and
> certainly not into the user variable a.  If you change this to instead create
> 
>   a_4 = _5 + 1;
> 
> then this leads to wrong debug info showing a value for 'a' that never existed.
> You can observe this with
But isn't the point to use an underlying SSA_NAME_VAR when we have one
that should be appropriate?  Are we just being too aggressive with using
copy_ssa_name?

jeff

Richard Biener Nov. 21, 2018, 2:27 p.m. UTC | #5

On Wed, Nov 21, 2018 at 3:16 PM Jeff Law <law@redhat.com> wrote:
>
> On 11/21/18 7:13 AM, Richard Biener wrote:
> > On Wed, Nov 21, 2018 at 1:12 AM Jeff Law <law@redhat.com> wrote:
> >>
> >> On 11/20/18 6:42 AM, Michael Matz wrote:
> >>> Hi,
> >>>
> >>> this bug report is about cris generating worse code since tree-ssa.  The
> >>> effect is also visible on x86-64.  The symptom is that the work horse of
> >>> adler32.c (from zlib) needs spills in the inner loop, while gcc 3 did not,
> >>> and those spills go away with -fno-tree-reassoc.
> >>>
> >>> The underlying reason for the spills is register pressure, which could
> >>> either be rectified by the pressure aware scheduling (which cris doesn't
> >>> have), or by simply not generating high pressure code to begin with.  In
> >>> this case it's TER which ultimately causes the register pressure to
> >>> increase, and there are many plans in people minds how to fix this (make
> >>> TER regpressure aware, do some regpressure scheduling on gimple, or even
> >>> more ambitious things), but this patch doesn't tackle this.  Instead it
> >>> makes reassoc not generate the situation which causes TER to run wild.
> >>>
> >>> TER increasing register pressure is a long standing problem and so it has
> >>> some heuristics to avoid that.  One wobbly heuristic is that it doesn't
> >>> TER expressions together that have the same base variable as their LHSs.
> >>> But reassoc generates only anonymous SSA names for its temporary
> >>> subexpressions, so that TER heuristic doesn't apply.  In this testcase
> >>> it's even the case that reassoc doesn't really change much code (one
> >>> addition moves from the end to the beginning of the inner loop), so that
> >>> whole rewriting is even pointless.
> >>>
> >>> In any case, let's use copy_ssa_name instead of make_ssa_name, when we
> >>> have an obvious LHS; that leads to TER making the same decisions with and
> >>> without -fno-tree-reassoc, leading to the same register pressure and no
> >>> spills.
> >
> > I don't think this is OK.  Take one example, in rewrite_expr_tree for the final
> > recursion case we replace
> >
> >    a_1 = _2 + _3;
> >
> > with sth else, like
> >
> >     _4 = _5 + 1;
> >
> > so we compute a new value that may not have been computed before and
> > certainly not into the user variable a.  If you change this to instead create
> >
> >   a_4 = _5 + 1;
> >
> > then this leads to wrong debug info showing a value for 'a' that never existed.
> > You can observe this with
> But isn't the point to use an underlying SSA_NAME_VAR when we have one
> that should be appropriate?  Are we just being too aggressive with using
> copy_ssa_name?

Well, sure - there _might_ be places that could use copy_ssa_name in reassoc
but I doubt so.  Usually code-generation should not use copy_ssa_name given
the aforementioned issues.

Richard.

>
> jeff
>

Michael Matz Nov. 22, 2018, 1:30 p.m. UTC | #6

Hi,

On Wed, 21 Nov 2018, Richard Biener wrote:

> then this leads to wrong debug info showing a value for 'a' that never 
> existed.

var-tracking was specifically created so that the base-vars of SSA names 
aren't necessary for debug info anymore ...

> when you build this with -O -g -fno-var-tracking-assignments (VTA seems
> to hide the issue, probably due to not enough instructions in the end)

... which makes this an non-test.  Sure, if you cripple debug info 
generation you get ... crippled debug info.

> and the fact that var-tracking looks at REG_DECL for
> debug info location generation.  With VTA we get
> 
>   # DEBUG BEGIN_STMT
>   # DEBUG D#3 => a_1(D) + 1
>   # DEBUG a => D#3
>   # DEBUG BEGIN_STMT
>   a_11 = d_7(D) + 4;
>   # DEBUG D#2 => D#3 + b_3(D)
>   # DEBUG a => D#2
>   # DEBUG BEGIN_STMT
>   a_12 = a_11 + c_5(D);
>   # DEBUG D#1 => D#2 + c_5(D)
>   # DEBUG a => D#1
>   # DEBUG BEGIN_STMT
>   a_13 = a_12 + b_3(D);
>   # DEBUG a => D#1 + d_7(D)
>   # DEBUG BEGIN_STMT
>   a_9 = a_13 + a_1(D);
>   # DEBUG a => a_9

This seems to be exactly correct and reflecting the original code, i.e. 
VTA works as designed even with "invented" base variables.

> So your patch has to be much more careful to never change the LHS of 
> stmts that are adjusted (which I think reassoc already does).

I can't do much more in reassoc than what I implemented, it's constrained 
to cases where we have a known LHS.  If you think that's still too much 
then the problem needs to remain unfixed in reassoc, and it would need TER 
changes.  An easy way would be to disable TERin also of SSA names having 
no base var (essentially on the ground that it's the same base var then), 
but that probably disables TER too much, though I haven't tried.

In addition: I'd always happily trade less correct debug info (where the 
impact of the incorrectness is fairly trivial) with no stack accesses in 
inner loops; i.e. a tradeoff.

Ciao,
Michael.

Jakub Jelinek Nov. 22, 2018, 1:34 p.m. UTC | #7

On Thu, Nov 22, 2018 at 01:30:03PM +0000, Michael Matz wrote:
> On Wed, 21 Nov 2018, Richard Biener wrote:
> 
> > then this leads to wrong debug info showing a value for 'a' that never 
> > existed.
> 
> var-tracking was specifically created so that the base-vars of SSA names 
> aren't necessary for debug info anymore ...

Not to my knowledge.  The original var-tracking used that information
almost exclusively (whatever ends up in REG_EXPR/MEM_EXPR and that is based
on the base vars of SSA_NAMEs).
VTA doesn't need that for vars for which there are debug stmts, but
otherwise it is still used.

	Jakub

Richard Biener Nov. 22, 2018, 1:56 p.m. UTC | #8

On Thu, Nov 22, 2018 at 2:34 PM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Thu, Nov 22, 2018 at 01:30:03PM +0000, Michael Matz wrote:
> > On Wed, 21 Nov 2018, Richard Biener wrote:
> >
> > > then this leads to wrong debug info showing a value for 'a' that never
> > > existed.
> >
> > var-tracking was specifically created so that the base-vars of SSA names
> > aren't necessary for debug info anymore ...
>
> Not to my knowledge.  The original var-tracking used that information
> almost exclusively (whatever ends up in REG_EXPR/MEM_EXPR and that is based
> on the base vars of SSA_NAMEs).
> VTA doesn't need that for vars for which there are debug stmts, but
> otherwise it is still used.

Also it still uses it even for vars for which there are debug stmts,
no?  We just
happen to be lucky for the too simple testcase that only the correct values
survive for the very few $pc addresses?

Richard.

>         Jakub

Fix PR37916 (unnecessary spilling)

Commit Message

Comments

Patch