diff mbox

[v2,0/9] Separate shrink-wrapping

Message ID 81710c02-05bf-fb65-dedc-8ba389c0d8e8@redhat.com
State New
Headers show

Commit Message

Bernd Schmidt Aug. 26, 2016, 1:03 p.m. UTC
On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> This is the second version.  Concern was renamed to component, and all
> other comments were addressed (I hope).

Not really, I'm afraid. There still seems to be no detailed explanation 
of the algorithm. Instead, there is a vague outline (which should be 
expanded to at least the level of detail found e.g. in tree-ssa-pre.c), 
and inside the code there are still meaningless comments of the form

/* Deselect those epilogue components that should not be inserted
    for this edge.  */

which don't tell me anything about what the intention is (why should 
they not be inserted?). The result is that as a reader, I still find 
myself unable to determine whether the algorithm is correct or not.

Worse, I'm still not entirely certain what it is intended to achieve: I 
asked for a motivating example or two, but haven't seen one in the 
comments anywhere. My most general question would be why the algorithm 
for placing individual prologue components would be different from the 
regular shrink-wrapping algorithm for full prologues. Examples and/or 
arguments relating to how this new code acts with regard to loops are 
also particularly needed.

So, I think improvement is necessary in these points, but I also think 
that there's room for experimental verification by way of self-tests. If 
done thoroughly (testing the transformation on several sufficiently 
random CFGs and maybe some manually constructed ones) it would go a long 
way towards showing that at least we don't have to worry too much about 
miscompilations. That's what I've been working on, and an initial patch 
is below. This is incomplete and posted more as an RFD since we're 
getting towards the end of the week: there are gaps in the test 
coverage, and it currently fails the selftest. I assume the latter is a 
problem with my code, but it wouldn't hurt if you could take a look; 
maybe I misunderstood something entirely about what the separate 
shrink-wrapping is supposed to achieve, or maybe I messed up the 
algorithm with my changes.


Bernd

Comments

David Malcolm Aug. 26, 2016, 1:48 p.m. UTC | #1
On Fri, 2016-08-26 at 15:03 +0200, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
> > This is the second version.  Concern was renamed to component, and
> > all
> > other comments were addressed (I hope).
> 
> Not really, I'm afraid. There still seems to be no detailed
> explanation 
> of the algorithm. Instead, there is a vague outline (which should be 
> expanded to at least the level of detail found e.g. in tree-ssa
> -pre.c), 
> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>     for this edge.  */
> 
> which don't tell me anything about what the intention is (why should 
> they not be inserted?). The result is that as a reader, I still find 
> myself unable to determine whether the algorithm is correct or not.
> 
> Worse, I'm still not entirely certain what it is intended to achieve:
> I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere. My most general question would be why the
> algorithm 
> for placing individual prologue components would be different from
> the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or
> arguments relating to how this new code acts with regard to loops are
> also particularly needed.
> 
> So, I think improvement is necessary in these points, but I also
> think 
> that there's room for experimental verification by way of self-tests.
> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a
> long 
> way towards showing that at least we don't have to worry too much
> about 
> miscompilations. That's what I've been working on, and an initial
> patch 
> is below. This is incomplete and posted more as an RFD since we're 
> getting towards the end of the week: there are gaps in the test 
> coverage, and it currently fails the selftest. I assume the latter is
> a 
> problem with my code, but it wouldn't hurt if you could take a look; 
> maybe I misunderstood something entirely about what the separate 
> shrink-wrapping is supposed to achieve, or maybe I messed up the 
> algorithm with my changes.
> 

Bernd: I'm very happy to see someone else using the selftest framework.

(My mailer isn't letting me quote the patch body, sorry).

I'm nervous about the build_random_cfg function: randomness in
selftests seems like a double-edged sword.  On the one hand, we can use
it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
the other hand, does every host generate the same sequence of values?
Presumably we want everyone to be effectively running the same
selftests.

Is this using libiberty's implementation of "random", or can it use the
underlying host libc implementation?  Are there any cross-platform
guarantees on the sequence of values that are returned for a particular
seed?  I don't want us to have host-specific failures that turn out to
be due to host differences in the RNG.

Do we need to re-seed the RNG before each test?  (or to rework
libiberty's random to bundle up the RNG state in a class, and use
that).  Otherwise, the meaning of the test could change if someone adds
another random-using selftest before this one.

Maybe there's a need for some kind of selftest::rng class here?

On a unrelated note, should the various vfunc implementations be marked
with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
better chance of devirtualizing them in a release build?

Hope this is constructive
Dave
Bernd Schmidt Aug. 26, 2016, 1:55 p.m. UTC | #2
On 08/26/2016 03:48 PM, David Malcolm wrote:
> I'm nervous about the build_random_cfg function: randomness in
> selftests seems like a double-edged sword.  On the one hand, we can use
> it to fuzz-test an optimization to rapidly gain a lot of coverage.  On
> the other hand, does every host generate the same sequence of values?
> Presumably we want everyone to be effectively running the same
> selftests.

I intended to put a srandom call in there at the start but forgot. One 
could argue that there's value in having different tests on different 
hosts, as long as they are stable between runs, but...

> Maybe there's a need for some kind of selftest::rng class here?

That might be a good idea too.

> On a unrelated note, should the various vfunc implementations be marked
> with "FINAL OVERRIDE" (from coretypes.h), so that the compiler has a
> better chance of devirtualizing them in a release build?

Wasn't aware of that - will have a look.

Maybe we could also use a long-running "-fselftest=extensive" for 
stress-testing algorithms such as this (increasing the number of random 
cfgs it tries)?


Bernd
Segher Boessenkool Aug. 26, 2016, 2:50 p.m. UTC | #3
Hi Bernd,

On Fri, Aug 26, 2016 at 03:03:43PM +0200, Bernd Schmidt wrote:
> Not really, I'm afraid. There still seems to be no detailed explanation 
> of the algorithm. Instead, there is a vague outline

Did you read the description of 8/9?  If you think any of that needs to
be in the code, please just say so.

> and inside the code there are still meaningless comments of the form
> 
> /* Deselect those epilogue components that should not be inserted
>    for this edge.  */

You asked for a comment here, see
https://gcc.gnu.org/ml/gcc-patches/2016-07/msg00932.html
("what's the purpose of this", etc.)

> Worse, I'm still not entirely certain what it is intended to achieve: I 
> asked for a motivating example or two, but haven't seen one in the 
> comments anywhere.

"Make faster code", like all other optimisation passes do as well.

I commented repeatedly about (micro-)benchmark results.

The head comment starts with

+/* Separate shrink-wrapping
+
+   Instead of putting all of the prologue and epilogue in one spot, we
+   can put parts of it in places where those components are executed less
+   frequently.

and that is the long and short of it.

> My most general question would be why the algorithm 
> for placing individual prologue components would be different from the 
> regular shrink-wrapping algorithm for full prologues. Examples and/or 
> arguments relating to how this new code acts with regard to loops are 
> also particularly needed.

The full-prologue algorithm makes as many blocks run without prologue as
possible, by duplicating blocks where that helps.  If you do this for
every component you can and up with 2**40 blocks for just 40 components,
not such a hot plan.  The "separate" algo does not duplicate blocks at all
(its only code growth is for all the prologue/epilogue components inserted,
and for some forwarder blocks sometimes).

The full-prologue algorithm only ever inserts exactly one prologue, far
from optimal.  But it *cannot* duplicate it with the semantics we have.
The separate components can of course be duplicated, it's a new abstraction
and part of the requirements for it.

> So, I think improvement is necessary in these points, but I also think 
> that there's room for experimental verification by way of self-tests.

Since it is used everywhere I think that is a big waste of time (it is
tested a lot already).  Testing specific problem cases can of course help;
but we have "make check" for that already anwyay. Also, I think "self-tests"
looking at the internals are just wrong (the only sane way to update such
tests when changing the code is to delete the tests, since they should be
written independently; otherwise you just have two copies of the same).  And
the useless abstractions are just useless.

The algorithm is procedural; writing it in procedural style is much clearer
than hiding everything.

> If 
> done thoroughly (testing the transformation on several sufficiently 
> random CFGs and maybe some manually constructed ones) it would go a long 
> way towards showing that at least we don't have to worry too much about 
> miscompilations.

If you hadn't noticed, the algorithm is constructed in such a way as to
minimise possible miscompilations: it first determines what blocks need
what components "active", and then makes it so, in a separate (much more
trivial) phase.

Wrt your patch...  making each component needed half of the time is not
very realistic, you'll end up with all components active in most blocks.

bools as parameters where they do not mean "yes/no" is confusing.

It seems you do no longer do the "insert at head" and "insert at tail"
before the "insert on edge"?  This cannot work afais.

Some utility funcs print dump info; the caller should, instead.

You make "components" global (as "m_components").


Segher
Bernd Schmidt Aug. 26, 2016, 3:03 p.m. UTC | #4
On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
> The head comment starts with
>
> +/* Separate shrink-wrapping
> +
> +   Instead of putting all of the prologue and epilogue in one spot, we
> +   can put parts of it in places where those components are executed less
> +   frequently.
>
> and that is the long and short of it.

And that comment puzzles me. Surely prologue and epilogue are executed 
only once currently, so how does frequency come into it? Again - please 
provide an example.

> The full-prologue algorithm makes as many blocks run without prologue as
> possible, by duplicating blocks where that helps.  If you do this for
> every component you can and up with 2**40 blocks for just 40 components,

Ok, so why wouldn't we use the existing code with the duplication part 
disabled? That's a later addition anyway and isn't necessary to do 
shrink-wrapping in the first place.


Bernd
Segher Boessenkool Aug. 26, 2016, 4:27 p.m. UTC | #5
On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:
> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
> >The head comment starts with
> >
> >+/* Separate shrink-wrapping
> >+
> >+   Instead of putting all of the prologue and epilogue in one spot, we
> >+   can put parts of it in places where those components are executed less
> >+   frequently.
> >
> >and that is the long and short of it.
> 
> And that comment puzzles me. Surely prologue and epilogue are executed 
> only once currently, so how does frequency come into it? Again - please 
> provide an example.

If some component is only needed for 0.01% of executions of a function,
running it once for every execution is 10000 times too much.

The trivial example is a function that does an early exit, but uses one
or a few non-volatile registers before that exit.  This happens in e.g.
glibc's malloc, if you want an easily accessed example.  With the current
code, *all* components will be saved and then restored shortly afterwards.

> >The full-prologue algorithm makes as many blocks run without prologue as
> >possible, by duplicating blocks where that helps.  If you do this for
> >every component you can and up with 2**40 blocks for just 40 components,
> 
> Ok, so why wouldn't we use the existing code with the duplication part 
> disabled?

That would not perform nearly as well.

> That's a later addition anyway and isn't necessary to do 
> shrink-wrapping in the first place.

No, it always did that, just not as often (it only duplicated straight-line
code before).


Segher
Michael Matz Aug. 30, 2016, 12:31 p.m. UTC | #6
Hi,

On Fri, 26 Aug 2016, Bernd Schmidt wrote:

> And that comment puzzles me. Surely prologue and epilogue are executed only
> once currently, so how does frequency come into it? Again - please provide an
> example.

int some_global;
int foo (void) {
  if (!some_global) {
    call_this();
    call_that();
    x = some + large / computation;
    while (x--) { much_more_code(); }
    some_global = 1;
  }
  return some_global;
}

Now you need the full pro/epilogue (saving/restoring all call clobbered 
regs presumably used by the large computation and loop) for only one call 
of that function (the first), and then never again.  But you do need a 
part of it always assuming the architecture is right, and this is a shared 
library, namely the setup of the PIC pointer for the access to 
some_global.

A prologue does many things, and in some cases many of them aren't 
necessary for all calls (indeed that's often the case for functions 
containing early-outs), so much so that the full prologue including 
unnecessary parts needs more time than the functional body of a functions 
particular call.  It's obvious that it would be better to move those parts 
of the prologue to places where they actually are needed:

int f2 (void) {
  setup_pic_reg();
  if (!some_global) {
    save_call_clobbers();
    code_from_above();
    restore_call_clobbers();
  }
  retreg = some_global;
  restore_pic_reg();
}

That includes moving parts of the prologue into loops:

int g() {
  int sum = 0;
  for (int i = 0; i < NUM; i++) {
    sum += i;
    if (sum >= CUTOFF) {
      some_long_winded_expression();
      that_eventually_calls_abort();
    }
  }
  return sum;
}

Here all parts of the prologue that somehow make it possible to call other 
functions are necessary only when the program will abort eventually: hence 
is necessary only at one call of g() at most.  Again it's sensible to move 
those parts inside the unlikely condition, even though it's inside a loop.


Ciao,
Michael.
Jeff Law Sept. 8, 2016, 4:28 p.m. UTC | #7
On 08/30/2016 06:31 AM, Michael Matz wrote:
> Hi,
>
> On Fri, 26 Aug 2016, Bernd Schmidt wrote:
>
>> And that comment puzzles me. Surely prologue and epilogue are executed only
>> once currently, so how does frequency come into it? Again - please provide an
>> example.
>
> int some_global;
> int foo (void) {
>   if (!some_global) {
>     call_this();
>     call_that();
>     x = some + large / computation;
>     while (x--) { much_more_code(); }
>     some_global = 1;
>   }
>   return some_global;
> }
>
> Now you need the full pro/epilogue (saving/restoring all call clobbered
> regs presumably used by the large computation and loop) for only one call
> of that function (the first), and then never again.  But you do need a
> part of it always assuming the architecture is right, and this is a shared
> library, namely the setup of the PIC pointer for the access to
> some_global.

>
> A prologue does many things, and in some cases many of them aren't
> necessary for all calls (indeed that's often the case for functions
> containing early-outs), so much so that the full prologue including
> unnecessary parts needs more time than the functional body of a functions
> particular call.  It's obvious that it would be better to move those parts
> of the prologue to places where they actually are needed:
[ ... ]
Right.  Essentially Segher's patch introduces the concept of prologue 
components that are independent of each other and which can be 
shrink-wrapped separately.  The degree of independence is highly target 
specific, of course.

I think one of the questions (and I haven't looked through the whole 
thread yet to see if it's answered) is why the basic shrink-wrapping 
algorithm can't be applied to each of the prologue components -- though 
you may have answered it with your loop example below.


>
> int f2 (void) {
>   setup_pic_reg();
>   if (!some_global) {
>     save_call_clobbers();
>     code_from_above();
>     restore_call_clobbers();
>   }
>   retreg = some_global;
>   restore_pic_reg();
> }
>
> That includes moving parts of the prologue into loops:
>
> int g() {
>   int sum = 0;
>   for (int i = 0; i < NUM; i++) {
>     sum += i;
>     if (sum >= CUTOFF) {
>       some_long_winded_expression();
>       that_eventually_calls_abort();
>     }
>   }
>   return sum;
> }
>
> Here all parts of the prologue that somehow make it possible to call other
> functions are necessary only when the program will abort eventually: hence
> is necessary only at one call of g() at most.  Again it's sensible to move
> those parts inside the unlikely condition, even though it's inside a loop.
Thanks.  I'd been wondering about when it'd be useful to push prologue 
code into a loop nest when I saw the patches fly by, but didn't think 
about it too much.  I haven't looked at the shrink-wrapping literature 
in years, but I don't recall it having any concept that there were cases 
where sinking into a loop nest was profitable.

Jeff
Jeff Law Sept. 8, 2016, 4:41 p.m. UTC | #8
On 08/26/2016 10:27 AM, Segher Boessenkool wrote:
> On Fri, Aug 26, 2016 at 05:03:34PM +0200, Bernd Schmidt wrote:
>> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
>>> The head comment starts with
>>>
>>> +/* Separate shrink-wrapping
>>> +
>>> +   Instead of putting all of the prologue and epilogue in one spot, we
>>> +   can put parts of it in places where those components are executed less
>>> +   frequently.
>>>
>>> and that is the long and short of it.
>>
>> And that comment puzzles me. Surely prologue and epilogue are executed
>> only once currently, so how does frequency come into it? Again - please
>> provide an example.
>
> If some component is only needed for 0.01% of executions of a function,
> running it once for every execution is 10000 times too much.
>
> The trivial example is a function that does an early exit, but uses one
> or a few non-volatile registers before that exit.  This happens in e.g.
> glibc's malloc, if you want an easily accessed example.  With the current
> code, *all* components will be saved and then restored shortly afterwards.
So can you expand on the malloc example a bit -- I'm pretty sure I 
understand what you're trying to do, but a concrete example may help 
Bernd and be useful for archival purposes.

I also know that Carlos is interested in the malloc example -- so I'd 
like to be able to pass that along to him.

Given the multiple early exit and fast paths through the allocator, I'm 
not at all surprised that sinking different components of the prologue 
to different locations is useful.

Also if there's a case where sinking into a loop occurs, definitely 
point that out.

>
>>> The full-prologue algorithm makes as many blocks run without prologue as
>>> possible, by duplicating blocks where that helps.  If you do this for
>>> every component you can and up with 2**40 blocks for just 40 components,
>>
>> Ok, so why wouldn't we use the existing code with the duplication part
>> disabled?
>
> That would not perform nearly as well.
>
>> That's a later addition anyway and isn't necessary to do
>> shrink-wrapping in the first place.
>
> No, it always did that, just not as often (it only duplicated straight-line
> code before).
Presumably (I haven't looked yet), the duplication is so that we can 
isolate one or more paths which in turn allows sinking the prologue 
further on some of those paths.

This is something I'll definitely want to look at -- block duplication 
to facilitate code elimination (or in this case avoid code insertion) 
hits several areas of interest to me -- and how we balance duplication 
vs runtime savings is always interesting.

Jeff
Jeff Law Sept. 8, 2016, 4:58 p.m. UTC | #9
On 08/26/2016 09:03 AM, Bernd Schmidt wrote:
> On 08/26/2016 04:50 PM, Segher Boessenkool wrote:
>> The head comment starts with
>>
>> +/* Separate shrink-wrapping
>> +
>> +   Instead of putting all of the prologue and epilogue in one spot, we
>> +   can put parts of it in places where those components are executed
>> less
>> +   frequently.
>>
>> and that is the long and short of it.
>
> And that comment puzzles me. Surely prologue and epilogue are executed
> only once currently, so how does frequency come into it? Again - please
> provide an example.
Right, they're executed once currently.  But the prologue could be sunk 
into locations where they are not executed every time the function is 
called.  That's the basis behind shrink wrapping.

Segher's code essentially allows individual components of the prologue 
to sink to different points within the function rather than forcing the 
prologue to be sunk as an atomic unit.

Conceptually you could run the standard algorithm on each independent 
component.


>
>> The full-prologue algorithm makes as many blocks run without prologue as
>> possible, by duplicating blocks where that helps.  If you do this for
>> every component you can and up with 2**40 blocks for just 40 components,
>
> Ok, so why wouldn't we use the existing code with the duplication part
> disabled? That's a later addition anyway and isn't necessary to do
> shrink-wrapping in the first place.
I think the concern here is the balance between code explosion and the 
runtime gains.

jeff
Segher Boessenkool Sept. 9, 2016, 6:19 a.m. UTC | #10
Thanks for looking at the patches Jeff.

On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:
> Right.  Essentially Segher's patch introduces the concept of prologue 
> components that are independent of each other and which can be 
> shrink-wrapped separately.  The degree of independence is highly target 
> specific, of course.

There can be dependencies as well (e.g. a save to the frame requires that
frame to be set up already), but the current patches have no way to
describe such dependencies.  I haven't found a good way to describe the
dependencies yet.  Finding a good balance between general and useful isn't
easy, as usual.

> I think one of the questions (and I haven't looked through the whole 
> thread yet to see if it's answered) is why the basic shrink-wrapping 
> algorithm can't be applied to each of the prologue components -- though 
> you may have answered it with your loop example below.

You get code size explosion with the "basic" algorithm.  And a lot of
that isn't really worth it: avoiding the whole prologue/epilogue is
usually worth copying some blocks for, but avoiding just a single register
save/restore pair?  More doubtful.

> >That includes moving parts of the prologue into loops:
> >
> >int g() {
> >  int sum = 0;
> >  for (int i = 0; i < NUM; i++) {
> >    sum += i;
> >    if (sum >= CUTOFF) {
> >      some_long_winded_expression();
> >      that_eventually_calls_abort();
> >    }
> >  }
> >  return sum;
> >}
> >
> >Here all parts of the prologue that somehow make it possible to call other
> >functions are necessary only when the program will abort eventually: hence
> >is necessary only at one call of g() at most.  Again it's sensible to move
> >those parts inside the unlikely condition, even though it's inside a loop.
> Thanks.  I'd been wondering about when it'd be useful to push prologue 
> code into a loop nest when I saw the patches fly by, but didn't think 
> about it too much.  I haven't looked at the shrink-wrapping literature 
> in years, but I don't recall it having any concept that there were cases 
> where sinking into a loop nest was profitable.

It isn't common, but it does happen.  If you use a proper cost metric
based on executiuon frequency with some algorithm then that algorithm will
naturally avoid placing *logues into loops.


Segher
Segher Boessenkool Sept. 9, 2016, 3:17 p.m. UTC | #11
On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> So can you expand on the malloc example a bit -- I'm pretty sure I 
> understand what you're trying to do, but a concrete example may help 
> Bernd and be useful for archival purposes.

Sure, but it's big (which is the problem :-) )

> I also know that Carlos is interested in the malloc example -- so I'd 
> like to be able to pass that along to him.
> 
> Given the multiple early exit and fast paths through the allocator, I'm 
> not at all surprised that sinking different components of the prologue 
> to different locations is useful.
> 
> Also if there's a case where sinking into a loop occurs, definitely 
> point that out.

Not sure that happens in there, I'll find out.

> >>That's a later addition anyway and isn't necessary to do
> >>shrink-wrapping in the first place.
> >
> >No, it always did that, just not as often (it only duplicated straight-line
> >code before).
> Presumably (I haven't looked yet), the duplication is so that we can 
> isolate one or more paths which in turn allows sinking the prologue 
> further on some of those paths.

It duplicates as many blocks as it needs to dup, to make as many exits
as possible reachable without *any* prologue/epilogue.

As the header comment before the older code says:

/* Try to perform a kind of shrink-wrapping, making sure the
   prologue/epilogue is emitted only around those parts of the
   function that require it.

   There will be exactly one prologue, and it will be executed either
   zero or one time, on any path.  Depending on where the prologue is
   placed, some of the basic blocks can be reached via both paths with
   and without a prologue.  Such blocks will be duplicated here, and the
   edges changed to match.

   Paths that go to the exit without going through the prologue will use
   a simple_return instead of the epilogue.  We maximize the number of
   those, making sure to only duplicate blocks that can be duplicated.
   If the prologue can then still be placed in multiple locations, we
   place it as early as possible.


> This is something I'll definitely want to look at -- block duplication 
> to facilitate code elimination (or in this case avoid code insertion) 
> hits several areas of interest to me -- and how we balance duplication 
> vs runtime savings is always interesting.

Yeah.  If you use separate shrink-wrapping you still *also* get the
"old" algorithm, if that wasn't clear.


Segher
Jeff Law Sept. 9, 2016, 3:26 p.m. UTC | #12
On 09/09/2016 12:19 AM, Segher Boessenkool wrote:
> Thanks for looking at the patches Jeff.
>
> On Thu, Sep 08, 2016 at 10:28:59AM -0600, Jeff Law wrote:
>> Right.  Essentially Segher's patch introduces the concept of prologue
>> components that are independent of each other and which can be
>> shrink-wrapped separately.  The degree of independence is highly target
>> specific, of course.
>
> There can be dependencies as well (e.g. a save to the frame requires that
> frame to be set up already), but the current patches have no way to
> describe such dependencies.  I haven't found a good way to describe the
> dependencies yet.  Finding a good balance between general and useful isn't
> easy, as usual.
Right.  I over-simplified a bit here.  Some dependencies can be handled 
via the hooks in the target files.  I think what we've got is sufficient 
for now -- we'll have a much clearer picture of weak spots in the design 
if/when someone enables separate shrink wrapping for another target. 
Until then, I think we're OK.

>
>> I think one of the questions (and I haven't looked through the whole
>> thread yet to see if it's answered) is why the basic shrink-wrapping
>> algorithm can't be applied to each of the prologue components -- though
>> you may have answered it with your loop example below.
>
> You get code size explosion with the "basic" algorithm.  And a lot of
> that isn't really worth it: avoiding the whole prologue/epilogue is
> usually worth copying some blocks for, but avoiding just a single register
> save/restore pair?  More doubtful.
Understood.   We have similar balancing problems elsewhere.  How much 
duplication should we allow to expose a CSE or eliminate a conditional 
jump on a path through the CFG.

So I think sticking with this as a design decision makes sense -- does 
it impact the issue around running a particular component's prologue 
more than once?


>> Thanks.  I'd been wondering about when it'd be useful to push prologue
>> code into a loop nest when I saw the patches fly by, but didn't think
>> about it too much.  I haven't looked at the shrink-wrapping literature
>> in years, but I don't recall it having any concept that there were cases
>> where sinking into a loop nest was profitable.
>
> It isn't common, but it does happen.  If you use a proper cost metric
> based on executiuon frequency with some algorithm then that algorithm will
> naturally avoid placing *logues into loops.
Right and the cases where it's placing them into loops it's going to be 
doing so on paths that are unlikely executed within the loop.  ISTM that 
using frequencies is a significant step forward over the older placement 
algorithms in this space (which IIRC were structured as simple dataflow 
solvers) -- with the added advantage that if we have profiling data we 
can make better decisions.

Does this impact the compile time computation complexity issue that was 
raised elsewhere?

jeff
Segher Boessenkool Sept. 9, 2016, 3:28 p.m. UTC | #13
On Thu, Sep 08, 2016 at 10:58:13AM -0600, Jeff Law wrote:
> >And that comment puzzles me. Surely prologue and epilogue are executed
> >only once currently, so how does frequency come into it? Again - please
> >provide an example.
> Right, they're executed once currently.  But the prologue could be sunk 
> into locations where they are not executed every time the function is 
> called.  That's the basis behind shrink wrapping.

Right.

> Segher's code essentially allows individual components of the prologue 
> to sink to different points within the function rather than forcing the 
> prologue to be sunk as an atomic unit.

It also allows prologue an epilogue components to be placed in multiple
places, and even allows them to be executed more than once, if that is
cheaper.

> >>The full-prologue algorithm makes as many blocks run without prologue as
> >>possible, by duplicating blocks where that helps.  If you do this for
> >>every component you can and up with 2**40 blocks for just 40 components,
> >
> >Ok, so why wouldn't we use the existing code with the duplication part
> >disabled? That's a later addition anyway and isn't necessary to do
> >shrink-wrapping in the first place.
> I think the concern here is the balance between code explosion and the 
> runtime gains.

The code explosion can be terrible if you have only a few components,
already.  The runtime gains are not so great either.  Here's a simple
example, showing part of a cfg, the exits to the right are calls to
abort and need some prologue component:

|
1
|\
| \
2  X1
|\
| \
|  X2
|

With the "old" algorithm, you need to place the prologue at 1 (because
you can only have one), and then the fall-through path also necessarily
gets that component's prologue.  With the "separate" algorithm, the
component prologues can be placed at X1 and X2 instead (and that will
most likely be cheapest according to the cost model, too).

You can also put this in a loop.  I'll try to make a simple example
with that.


Segher
Segher Boessenkool Sept. 9, 2016, 3:40 p.m. UTC | #14
On Fri, Sep 09, 2016 at 09:26:39AM -0600, Jeff Law wrote:
> >>I think one of the questions (and I haven't looked through the whole
> >>thread yet to see if it's answered) is why the basic shrink-wrapping
> >>algorithm can't be applied to each of the prologue components -- though
> >>you may have answered it with your loop example below.
> >
> >You get code size explosion with the "basic" algorithm.  And a lot of
> >that isn't really worth it: avoiding the whole prologue/epilogue is
> >usually worth copying some blocks for, but avoiding just a single register
> >save/restore pair?  More doubtful.
> Understood.   We have similar balancing problems elsewhere.  How much 
> duplication should we allow to expose a CSE or eliminate a conditional 
> jump on a path through the CFG.
> 
> So I think sticking with this as a design decision makes sense -- does 
> it impact the issue around running a particular component's prologue 
> more than once?

I don't follow, sorry; could you rephrase?

> >>Thanks.  I'd been wondering about when it'd be useful to push prologue
> >>code into a loop nest when I saw the patches fly by, but didn't think
> >>about it too much.  I haven't looked at the shrink-wrapping literature
> >>in years, but I don't recall it having any concept that there were cases
> >>where sinking into a loop nest was profitable.
> >
> >It isn't common, but it does happen.  If you use a proper cost metric
> >based on executiuon frequency with some algorithm then that algorithm will
> >naturally avoid placing *logues into loops.
> Right and the cases where it's placing them into loops it's going to be 
> doing so on paths that are unlikely executed within the loop.  ISTM that 
> using frequencies is a significant step forward over the older placement 
> algorithms in this space (which IIRC were structured as simple dataflow 
> solvers) -- with the added advantage that if we have profiling data we 
> can make better decisions.

Using the known/guessed execution frequencies is not really new, just not
forty years old :-)

> Does this impact the compile time computation complexity issue that was 
> raised elsewhere?

I'm not sure what you mean here either, sorry.  It is all O(NM) with N
the number of BBs and M the number of components (and assuming dom
lookups are constant time, as usual).


Segher
Jeff Law Sept. 9, 2016, 3:59 p.m. UTC | #15
On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
> On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
>> So can you expand on the malloc example a bit -- I'm pretty sure I
>> understand what you're trying to do, but a concrete example may help
>> Bernd and be useful for archival purposes.
>
> Sure, but it's big (which is the problem :-) )
Yea :(  But it's likely a very compelling example of real world code. 
It's almost certainly too big to turn into a testcase of any kind, but 
just some before/after annotated code would be helpful.

Ideally we'd have some smaller testcases we could put in the testsuite 
to ensure that the feature works over-time in the way intended would be 
helpful as well.

>>
>>>> That's a later addition anyway and isn't necessary to do
>>>> shrink-wrapping in the first place.
>>>
>>> No, it always did that, just not as often (it only duplicated straight-line
>>> code before).
>> Presumably (I haven't looked yet), the duplication is so that we can
>> isolate one or more paths which in turn allows sinking the prologue
>> further on some of those paths.
>
> It duplicates as many blocks as it needs to dup, to make as many exits
> as possible reachable without *any* prologue/epilogue.
>
> As the header comment before the older code says:
>
> /* Try to perform a kind of shrink-wrapping, making sure the
>    prologue/epilogue is emitted only around those parts of the
>    function that require it.
>
>    There will be exactly one prologue, and it will be executed either
>    zero or one time, on any path.
Right.  That's always been my understanding of the key driver for 
placement.  There's exactly one and will be executed one time or none 
across all paths in the CFG.

Essentially this is comparable to PRE-like algorithms for placement of 
expression evaluations.

And in separate shrink-wrapping world, we're leaving that model behind 
and I think that's one of the big things I'm struggling with -- we may 
execute a prologue component more than once if I've read everything 
correctly.


  Depending on where the prologue is
>    placed, some of the basic blocks can be reached via both paths with
>    and without a prologue.  Such blocks will be duplicated here, and the
>    edges changed to match.
Understood.

This really feels comparable to block duplication for the purposes of 
isolating a particular path through the CFG so that path can be modified 
without affecting the behavior of other paths through the CFG.

It's also directly comparable to block duplication to allow more 
aggressive code motion in PRE-like algorithms.

Jeff
Jeff Law Sept. 9, 2016, 4:48 p.m. UTC | #16
On 09/09/2016 09:28 AM, Segher Boessenkool wrote:
>> Segher's code essentially allows individual components of the prologue
>> to sink to different points within the function rather than forcing the
>> prologue to be sunk as an atomic unit.
>
> It also allows prologue an epilogue components to be placed in multiple
> places,
Right.  I need to keep that in the forefront of my mind.



and even allows them to be executed more than once, if that is
> cheaper.
This is the part that I'm still struggling with.

>
>>>> The full-prologue algorithm makes as many blocks run without prologue as
>>>> possible, by duplicating blocks where that helps.  If you do this for
>>>> every component you can and up with 2**40 blocks for just 40 components,
>>>
>>> Ok, so why wouldn't we use the existing code with the duplication part
>>> disabled? That's a later addition anyway and isn't necessary to do
>>> shrink-wrapping in the first place.
>> I think the concern here is the balance between code explosion and the
>> runtime gains.
>
> The code explosion can be terrible if you have only a few components,
> already.  The runtime gains are not so great either.  Here's a simple
> example, showing part of a cfg, the exits to the right are calls to
> abort and need some prologue component:
>
> |
> 1
> |\
> | \
> 2  X1
> |\
> | \
> |  X2
> |
   3
>
> With the "old" algorithm, you need to place the prologue at 1 (because
> you can only have one), and then the fall-through path also necessarily
> gets that component's prologue.  With the "separate" algorithm, the
> component prologues can be placed at X1 and X2 instead (and that will
> most likely be cheapest according to the cost model, too).
Thanks.  That really helps highlight some of the key differences between 
the old and new model.


>
> You can also put this in a loop.  I'll try to make a simple example
> with that.
That would be great.  And a case where they execute more than once too.

Jeff
Segher Boessenkool Sept. 9, 2016, 4:49 p.m. UTC | #17
On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:
> On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
> >On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
> >>So can you expand on the malloc example a bit -- I'm pretty sure I
> >>understand what you're trying to do, but a concrete example may help
> >>Bernd and be useful for archival purposes.
> >
> >Sure, but it's big (which is the problem :-) )
> Yea :(  But it's likely a very compelling example of real world code. 
> It's almost certainly too big to turn into a testcase of any kind, but 
> just some before/after annotated code would be helpful.

I'll work on it, but it won't be before tonight, probably quite a bit
later.  Seeing the difference in generated machine code is probably
most useful?  Better than RTL ;-)

> Ideally we'd have some smaller testcases we could put in the testsuite 
> to ensure that the feature works over-time in the way intended would be 
> helpful as well.

I might be able to make some really tiny testcases that will not need
updates all of the time.  We'll see.

> Right.  That's always been my understanding of the key driver for 
> placement.  There's exactly one and will be executed one time or none 
> across all paths in the CFG.

But that is not because it is good to have only one!  GCC expects there
to be only one, instead.  Some ports might even use prologues that cannot
be duplicated at all.

> And in separate shrink-wrapping world, we're leaving that model behind 
> and I think that's one of the big things I'm struggling with -- we may 
> execute a prologue component more than once if I've read everything 
> correctly.

Yes, and that is perhaps radically new in the GCC world, but not anywhere
else.

[ ... ]

> This really feels comparable to block duplication for the purposes of 
> isolating a particular path through the CFG so that path can be modified 
> without affecting the behavior of other paths through the CFG.

Yes, very much.

> It's also directly comparable to block duplication to allow more 
> aggressive code motion in PRE-like algorithms.

Yeah.


Segher
Segher Boessenkool Sept. 9, 2016, 4:57 p.m. UTC | #18
On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:
> and even allows them to be executed more than once, if that is
> >cheaper.
> This is the part that I'm still struggling with.

The usual example:

1
|\
| \
|  2
| /
|/
3
|\
| \
|  4
| /
|/
5

where 2 and 4 need a certain prologue component (and the rest doesn't).

If the frequency of 2 plus that of 4 is less than the frequency of 1,
it is better to place a prologue (and epilogue) around both 2 and 4
than to place a prologue at 1 and an epilogue at 5.

> >You can also put this in a loop.  I'll try to make a simple example
> >with that.
> That would be great.  And a case where they execute more than once too.

Okido.


Segher
Jeff Law Sept. 9, 2016, 5:01 p.m. UTC | #19
On 09/09/2016 10:49 AM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 09:59:03AM -0600, Jeff Law wrote:
>> On 09/09/2016 09:17 AM, Segher Boessenkool wrote:
>>> On Thu, Sep 08, 2016 at 10:41:37AM -0600, Jeff Law wrote:
>>>> So can you expand on the malloc example a bit -- I'm pretty sure I
>>>> understand what you're trying to do, but a concrete example may help
>>>> Bernd and be useful for archival purposes.
>>>
>>> Sure, but it's big (which is the problem :-) )
>> Yea :(  But it's likely a very compelling example of real world code.
>> It's almost certainly too big to turn into a testcase of any kind, but
>> just some before/after annotated code would be helpful.
>
> I'll work on it, but it won't be before tonight, probably quite a bit
> later.  Seeing the difference in generated machine code is probably
> most useful?  Better than RTL ;-)
Yea, machine code is probably most useful.

>
>> Ideally we'd have some smaller testcases we could put in the testsuite
>> to ensure that the feature works over-time in the way intended would be
>> helpful as well.
>
> I might be able to make some really tiny testcases that will not need
> updates all of the time.  We'll see.
Understood, particularly on the maintenance burden for these kind of 
tests.  The alternate approach might be to pick up Bernd's work and see 
if we can use that test the various components.

>
> But that is not because it is good to have only one!  GCC expects there
> to be only one, instead.  Some ports might even use prologues that cannot
> be duplicated at all.
Agreed on all points.

>
>> And in separate shrink-wrapping world, we're leaving that model behind
>> and I think that's one of the big things I'm struggling with -- we may
>> execute a prologue component more than once if I've read everything
>> correctly.
>
> Yes, and that is perhaps radically new in the GCC world, but not anywhere
> else.
And just to be clear I think I'm mostly looking for why multiple 
execution of a particular component is useful.  My first thought in that 
case is that we've got a redundancy and that the redundancy should be 
identified and eliminated :-)

I've got a few not-well-formed ideas in my head about when that might 
happen and why we might not want to squash out the redundancy.  But it'd 
be a hell of a step forward to see that case in action.

Jeff
Jeff Law Sept. 9, 2016, 5:24 p.m. UTC | #20
On 09/09/2016 10:57 AM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 10:48:30AM -0600, Jeff Law wrote:
>> and even allows them to be executed more than once, if that is
>>> cheaper.
>> This is the part that I'm still struggling with.
>
> The usual example:
>
> 1
> |\
> | \
> |  2
> | /
> |/
> 3
> |\
> | \
> |  4
> | /
> |/
> 5
>
> where 2 and 4 need a certain prologue component (and the rest doesn't).
Perfect.

So this is consistent with one of the ideas I was starting to form.  I'm 
going to stay in the PRE world because it's model is so damn close to 
what you're doing.

PRE minimizes expression evaluations on paths *without* introducing 
evaluations on paths that didn't already have one.

So if we pretend we had some expression evaluated in 2 & 4.  The path 
1->2->3->4 has two evaluations of the expression, but PRE can't really 
do anything here.  We can't hoist the evaluation into a better spot as 
doing so would introduce evaluations on paths that didn't have one 
before.  2 & 4 are the proper locations for that expression evaluation.

And the same applies to separate shrink wrapping.  2 & 4 are the right 
place.  It's all so clear now.

I'll note that duplicating 3 into 3' and redirecting the edge 2->3 to 
2->3' allows us to do better PRE and prologue insertion.  But I don't 
think that's a requirement for you to go forward :-)

Anyway, it's all clear now.  Thanks so much for that example.


Jeff
Jeff Law Sept. 9, 2016, 6:19 p.m. UTC | #21
On 09/09/2016 09:40 AM, Segher Boessenkool wrote:
>>
>> So I think sticking with this as a design decision makes sense -- does
>> it impact the issue around running a particular component's prologue
>> more than once?
>
> I don't follow, sorry; could you rephrase?
Nevermind -- my question has been resolved.  I was fumbling around 
trying to figure out the circumstances where a prologue/epilogue might 
be run more than once.  With your recent message, that's been cleared up.

>> Right and the cases where it's placing them into loops it's going to be
>> doing so on paths that are unlikely executed within the loop.  ISTM that
>> using frequencies is a significant step forward over the older placement
>> algorithms in this space (which IIRC were structured as simple dataflow
>> solvers) -- with the added advantage that if we have profiling data we
>> can make better decisions.
>
> Using the known/guessed execution frequencies is not really new, just not
> forty years old :-)
Referring to its use in GCC (or lack thereof).

>
>> Does this impact the compile time computation complexity issue that was
>> raised elsewhere?
>
> I'm not sure what you mean here either, sorry.  It is all O(NM) with N
> the number of BBs and M the number of components (and assuming dom
> lookups are constant time, as usual).
You said in a different message that computing optimal placement points 
for prologue/epilogue components was not computationally feasible.   I'm 
just trying to figure out if the switch to utilizing block frequencies 
is a part of what makes solving that problem infeasible.

jeff
Jeff Law Sept. 9, 2016, 6:57 p.m. UTC | #22
On 08/26/2016 07:03 AM, Bernd Schmidt wrote:
> On 08/01/2016 03:42 AM, Segher Boessenkool wrote:
>> This is the second version.  Concern was renamed to component, and all
>> other comments were addressed (I hope).
>
> Not really, I'm afraid. There still seems to be no detailed explanation
> of the algorithm. Instead, there is a vague outline (which should be
> expanded to at least the level of detail found e.g. in tree-ssa-pre.c),
> and inside the code there are still meaningless comments of the form
I think Segher and I can work towards this. The placement algorithm is 
where I think the main focus will need to be.  Segher has some text for 
that, but we may need to supplement with a CFG or two to help clarify.

>
> /* Deselect those epilogue components that should not be inserted
>    for this edge.  */
>
> which don't tell me anything about what the intention is (why should
> they not be inserted?). The result is that as a reader, I still find
> myself unable to determine whether the algorithm is correct or not.
Deselection is more about pruning components that we initially thought 
were separate shrink wrap candidates, but upon deeper inspection we can 
not handle.  And it's all target driven.

Deselection has to run after placement computation because deselection 
is a property of the desired insertion site (which is why this hook 
passes in the desired edge where we want to insert code).  Deselection 
runs prior to actual generation & insertion of the sequences. 
Deselection affects the ability to separately shrink wrap that component 
globally.

So it does affect correctness, but it's pretty easy to see that its safe.



>
> Worse, I'm still not entirely certain what it is intended to achieve: I
> asked for a motivating example or two, but haven't seen one in the
> comments anywhere. My most general question would be why the algorithm
> for placing individual prologue components would be different from the
> regular shrink-wrapping algorithm for full prologues. Examples and/or
> arguments relating to how this new code acts with regard to loops are
> also particularly needed.
I think Segher and Michael covered this in the various follow-ups.  I'm 
now comfortable with the overall goals.

Right now we have a single copy of the prologue/epilogue.  That single 
copy might get shrink-wrapped, but at the end of the day, it's still a 
single copy that gets executed at most once.

If we relax the single copy and single execution traits, then we get 
more freedom to place the prologues/epilogues and less frequently 
executed points.

We get even more flexibility by handling components of the 
prologue/epilogue separately.

Some comments in these areas will be appropriate, but again, I'm 
comfortable with the overall direction this work has gone.


>
> So, I think improvement is necessary in these points, but I also think
> that there's room for experimental verification by way of self-tests. If
> done thoroughly (testing the transformation on several sufficiently
> random CFGs and maybe some manually constructed ones) it would go a long
> way towards showing that at least we don't have to worry too much about
> miscompilations. That's what I've been working on, and an initial patch
> is below. This is incomplete and posted more as an RFD since we're
> getting towards the end of the week: there are gaps in the test
> coverage, and it currently fails the selftest. I assume the latter is a
> problem with my code, but it wouldn't hurt if you could take a look;
> maybe I misunderstood something entirely about what the separate
> shrink-wrapping is supposed to achieve, or maybe I messed up the
> algorithm with my changes.
I think the lack of test coverage is something we'll want to address.

jeff
Segher Boessenkool Sept. 9, 2016, 8:24 p.m. UTC | #23
On Fri, Sep 09, 2016 at 12:19:03PM -0600, Jeff Law wrote:
> >>Does this impact the compile time computation complexity issue that was
> >>raised elsewhere?
> >
> >I'm not sure what you mean here either, sorry.  It is all O(NM) with N
> >the number of BBs and M the number of components (and assuming dom
> >lookups are constant time, as usual).
> You said in a different message that computing optimal placement points 
> for prologue/epilogue components was not computationally feasible.   I'm 
> just trying to figure out if the switch to utilizing block frequencies 
> is a part of what makes solving that problem infeasible.

Ah, I see.  Allowing multiple prologues (and epilogues) is what makes
it infeasible.  If there is just (at most) one prologue (per component),
calculating the optimal placement is of course linear in # BBs, given
that the cost function is O(1) (or sort of kind of; linear in # edges
really, if you have to insist :-) )

If you allow multiple prologues, i.e. allow any combination of blocks
to run with or without some component "active", you get an exponential
number of possible way to place things, and the cost for those combinations
is *not* an ordered function: if all predecessors of a block have some
component active, then this block itself does not need a prologue.

I get around that by making the cost function ordered, that is, possibly
overestimating the cost of nodes that are the dest of a cross-jump; in
the first version of the patch, by always using the execution frequency
of a node (so, not considering that a prologue there does not cost
anything for edges where the predecessor already has that component
active); and in the second version of the patch, that, but do subtract
the cost from backedges (which makes it clearer that loops are handled
correctly, because the loop header generally has lower cost than the
nodes in the loop body).


Segher
Segher Boessenkool Sept. 9, 2016, 8:56 p.m. UTC | #24
On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> I think the lack of test coverage is something we'll want to address.

Building and running the compiler, the various target libraries, and the
testsuite is more than enough coverage for correctness in my opinion --
I cannot make up anything that is not already in there.  No doubt some
PR will show up later, and we can (and should) of course add that then.

For checking whether it makes "smart" decisions, I'll make some really
simple testcases that will not fail for random reasons all the time.


Segher
Bernd Schmidt Sept. 12, 2016, 10:50 a.m. UTC | #25
On 09/09/2016 10:56 PM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
>> I think the lack of test coverage is something we'll want to address.
>
> Building and running the compiler, the various target libraries, and the
> testsuite is more than enough coverage for correctness in my opinion --

I'd argue against that:
  - a lot of compiler bugs for passes like this show up only for unusual
    CFGs that you don't really get with normal input code.
  - separate shrink wrapping explicitly tries to modify infrequent
    paths, which may not even get exercised.

Also, unless I've missed it you've not provided any numbers about how 
often this triggers, let's say on a cc1 build.

For these reasons I thought having self-tests would be a good idea, but 
I was informed it's a waste of time, so I guess not.


Bernd
Jeff Law Sept. 12, 2016, 4:49 p.m. UTC | #26
On 09/09/2016 02:56 PM, Segher Boessenkool wrote:
> On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
>> I think the lack of test coverage is something we'll want to address.
>
> Building and running the compiler, the various target libraries, and the
> testsuite is more than enough coverage for correctness in my opinion --
> I cannot make up anything that is not already in there.  No doubt some
> PR will show up later, and we can (and should) of course add that then.
>
> For checking whether it makes "smart" decisions, I'll make some really
> simple testcases that will not fail for random reasons all the time.
I'm moving more and more towards wanting tests for new features. 
There's no reason why verification of basic behavior shouldn't be part 
of the feature patch itself.

I suspect the biggest problem with testing something like separate 
shrink wrapping is making sure the test is stable over time -- and 
that's a legitimate concern.  But it ought to be do-able in one form or 
another, particularly since we have control over the debug dumps. 
Alternately building something on top of Bernd's work is an option too.

Building the compiler & libraries, regression testing, etc are all 
important as well.

jeff
Segher Boessenkool Sept. 14, 2016, 1:18 p.m. UTC | #27
On Mon, Sep 12, 2016 at 10:49:46AM -0600, Jeff Law wrote:
> On 09/09/2016 02:56 PM, Segher Boessenkool wrote:
> >On Fri, Sep 09, 2016 at 12:57:32PM -0600, Jeff Law wrote:
> >>I think the lack of test coverage is something we'll want to address.
> >
> >Building and running the compiler, the various target libraries, and the
> >testsuite is more than enough coverage for correctness in my opinion --
> >I cannot make up anything that is not already in there.  No doubt some
> >PR will show up later, and we can (and should) of course add that then.
> >
> >For checking whether it makes "smart" decisions, I'll make some really
> >simple testcases that will not fail for random reasons all the time.
> I'm moving more and more towards wanting tests for new features. 
> There's no reason why verification of basic behavior shouldn't be part 
> of the feature patch itself.

Yes, but as I said, this is used millions of times during bootstrap
already, and at least *I* cannot think of more complex structures that
would fail.  If any show up we can add them as regression tests.

> I suspect the biggest problem with testing something like separate 
> shrink wrapping is making sure the test is stable over time -- and 
> that's a legitimate concern.

Yes exactly.

> But it ought to be do-able in one form or another,

Yes, as I promised, I'll make up some trivial tests.

> particularly since we have control over the debug dumps. 

I'm not a big fan of using the debug dumps in the testsuite -- very
often tests break because you changed something in the dump.  Debug
dumps are for debug, not for testing.

If we get to the point where we generically can use some piece of RTL
as input and just run one pass on it, that would be nice.  But we're
not there yet, and the kind of "unit tests" where you just write all
code twice and essentially only test if the test is correct: no thanks.


Segher
diff mbox

Patch

diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.c with-selftest/gcc/sbitmap.c
--- sw-selftest-base/gcc/sbitmap.c	2016-04-29 10:46:49.189700651 +0200
+++ with-selftest/gcc/sbitmap.c	2016-08-26 11:40:05.043374153 +0200
@@ -319,7 +319,7 @@ 
       *dstp++ = *ap++;
 }
 
-/* Return true if there are any bits set in A are also set in B.
+/* Return true if there are any bits set in A which are also set in B.
    Return false otherwise.  */
 
 bool
@@ -335,6 +335,24 @@ 
       return true;
 
   return false;
+}
+
+/* Return true if there are any bits set in A which are not set in B.
+   Return false otherwise.  */
+
+bool
+bitmap_intersect_compl_p (const_sbitmap a, const_sbitmap b)
+{
+  const_sbitmap_ptr ap = a->elms;
+  const_sbitmap_ptr bp = b->elms;
+  unsigned int i, n;
+
+  n = MIN (a->size, b->size);
+  for (i = 0; i < n; i++)
+    if ((*ap++ & ~*bp++) != 0)
+      return true;
+
+  return false;
 }
 
 /* Set DST to be (A and B).
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/sbitmap.h with-selftest/gcc/sbitmap.h
--- sw-selftest-base/gcc/sbitmap.h	2016-08-01 12:42:29.678435973 +0200
+++ with-selftest/gcc/sbitmap.h	2016-08-26 11:40:14.520040887 +0200
@@ -246,6 +246,7 @@ 
 extern bool bitmap_and_or (sbitmap, const_sbitmap,
 				     const_sbitmap, const_sbitmap);
 extern bool bitmap_intersect_p (const_sbitmap, const_sbitmap);
+extern bool bitmap_intersect_compl_p (const_sbitmap, const_sbitmap);
 extern bool bitmap_and (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_ior (sbitmap, const_sbitmap, const_sbitmap);
 extern bool bitmap_xor (sbitmap, const_sbitmap, const_sbitmap);
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest.h with-selftest/gcc/selftest.h
--- sw-selftest-base/gcc/selftest.h	2016-08-01 12:42:57.738436176 +0200
+++ with-selftest/gcc/selftest.h	2016-08-26 11:49:50.656711729 +0200
@@ -92,6 +92,11 @@ 
 extern void tree_cfg_c_tests ();
 extern void vec_c_tests ();
 extern void wide_int_cc_tests ();
+extern void shrink_wrapping_separate_tests ();
+
+/* Helper functions to set up certain tests.  */
+extern basic_block build_random_cfg (int);
+extern void free_random_cfg ();
 
 extern int num_passes;
 
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/selftest-run-tests.c with-selftest/gcc/selftest-run-tests.c
--- sw-selftest-base/gcc/selftest-run-tests.c	2016-08-01 12:43:07.411769580 +0200
+++ with-selftest/gcc/selftest-run-tests.c	2016-08-26 11:50:34.160045378 +0200
@@ -67,8 +67,9 @@ 
   spellcheck_tree_c_tests ();
   tree_cfg_c_tests ();
 
-  /* This one relies on most of the above.  */
+  /* These ones relies on most of the above.  */
   function_tests_c_tests ();
+  shrink_wrapping_separate_tests ();
 
   /* Finished running tests.  */
   long finish_time = get_run_time ();
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/shrink-wrap.c with-selftest/gcc/shrink-wrap.c
--- sw-selftest-base/gcc/shrink-wrap.c	2016-08-25 13:50:11.013943584 +0200
+++ with-selftest/gcc/shrink-wrap.c	2016-08-26 13:25:48.843420128 +0200
@@ -41,7 +41,7 @@ 
 #include "regcprop.h"
 #include "rtl-iter.h"
 #include "valtrack.h"
-
+#include "selftest.h"
 
 /* Return true if INSN requires the stack frame to be set up.
    PROLOGUE_USED contains the hard registers used in the function
@@ -1091,6 +1091,58 @@ 
   gcov_type total_cost;
 };
 
+/* An abstract base class which holds the implementation of the algorithm.
+   Derived classes exist for actual shrink-wrapping, and for self-test.  */
+class sw_sep_base
+{
+  virtual sbitmap components_for_bb (basic_block) = 0;
+  virtual void disqualify (edge, sbitmap, bool) = 0;
+  virtual void emit_prologue_into_bb (basic_block, sbitmap, bool) = 0;
+  virtual void emit_epilogue_into_bb (basic_block, sbitmap, bool) = 0;
+  virtual void emit_prologue_on_edge (edge, sbitmap) = 0;
+  virtual void emit_epilogue_on_edge (edge, sbitmap) = 0;
+  virtual void set_handled () = 0;
+
+  void init ();
+  void place_prologue_for_one_component (unsigned int, basic_block);
+  void spread_components (basic_block head);
+  void disqualify_problematic_components ();
+  void emit_common_heads_for_components ();
+  void emit_common_tails_for_components ();
+  void insert_prologue_epilogue_for_components ();
+
+ protected:
+  basic_block m_first_bb;
+  sbitmap m_components;
+
+  sw_sep_base (basic_block first_bb, sbitmap components)
+    : m_first_bb (first_bb), m_components (components)
+  {
+  }
+
+ public:
+  virtual void fini ();
+  sbitmap run ();
+};
+
+/* The derived class that performs the real optimization.  */
+class sw_sep : public sw_sep_base
+{
+  sbitmap components_for_bb (basic_block);
+  void disqualify (edge, sbitmap, bool);
+  void emit_prologue_into_bb (basic_block, sbitmap, bool);
+  void emit_epilogue_into_bb (basic_block, sbitmap, bool);
+  void emit_prologue_on_edge (edge, sbitmap);
+  void emit_epilogue_on_edge (edge, sbitmap);
+  void set_handled ();
+
+ public:
+  sw_sep (basic_block first_bb, sbitmap components)
+    : sw_sep_base (first_bb, components)
+  {
+  }
+};
+
 /* A helper function for accessing the pass-specific info.  */
 static inline struct sw *
 SW (basic_block bb)
@@ -1101,26 +1153,19 @@ 
 
 /* Create the pass-specific data structures for separately shrink-wrapping
    with components COMPONENTS.  */
-static void
-init_separate_shrink_wrap (sbitmap components)
+void
+sw_sep_base::init ()
 {
   basic_block bb;
   FOR_ALL_BB_FN (bb, cfun)
     {
       bb->aux = xcalloc (1, sizeof (struct sw));
 
-      SW (bb)->needs_components = targetm.shrink_wrap.components_for_bb (bb);
-
-      if (dump_file)
-	{
-	  fprintf (dump_file, "bb %d components:", bb->index);
-	  dump_components ("has", SW (bb)->needs_components);
-	  fprintf (dump_file, "\n");
-	}
+      SW (bb)->needs_components = components_for_bb (bb);
 
-      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (components));
-      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (components));
-      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (components));
+      SW (bb)->has_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
+      SW (bb)->head_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
+      SW (bb)->tail_components = sbitmap_alloc (SBITMAP_SIZE (m_components));
       bitmap_clear (SW (bb)->has_components);
       bitmap_clear (SW (bb)->head_components);
       bitmap_clear (SW (bb)->tail_components);
@@ -1128,8 +1173,8 @@ 
 }
 
 /* Destroy the pass-specific data.  */
-static void
-fini_separate_shrink_wrap (void)
+void
+sw_sep_base::fini (void)
 {
   basic_block bb;
   FOR_ALL_BB_FN (bb, cfun)
@@ -1144,13 +1189,132 @@ 
       }
 }
 
+sbitmap
+sw_sep::components_for_bb (basic_block bb)
+{
+  sbitmap b = targetm.shrink_wrap.components_for_bb (bb);
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "bb %d components:", bb->index);
+      dump_components ("has", b);
+      fprintf (dump_file, "\n");
+    }
+  return b;
+}
+
+void
+sw_sep::disqualify (edge e, sbitmap edge_components, bool is_prologue)
+{
+  targetm.shrink_wrap.disqualify_components (m_components, e, edge_components,
+					     is_prologue);
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_sep::emit_epilogue_on_edge (edge e, sbitmap epi)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_epilogue_components (epi);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (e->flags & EDGE_SIBCALL)
+    {
+      gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
+
+      rtx_insn *insn = BB_END (e->src);
+      gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
+      emit_insn_before (seq, insn);
+    }
+  else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
+    {
+      gcc_assert (e->flags & EDGE_FALLTHRU);
+      basic_block new_bb = split_edge (e);
+      emit_insn_after (seq, BB_END (new_bb));
+    }
+  else
+    insert_insn_on_edge (seq, e);
+}
+
+/* Call the target hook to emit prologue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_sep::emit_prologue_on_edge (edge e, sbitmap pro)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_prologue_components (pro);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  insert_insn_on_edge (seq, e);
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_sep::emit_epilogue_into_bb (basic_block bb, sbitmap epi, bool at_head)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_epilogue_components (epi);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (at_head)
+    emit_insn_after (seq, bb_note (bb));
+  else
+    {
+      rtx_insn *insn = BB_END (bb);
+      if (control_flow_insn_p (insn))
+	emit_insn_before (seq, insn);
+      else
+	emit_insn_after (seq, insn);
+    }
+}
+
+/* Call the target hook to emit prologue components for PRO, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_sep::emit_prologue_into_bb (basic_block bb, sbitmap pro, bool at_head)
+{
+  start_sequence ();
+  targetm.shrink_wrap.emit_prologue_components (pro);
+  rtx_insn *seq = get_insns ();
+  end_sequence ();
+
+  if (at_head)
+    emit_insn_after (seq, bb_note (bb));
+  else
+    {
+      rtx_insn *insn = BB_END (bb);
+      if (control_flow_insn_p (insn))
+	emit_insn_before (seq, insn);
+      else
+	emit_insn_after (seq, insn);
+    }
+}
+
+/* Called by the engine if separate shrink-wrapping was successful for
+   some components.  */
+void
+sw_sep::set_handled ()
+{
+  targetm.shrink_wrap.set_handled_components (m_components);
+
+  crtl->shrink_wrapped_separate = true;
+}
+
 /* Place the prologue for component WHICH, in the basic blocks dominated
    by HEAD.  Do a DFS over the dominator tree, and set bit WHICH in the
    HAS_COMPONENTS of a block if either the block has that bit set in
    NEEDS_COMPONENTS, or it is cheaper to place the prologue here than in all
    dominator subtrees separately.  */
-static void
-place_prologue_for_one_component (unsigned int which, basic_block head)
+void
+sw_sep_base::place_prologue_for_one_component (unsigned int which,
+					       basic_block head)
 {
   /* The block we are currently dealing with.  */
   basic_block bb = head;
@@ -1246,8 +1410,8 @@ 
 
 /* Mark HAS_COMPONENTS for every block dominated by at least one block with
    PRO_COMPONENTS set for the respective components, starting at HEAD.  */
-static void
-spread_components (basic_block head)
+void
+sw_sep_base::spread_components (basic_block head)
 {
   basic_block bb = head;
   bool first_visit = true;
@@ -1291,11 +1455,11 @@ 
 /* If we cannot handle placing some component's prologues or epilogues where
    we decided we should place them, unmark that component in COMPONENTS so
    that it is not wrapped separately.  */
-static void
-disqualify_problematic_components (sbitmap components)
+void
+sw_sep_base::disqualify_problematic_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1312,11 +1476,9 @@ 
 
 	  /* Ask the target what it thinks about things.  */
 	  if (!bitmap_empty_p (epi))
-	    targetm.shrink_wrap.disqualify_components (components, e, epi,
-						       false);
+	    disqualify (e, epi, false);
 	  if (!bitmap_empty_p (pro))
-	    targetm.shrink_wrap.disqualify_components (components, e, pro,
-						       true);
+	    disqualify (e, pro, true);
 
 	  /* If this edge doesn't need splitting, we're fine.  */
 	  if (single_pred_p (e->dest)
@@ -1336,10 +1498,10 @@ 
 
 	  /* Remove from consideration those components we would need
 	     pro/epilogues for on edges where we cannot insert them.  */
-	  bitmap_and_compl (components, components, epi);
-	  bitmap_and_compl (components, components, pro);
+	  bitmap_and_compl (m_components, m_components, epi);
+	  bitmap_and_compl (m_components, m_components, pro);
 
-	  if (dump_file && !bitmap_subset_p (epi, components))
+	  if (dump_file && !bitmap_subset_p (epi, m_components))
 	    {
 	      fprintf (dump_file, "  BAD epi %d->%d", e->src->index,
 		       e->dest->index);
@@ -1349,7 +1511,7 @@ 
 	      fprintf (dump_file, "\n");
 	    }
 
-	  if (dump_file && !bitmap_subset_p (pro, components))
+	  if (dump_file && !bitmap_subset_p (pro, m_components))
 	    {
 	      fprintf (dump_file, "  BAD pro %d->%d", e->src->index,
 		       e->dest->index);
@@ -1367,12 +1529,12 @@ 
 
 /* Place code for prologues and epilogues for COMPONENTS where we can put
    that code at the start of basic blocks.  */
-static void
-emit_common_heads_for_components (sbitmap components)
+void
+sw_sep_base::emit_common_heads_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1381,8 +1543,8 @@ 
 	 predecessor edges to this block.  */
 
       /* First, select all possible components.  */
-      bitmap_copy (epi, components);
-      bitmap_copy (pro, components);
+      bitmap_copy (epi, m_components);
+      bitmap_copy (pro, m_components);
 
       edge e;
       edge_iterator ei;
@@ -1421,24 +1583,14 @@ 
       /* Place code after the BB note.  */
       if (!bitmap_empty_p (pro))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_prologue_components (pro);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  emit_insn_after (seq, bb_note (bb));
+	  emit_prologue_into_bb (bb, pro, true);
 
 	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, pro);
 	}
 
       if (!bitmap_empty_p (epi))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_epilogue_components (epi);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  emit_insn_after (seq, bb_note (bb));
+	  emit_epilogue_into_bb (bb, epi, true);
 
 	  bitmap_ior (SW (bb)->head_components, SW (bb)->head_components, epi);
 	}
@@ -1451,12 +1603,12 @@ 
 
 /* Place code for prologues and epilogues for COMPONENTS where we can put
    that code at the end of basic blocks.  */
-static void
-emit_common_tails_for_components (sbitmap components)
+void
+sw_sep_base::emit_common_tails_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap tmp = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1467,8 +1619,8 @@ 
 	continue;
 
       /* First, select all possible components.  */
-      bitmap_copy (epi, components);
-      bitmap_copy (pro, components);
+      bitmap_copy (epi, m_components);
+      bitmap_copy (pro, m_components);
 
       edge e;
       edge_iterator ei;
@@ -1510,32 +1662,13 @@ 
       /* Put the code at the end of the BB, but before any final jump.  */
       if (!bitmap_empty_p (epi))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_epilogue_components (epi);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  rtx_insn *insn = BB_END (bb);
-	  if (control_flow_insn_p (insn))
-	    emit_insn_before (seq, insn);
-	  else
-	    emit_insn_after (seq, insn);
-
+	  emit_epilogue_into_bb (bb, epi, false);
 	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, epi);
 	}
 
       if (!bitmap_empty_p (pro))
 	{
-	  start_sequence ();
-	  targetm.shrink_wrap.emit_prologue_components (pro);
-	  rtx_insn *seq = get_insns ();
-	  end_sequence ();
-
-	  rtx_insn *insn = BB_END (bb);
-	  if (JUMP_P (insn) || (CALL_P (insn) && SIBLING_CALL_P (insn)))
-	    emit_insn_before (seq, insn);
-	  else
-	    emit_insn_after (seq, insn);
+	  emit_prologue_into_bb (bb, pro, false);
 
 	  bitmap_ior (SW (bb)->tail_components, SW (bb)->tail_components, pro);
 	}
@@ -1548,11 +1681,11 @@ 
 
 /* Place prologues and epilogues for COMPONENTS on edges, if we haven't already
    placed them inside blocks directly.  */
-static void
-insert_prologue_epilogue_for_components (sbitmap components)
+void
+sw_sep_base::insert_prologue_epilogue_for_components ()
 {
-  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (components));
-  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (components));
+  sbitmap pro = sbitmap_alloc (SBITMAP_SIZE (m_components));
+  sbitmap epi = sbitmap_alloc (SBITMAP_SIZE (m_components));
 
   basic_block bb;
   FOR_EACH_BB_FN (bb, cfun)
@@ -1569,8 +1702,8 @@ 
 			    SW (e->dest)->has_components);
 	  bitmap_and_compl (pro, SW (e->dest)->has_components,
 			    SW (e->src)->has_components);
-	  bitmap_and (epi, epi, components);
-	  bitmap_and (pro, pro, components);
+	  bitmap_and (epi, epi, m_components);
+	  bitmap_and (pro, pro, m_components);
 
 	  /* Deselect those we already have put at the head or tail of the
 	     edge's dest resp. src.  */
@@ -1590,36 +1723,8 @@ 
 		  fprintf (dump_file, "\n");
 		}
 
-	      /* Put the epilogue components in place.  */
-	      start_sequence ();
-	      targetm.shrink_wrap.emit_epilogue_components (epi);
-	      rtx_insn *seq = get_insns ();
-	      end_sequence ();
-
-	      if (e->flags & EDGE_SIBCALL)
-		{
-		  gcc_assert (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun));
-
-		  rtx_insn *insn = BB_END (e->src);
-		  gcc_assert (CALL_P (insn) && SIBLING_CALL_P (insn));
-		  emit_insn_before (seq, insn);
-		}
-	      else if (e->dest == EXIT_BLOCK_PTR_FOR_FN (cfun))
-		{
-		  gcc_assert (e->flags & EDGE_FALLTHRU);
-		  basic_block new_bb = split_edge (e);
-		  emit_insn_after (seq, BB_END (new_bb));
-		}
-	      else
-		insert_insn_on_edge (seq, e);
-
-	      /* Put the prologue components in place.  */
-	      start_sequence ();
-	      targetm.shrink_wrap.emit_prologue_components (pro);
-	      seq = get_insns ();
-	      end_sequence ();
-
-	      insert_insn_on_edge (seq, e);
+	      emit_epilogue_on_edge (e, epi);
+	      emit_prologue_on_edge (e, pro);
 	    }
 	}
     }
@@ -1630,6 +1735,62 @@ 
   commit_edge_insertions ();
 }
 
+sbitmap
+sw_sep_base::run ()
+{
+  init ();
+
+  sbitmap_iterator sbi;
+  unsigned int j;
+  EXECUTE_IF_SET_IN_BITMAP (m_components, 0, j, sbi)
+    place_prologue_for_one_component (j, m_first_bb);
+
+  spread_components (m_first_bb);
+
+  disqualify_problematic_components ();
+
+  /* Don't separately shrink-wrap anything where the "main" prologue will
+     go; the target code can often optimize things if it is presented with
+     all components together (say, if it generates store-multiple insns).  */
+  bitmap_and_compl (m_components, m_components,
+		    SW (m_first_bb)->has_components);
+
+  if (bitmap_empty_p (m_components))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not wrapping anything separately.\n");
+      return m_components;
+    }
+  if (dump_file)
+    {
+      fprintf (dump_file, "The components we wrap separately are");
+      dump_components ("sep", m_components);
+      fprintf (dump_file, "\n");
+
+      fprintf (dump_file, "... Inserting common heads...\n");
+    }
+
+  emit_common_heads_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Inserting common tails...\n");
+
+  emit_common_tails_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Inserting the more difficult ones...\n");
+
+  insert_prologue_epilogue_for_components ();
+
+  if (dump_file)
+    fprintf (dump_file, "... Done.\n");
+
+  set_handled ();
+
+  fini ();
+  return m_components;
+}
+
 /* The main entry point to this subpass.  FIRST_BB is where the prologue
    would be normally put.  */
 void
@@ -1663,61 +1824,310 @@ 
   calculate_dominance_info (CDI_DOMINATORS);
   calculate_dominance_info (CDI_POST_DOMINATORS);
 
-  init_separate_shrink_wrap (components);
+  sw_sep engine (first_bb, components);
+  components = engine.run ();
+  engine.fini ();
 
-  sbitmap_iterator sbi;
-  unsigned int j;
-  EXECUTE_IF_SET_IN_BITMAP (components, 0, j, sbi)
-    place_prologue_for_one_component (j, first_bb);
+  sbitmap_free (components);
+  free_dominance_info (CDI_DOMINATORS);
+  free_dominance_info (CDI_POST_DOMINATORS);
+}
 
-  spread_components (first_bb);
+#if CHECKING_P
 
-  disqualify_problematic_components (components);
+const int n_components_for_test = 32;
+const int n_cfgs_to_try = 32;
+const int n_paths_to_try = 256;
+const int max_path_len = 32;
 
-  /* Don't separately shrink-wrap anything where the "main" prologue will
-     go; the target code can often optimize things if it is presented with
-     all components together (say, if it generates store-multiple insns).  */
-  bitmap_and_compl (components, components, SW (first_bb)->has_components);
+namespace selftest {
 
-  if (bitmap_empty_p (components))
+/* A derived class that performs a self-test of the separate
+   shrink-wrapping algorithm.  */
+class sw_test_sep : public sw_sep_base
+{
+  hash_map<basic_block, sbitmap> m_orig_components;
+  hash_map<basic_block, sbitmap> m_pro_added_to_head;
+  hash_map<basic_block, sbitmap> m_epi_added_to_head;
+  hash_map<basic_block, sbitmap> m_pro_added_to_end;
+  hash_map<basic_block, sbitmap> m_epi_added_to_end;
+  hash_map<edge, sbitmap> m_pro_added_to_edge;
+  hash_map<edge, sbitmap> m_epi_added_to_edge;
+
+  sbitmap components_for_bb (basic_block);
+  void disqualify (edge, sbitmap, bool);
+  void emit_prologue_into_bb (basic_block, sbitmap, bool);
+  void emit_epilogue_into_bb (basic_block, sbitmap, bool);
+  void emit_prologue_on_edge (edge, sbitmap);
+  void emit_epilogue_on_edge (edge, sbitmap);
+  void set_handled ();
+
+ public:
+  void fini ();
+
+  sw_test_sep (basic_block first_bb, sbitmap components)
+    : sw_sep_base (first_bb, components)
+  {
+  }
+};
+
+sbitmap
+sw_test_sep::components_for_bb (basic_block bb)
+{
+  sbitmap components = sbitmap_alloc (n_components_for_test);
+  bitmap_clear (components);
+  if (bb != ENTRY_BLOCK_PTR_FOR_FN (cfun)
+      && bb != EXIT_BLOCK_PTR_FOR_FN (cfun))
+    for (int i = 0; i < n_components_for_test; i++)
+      if (random () & 1)
+	bitmap_set_bit (components, i);
+  m_orig_components.put (bb, components);
+  return components;
+}
+
+void
+sw_test_sep::disqualify (edge, sbitmap, bool)
+{
+  /* This mechanism isn't tested yet.  */
+}
+
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_test_sep::emit_epilogue_on_edge (edge e, sbitmap epi)
+{
+  sbitmap *p = m_epi_added_to_edge.get (e);
+  sbitmap mask;
+  if (p == NULL)
     {
-      if (dump_file)
-	fprintf (dump_file, "Not wrapping anything separately.\n");
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      m_epi_added_to_edge.put (e, mask);
     }
   else
+    mask = *p;
+
+  ASSERT_FALSE (bitmap_intersect_p (mask, epi));
+  bitmap_ior (mask, mask, epi);
+}
+
+/* Call the target hook to emit prologue components for EPI, and place
+   the insns on edge E.  */
+void
+sw_test_sep::emit_prologue_on_edge (edge e, sbitmap pro)
+{
+  sbitmap *p = m_pro_added_to_edge.get (e);
+  sbitmap mask;
+  if (p == NULL)
     {
-      if (dump_file)
-	{
-	  fprintf (dump_file, "The components we wrap separately are");
-	  dump_components ("sep", components);
-	  fprintf (dump_file, "\n");
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      m_pro_added_to_edge.put (e, mask);
+    }
+  else
+    mask = *p;
 
-	  fprintf (dump_file, "... Inserting common heads...\n");
-	}
+  ASSERT_FALSE (bitmap_intersect_p (mask, pro));
+  bitmap_ior (mask, mask, pro);
+}
 
-      emit_common_heads_for_components (components);
+/* Call the target hook to emit epilogue components for EPI, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_test_sep::emit_epilogue_into_bb (basic_block bb, sbitmap epi, bool at_head)
+{
+  hash_map<basic_block, sbitmap> *map;
+  map = at_head ? &m_epi_added_to_head : &m_epi_added_to_end;
+  sbitmap *p = map->get (bb);
+  sbitmap mask;
+  if (p == NULL)
+    {
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      map->put (bb, mask);
+    }
+  else
+    mask = *p;
 
-      if (dump_file)
-	fprintf (dump_file, "... Inserting common tails...\n");
+  ASSERT_FALSE (bitmap_intersect_p (mask, epi));
+  bitmap_ior (mask, mask, epi);
+}
 
-      emit_common_tails_for_components (components);
+/* Call the target hook to emit prologue components for PRO, and place
+   the insns in basic block BB, at the start if AT_HEAD, at the end
+   otherwise.  */
+void
+sw_test_sep::emit_prologue_into_bb (basic_block bb, sbitmap pro, bool at_head)
+{
+  hash_map<basic_block, sbitmap> *map;
+  map = at_head ? &m_pro_added_to_head : &m_pro_added_to_end;
+  sbitmap *p = map->get (bb);
+  sbitmap mask;
+  if (p == NULL)
+    {
+      mask = sbitmap_alloc (n_components_for_test);
+      bitmap_clear (mask);
+      map->put (bb, mask);
+    }
+  else
+    mask = *p;
 
-      if (dump_file)
-	fprintf (dump_file, "... Inserting the more difficult ones...\n");
+  ASSERT_FALSE (bitmap_intersect_p (mask, pro));
+  bitmap_ior (mask, mask, pro);
+}
 
-      insert_prologue_epilogue_for_components (components);
+/* Called by the engine if separate shrink-wrapping was successful for
+   some components.  For the test, we use this to verify the result.  */
+void
+sw_test_sep::set_handled ()
+{
+  basic_block bb;
+  /* Verify that we didn't added prologue and epilogue for the same
+     component in the same place.  */
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      sbitmap *pro = m_pro_added_to_head.get (bb);
+      sbitmap *epi = m_epi_added_to_head.get (bb);
+      if (pro && epi)
+	ASSERT_FALSE (bitmap_intersect_p (*pro, *epi));
+      pro = m_pro_added_to_end.get (bb);
+      epi = m_epi_added_to_end.get (bb);
+      if (pro && epi)
+	ASSERT_FALSE (bitmap_intersect_p (*pro, *epi));
+    }
 
-      if (dump_file)
-	fprintf (dump_file, "... Done.\n");
+  /* Try random paths through the function, and verify that
+     (a) if a block needs a component, it has been saved
+     (b) no component is saved or restored more than once in
+         a row.  */
 
-      targetm.shrink_wrap.set_handled_components (components);
+  sbitmap active = sbitmap_alloc (n_components_for_test);
+  for (int i = 0; i < n_paths_to_try; i++)
+    {
+      basic_block curr = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+      bitmap_clear (active);
 
-      crtl->shrink_wrapped_separate = true;
+      for (int len = 0; len < max_path_len; len++)
+	{
+	  /* Update for the bb head.  */
+	  sbitmap *pro_entry = m_pro_added_to_head.get (curr);
+	  sbitmap *epi_entry = m_epi_added_to_head.get (curr);
+	  if (epi_entry)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_entry, active));
+	      bitmap_and_compl (active, active, *epi_entry);
+	    }
+	  if (pro_entry)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_entry, active));
+	      bitmap_ior (active, active, *pro_entry);
+	    }
+
+	  /* Make sure the required components are active.  */
+	  sbitmap *required = m_orig_components.get (curr);
+	  if (required)
+	    ASSERT_FALSE (bitmap_intersect_compl_p (*required, active));
+
+	  /* Update for the bb end.  */
+	  sbitmap *pro_end = m_pro_added_to_end.get (curr);
+	  sbitmap *epi_end = m_epi_added_to_end.get (curr);
+	  if (epi_end)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_end, active));
+	      bitmap_and_compl (active, active, *epi_end);
+	    }
+	  if (pro_end)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_end, active));
+	      bitmap_ior (active, active, *pro_end);
+	    }
+
+	  int l = EDGE_COUNT (curr->succs);
+	  if (l == 0)
+	    break;
+	  int n = random () % l;
+	  edge e = EDGE_SUCC (curr, n);
+
+	  /* Update for the edge.  */
+	  sbitmap *pro_edge = m_pro_added_to_edge.get (e);
+	  sbitmap *epi_edge = m_epi_added_to_edge.get (e);
+	  if (epi_edge)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_compl_p (*epi_edge, active));
+	      bitmap_and_compl (active, active, *epi_edge);
+	    }
+	  if (pro_edge)
+	    {
+	      ASSERT_FALSE (bitmap_intersect_p (*pro_edge, active));
+	      bitmap_ior (active, active, *pro_edge);
+	    }
+
+	  curr = e->dest;
+	  if (curr == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    break;
+	}
     }
+}
 
-  fini_separate_shrink_wrap ();
+void
+sw_test_sep::fini ()
+{
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      sbitmap *proh = m_pro_added_to_head.get (bb);
+      sbitmap *epih = m_epi_added_to_head.get (bb);
+      sbitmap *proe = m_pro_added_to_end.get (bb);
+      sbitmap *epie = m_epi_added_to_end.get (bb);
+      if (proh)
+	sbitmap_free (*proh);
+      if (epih)
+	sbitmap_free (*epih);
+      if (proe)
+	sbitmap_free (*proe);
+      if (epie)
+	sbitmap_free (*epie);
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  sbitmap *pro = m_pro_added_to_edge.get (e);
+	  sbitmap *epi = m_epi_added_to_edge.get (e);
+	  if (pro)
+	    sbitmap_free (*pro);
+	  if (epi)
+	    sbitmap_free (*epi);
+	}
+    }
+
+  sw_sep_base::fini ();
+}
+
+void
+shrink_wrapping_separate_tests ()
+{
+  /* Ask the target what components there are.  If it returns NULL, don't
+     do anything.  */
+  sbitmap components = sbitmap_alloc (n_components_for_test);
+  bitmap_ones (components);
+
+  basic_block first = build_random_cfg (5);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  calculate_dominance_info (CDI_POST_DOMINATORS);
+
+  sw_test_sep engine (first, components);
+  components = engine.run ();
+  engine.fini ();
 
   sbitmap_free (components);
   free_dominance_info (CDI_DOMINATORS);
   free_dominance_info (CDI_POST_DOMINATORS);
+
+  free_random_cfg ();
 }
+
+} // namespace selftest
+
+#endif
diff '--width=118' -x .svn -dru sw-selftest-base/gcc/tree-cfg.c with-selftest/gcc/tree-cfg.c
--- sw-selftest-base/gcc/tree-cfg.c	2016-08-25 13:49:37.443942684 +0200
+++ with-selftest/gcc/tree-cfg.c	2016-08-26 14:45:06.643454608 +0200
@@ -9234,6 +9234,110 @@ 
   return fndecl;
 }
 
+static basic_block
+build_single_after (vec<basic_block> *in_exits, vec<basic_block> *exits)
+{
+  basic_block bb_a = create_empty_bb ((*in_exits)[0]);
+  exits->safe_push (bb_a);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (*in_exits, i, bb)
+    make_edge (bb, bb_a, i == 0 ? EDGE_FALLTHRU : 0);
+  return bb_a;
+}
+
+static basic_block
+build_if (function *fun, basic_block thn, vec<basic_block> *exits)
+{
+  basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+  basic_block bb_a = create_empty_bb (entry);
+  make_edge (bb_a, thn, EDGE_TRUE_VALUE);
+  exits->safe_push (bb_a);
+  return bb_a;
+}
+
+static basic_block
+build_ifelse (function *fun, basic_block thn, basic_block els)
+{
+  basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+  basic_block bb_a = create_empty_bb (entry);
+  make_edge (bb_a, thn, EDGE_TRUE_VALUE);
+  make_edge (bb_a, els, EDGE_FALSE_VALUE);
+  return bb_a;
+}
+
+static basic_block
+build_loop (basic_block inner_head, vec<basic_block> *inner_exits,
+	    vec<basic_block> *exits)
+{
+  build_single_after (inner_exits, exits);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (*inner_exits, i, bb)
+    make_edge (bb, inner_head, 0);
+
+  return inner_head;
+}
+
+static basic_block
+recursive_build_cfg (function *fun, int maxdepth, vec<basic_block> *exits)
+{
+  if (maxdepth == 0)
+    {
+      basic_block entry = ENTRY_BLOCK_PTR_FOR_FN (fun);
+      basic_block bb_a = create_empty_bb (entry);
+      exits->safe_push (bb_a);
+      return bb_a;
+    }
+  int subdepth1 = random () % maxdepth;
+  int subdepth2 = random () % maxdepth;
+  auto_vec<basic_block> inner_exits;
+  basic_block inner, inner2;
+
+  switch (random () % 4)
+    {
+    case 0:
+      inner = recursive_build_cfg (fun, subdepth1, exits);
+      return build_if (fun, inner, exits);
+    case 1:
+      inner = recursive_build_cfg (fun, subdepth1, exits);
+      inner2 = recursive_build_cfg (fun, subdepth2, exits);
+      return build_ifelse (fun, inner, inner2);
+    case 2:
+      inner = recursive_build_cfg (fun, subdepth1, &inner_exits);
+      return build_loop (inner, &inner_exits, exits);
+    case 3:
+      inner = recursive_build_cfg (fun, subdepth1, &inner_exits);
+      build_single_after (&inner_exits, exits);
+      return inner;
+    default:
+      gcc_unreachable ();
+    }
+}
+
+basic_block
+build_random_cfg (int maxdepth)
+{
+  gimple_register_cfg_hooks ();
+
+  tree fndecl = push_fndecl ("cfg_test_random");
+  function *fun = DECL_STRUCT_FUNCTION (fndecl);
+
+  auto_vec<basic_block> rgn_exits;
+  basic_block head = recursive_build_cfg (fun, maxdepth, &rgn_exits);
+  make_edge (ENTRY_BLOCK_PTR_FOR_FN (fun), head, EDGE_FALLTHRU);
+  basic_block bb;
+  unsigned i;
+  FOR_EACH_VEC_ELT (rgn_exits, i, bb)
+    make_edge (bb, EXIT_BLOCK_PTR_FOR_FN (fun), 0);
+  return head;
+}
+
+void free_random_cfg ()
+{
+  pop_cfun ();
+}
+
 /* These tests directly create CFGs.
    Compare with the static fns within tree-cfg.c:
      - build_gimple_cfg