Gimple loop splitting

Message ID	alpine.LSU.2.20.1511121734040.11029@wotan.suse.de
State	New
Headers	show Return-Path: <gcc-patches-return-413848-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=MEqkOZCJWkP+phExR/9QKTV32hNVDv4ZnPub6BWkOx4ii4lYrv3Dr pyiMd4Hk6VqQTW6kQuQ49LcWrgxKCVXjsKr41GuQHsWiTzlifcX4vAxmwPZkrpfS Hq1Js7uQmrbvzwkRoLBrOcLcphaoVGPrShhMTOV30+a5wrf/+4Okts= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Thu, 12 Nov 2015 17:52:51 +0100 (CET) From: Michael Matz <matz@suse.de> To: gcc-patches@gcc.gnu.org Subject: Gimple loop splitting Message-ID: <alpine.LSU.2.20.1511121734040.11029@wotan.suse.de> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII

Michael Matz Nov. 12, 2015, 4:52 p.m. UTC

Hello,

this new pass implements loop iteration space splitting for loops that 
contain a conditional that's always true for one part of the iteration 
space and false for the other, i.e. such situations:

  for (i = beg; i < end; i++)
    if (i < p)
      dothis();
    else
      dothat();

this is transformed into roughly:

  for (i = beg; i < p; i++)
    dothis();
  for (; i < end; i++)
    dothat();

Of course, not quite the above as there needs to be provisions for the 
border conditions, if e.g. 'p' is outside the original iteration space, or 
the conditional doesn't directly use the control IV, but some other, or 
the IV runs backwards.  The testcase checks many of these border 
conditions.

This transformation is in itself a good one but can also be an enabler for 
the vectorizer.  It does increase code size, when the loop body contains 
also unconditional code (that one is duplicated), so we only transform hot 
loops.  I'm a bit unsure of the placement of the new pass, or if it should 
be an own pass at all.  Right now I've placed it after unswitching and 
scev_cprop, before loop distribution.  Ideally I think all three, together 
with loop fusion and an gimple unroller should be integrated into one loop 
nest optimizer, alas, we aren't there yet.

I'm planning to work on loop fusion in the future as well, but that's not 
for GCC 6.

I've regstrapped this pass enabled with -O2 on x86-64-linux, without 
regressions.  I've also checked cpu2006 (the non-fortran part) for 
correctness, not yet for performance.  In the end it should probably only 
be enabled for -O3+ (although if the whole loop body is conditional it 
makes sense to also have it with -O2 because code growth is very small 
then).

So, okay for trunk?


Ciao,
Michael.
	* passes.def (pass_loop_split): Add.
	* timevar.def (TV_LOOP_SPLIT): Add.
	* tree-pass.h (make_pass_loop_split): Declare.
	* tree-ssa-loop-manip.h (rewrite_into_loop_closed_ssa_1): Declare.
	* tree-ssa-loop-unswitch.c: Include tree-ssa-loop-manip.h,
	cfganal.h, tree-chrec.h, tree-affine.h, tree-scalar-evolution.h,
	gimple-pretty-print.h, gimple-fold.h, gimplify-me.h.
	(split_at_bb_p, patch_loop_exit, find_or_create_guard_phi,
	split_loop, tree_ssa_split_loops,
	make_pass_loop_split): New functions.
	(pass_data_loop_split): New.
	(pass_loop_split): New.

testsuite/
	* gcc.dg/loop-split.c: New test.

Jeff Law Nov. 12, 2015, 9:44 p.m. UTC | #1

On 11/12/2015 09:52 AM, Michael Matz wrote:
> Hello,
>
> this new pass implements loop iteration space splitting for loops that
> contain a conditional that's always true for one part of the iteration
> space and false for the other, i.e. such situations:
FWIW, Ajit suggested the same transformation earlier this year.  During 
that discussion Richi indicated that for hmmer this transformation would 
enable vectorization.

>
> This transformation is in itself a good one but can also be an enabler for
> the vectorizer.
Agreed.


   It does increase code size, when the loop body contains
> also unconditional code (that one is duplicated), so we only transform hot
> loops.
Probably ought to be disabled when we're not optimizing for speed as well.




   I'm a bit unsure of the placement of the new pass, or if it should
> be an own pass at all.  Right now I've placed it after unswitching and
> scev_cprop, before loop distribution.  Ideally I think all three, together
> with loop fusion and an gimple unroller should be integrated into one loop
> nest optimizer, alas, we aren't there yet.
Given its impact on the looping structure, I'd think early in the loop 
optimizer.  Given the similarities with unswitching, I think 
before/after unswitching is a natural first cut.  We can always iterate 
if it looks like putting it elsewhere would make sense.



> I've regstrapped this pass enabled with -O2 on x86-64-linux, without
> regressions.  I've also checked cpu2006 (the non-fortran part) for
> correctness, not yet for performance.  In the end it should probably only
> be enabled for -O3+ (although if the whole loop body is conditional it
> makes sense to also have it with -O2 because code growth is very small
> then).
Very curious on the performance side, so if you could get some #s on 
that, it'd be greatly appreciated.

I'd be comfortable with this at -O2, but won't object if you'd prefer -O3.


>
> So, okay for trunk?
>
>
> Ciao,
> Michael.
> 	* passes.def (pass_loop_split): Add.
> 	* timevar.def (TV_LOOP_SPLIT): Add.
> 	* tree-pass.h (make_pass_loop_split): Declare.
> 	* tree-ssa-loop-manip.h (rewrite_into_loop_closed_ssa_1): Declare.
> 	* tree-ssa-loop-unswitch.c: Include tree-ssa-loop-manip.h,
> 	cfganal.h, tree-chrec.h, tree-affine.h, tree-scalar-evolution.h,
> 	gimple-pretty-print.h, gimple-fold.h, gimplify-me.h.
> 	(split_at_bb_p, patch_loop_exit, find_or_create_guard_phi,
> 	split_loop, tree_ssa_split_loops,
> 	make_pass_loop_split): New functions.
> 	(pass_data_loop_split): New.
> 	(pass_loop_split): New.
>
> testsuite/
> 	* gcc.dg/loop-split.c: New test.
Please clean up the #if 0/#if 1 code in the new tests.  You might also 
want to clean out the TRACE stuff.  Essentially the tests look like you 
just dropped in a test you'd been running by hand until now :-)

I don't see any negative tests -- ie tests that should not be split due 
to boundary conditions.  Do you have any from development?  If so it'd 
be good to have those too.

>
> Index: tree-ssa-loop-manip.h
> ===================================================================
> --- tree-ssa-loop-manip.h	(revision 229763)
> +++ tree-ssa-loop-manip.h	(working copy)
> @@ -24,6 +24,8 @@ typedef void (*transform_callback)(struc
>
>   extern void create_iv (tree, tree, tree, struct loop *, gimple_stmt_iterator *,
>   		       bool, tree *, tree *);
> +extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int,
> +					    struct loop *);
>   extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
>   extern void rewrite_virtuals_into_loop_closed_ssa (struct loop *);
>   extern void verify_loop_closed_ssa (bool);
> Index: tree-ssa-loop-unswitch.c
> ===================================================================
> --- tree-ssa-loop-unswitch.c	(revision 229763)
> +++ tree-ssa-loop-unswitch.c	(working copy)
Given the amount of new code, unless there's a strong need, I'd prefer 
this transformation to be implemented in its own file.



> +
> +/* Give an induction variable GUARD_IV, and its affine descriptor IV,
> +   find the loop phi node in LOOP defining it directly, or create
> +   such phi node.  Return that phi node.  */
> +
> +static gphi *
> +find_or_create_guard_phi (struct loop *loop, tree guard_iv, affine_iv * /*iv*/)
> +{
> +  gimple *def = SSA_NAME_DEF_STMT (guard_iv);
> +  gphi *phi;
> +  if ((phi = dyn_cast <gphi *> (def))
> +      && gimple_bb (phi) == loop->header)
> +    return phi;
> +
> +  /* XXX Create the PHI instead.  */
> +  return NULL;
So right now we just punt if we need to create the PHI?  Does that 
happen with any kind of regularity in practice?


> +}
> +
> +/* Checks if LOOP contains an conditional block whose condition
> +   depends on which side in the iteration space it is, and if so
> +   splits the iteration space into two loops.  Returns true if the
> +   loop was split.  NITER must contain the iteration descriptor for the
> +   single exit of LOOP.  */
> +
> +static bool
> +split_loop (struct loop *loop, struct tree_niter_desc *niter)
This should probably be broken up a bit more.  It's loooong as-is.

Without looking at how much stuff would have to be passed around, 
diddling the exit edge of the first loop, phi updates for the 2nd loop, 
fix iteration space of 2nd loop, exit block fixup might be a good 
initial cut at breaking this down into something of manageable size. 
Not sure if the setup and initial versioning should be broken out or not.


> +	initialize_original_copy_tables ();
> +	basic_block cond_bb;
> +	struct loop *floop = loop_version (loop, cond, &cond_bb,
> +					   REG_BR_PROB_BASE, REG_BR_PROB_BASE,
> +					   REG_BR_PROB_BASE, false);
> +	gcc_assert (floop);
> +	update_ssa (TODO_update_ssa);
> +
> +	/* Now diddle the exit edge of the first loop (floop->join in the
> +	   above) to either go to the common exit (join) or to the second
> +	   loop, depending on if there are still iterations left, or not.
> +	   We split the floop exit edge and insert a copy of the
> +	   original exit expression into the new block, that either
> +	   skips the second loop or goes to it.  */
So after diddling, haven't we mucked up the dominator tree and the SSA 
graph?   You're iterating over each PHI in two loop headers and fixing 
the SSA graph by hand AFAICT.   But ISTM the dominator tree is still 
mucked up, right?  I'm thinking specifically about the 2nd loop.  Though 
perhaps it just works since after all your transformations it'll still 
be immediately dominated by the same block as before your transformations.

Overall I think this looks real good.  THe biggest problem IMHO is 
breaking down that monster function a bit.  I'm a bit concerned by the 
dominator tree state.  Worst case is we have to rebuild the dominators 
before ensuring we're LCSSA form, and even that doesn't seem too bad. 
As I mentioned, it may actually be the case that we're OK on the 
dominator tree, kindof by accident more than design.


Jeff

Michael Matz Nov. 16, 2015, 4:05 p.m. UTC | #2

Hi,

On Thu, 12 Nov 2015, Jeff Law wrote:

> > this new pass implements loop iteration space splitting for loops that
> > contain a conditional that's always true for one part of the iteration
> > space and false for the other, i.e. such situations:
> FWIW, Ajit suggested the same transformation earlier this year.  During that
> discussion Richi indicated that for hmmer this transformation would enable
> vectorization.

It's a prerequisite indeed, but not enough in itself.  The next problem 
will be that only parts of access chains inside the hot loop are 
vectorizable, but for that those parts need to be disambiguated.  ICC is 
doing that by a massive chain of conditionals testing non-overlapping of 
the respective arrays at runtime.  Our vectorizer could also do that 
(possibly by increasing the allowed number of conditionals), but the next 
problem then is that one of these (then indeed separated) parts is not 
vectorizable by our vectorizer: it's a 'a[i] = f(a[i-1])' dependency that 
can't yet be handled by us.  If the separation of parts would be done by 
loop distribution that would be fine (we'd have separate loops for the 
parts, some of them vectorizable), but our loop distribution can't do 
runtime disambiguation, only our vectorizer.

hmmer is actually quite interesting because it's a fairly isolated hot 
loop posing quite challenging problems for us :)

> 
>   It does increase code size, when the loop body contains
> > also unconditional code (that one is duplicated), so we only transform hot
> > loops.
> 
> Probably ought to be disabled when we're not optimizing for speed as well.

That should be dealt with by '!optimize_loop_for_size_p (loop)'.

> > I've regstrapped this pass enabled with -O2 on x86-64-linux, without
> > regressions.  I've also checked cpu2006 (the non-fortran part) for
> > correctness, not yet for performance.  In the end it should probably only
> > be enabled for -O3+ (although if the whole loop body is conditional it
> > makes sense to also have it with -O2 because code growth is very small
> > then).
> 
> Very curious on the performance side, so if you could get some #s on that,
> it'd be greatly appreciated.

My test machine misbehaved over the weekend, but as soon as I have them 
I'll update here.

> > testsuite/
> > 	* gcc.dg/loop-split.c: New test.
> 
> Please clean up the #if 0/#if 1 code in the new tests.

Actually I'd prefer if that test contains the by-hand code and the TRACE 
stuff as well, I'd only change the #if 0 into some #if BYHAND or so ...

> You might also want to clean out the TRACE stuff.  Essentially the tests 
> look like you just dropped in a test you'd been running by hand until 
> now :-)

... the reason being, that bugs in the splitter are somewhat unwieldy to 
debug by just staring at the dumps, you only get a checksum mismatch, so 
TRACE=1 is for finding out which of the params and loops is actually 
miscompiled, TRACE=2 for finding the specific iteration that's broken, and 
the #if0 code for putting that situation into a non-macroized and smaller 
function than dotest.  (That's actually how I've run the testcase after I 
had it basically working, extending dotest with a couple more lines, aka 
example loop sitations, adjusting the checksum, and then making a face and 
scratching my head and mucking with the TRACE and #if0 macros :) ).

> I don't see any negative tests -- ie tests that should not be split due 
> to boundary conditions.  Do you have any from development?

Good point, I had some but only ones where I was able to extend the 
splitters to cover them.  I'll think of some that really shouldn't be 
split.

> > Index: tree-ssa-loop-unswitch.c
> > ===================================================================
> > --- tree-ssa-loop-unswitch.c	(revision 229763)
> > +++ tree-ssa-loop-unswitch.c	(working copy)
> Given the amount of new code, unless there's a strong need, I'd prefer this
> transformation to be implemented in its own file.

Okay.

> > +/* Give an induction variable GUARD_IV, and its affine descriptor IV,
> > +   find the loop phi node in LOOP defining it directly, or create
> > +   such phi node.  Return that phi node.  */
> > +
> > +static gphi *
> > +find_or_create_guard_phi (struct loop *loop, tree guard_iv, affine_iv *
> > /*iv*/)
> > +{
> > +  gimple *def = SSA_NAME_DEF_STMT (guard_iv);
> > +  gphi *phi;
> > +  if ((phi = dyn_cast <gphi *> (def))
> > +      && gimple_bb (phi) == loop->header)
> > +    return phi;
> > +
> > +  /* XXX Create the PHI instead.  */
> > +  return NULL;
> 
> So right now we just punt if we need to create the PHI?  Does that 
> happen with any kind of regularity in practice?

Only with such situations:

  for (int i = start; i < end; i++) {
    if (i + offset < bound)
      ...
  }

Here the condition-IV is not directly defined by a PHI node.  If it 
happens often I don't know, I guess the usual situation is testing the 
control IV directly.  The deficiency is not hard to fix.

> > +static bool
> > +split_loop (struct loop *loop, struct tree_niter_desc *niter)
> This should probably be broken up a bit more.  It's loooong as-is.
> 
> Without looking at how much stuff would have to be passed around, 
> diddling the exit edge of the first loop, phi updates for the 2nd loop, 
> fix iteration space of 2nd loop, exit block fixup might be a good 
> initial cut at breaking this down into something of manageable size.

Thanks, I'll do that.

> > +	initialize_original_copy_tables ();
> > +	basic_block cond_bb;
> > +	struct loop *floop = loop_version (loop, cond, &cond_bb,
> > +					   REG_BR_PROB_BASE, REG_BR_PROB_BASE,
> > +					   REG_BR_PROB_BASE, false);
> > +	gcc_assert (floop);
> > +	update_ssa (TODO_update_ssa);
> > +
> > +	/* Now diddle the exit edge of the first loop (floop->join in the
> > +	   above) to either go to the common exit (join) or to the second
> > +	   loop, depending on if there are still iterations left, or not.
> > +	   We split the floop exit edge and insert a copy of the
> > +	   original exit expression into the new block, that either
> > +	   skips the second loop or goes to it.  */
> 
> So after diddling, haven't we mucked up the dominator tree and the SSA 
> graph? You're iterating over each PHI in two loop headers and fixing the 
> SSA graph by hand AFAICT.  But ISTM the dominator tree is still mucked 
> up, right?

I think I convinced myself on paper that the dominator tree is correct due 
to our helpers doing the right thing (loop_version() for the initial 
loop copying and split_edge for the above diddling).  Let's see if I can 
paint some ASCII art.  So, after loop_version (which updates dom) we 
have:

               .------if (cond)-------.
               v                      v
             pre1                   pre2
              |                      |
             h1<----.               h2<----.
              |     |                |     |
          .--ex1    |        .------ex2    |
          |    \    |        |        \    |
          |    l1---'        |        l2---'
          |                  |
          |                  |
          '--X--------->join<'

At this point dominators are all correct (due to loop_version updating 
them), in particular dom(pre1)==dom(pre2)==if(cond).  Now we split 
ex1->join at X, and split_edge also updates them (trivially), but we 
insert a new edge from split_bb to pre2.  There are no paths from region2 
into region1, and anything in region2 except pre2 is still dominated by 
pre2 (or something further down), so if anything changes, then dom(pre2).

               .------if (cond)----.
               v                   |
             pre1                  |
              |                    |
             h1<----.              |
              |     |              |
             ex1    |              |
              | \   |              |
              |  l1-'              |
              v                    |
          .-split-----------.      |
          |                 v      |
          |               pre2<----'
          |                |
          |               h2<----.
          |                |     |
          |               ex2    |
          |                | \   |
          |                | l2--'
          |              .-'
          '------>join<--'

But there's a path directly to pre2, skipping whole region1, so dom(pre2) 
must be still if(cond), as originally.  Also dom(join) doesn't change, 
because what was first a normal diamond between 
if(cond),region1,region2,join now is a meddled diamond with paths from 
region1 to region2, but not back, so the dominator of the join block still 
is the if(cond) block.

This is all true if the internal structure of region1/region2 is sensible, 
and single_exit() regions are such.  Even multiple exits to something 
behind join wouldn't change this, but we don't even have to think about 
this.

In addition, anything not updating dominators correctly would scream 
loudly in the verifier.

The SSA tree is correct after loop_version() and split_edge.  The new edge 
split_bb->pre2 needs the adjustments in that loop over loop PHI nodes.  
That walk must catch everything, if it wouldn't then that would mean a use 
in region2 that's defined in region1, that wasn't originally dominated by 
the def (and hence must have been a loop-carried value and hence be 
defined in the loop header PHI block).

> Overall I think this looks real good.  THe biggest problem IMHO is 
> breaking down that monster function a bit.  I'm a bit concerned by the 
> dominator tree state.  Worst case is we have to rebuild the dominators 
> before ensuring we're LCSSA form, and even that doesn't seem too bad.

Actually keeping LCSSA form correct is doable as well, but needs another 
loop over one or the other PHI nodes.  I punted for now and called 
rewrite_into_loop_closed_ssa_1, which actually isn't too expensive for a 
single loop.

> As I mentioned, it may actually be the case that we're OK on the 
> dominator tree, kindof by accident more than design.

I'm pretty sure it is correct, and it is so by design :)

Thanks for the feedback, I'll update the patch accordingly.

Ciao,
Michael.

Jeff Law Nov. 16, 2015, 11:27 p.m. UTC | #3

On 11/16/2015 09:05 AM, Michael Matz wrote:
> It's a prerequisite indeed, but not enough in itself.
Sigh.  OK.  One can always hope.



>
> hmmer is actually quite interesting because it's a fairly isolated hot
> loop posing quite challenging problems for us :)
Sounds like it.  Essentially, it's a TODO list :-)

>> Probably ought to be disabled when we're not optimizing for speed as well.
>
> That should be dealt with by '!optimize_loop_for_size_p (loop)'.
Doh, must have missed that.

>>
>> Please clean up the #if 0/#if 1 code in the new tests.
>
> Actually I'd prefer if that test contains the by-hand code and the TRACE
> stuff as well, I'd only change the #if 0 into some #if BYHAND or so ...
>
>> You might also want to clean out the TRACE stuff.  Essentially the tests
>> look like you just dropped in a test you'd been running by hand until
>> now :-)
>
> ... the reason being, that bugs in the splitter are somewhat unwieldy to
> debug by just staring at the dumps, you only get a checksum mismatch, so
> TRACE=1 is for finding out which of the params and loops is actually
> miscompiled, TRACE=2 for finding the specific iteration that's broken, and
> the #if0 code for putting that situation into a non-macroized and smaller
> function than dotest.  (That's actually how I've run the testcase after I
> had it basically working, extending dotest with a couple more lines, aka
> example loop sitations, adjusting the checksum, and then making a face and
> scratching my head and mucking with the TRACE and #if0 macros :) ).
OK, if you want to keep them, then  have a consistent way to turn them 
on/off for future debugging.  if0/if1 doesn't provide much of a clue to 
someone else what to turn on/off if they need to debug this stuff.

>
>> I don't see any negative tests -- ie tests that should not be split due
>> to boundary conditions.  Do you have any from development?
>
> Good point, I had some but only ones where I was able to extend the
> splitters to cover them.  I'll think of some that really shouldn't be
> split.
If you've got them, certainly add them.  Though I realize they may get 
lost over time.

>
> Only with such situations:
>
>    for (int i = start; i < end; i++) {
>      if (i + offset < bound)
>        ...
>    }
>
> Here the condition-IV is not directly defined by a PHI node.  If it
> happens often I don't know, I guess the usual situation is testing the
> control IV directly.  The deficiency is not hard to fix.
I'm comfortable waiting until we see the need.

> I think I convinced myself on paper that the dominator tree is correct due
> to our helpers doing the right thing (loop_version() for the initial
> loop copying and split_edge for the above diddling).  Let's see if I can
> paint some ASCII art.  So, after loop_version (which updates dom) we
> have:
OK.  I was worried about the next step -- where we insert the 
conditional on the exit from pre1 to have it transfer to join or pre2.

But in that case, the immediate dominator of pre2 & join is still the 
initial if statement.  So I think we're OK.  That was the conclusion I 
was starting to come to yesterday, having the ascii art makes it pretty 
clear.  I'm just not good at conceptualizing a CFG.  I have to see it 
explicitly and then everything seems so clear and simple.

jeff

Michael Matz Dec. 1, 2015, 4:46 p.m. UTC | #4

Hi,

On Mon, 16 Nov 2015, Jeff Law wrote:

> OK, if you want to keep them, then have a consistent way to turn them 
> on/off for future debugging.  if0/if1 doesn't provide much of a clue to 
> someone else what to turn on/off if they need to debug this stuff.

> > > I don't see any negative tests -- ie tests that should not be split 
> > > due to boundary conditions.  Do you have any from development?
> > 
> > Good point, I had some but only ones where I was able to extend the 
> > splitters to cover them.  I'll think of some that really shouldn't be 
> > split.
> If you've got them, certainly add them.  Though I realize they may get 
> lost over time.

Actually, thinking a bit more about this, I don't have any that wouldn't 
be merely restrictions in the implementation that couldn't be lifted in 
the future (e.g. unequal step sizes), so I've added no additional ones.

> But in that case, the immediate dominator of pre2 & join is still the 
> initial if statement.  So I think we're OK.  That was the conclusion I 
> was starting to come to yesterday, having the ascii art makes it pretty 
> clear.  I'm just not good at conceptualizing a CFG.  I have to see it 
> explicitly and then everything seems so clear and simple.

So, this second version should reflect the review.  I've moved everything 
to a new file, split the long function into several logically separate 
ones, and even included ascii art in the comments :)  The testcase got a 
comment about what to #define for debugging.  I've included the pass to 
-O3 or alternatively if profile-use is on, similar to funswitch-loops.  
I've also added a proper -fsplit-loops option.

There's two functional changes in v2: a bugfix to not try splitting a 
non-iterating loop (irritatingly such a look returns true from 
number_of_iterations_exit, but with an ERROR_MARK comparator), and a 
limitation to avoid combinatorical explosion in artificial testcases: Once 
we have done a splitting, we don't do any in that loops parents (we may 
still do splitting in siblings or childs of siblings).

I've also done some measurements: first, bootstrap time is unaffected, and 
regstrapping succeeds without regressions when I activate the pass by 
default.  Then SPECcpu2006: build times are unaffected, everything builds 
and works also with -fsplit-loops, performance is mostly unaffected, base 
is -Ofast -funroll-loops -fpeel-loops, peak adds -fsplit-loops.

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
400.perlbench    9770        325       30.1 *    9770        323       30.3 *  
401.bzip2        9650        382       25.2 *    9650        382       25.3 *  
403.gcc          8050        242       33.3 *    8050        241       33.4 *  
429.mcf          9120        311       29.3 *    9120        311       29.3 *  
445.gobmk       10490        392       26.8 *   10490        391       26.8 *  
456.hmmer        9330        345       27.0 *    9330        342       27.3 *  
458.sjeng       12100        422       28.7 *   12100        420       28.8 *  
462.libquantum  20720        308       67.3 *   20720        308       67.3 *  
464.h264ref     22130        423       52.3 *   22130        423       52.3 *  
471.omnetpp      6250        273       22.9 *    6250        273       22.9 *  
473.astar        7020        311       22.6 *    7020        311       22.6 *  
483.xalancbmk    6900        191       36.2 *    6900        190       36.2 *  
 Est. SPECint_base2006                 31.7
 Est. SPECint2006                                                      31.7

                                  Estimated                       Estimated
                Base     Base       Base        Peak     Peak       Peak
Benchmarks      Ref.   Run Time     Ratio       Ref.   Run Time     Ratio
-------------- ------  ---------  ---------    ------  ---------  
---------
410.bwaves      13590        235       57.7 *   13590        235       57.8 *  
416.gamess                                  NR                              NR 
433.milc         9180        347       26.5 *    9180        345       26.6 *  
434.zeusmp       9100        269       33.9 *    9100        268       33.9 *  
435.gromacs      7140        260       27.4 *    7140        262       27.3 *  
436.cactusADM   11950        237       50.5 *   11950        240       49.9 *  
437.leslie3d     9400        228       41.3 *    9400        228       41.2 *  
444.namd         8020        312       25.7 *    8020        311       25.7 *  
447.dealII      11440        254       45.0 *   11440        254       45.0 *  
450.soplex       8340        201       41.4 *    8340        202       41.4 *  
453.povray                                  NR                              NR 
454.calculix     8250        282       29.2 *    8250        283       29.2 *  
459.GemsFDTD    10610        310       34.3 *   10610        309       34.3 *  
465.tonto        9840        683       14.4 *    9840        684       14.4 *  
470.lbm         13740        224       61.2 *   13740        224       61.3 *  
481.wrf         11170        291       38.4 *   11170        291       38.4 *  
482.sphinx3     19490        377       51.7 *   19490        377       51.6 *  
 Est. SPECfp_base2006                  36.3
 Est. SPECfp2006                                                       36.3

The 1% improvements and degradations are all inside the normal result 
variations on this machine (I have the feeling that the hmmer improvement 
is stable, and will recheck this).  Not all of the above had loops split 
at all, only: SPECint: 400.perlbench, 403.gcc, 445.gobmk, 456.hmmer, 
462.libquantum, 464.h264ref, 471.omnetpp and SPECfp: 435.gromacs, 
436.cactusADM, 447.dealII, 454.calculix.

So, okay for trunk?


Ciao,
Michael.

Jeff Law Dec. 1, 2015, 10:57 p.m. UTC | #5

On 12/01/2015 09:46 AM, Michael Matz wrote:
> Hi,
>
> So, okay for trunk?
-ENOPATCH

Jeff

Andrew Pinski July 25, 2016, 6:59 a.m. UTC | #6

On Thu, Nov 12, 2015 at 8:52 AM, Michael Matz <matz@suse.de> wrote:
> Hello,
>
> this new pass implements loop iteration space splitting for loops that
> contain a conditional that's always true for one part of the iteration
> space and false for the other, i.e. such situations:
>
>   for (i = beg; i < end; i++)
>     if (i < p)
>       dothis();
>     else
>       dothat();
>
> this is transformed into roughly:
>
>   for (i = beg; i < p; i++)
>     dothis();
>   for (; i < end; i++)
>     dothat();
>
> Of course, not quite the above as there needs to be provisions for the
> border conditions, if e.g. 'p' is outside the original iteration space, or
> the conditional doesn't directly use the control IV, but some other, or
> the IV runs backwards.  The testcase checks many of these border
> conditions.
>
> This transformation is in itself a good one but can also be an enabler for
> the vectorizer.  It does increase code size, when the loop body contains
> also unconditional code (that one is duplicated), so we only transform hot
> loops.  I'm a bit unsure of the placement of the new pass, or if it should
> be an own pass at all.  Right now I've placed it after unswitching and
> scev_cprop, before loop distribution.  Ideally I think all three, together
> with loop fusion and an gimple unroller should be integrated into one loop
> nest optimizer, alas, we aren't there yet.
>
> I'm planning to work on loop fusion in the future as well, but that's not
> for GCC 6.
>
> I've regstrapped this pass enabled with -O2 on x86-64-linux, without
> regressions.  I've also checked cpu2006 (the non-fortran part) for
> correctness, not yet for performance.  In the end it should probably only
> be enabled for -O3+ (although if the whole loop body is conditional it
> makes sense to also have it with -O2 because code growth is very small
> then).
>
> So, okay for trunk?

What ever happened to this patch?  I was looking into doing this
myself today but I found this patch.
It is stage 1 of GCC 7, it might be a good idea to get this patch into GCC.

Thanks,
Andrew

>
>
> Ciao,
> Michael.
>         * passes.def (pass_loop_split): Add.
>         * timevar.def (TV_LOOP_SPLIT): Add.
>         * tree-pass.h (make_pass_loop_split): Declare.
>         * tree-ssa-loop-manip.h (rewrite_into_loop_closed_ssa_1): Declare.
>         * tree-ssa-loop-unswitch.c: Include tree-ssa-loop-manip.h,
>         cfganal.h, tree-chrec.h, tree-affine.h, tree-scalar-evolution.h,
>         gimple-pretty-print.h, gimple-fold.h, gimplify-me.h.
>         (split_at_bb_p, patch_loop_exit, find_or_create_guard_phi,
>         split_loop, tree_ssa_split_loops,
>         make_pass_loop_split): New functions.
>         (pass_data_loop_split): New.
>         (pass_loop_split): New.
>
> testsuite/
>         * gcc.dg/loop-split.c: New test.
>
> Index: passes.def
> ===================================================================
> --- passes.def  (revision 229763)
> +++ passes.def  (working copy)
> @@ -233,6 +233,7 @@ along with GCC; see the file COPYING3.
>           NEXT_PASS (pass_dce);
>           NEXT_PASS (pass_tree_unswitch);
>           NEXT_PASS (pass_scev_cprop);
> +         NEXT_PASS (pass_loop_split);
>           NEXT_PASS (pass_record_bounds);
>           NEXT_PASS (pass_loop_distribution);
>           NEXT_PASS (pass_copy_prop);
> Index: timevar.def
> ===================================================================
> --- timevar.def (revision 229763)
> +++ timevar.def (working copy)
> @@ -179,6 +179,7 @@ DEFTIMEVAR (TV_LIM                   , "
>  DEFTIMEVAR (TV_TREE_LOOP_IVCANON     , "tree canonical iv")
>  DEFTIMEVAR (TV_SCEV_CONST            , "scev constant prop")
>  DEFTIMEVAR (TV_TREE_LOOP_UNSWITCH    , "tree loop unswitching")
> +DEFTIMEVAR (TV_LOOP_SPLIT            , "loop splitting")
>  DEFTIMEVAR (TV_COMPLETE_UNROLL       , "complete unrolling")
>  DEFTIMEVAR (TV_TREE_PARALLELIZE_LOOPS, "tree parallelize loops")
>  DEFTIMEVAR (TV_TREE_VECTORIZATION    , "tree vectorization")
> Index: tree-pass.h
> ===================================================================
> --- tree-pass.h (revision 229763)
> +++ tree-pass.h (working copy)
> @@ -366,6 +366,7 @@ extern gimple_opt_pass *make_pass_tree_n
>  extern gimple_opt_pass *make_pass_tree_loop_init (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_lim (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_tree_unswitch (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_loop_split (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_predcom (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_iv_canon (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_scev_cprop (gcc::context *ctxt);
> Index: tree-ssa-loop-manip.h
> ===================================================================
> --- tree-ssa-loop-manip.h       (revision 229763)
> +++ tree-ssa-loop-manip.h       (working copy)
> @@ -24,6 +24,8 @@ typedef void (*transform_callback)(struc
>
>  extern void create_iv (tree, tree, tree, struct loop *, gimple_stmt_iterator *,
>                        bool, tree *, tree *);
> +extern void rewrite_into_loop_closed_ssa_1 (bitmap, unsigned, int,
> +                                           struct loop *);
>  extern void rewrite_into_loop_closed_ssa (bitmap, unsigned);
>  extern void rewrite_virtuals_into_loop_closed_ssa (struct loop *);
>  extern void verify_loop_closed_ssa (bool);
> Index: tree-ssa-loop-unswitch.c
> ===================================================================
> --- tree-ssa-loop-unswitch.c    (revision 229763)
> +++ tree-ssa-loop-unswitch.c    (working copy)
> @@ -31,12 +31,20 @@ along with GCC; see the file COPYING3.
>  #include "tree-ssa.h"
>  #include "tree-ssa-loop-niter.h"
>  #include "tree-ssa-loop.h"
> +#include "tree-ssa-loop-manip.h"
>  #include "tree-into-ssa.h"
> +#include "cfganal.h"
>  #include "cfgloop.h"
> +#include "tree-chrec.h"
> +#include "tree-affine.h"
> +#include "tree-scalar-evolution.h"
>  #include "params.h"
>  #include "tree-inline.h"
>  #include "gimple-iterator.h"
> +#include "gimple-pretty-print.h"
>  #include "cfghooks.h"
> +#include "gimple-fold.h"
> +#include "gimplify-me.h"
>
>  /* This file implements the loop unswitching, i.e. transformation of loops like
>
> @@ -842,4 +850,551 @@ make_pass_tree_unswitch (gcc::context *c
>    return new pass_tree_unswitch (ctxt);
>  }
>
> +/* Return true when BB inside LOOP is a potential iteration space
> +   split point, i.e. ends with a condition like "IV < comp", which
> +   is true on one side of the iteration space and false on the other,
> +   and the split point can be computed.  If so, also return the border
> +   point in *BORDER and the comparison induction variable in IV.  */
>
> +static tree
> +split_at_bb_p (struct loop *loop, basic_block bb, tree *border, affine_iv *iv)
> +{
> +  gimple *last;
> +  gcond *stmt;
> +  affine_iv iv2;
> +
> +  /* BB must end in a simple conditional jump.  */
> +  last = last_stmt (bb);
> +  if (!last || gimple_code (last) != GIMPLE_COND)
> +    return NULL_TREE;
> +  stmt = as_a <gcond *> (last);
> +
> +  enum tree_code code = gimple_cond_code (stmt);
> +
> +  /* Only handle relational comparisons, for equality and non-equality
> +     we'd have to split the loop into two loops and a middle statement.  */
> +  switch (code)
> +    {
> +      case LT_EXPR:
> +      case LE_EXPR:
> +      case GT_EXPR:
> +      case GE_EXPR:
> +       break;
> +      default:
> +       return NULL_TREE;
> +    }
> +
> +  if (loop_exits_from_bb_p (loop, bb))
> +    return NULL_TREE;
> +
> +  tree op0 = gimple_cond_lhs (stmt);
> +  tree op1 = gimple_cond_rhs (stmt);
> +
> +  if (!simple_iv (loop, loop, op0, iv, false))
> +    return NULL_TREE;
> +  if (!simple_iv (loop, loop, op1, &iv2, false))
> +    return NULL_TREE;
> +
> +  /* Make it so, that the first argument of the condition is
> +     the looping one.  */
> +  if (integer_zerop (iv->step))
> +    {
> +      std::swap (op0, op1);
> +      std::swap (*iv, iv2);
> +      code = swap_tree_comparison (code);
> +      gimple_cond_set_condition (stmt, code, op0, op1);
> +      update_stmt (stmt);
> +    }
> +
> +  if (integer_zerop (iv->step))
> +    return NULL_TREE;
> +  if (!integer_zerop (iv2.step))
> +    return NULL_TREE;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "Found potential split point: ");
> +      print_gimple_stmt (dump_file, stmt, 0, TDF_SLIM);
> +      fprintf (dump_file, " { ");
> +      print_generic_expr (dump_file, iv->base, TDF_SLIM);
> +      fprintf (dump_file, " + I*");
> +      print_generic_expr (dump_file, iv->step, TDF_SLIM);
> +      fprintf (dump_file, " } %s ", get_tree_code_name (code));
> +      print_generic_expr (dump_file, iv2.base, TDF_SLIM);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  *border = iv2.base;
> +  return op0;
> +}
> +
> +/* Given a GUARD conditional stmt inside LOOP, which we want to make always
> +   true or false depending on INITIAL_TRUE, and adjusted values NEXTVAL
> +   (a post-increment IV) and NEWBOUND (the comparator) adjust the loop
> +   exit test statement to loop back only if the GUARD statement will
> +   also be true/false in the next iteration.  */
> +
> +static void
> +patch_loop_exit (struct loop *loop, gcond *guard, tree nextval, tree newbound,
> +                bool initial_true)
> +{
> +  edge exit = single_exit (loop);
> +  gcond *stmt = as_a <gcond *> (last_stmt (exit->src));
> +  gimple_cond_set_condition (stmt, gimple_cond_code (guard),
> +                            nextval, newbound);
> +  update_stmt (stmt);
> +
> +  edge stay = single_pred_edge (loop->latch);
> +
> +  exit->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
> +  stay->flags &= ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
> +
> +  if (initial_true)
> +    {
> +      exit->flags |= EDGE_FALSE_VALUE;
> +      stay->flags |= EDGE_TRUE_VALUE;
> +    }
> +  else
> +    {
> +      exit->flags |= EDGE_TRUE_VALUE;
> +      stay->flags |= EDGE_FALSE_VALUE;
> +    }
> +}
> +
> +/* Give an induction variable GUARD_IV, and its affine descriptor IV,
> +   find the loop phi node in LOOP defining it directly, or create
> +   such phi node.  Return that phi node.  */
> +
> +static gphi *
> +find_or_create_guard_phi (struct loop *loop, tree guard_iv, affine_iv * /*iv*/)
> +{
> +  gimple *def = SSA_NAME_DEF_STMT (guard_iv);
> +  gphi *phi;
> +  if ((phi = dyn_cast <gphi *> (def))
> +      && gimple_bb (phi) == loop->header)
> +    return phi;
> +
> +  /* XXX Create the PHI instead.  */
> +  return NULL;
> +}
> +
> +/* Checks if LOOP contains an conditional block whose condition
> +   depends on which side in the iteration space it is, and if so
> +   splits the iteration space into two loops.  Returns true if the
> +   loop was split.  NITER must contain the iteration descriptor for the
> +   single exit of LOOP.  */
> +
> +static bool
> +split_loop (struct loop *loop, struct tree_niter_desc *niter)
> +{
> +  basic_block *bbs;
> +  unsigned i;
> +  bool changed = false;
> +  tree guard_iv;
> +  tree border;
> +  affine_iv iv;
> +
> +  bbs = get_loop_body (loop);
> +
> +  /* Find a splitting opportunity.  */
> +  for (i = 0; i < loop->num_nodes; i++)
> +    if ((guard_iv = split_at_bb_p (loop, bbs[i], &border, &iv)))
> +      {
> +       /* Handling opposite steps is not implemented yet.  Neither
> +          is handling different step sizes.  */
> +       if ((tree_int_cst_sign_bit (iv.step)
> +            != tree_int_cst_sign_bit (niter->control.step))
> +           || !tree_int_cst_equal (iv.step, niter->control.step))
> +         continue;
> +
> +       /* Find a loop PHI node that defines guard_iv directly,
> +          or create one doing that.  */
> +       gphi *phi = find_or_create_guard_phi (loop, guard_iv, &iv);
> +       if (!phi)
> +         continue;
> +       gcond *guard_stmt = as_a<gcond *> (last_stmt (bbs[i]));
> +       tree guard_init = PHI_ARG_DEF_FROM_EDGE (phi,
> +                                                loop_preheader_edge (loop));
> +       enum tree_code guard_code = gimple_cond_code (guard_stmt);
> +
> +       /* Loop splitting is implemented by versioning the loop, placing
> +          the new loop in front of the old loop, make the first loop iterate
> +          as long as the conditional stays true (or false) and let the
> +          second (original) loop handle the rest of the iterations.
> +
> +          First we need to determine if the condition will start being true
> +          or false in the first loop.  */
> +       bool initial_true;
> +       switch (guard_code)
> +         {
> +           case LT_EXPR:
> +           case LE_EXPR:
> +             initial_true = !tree_int_cst_sign_bit (iv.step);
> +             break;
> +           case GT_EXPR:
> +           case GE_EXPR:
> +             initial_true = tree_int_cst_sign_bit (iv.step);
> +             break;
> +           default:
> +             gcc_unreachable ();
> +         }
> +
> +       /* Build a condition that will skip the first loop when the
> +          guard condition won't ever be true (or false).  */
> +       tree cond = build2 (guard_code, boolean_type_node, guard_init, border);
> +       if (initial_true)
> +         cond = fold_build1 (TRUTH_NOT_EXPR, boolean_type_node, cond);
> +
> +       /* Now version the loop, we will then have this situation:
> +          if (!cond)
> +            for (...) {body}   //floop
> +          else
> +            for (...) {body}   //loop
> +          join:  */
> +       initialize_original_copy_tables ();
> +       basic_block cond_bb;
> +       struct loop *floop = loop_version (loop, cond, &cond_bb,
> +                                          REG_BR_PROB_BASE, REG_BR_PROB_BASE,
> +                                          REG_BR_PROB_BASE, false);
> +       gcc_assert (floop);
> +       update_ssa (TODO_update_ssa);
> +
> +       /* Now diddle the exit edge of the first loop (floop->join in the
> +          above) to either go to the common exit (join) or to the second
> +          loop, depending on if there are still iterations left, or not.
> +          We split the floop exit edge and insert a copy of the
> +          original exit expression into the new block, that either
> +          skips the second loop or goes to it.  */
> +       edge exit = single_exit (floop);
> +       basic_block skip_bb = split_edge (exit);
> +       gcond *skip_stmt;
> +       gimple_stmt_iterator gsi;
> +       edge new_e, skip_e;
> +
> +       gimple *stmt = last_stmt (exit->src);
> +       skip_stmt = gimple_build_cond (gimple_cond_code (stmt),
> +                                      gimple_cond_lhs (stmt),
> +                                      gimple_cond_rhs (stmt),
> +                                      NULL_TREE, NULL_TREE);
> +       gsi = gsi_last_bb (skip_bb);
> +       gsi_insert_after (&gsi, skip_stmt, GSI_NEW_STMT);
> +
> +       skip_e = EDGE_SUCC (skip_bb, 0);
> +       skip_e->flags &= ~EDGE_FALLTHRU;
> +       new_e = make_edge (skip_bb, loop_preheader_edge (loop)->src, 0);
> +       if (exit->flags & EDGE_TRUE_VALUE)
> +         {
> +           skip_e->flags |= EDGE_TRUE_VALUE;
> +           new_e->flags |= EDGE_FALSE_VALUE;
> +         }
> +       else
> +         {
> +           skip_e->flags |= EDGE_FALSE_VALUE;
> +           new_e->flags |= EDGE_TRUE_VALUE;
> +         }
> +
> +       new_e->count = skip_bb->count;
> +       new_e->probability = PROB_LIKELY;
> +       new_e->count = apply_probability (skip_e->count, PROB_LIKELY);
> +       skip_e->count -= new_e->count;
> +       skip_e->probability = inverse_probability (PROB_LIKELY);
> +
> +       /* Now we have created this situation:
> +            if (!cond) {
> +              for (...) {body; if (cexit) break;}
> +              if (!cexit) goto second;
> +            } else {
> +              second:
> +              for (...) {body; if (cexit) break;}
> +            }
> +            join:
> +
> +          The second loop can now be entered by skipping the first
> +          loop (the inital values of its PHI nodes will be the
> +          original initial values), or by falling in from the first
> +          loop (the initial values will be the continuation values
> +          from the first loop).  Insert PHI nodes reflecting this
> +          in the pre-header of the second loop.  */
> +
> +       basic_block rest = loop_preheader_edge (loop)->src;
> +       edge skip_first = find_edge (cond_bb, rest);
> +       gcc_assert (skip_first);
> +
> +       edge firste = loop_preheader_edge (floop);
> +       edge seconde = loop_preheader_edge (loop);
> +       edge firstn = loop_latch_edge (floop);
> +       gphi *new_guard_phi = 0;
> +       gphi_iterator psi_first, psi_second;
> +       for (psi_first = gsi_start_phis (floop->header),
> +            psi_second = gsi_start_phis (loop->header);
> +            !gsi_end_p (psi_first);
> +            gsi_next (&psi_first), gsi_next (&psi_second))
> +         {
> +           tree init, next, new_init;
> +           use_operand_p op;
> +           gphi *phi_first = psi_first.phi ();
> +           gphi *phi_second = psi_second.phi ();
> +
> +           if (phi_second == phi)
> +             new_guard_phi = phi_first;
> +
> +           init = PHI_ARG_DEF_FROM_EDGE (phi_first, firste);
> +           next = PHI_ARG_DEF_FROM_EDGE (phi_first, firstn);
> +           op = PHI_ARG_DEF_PTR_FROM_EDGE (phi_second, seconde);
> +           gcc_assert (operand_equal_for_phi_arg_p (init, USE_FROM_PTR (op)));
> +
> +           /* Prefer using original variable as a base for the new ssa name.
> +              This is necessary for virtual ops, and useful in order to avoid
> +              losing debug info for real ops.  */
> +           if (TREE_CODE (next) == SSA_NAME
> +               && useless_type_conversion_p (TREE_TYPE (next),
> +                                             TREE_TYPE (init)))
> +             new_init = copy_ssa_name (next);
> +           else if (TREE_CODE (init) == SSA_NAME
> +                    && useless_type_conversion_p (TREE_TYPE (init),
> +                                                  TREE_TYPE (next)))
> +             new_init = copy_ssa_name (init);
> +           else if (useless_type_conversion_p (TREE_TYPE (next),
> +                                               TREE_TYPE (init)))
> +             new_init = make_temp_ssa_name (TREE_TYPE (next), NULL,
> +                                            "unrinittmp");
> +           else
> +             new_init = make_temp_ssa_name (TREE_TYPE (init), NULL,
> +                                            "unrinittmp");
> +
> +           gphi * newphi = create_phi_node (new_init, rest);
> +           add_phi_arg (newphi, init, skip_first, UNKNOWN_LOCATION);
> +           add_phi_arg (newphi, next, new_e, UNKNOWN_LOCATION);
> +           SET_USE (op, new_init);
> +         }
> +
> +       /* The iterations of the second loop is now already
> +          exactly those that the first loop didn't do, but the
> +          iteration space of the first loop is still the original one.
> +          Build a new one, exactly covering those iterations where
> +          the conditional is true (or false).  For example, from such a loop:
> +
> +            for (i = beg, j = beg2; i < end; i++, j++)
> +              if (j < c)  // this is supposed to be true
> +                ...
> +
> +          we build new bounds and change the exit condtions such that
> +          it's effectively this:
> +
> +            newend = min (end+beg2-beg, c)
> +            for (i = beg; j = beg2; j < newend; i++, j++)
> +              if (j < c)
> +                ...
> +
> +          Depending on the direction of the IVs and if the exit tests
> +          are strict or include equality we need to use MIN or MAX,
> +          and add or subtract 1.  */
> +
> +       gimple_seq stmts = NULL;
> +       /* The niter structure contains the after-increment IV, we need
> +          the loop-enter base, so subtract STEP once.  */
> +       tree controlbase = force_gimple_operand (niter->control.base,
> +                                                &stmts, true, NULL_TREE);
> +       tree controlstep = niter->control.step;
> +       tree enddiff;
> +       if (POINTER_TYPE_P (TREE_TYPE (controlbase)))
> +         {
> +           controlstep = gimple_build (&stmts, NEGATE_EXPR,
> +                                       TREE_TYPE (controlstep), controlstep);
> +           enddiff = gimple_build (&stmts, POINTER_PLUS_EXPR,
> +                                   TREE_TYPE (controlbase),
> +                                   controlbase, controlstep);
> +         }
> +       else
> +         enddiff = gimple_build (&stmts, MINUS_EXPR,
> +                                 TREE_TYPE (controlbase),
> +                                 controlbase, controlstep);
> +
> +       /* Compute beg-beg2.  */
> +       if (POINTER_TYPE_P (TREE_TYPE (enddiff)))
> +         {
> +           tree tem = gimple_convert (&stmts, sizetype, guard_init);
> +           tem = gimple_build (&stmts, NEGATE_EXPR, sizetype, tem);
> +           enddiff = gimple_build (&stmts, POINTER_PLUS_EXPR,
> +                                   TREE_TYPE (enddiff),
> +                                   enddiff, tem);
> +         }
> +       else
> +         enddiff = gimple_build (&stmts, MINUS_EXPR, TREE_TYPE (enddiff),
> +                                 enddiff, guard_init);
> +
> +       /* Compute end-(beg-beg2).  */
> +       gimple_seq stmts2;
> +       tree newbound = force_gimple_operand (niter->bound, &stmts2,
> +                                             true, NULL_TREE);
> +       gimple_seq_add_seq_without_update (&stmts, stmts2);
> +
> +       if (POINTER_TYPE_P (TREE_TYPE (enddiff))
> +           || POINTER_TYPE_P (TREE_TYPE (newbound)))
> +         {
> +           enddiff = gimple_convert (&stmts, sizetype, enddiff);
> +           enddiff = gimple_build (&stmts, NEGATE_EXPR, sizetype, enddiff);
> +           newbound = gimple_build (&stmts, POINTER_PLUS_EXPR,
> +                                    TREE_TYPE (newbound),
> +                                    newbound, enddiff);
> +         }
> +       else
> +         newbound = gimple_build (&stmts, MINUS_EXPR, TREE_TYPE (enddiff),
> +                                  newbound, enddiff);
> +
> +       /* Depending on the direction of the IVs the new bound for the first
> +          loop is the minimum or maximum of old bound and border.
> +          Also, if the guard condition isn't strictly less or greater,
> +          we need to adjust the bound.  */
> +       int addbound = 0;
> +       enum tree_code minmax;
> +       if (niter->cmp == LT_EXPR)
> +         {
> +           /* GT and LE are the same, inverted.  */
> +           if (guard_code == GT_EXPR || guard_code == LE_EXPR)
> +             addbound = -1;
> +           minmax = MIN_EXPR;
> +         }
> +       else
> +         {
> +           gcc_assert (niter->cmp == GT_EXPR);
> +           if (guard_code == GE_EXPR || guard_code == LT_EXPR)
> +             addbound = 1;
> +           minmax = MAX_EXPR;
> +         }
> +
> +       if (addbound)
> +         {
> +           tree type2 = TREE_TYPE (newbound);
> +           if (POINTER_TYPE_P (type2))
> +             type2 = sizetype;
> +           newbound = gimple_build (&stmts,
> +                                    POINTER_TYPE_P (TREE_TYPE (newbound))
> +                                    ? POINTER_PLUS_EXPR : PLUS_EXPR,
> +                                    TREE_TYPE (newbound),
> +                                    newbound,
> +                                    build_int_cst (type2, addbound));
> +         }
> +
> +       tree newend = gimple_build (&stmts, minmax, TREE_TYPE (border),
> +                                   border, newbound);
> +       if (stmts)
> +         gsi_insert_seq_on_edge_immediate (loop_preheader_edge (floop),
> +                                           stmts);
> +
> +       /* Now patch the exit block of the first loop to compare
> +          the post-increment value of the guarding IV with the new end
> +          value.  */
> +       tree new_guard_next = PHI_ARG_DEF_FROM_EDGE (new_guard_phi,
> +                                                    loop_latch_edge (floop));
> +       patch_loop_exit (floop, guard_stmt, new_guard_next, newend,
> +                        initial_true);
> +
> +       /* Finally patch out the two copies of the condition to be always
> +          true/false (or opposite).  */
> +       gcond *force_true = as_a<gcond *> (last_stmt (get_bb_copy (bbs[i])));
> +       gcond *force_false = as_a<gcond *> (last_stmt (bbs[i]));
> +       if (!initial_true)
> +         std::swap (force_true, force_false);
> +       gimple_cond_make_true (force_true);
> +       gimple_cond_make_false (force_false);
> +       update_stmt (force_true);
> +       update_stmt (force_false);
> +
> +       free_original_copy_tables ();
> +
> +       /* We destroyed LCSSA form above.  Eventually we might be able
> +          to fix it on the fly, for now simply punt and use the helper.  */
> +       rewrite_into_loop_closed_ssa_1 (NULL, 0, SSA_OP_USE, floop);
> +
> +       changed = true;
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +         fprintf (dump_file, ";; Loop split.\n");
> +
> +       /* Only deal with the first opportunity.  */
> +       break;
> +      }
> +
> +  free (bbs);
> +  return changed;
> +}
> +
> +/* Main entry point.  Perform loop splitting on all suitable loops.  */
> +
> +static unsigned int
> +tree_ssa_split_loops (void)
> +{
> +  struct loop *loop;
> +  bool changed = false;
> +
> +  gcc_assert (scev_initialized_p ());
> +  /* Go through all loops starting from innermost.  */
> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +    {
> +      struct tree_niter_desc niter;
> +      if (single_exit (loop)
> +         /* ??? We could handle non-empty latches when we split
> +            the latch edge (not the exit edge), and put the new
> +            exit condition in the new block.  OTOH this executes some
> +            code unconditionally that might have been skipped by the
> +            original exit before.  */
> +         && empty_block_p (loop->latch)
> +         && !optimize_loop_for_size_p (loop)
> +         && number_of_iterations_exit (loop, single_exit (loop), &niter,
> +                                       false, true)
> +         /* We can't yet handle loops controlled by a != predicate.  */
> +         && niter.cmp != NE_EXPR)
> +       changed |= split_loop (loop, &niter);
> +    }
> +
> +  if (changed)
> +    return TODO_cleanup_cfg;
> +  return 0;
> +}
> +
> +/* Loop splitting pass.  */
> +
> +namespace {
> +
> +const pass_data pass_data_loop_split =
> +{
> +  GIMPLE_PASS, /* type */
> +  "lsplit", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_LOOP_SPLIT, /* tv_id */
> +  PROP_cfg, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_loop_split : public gimple_opt_pass
> +{
> +public:
> +  pass_loop_split (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_loop_split, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *) { return optimize >= 2; }
> +  virtual unsigned int execute (function *);
> +
> +}; // class pass_loop_split
> +
> +unsigned int
> +pass_loop_split::execute (function *fun)
> +{
> +  if (number_of_loops (fun) <= 1)
> +    return 0;
> +
> +  return tree_ssa_split_loops ();
> +}
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_loop_split (gcc::context *ctxt)
> +{
> +  return new pass_loop_split (ctxt);
> +}
> Index: testsuite/gcc.dg/loop-split.c
> ===================================================================
> --- testsuite/gcc.dg/loop-split.c       (revision 0)
> +++ testsuite/gcc.dg/loop-split.c       (working copy)
> @@ -0,0 +1,141 @@
> +/* { dg-do run } */
> +/* { dg-options "-O2 -fdump-tree-lsplit-details" } */
> +
> +#ifdef __cplusplus
> +extern "C" int printf (const char *, ...);
> +extern "C" void abort (void);
> +#else
> +extern int printf (const char *, ...);
> +extern void abort (void);
> +#endif
> +
> +#ifndef TRACE
> +#define TRACE 0
> +#endif
> +
> +#define loop(beg,step,beg2,cond1,cond2) \
> +    do \
> +      { \
> +       sum = 0; \
> +        for (i = (beg), j = (beg2); (cond1); i+=(step),j+=(step)) \
> +          { \
> +            if (cond2) { \
> +             if (TRACE > 1) printf ("a: %d %d\n", i, j); \
> +              sum += a[i]; \
> +           } else { \
> +             if (TRACE > 1) printf ("b: %d %d\n", i, j); \
> +              sum += b[i]; \
> +           } \
> +          } \
> +       if (TRACE > 0) printf ("sum: %d\n", sum); \
> +       check = check * 47 + sum; \
> +      } while (0)
> +
> +#if 1
> +unsigned __attribute__((noinline, noclone)) dotest (int beg, int end, int step,
> +                                              int c, int *a, int *b, int beg2)
> +{
> +  unsigned check = 0;
> +  int sum;
> +  int i, j;
> +  loop (beg, 1, beg2, i < end, j < c);
> +  loop (beg, 1, beg2, i <= end, j < c);
> +  loop (beg, 1, beg2, i < end, j <= c);
> +  loop (beg, 1, beg2, i <= end, j <= c);
> +  loop (beg, 1, beg2, i < end, j > c);
> +  loop (beg, 1, beg2, i <= end, j > c);
> +  loop (beg, 1, beg2, i < end, j >= c);
> +  loop (beg, 1, beg2, i <= end, j >= c);
> +  beg2 += end-beg;
> +  loop (end, -1, beg2, i >= beg, j >= c);
> +  loop (end, -1, beg2, i >= beg, j > c);
> +  loop (end, -1, beg2, i > beg, j >= c);
> +  loop (end, -1, beg2, i > beg, j > c);
> +  loop (end, -1, beg2, i >= beg, j <= c);
> +  loop (end, -1, beg2, i >= beg, j < c);
> +  loop (end, -1, beg2, i > beg, j <= c);
> +  loop (end, -1, beg2, i > beg, j < c);
> +  return check;
> +}
> +
> +#else
> +
> +int __attribute__((noinline, noclone)) f (int beg, int end, int step,
> +                                         int c, int *a, int *b, int beg2)
> +{
> +  int sum = 0;
> +  int i, j;
> +  //for (i = beg, j = beg2; i < end; i += 1, j++ /*step*/)
> +  for (i = end, j = beg2 + (end-beg); i > beg; i += -1, j-- /*step*/)
> +    {
> +      // i - j == X --> i = X + j
> +      // --> i < end == X+j < end == j < end - X
> +      // --> newend = end - (i_init - j_init)
> +      // j < end-X && j < c --> j < min(end-X,c)
> +      // j < end-X && j <= c --> j <= min(end-X-1,c) or j < min(end-X,c+1{OF!})
> +      //if (j < c)
> +      if (j >= c)
> +       printf ("a: %d %d\n", i, j);
> +      /*else
> +       printf ("b: %d %d\n", i, j);*/
> +       /*sum += a[i];
> +      else
> +       sum += b[i];*/
> +    }
> +  return sum;
> +}
> +
> +int __attribute__((noinline, noclone)) f2 (int *beg, int *end, int step,
> +                                         int *c, int *a, int *b, int *beg2)
> +{
> +  int sum = 0;
> +  int *i, *j;
> +  for (i = beg, j = beg2; i < end; i += 1, j++ /*step*/)
> +    {
> +      if (j <= c)
> +       printf ("%d %d\n", i - beg, j - beg);
> +       /*sum += a[i];
> +      else
> +       sum += b[i];*/
> +    }
> +  return sum;
> +}
> +#endif
> +
> +extern int printf (const char *, ...);
> +
> +int main ()
> +{
> +  int a[] = {0,0,0,0,0, 1,2,3,4,5,6,7,8,9,          0,0,0,0,0};
> +  int b[] = {0,0,0,0,0, -1,-2,-3,-4,-5,-6,-7,-8,-9, 0,0,0,0,0,};
> +  int c;
> +  int diff = 0;
> +  unsigned check = 0;
> +  //dotest (0, 9, 1, -1, a+5, b+5, -1);
> +  //return 0;
> +  //f (0, 9, 1, -1, a+5, b+5, -1);
> +  //return 0;
> +  for (diff = -5; diff <= 5; diff++)
> +    {
> +      for (c = -1; c <= 10; c++)
> +       {
> +#if 0
> +         int s = f (0, 9, 1, c, a+5, b+5, diff);
> +         //int s = f2 (a+0, a+9, 1, a+c, a+5, b+5, a+diff);
> +         printf ("%d ", s);
> +#else
> +         if (TRACE > 0)
> +           printf ("check %d %d\n", c, diff);
> +         check = check * 51 + dotest (0, 9, 1, c, a+5, b+5, diff);
> +#endif
> +       }
> +      //printf ("\n");
> +    }
> +  //printf ("%u\n", check);
> +  if (check != 3213344948)
> +    abort ();
> +  return 0;
> +}
> +
> +/* All 16 loops in dotest should be split.  */
> +/* { dg-final { scan-tree-dump-times "Loop split" 16 "lsplit" } } */

Michael Matz July 25, 2016, 2:27 p.m. UTC | #7

Hi,

On Sun, 24 Jul 2016, Andrew Pinski wrote:

> What ever happened to this patch?

It got accepted but I deferred inclusion in GCC 6 because it 
was late in the cycle then and performance results didn't show super 
improvements (only looked at cpu2006).  No regressions, but no nice 
speedups either.

> I was looking into doing this myself today but I found this patch. It is 
> stage 1 of GCC 7, it might be a good idea to get this patch into GCC.

Indeed.  If you want to performance test it on something you know where it 
should help, I'm all ears.

Ciao,
Michael.

Gimple loop splitting

Commit Message

Comments

Patch