diff mbox series

[GRAPHITE] More TLC

Message ID alpine.LSU.2.20.1709221453010.26836@zhemvz.fhfr.qr
State New
Headers show
Series [GRAPHITE] More TLC | expand

Commit Message

Richard Biener Sept. 22, 2017, 1:03 p.m. UTC
This simplifies canonicalize_loop_closed_ssa and does other minimal
TLC.  It also adds a testcase I reduced from a stupid mistake I made
when reworking canonicalize_loop_closed_ssa.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.

SPEC CPU 2006 is happy with it, current statistics on x86_64 with
-Ofast -march=haswell -floop-nest-optimize are

 61 loop nests "optimized"
 45 loop nest transforms cancelled because of code generation issues
 21 loop nest optimizations timed out the 350000 ISL "operations" we allow

I say "optimized" because the usual transform I've seen is static tiling
as enforced by GRAPHITE according to --param loop-block-tile-size.
There's no way to automagically figure what kind of transform ISL did
(usually none with the schedule identical check confused by FILTER
stuff positioning).  This is also the issue with most GRAPHITE testcases.
We can't really verify easily whether we performed loop interchange
or not.  We can probably tell whether we applied loop fusion or
splitting (by counting loops).

I'm not aware of any remaining ICEs / wrong-code issues with GRAPHITE.

I'm aware that the current "black-box" granularity hinders
scheduling freedom (each GIMPLE BB is mapped to a ISL stmt, this
is too coarse to schedule say two writes in a BB independently
from each other).  Quick experiments could be done by simply
splitting gimple BBs at some points.

I'm aware that the SCOP detection algorithm assumes that it can
walk loop->next and find loops "in order" -- but while that's
true for the initial flow_loops_find result (DFS walk) it isn't
true for any later created / discovered loops.  Sorting of
loop siblings in DFS order should be easy (and a general cfgloopanal
helper).

Richard.

2017-09-22  Richard Biener  <rguenther@suse.de>

	* graphite-isl-ast-to-gimple.c (graphite_verify): Inline into
	single caller.
	(graphite_regenerate_ast_isl): Do not reset SCEV.  Move debug
	print of no dependency loops ...
	* graphite.c (graphite_transform_loops): ... here.
	(canonicalize_loop_closed_ssa_form): Work from inner to outer
	loops.
	(same_close_phi_node, remove_duplicate_close_phi,
	make_close_phi_nodes_unique, defined_in_loop_p): Fold into ...
	(canonicalize_loop_closed_ssa): ... here and simplify.
	* graphite-optimize-isl.c: Include tree-vectorizer.h.
	(optimize_isl): Use dump_printf_loc to tell when we stopped
	optimizing because of an ISL timeout.

	* gcc.dg/graphite/scop-24.c: New testcase.

Comments

Sebastian Pop Sept. 22, 2017, 2:29 p.m. UTC | #1
On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> wrote:

>
> This simplifies canonicalize_loop_closed_ssa and does other minimal
> TLC.  It also adds a testcase I reduced from a stupid mistake I made
> when reworking canonicalize_loop_closed_ssa.
>
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>
> SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> -Ofast -march=haswell -floop-nest-optimize are
>
>  61 loop nests "optimized"
>  45 loop nest transforms cancelled because of code generation issues
>  21 loop nest optimizations timed out the 350000 ISL "operations" we allow
>
> I say "optimized" because the usual transform I've seen is static tiling
> as enforced by GRAPHITE according to --param loop-block-tile-size.
> There's no way to automagically figure what kind of transform ISL did
>

Here is how to automate (without magic) the detection
of the transform that isl did.

The problem solved by isl is the minimization of strides
in memory, and to do this, we need to tell the isl scheduler
the validity dependence graph, in graphite-optimize-isl.c
see the validity (RAW, WAR, WAW) and the proximity
(RAR + validity) maps.  The proximity does include the
read after read, as the isl scheduler needs to minimize
strides between consecutive reads.

When you apply the schedule to the dependence graph,
one can tell from the result the strides in memory, a good
way to say whether a transform was beneficial is to sum up
all memory strides, and make sure that the sum of all strides
decreases after transform.  We could add a printf with the
sum of strides before and after transforms, and have the
testcases check for that.

(usually none with the schedule identical check confused by FILTER
> stuff positioning).  This is also the issue with most GRAPHITE testcases.
> We can't really verify easily whether we performed loop interchange
> or not.  We can probably tell whether we applied loop fusion or
> splitting (by counting loops).
>
> I'm not aware of any remaining ICEs / wrong-code issues with GRAPHITE.
>
> I'm aware that the current "black-box" granularity hinders
> scheduling freedom (each GIMPLE BB is mapped to a ISL stmt, this
> is too coarse to schedule say two writes in a BB independently
> from each other).  Quick experiments could be done by simply
> splitting gimple BBs at some points.
>
> I'm aware that the SCOP detection algorithm assumes that it can
> walk loop->next and find loops "in order" -- but while that's
> true for the initial flow_loops_find result (DFS walk) it isn't
> true for any later created / discovered loops.  Sorting of
> loop siblings in DFS order should be easy (and a general cfgloopanal
> helper).
>
> Richard.
>
> 2017-09-22  Richard Biener  <rguenther@suse.de>
>
>         * graphite-isl-ast-to-gimple.c (graphite_verify): Inline into
>         single caller.
>         (graphite_regenerate_ast_isl): Do not reset SCEV.  Move debug
>         print of no dependency loops ...
>         * graphite.c (graphite_transform_loops): ... here.
>         (canonicalize_loop_closed_ssa_form): Work from inner to outer
>         loops.
>         (same_close_phi_node, remove_duplicate_close_phi,
>         make_close_phi_nodes_unique, defined_in_loop_p): Fold into ...
>         (canonicalize_loop_closed_ssa): ... here and simplify.
>         * graphite-optimize-isl.c: Include tree-vectorizer.h.
>         (optimize_isl): Use dump_printf_loc to tell when we stopped
>         optimizing because of an ISL timeout.
>
>         * gcc.dg/graphite/scop-24.c: New testcase.
>
>
The change looks good to me.

Thanks,
Sebastian
Richard Biener Sept. 25, 2017, 7:39 a.m. UTC | #2
On Fri, 22 Sep 2017, Richard Biener wrote:

> 
> This simplifies canonicalize_loop_closed_ssa and does other minimal
> TLC.  It also adds a testcase I reduced from a stupid mistake I made
> when reworking canonicalize_loop_closed_ssa.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> 
> SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> -Ofast -march=haswell -floop-nest-optimize are
> 
>  61 loop nests "optimized"
>  45 loop nest transforms cancelled because of code generation issues
>  21 loop nest optimizations timed out the 350000 ISL "operations" we allow

Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
and 709 sec. with (this was with release checking).

A single-run has 416.gamess (580s -> 618s),
436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 
379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
do a 3-run for those to confirm (it would be only a single regression
for 416.gamess).

Sofar I'm positively surprised given the limitations (and inefficiencies)
I know.

I'll add some more opt-info stuff to assess the number of SCOPs we
detect but discard during further analysis and the number of transforms
we cancel because they turn out as a no-op.

Richard.

>      {
> -      if (dump_file && dump_flags)
> -	fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n",
> -		 max_operations);
> +      location_t loc = find_loop_location
> +	(scop->scop_info->region.entry->dest->loop_father);
> +      dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc,
> +		       "loop nest not optimized, optimization timed out "
> +		       "after %d operations [--param max-isl-operations]\n",
> +		       max_operations);
>        return false;
>      }
>  
> Index: gcc/graphite.c
> ===================================================================
> --- gcc/graphite.c	(revision 253091)
> +++ gcc/graphite.c	(working copy)
> @@ -293,86 +293,6 @@ free_scops (vec<scop_p> scops)
>    scops.release ();
>  }
>  
> -/* Returns true when P1 and P2 are close phis with the same
> -   argument.  */
> -
> -static inline bool
> -same_close_phi_node (gphi *p1, gphi *p2)
> -{
> -  return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)),
> -			      TREE_TYPE (gimple_phi_result (p2)))
> -	  && operand_equal_p (gimple_phi_arg_def (p1, 0),
> -			      gimple_phi_arg_def (p2, 0), 0));
> -}
> -
> -static void make_close_phi_nodes_unique (basic_block bb);
> -
> -/* Remove the close phi node at GSI and replace its rhs with the rhs
> -   of PHI.  */
> -
> -static void
> -remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi)
> -{
> -  gimple *use_stmt;
> -  use_operand_p use_p;
> -  imm_use_iterator imm_iter;
> -  tree res = gimple_phi_result (phi);
> -  tree def = gimple_phi_result (gsi->phi ());
> -
> -  gcc_assert (same_close_phi_node (phi, gsi->phi ()));
> -
> -  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def)
> -    {
> -      FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
> -	SET_USE (use_p, res);
> -
> -      update_stmt (use_stmt);
> -
> -      /* It is possible that we just created a duplicate close-phi
> -	 for an already-processed containing loop.  Check for this
> -	 case and clean it up.  */
> -      if (gimple_code (use_stmt) == GIMPLE_PHI
> -	  && gimple_phi_num_args (use_stmt) == 1)
> -	make_close_phi_nodes_unique (gimple_bb (use_stmt));
> -    }
> -
> -  remove_phi_node (gsi, true);
> -}
> -
> -/* Removes all the close phi duplicates from BB.  */
> -
> -static void
> -make_close_phi_nodes_unique (basic_block bb)
> -{
> -  gphi_iterator psi;
> -
> -  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
> -    {
> -      gphi_iterator gsi = psi;
> -      gphi *phi = psi.phi ();
> -
> -      /* At this point, PHI should be a close phi in normal form.  */
> -      gcc_assert (gimple_phi_num_args (phi) == 1);
> -
> -      /* Iterate over the next phis and remove duplicates.  */
> -      gsi_next (&gsi);
> -      while (!gsi_end_p (gsi))
> -	if (same_close_phi_node (phi, gsi.phi ()))
> -	  remove_duplicate_close_phi (phi, &gsi);
> -	else
> -	  gsi_next (&gsi);
> -    }
> -}
> -
> -/* Return true when NAME is defined in LOOP.  */
> -
> -static bool
> -defined_in_loop_p (tree name, loop_p loop)
> -{
> -  gcc_assert (TREE_CODE (name) == SSA_NAME);
> -  return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name));
> -}
> -
>  /* Transforms LOOP to the canonical loop closed SSA form.  */
>  
>  static void
> @@ -380,20 +300,22 @@ canonicalize_loop_closed_ssa (loop_p loo
>  {
>    edge e = single_exit (loop);
>    basic_block bb;
> +  gphi_iterator psi;
>  
>    if (!e || (e->flags & EDGE_COMPLEX))
>      return;
>  
>    bb = e->dest;
>  
> +  /* Make the loop-close PHI node BB contain only PHIs and have a
> +     single predecessor.  */
>    if (single_pred_p (bb))
>      {
>        e = split_block_after_labels (bb);
> -      make_close_phi_nodes_unique (e->src);
> +      bb = e->src;
>      }
>    else
>      {
> -      gphi_iterator psi;
>        basic_block close = split_edge (e);
>        e = single_succ_edge (close);
>        for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
> @@ -404,7 +326,7 @@ canonicalize_loop_closed_ssa (loop_p loo
>  
>  	  /* Only add close phi nodes for SSA_NAMEs defined in LOOP.  */
>  	  if (TREE_CODE (arg) != SSA_NAME
> -	      || !defined_in_loop_p (arg, loop))
> +	      || loop_containing_stmt (SSA_NAME_DEF_STMT (arg)) != loop)
>  	    continue;
>  
>  	  tree res = copy_ssa_name (arg);
> @@ -413,8 +335,30 @@ canonicalize_loop_closed_ssa (loop_p loo
>  		       UNKNOWN_LOCATION);
>  	  SET_USE (use_p, res);
>  	}
> +      bb = close;
> +    }
>  
> -      make_close_phi_nodes_unique (close);
> +  /* Eliminate duplicates.  This relies on processing loops from
> +     innermost to outer.  */
> +  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
> +    {
> +      gphi_iterator gsi = psi;
> +      gphi *phi = psi.phi ();
> +
> +      /* At this point, PHI should be a close phi in normal form.  */
> +      gcc_assert (gimple_phi_num_args (phi) == 1);
> +
> +      /* Iterate over the next phis and remove duplicates.  */
> +      gsi_next (&gsi);
> +      while (!gsi_end_p (gsi))
> +	if (gimple_phi_arg_def (phi, 0) == gimple_phi_arg_def (gsi.phi (), 0))
> +	  {
> +	    replace_uses_by (gimple_phi_result (gsi.phi ()),
> +			     gimple_phi_result (phi));
> +	    remove_phi_node (&gsi, true);
> +	  }
> +	else
> +	  gsi_next (&gsi);
>      }
>  }
>  
> @@ -443,7 +387,7 @@ static void
>  canonicalize_loop_closed_ssa_form (void)
>  {
>    loop_p loop;
> -  FOR_EACH_LOOP (loop, 0)
> +  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
>      canonicalize_loop_closed_ssa (loop);
>  
>    checking_verify_loop_closed_ssa (true);
> @@ -509,6 +453,19 @@ graphite_transform_loops (void)
>  			   "loop nest optimized\n");
>        }
>  
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      loop_p loop;
> +      int num_no_dependency = 0;
> +
> +      FOR_EACH_LOOP (loop, 0)
> +	if (loop->can_be_parallel)
> +	  num_no_dependency++;
> +
> +      fprintf (dump_file, "%d loops carried no dependency.\n",
> +	       num_no_dependency);
> +    }
> +
>    free_scops (scops);
>    graphite_finalize (need_cfg_cleanup_p);
>    the_isl_ctx = NULL;
> Index: gcc/testsuite/gcc.dg/graphite/scop-24.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/graphite/scop-24.c	(nonexistent)
> +++ gcc/testsuite/gcc.dg/graphite/scop-24.c	(working copy)
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -floop-nest-optimize" } */
> +
> +typedef struct _IO_FILE FILE;
> +extern struct _IO_FILE *stderr;
> +typedef float real;
> +typedef real rvec[3];
> +int rgbset (int);
> +void ps_box (int, int);
> +void plot_phi(char *fn,rvec box,int natoms,rvec x[],real phi[])
> +{
> +  real phi_max,rr,gg,bb,fac,dx,x0,y0;
> +  int i;
> +  for(i=0; (i<natoms); i++) 
> +    phi_max=((phi_max > __builtin_fabs(phi[i]))
> +	     ? phi_max : __builtin_fabs(phi[i]));
> +  if (__builtin_fabs(phi_max)<1.2e-38)
> +      __builtin_fprintf(stderr, "X");
> +  ps_box((real)(fac*box[0]-1),(real)(fac*box[1]-1));
> +  for(i=0; (i<natoms); i++)
> +    {
> +      rr=gg=bb=1.0;
> +      if (phi[i] < 0)
> +	gg=bb=(1.0+(phi[i]/phi_max));
> +      else
> +	rr=gg=(1.0-(phi[i]/phi_max));
> +      rr=rgbset(rr);
> +    }
> +}
>
Richard Biener Sept. 25, 2017, 12:46 p.m. UTC | #3
On Mon, 25 Sep 2017, Richard Biener wrote:

> On Fri, 22 Sep 2017, Richard Biener wrote:
> 
> > 
> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > when reworking canonicalize_loop_closed_ssa.
> > 
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > 
> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > -Ofast -march=haswell -floop-nest-optimize are
> > 
> >  61 loop nests "optimized"
> >  45 loop nest transforms cancelled because of code generation issues
> >  21 loop nest optimizations timed out the 350000 ISL "operations" we allow
> 
> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
> and 709 sec. with (this was with release checking).
> 
> A single-run has 416.gamess (580s -> 618s),
> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 
> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
> do a 3-run for those to confirm (it would be only a single regression
> for 416.gamess).

416.gamess regression confirmed, 450.soplex improvement as well,
in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
that's noise.

Richard.
Bin.Cheng Sept. 25, 2017, 12:55 p.m. UTC | #4
On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener <rguenther@suse.de> wrote:
> On Mon, 25 Sep 2017, Richard Biener wrote:
>
>> On Fri, 22 Sep 2017, Richard Biener wrote:
>>
>> >
>> > This simplifies canonicalize_loop_closed_ssa and does other minimal
>> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
>> > when reworking canonicalize_loop_closed_ssa.
>> >
>> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
>> >
>> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
>> > -Ofast -march=haswell -floop-nest-optimize are
>> >
>> >  61 loop nests "optimized"
>> >  45 loop nest transforms cancelled because of code generation issues
>> >  21 loop nest optimizations timed out the 350000 ISL "operations" we allow
>>
>> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
>> and 709 sec. with (this was with release checking).
>>
>> A single-run has 416.gamess (580s -> 618s),
>> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
>> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s ->
>> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
>> do a 3-run for those to confirm (it would be only a single regression
>> for 416.gamess).
>
> 416.gamess regression confirmed, 450.soplex improvement as well,
436/437 improvements?  450.soplex (229s -> 226s) loops like noise.

Thanks,
bin
> in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
> that's noise.
>
> Richard.
Richard Biener Sept. 25, 2017, 1:10 p.m. UTC | #5
On Mon, 25 Sep 2017, Bin.Cheng wrote:

> On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener <rguenther@suse.de> wrote:
> > On Mon, 25 Sep 2017, Richard Biener wrote:
> >
> >> On Fri, 22 Sep 2017, Richard Biener wrote:
> >>
> >> >
> >> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> >> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> >> > when reworking canonicalize_loop_closed_ssa.
> >> >
> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> >> >
> >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> >> > -Ofast -march=haswell -floop-nest-optimize are
> >> >
> >> >  61 loop nests "optimized"
> >> >  45 loop nest transforms cancelled because of code generation issues
> >> >  21 loop nest optimizations timed out the 350000 ISL "operations" we allow
> >>
> >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize
> >> and 709 sec. with (this was with release checking).
> >>
> >> A single-run has 416.gamess (580s -> 618s),
> >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s),
> >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s ->
> >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes.  Will
> >> do a 3-run for those to confirm (it would be only a single regression
> >> for 416.gamess).
> >
> > 416.gamess regression confirmed, 450.soplex improvement as well,
> 436/437 improvements?  450.soplex (229s -> 226s) loops like noise.

base is with -floop-nest-optimize, peak without.

416.gamess      19580        619       31.7 S   19580        576       
34.0 *
416.gamess      19580        614       31.9 S   19580        577       
33.9 S
416.gamess      19580        618       31.7 *   19580        576       
34.0 S
436.cactusADM   11950        194       61.5 S   11950        204       
58.5 S
436.cactusADM   11950        184       65.0 S   11950        187       
63.8 *
436.cactusADM   11950        186       64.1 *   11950        186       
64.1 S
437.leslie3d     9400        219       43.0 S    9400        218       
43.1 S
437.leslie3d     9400        219       43.0 *    9400        223       
42.1 S
437.leslie3d     9400        218       43.0 S    9400        223       
42.2 *
450.soplex       8340        225       37.0 S    8340        231       
36.1 S
450.soplex       8340        226       36.9 *    8340        230       
36.3 *
450.soplex       8340        227       36.8 S    8340        229       
36.4 S
465.tonto        9840        426       23.1 S    9840        427       
23.0 *
465.tonto        9840        424       23.2 S    9840        430       
22.9 S
465.tonto        9840        425       23.2 *    9840        425       
23.2 S
401.bzip2        9650        379       25.5 S    9650        378       
25.5 S
401.bzip2        9650        379       25.5 *    9650        380       
25.4 *
401.bzip2        9650        379       25.5 S    9650        380       
25.4 S
462.libquantum  20720        351       59.0 *   20720        349       
59.4 S
462.libquantum  20720        351       59.0 S   20720        345       
60.1 *
462.libquantum  20720        352       58.8 S   20720        344       
60.2 S



> Thanks,
> bin
> > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose
> > that's noise.
> >
> > Richard.
> 
>
Richard Biener Sept. 25, 2017, 1:12 p.m. UTC | #6
On Fri, 22 Sep 2017, Sebastian Pop wrote:

> On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> wrote:
> 
> >
> > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > when reworking canonicalize_loop_closed_ssa.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> >
> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > -Ofast -march=haswell -floop-nest-optimize are
> >
> >  61 loop nests "optimized"
> >  45 loop nest transforms cancelled because of code generation issues
> >  21 loop nest optimizations timed out the 350000 ISL "operations" we allow
> >
> > I say "optimized" because the usual transform I've seen is static tiling
> > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > There's no way to automagically figure what kind of transform ISL did
> >
> 
> Here is how to automate (without magic) the detection
> of the transform that isl did.
> 
> The problem solved by isl is the minimization of strides
> in memory, and to do this, we need to tell the isl scheduler
> the validity dependence graph, in graphite-optimize-isl.c
> see the validity (RAW, WAR, WAW) and the proximity
> (RAR + validity) maps.  The proximity does include the
> read after read, as the isl scheduler needs to minimize
> strides between consecutive reads.
> 
> When you apply the schedule to the dependence graph,
> one can tell from the result the strides in memory, a good
> way to say whether a transform was beneficial is to sum up
> all memory strides, and make sure that the sum of all strides
> decreases after transform.  We could add a printf with the
> sum of strides before and after transforms, and have the
> testcases check for that.

Interesting.  Can you perhaps show me in code how to do that?

Thanks,
Richard.
Sebastian Pop Sept. 26, 2017, 2:19 p.m. UTC | #7
On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote:

> On Fri, 22 Sep 2017, Sebastian Pop wrote:
>
> > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de>
> wrote:
> >
> > >
> > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > when reworking canonicalize_loop_closed_ssa.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > >
> > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > -Ofast -march=haswell -floop-nest-optimize are
> > >
> > >  61 loop nests "optimized"
> > >  45 loop nest transforms cancelled because of code generation issues
> > >  21 loop nest optimizations timed out the 350000 ISL "operations" we
> allow
> > >
> > > I say "optimized" because the usual transform I've seen is static
> tiling
> > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > There's no way to automagically figure what kind of transform ISL did
> > >
> >
> > Here is how to automate (without magic) the detection
> > of the transform that isl did.
> >
> > The problem solved by isl is the minimization of strides
> > in memory, and to do this, we need to tell the isl scheduler
> > the validity dependence graph, in graphite-optimize-isl.c
> > see the validity (RAW, WAR, WAW) and the proximity
> > (RAR + validity) maps.  The proximity does include the
> > read after read, as the isl scheduler needs to minimize
> > strides between consecutive reads.
> >
> > When you apply the schedule to the dependence graph,
> > one can tell from the result the strides in memory, a good
> > way to say whether a transform was beneficial is to sum up
> > all memory strides, and make sure that the sum of all strides
> > decreases after transform.  We could add a printf with the
> > sum of strides before and after transforms, and have the
> > testcases check for that.
>
> Interesting.  Can you perhaps show me in code how to do that?
>
>
Sven, is there already a function that computes the sum of all
strides in a proximity map?  Maybe you have code that does
something similar in pet or ppcg?

Thanks,
Sebastian
Sven Verdoolaege Sept. 26, 2017, 3:15 p.m. UTC | #8
On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote:
> Sven, is there already a function that computes the sum of all
> strides in a proximity map?  Maybe you have code that does
> something similar in pet or ppcg?

What exactly do you want to sum?
If this involves any counting, then it cannot currently
be done in pet or ppcg since isl does not support counting yet
and the public version of barvinok is GPL licensed.

Also, it's better to ask such questions on the isl mailing list
isl-development@googlegroups.com

skimo
Richard Biener Sept. 27, 2017, 12:18 p.m. UTC | #9
On Tue, 26 Sep 2017, Sebastian Pop wrote:

> On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote:
> 
> > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> >
> > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de>
> > wrote:
> > >
> > > >
> > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > when reworking canonicalize_loop_closed_ssa.
> > > >
> > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > > >
> > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > -Ofast -march=haswell -floop-nest-optimize are
> > > >
> > > >  61 loop nests "optimized"
> > > >  45 loop nest transforms cancelled because of code generation issues
> > > >  21 loop nest optimizations timed out the 350000 ISL "operations" we
> > allow
> > > >
> > > > I say "optimized" because the usual transform I've seen is static
> > tiling
> > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > There's no way to automagically figure what kind of transform ISL did
> > > >
> > >
> > > Here is how to automate (without magic) the detection
> > > of the transform that isl did.
> > >
> > > The problem solved by isl is the minimization of strides
> > > in memory, and to do this, we need to tell the isl scheduler
> > > the validity dependence graph, in graphite-optimize-isl.c
> > > see the validity (RAW, WAR, WAW) and the proximity
> > > (RAR + validity) maps.  The proximity does include the
> > > read after read, as the isl scheduler needs to minimize
> > > strides between consecutive reads.

Ah, so I now see why we do not perform interchange on trivial cases like

double A[1024][1024], B[1024][1024];

void foo(void)
{
  for (int i = 0; i < 1024; ++i)
    for (int j = 0; j < 1024; ++j)
      A[j][i] = B[j][i];
}

which is probably because

  /* FIXME: proximity should not be validity.  */
  isl_union_map *proximity = isl_union_map_copy (validity);

falls apart when there is _no_ dependence?

I can trick GRAPHITE into performing the interchange for

double A[1024][1024], B[1024][1024];

void foo(void)
{
  for (int i = 1; i < 1023; ++i)
    for (int j = 0; j < 1024; ++j)
      A[j][i] = B[j][i-1] + A[j][i+1];
}

because now there is a dependence.  Any idea on how to rewrite
scop_get_dependences to avoid "simplifying"?  I suppose the
validity constraints _do_ also specify kind-of a proximity
we just may not prune / optimize them in the same way as
dependences?

Richard.
Richard Biener Sept. 27, 2017, 1:04 p.m. UTC | #10
On Wed, 27 Sep 2017, Richard Biener wrote:

> On Tue, 26 Sep 2017, Sebastian Pop wrote:
> 
> > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote:
> > 
> > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > >
> > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de>
> > > wrote:
> > > >
> > > > >
> > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > > when reworking canonicalize_loop_closed_ssa.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > > > >
> > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > >
> > > > >  61 loop nests "optimized"
> > > > >  45 loop nest transforms cancelled because of code generation issues
> > > > >  21 loop nest optimizations timed out the 350000 ISL "operations" we
> > > allow
> > > > >
> > > > > I say "optimized" because the usual transform I've seen is static
> > > tiling
> > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > There's no way to automagically figure what kind of transform ISL did
> > > > >
> > > >
> > > > Here is how to automate (without magic) the detection
> > > > of the transform that isl did.
> > > >
> > > > The problem solved by isl is the minimization of strides
> > > > in memory, and to do this, we need to tell the isl scheduler
> > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > (RAR + validity) maps.  The proximity does include the
> > > > read after read, as the isl scheduler needs to minimize
> > > > strides between consecutive reads.
> 
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i];
> }
> 
> which is probably because
> 
>   /* FIXME: proximity should not be validity.  */
>   isl_union_map *proximity = isl_union_map_copy (validity);
> 
> falls apart when there is _no_ dependence?
> 
> I can trick GRAPHITE into performing the interchange for
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 1; i < 1023; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i-1] + A[j][i+1];
> }
> 
> because now there is a dependence.  Any idea on how to rewrite
> scop_get_dependences to avoid "simplifying"?  I suppose the
> validity constraints _do_ also specify kind-of a proximity
> we just may not prune / optimize them in the same way as
> dependences?

Another thing I notice is that we don't handle the multi-dimensional
accesses the fortran frontend produces:

(gdb) p debug_data_reference (dr)
#(Data Ref: 
#  bb: 18 
#  stmt: _43 = *a_141(D)[_42];
#  ref: *a_141(D)[_42];
#  base_object: *a_141(D);
#  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, 
stride.88_115}_5

ultimatively we fail here because we try to build a constraint for

{{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5

which ends up computing isl_pw_aff_mul (A, stride.88_115) with
A being the non-constant constraint generated for
{(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
a parameter.  ISL doesn't like that multiplication as the result
isn't affine (well - it is, we just have parameters in there).

I suppose ISL doesn't handle this form of accesses given the
two "dimensions" in this scalarized form may overlap?  So we'd
really need to turn those into references with different access
functions (even if that's not 100% a valid semantic transformation
as scalarization isn't reversible without extra information)?

Thanks,
Richard.
Richard Biener Sept. 27, 2017, 2:33 p.m. UTC | #11
On Wed, 27 Sep 2017, Richard Biener wrote:

> On Wed, 27 Sep 2017, Richard Biener wrote:
> 
> > On Tue, 26 Sep 2017, Sebastian Pop wrote:
> > 
> > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote:
> > > 
> > > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > > >
> > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I made
> > > > > > when reworking canonicalize_loop_closed_ssa.
> > > > > >
> > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk.
> > > > > >
> > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > > >
> > > > > >  61 loop nests "optimized"
> > > > > >  45 loop nest transforms cancelled because of code generation issues
> > > > > >  21 loop nest optimizations timed out the 350000 ISL "operations" we
> > > > allow
> > > > > >
> > > > > > I say "optimized" because the usual transform I've seen is static
> > > > tiling
> > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > > There's no way to automagically figure what kind of transform ISL did
> > > > > >
> > > > >
> > > > > Here is how to automate (without magic) the detection
> > > > > of the transform that isl did.
> > > > >
> > > > > The problem solved by isl is the minimization of strides
> > > > > in memory, and to do this, we need to tell the isl scheduler
> > > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > > (RAR + validity) maps.  The proximity does include the
> > > > > read after read, as the isl scheduler needs to minimize
> > > > > strides between consecutive reads.
> > 
> > Ah, so I now see why we do not perform interchange on trivial cases like
> > 
> > double A[1024][1024], B[1024][1024];
> > 
> > void foo(void)
> > {
> >   for (int i = 0; i < 1024; ++i)
> >     for (int j = 0; j < 1024; ++j)
> >       A[j][i] = B[j][i];
> > }
> > 
> > which is probably because
> > 
> >   /* FIXME: proximity should not be validity.  */
> >   isl_union_map *proximity = isl_union_map_copy (validity);
> > 
> > falls apart when there is _no_ dependence?
> > 
> > I can trick GRAPHITE into performing the interchange for
> > 
> > double A[1024][1024], B[1024][1024];
> > 
> > void foo(void)
> > {
> >   for (int i = 1; i < 1023; ++i)
> >     for (int j = 0; j < 1024; ++j)
> >       A[j][i] = B[j][i-1] + A[j][i+1];
> > }
> > 
> > because now there is a dependence.  Any idea on how to rewrite
> > scop_get_dependences to avoid "simplifying"?  I suppose the
> > validity constraints _do_ also specify kind-of a proximity
> > we just may not prune / optimize them in the same way as
> > dependences?
> 
> Another thing I notice is that we don't handle the multi-dimensional
> accesses the fortran frontend produces:
> 
> (gdb) p debug_data_reference (dr)
> #(Data Ref: 
> #  bb: 18 
> #  stmt: _43 = *a_141(D)[_42];
> #  ref: *a_141(D)[_42];
> #  base_object: *a_141(D);
> #  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, 
> stride.88_115}_5
> 
> ultimatively we fail here because we try to build a constraint for
> 
> {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5
> 
> which ends up computing isl_pw_aff_mul (A, stride.88_115) with
> A being the non-constant constraint generated for
> {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
> a parameter.  ISL doesn't like that multiplication as the result
> isn't affine (well - it is, we just have parameters in there).
> 
> I suppose ISL doesn't handle this form of accesses given the
> two "dimensions" in this scalarized form may overlap?  So we'd
> really need to turn those into references with different access
> functions (even if that's not 100% a valid semantic transformation
> as scalarization isn't reversible without extra information)?

Looks like even when hacking the Fortran FE to produce nested
ARRAY_REFs we run into the same issue for

(gdb) p debug_data_reference (dr)
#(Data Ref: 
#  bb: 17 
#  stmt: 
VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: 
1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 
sz: 8} = 0.0;
#  ref: 
VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: 
1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 
sz: 8};
#  base_object: 
VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D));
#  Access function 0: {1, +, 1}_4
#  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, 
(unsigned long) stride.88_92}_3;
#  Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, 
(unsigned long) stride.90_96}_2;
#  Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, 
(unsigned long) stride.92_100}_1;

so it looks like simple strided (where stride is a parameter) access
is not handled either.

GCCs dependence analysis can at least compute distances of two
DRs when the difference of the access CHRECs is constant.  Within
the polyhedral model those cases cannot be handled?

Richard.

> Thanks,
> Richard.
>
Sebastian Pop Sept. 28, 2017, 6:28 p.m. UTC | #12
Hi skimo,

On Tue, Sep 26, 2017 at 10:15 AM, Sven Verdoolaege <
sven.verdoolaege@gmail.com> wrote:

> On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote:
> > Sven, is there already a function that computes the sum of all
> > strides in a proximity map?  Maybe you have code that does
> > something similar in pet or ppcg?
>
> What exactly do you want to sum?

If this involves any counting, then it cannot currently
>

I think that it does involve counting: we need to know
the distance between all pairs of array accesses, that is the
number of points in the dependence polyhedron.


> be done in pet or ppcg since isl does not support counting yet
> and the public version of barvinok is GPL licensed.
>
> Also, it's better to ask such questions on the isl mailing list
> isl-development@googlegroups.com
>
>
We are trying to find a metric that shows that isl's scheduler
did a useful transform.  Something like a diff tool that shows
before and after scheduling the strides of array accesses.

Could the isl scheduler output a description of what it did?
We would like to use that output to build testcases that match
the behavior of the compiler on different patterns.

Thanks,
Sebastian
Sebastian Pop Sept. 28, 2017, 6:46 p.m. UTC | #13
On Wed, Sep 27, 2017 at 7:18 AM, Richard Biener <rguenther@suse.de> wrote:

> On Tue, 26 Sep 2017, Sebastian Pop wrote:
>
> > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de>
> wrote:
> >
> > > On Fri, 22 Sep 2017, Sebastian Pop wrote:
> > >
> > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de>
> > > wrote:
> > > >
> > > > >
> > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal
> > > > > TLC.  It also adds a testcase I reduced from a stupid mistake I
> made
> > > > > when reworking canonicalize_loop_closed_ssa.
> > > > >
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to
> trunk.
> > > > >
> > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with
> > > > > -Ofast -march=haswell -floop-nest-optimize are
> > > > >
> > > > >  61 loop nests "optimized"
> > > > >  45 loop nest transforms cancelled because of code generation
> issues
> > > > >  21 loop nest optimizations timed out the 350000 ISL "operations"
> we
> > > allow
> > > > >
> > > > > I say "optimized" because the usual transform I've seen is static
> > > tiling
> > > > > as enforced by GRAPHITE according to --param loop-block-tile-size.
> > > > > There's no way to automagically figure what kind of transform ISL
> did
> > > > >
> > > >
> > > > Here is how to automate (without magic) the detection
> > > > of the transform that isl did.
> > > >
> > > > The problem solved by isl is the minimization of strides
> > > > in memory, and to do this, we need to tell the isl scheduler
> > > > the validity dependence graph, in graphite-optimize-isl.c
> > > > see the validity (RAW, WAR, WAW) and the proximity
> > > > (RAR + validity) maps.  The proximity does include the
> > > > read after read, as the isl scheduler needs to minimize
> > > > strides between consecutive reads.
>
> Ah, so I now see why we do not perform interchange on trivial cases like
>
> double A[1024][1024], B[1024][1024];
>
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i];
> }
>
> which is probably because
>
>   /* FIXME: proximity should not be validity.  */
>   isl_union_map *proximity = isl_union_map_copy (validity);
>
> falls apart when there is _no_ dependence?
>

You are right.  The proximity needs to account for spatial
locality as well if you want to interchange the loop.
To describe the spatial locality, I would recommend adding
to the proximity relation the array accesses from two
successive iterations of the innermost loop:
A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
With these two extra relations in the proximity map,
isl should be able to interchange the above loop.


>
> I can trick GRAPHITE into performing the interchange for
>
> double A[1024][1024], B[1024][1024];
>
> void foo(void)
> {
>   for (int i = 1; i < 1023; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i-1] + A[j][i+1];
> }
>
> because now there is a dependence.  Any idea on how to rewrite
> scop_get_dependences to avoid "simplifying"?  I suppose the
> validity constraints _do_ also specify kind-of a proximity
>

Correct: the validity map specifies a subset (it is missing
RAR dependences) of data reuse.


> we just may not prune / optimize them in the same way as
> dependences?
>

Validity constraints are there to "keep the wind blowing
in the same direction" after the transform (otherwise the
result of the transformed computation may be wrong.)

The proximity map should contain a description of
- reuse of memory (temporal locality)
- how close together the access elements are (spatial locality.)
isl will optimize for both if the proximity map has a description
of both.

For the moment the proximity map is initialized only with the
current validity constraints, as you quoted the FIXME comment,
which would only describe a subset of the temporal locality.

Sebastian
Sebastian Pop Sept. 28, 2017, 6:59 p.m. UTC | #14
On Wed, Sep 27, 2017 at 8:04 AM, Richard Biener <rguenther@suse.de> wrote:
>
> Another thing I notice is that we don't handle the multi-dimensional
> accesses the fortran frontend produces:
>
> (gdb) p debug_data_reference (dr)
> #(Data Ref:
> #  bb: 18
> #  stmt: _43 = *a_141(D)[_42];
> #  ref: *a_141(D)[_42];
> #  base_object: *a_141(D);
> #  Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +,
> stride.88_115}_5
>
> ultimatively we fail here because we try to build a constraint for
>
> {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5
>
> which ends up computing isl_pw_aff_mul (A, stride.88_115) with
> A being the non-constant constraint generated for
> {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being
> a parameter.  ISL doesn't like that multiplication as the result
> isn't affine (well - it is, we just have parameters in there).
>
> I suppose ISL doesn't handle this form of accesses given the
> two "dimensions" in this scalarized form may overlap?  So we'd
> really need to turn those into references with different access
> functions (even if that's not 100% a valid semantic transformation
> as scalarization isn't reversible without extra information)?

You are right.
This multivariate memory access would be better handled in
delinearized form:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66981
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61000
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741

There are two ways to handle this issue:
- fix the FORTRAN front-end to emit multi dimensions ARRAY_REFs,
- implement an array delinearization pass, as I implemented in LLVM
http://llvm.org/doxygen/Delinearization_8cpp_source.html
"On Recovering Multi-Dimensional Arrays in Polly"
http://impact.gforge.inria.fr/impact2015/papers/impact2015-grosser.pdf
"Optimistic Delinearization of Parametrically Sized Arrays"
https://dl.acm.org/citation.cfm?id=2751248
LLVM does not have an equivalent for multi-dim ARRAY_REF description
it only reasons about linearized memory accesses like in GCC's RTL:
gep = Get Element Pointer, so we had no other option than to delinearize.

Sebastian
Sebastian Pop Sept. 28, 2017, 7:10 p.m. UTC | #15
On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener <rguenther@suse.de> wrote:
> Looks like even when hacking the Fortran FE to produce nested
> ARRAY_REFs we run into the same issue for
>
> (gdb) p debug_data_reference (dr)
> #(Data Ref:
> #  bb: 17
> #  stmt:
> VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb:
> 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> sz: 8} = 0.0;
> #  ref:
> VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb:
> 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> sz: 8};
> #  base_object:
> VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D));
> #  Access function 0: {1, +, 1}_4
> #  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +,
> (unsigned long) stride.88_92}_3;
> #  Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +,
> (unsigned long) stride.90_96}_2;
> #  Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +,
> (unsigned long) stride.92_100}_1;
>
> so it looks like simple strided (where stride is a parameter) access
> is not handled either.

Yes, this is the first option I was mentioning: it could work,
could you please make sure that you don't have a bug in the "hack patch"
where the outer dimension should not contain the parameter
(inner array dimension) times the access function.

Example in C:
int A[100][N];
A[i][j] is linearized as *(A + i * N * 4 + j * 4)
and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j]
Could you please check that it would delinearize back to A[i][j]?

>
> GCCs dependence analysis can at least compute distances of two
> DRs when the difference of the access CHRECs is constant.  Within
> the polyhedral model those cases cannot be handled?

The difficulty for the polyhedral model is in the representation
of a multiplication of parameter times loop index variable.
The delinearization removes these difficulties by creating linear expressions.

Think about multiplication as something introducing exponentiality
and you realize that any such expression would not fit in the
linear model of polyhedra.
A parameter is nothing else than an outer loop index to which we don't
have access to that loop level as it may be outside the current function
in which we get that parameter in.

Sebastian
Richard Biener Sept. 29, 2017, 11:17 a.m. UTC | #16
On Thu, 28 Sep 2017, Sebastian Pop wrote:

> On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener <rguenther@suse.de> wrote:
> > Looks like even when hacking the Fortran FE to produce nested
> > ARRAY_REFs we run into the same issue for
> >
> > (gdb) p debug_data_reference (dr)
> > #(Data Ref:
> > #  bb: 17
> > #  stmt:
> > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb:
> > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> > sz: 8} = 0.0;
> > #  ref:
> > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb:
> > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1
> > sz: 8};
> > #  base_object:
> > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D));
> > #  Access function 0: {1, +, 1}_4
> > #  Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +,
> > (unsigned long) stride.88_92}_3;
> > #  Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +,
> > (unsigned long) stride.90_96}_2;
> > #  Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +,
> > (unsigned long) stride.92_100}_1;
> >
> > so it looks like simple strided (where stride is a parameter) access
> > is not handled either.
> 
> Yes, this is the first option I was mentioning: it could work,
> could you please make sure that you don't have a bug in the "hack patch"
> where the outer dimension should not contain the parameter
> (inner array dimension) times the access function.
> 
> Example in C:
> int A[100][N];
> A[i][j] is linearized as *(A + i * N * 4 + j * 4)
> and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j]
> Could you please check that it would delinearize back to A[i][j]?

I fixed the "hack patch" somewhat but realized it's not really possible
ATM to get back at this form because the array descriptor contains only
information to generate the linearized form.  So while I get now correct
values they end up with runtime divisions which aren't handled by
SCEV.

I fear emitting delinearized code from the Fortran FE would be an
ABI breakage as we'd have to change the array descriptor contents.

> >
> > GCCs dependence analysis can at least compute distances of two
> > DRs when the difference of the access CHRECs is constant.  Within
> > the polyhedral model those cases cannot be handled?
> 
> The difficulty for the polyhedral model is in the representation
> of a multiplication of parameter times loop index variable.
> The delinearization removes these difficulties by creating linear expressions.
> 
> Think about multiplication as something introducing exponentiality
> and you realize that any such expression would not fit in the
> linear model of polyhedra.
> A parameter is nothing else than an outer loop index to which we don't
> have access to that loop level as it may be outside the current function
> in which we get that parameter in.

Yeah, I see that now.

Richard.
Sebastian Pop Sept. 29, 2017, 2:58 p.m. UTC | #17
On Fri, Sep 29, 2017 at 6:17 AM, Richard Biener <rguenther@suse.de> wrote:
> I fixed the "hack patch" somewhat but realized it's not really possible
> ATM to get back at this form because the array descriptor contains only
> information to generate the linearized form.  So while I get now correct
> values they end up with runtime divisions which aren't handled by
> SCEV.

You are right, SCEV has some limits on representing and folding
those division expressions.

There is a proposal in LLVM from Johannes Doerfert
https://reviews.llvm.org/D38255
to use isl as a representation and expression folder instead of the
chains of recurrences for the scalar evolution analysis.  isl would
be able to handle some of the semantics of the div_exprs, and the
semantics of wrap-around variables, and of course it would have
some other limits to represent multiplications (as we spoke about
yesterday, i * N or M * N for example,) and thus that polyhedral
analysis would need to rely on the delinearization of array indices.
Sven Verdoolaege Sept. 29, 2017, 7:34 p.m. UTC | #18
On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i];
> }

I didn't see you mentioning _why_ you expect an interchange here.
Are you prehaps interested in spatial locality?
If so, then there are several approaches for taking
that into account.
- pluto performs an intra-tile loop interchange to
  improve temporal and/or spatial locality.  It shouldn't
  be too hard to do something similar on an isl generated
  schedule
- Alex (Oleksandr) has been working on an extension of
  the isl scheduler that takes into account spatial locality.
  I'm not sure if it's publicly available.
- I've been working on a special case of spatial locality
  (consecutivity).  The current version is available in
  the consecutivity branch.  Note that it may get rebased and
  it may not necessarily get merged into master.

There are also other approaches, but they may not be that
easy to combine with the isl scheduler.

skimo
Sven Verdoolaege Sept. 29, 2017, 7:37 p.m. UTC | #19
[Sorry for the resend; I used the wrong email address to CC Alex]

On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> Ah, so I now see why we do not perform interchange on trivial cases like
> 
> double A[1024][1024], B[1024][1024];
> 
> void foo(void)
> {
>   for (int i = 0; i < 1024; ++i)
>     for (int j = 0; j < 1024; ++j)
>       A[j][i] = B[j][i];
> }

I didn't see you mentioning _why_ you expect an interchange here.
Are you prehaps interested in spatial locality?
If so, then there are several approaches for taking
that into account.
- pluto performs an intra-tile loop interchange to
  improve temporal and/or spatial locality.  It shouldn't
  be too hard to do something similar on an isl generated
  schedule
- Alex (Oleksandr) has been working on an extension of
  the isl scheduler that takes into account spatial locality.
  I'm not sure if it's publicly available.
- I've been working on a special case of spatial locality
  (consecutivity).  The current version is available in
  the consecutivity branch.  Note that it may get rebased and
  it may not necessarily get merged into master.

There are also other approaches, but they may not be that
easy to combine with the isl scheduler.

skimo
Sebastian Pop Sept. 29, 2017, 7:58 p.m. UTC | #20
On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
<sven.verdoolaege@gmail.com> wrote:
> [Sorry for the resend; I used the wrong email address to CC Alex]
>
> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
>> Ah, so I now see why we do not perform interchange on trivial cases like
>>
>> double A[1024][1024], B[1024][1024];
>>
>> void foo(void)
>> {
>>   for (int i = 0; i < 1024; ++i)
>>     for (int j = 0; j < 1024; ++j)
>>       A[j][i] = B[j][i];
>> }
>
> I didn't see you mentioning _why_ you expect an interchange here.
> Are you prehaps interested in spatial locality?
> If so, then there are several approaches for taking
> that into account.
> - pluto performs an intra-tile loop interchange to
>   improve temporal and/or spatial locality.  It shouldn't
>   be too hard to do something similar on an isl generated
>   schedule
> - Alex (Oleksandr) has been working on an extension of
>   the isl scheduler that takes into account spatial locality.
>   I'm not sure if it's publicly available.
> - I've been working on a special case of spatial locality
>   (consecutivity).  The current version is available in
>   the consecutivity branch.  Note that it may get rebased and
>   it may not necessarily get merged into master.
>
> There are also other approaches, but they may not be that
> easy to combine with the isl scheduler.

Would the following work?
Add to the proximity relation the array accesses from two
successive iterations of the innermost loop:
A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
With these two extra relations in the proximity map,
isl should be able to interchange the above loop.

Sebastian
Oleksandr Zinenko Sept. 29, 2017, 9:28 p.m. UTC | #21
On 29/09/17 21:58, Sebastian Pop wrote:
> On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
> <sven.verdoolaege@gmail.com> wrote:
>> [Sorry for the resend; I used the wrong email address to CC Alex]
>>
>> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
>>> Ah, so I now see why we do not perform interchange on trivial cases like
>>>
>>> double A[1024][1024], B[1024][1024];
>>>
>>> void foo(void)
>>> {
>>>    for (int i = 0; i < 1024; ++i)
>>>      for (int j = 0; j < 1024; ++j)
>>>        A[j][i] = B[j][i];
>>> }
>> I didn't see you mentioning _why_ you expect an interchange here.
>> Are you prehaps interested in spatial locality?
>> If so, then there are several approaches for taking
>> that into account.
>> - pluto performs an intra-tile loop interchange to
>>    improve temporal and/or spatial locality.  It shouldn't
>>    be too hard to do something similar on an isl generated
>>    schedule
>> - Alex (Oleksandr) has been working on an extension of
>>    the isl scheduler that takes into account spatial locality.
>>    I'm not sure if it's publicly available.
>> - I've been working on a special case of spatial locality
>>    (consecutivity).  The current version is available in
>>    the consecutivity branch.  Note that it may get rebased and
>>    it may not necessarily get merged into master.
>>
>> There are also other approaches, but they may not be that
>> easy to combine with the isl scheduler.
> Would the following work?
> Add to the proximity relation the array accesses from two
> successive iterations of the innermost loop:
> A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
> With these two extra relations in the proximity map,
> isl should be able to interchange the above loop.
>
> Sebastian
Hi,

this looks very close to what we do for spatial locality in the 
scheduler, except that we separate proximity and "spatial proximity" 
maps.  There is a couple of caveats in just plugging those into 
proximity, in particular resolving conflicts between spatial and 
temporal locality and unnecessary skewing.

Cheers,
Alex
Richard Biener Sept. 30, 2017, 5:47 p.m. UTC | #22
On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop <sebpop@gmail.com> wrote:
>On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
><sven.verdoolaege@gmail.com> wrote:
>> [Sorry for the resend; I used the wrong email address to CC Alex]
>>
>> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
>>> Ah, so I now see why we do not perform interchange on trivial cases
>like
>>>
>>> double A[1024][1024], B[1024][1024];
>>>
>>> void foo(void)
>>> {
>>>   for (int i = 0; i < 1024; ++i)
>>>     for (int j = 0; j < 1024; ++j)
>>>       A[j][i] = B[j][i];
>>> }
>>
>> I didn't see you mentioning _why_ you expect an interchange here.
>> Are you prehaps interested in spatial locality?
>> If so, then there are several approaches for taking
>> that into account.
>> - pluto performs an intra-tile loop interchange to
>>   improve temporal and/or spatial locality.  It shouldn't
>>   be too hard to do something similar on an isl generated
>>   schedule
>> - Alex (Oleksandr) has been working on an extension of
>>   the isl scheduler that takes into account spatial locality.
>>   I'm not sure if it's publicly available.
>> - I've been working on a special case of spatial locality
>>   (consecutivity).  The current version is available in
>>   the consecutivity branch.  Note that it may get rebased and
>>   it may not necessarily get merged into master.
>>
>> There are also other approaches, but they may not be that
>> easy to combine with the isl scheduler.
>
>Would the following work?
>Add to the proximity relation the array accesses from two
>successive iterations of the innermost loop:
>A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
>With these two extra relations in the proximity map,
>isl should be able to interchange the above loop.

Can anyone give a hint on how to do that in ISL terms? 

Richard. 

>Sebastian
Sven Verdoolaege Oct. 1, 2017, 9:58 a.m. UTC | #23
On Sat, Sep 30, 2017 at 07:47:43PM +0200, Richard Biener wrote:
> On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop <sebpop@gmail.com> wrote:
> >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege
> ><sven.verdoolaege@gmail.com> wrote:
> >> [Sorry for the resend; I used the wrong email address to CC Alex]
> >>
> >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote:
> >>> Ah, so I now see why we do not perform interchange on trivial cases
> >like
> >>>
> >>> double A[1024][1024], B[1024][1024];
> >>>
> >>> void foo(void)
> >>> {
> >>>   for (int i = 0; i < 1024; ++i)
> >>>     for (int j = 0; j < 1024; ++j)
> >>>       A[j][i] = B[j][i];
> >>> }
> >>
> >> I didn't see you mentioning _why_ you expect an interchange here.
> >> Are you prehaps interested in spatial locality?
> >> If so, then there are several approaches for taking
> >> that into account.
> >> - pluto performs an intra-tile loop interchange to
> >>   improve temporal and/or spatial locality.  It shouldn't
> >>   be too hard to do something similar on an isl generated
> >>   schedule
> >> - Alex (Oleksandr) has been working on an extension of
> >>   the isl scheduler that takes into account spatial locality.
> >>   I'm not sure if it's publicly available.
> >> - I've been working on a special case of spatial locality
> >>   (consecutivity).  The current version is available in
> >>   the consecutivity branch.  Note that it may get rebased and
> >>   it may not necessarily get merged into master.
> >>
> >> There are also other approaches, but they may not be that
> >> easy to combine with the isl scheduler.
> >
> >Would the following work?
> >Add to the proximity relation the array accesses from two
> >successive iterations of the innermost loop:
> >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1]
> >With these two extra relations in the proximity map,
> >isl should be able to interchange the above loop.
> 
> Can anyone give a hint on how to do that in ISL terms? 

For the approach pluto is taking, you'll have to look at the source
code, see pluto_intra_tile_optimize_band.
For the other two approaches I mentioned above, reports will
be made available within the next couple of weeks.
For the last one, there is a sample implementation in the
consecutivity branch of PPCG.

skimo
Sven Verdoolaege Nov. 11, 2017, 4:57 p.m. UTC | #24
On Sun, Oct 01, 2017 at 11:58:30AM +0200, Sven Verdoolaege wrote:
> For the approach pluto is taking, you'll have to look at the source
> code, see pluto_intra_tile_optimize_band.
> For the other two approaches I mentioned above, reports will
> be made available within the next couple of weeks.

https://hal.inria.fr/hal-01628798
http://www.cs.kuleuven.be/publicaties/rapporten/cw/CW709.abs.html

skimo
diff mbox series

Patch

Index: gcc/graphite-isl-ast-to-gimple.c
===================================================================
--- gcc/graphite-isl-ast-to-gimple.c	(revision 253091)
+++ gcc/graphite-isl-ast-to-gimple.c	(working copy)
@@ -73,15 +73,6 @@  struct ast_build_info
   bool is_parallelizable;
 };
 
-/* Verifies properties that GRAPHITE should maintain during translation.  */
-
-static inline void
-graphite_verify (void)
-{
-  checking_verify_loop_structure ();
-  checking_verify_loop_closed_ssa (true);
-}
-
 /* IVS_PARAMS maps isl's scattering and parameter identifiers
    to corresponding trees.  */
 
@@ -2997,8 +2988,9 @@  graphite_regenerate_ast_isl (scop_p scop
 	  delete_loop (loop);
     }
 
-  graphite_verify ();
-  scev_reset ();
+  /* Verifies properties that GRAPHITE should maintain during translation.  */
+  checking_verify_loop_structure ();
+  checking_verify_loop_closed_ssa (true);
 
   free (if_region->true_region);
   free (if_region->region);
@@ -3008,19 +3000,6 @@  graphite_regenerate_ast_isl (scop_p scop
   isl_ast_node_free (root_node);
   timevar_pop (TV_GRAPHITE_CODE_GEN);
 
-  if (dump_file && (dump_flags & TDF_DETAILS))
-    {
-      loop_p loop;
-      int num_no_dependency = 0;
-
-      FOR_EACH_LOOP (loop, 0)
-	if (loop->can_be_parallel)
-	  num_no_dependency++;
-
-      fprintf (dump_file, "%d loops carried no dependency.\n",
-	       num_no_dependency);
-    }
-
   return !t.codegen_error_p ();
 }
 
Index: gcc/graphite-optimize-isl.c
===================================================================
--- gcc/graphite-optimize-isl.c	(revision 253091)
+++ gcc/graphite-optimize-isl.c	(working copy)
@@ -37,6 +37,7 @@  along with GCC; see the file COPYING3.
 #include "tree-data-ref.h"
 #include "params.h"
 #include "dumpfile.h"
+#include "tree-vectorizer.h"
 #include "graphite.h"
 
 
@@ -156,9 +157,12 @@  optimize_isl (scop_p scop)
   if (!scop->transformed_schedule
       || isl_ctx_last_error (scop->isl_context) == isl_error_quota)
     {
-      if (dump_file && dump_flags)
-	fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n",
-		 max_operations);
+      location_t loc = find_loop_location
+	(scop->scop_info->region.entry->dest->loop_father);
+      dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc,
+		       "loop nest not optimized, optimization timed out "
+		       "after %d operations [--param max-isl-operations]\n",
+		       max_operations);
       return false;
     }
 
Index: gcc/graphite.c
===================================================================
--- gcc/graphite.c	(revision 253091)
+++ gcc/graphite.c	(working copy)
@@ -293,86 +293,6 @@  free_scops (vec<scop_p> scops)
   scops.release ();
 }
 
-/* Returns true when P1 and P2 are close phis with the same
-   argument.  */
-
-static inline bool
-same_close_phi_node (gphi *p1, gphi *p2)
-{
-  return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)),
-			      TREE_TYPE (gimple_phi_result (p2)))
-	  && operand_equal_p (gimple_phi_arg_def (p1, 0),
-			      gimple_phi_arg_def (p2, 0), 0));
-}
-
-static void make_close_phi_nodes_unique (basic_block bb);
-
-/* Remove the close phi node at GSI and replace its rhs with the rhs
-   of PHI.  */
-
-static void
-remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi)
-{
-  gimple *use_stmt;
-  use_operand_p use_p;
-  imm_use_iterator imm_iter;
-  tree res = gimple_phi_result (phi);
-  tree def = gimple_phi_result (gsi->phi ());
-
-  gcc_assert (same_close_phi_node (phi, gsi->phi ()));
-
-  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def)
-    {
-      FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
-	SET_USE (use_p, res);
-
-      update_stmt (use_stmt);
-
-      /* It is possible that we just created a duplicate close-phi
-	 for an already-processed containing loop.  Check for this
-	 case and clean it up.  */
-      if (gimple_code (use_stmt) == GIMPLE_PHI
-	  && gimple_phi_num_args (use_stmt) == 1)
-	make_close_phi_nodes_unique (gimple_bb (use_stmt));
-    }
-
-  remove_phi_node (gsi, true);
-}
-
-/* Removes all the close phi duplicates from BB.  */
-
-static void
-make_close_phi_nodes_unique (basic_block bb)
-{
-  gphi_iterator psi;
-
-  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
-    {
-      gphi_iterator gsi = psi;
-      gphi *phi = psi.phi ();
-
-      /* At this point, PHI should be a close phi in normal form.  */
-      gcc_assert (gimple_phi_num_args (phi) == 1);
-
-      /* Iterate over the next phis and remove duplicates.  */
-      gsi_next (&gsi);
-      while (!gsi_end_p (gsi))
-	if (same_close_phi_node (phi, gsi.phi ()))
-	  remove_duplicate_close_phi (phi, &gsi);
-	else
-	  gsi_next (&gsi);
-    }
-}
-
-/* Return true when NAME is defined in LOOP.  */
-
-static bool
-defined_in_loop_p (tree name, loop_p loop)
-{
-  gcc_assert (TREE_CODE (name) == SSA_NAME);
-  return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name));
-}
-
 /* Transforms LOOP to the canonical loop closed SSA form.  */
 
 static void
@@ -380,20 +300,22 @@  canonicalize_loop_closed_ssa (loop_p loo
 {
   edge e = single_exit (loop);
   basic_block bb;
+  gphi_iterator psi;
 
   if (!e || (e->flags & EDGE_COMPLEX))
     return;
 
   bb = e->dest;
 
+  /* Make the loop-close PHI node BB contain only PHIs and have a
+     single predecessor.  */
   if (single_pred_p (bb))
     {
       e = split_block_after_labels (bb);
-      make_close_phi_nodes_unique (e->src);
+      bb = e->src;
     }
   else
     {
-      gphi_iterator psi;
       basic_block close = split_edge (e);
       e = single_succ_edge (close);
       for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
@@ -404,7 +326,7 @@  canonicalize_loop_closed_ssa (loop_p loo
 
 	  /* Only add close phi nodes for SSA_NAMEs defined in LOOP.  */
 	  if (TREE_CODE (arg) != SSA_NAME
-	      || !defined_in_loop_p (arg, loop))
+	      || loop_containing_stmt (SSA_NAME_DEF_STMT (arg)) != loop)
 	    continue;
 
 	  tree res = copy_ssa_name (arg);
@@ -413,8 +335,30 @@  canonicalize_loop_closed_ssa (loop_p loo
 		       UNKNOWN_LOCATION);
 	  SET_USE (use_p, res);
 	}
+      bb = close;
+    }
 
-      make_close_phi_nodes_unique (close);
+  /* Eliminate duplicates.  This relies on processing loops from
+     innermost to outer.  */
+  for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi))
+    {
+      gphi_iterator gsi = psi;
+      gphi *phi = psi.phi ();
+
+      /* At this point, PHI should be a close phi in normal form.  */
+      gcc_assert (gimple_phi_num_args (phi) == 1);
+
+      /* Iterate over the next phis and remove duplicates.  */
+      gsi_next (&gsi);
+      while (!gsi_end_p (gsi))
+	if (gimple_phi_arg_def (phi, 0) == gimple_phi_arg_def (gsi.phi (), 0))
+	  {
+	    replace_uses_by (gimple_phi_result (gsi.phi ()),
+			     gimple_phi_result (phi));
+	    remove_phi_node (&gsi, true);
+	  }
+	else
+	  gsi_next (&gsi);
     }
 }
 
@@ -443,7 +387,7 @@  static void
 canonicalize_loop_closed_ssa_form (void)
 {
   loop_p loop;
-  FOR_EACH_LOOP (loop, 0)
+  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
     canonicalize_loop_closed_ssa (loop);
 
   checking_verify_loop_closed_ssa (true);
@@ -509,6 +453,19 @@  graphite_transform_loops (void)
 			   "loop nest optimized\n");
       }
 
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      loop_p loop;
+      int num_no_dependency = 0;
+
+      FOR_EACH_LOOP (loop, 0)
+	if (loop->can_be_parallel)
+	  num_no_dependency++;
+
+      fprintf (dump_file, "%d loops carried no dependency.\n",
+	       num_no_dependency);
+    }
+
   free_scops (scops);
   graphite_finalize (need_cfg_cleanup_p);
   the_isl_ctx = NULL;
Index: gcc/testsuite/gcc.dg/graphite/scop-24.c
===================================================================
--- gcc/testsuite/gcc.dg/graphite/scop-24.c	(nonexistent)
+++ gcc/testsuite/gcc.dg/graphite/scop-24.c	(working copy)
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast -floop-nest-optimize" } */
+
+typedef struct _IO_FILE FILE;
+extern struct _IO_FILE *stderr;
+typedef float real;
+typedef real rvec[3];
+int rgbset (int);
+void ps_box (int, int);
+void plot_phi(char *fn,rvec box,int natoms,rvec x[],real phi[])
+{
+  real phi_max,rr,gg,bb,fac,dx,x0,y0;
+  int i;
+  for(i=0; (i<natoms); i++) 
+    phi_max=((phi_max > __builtin_fabs(phi[i]))
+	     ? phi_max : __builtin_fabs(phi[i]));
+  if (__builtin_fabs(phi_max)<1.2e-38)
+      __builtin_fprintf(stderr, "X");
+  ps_box((real)(fac*box[0]-1),(real)(fac*box[1]-1));
+  for(i=0; (i<natoms); i++)
+    {
+      rr=gg=bb=1.0;
+      if (phi[i] < 0)
+	gg=bb=(1.0+(phi[i]/phi_max));
+      else
+	rr=gg=(1.0-(phi[i]/phi_max));
+      rr=rgbset(rr);
+    }
+}