Message ID | alpine.LSU.2.20.1709221453010.26836@zhemvz.fhfr.qr |
---|---|
State | New |
Headers | show |
Series | [GRAPHITE] More TLC | expand |
On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> wrote: > > This simplifies canonicalize_loop_closed_ssa and does other minimal > TLC. It also adds a testcase I reduced from a stupid mistake I made > when reworking canonicalize_loop_closed_ssa. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > -Ofast -march=haswell -floop-nest-optimize are > > 61 loop nests "optimized" > 45 loop nest transforms cancelled because of code generation issues > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow > > I say "optimized" because the usual transform I've seen is static tiling > as enforced by GRAPHITE according to --param loop-block-tile-size. > There's no way to automagically figure what kind of transform ISL did > Here is how to automate (without magic) the detection of the transform that isl did. The problem solved by isl is the minimization of strides in memory, and to do this, we need to tell the isl scheduler the validity dependence graph, in graphite-optimize-isl.c see the validity (RAW, WAR, WAW) and the proximity (RAR + validity) maps. The proximity does include the read after read, as the isl scheduler needs to minimize strides between consecutive reads. When you apply the schedule to the dependence graph, one can tell from the result the strides in memory, a good way to say whether a transform was beneficial is to sum up all memory strides, and make sure that the sum of all strides decreases after transform. We could add a printf with the sum of strides before and after transforms, and have the testcases check for that. (usually none with the schedule identical check confused by FILTER > stuff positioning). This is also the issue with most GRAPHITE testcases. > We can't really verify easily whether we performed loop interchange > or not. We can probably tell whether we applied loop fusion or > splitting (by counting loops). > > I'm not aware of any remaining ICEs / wrong-code issues with GRAPHITE. > > I'm aware that the current "black-box" granularity hinders > scheduling freedom (each GIMPLE BB is mapped to a ISL stmt, this > is too coarse to schedule say two writes in a BB independently > from each other). Quick experiments could be done by simply > splitting gimple BBs at some points. > > I'm aware that the SCOP detection algorithm assumes that it can > walk loop->next and find loops "in order" -- but while that's > true for the initial flow_loops_find result (DFS walk) it isn't > true for any later created / discovered loops. Sorting of > loop siblings in DFS order should be easy (and a general cfgloopanal > helper). > > Richard. > > 2017-09-22 Richard Biener <rguenther@suse.de> > > * graphite-isl-ast-to-gimple.c (graphite_verify): Inline into > single caller. > (graphite_regenerate_ast_isl): Do not reset SCEV. Move debug > print of no dependency loops ... > * graphite.c (graphite_transform_loops): ... here. > (canonicalize_loop_closed_ssa_form): Work from inner to outer > loops. > (same_close_phi_node, remove_duplicate_close_phi, > make_close_phi_nodes_unique, defined_in_loop_p): Fold into ... > (canonicalize_loop_closed_ssa): ... here and simplify. > * graphite-optimize-isl.c: Include tree-vectorizer.h. > (optimize_isl): Use dump_printf_loc to tell when we stopped > optimizing because of an ISL timeout. > > * gcc.dg/graphite/scop-24.c: New testcase. > > The change looks good to me. Thanks, Sebastian
On Fri, 22 Sep 2017, Richard Biener wrote: > > This simplifies canonicalize_loop_closed_ssa and does other minimal > TLC. It also adds a testcase I reduced from a stupid mistake I made > when reworking canonicalize_loop_closed_ssa. > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > -Ofast -march=haswell -floop-nest-optimize are > > 61 loop nests "optimized" > 45 loop nest transforms cancelled because of code generation issues > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize and 709 sec. with (this was with release checking). A single-run has 416.gamess (580s -> 618s), 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will do a 3-run for those to confirm (it would be only a single regression for 416.gamess). Sofar I'm positively surprised given the limitations (and inefficiencies) I know. I'll add some more opt-info stuff to assess the number of SCOPs we detect but discard during further analysis and the number of transforms we cancel because they turn out as a no-op. Richard. > { > - if (dump_file && dump_flags) > - fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n", > - max_operations); > + location_t loc = find_loop_location > + (scop->scop_info->region.entry->dest->loop_father); > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc, > + "loop nest not optimized, optimization timed out " > + "after %d operations [--param max-isl-operations]\n", > + max_operations); > return false; > } > > Index: gcc/graphite.c > =================================================================== > --- gcc/graphite.c (revision 253091) > +++ gcc/graphite.c (working copy) > @@ -293,86 +293,6 @@ free_scops (vec<scop_p> scops) > scops.release (); > } > > -/* Returns true when P1 and P2 are close phis with the same > - argument. */ > - > -static inline bool > -same_close_phi_node (gphi *p1, gphi *p2) > -{ > - return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)), > - TREE_TYPE (gimple_phi_result (p2))) > - && operand_equal_p (gimple_phi_arg_def (p1, 0), > - gimple_phi_arg_def (p2, 0), 0)); > -} > - > -static void make_close_phi_nodes_unique (basic_block bb); > - > -/* Remove the close phi node at GSI and replace its rhs with the rhs > - of PHI. */ > - > -static void > -remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi) > -{ > - gimple *use_stmt; > - use_operand_p use_p; > - imm_use_iterator imm_iter; > - tree res = gimple_phi_result (phi); > - tree def = gimple_phi_result (gsi->phi ()); > - > - gcc_assert (same_close_phi_node (phi, gsi->phi ())); > - > - FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def) > - { > - FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter) > - SET_USE (use_p, res); > - > - update_stmt (use_stmt); > - > - /* It is possible that we just created a duplicate close-phi > - for an already-processed containing loop. Check for this > - case and clean it up. */ > - if (gimple_code (use_stmt) == GIMPLE_PHI > - && gimple_phi_num_args (use_stmt) == 1) > - make_close_phi_nodes_unique (gimple_bb (use_stmt)); > - } > - > - remove_phi_node (gsi, true); > -} > - > -/* Removes all the close phi duplicates from BB. */ > - > -static void > -make_close_phi_nodes_unique (basic_block bb) > -{ > - gphi_iterator psi; > - > - for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) > - { > - gphi_iterator gsi = psi; > - gphi *phi = psi.phi (); > - > - /* At this point, PHI should be a close phi in normal form. */ > - gcc_assert (gimple_phi_num_args (phi) == 1); > - > - /* Iterate over the next phis and remove duplicates. */ > - gsi_next (&gsi); > - while (!gsi_end_p (gsi)) > - if (same_close_phi_node (phi, gsi.phi ())) > - remove_duplicate_close_phi (phi, &gsi); > - else > - gsi_next (&gsi); > - } > -} > - > -/* Return true when NAME is defined in LOOP. */ > - > -static bool > -defined_in_loop_p (tree name, loop_p loop) > -{ > - gcc_assert (TREE_CODE (name) == SSA_NAME); > - return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name)); > -} > - > /* Transforms LOOP to the canonical loop closed SSA form. */ > > static void > @@ -380,20 +300,22 @@ canonicalize_loop_closed_ssa (loop_p loo > { > edge e = single_exit (loop); > basic_block bb; > + gphi_iterator psi; > > if (!e || (e->flags & EDGE_COMPLEX)) > return; > > bb = e->dest; > > + /* Make the loop-close PHI node BB contain only PHIs and have a > + single predecessor. */ > if (single_pred_p (bb)) > { > e = split_block_after_labels (bb); > - make_close_phi_nodes_unique (e->src); > + bb = e->src; > } > else > { > - gphi_iterator psi; > basic_block close = split_edge (e); > e = single_succ_edge (close); > for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) > @@ -404,7 +326,7 @@ canonicalize_loop_closed_ssa (loop_p loo > > /* Only add close phi nodes for SSA_NAMEs defined in LOOP. */ > if (TREE_CODE (arg) != SSA_NAME > - || !defined_in_loop_p (arg, loop)) > + || loop_containing_stmt (SSA_NAME_DEF_STMT (arg)) != loop) > continue; > > tree res = copy_ssa_name (arg); > @@ -413,8 +335,30 @@ canonicalize_loop_closed_ssa (loop_p loo > UNKNOWN_LOCATION); > SET_USE (use_p, res); > } > + bb = close; > + } > > - make_close_phi_nodes_unique (close); > + /* Eliminate duplicates. This relies on processing loops from > + innermost to outer. */ > + for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) > + { > + gphi_iterator gsi = psi; > + gphi *phi = psi.phi (); > + > + /* At this point, PHI should be a close phi in normal form. */ > + gcc_assert (gimple_phi_num_args (phi) == 1); > + > + /* Iterate over the next phis and remove duplicates. */ > + gsi_next (&gsi); > + while (!gsi_end_p (gsi)) > + if (gimple_phi_arg_def (phi, 0) == gimple_phi_arg_def (gsi.phi (), 0)) > + { > + replace_uses_by (gimple_phi_result (gsi.phi ()), > + gimple_phi_result (phi)); > + remove_phi_node (&gsi, true); > + } > + else > + gsi_next (&gsi); > } > } > > @@ -443,7 +387,7 @@ static void > canonicalize_loop_closed_ssa_form (void) > { > loop_p loop; > - FOR_EACH_LOOP (loop, 0) > + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) > canonicalize_loop_closed_ssa (loop); > > checking_verify_loop_closed_ssa (true); > @@ -509,6 +453,19 @@ graphite_transform_loops (void) > "loop nest optimized\n"); > } > > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + loop_p loop; > + int num_no_dependency = 0; > + > + FOR_EACH_LOOP (loop, 0) > + if (loop->can_be_parallel) > + num_no_dependency++; > + > + fprintf (dump_file, "%d loops carried no dependency.\n", > + num_no_dependency); > + } > + > free_scops (scops); > graphite_finalize (need_cfg_cleanup_p); > the_isl_ctx = NULL; > Index: gcc/testsuite/gcc.dg/graphite/scop-24.c > =================================================================== > --- gcc/testsuite/gcc.dg/graphite/scop-24.c (nonexistent) > +++ gcc/testsuite/gcc.dg/graphite/scop-24.c (working copy) > @@ -0,0 +1,29 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Ofast -floop-nest-optimize" } */ > + > +typedef struct _IO_FILE FILE; > +extern struct _IO_FILE *stderr; > +typedef float real; > +typedef real rvec[3]; > +int rgbset (int); > +void ps_box (int, int); > +void plot_phi(char *fn,rvec box,int natoms,rvec x[],real phi[]) > +{ > + real phi_max,rr,gg,bb,fac,dx,x0,y0; > + int i; > + for(i=0; (i<natoms); i++) > + phi_max=((phi_max > __builtin_fabs(phi[i])) > + ? phi_max : __builtin_fabs(phi[i])); > + if (__builtin_fabs(phi_max)<1.2e-38) > + __builtin_fprintf(stderr, "X"); > + ps_box((real)(fac*box[0]-1),(real)(fac*box[1]-1)); > + for(i=0; (i<natoms); i++) > + { > + rr=gg=bb=1.0; > + if (phi[i] < 0) > + gg=bb=(1.0+(phi[i]/phi_max)); > + else > + rr=gg=(1.0-(phi[i]/phi_max)); > + rr=rgbset(rr); > + } > +} >
On Mon, 25 Sep 2017, Richard Biener wrote: > On Fri, 22 Sep 2017, Richard Biener wrote: > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > when reworking canonicalize_loop_closed_ssa. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > -Ofast -march=haswell -floop-nest-optimize are > > > > 61 loop nests "optimized" > > 45 loop nest transforms cancelled because of code generation issues > > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow > > Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize > and 709 sec. with (this was with release checking). > > A single-run has 416.gamess (580s -> 618s), > 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), > 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> > 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will > do a 3-run for those to confirm (it would be only a single regression > for 416.gamess). 416.gamess regression confirmed, 450.soplex improvement as well, in the three-run 462.libquantum regresses (344s -> 351s) so I suppose that's noise. Richard.
On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener <rguenther@suse.de> wrote: > On Mon, 25 Sep 2017, Richard Biener wrote: > >> On Fri, 22 Sep 2017, Richard Biener wrote: >> >> > >> > This simplifies canonicalize_loop_closed_ssa and does other minimal >> > TLC. It also adds a testcase I reduced from a stupid mistake I made >> > when reworking canonicalize_loop_closed_ssa. >> > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. >> > >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with >> > -Ofast -march=haswell -floop-nest-optimize are >> > >> > 61 loop nests "optimized" >> > 45 loop nest transforms cancelled because of code generation issues >> > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow >> >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize >> and 709 sec. with (this was with release checking). >> >> A single-run has 416.gamess (580s -> 618s), >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will >> do a 3-run for those to confirm (it would be only a single regression >> for 416.gamess). > > 416.gamess regression confirmed, 450.soplex improvement as well, 436/437 improvements? 450.soplex (229s -> 226s) loops like noise. Thanks, bin > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose > that's noise. > > Richard.
On Mon, 25 Sep 2017, Bin.Cheng wrote: > On Mon, Sep 25, 2017 at 1:46 PM, Richard Biener <rguenther@suse.de> wrote: > > On Mon, 25 Sep 2017, Richard Biener wrote: > > > >> On Fri, 22 Sep 2017, Richard Biener wrote: > >> > >> > > >> > This simplifies canonicalize_loop_closed_ssa and does other minimal > >> > TLC. It also adds a testcase I reduced from a stupid mistake I made > >> > when reworking canonicalize_loop_closed_ssa. > >> > > >> > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > >> > > >> > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > >> > -Ofast -march=haswell -floop-nest-optimize are > >> > > >> > 61 loop nests "optimized" > >> > 45 loop nest transforms cancelled because of code generation issues > >> > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow > >> > >> Overall compile time (with -j6) is 695 sec. w/o -floop-nest-optimize > >> and 709 sec. with (this was with release checking). > >> > >> A single-run has 416.gamess (580s -> 618s), > >> 436.cactusADM (206s -> 182s), 437.leslie3d (228s ->218s), > >> 450.soplex (229s -> 226s), 465.tonto (428s -> 425s), 401.bzip2 (383s -> > >> 379s), 462.libquantum (352s -> 343s), ignoring +-2s changes. Will > >> do a 3-run for those to confirm (it would be only a single regression > >> for 416.gamess). > > > > 416.gamess regression confirmed, 450.soplex improvement as well, > 436/437 improvements? 450.soplex (229s -> 226s) loops like noise. base is with -floop-nest-optimize, peak without. 416.gamess 19580 619 31.7 S 19580 576 34.0 * 416.gamess 19580 614 31.9 S 19580 577 33.9 S 416.gamess 19580 618 31.7 * 19580 576 34.0 S 436.cactusADM 11950 194 61.5 S 11950 204 58.5 S 436.cactusADM 11950 184 65.0 S 11950 187 63.8 * 436.cactusADM 11950 186 64.1 * 11950 186 64.1 S 437.leslie3d 9400 219 43.0 S 9400 218 43.1 S 437.leslie3d 9400 219 43.0 * 9400 223 42.1 S 437.leslie3d 9400 218 43.0 S 9400 223 42.2 * 450.soplex 8340 225 37.0 S 8340 231 36.1 S 450.soplex 8340 226 36.9 * 8340 230 36.3 * 450.soplex 8340 227 36.8 S 8340 229 36.4 S 465.tonto 9840 426 23.1 S 9840 427 23.0 * 465.tonto 9840 424 23.2 S 9840 430 22.9 S 465.tonto 9840 425 23.2 * 9840 425 23.2 S 401.bzip2 9650 379 25.5 S 9650 378 25.5 S 401.bzip2 9650 379 25.5 * 9650 380 25.4 * 401.bzip2 9650 379 25.5 S 9650 380 25.4 S 462.libquantum 20720 351 59.0 * 20720 349 59.4 S 462.libquantum 20720 351 59.0 S 20720 345 60.1 * 462.libquantum 20720 352 58.8 S 20720 344 60.2 S > Thanks, > bin > > in the three-run 462.libquantum regresses (344s -> 351s) so I suppose > > that's noise. > > > > Richard. > >
On Fri, 22 Sep 2017, Sebastian Pop wrote: > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> wrote: > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > when reworking canonicalize_loop_closed_ssa. > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > -Ofast -march=haswell -floop-nest-optimize are > > > > 61 loop nests "optimized" > > 45 loop nest transforms cancelled because of code generation issues > > 21 loop nest optimizations timed out the 350000 ISL "operations" we allow > > > > I say "optimized" because the usual transform I've seen is static tiling > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > There's no way to automagically figure what kind of transform ISL did > > > > Here is how to automate (without magic) the detection > of the transform that isl did. > > The problem solved by isl is the minimization of strides > in memory, and to do this, we need to tell the isl scheduler > the validity dependence graph, in graphite-optimize-isl.c > see the validity (RAW, WAR, WAW) and the proximity > (RAR + validity) maps. The proximity does include the > read after read, as the isl scheduler needs to minimize > strides between consecutive reads. > > When you apply the schedule to the dependence graph, > one can tell from the result the strides in memory, a good > way to say whether a transform was beneficial is to sum up > all memory strides, and make sure that the sum of all strides > decreases after transform. We could add a printf with the > sum of strides before and after transforms, and have the > testcases check for that. Interesting. Can you perhaps show me in code how to do that? Thanks, Richard.
On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote: > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> > wrote: > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > when reworking canonicalize_loop_closed_ssa. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > 61 loop nests "optimized" > > > 45 loop nest transforms cancelled because of code generation issues > > > 21 loop nest optimizations timed out the 350000 ISL "operations" we > allow > > > > > > I say "optimized" because the usual transform I've seen is static > tiling > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > Here is how to automate (without magic) the detection > > of the transform that isl did. > > > > The problem solved by isl is the minimization of strides > > in memory, and to do this, we need to tell the isl scheduler > > the validity dependence graph, in graphite-optimize-isl.c > > see the validity (RAW, WAR, WAW) and the proximity > > (RAR + validity) maps. The proximity does include the > > read after read, as the isl scheduler needs to minimize > > strides between consecutive reads. > > > > When you apply the schedule to the dependence graph, > > one can tell from the result the strides in memory, a good > > way to say whether a transform was beneficial is to sum up > > all memory strides, and make sure that the sum of all strides > > decreases after transform. We could add a printf with the > > sum of strides before and after transforms, and have the > > testcases check for that. > > Interesting. Can you perhaps show me in code how to do that? > > Sven, is there already a function that computes the sum of all strides in a proximity map? Maybe you have code that does something similar in pet or ppcg? Thanks, Sebastian
On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote: > Sven, is there already a function that computes the sum of all > strides in a proximity map? Maybe you have code that does > something similar in pet or ppcg? What exactly do you want to sum? If this involves any counting, then it cannot currently be done in pet or ppcg since isl does not support counting yet and the public version of barvinok is GPL licensed. Also, it's better to ask such questions on the isl mailing list isl-development@googlegroups.com skimo
On Tue, 26 Sep 2017, Sebastian Pop wrote: > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote: > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> > > wrote: > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > 61 loop nests "optimized" > > > > 45 loop nest transforms cancelled because of code generation issues > > > > 21 loop nest optimizations timed out the 350000 ISL "operations" we > > allow > > > > > > > > I say "optimized" because the usual transform I've seen is static > > tiling > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > > > > Here is how to automate (without magic) the detection > > > of the transform that isl did. > > > > > > The problem solved by isl is the minimization of strides > > > in memory, and to do this, we need to tell the isl scheduler > > > the validity dependence graph, in graphite-optimize-isl.c > > > see the validity (RAW, WAR, WAW) and the proximity > > > (RAR + validity) maps. The proximity does include the > > > read after read, as the isl scheduler needs to minimize > > > strides between consecutive reads. Ah, so I now see why we do not perform interchange on trivial cases like double A[1024][1024], B[1024][1024]; void foo(void) { for (int i = 0; i < 1024; ++i) for (int j = 0; j < 1024; ++j) A[j][i] = B[j][i]; } which is probably because /* FIXME: proximity should not be validity. */ isl_union_map *proximity = isl_union_map_copy (validity); falls apart when there is _no_ dependence? I can trick GRAPHITE into performing the interchange for double A[1024][1024], B[1024][1024]; void foo(void) { for (int i = 1; i < 1023; ++i) for (int j = 0; j < 1024; ++j) A[j][i] = B[j][i-1] + A[j][i+1]; } because now there is a dependence. Any idea on how to rewrite scop_get_dependences to avoid "simplifying"? I suppose the validity constraints _do_ also specify kind-of a proximity we just may not prune / optimize them in the same way as dependences? Richard.
On Wed, 27 Sep 2017, Richard Biener wrote: > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote: > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> > > > wrote: > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > 61 loop nests "optimized" > > > > > 45 loop nest transforms cancelled because of code generation issues > > > > > 21 loop nest optimizations timed out the 350000 ISL "operations" we > > > allow > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > tiling > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > of the transform that isl did. > > > > > > > > The problem solved by isl is the minimization of strides > > > > in memory, and to do this, we need to tell the isl scheduler > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > (RAR + validity) maps. The proximity does include the > > > > read after read, as the isl scheduler needs to minimize > > > > strides between consecutive reads. > > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } > > which is probably because > > /* FIXME: proximity should not be validity. */ > isl_union_map *proximity = isl_union_map_copy (validity); > > falls apart when there is _no_ dependence? > > I can trick GRAPHITE into performing the interchange for > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 1; i < 1023; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i-1] + A[j][i+1]; > } > > because now there is a dependence. Any idea on how to rewrite > scop_get_dependences to avoid "simplifying"? I suppose the > validity constraints _do_ also specify kind-of a proximity > we just may not prune / optimize them in the same way as > dependences? Another thing I notice is that we don't handle the multi-dimensional accesses the fortran frontend produces: (gdb) p debug_data_reference (dr) #(Data Ref: # bb: 18 # stmt: _43 = *a_141(D)[_42]; # ref: *a_141(D)[_42]; # base_object: *a_141(D); # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 ultimatively we fail here because we try to build a constraint for {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 which ends up computing isl_pw_aff_mul (A, stride.88_115) with A being the non-constant constraint generated for {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being a parameter. ISL doesn't like that multiplication as the result isn't affine (well - it is, we just have parameters in there). I suppose ISL doesn't handle this form of accesses given the two "dimensions" in this scalarized form may overlap? So we'd really need to turn those into references with different access functions (even if that's not 100% a valid semantic transformation as scalarization isn't reversible without extra information)? Thanks, Richard.
On Wed, 27 Sep 2017, Richard Biener wrote: > On Wed, 27 Sep 2017, Richard Biener wrote: > > > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> wrote: > > > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> > > > > wrote: > > > > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I made > > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to trunk. > > > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > > > 61 loop nests "optimized" > > > > > > 45 loop nest transforms cancelled because of code generation issues > > > > > > 21 loop nest optimizations timed out the 350000 ISL "operations" we > > > > allow > > > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > > tiling > > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > > There's no way to automagically figure what kind of transform ISL did > > > > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > > of the transform that isl did. > > > > > > > > > > The problem solved by isl is the minimization of strides > > > > > in memory, and to do this, we need to tell the isl scheduler > > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > > (RAR + validity) maps. The proximity does include the > > > > > read after read, as the isl scheduler needs to minimize > > > > > strides between consecutive reads. > > > > Ah, so I now see why we do not perform interchange on trivial cases like > > > > double A[1024][1024], B[1024][1024]; > > > > void foo(void) > > { > > for (int i = 0; i < 1024; ++i) > > for (int j = 0; j < 1024; ++j) > > A[j][i] = B[j][i]; > > } > > > > which is probably because > > > > /* FIXME: proximity should not be validity. */ > > isl_union_map *proximity = isl_union_map_copy (validity); > > > > falls apart when there is _no_ dependence? > > > > I can trick GRAPHITE into performing the interchange for > > > > double A[1024][1024], B[1024][1024]; > > > > void foo(void) > > { > > for (int i = 1; i < 1023; ++i) > > for (int j = 0; j < 1024; ++j) > > A[j][i] = B[j][i-1] + A[j][i+1]; > > } > > > > because now there is a dependence. Any idea on how to rewrite > > scop_get_dependences to avoid "simplifying"? I suppose the > > validity constraints _do_ also specify kind-of a proximity > > we just may not prune / optimize them in the same way as > > dependences? > > Another thing I notice is that we don't handle the multi-dimensional > accesses the fortran frontend produces: > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 18 > # stmt: _43 = *a_141(D)[_42]; > # ref: *a_141(D)[_42]; > # base_object: *a_141(D); > # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, > stride.88_115}_5 > > ultimatively we fail here because we try to build a constraint for > > {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 > > which ends up computing isl_pw_aff_mul (A, stride.88_115) with > A being the non-constant constraint generated for > {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being > a parameter. ISL doesn't like that multiplication as the result > isn't affine (well - it is, we just have parameters in there). > > I suppose ISL doesn't handle this form of accesses given the > two "dimensions" in this scalarized form may overlap? So we'd > really need to turn those into references with different access > functions (even if that's not 100% a valid semantic transformation > as scalarization isn't reversible without extra information)? Looks like even when hacking the Fortran FE to produce nested ARRAY_REFs we run into the same issue for (gdb) p debug_data_reference (dr) #(Data Ref: # bb: 17 # stmt: VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 sz: 8} = 0.0; # ref: VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 sz: 8}; # base_object: VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D)); # Access function 0: {1, +, 1}_4 # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, (unsigned long) stride.88_92}_3; # Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, (unsigned long) stride.90_96}_2; # Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, (unsigned long) stride.92_100}_1; so it looks like simple strided (where stride is a parameter) access is not handled either. GCCs dependence analysis can at least compute distances of two DRs when the difference of the access CHRECs is constant. Within the polyhedral model those cases cannot be handled? Richard. > Thanks, > Richard. >
Hi skimo, On Tue, Sep 26, 2017 at 10:15 AM, Sven Verdoolaege < sven.verdoolaege@gmail.com> wrote: > On Tue, Sep 26, 2017 at 09:19:50AM -0500, Sebastian Pop wrote: > > Sven, is there already a function that computes the sum of all > > strides in a proximity map? Maybe you have code that does > > something similar in pet or ppcg? > > What exactly do you want to sum? If this involves any counting, then it cannot currently > I think that it does involve counting: we need to know the distance between all pairs of array accesses, that is the number of points in the dependence polyhedron. > be done in pet or ppcg since isl does not support counting yet > and the public version of barvinok is GPL licensed. > > Also, it's better to ask such questions on the isl mailing list > isl-development@googlegroups.com > > We are trying to find a metric that shows that isl's scheduler did a useful transform. Something like a diff tool that shows before and after scheduling the strides of array accesses. Could the isl scheduler output a description of what it did? We would like to use that output to build testcases that match the behavior of the compiler on different patterns. Thanks, Sebastian
On Wed, Sep 27, 2017 at 7:18 AM, Richard Biener <rguenther@suse.de> wrote: > On Tue, 26 Sep 2017, Sebastian Pop wrote: > > > On Mon, Sep 25, 2017 at 8:12 AM, Richard Biener <rguenther@suse.de> > wrote: > > > > > On Fri, 22 Sep 2017, Sebastian Pop wrote: > > > > > > > On Fri, Sep 22, 2017 at 8:03 AM, Richard Biener <rguenther@suse.de> > > > wrote: > > > > > > > > > > > > > > This simplifies canonicalize_loop_closed_ssa and does other minimal > > > > > TLC. It also adds a testcase I reduced from a stupid mistake I > made > > > > > when reworking canonicalize_loop_closed_ssa. > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied to > trunk. > > > > > > > > > > SPEC CPU 2006 is happy with it, current statistics on x86_64 with > > > > > -Ofast -march=haswell -floop-nest-optimize are > > > > > > > > > > 61 loop nests "optimized" > > > > > 45 loop nest transforms cancelled because of code generation > issues > > > > > 21 loop nest optimizations timed out the 350000 ISL "operations" > we > > > allow > > > > > > > > > > I say "optimized" because the usual transform I've seen is static > > > tiling > > > > > as enforced by GRAPHITE according to --param loop-block-tile-size. > > > > > There's no way to automagically figure what kind of transform ISL > did > > > > > > > > > > > > > Here is how to automate (without magic) the detection > > > > of the transform that isl did. > > > > > > > > The problem solved by isl is the minimization of strides > > > > in memory, and to do this, we need to tell the isl scheduler > > > > the validity dependence graph, in graphite-optimize-isl.c > > > > see the validity (RAW, WAR, WAW) and the proximity > > > > (RAR + validity) maps. The proximity does include the > > > > read after read, as the isl scheduler needs to minimize > > > > strides between consecutive reads. > > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } > > which is probably because > > /* FIXME: proximity should not be validity. */ > isl_union_map *proximity = isl_union_map_copy (validity); > > falls apart when there is _no_ dependence? > You are right. The proximity needs to account for spatial locality as well if you want to interchange the loop. To describe the spatial locality, I would recommend adding to the proximity relation the array accesses from two successive iterations of the innermost loop: A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] With these two extra relations in the proximity map, isl should be able to interchange the above loop. > > I can trick GRAPHITE into performing the interchange for > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 1; i < 1023; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i-1] + A[j][i+1]; > } > > because now there is a dependence. Any idea on how to rewrite > scop_get_dependences to avoid "simplifying"? I suppose the > validity constraints _do_ also specify kind-of a proximity > Correct: the validity map specifies a subset (it is missing RAR dependences) of data reuse. > we just may not prune / optimize them in the same way as > dependences? > Validity constraints are there to "keep the wind blowing in the same direction" after the transform (otherwise the result of the transformed computation may be wrong.) The proximity map should contain a description of - reuse of memory (temporal locality) - how close together the access elements are (spatial locality.) isl will optimize for both if the proximity map has a description of both. For the moment the proximity map is initialized only with the current validity constraints, as you quoted the FIXME comment, which would only describe a subset of the temporal locality. Sebastian
On Wed, Sep 27, 2017 at 8:04 AM, Richard Biener <rguenther@suse.de> wrote: > > Another thing I notice is that we don't handle the multi-dimensional > accesses the fortran frontend produces: > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 18 > # stmt: _43 = *a_141(D)[_42]; > # ref: *a_141(D)[_42]; > # base_object: *a_141(D); > # Access function 0: {{(_38 + stride.88_115) + 1, +, 1}_4, +, > stride.88_115}_5 > > ultimatively we fail here because we try to build a constraint for > > {{(_38 + stride.88_115) + 1, +, 1}_4, +, stride.88_115}_5 > > which ends up computing isl_pw_aff_mul (A, stride.88_115) with > A being the non-constant constraint generated for > {(_38 + stride.88_115) + 1, +, 1}_4 and stride.88_115 being > a parameter. ISL doesn't like that multiplication as the result > isn't affine (well - it is, we just have parameters in there). > > I suppose ISL doesn't handle this form of accesses given the > two "dimensions" in this scalarized form may overlap? So we'd > really need to turn those into references with different access > functions (even if that's not 100% a valid semantic transformation > as scalarization isn't reversible without extra information)? You are right. This multivariate memory access would be better handled in delinearized form: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66981 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61000 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14741 There are two ways to handle this issue: - fix the FORTRAN front-end to emit multi dimensions ARRAY_REFs, - implement an array delinearization pass, as I implemented in LLVM http://llvm.org/doxygen/Delinearization_8cpp_source.html "On Recovering Multi-Dimensional Arrays in Polly" http://impact.gforge.inria.fr/impact2015/papers/impact2015-grosser.pdf "Optimistic Delinearization of Parametrically Sized Arrays" https://dl.acm.org/citation.cfm?id=2751248 LLVM does not have an equivalent for multi-dim ARRAY_REF description it only reasons about linearized memory accesses like in GCC's RTL: gep = Get Element Pointer, so we had no other option than to delinearize. Sebastian
On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener <rguenther@suse.de> wrote: > Looks like even when hacking the Fortran FE to produce nested > ARRAY_REFs we run into the same issue for > > (gdb) p debug_data_reference (dr) > #(Data Ref: > # bb: 17 > # stmt: > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > sz: 8} = 0.0; > # ref: > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > sz: 8}; > # base_object: > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D)); > # Access function 0: {1, +, 1}_4 > # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, > (unsigned long) stride.88_92}_3; > # Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, > (unsigned long) stride.90_96}_2; > # Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, > (unsigned long) stride.92_100}_1; > > so it looks like simple strided (where stride is a parameter) access > is not handled either. Yes, this is the first option I was mentioning: it could work, could you please make sure that you don't have a bug in the "hack patch" where the outer dimension should not contain the parameter (inner array dimension) times the access function. Example in C: int A[100][N]; A[i][j] is linearized as *(A + i * N * 4 + j * 4) and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j] Could you please check that it would delinearize back to A[i][j]? > > GCCs dependence analysis can at least compute distances of two > DRs when the difference of the access CHRECs is constant. Within > the polyhedral model those cases cannot be handled? The difficulty for the polyhedral model is in the representation of a multiplication of parameter times loop index variable. The delinearization removes these difficulties by creating linear expressions. Think about multiplication as something introducing exponentiality and you realize that any such expression would not fit in the linear model of polyhedra. A parameter is nothing else than an outer loop index to which we don't have access to that loop level as it may be outside the current function in which we get that parameter in. Sebastian
On Thu, 28 Sep 2017, Sebastian Pop wrote: > On Wed, Sep 27, 2017 at 9:33 AM, Richard Biener <rguenther@suse.de> wrote: > > Looks like even when hacking the Fortran FE to produce nested > > ARRAY_REFs we run into the same issue for > > > > (gdb) p debug_data_reference (dr) > > #(Data Ref: > > # bb: 17 > > # stmt: > > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: > > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > > sz: 8} = 0.0; > > # ref: > > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D))[_24]{lb: > > 1 sz: _20 * 8}[_26]{lb: 1 sz: _21 * 8}[_28]{lb: 1 sz: _22 * 8}[_29]{lb: 1 > > sz: 8}; > > # base_object: > > VIEW_CONVERT_EXPR<real(kind=8)[1:ubound.91][1:ubound.89][1:ubound.87][1:ubound.86]>(*y_117(D)); > > # Access function 0: {1, +, 1}_4 > > # Access function 1: (integer(kind=8)) {(unsigned long) stride.88_92, +, > > (unsigned long) stride.88_92}_3; > > # Access function 2: (integer(kind=8)) {(unsigned long) stride.90_96, +, > > (unsigned long) stride.90_96}_2; > > # Access function 3: (integer(kind=8)) {(unsigned long) stride.92_100, +, > > (unsigned long) stride.92_100}_1; > > > > so it looks like simple strided (where stride is a parameter) access > > is not handled either. > > Yes, this is the first option I was mentioning: it could work, > could you please make sure that you don't have a bug in the "hack patch" > where the outer dimension should not contain the parameter > (inner array dimension) times the access function. > > Example in C: > int A[100][N]; > A[i][j] is linearized as *(A + i * N * 4 + j * 4) > and you may have a bug if you delinearized it in the Fortran FE as A[i * N][j] > Could you please check that it would delinearize back to A[i][j]? I fixed the "hack patch" somewhat but realized it's not really possible ATM to get back at this form because the array descriptor contains only information to generate the linearized form. So while I get now correct values they end up with runtime divisions which aren't handled by SCEV. I fear emitting delinearized code from the Fortran FE would be an ABI breakage as we'd have to change the array descriptor contents. > > > > GCCs dependence analysis can at least compute distances of two > > DRs when the difference of the access CHRECs is constant. Within > > the polyhedral model those cases cannot be handled? > > The difficulty for the polyhedral model is in the representation > of a multiplication of parameter times loop index variable. > The delinearization removes these difficulties by creating linear expressions. > > Think about multiplication as something introducing exponentiality > and you realize that any such expression would not fit in the > linear model of polyhedra. > A parameter is nothing else than an outer loop index to which we don't > have access to that loop level as it may be outside the current function > in which we get that parameter in. Yeah, I see that now. Richard.
On Fri, Sep 29, 2017 at 6:17 AM, Richard Biener <rguenther@suse.de> wrote: > I fixed the "hack patch" somewhat but realized it's not really possible > ATM to get back at this form because the array descriptor contains only > information to generate the linearized form. So while I get now correct > values they end up with runtime divisions which aren't handled by > SCEV. You are right, SCEV has some limits on representing and folding those division expressions. There is a proposal in LLVM from Johannes Doerfert https://reviews.llvm.org/D38255 to use isl as a representation and expression folder instead of the chains of recurrences for the scalar evolution analysis. isl would be able to handle some of the semantics of the div_exprs, and the semantics of wrap-around variables, and of course it would have some other limits to represent multiplications (as we spoke about yesterday, i * N or M * N for example,) and thus that polyhedral analysis would need to rely on the delinearization of array indices.
On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } I didn't see you mentioning _why_ you expect an interchange here. Are you prehaps interested in spatial locality? If so, then there are several approaches for taking that into account. - pluto performs an intra-tile loop interchange to improve temporal and/or spatial locality. It shouldn't be too hard to do something similar on an isl generated schedule - Alex (Oleksandr) has been working on an extension of the isl scheduler that takes into account spatial locality. I'm not sure if it's publicly available. - I've been working on a special case of spatial locality (consecutivity). The current version is available in the consecutivity branch. Note that it may get rebased and it may not necessarily get merged into master. There are also other approaches, but they may not be that easy to combine with the isl scheduler. skimo
[Sorry for the resend; I used the wrong email address to CC Alex] On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > Ah, so I now see why we do not perform interchange on trivial cases like > > double A[1024][1024], B[1024][1024]; > > void foo(void) > { > for (int i = 0; i < 1024; ++i) > for (int j = 0; j < 1024; ++j) > A[j][i] = B[j][i]; > } I didn't see you mentioning _why_ you expect an interchange here. Are you prehaps interested in spatial locality? If so, then there are several approaches for taking that into account. - pluto performs an intra-tile loop interchange to improve temporal and/or spatial locality. It shouldn't be too hard to do something similar on an isl generated schedule - Alex (Oleksandr) has been working on an extension of the isl scheduler that takes into account spatial locality. I'm not sure if it's publicly available. - I've been working on a special case of spatial locality (consecutivity). The current version is available in the consecutivity branch. Note that it may get rebased and it may not necessarily get merged into master. There are also other approaches, but they may not be that easy to combine with the isl scheduler. skimo
On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege <sven.verdoolaege@gmail.com> wrote: > [Sorry for the resend; I used the wrong email address to CC Alex] > > On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: >> Ah, so I now see why we do not perform interchange on trivial cases like >> >> double A[1024][1024], B[1024][1024]; >> >> void foo(void) >> { >> for (int i = 0; i < 1024; ++i) >> for (int j = 0; j < 1024; ++j) >> A[j][i] = B[j][i]; >> } > > I didn't see you mentioning _why_ you expect an interchange here. > Are you prehaps interested in spatial locality? > If so, then there are several approaches for taking > that into account. > - pluto performs an intra-tile loop interchange to > improve temporal and/or spatial locality. It shouldn't > be too hard to do something similar on an isl generated > schedule > - Alex (Oleksandr) has been working on an extension of > the isl scheduler that takes into account spatial locality. > I'm not sure if it's publicly available. > - I've been working on a special case of spatial locality > (consecutivity). The current version is available in > the consecutivity branch. Note that it may get rebased and > it may not necessarily get merged into master. > > There are also other approaches, but they may not be that > easy to combine with the isl scheduler. Would the following work? Add to the proximity relation the array accesses from two successive iterations of the innermost loop: A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] With these two extra relations in the proximity map, isl should be able to interchange the above loop. Sebastian
On 29/09/17 21:58, Sebastian Pop wrote: > On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege > <sven.verdoolaege@gmail.com> wrote: >> [Sorry for the resend; I used the wrong email address to CC Alex] >> >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: >>> Ah, so I now see why we do not perform interchange on trivial cases like >>> >>> double A[1024][1024], B[1024][1024]; >>> >>> void foo(void) >>> { >>> for (int i = 0; i < 1024; ++i) >>> for (int j = 0; j < 1024; ++j) >>> A[j][i] = B[j][i]; >>> } >> I didn't see you mentioning _why_ you expect an interchange here. >> Are you prehaps interested in spatial locality? >> If so, then there are several approaches for taking >> that into account. >> - pluto performs an intra-tile loop interchange to >> improve temporal and/or spatial locality. It shouldn't >> be too hard to do something similar on an isl generated >> schedule >> - Alex (Oleksandr) has been working on an extension of >> the isl scheduler that takes into account spatial locality. >> I'm not sure if it's publicly available. >> - I've been working on a special case of spatial locality >> (consecutivity). The current version is available in >> the consecutivity branch. Note that it may get rebased and >> it may not necessarily get merged into master. >> >> There are also other approaches, but they may not be that >> easy to combine with the isl scheduler. > Would the following work? > Add to the proximity relation the array accesses from two > successive iterations of the innermost loop: > A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] > With these two extra relations in the proximity map, > isl should be able to interchange the above loop. > > Sebastian Hi, this looks very close to what we do for spatial locality in the scheduler, except that we separate proximity and "spatial proximity" maps. There is a couple of caveats in just plugging those into proximity, in particular resolving conflicts between spatial and temporal locality and unnecessary skewing. Cheers, Alex
On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop <sebpop@gmail.com> wrote: >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege ><sven.verdoolaege@gmail.com> wrote: >> [Sorry for the resend; I used the wrong email address to CC Alex] >> >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: >>> Ah, so I now see why we do not perform interchange on trivial cases >like >>> >>> double A[1024][1024], B[1024][1024]; >>> >>> void foo(void) >>> { >>> for (int i = 0; i < 1024; ++i) >>> for (int j = 0; j < 1024; ++j) >>> A[j][i] = B[j][i]; >>> } >> >> I didn't see you mentioning _why_ you expect an interchange here. >> Are you prehaps interested in spatial locality? >> If so, then there are several approaches for taking >> that into account. >> - pluto performs an intra-tile loop interchange to >> improve temporal and/or spatial locality. It shouldn't >> be too hard to do something similar on an isl generated >> schedule >> - Alex (Oleksandr) has been working on an extension of >> the isl scheduler that takes into account spatial locality. >> I'm not sure if it's publicly available. >> - I've been working on a special case of spatial locality >> (consecutivity). The current version is available in >> the consecutivity branch. Note that it may get rebased and >> it may not necessarily get merged into master. >> >> There are also other approaches, but they may not be that >> easy to combine with the isl scheduler. > >Would the following work? >Add to the proximity relation the array accesses from two >successive iterations of the innermost loop: >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] >With these two extra relations in the proximity map, >isl should be able to interchange the above loop. Can anyone give a hint on how to do that in ISL terms? Richard. >Sebastian
On Sat, Sep 30, 2017 at 07:47:43PM +0200, Richard Biener wrote: > On September 29, 2017 9:58:41 PM GMT+02:00, Sebastian Pop <sebpop@gmail.com> wrote: > >On Fri, Sep 29, 2017 at 2:37 PM, Sven Verdoolaege > ><sven.verdoolaege@gmail.com> wrote: > >> [Sorry for the resend; I used the wrong email address to CC Alex] > >> > >> On Wed, Sep 27, 2017 at 02:18:51PM +0200, Richard Biener wrote: > >>> Ah, so I now see why we do not perform interchange on trivial cases > >like > >>> > >>> double A[1024][1024], B[1024][1024]; > >>> > >>> void foo(void) > >>> { > >>> for (int i = 0; i < 1024; ++i) > >>> for (int j = 0; j < 1024; ++j) > >>> A[j][i] = B[j][i]; > >>> } > >> > >> I didn't see you mentioning _why_ you expect an interchange here. > >> Are you prehaps interested in spatial locality? > >> If so, then there are several approaches for taking > >> that into account. > >> - pluto performs an intra-tile loop interchange to > >> improve temporal and/or spatial locality. It shouldn't > >> be too hard to do something similar on an isl generated > >> schedule > >> - Alex (Oleksandr) has been working on an extension of > >> the isl scheduler that takes into account spatial locality. > >> I'm not sure if it's publicly available. > >> - I've been working on a special case of spatial locality > >> (consecutivity). The current version is available in > >> the consecutivity branch. Note that it may get rebased and > >> it may not necessarily get merged into master. > >> > >> There are also other approaches, but they may not be that > >> easy to combine with the isl scheduler. > > > >Would the following work? > >Add to the proximity relation the array accesses from two > >successive iterations of the innermost loop: > >A[j][i] -> A[j][i+1] and B[j][i] -> B[j][i+1] > >With these two extra relations in the proximity map, > >isl should be able to interchange the above loop. > > Can anyone give a hint on how to do that in ISL terms? For the approach pluto is taking, you'll have to look at the source code, see pluto_intra_tile_optimize_band. For the other two approaches I mentioned above, reports will be made available within the next couple of weeks. For the last one, there is a sample implementation in the consecutivity branch of PPCG. skimo
On Sun, Oct 01, 2017 at 11:58:30AM +0200, Sven Verdoolaege wrote: > For the approach pluto is taking, you'll have to look at the source > code, see pluto_intra_tile_optimize_band. > For the other two approaches I mentioned above, reports will > be made available within the next couple of weeks. https://hal.inria.fr/hal-01628798 http://www.cs.kuleuven.be/publicaties/rapporten/cw/CW709.abs.html skimo
Index: gcc/graphite-isl-ast-to-gimple.c =================================================================== --- gcc/graphite-isl-ast-to-gimple.c (revision 253091) +++ gcc/graphite-isl-ast-to-gimple.c (working copy) @@ -73,15 +73,6 @@ struct ast_build_info bool is_parallelizable; }; -/* Verifies properties that GRAPHITE should maintain during translation. */ - -static inline void -graphite_verify (void) -{ - checking_verify_loop_structure (); - checking_verify_loop_closed_ssa (true); -} - /* IVS_PARAMS maps isl's scattering and parameter identifiers to corresponding trees. */ @@ -2997,8 +2988,9 @@ graphite_regenerate_ast_isl (scop_p scop delete_loop (loop); } - graphite_verify (); - scev_reset (); + /* Verifies properties that GRAPHITE should maintain during translation. */ + checking_verify_loop_structure (); + checking_verify_loop_closed_ssa (true); free (if_region->true_region); free (if_region->region); @@ -3008,19 +3000,6 @@ graphite_regenerate_ast_isl (scop_p scop isl_ast_node_free (root_node); timevar_pop (TV_GRAPHITE_CODE_GEN); - if (dump_file && (dump_flags & TDF_DETAILS)) - { - loop_p loop; - int num_no_dependency = 0; - - FOR_EACH_LOOP (loop, 0) - if (loop->can_be_parallel) - num_no_dependency++; - - fprintf (dump_file, "%d loops carried no dependency.\n", - num_no_dependency); - } - return !t.codegen_error_p (); } Index: gcc/graphite-optimize-isl.c =================================================================== --- gcc/graphite-optimize-isl.c (revision 253091) +++ gcc/graphite-optimize-isl.c (working copy) @@ -37,6 +37,7 @@ along with GCC; see the file COPYING3. #include "tree-data-ref.h" #include "params.h" #include "dumpfile.h" +#include "tree-vectorizer.h" #include "graphite.h" @@ -156,9 +157,12 @@ optimize_isl (scop_p scop) if (!scop->transformed_schedule || isl_ctx_last_error (scop->isl_context) == isl_error_quota) { - if (dump_file && dump_flags) - fprintf (dump_file, "isl timed out --param max-isl-operations=%d\n", - max_operations); + location_t loc = find_loop_location + (scop->scop_info->region.entry->dest->loop_father); + dump_printf_loc (MSG_MISSED_OPTIMIZATION, loc, + "loop nest not optimized, optimization timed out " + "after %d operations [--param max-isl-operations]\n", + max_operations); return false; } Index: gcc/graphite.c =================================================================== --- gcc/graphite.c (revision 253091) +++ gcc/graphite.c (working copy) @@ -293,86 +293,6 @@ free_scops (vec<scop_p> scops) scops.release (); } -/* Returns true when P1 and P2 are close phis with the same - argument. */ - -static inline bool -same_close_phi_node (gphi *p1, gphi *p2) -{ - return (types_compatible_p (TREE_TYPE (gimple_phi_result (p1)), - TREE_TYPE (gimple_phi_result (p2))) - && operand_equal_p (gimple_phi_arg_def (p1, 0), - gimple_phi_arg_def (p2, 0), 0)); -} - -static void make_close_phi_nodes_unique (basic_block bb); - -/* Remove the close phi node at GSI and replace its rhs with the rhs - of PHI. */ - -static void -remove_duplicate_close_phi (gphi *phi, gphi_iterator *gsi) -{ - gimple *use_stmt; - use_operand_p use_p; - imm_use_iterator imm_iter; - tree res = gimple_phi_result (phi); - tree def = gimple_phi_result (gsi->phi ()); - - gcc_assert (same_close_phi_node (phi, gsi->phi ())); - - FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def) - { - FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter) - SET_USE (use_p, res); - - update_stmt (use_stmt); - - /* It is possible that we just created a duplicate close-phi - for an already-processed containing loop. Check for this - case and clean it up. */ - if (gimple_code (use_stmt) == GIMPLE_PHI - && gimple_phi_num_args (use_stmt) == 1) - make_close_phi_nodes_unique (gimple_bb (use_stmt)); - } - - remove_phi_node (gsi, true); -} - -/* Removes all the close phi duplicates from BB. */ - -static void -make_close_phi_nodes_unique (basic_block bb) -{ - gphi_iterator psi; - - for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) - { - gphi_iterator gsi = psi; - gphi *phi = psi.phi (); - - /* At this point, PHI should be a close phi in normal form. */ - gcc_assert (gimple_phi_num_args (phi) == 1); - - /* Iterate over the next phis and remove duplicates. */ - gsi_next (&gsi); - while (!gsi_end_p (gsi)) - if (same_close_phi_node (phi, gsi.phi ())) - remove_duplicate_close_phi (phi, &gsi); - else - gsi_next (&gsi); - } -} - -/* Return true when NAME is defined in LOOP. */ - -static bool -defined_in_loop_p (tree name, loop_p loop) -{ - gcc_assert (TREE_CODE (name) == SSA_NAME); - return loop == loop_containing_stmt (SSA_NAME_DEF_STMT (name)); -} - /* Transforms LOOP to the canonical loop closed SSA form. */ static void @@ -380,20 +300,22 @@ canonicalize_loop_closed_ssa (loop_p loo { edge e = single_exit (loop); basic_block bb; + gphi_iterator psi; if (!e || (e->flags & EDGE_COMPLEX)) return; bb = e->dest; + /* Make the loop-close PHI node BB contain only PHIs and have a + single predecessor. */ if (single_pred_p (bb)) { e = split_block_after_labels (bb); - make_close_phi_nodes_unique (e->src); + bb = e->src; } else { - gphi_iterator psi; basic_block close = split_edge (e); e = single_succ_edge (close); for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) @@ -404,7 +326,7 @@ canonicalize_loop_closed_ssa (loop_p loo /* Only add close phi nodes for SSA_NAMEs defined in LOOP. */ if (TREE_CODE (arg) != SSA_NAME - || !defined_in_loop_p (arg, loop)) + || loop_containing_stmt (SSA_NAME_DEF_STMT (arg)) != loop) continue; tree res = copy_ssa_name (arg); @@ -413,8 +335,30 @@ canonicalize_loop_closed_ssa (loop_p loo UNKNOWN_LOCATION); SET_USE (use_p, res); } + bb = close; + } - make_close_phi_nodes_unique (close); + /* Eliminate duplicates. This relies on processing loops from + innermost to outer. */ + for (psi = gsi_start_phis (bb); !gsi_end_p (psi); gsi_next (&psi)) + { + gphi_iterator gsi = psi; + gphi *phi = psi.phi (); + + /* At this point, PHI should be a close phi in normal form. */ + gcc_assert (gimple_phi_num_args (phi) == 1); + + /* Iterate over the next phis and remove duplicates. */ + gsi_next (&gsi); + while (!gsi_end_p (gsi)) + if (gimple_phi_arg_def (phi, 0) == gimple_phi_arg_def (gsi.phi (), 0)) + { + replace_uses_by (gimple_phi_result (gsi.phi ()), + gimple_phi_result (phi)); + remove_phi_node (&gsi, true); + } + else + gsi_next (&gsi); } } @@ -443,7 +387,7 @@ static void canonicalize_loop_closed_ssa_form (void) { loop_p loop; - FOR_EACH_LOOP (loop, 0) + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) canonicalize_loop_closed_ssa (loop); checking_verify_loop_closed_ssa (true); @@ -509,6 +453,19 @@ graphite_transform_loops (void) "loop nest optimized\n"); } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + loop_p loop; + int num_no_dependency = 0; + + FOR_EACH_LOOP (loop, 0) + if (loop->can_be_parallel) + num_no_dependency++; + + fprintf (dump_file, "%d loops carried no dependency.\n", + num_no_dependency); + } + free_scops (scops); graphite_finalize (need_cfg_cleanup_p); the_isl_ctx = NULL; Index: gcc/testsuite/gcc.dg/graphite/scop-24.c =================================================================== --- gcc/testsuite/gcc.dg/graphite/scop-24.c (nonexistent) +++ gcc/testsuite/gcc.dg/graphite/scop-24.c (working copy) @@ -0,0 +1,29 @@ +/* { dg-do compile } */ +/* { dg-options "-Ofast -floop-nest-optimize" } */ + +typedef struct _IO_FILE FILE; +extern struct _IO_FILE *stderr; +typedef float real; +typedef real rvec[3]; +int rgbset (int); +void ps_box (int, int); +void plot_phi(char *fn,rvec box,int natoms,rvec x[],real phi[]) +{ + real phi_max,rr,gg,bb,fac,dx,x0,y0; + int i; + for(i=0; (i<natoms); i++) + phi_max=((phi_max > __builtin_fabs(phi[i])) + ? phi_max : __builtin_fabs(phi[i])); + if (__builtin_fabs(phi_max)<1.2e-38) + __builtin_fprintf(stderr, "X"); + ps_box((real)(fac*box[0]-1),(real)(fac*box[1]-1)); + for(i=0; (i<natoms); i++) + { + rr=gg=bb=1.0; + if (phi[i] < 0) + gg=bb=(1.0+(phi[i]/phi_max)); + else + rr=gg=(1.0-(phi[i]/phi_max)); + rr=rgbset(rr); + } +}