diff mbox

[gomp4] Propagate independent clause for OpenACC kernels pass

Message ID 55A4A21C.1070004@codesourcery.com
State New
Headers show

Commit Message

Chung-Lin Tang July 14, 2015, 5:46 a.m. UTC
Hi Tom,
this patch provides a 'bool independent' field in struct loop, which
will be switched on by an "independent" clause in a #pragma acc loop directive.
I assume you'll be wiring it to the kernels parloops pass in a followup patch.

Note: there are already a few other similar fields in struct loop, namely
'safelen' and 'can_be_parallel', used by OMP simd safelen and GRAPHITE respectively.
The intention and/or setting of these fields are all a bit different, so I've
decided to add a new bool for OpenACC.

Tested and committed to gomp-4_0-branch.

Chung-Lin

2015-07-14  Chung-Lin Tang  <cltang@codesourcery.com>

        * cfgloop.h (struct loop): Add 'bool marked_independent' field.
        * gimplify.c (gimplify_scan_omp_clauses): Keep OMP_CLAUSE_INDEPENDENT.
        * omp-low.c (struct omp_region): Add 'int kind' and
        'bool independent' fields.
        (expand_omp_for): Set 'marked_independent' field for loop
        corresponding to region.
        (find_omp_for_region_data): New function.
        (find_omp_target_region_data): Set kind field.
        (build_omp_regions_1): Call find_omp_for_region_data() for
        GIMPLE_OMP_FOR statements.

Comments

Jakub Jelinek July 14, 2015, 7 a.m. UTC | #1
On Tue, Jul 14, 2015 at 01:46:04PM +0800, Chung-Lin Tang wrote:
> this patch provides a 'bool independent' field in struct loop, which
> will be switched on by an "independent" clause in a #pragma acc loop directive.
> I assume you'll be wiring it to the kernels parloops pass in a followup patch.
> 
> Note: there are already a few other similar fields in struct loop, namely
> 'safelen' and 'can_be_parallel', used by OMP simd safelen and GRAPHITE respectively.
> The intention and/or setting of these fields are all a bit different, so I've
> decided to add a new bool for OpenACC.

How is it different though?  Can you cite exact definition of the
independent clause vs. safelen (set to INT_MAX)?
The OpenMP definition is:
"A SIMD loop has logical iterations numbered 0,1,...,N-1 where N is the
number of loop iterations, and the logical numbering denotes the sequence in which the iterations would
be executed if the associated loop(s) were executed with no SIMD instructions. If the safelen
clause is used then no two iterations executed concurrently with SIMD instructions can have a
greater distance in the logical iteration space than its value."
...
"Lexical forward dependencies in the iterations of the
original loop must be preserved within each SIMD chunk."

So e.g. safelen >= 32 means for PTX you can safely implement it by
running up to 32 consecutive iterations by all threads in the warp
(assuming code that for some reason must be run by a single thread
(e.g. calls to functions that are marked so that they expect to be run
by the first thread in a warp initially) is run sequentially by increasing
iterator), but it doesn't mean the iterations have no dependencies in between
them whatsoever (see the above note about lexical forward dependencies),
so you can't parallelize it by assigning different iterations to different
threads outside of warp (or pthread_create created threads).
So if OpenACC independent means there are no dependencies in between
iterations, the OpenMP counterpart here is #pragma omp for simd schedule (auto)
or #pragma omp distribute parallel for simd schedule (auto).

	Jakub
Chung-Lin Tang July 14, 2015, 9:35 a.m. UTC | #2
On 15/7/14 3:00 PM, Jakub Jelinek wrote:
> On Tue, Jul 14, 2015 at 01:46:04PM +0800, Chung-Lin Tang wrote:
>> this patch provides a 'bool independent' field in struct loop, which
>> will be switched on by an "independent" clause in a #pragma acc loop directive.
>> I assume you'll be wiring it to the kernels parloops pass in a followup patch.
>>
>> Note: there are already a few other similar fields in struct loop, namely
>> 'safelen' and 'can_be_parallel', used by OMP simd safelen and GRAPHITE respectively.
>> The intention and/or setting of these fields are all a bit different, so I've
>> decided to add a new bool for OpenACC.
> 
> How is it different though?  Can you cite exact definition of the
> independent clause vs. safelen (set to INT_MAX)?
> The OpenMP definition is:
> "A SIMD loop has logical iterations numbered 0,1,...,N-1 where N is the
> number of loop iterations, and the logical numbering denotes the sequence in which the iterations would
> be executed if the associated loop(s) were executed with no SIMD instructions. If the safelen
> clause is used then no two iterations executed concurrently with SIMD instructions can have a
> greater distance in the logical iteration space than its value."
> ...
> "Lexical forward dependencies in the iterations of the
> original loop must be preserved within each SIMD chunk."

The wording of OpenACC independent is more simple:
"... the independent clause tells the implementation that the iterations of this loop
are data-independent with respect to each other." -- OpenACC spec 2.7.9

I would say this implies even more relaxed conditions than OpenMP simd safelen,
essentially saying that the compiler doesn't even need dependence analysis; just
assume independence of iterations.

> So e.g. safelen >= 32 means for PTX you can safely implement it by
> running up to 32 consecutive iterations by all threads in the warp
> (assuming code that for some reason must be run by a single thread
> (e.g. calls to functions that are marked so that they expect to be run
> by the first thread in a warp initially) is run sequentially by increasing
> iterator), but it doesn't mean the iterations have no dependencies in between
> them whatsoever (see the above note about lexical forward dependencies),
> so you can't parallelize it by assigning different iterations to different
> threads outside of warp (or pthread_create created threads).

> So if OpenACC independent means there are no dependencies in between
> iterations, the OpenMP counterpart here is #pragma omp for simd schedule (auto)
> or #pragma omp distribute parallel for simd schedule (auto).

schedule(auto) appears to correspond to the OpenACC 'auto' clause, or
what is implied in a kernels compute construct, but I'm not sure it implies
no dependencies between iterations?

Putting aside the semantic issues, as of currently safelen>0 turns on a certain amount of
vectorization code that we are not currently using (and not likely at all for nvptx).
Right now, we're just trying to pass the new flag to a kernels tree-parloops based pass.

Maybe this can all be reconciled later in a more precise way, e.g. have flags that correspond
specifically to phases of internal compiler passes (and selected by needs of the accel target),
instead of ones that are "sort of" associated with high-level language features.

Chung-Lin
Jakub Jelinek July 14, 2015, 9:48 a.m. UTC | #3
On Tue, Jul 14, 2015 at 05:35:28PM +0800, Chung-Lin Tang wrote:
> The wording of OpenACC independent is more simple:
> "... the independent clause tells the implementation that the iterations of this loop
> are data-independent with respect to each other." -- OpenACC spec 2.7.9
> 
> I would say this implies even more relaxed conditions than OpenMP simd safelen,
> essentially saying that the compiler doesn't even need dependence analysis; just
> assume independence of iterations.

safelen is also saying that the compiler doesn't even need dependence
analysis.  It is just that only some transformations of the loop are ok
without dependence analysis, others need to be with dependence analysis.
Classical vectorization optimizations (instead of doing one iteration
at a time you can do up to safelen consecutive iterations together) for the
first statement in the loop, then second statement, etc. are ok without
dependence analysis, but e.g. reversing the loop and running first the last
iteration and so on up to first, or running the iterations in random orders
is not ok.

> > So if OpenACC independent means there are no dependencies in between
> > iterations, the OpenMP counterpart here is #pragma omp for simd schedule (auto)
> > or #pragma omp distribute parallel for simd schedule (auto).
> 
> schedule(auto) appears to correspond to the OpenACC 'auto' clause, or
> what is implied in a kernels compute construct, but I'm not sure it implies
> no dependencies between iterations?

By the schedule(auto) I meant that the user tells the compiler it can
parallelize the loop with whatever schedule it wants.  Other schedules are
quite well defined, if the team has that many threads, which of the thread
gets which iteration, so user could rely on a particular parallelization and
the loop iterations still could not be 100% independent.  With
schedule(auto) you say it is up to the compiler to schedule them, thus they
really have to be all independent.

> Putting aside the semantic issues, as of currently safelen>0 turns on a certain amount of
> vectorization code that we are not currently using (and not likely at all for nvptx).
> Right now, we're just trying to pass the new flag to a kernels tree-parloops based pass.

In any case, when setting your flag you should also set safelen = INT_MAX,
as the OpenACC independent implies that you can vectorize the loop with any
vectorization factor without performing dependency analysis on the loop.
OpenACC is (hopefully) not just about PTX and most other targets will want
to vectorize such loops.

	Jakub
diff mbox

Patch

Index: cfgloop.h
===================================================================
--- cfgloop.h	(revision 225758)
+++ cfgloop.h	(working copy)
@@ -194,6 +194,10 @@  struct GTY ((chain_next ("%h.next"))) loop {
   /* True if the loop is part of an oacc kernels region.  */
   bool in_oacc_kernels_region;
 
+  /* True if loop is tagged as having independent iterations by user,
+     e.g. the OpenACC independent clause.  */
+  bool marked_independent;
+
   /* For SIMD loops, this is a unique identifier of the loop, referenced
      by IFN_GOMP_SIMD_VF, IFN_GOMP_SIMD_LANE and IFN_GOMP_SIMD_LAST_LANE
      builtins.  */
Index: gimplify.c
===================================================================
--- gimplify.c	(revision 225758)
+++ gimplify.c	(working copy)
@@ -6602,7 +6602,6 @@  gimplify_scan_omp_clauses (tree *list_p, gimple_se
 	  break;
 
 	case OMP_CLAUSE_DEVICE_RESIDENT:
-	case OMP_CLAUSE_INDEPENDENT:
 	  remove = true;
 	  break;
 
@@ -6612,6 +6611,7 @@  gimplify_scan_omp_clauses (tree *list_p, gimple_se
 	case OMP_CLAUSE_COLLAPSE:
 	case OMP_CLAUSE_AUTO:
 	case OMP_CLAUSE_SEQ:
+	case OMP_CLAUSE_INDEPENDENT:
 	case OMP_CLAUSE_MERGEABLE:
 	case OMP_CLAUSE_PROC_BIND:
 	case OMP_CLAUSE_SAFELEN:
Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225758)
+++ omp-low.c	(working copy)
@@ -136,8 +136,16 @@  struct omp_region
   /* True if this is nested inside an OpenACC kernels construct.  */
   bool inside_kernels_p;
 
+  /* Records a generic kind field.  */
+  int kind;
+
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
+
+  /* For an OpenACC loop directive, true if has the 'independent' clause.  */
+  bool independent;
+
+  tree broadcast_array;
 };
 
 /* Context structure.  Used to store information about each parallel
@@ -8273,8 +8281,15 @@  expand_omp_for (struct omp_region *region, gimple
     loops_state_set (LOOPS_NEED_FIXUP);
 
   if (region->inside_kernels_p)
-    expand_omp_for_generic (region, &fd, BUILT_IN_NONE, BUILT_IN_NONE,
-			    inner_stmt);
+    {
+      expand_omp_for_generic (region, &fd, BUILT_IN_NONE, BUILT_IN_NONE,
+			      inner_stmt);
+      if (region->independent && region->cont->loop_father)
+	{
+	  struct loop *loop = region->cont->loop_father; 
+	  loop->marked_independent = true;
+	}
+    }
   else if (gimple_omp_for_kind (fd.for_stmt) & GF_OMP_FOR_SIMD)
     expand_omp_simd (region, &fd);
   else if (gimple_omp_for_kind (fd.for_stmt) == GF_OMP_FOR_KIND_CILKFOR)
@@ -9943,6 +9958,34 @@  find_omp_for_region_gwv (gimple stmt)
   return tmp;
 }
 
+static void
+find_omp_for_region_data (struct omp_region *region, gomp_for *stmt)
+{
+  region->gwv_this = find_omp_for_region_gwv (stmt);
+  region->kind = gimple_omp_for_kind (stmt);
+
+  if (region->kind == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      struct omp_region *target_region = region->outer;
+      while (target_region
+	     && target_region->type != GIMPLE_OMP_TARGET)
+	target_region = target_region->outer;
+      if (!target_region)
+	return;
+
+      tree clauses = gimple_omp_for_clauses (stmt);
+
+      if (target_region->kind == GF_OMP_TARGET_KIND_OACC_PARALLEL
+	  && !find_omp_clause (clauses, OMP_CLAUSE_SEQ))
+	/* In OpenACC parallel constructs, 'independent' is implied on all
+	   loop directives without a 'seq' clause.  */
+	region->independent = true;
+      else if (target_region->kind == GF_OMP_TARGET_KIND_OACC_KERNELS
+	       && find_omp_clause (clauses, OMP_CLAUSE_INDEPENDENT))
+	region->independent = true;
+    }
+}
+
 /* Fill in additional data for a region REGION associated with an
    OMP_TARGET STMT.  */
 
@@ -9960,6 +10003,7 @@  find_omp_target_region_data (struct omp_region *re
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
+  region->kind = gimple_omp_target_kind (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10046,7 +10090,7 @@  build_omp_regions_1 (basic_block bb, struct omp_re
 		}
 	    }
 	  else if (code == GIMPLE_OMP_FOR)
-	    region->gwv_this = find_omp_for_region_gwv (stmt);
+	    find_omp_for_region_data (region, as_a <gomp_for *> (stmt));
 	  /* ..., this directive becomes the parent for a new region.  */
 	  if (region)
 	    parent = region;