diff mbox

[6/16] Add pass_oacc_kernels

Message ID 5640DA47.2010508@mentor.com
State New
Headers show

Commit Message

Tom de Vries Nov. 9, 2015, 5:39 p.m. UTC
On 09/11/15 16:35, Tom de Vries wrote:
> Hi,
>
> this patch series for stage1 trunk adds support to:
> - parallelize oacc kernels regions using parloops, and
> - map the loops onto the oacc gang dimension.
>
> The patch series contains these patches:
>
>       1    Insert new exit block only when needed in
>          transform_to_exit_first_loop_alt
>       2    Make create_parallel_loop return void
>       3    Ignore reduction clause on kernels directive
>       4    Implement -foffload-alias
>       5    Add in_oacc_kernels_region in struct loop
>       6    Add pass_oacc_kernels
>       7    Add pass_dominator_oacc_kernels
>       8    Add pass_ch_oacc_kernels
>       9    Add pass_parallelize_loops_oacc_kernels
>      10    Add pass_oacc_kernels pass group in passes.def
>      11    Update testcases after adding kernels pass group
>      12    Handle acc loop directive
>      13    Add c-c++-common/goacc/kernels-*.c
>      14    Add gfortran.dg/goacc/kernels-*.f95
>      15    Add libgomp.oacc-c-c++-common/kernels-*.c
>      16    Add libgomp.oacc-fortran/kernels-*.f95
>
> The first 9 patches are more or less independent, but patches 10-16 are
> intended to be committed at the same time.
>
> Bootstrapped and reg-tested on x86_64.
>
> Build and reg-tested with nvidia accelerator, in combination with a
> patch that enables accelerator testing (which is submitted at
> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>
> I'll post the individual patches in reply to this message.

this patchs add a pass group pass_oacc_kernels (which will be added to 
the pass list as a whole in patch 10).

Atm, the parallelization behaviour for the kernels region is controlled 
by flag_tree_parallelize_loops, which is also used to control generic 
auto-parallelization by autopar using omp. That is not ideal, and we may 
want a separate flag (or param) to control the behaviour for oacc 
kernels, f.i. -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.

The purpose of the pass group as a whole is to massage the offloaded 
function into a shape that parloops can deal with it, and then run 
parloops on it.

Consider a testcase with a reduction, and a loop counter declared 
outside the offload region:
...
unsigned int a[n];

unsigned int
foo (void)
{
   int i;
   unsigned int sum = 1;

#pragma acc kernels copyin (a[0:n]) copy (sum)
   {
     for (i = 0; i < n; ++i)
       sum += a[i];
   }

   return sum;
}
...

After ealias, the loop body looks like this:
...
   <bb 5>:
   _8 = *.omp_data_i_3(D).a;
   _9 = *.omp_data_i_3(D).i;
   _10 = *_9;
   _11 = *_8[_10];
   _12 = *.omp_data_i_3(D).sum;
   sum.0_13 = *_12;
   sum.1_14 = _11 + sum.0_13;
   _15 = *.omp_data_i_3(D).sum;
   *_15 = sum.1_14;
   _17 = *.omp_data_i_3(D).i;
   _18 = *_17;
   _19 = *.omp_data_i_3(D).i;
   _20 = _18 + 1;
   *_19 = _20;
   goto <bb 6>;
...
In other words, the iteration variable is in memory, as is the reduction 
variable, and the body contains lots of loop invariant loads.

At the end of the pass group, just before parloops, the body has been 
rewritten to have a local iteration variable and a local reduction 
variable, and all the loop invariant loads have been moved out of the loop:
...
   <bb 4>:
   # _27 = PHI <0(2), _20(5)>
   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
   _11 = *_8[_27];
   sum.1_14 = _11 + D__lsm.7_28;
   _20 = _27 + 1;
   if (_20 <= 9999)
     goto <bb 5>;
   else
     goto <bb 3>;
...

Thanks,
- Tom

Comments

Richard Biener Nov. 11, 2015, 10:58 a.m. UTC | #1
On Mon, 9 Nov 2015, Tom de Vries wrote:

> On 09/11/15 16:35, Tom de Vries wrote:
> > Hi,
> > 
> > this patch series for stage1 trunk adds support to:
> > - parallelize oacc kernels regions using parloops, and
> > - map the loops onto the oacc gang dimension.
> > 
> > The patch series contains these patches:
> > 
> >       1    Insert new exit block only when needed in
> >          transform_to_exit_first_loop_alt
> >       2    Make create_parallel_loop return void
> >       3    Ignore reduction clause on kernels directive
> >       4    Implement -foffload-alias
> >       5    Add in_oacc_kernels_region in struct loop
> >       6    Add pass_oacc_kernels
> >       7    Add pass_dominator_oacc_kernels
> >       8    Add pass_ch_oacc_kernels
> >       9    Add pass_parallelize_loops_oacc_kernels
> >      10    Add pass_oacc_kernels pass group in passes.def
> >      11    Update testcases after adding kernels pass group
> >      12    Handle acc loop directive
> >      13    Add c-c++-common/goacc/kernels-*.c
> >      14    Add gfortran.dg/goacc/kernels-*.f95
> >      15    Add libgomp.oacc-c-c++-common/kernels-*.c
> >      16    Add libgomp.oacc-fortran/kernels-*.f95
> > 
> > The first 9 patches are more or less independent, but patches 10-16 are
> > intended to be committed at the same time.
> > 
> > Bootstrapped and reg-tested on x86_64.
> > 
> > Build and reg-tested with nvidia accelerator, in combination with a
> > patch that enables accelerator testing (which is submitted at
> > https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
> > 
> > I'll post the individual patches in reply to this message.
> 
> this patchs add a pass group pass_oacc_kernels (which will be added to the
> pass list as a whole in patch 10).

Just to understand (while also skimming the HSA patches).

You are basically relying on autopar for what the HSA patches call
"gridification"?  That is, OMP lowering produces loopy kernels
and autopar then will basically strip the outermost loop?

Richard.

> Atm, the parallelization behaviour for the kernels region is controlled by
> flag_tree_parallelize_loops, which is also used to control generic
> auto-parallelization by autopar using omp. That is not ideal, and we may want
> a separate flag (or param) to control the behaviour for oacc kernels, f.i.
> -foacc-kernels-gang-parallelize=<n>. I'm open to suggestions.
> 
> The purpose of the pass group as a whole is to massage the offloaded function
> into a shape that parloops can deal with it, and then run parloops on it.
> 
> Consider a testcase with a reduction, and a loop counter declared outside the
> offload region:
> ...
> unsigned int a[n];
> 
> unsigned int
> foo (void)
> {
>   int i;
>   unsigned int sum = 1;
> 
> #pragma acc kernels copyin (a[0:n]) copy (sum)
>   {
>     for (i = 0; i < n; ++i)
>       sum += a[i];
>   }
> 
>   return sum;
> }
> ...
> 
> After ealias, the loop body looks like this:
> ...
>   <bb 5>:
>   _8 = *.omp_data_i_3(D).a;
>   _9 = *.omp_data_i_3(D).i;
>   _10 = *_9;
>   _11 = *_8[_10];
>   _12 = *.omp_data_i_3(D).sum;
>   sum.0_13 = *_12;
>   sum.1_14 = _11 + sum.0_13;
>   _15 = *.omp_data_i_3(D).sum;
>   *_15 = sum.1_14;
>   _17 = *.omp_data_i_3(D).i;
>   _18 = *_17;
>   _19 = *.omp_data_i_3(D).i;
>   _20 = _18 + 1;
>   *_19 = _20;
>   goto <bb 6>;
> ...
> In other words, the iteration variable is in memory, as is the reduction
> variable, and the body contains lots of loop invariant loads.
> 
> At the end of the pass group, just before parloops, the body has been
> rewritten to have a local iteration variable and a local reduction variable,
> and all the loop invariant loads have been moved out of the loop:
> ...
>   <bb 4>:
>   # _27 = PHI <0(2), _20(5)>
>   # D__lsm.7_28 = PHI <D__lsm.7_29(2), sum.1_14(5)>
>   _11 = *_8[_27];
>   sum.1_14 = _11 + D__lsm.7_28;
>   _20 = _27 + 1;
>   if (_20 <= 9999)
>     goto <bb 5>;
>   else
>     goto <bb 3>;
> ...
> 
> Thanks,
> - Tom
> 
>
Tom de Vries Nov. 19, 2015, 1:50 p.m. UTC | #2
On 11/11/15 11:58, Richard Biener wrote:
> On Mon, 9 Nov 2015, Tom de Vries wrote:
>
>> On 09/11/15 16:35, Tom de Vries wrote:
>>> Hi,
>>>
>>> this patch series for stage1 trunk adds support to:
>>> - parallelize oacc kernels regions using parloops, and
>>> - map the loops onto the oacc gang dimension.
>>>
>>> The patch series contains these patches:
>>>
>>>        1    Insert new exit block only when needed in
>>>           transform_to_exit_first_loop_alt
>>>        2    Make create_parallel_loop return void
>>>        3    Ignore reduction clause on kernels directive
>>>        4    Implement -foffload-alias
>>>        5    Add in_oacc_kernels_region in struct loop
>>>        6    Add pass_oacc_kernels
>>>        7    Add pass_dominator_oacc_kernels
>>>        8    Add pass_ch_oacc_kernels
>>>        9    Add pass_parallelize_loops_oacc_kernels
>>>       10    Add pass_oacc_kernels pass group in passes.def
>>>       11    Update testcases after adding kernels pass group
>>>       12    Handle acc loop directive
>>>       13    Add c-c++-common/goacc/kernels-*.c
>>>       14    Add gfortran.dg/goacc/kernels-*.f95
>>>       15    Add libgomp.oacc-c-c++-common/kernels-*.c
>>>       16    Add libgomp.oacc-fortran/kernels-*.f95
>>>
>>> The first 9 patches are more or less independent, but patches 10-16 are
>>> intended to be committed at the same time.
>>>
>>> Bootstrapped and reg-tested on x86_64.
>>>
>>> Build and reg-tested with nvidia accelerator, in combination with a
>>> patch that enables accelerator testing (which is submitted at
>>> https://gcc.gnu.org/ml/gcc-patches/2015-10/msg01771.html ).
>>>
>>> I'll post the individual patches in reply to this message.
>>
>> this patchs add a pass group pass_oacc_kernels (which will be added to the
>> pass list as a whole in patch 10).
>
> Just to understand (while also skimming the HSA patches).
>
> You are basically relying on autopar for what the HSA patches call
> "gridification"?  That is, OMP lowering produces loopy kernels
> and autopar then will basically strip the outermost loop?

Short answer: no. In more detail...

Existing openmp support maps explictly independent loops (annotated with 
omp-for) in omp-parallel regions onto pthreads. It generates thread 
functions containing sequential loops that iterate on a subset of data 
of the original loop.

Parloops maps sequential loops onto pthreads by:
- proving the loop is independent
- identifiying reductions
- rewriting the loop into an omp-for annotated loop
- wrapping the loop in an an omp-parallel region
- rewriting the variable accesses in the loop such that they are
   relative to base pointers passed into the region
   (note: this bit is done by omplower for omp-for loops from source)
- rewriting the preloop-read and postloop-write pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the omp-for and omp-parallel

The HSA support maps explicitly independent loops in openmp target 
regions onto an shared memory accelerator. By default, it generates 
kernel functions containing sequential loops that iterate on a subset of 
data of the original loop. The control flow has a performance penalty on 
the accelerator, so there's a concept called gridification (explained 
here: https://gcc.gnu.org/ml/gcc-patches/2015-11/msg00586.html ). [ I'm 
not sure if it is an additional transformation or a different style of 
generation ].  The gridification increases the launch dimensions of the 
kernels to a point that there's only one iteration left in the loop, 
which means that the control flow can be eliminated.

The openacc kernels support maps loops in an oacc kernels region onto a 
non-shared memory accelerator. These loops can be unannotated loops, or 
acc-loop annotated loops. If an acc-loop directive contains the 
independent clause, the loop is explicitly independent.

The current oacc kernels implementation mostly ignores the acc-loop 
directive, in order to unify handling of the annotated and unannotated 
loop. The patch "Handle acc loop directive" (at 
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg01089.html ) expands the 
annotated loop as sequential loop.
At the point that we get to pass_parallelize_loops_oacc_kernels, we have 
sequential loops in an offloaded function (atm, there's no support for 
the independent clause yet).

So pass_parallelize_loops_oacc_kernels transforms sequential loops in an 
offloaded function originating from a kernels region into explicitly 
independent loops by:
- proving the loop is independent
- identifying reductions
- rewriting the loop into an acc-loop annotated loop
- annotating the offloaded function with kernel launch dimensions
- rewriting the preloop-load and postloop-store pair of a reduction
   variable into an atomic update
- letting a subsequent ompexpand expand the acc-loop

I'd say there's is no explicit gridification in there.

AFAIU, gridification is something that can result from determining the 
lauch dimensions of the offloaded function, and optimizing for those 
dimensions. Currently pass_parallelize_loops_oacc_kernels is a place 
where we set launch dimensions, but we're not optimizing for that, that 
happens later-on. (And I'm starting to wonder whether I can get rid of 
the setting of the gang dimension in pass_parallelize_loops_oacc_kernels).

Thanks,
- Tom
diff mbox

Patch

Add pass_oacc_kernels

2015-11-09  Tom de Vries  <tom@codesourcery.com>

	* tree-pass.h (make_pass_oacc_kernels): Declare.
	* tree-ssa-loop.c (gate_oacc_kernels): New static function.
	(pass_data_oacc_kernels): New pass_data.
	(class pass_oacc_kernels): New pass.
	(make_pass_oacc_kernels): New function.
---
 gcc/tree-pass.h     |  1 +
 gcc/tree-ssa-loop.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 49e22a9..4ed8da6 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -463,6 +463,7 @@  extern gimple_opt_pass *make_pass_strength_reduction (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vtable_verify (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_ubsan (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sanopt (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_oacc_kernels (gcc::context *ctxt);
 
 /* IPA Passes */
 extern simple_ipa_opt_pass *make_pass_ipa_lower_emutls (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-loop.c b/gcc/tree-ssa-loop.c
index 8ecd140..b51cac2 100644
--- a/gcc/tree-ssa-loop.c
+++ b/gcc/tree-ssa-loop.c
@@ -35,6 +35,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-inline.h"
 #include "tree-scalar-evolution.h"
 #include "tree-vectorizer.h"
+#include "omp-low.h"
 
 
 /* A pass making sure loops are fixed up.  */
@@ -141,6 +142,70 @@  make_pass_tree_loop (gcc::context *ctxt)
   return new pass_tree_loop (ctxt);
 }
 
+/* Gate for oacc kernels pass group.  */
+
+static bool
+gate_oacc_kernels (function *fn)
+{
+  if (flag_tree_parallelize_loops <= 1)
+    return false;
+
+  tree oacc_function_attr = get_oacc_fn_attrib (fn->decl);
+  if (oacc_function_attr == NULL_TREE)
+    return false;
+
+  tree val = TREE_VALUE (oacc_function_attr);
+  while (val != NULL_TREE && TREE_VALUE (val) == NULL_TREE)
+    val = TREE_CHAIN (val);
+
+  if (val != NULL_TREE)
+    return false;
+
+  struct loop *loop;
+  FOR_EACH_LOOP (loop, 0)
+    if (loop->in_oacc_kernels_region)
+      return true;
+
+  return false;
+}
+
+/* The oacc kernels superpass.  */
+
+namespace {
+
+const pass_data pass_data_oacc_kernels =
+{
+  GIMPLE_PASS, /* type */
+  "oacc_kernels", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_LOOP, /* tv_id */
+  PROP_cfg, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_oacc_kernels : public gimple_opt_pass
+{
+public:
+  pass_oacc_kernels (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_oacc_kernels, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *fn) { return gate_oacc_kernels (fn); }
+
+}; // class pass_oacc_kernels
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_oacc_kernels (gcc::context *ctxt)
+{
+  return new pass_oacc_kernels (ctxt);
+}
+
 /* The no-loop superpass.  */
 
 namespace {
-- 
1.9.1