diff mbox series

[OpenACC] Add support for gang local storage allocation in shared memory

Message ID 20180813172151.6bfcece3@squid.athome
State New
Headers show
Series [OpenACC] Add support for gang local storage allocation in shared memory | expand

Commit Message

Julian Brown Aug. 13, 2018, 4:21 p.m. UTC
This patch adds support for placing gang-private variables in NVPTX
per-CU shared memory. This is done by marking up addressable variables
declared at the appropriate parallelism level with an attribute ("oacc
gangprivate") in omp-low.c.

Target-dependent code in the NVPTX backend then modifies the symbol
associated with the variable at expand time via a new target hook
(TARGET_GOACC_EXPAND_ACCEL_VAR) in order to place it in shared memory,
which is faster to access than the ".local" memory that would otherwise
be used for such variables. This has (theoretical, at least)
consequences on program semantics, in that the shared memory is also
statically-allocated rather than obeying stack discipline -- but you
can't have recursive routine calls in OpenACC anyway, so that's no big
deal.

Other targets can use the same attribute in different ways, as
appropriate.

OK for trunk?

Thanks,

Julian

2018-08-10  Julian Brown  <julian@codesourcery.com>
            Chung-Lin Tang  <cltang@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
        (gangprivate_shared_size): New global variable.
        (gangprivate_shared_align): Likewise.
        (gangprivate_shared_sym): Likewise.
        (gangprivate_shared_hmap): Likewise.
        (nvptx_option_override): Initialize gangprivate_shared_sym,
        gangprivate_shared_align.
        (nvptx_file_end): Output gangprivate_shared_sym.
        (nvptx_goacc_expand_accel_var): New function.
        (nvptx_set_current_function): New function.
        (TARGET_SET_CURRENT_FUNCTION): Define hook.
        (TARGET_GOACC_EXPAND_ACCEL): Likewise.
        * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
        * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
        * expr.c (expand_expr_real_1): Remap decls marked with the
        "oacc gangprivate" atttribute.
        * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
        fields.
        (new_omp_context): Initialize oacc_decls in new omp_context.
        (delete_omp_context): Delete oacc_decls in old omp_context.
        (lower_oacc_head_tail): Record partitioning-level count in omp context.
        (oacc_record_private_var_clauses, oacc_record_vars_in_bind)
        (mark_oacc_gangprivate): New functions.
        (lower_omp_for): Call oacc_record_private_var_clauses with "for"
        clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
        (lower_omp_target): Call oacc_record_private_var_clauses with "target"
        clauses.
        Call mark_oacc_gangprivate for offloaded target regions.
        (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
        * target.def (expand_accel_var): New hook.

        libgomp/
        * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
        * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
        * testsuite/libgomp.oacc-c/pr85465.c: New test.

Comments

Cesar Philippidis Aug. 13, 2018, 6:42 p.m. UTC | #1
On 08/13/2018 09:21 AM, Julian Brown wrote:

> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
> new file mode 100644
> index 0000000..2fa708a
> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
> @@ -0,0 +1,106 @@
> +/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */

As a quick comment, I like the approach that you've taken with this
patch, but the og8 patch only applies the gangprivate attribute in the
c/c++ FE. I'd have to review the notes, but I seem to recall that
excluding that clause in fortran was deliberate. Chung-Lin, do you
recall the rationale behind that?

With that aside, is the above xfail still necessary? It seems to xpass
for me on nvptx. However, I see this regression on the host:

FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/loop-gwv-2.c
-DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1  -O2  execution test

There could be other regressions, but I only tested the new tests
introduced by the patch so far.

Cesar
Cesar Philippidis Aug. 13, 2018, 7:06 p.m. UTC | #2
On 08/13/2018 11:42 AM, Cesar Philippidis wrote:
> On 08/13/2018 09:21 AM, Julian Brown wrote:
> 
>> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
>> new file mode 100644
>> index 0000000..2fa708a
>> --- /dev/null
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
>> @@ -0,0 +1,106 @@
>> +/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */
> 
> As a quick comment, I like the approach that you've taken with this
> patch, but the og8 patch only applies the gangprivate attribute in the
> c/c++ FE. I'd have to review the notes, but I seem to recall that
> excluding that clause in fortran was deliberate. Chung-Lin, do you
> recall the rationale behind that?

I found this in an old email:

  The older version of fortran that OpenACC supports doesn't have a
  concept of lexically scoped blocks like c/c++, so this isn't relevant
  except for explicit gang private variables.

So in other words, this is safe for fortran. It probably could use a
fortran test, because that functionality wasn't explicitly exercised in
og7/og8.

Cesar
Julian Brown Aug. 13, 2018, 8:41 p.m. UTC | #3
On Mon, 13 Aug 2018 11:42:26 -0700
Cesar Philippidis <cesar@codesourcery.com> wrote:

> On 08/13/2018 09:21 AM, Julian Brown wrote:
> 
> > diff --git
> > a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
> > b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file
> > mode 100644 index 0000000..2fa708a --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
> > @@ -0,0 +1,106 @@
> > +/* { dg-xfail-run-if "gangprivate
> > failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */  
> 
> As a quick comment, I like the approach that you've taken with this
> patch, but the og8 patch only applies the gangprivate attribute in the
> c/c++ FE. I'd have to review the notes, but I seem to recall that
> excluding that clause in fortran was deliberate. Chung-Lin, do you
> recall the rationale behind that?
> 
> With that aside, is the above xfail still necessary? It seems to xpass
> for me on nvptx. However, I see this regression on the host:
> 
> FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/loop-gwv-2.c
> -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1  -O2  execution test
> 
> There could be other regressions, but I only tested the new tests
> introduced by the patch so far.

Oops, this was the version of the patch I meant to post (and the one I
tested). The XFAIL on loop-gwv-2.c isn't necessary, plus that test
needed some other fixes to make it pass for NVPTX (it was written for
GCN to start with).

Everything else is the same. I'll see what I can come up with for a
Fortran test.

Thanks,

Julian
commit 7834b2f0dffec3e56e510c04e1663424b778fdfb
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Aug 9 20:27:04 2018 -0700

    [OpenACC] Add support for gang local storage allocation in shared memory
    
    2018-08-10  Julian Brown  <julian@codesourcery.com>
    	    Chung-Lin Tang  <cltang@codesourcery.com>
    
    	gcc/
    	* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
    	(gangprivate_shared_size): New global variable.
    	(gangprivate_shared_align): Likewise.
    	(gangprivate_shared_sym): Likewise.
    	(gangprivate_shared_hmap): Likewise.
    	(nvptx_option_override): Initialize gangprivate_shared_sym,
    	gangprivate_shared_align.
    	(nvptx_file_end): Output gangprivate_shared_sym.
    	(nvptx_goacc_expand_accel_var): New function.
    	(nvptx_set_current_function): New function.
    	(TARGET_SET_CURRENT_FUNCTION): Define hook.
    	(TARGET_GOACC_EXPAND_ACCEL): Likewise.
    	* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
    	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
    	* expr.c (expand_expr_real_1): Remap decls marked with the
    	"oacc gangprivate" atttribute.
    	* omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
    	fields.
    	(new_omp_context): Initialize oacc_decls in new omp_context.
    	(delete_omp_context): Delete oacc_decls in old omp_context.
    	(lower_oacc_head_tail): Record partitioning-level count in omp context.
    	(oacc_record_private_var_clauses, oacc_record_vars_in_bind)
    	(mark_oacc_gangprivate): New functions.
    	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
    	clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
    	(lower_omp_target): Call oacc_record_private_var_clauses with "target"
    	clauses.
    	Call mark_oacc_gangprivate for offloaded target regions.
    	(lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
    	* target.def (expand_accel_var): New hook.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
    	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
    	* testsuite/libgomp.oacc-c/pr85465.c: New test.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index c0b0a2e..14eb842 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -73,6 +73,7 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -137,6 +138,12 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -210,6 +217,10 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -4968,6 +4979,10 @@ nvptx_file_end (void)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (gangprivate_shared_size)
+    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5915,6 +5930,47 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (TREE_CODE (var) == VAR_DECL
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size =
+	    (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
+static GTY(()) tree nvptx_previous_fndecl;
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  gangprivate_shared_hmap.empty ();
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6051,6 +6107,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index a40f45a..fb87f67 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 39a214e..beace61 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4151,6 +4151,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index de6709d..2c62bf9 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 843c66f..354e182 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -124,6 +124,12 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
+
+  /* The number of levels of OpenACC partitioning invoked in this context.  */
+  int oacc_partitioning_levels;
+
+  /* Decls in this context.  */
+  vec<tree> *oacc_decls;
 };
 
 static splay_tree all_contexts;
@@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
     }
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
+  ctx->oacc_decls = new vec<tree> ();
 
   return ctx;
 }
@@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value)
   if (is_task_ctx (ctx))
     finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt));
 
+  delete ctx->oacc_decls;
+
   XDELETE (ctx);
 }
 
@@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
   gcc_assert (count);
+
+  ctx->oacc_partitioning_levels = count;
+
   for (unsigned done = 1; count; count--, done++)
     {
       gimple_seq fork_seq = NULL;
@@ -6732,6 +6744,66 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  tree c;
+
+  if (!ctx)
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    switch (OMP_CLAUSE_CODE (c))
+      {
+      case OMP_CLAUSE_PRIVATE:
+	{
+	  tree decl = OMP_CLAUSE_DECL (c);
+	  ctx->oacc_decls->safe_push (decl);
+	}
+	break;
+
+      default:
+	/* Empty.  */;
+      }
+}
+
+/* Record vars declared in BINDVARS in CTX.  This information is used to mark
+   up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  if (!ctx)
+    return;
+
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    ctx->oacc_decls->safe_push (v);
+}
+
+/* Mark variables which are declared implicitly or explicitly as gang private
+   with a special attribute.  These may need to have their declarations altered
+   later on in compilation (e.g. in execute_oacc_device_lower or the backend,
+   depending on how the OpenACC execution model is implemented on a given
+   target) to ensure that sharing semantics are correct.
+   Only variables which have their address taken need to be considered.  */
+
+static void
+mark_oacc_gangprivate (vec<tree> *decls)
+{
+  int i;
+  tree decl;
+
+  FOR_EACH_VEC_ELT (*decls, i, decl)
+    {
+      if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl))
+	DECL_ATTRIBUTES (decl)
+	  = tree_cons (get_identifier ("oacc gangprivate"),
+		       NULL, DECL_ATTRIBUTES (decl));
+    }
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -6748,6 +6820,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -6878,7 +6952,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
   if (oacc_head)
-    gimple_seq_add_seq (&body, oacc_head);
+    {
+      gimple_seq_add_seq (&body, oacc_head);
+
+      int level_total = 0;
+      omp_context *thisctx;
+
+      for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
+        level_total += thisctx->oacc_partitioning_levels;
+
+      /* If the current context and parent contexts are distributed over a
+	 total of one parallelism level, we have gang partitioning.  */
+      if (level_total == 1)
+        mark_oacc_gangprivate (ctx->oacc_decls);
+    }
 
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
@@ -7511,6 +7598,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   clauses = gimple_omp_target_clauses (stmt);
 
+  oacc_record_private_var_clauses (ctx, clauses);
+
   gimple_seq dep_ilist = NULL;
   gimple_seq dep_olist = NULL;
   if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND))
@@ -7761,6 +7850,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      mark_oacc_gangprivate (ctx->oacc_decls);
+
       /* Declare all the variables created by mapping and the variables
 	 declared in the scope of the target body.  */
       record_vars_into (ctx->block_vars, child_fn);
@@ -8755,6 +8846,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/target.def b/gcc/target.def
index c570f38..b3b24b8 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 0000000..f378346
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 0000000..a4f81a3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
new file mode 100644
index 0000000..329e8a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-w" } */
+
+int
+main (void)
+{
+#pragma acc parallel
+  foo ();
+
+  return 0;
+}
Julian Brown Aug. 15, 2018, 4:46 p.m. UTC | #4
On Mon, 13 Aug 2018 12:06:21 -0700
Cesar Philippidis <cesar@codesourcery.com> wrote:

> So in other words, this is safe for fortran. It probably could use a
> fortran test, because that functionality wasn't explicitly exercised
> in og7/og8.

Here's a new version of the patch with a Fortran test case. It's not
too easy to write a test that depends on whether gang-local variables
actually end up in the right kind of memory, so I wrote one that scans
the omplower dump instead. Many other (including execution) tests will
already trigger the new behaviour.

Tested with offloading to NVPTX.

OK?

Thanks,

Julian

2018-08-10  Julian Brown  <julian@codesourcery.com>
            Chung-Lin Tang  <cltang@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
        (gangprivate_shared_size): New global variable.
        (gangprivate_shared_align): Likewise.
        (gangprivate_shared_sym): Likewise.
        (gangprivate_shared_hmap): Likewise.
        (nvptx_option_override): Initialize gangprivate_shared_sym,
        gangprivate_shared_align.
        (nvptx_file_end): Output gangprivate_shared_sym.
        (nvptx_goacc_expand_accel_var): New function.
        (nvptx_set_current_function): New function.
        (TARGET_SET_CURRENT_FUNCTION): Define hook.
        (TARGET_GOACC_EXPAND_ACCEL): Likewise.
        * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
        * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
        * expr.c (expand_expr_real_1): Remap decls marked with the
        "oacc gangprivate" atttribute.
        * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
        fields.
        (new_omp_context): Initialize oacc_decls in new omp_context.
        (delete_omp_context): Delete oacc_decls in old omp_context.
        (lower_oacc_head_tail): Record partitioning-level count in omp context.
        (oacc_record_private_var_clauses, oacc_record_vars_in_bind)
        (mark_oacc_gangprivate): New functions.
        (lower_omp_for): Call oacc_record_private_var_clauses with "for"
        clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
        (lower_omp_target): Call oacc_record_private_var_clauses with "target"
        clauses.
        Call mark_oacc_gangprivate for offloaded target regions.
        (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
        * target.def (expand_accel_var): New hook.

        libgomp/
        * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
        * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
        * testsuite/libgomp.oacc-c/pr85465.c: New test.
        * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
commit b73428237720be8d5b6e793f8615204356336d30
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Aug 9 20:27:04 2018 -0700

    [OpenACC] Add support for gang local storage allocation in shared memory
    
    2018-08-10  Julian Brown  <julian@codesourcery.com>
    	    Chung-Lin Tang  <cltang@codesourcery.com>
    
    	gcc/
    	* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
    	(gangprivate_shared_size): New global variable.
    	(gangprivate_shared_align): Likewise.
    	(gangprivate_shared_sym): Likewise.
    	(gangprivate_shared_hmap): Likewise.
    	(nvptx_option_override): Initialize gangprivate_shared_sym,
    	gangprivate_shared_align.
    	(nvptx_file_end): Output gangprivate_shared_sym.
    	(nvptx_goacc_expand_accel_var): New function.
    	(nvptx_set_current_function): New function.
    	(TARGET_SET_CURRENT_FUNCTION): Define hook.
    	(TARGET_GOACC_EXPAND_ACCEL): Likewise.
    	* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
    	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
    	* expr.c (expand_expr_real_1): Remap decls marked with the
    	"oacc gangprivate" atttribute.
    	* omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
    	fields.
    	(new_omp_context): Initialize oacc_decls in new omp_context.
    	(delete_omp_context): Delete oacc_decls in old omp_context.
    	(lower_oacc_head_tail): Record partitioning-level count in omp context.
    	(oacc_record_private_var_clauses, oacc_record_vars_in_bind)
    	(mark_oacc_gangprivate): New functions.
    	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
    	clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
    	(lower_omp_target): Call oacc_record_private_var_clauses with "target"
    	clauses.
    	Call mark_oacc_gangprivate for offloaded target regions.
    	(lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
    	* target.def (expand_accel_var): New hook.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
    	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
    	* testsuite/libgomp.oacc-c/pr85465.c: New test.
    	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index c0b0a2e..14eb842 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -73,6 +73,7 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -137,6 +138,12 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -210,6 +217,10 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -4968,6 +4979,10 @@ nvptx_file_end (void)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (gangprivate_shared_size)
+    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5915,6 +5930,47 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (TREE_CODE (var) == VAR_DECL
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size =
+	    (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
+static GTY(()) tree nvptx_previous_fndecl;
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  gangprivate_shared_hmap.empty ();
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6051,6 +6107,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index a40f45a..fb87f67 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 39a214e..beace61 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4151,6 +4151,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index de6709d..2c62bf9 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 843c66f..b0e173d 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -124,6 +124,12 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
+
+  /* The number of levels of OpenACC partitioning invoked in this context.  */
+  int oacc_partitioning_levels;
+
+  /* Decls in this context.  */
+  vec<tree> *oacc_decls;
 };
 
 static splay_tree all_contexts;
@@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
     }
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
+  ctx->oacc_decls = new vec<tree> ();
 
   return ctx;
 }
@@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value)
   if (is_task_ctx (ctx))
     finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt));
 
+  delete ctx->oacc_decls;
+
   XDELETE (ctx);
 }
 
@@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
   gcc_assert (count);
+
+  ctx->oacc_partitioning_levels = count;
+
   for (unsigned done = 1; count; count--, done++)
     {
       gimple_seq fork_seq = NULL;
@@ -6732,6 +6744,77 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  tree c;
+
+  if (!ctx)
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    switch (OMP_CLAUSE_CODE (c))
+      {
+      case OMP_CLAUSE_PRIVATE:
+	{
+	  tree decl = OMP_CLAUSE_DECL (c);
+	  ctx->oacc_decls->safe_push (decl);
+	}
+	break;
+
+      default:
+	/* Empty.  */;
+      }
+}
+
+/* Record vars declared in BINDVARS in CTX.  This information is used to mark
+   up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  if (!ctx)
+    return;
+
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    ctx->oacc_decls->safe_push (v);
+}
+
+/* Mark variables which are declared implicitly or explicitly as gang private
+   with a special attribute.  These may need to have their declarations altered
+   later on in compilation (e.g. in execute_oacc_device_lower or the backend,
+   depending on how the OpenACC execution model is implemented on a given
+   target) to ensure that sharing semantics are correct.
+   Only variables which have their address taken need to be considered.  */
+
+static void
+mark_oacc_gangprivate (vec<tree> *decls)
+{
+  int i;
+  tree decl;
+
+  FOR_EACH_VEC_ELT (*decls, i, decl)
+    {
+      if (TREE_CODE (decl) == VAR_DECL
+	  && TREE_ADDRESSABLE (decl)
+	  && !lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file,
+		       "Setting 'oacc gangprivate' attribute for decl:");
+	      print_generic_decl (dump_file, decl, TDF_SLIM);
+	      fputc ('\n', dump_file);
+	    }
+	  DECL_ATTRIBUTES (decl)
+	    = tree_cons (get_identifier ("oacc gangprivate"),
+			 NULL, DECL_ATTRIBUTES (decl));
+	}
+    }
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -6748,6 +6831,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -6878,7 +6963,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
   if (oacc_head)
-    gimple_seq_add_seq (&body, oacc_head);
+    {
+      gimple_seq_add_seq (&body, oacc_head);
+
+      int level_total = 0;
+      omp_context *thisctx;
+
+      for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
+        level_total += thisctx->oacc_partitioning_levels;
+
+      /* If the current context and parent contexts are distributed over a
+	 total of one parallelism level, we have gang partitioning.  */
+      if (level_total == 1)
+        mark_oacc_gangprivate (ctx->oacc_decls);
+    }
 
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
@@ -7511,6 +7609,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   clauses = gimple_omp_target_clauses (stmt);
 
+  oacc_record_private_var_clauses (ctx, clauses);
+
   gimple_seq dep_ilist = NULL;
   gimple_seq dep_olist = NULL;
   if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND))
@@ -7761,6 +7861,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      mark_oacc_gangprivate (ctx->oacc_decls);
+
       /* Declare all the variables created by mapping and the variables
 	 declared in the scope of the target body.  */
       record_vars_into (ctx->block_vars, child_fn);
@@ -8755,6 +8857,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/target.def b/gcc/target.def
index c570f38..b3b24b8 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 0000000..f378346
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 0000000..a4f81a3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
new file mode 100644
index 0000000..329e8a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-w" } */
+
+int
+main (void)
+{
+#pragma acc parallel
+  foo ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
new file mode 100644
index 0000000..5f8a5e6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
@@ -0,0 +1,25 @@
+! Test for "oacc gangprivate" attribute on gang-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-omplower-details" }
+! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang private(w)
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
Bernhard Reutner-Fischer Aug. 15, 2018, 7:56 p.m. UTC | #5
On 15 August 2018 18:46:37 CEST, Julian Brown <julian@codesourcery.com> wrote:
>On Mon, 13 Aug 2018 12:06:21 -0700
>Cesar Philippidis <cesar@codesourcery.com> wrote:

atttribute has more t than strictly necessary. 
Don't like signed integer levels where they should be some unsigned. 
Also don't like single switch cases instead of if.
And omitting function comments even if the hook way above is documented may be ok ish but is a bit lazy ;)

thanks, 

>
>> So in other words, this is safe for fortran. It probably could use a
>> fortran test, because that functionality wasn't explicitly exercised
>> in og7/og8.
>
>Here's a new version of the patch with a Fortran test case. It's not
>too easy to write a test that depends on whether gang-local variables
>actually end up in the right kind of memory, so I wrote one that scans
>the omplower dump instead. Many other (including execution) tests will
>already trigger the new behaviour.
>
>Tested with offloading to NVPTX.
>
>OK?
>
>Thanks,
>
>Julian
>
>2018-08-10  Julian Brown  <julian@codesourcery.com>
>            Chung-Lin Tang  <cltang@codesourcery.com>
>
>        gcc/
>        * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
>        (gangprivate_shared_size): New global variable.
>        (gangprivate_shared_align): Likewise.
>        (gangprivate_shared_sym): Likewise.
>        (gangprivate_shared_hmap): Likewise.
>        (nvptx_option_override): Initialize gangprivate_shared_sym,
>        gangprivate_shared_align.
>        (nvptx_file_end): Output gangprivate_shared_sym.
>        (nvptx_goacc_expand_accel_var): New function.
>        (nvptx_set_current_function): New function.
>        (TARGET_SET_CURRENT_FUNCTION): Define hook.
>        (TARGET_GOACC_EXPAND_ACCEL): Likewise.
>      * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
>        * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
>        * expr.c (expand_expr_real_1): Remap decls marked with the
>        "oacc gangprivate" atttribute.
>  * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
>        fields.
>        (new_omp_context): Initialize oacc_decls in new omp_context.
>        (delete_omp_context): Delete oacc_decls in old omp_context.
>(lower_oacc_head_tail): Record partitioning-level count in omp context.
>        (oacc_record_private_var_clauses, oacc_record_vars_in_bind)
>        (mark_oacc_gangprivate): New functions.
>       (lower_omp_for): Call oacc_record_private_var_clauses with "for"
>       clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
> (lower_omp_target): Call oacc_record_private_var_clauses with "target"
>        clauses.
>        Call mark_oacc_gangprivate for offloaded target regions.
>   (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
>        * target.def (expand_accel_var): New hook.
>
>        libgomp/
>      * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
>        * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
>        * testsuite/libgomp.oacc-c/pr85465.c: New test.
>   * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
Julian Brown Aug. 16, 2018, 3:46 p.m. UTC | #6
On Wed, 15 Aug 2018 21:56:54 +0200
Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:

> On 15 August 2018 18:46:37 CEST, Julian Brown
> <julian@codesourcery.com> wrote:
> >On Mon, 13 Aug 2018 12:06:21 -0700
> >Cesar Philippidis <cesar@codesourcery.com> wrote:  
> 
> atttribute has more t than strictly necessary. 
> Don't like signed integer levels where they should be some unsigned. 
> Also don't like single switch cases instead of if.
> And omitting function comments even if the hook way above is
> documented may be ok ish but is a bit lazy ;)

Here's a new version with those comments addressed. I also changed the
logic around a little to avoid adding decls to the vec in omp_context
which would never be given the gang-private attribute.

Re-tested with offloading to NVPTX.

OK?

Julian

2018-08-10  Julian Brown  <julian@codesourcery.com>
            Chung-Lin Tang  <cltang@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
        (gangprivate_shared_size): New global variable.
        (gangprivate_shared_align): Likewise.
        (gangprivate_shared_sym): Likewise.
        (gangprivate_shared_hmap): Likewise.
        (nvptx_option_override): Initialize gangprivate_shared_sym,
        gangprivate_shared_align.
        (nvptx_file_end): Output gangprivate_shared_sym.
        (nvptx_goacc_expand_accel_var): New function.
        (nvptx_set_current_function): New function.
        (TARGET_SET_CURRENT_FUNCTION): Define hook.
        (TARGET_GOACC_EXPAND_ACCEL): Likewise.
        * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
        * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
        * expr.c (expand_expr_real_1): Remap decls marked with the
        "oacc gangprivate" attribute.
        * omp-low.c (omp_context): Add oacc_partitioning_level and
        oacc_addressable_var_decls fields.
        (new_omp_context): Initialize oacc_addressable_var_decls in new
        omp_context.
        (delete_omp_context): Delete oacc_addressable_var_decls in old
        omp_context.
        (lower_oacc_head_tail): Record partitioning-level count in omp context.
        (oacc_record_private_var_clauses, oacc_record_vars_in_bind)
        (mark_oacc_gangprivate): New functions.
        (lower_omp_for): Call oacc_record_private_var_clauses with "for"
        clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
        (lower_omp_target): Call oacc_record_private_var_clauses with "target"
        clauses.
        Call mark_oacc_gangprivate for offloaded target regions.
        (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
        * target.def (expand_accel_var): New hook.

        libgomp/
        * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
        * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
        * testsuite/libgomp.oacc-c/pr85465.c: New test.
        * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
commit e276442550a85b62866ba13890eacf4e946d1079
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Aug 9 20:27:04 2018 -0700

    [OpenACC] Add support for gang local storage allocation in shared memory
    
    2018-08-10  Julian Brown  <julian@codesourcery.com>
    	    Chung-Lin Tang  <cltang@codesourcery.com>
    
    	gcc/
    	* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
    	(gangprivate_shared_size): New global variable.
    	(gangprivate_shared_align): Likewise.
    	(gangprivate_shared_sym): Likewise.
    	(gangprivate_shared_hmap): Likewise.
    	(nvptx_option_override): Initialize gangprivate_shared_sym,
    	gangprivate_shared_align.
    	(nvptx_file_end): Output gangprivate_shared_sym.
    	(nvptx_goacc_expand_accel_var): New function.
    	(nvptx_set_current_function): New function.
    	(TARGET_SET_CURRENT_FUNCTION): Define hook.
    	(TARGET_GOACC_EXPAND_ACCEL): Likewise.
    	* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
    	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
    	* expr.c (expand_expr_real_1): Remap decls marked with the
    	"oacc gangprivate" attribute.
    	* omp-low.c (omp_context): Add oacc_partitioning_level and
    	oacc_addressable_var_decls fields.
    	(new_omp_context): Initialize oacc_addressable_var_decls in new
    	omp_context.
    	(delete_omp_context): Delete oacc_addressable_var_decls in old
    	omp_context.
    	(lower_oacc_head_tail): Record partitioning-level count in omp context.
    	(oacc_record_private_var_clauses, oacc_record_vars_in_bind)
    	(mark_oacc_gangprivate): New functions.
    	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
    	clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
    	(lower_omp_target): Call oacc_record_private_var_clauses with "target"
    	clauses.
    	Call mark_oacc_gangprivate for offloaded target regions.
    	(lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
    	* target.def (expand_accel_var): New hook.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
    	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
    	* testsuite/libgomp.oacc-c/pr85465.c: New test.
    	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index c0b0a2e..7aeefdb 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -73,6 +73,7 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -137,6 +138,12 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -210,6 +217,10 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -4968,6 +4979,10 @@ nvptx_file_end (void)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (gangprivate_shared_size)
+    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5915,6 +5930,52 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+/* Implement TARGET_GOACC_EXPAND_ACCEL_VAR.  Place "oacc gangprivate"
+   variables in shared memory.  */
+
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (TREE_CODE (var) == VAR_DECL
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size =
+	    (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
+static GTY(()) tree nvptx_previous_fndecl;
+
+/* Implement TARGET_SET_CURRENT_FUNCTION.  Reset per-function context.  */
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  gangprivate_shared_hmap.empty ();
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6051,6 +6112,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index a40f45a..fb87f67 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 39a214e..beace61 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4151,6 +4151,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index de6709d..f186a41 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may be intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 843c66f..a649d2e 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -124,6 +124,12 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
+
+  /* The number of levels of OpenACC partitioning invoked in this context.  */
+  unsigned oacc_partitioning_levels;
+
+  /* Addressable variable decls in this context.  */
+  vec<tree> *oacc_addressable_var_decls;
 };
 
 static splay_tree all_contexts;
@@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
     }
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
+  ctx->oacc_addressable_var_decls = new vec<tree> ();
 
   return ctx;
 }
@@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value)
   if (is_task_ctx (ctx))
     finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt));
 
+  delete ctx->oacc_addressable_var_decls;
+
   XDELETE (ctx);
 }
 
@@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
   gcc_assert (count);
+
+  ctx->oacc_partitioning_levels = count;
+
   for (unsigned done = 1; count; count--, done++)
     {
       gimple_seq fork_seq = NULL;
@@ -6732,6 +6744,68 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  tree c;
+
+  if (!ctx)
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	tree decl = OMP_CLAUSE_DECL (c);
+	if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl))
+	  ctx->oacc_addressable_var_decls->safe_push (decl);
+      }
+}
+
+/* Record addressable vars declared in BINDVARS in CTX.  This information is
+   used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  if (!ctx)
+    return;
+
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    if (TREE_CODE (v) == VAR_DECL && TREE_ADDRESSABLE (v))
+      ctx->oacc_addressable_var_decls->safe_push (v);
+}
+
+/* Mark addressable variables which are declared implicitly or explicitly as
+   gang private with a special attribute.  These may need to have their
+   declarations altered later on in compilation (e.g. in
+   execute_oacc_device_lower or the backend, depending on how the OpenACC
+   execution model is implemented on a given target) to ensure that sharing
+   semantics are correct.  */
+
+static void
+mark_oacc_gangprivate (vec<tree> *decls)
+{
+  int i;
+  tree decl;
+
+  FOR_EACH_VEC_ELT (*decls, i, decl)
+    if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
+      {
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  {
+	    fprintf (dump_file,
+		     "Setting 'oacc gangprivate' attribute for decl:");
+	    print_generic_decl (dump_file, decl, TDF_SLIM);
+	    fputc ('\n', dump_file);
+	  }
+	DECL_ATTRIBUTES (decl)
+	  = tree_cons (get_identifier ("oacc gangprivate"),
+		       NULL, DECL_ATTRIBUTES (decl));
+      }
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -6748,6 +6822,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -6878,7 +6954,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
   if (oacc_head)
-    gimple_seq_add_seq (&body, oacc_head);
+    {
+      gimple_seq_add_seq (&body, oacc_head);
+
+      unsigned level_total = 0;
+      omp_context *thisctx;
+
+      for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
+        level_total += thisctx->oacc_partitioning_levels;
+
+      /* If the current context and parent contexts are distributed over a
+	 total of one parallelism level, we have gang partitioning.  */
+      if (level_total == 1)
+        mark_oacc_gangprivate (ctx->oacc_addressable_var_decls);
+    }
 
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
@@ -7511,6 +7600,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   clauses = gimple_omp_target_clauses (stmt);
 
+  oacc_record_private_var_clauses (ctx, clauses);
+
   gimple_seq dep_ilist = NULL;
   gimple_seq dep_olist = NULL;
   if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND))
@@ -7761,6 +7852,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      mark_oacc_gangprivate (ctx->oacc_addressable_var_decls);
+
       /* Declare all the variables created by mapping and the variables
 	 declared in the scope of the target body.  */
       record_vars_into (ctx->block_vars, child_fn);
@@ -8755,6 +8848,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/target.def b/gcc/target.def
index c570f38..b3b24b8 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 0000000..f378346
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 0000000..a4f81a3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
new file mode 100644
index 0000000..329e8a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-w" } */
+
+int
+main (void)
+{
+#pragma acc parallel
+  foo ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
new file mode 100644
index 0000000..5f8a5e6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
@@ -0,0 +1,25 @@
+! Test for "oacc gangprivate" attribute on gang-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-omplower-details" }
+! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang private(w)
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
Bernhard Reutner-Fischer Aug. 17, 2018, 4:39 p.m. UTC | #7
On 16 August 2018 17:46:43 CEST, Julian Brown <julian@codesourcery.com> wrote:
>On Wed, 15 Aug 2018 21:56:54 +0200
>Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:
>
>> On 15 August 2018 18:46:37 CEST, Julian Brown
>> <julian@codesourcery.com> wrote:
>> >On Mon, 13 Aug 2018 12:06:21 -0700
>> >Cesar Philippidis <cesar@codesourcery.com> wrote:  
>> 
>> atttribute has more t than strictly necessary. 
>> Don't like signed integer levels where they should be some unsigned. 
>> Also don't like single switch cases instead of if.
>> And omitting function comments even if the hook way above is
>> documented may be ok ish but is a bit lazy ;)
>
>Here's a new version with those comments addressed. I also changed the
>logic around a little to avoid adding decls to the vec in omp_context
>which would never be given the gang-private attribute.
>
>Re-tested with offloading to NVPTX.
>
>OK?

(TREE_CODE (var) == VAR_DECL
Is nowadays known as VAR_P (decl), FWIW.

ISTM that global variables are not JIT-friendly.
No further comments from me.

Thanks,
Tom de Vries Oct. 5, 2018, 1:41 p.m. UTC | #8
On 8/16/18 5:46 PM, Julian Brown wrote:
> On Wed, 15 Aug 2018 21:56:54 +0200
> Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:
> 
>> On 15 August 2018 18:46:37 CEST, Julian Brown
>> <julian@codesourcery.com> wrote:
>>> On Mon, 13 Aug 2018 12:06:21 -0700
>>> Cesar Philippidis <cesar@codesourcery.com> wrote:  
>>
>> atttribute has more t than strictly necessary. 
>> Don't like signed integer levels where they should be some unsigned. 
>> Also don't like single switch cases instead of if.
>> And omitting function comments even if the hook way above is
>> documented may be ok ish but is a bit lazy ;)
> 
> Here's a new version with those comments addressed. I also changed the
> logic around a little to avoid adding decls to the vec in omp_context
> which would never be given the gang-private attribute.
> 
> Re-tested with offloading to NVPTX.
> 
> OK?

As far as the nvptx part is concerned, I see:
...
=== ERROR type #4: trailing operator (1 error(s)) ===
gcc/config/nvptx/nvptx.c:5946:27:         gangprivate_shared_size =
...

Otherwise, the nvptx part is OK.

Thanks,
- Tom

> 
> Julian
> 
> 2018-08-10  Julian Brown  <julian@codesourcery.com>
>             Chung-Lin Tang  <cltang@codesourcery.com>
> 
>         gcc/
>         * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
>         (gangprivate_shared_size): New global variable.
>         (gangprivate_shared_align): Likewise.
>         (gangprivate_shared_sym): Likewise.
>         (gangprivate_shared_hmap): Likewise.
>         (nvptx_option_override): Initialize gangprivate_shared_sym,
>         gangprivate_shared_align.
>         (nvptx_file_end): Output gangprivate_shared_sym.
>         (nvptx_goacc_expand_accel_var): New function.
>         (nvptx_set_current_function): New function.
>         (TARGET_SET_CURRENT_FUNCTION): Define hook.
>         (TARGET_GOACC_EXPAND_ACCEL): Likewise.
>         * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
>         * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
>         * expr.c (expand_expr_real_1): Remap decls marked with the
>         "oacc gangprivate" attribute.
>         * omp-low.c (omp_context): Add oacc_partitioning_level and
>         oacc_addressable_var_decls fields.
>         (new_omp_context): Initialize oacc_addressable_var_decls in new
>         omp_context.
>         (delete_omp_context): Delete oacc_addressable_var_decls in old
>         omp_context.
>         (lower_oacc_head_tail): Record partitioning-level count in omp context.
>         (oacc_record_private_var_clauses, oacc_record_vars_in_bind)
>         (mark_oacc_gangprivate): New functions.
>         (lower_omp_for): Call oacc_record_private_var_clauses with "for"
>         clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
>         (lower_omp_target): Call oacc_record_private_var_clauses with "target"
>         clauses.
>         Call mark_oacc_gangprivate for offloaded target regions.
>         (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
>         * target.def (expand_accel_var): New hook.
> 
>         libgomp/
>         * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
>         * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
>         * testsuite/libgomp.oacc-c/pr85465.c: New test.
>         * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
>
Julian Brown Dec. 11, 2018, 3:08 p.m. UTC | #9
On Fri, 17 Aug 2018 18:39:00 +0200
Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:

> On 16 August 2018 17:46:43 CEST, Julian Brown
> <julian@codesourcery.com> wrote:
> >On Wed, 15 Aug 2018 21:56:54 +0200
> >Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote:
> >  
> >> On 15 August 2018 18:46:37 CEST, Julian Brown
> >> <julian@codesourcery.com> wrote:  
> >> >On Mon, 13 Aug 2018 12:06:21 -0700
> >> >Cesar Philippidis <cesar@codesourcery.com> wrote:    
> >> 
> >> atttribute has more t than strictly necessary. 
> >> Don't like signed integer levels where they should be some
> >> unsigned. Also don't like single switch cases instead of if.
> >> And omitting function comments even if the hook way above is
> >> documented may be ok ish but is a bit lazy ;)  
> >
> >Here's a new version with those comments addressed. I also changed
> >the logic around a little to avoid adding decls to the vec in
> >omp_context which would never be given the gang-private attribute.
> >
> >Re-tested with offloading to NVPTX.
> >
> >OK?  
> 
> (TREE_CODE (var) == VAR_DECL
> Is nowadays known as VAR_P (decl), FWIW.

Fixed. (And also Tom's formatting nit mentioned in another email.)

> ISTM that global variables are not JIT-friendly.
> No further comments from me.

Probably true, but AFAIK nobody's trying to use the (GCC) JIT with the
PTX backend, and the backend already uses global variables for several
other purposes. Of course PTX code is JIT'ted itself by the NVidia
runtime, but I guess that's not what you were referring to!

Is this version OK? Re-tested with offloading to NVPTX.

Thanks,

Julian
commit 3335ddfa72944be5359280116e8eb4febd4ed3c7
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Aug 9 20:27:04 2018 -0700

    [OpenACC] Add support for gang local storage allocation in shared memory
    
    2018-08-10  Julian Brown  <julian@codesourcery.com>
    	    Chung-Lin Tang  <cltang@codesourcery.com>
    
    	gcc/
    	* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
    	(gangprivate_shared_size): New global variable.
    	(gangprivate_shared_align): Likewise.
    	(gangprivate_shared_sym): Likewise.
    	(gangprivate_shared_hmap): Likewise.
    	(nvptx_option_override): Initialize gangprivate_shared_sym,
    	gangprivate_shared_align.
    	(nvptx_file_end): Output gangprivate_shared_sym.
    	(nvptx_goacc_expand_accel_var): New function.
    	(nvptx_set_current_function): New function.
    	(TARGET_SET_CURRENT_FUNCTION): Define hook.
    	(TARGET_GOACC_EXPAND_ACCEL): Likewise.
    	* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
    	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
    	* expr.c (expand_expr_real_1): Remap decls marked with the
    	"oacc gangprivate" attribute.
    	* omp-low.c (omp_context): Add oacc_partitioning_level and
    	oacc_addressable_var_decls fields.
    	(new_omp_context): Initialize oacc_addressable_var_decls in new
    	omp_context.
    	(delete_omp_context): Delete oacc_addressable_var_decls in old
    	omp_context.
    	(lower_oacc_head_tail): Record partitioning-level count in omp context.
    	(oacc_record_private_var_clauses, oacc_record_vars_in_bind)
    	(mark_oacc_gangprivate): New functions.
    	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
    	clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
    	(lower_omp_target): Call oacc_record_private_var_clauses with "target"
    	clauses.
    	Call mark_oacc_gangprivate for offloaded target regions.
    	(lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
    	* target.def (expand_accel_var): New hook.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
    	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
    	* testsuite/libgomp.oacc-c/pr85465.c: New test.
    	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index 9903a27..02c2847 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -73,6 +73,7 @@
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -137,6 +138,12 @@ static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -210,6 +217,10 @@ nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -4971,6 +4982,10 @@ nvptx_file_end (void)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (gangprivate_shared_size)
+    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5918,6 +5933,52 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+/* Implement TARGET_GOACC_EXPAND_ACCEL_VAR.  Place "oacc gangprivate"
+   variables in shared memory.  */
+
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (VAR_P (var)
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size
+	    = (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
+static GTY(()) tree nvptx_previous_fndecl;
+
+/* Implement TARGET_SET_CURRENT_FUNCTION.  Reset per-function context.  */
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  gangprivate_shared_hmap.empty ();
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6054,6 +6115,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index e348f0a..9164917 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6124,6 +6124,14 @@ like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index f1ad80d..3cdaca2 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4202,6 +4202,8 @@ address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index 85b7847..0f73deb 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9874,8 +9874,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may be intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index b406ce7..f078110 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -133,6 +133,12 @@ struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
+
+  /* The number of levels of OpenACC partitioning invoked in this context.  */
+  unsigned oacc_partitioning_levels;
+
+  /* Addressable variable decls in this context.  */
+  vec<tree> *oacc_addressable_var_decls;
 };
 
 static splay_tree all_contexts;
@@ -872,6 +878,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
     }
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
+  ctx->oacc_addressable_var_decls = new vec<tree> ();
 
   return ctx;
 }
@@ -953,6 +960,8 @@ delete_omp_context (splay_tree_value value)
       delete ctx->task_reduction_map;
     }
 
+  delete ctx->oacc_addressable_var_decls;
+
   XDELETE (ctx);
 }
 
@@ -6470,6 +6479,9 @@ lower_oacc_head_tail (location_t loc, tree clauses,
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
   gcc_assert (count);
+
+  ctx->oacc_partitioning_levels = count;
+
   for (unsigned done = 1; count; count--, done++)
     {
       gimple_seq fork_seq = NULL;
@@ -8144,6 +8156,68 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  tree c;
+
+  if (!ctx)
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
+      {
+	tree decl = OMP_CLAUSE_DECL (c);
+	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
+	  ctx->oacc_addressable_var_decls->safe_push (decl);
+      }
+}
+
+/* Record addressable vars declared in BINDVARS in CTX.  This information is
+   used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  if (!ctx)
+    return;
+
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    if (VAR_P (v) && TREE_ADDRESSABLE (v))
+      ctx->oacc_addressable_var_decls->safe_push (v);
+}
+
+/* Mark addressable variables which are declared implicitly or explicitly as
+   gang private with a special attribute.  These may need to have their
+   declarations altered later on in compilation (e.g. in
+   execute_oacc_device_lower or the backend, depending on how the OpenACC
+   execution model is implemented on a given target) to ensure that sharing
+   semantics are correct.  */
+
+static void
+mark_oacc_gangprivate (vec<tree> *decls)
+{
+  int i;
+  tree decl;
+
+  FOR_EACH_VEC_ELT (*decls, i, decl)
+    if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
+      {
+	if (dump_file && (dump_flags & TDF_DETAILS))
+	  {
+	    fprintf (dump_file,
+		     "Setting 'oacc gangprivate' attribute for decl:");
+	    print_generic_decl (dump_file, decl, TDF_SLIM);
+	    fputc ('\n', dump_file);
+	  }
+	DECL_ATTRIBUTES (decl)
+	  = tree_cons (get_identifier ("oacc gangprivate"),
+		       NULL, DECL_ATTRIBUTES (decl));
+      }
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -8161,6 +8235,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -8316,7 +8392,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
   if (oacc_head)
-    gimple_seq_add_seq (&body, oacc_head);
+    {
+      gimple_seq_add_seq (&body, oacc_head);
+
+      unsigned level_total = 0;
+      omp_context *thisctx;
+
+      for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
+        level_total += thisctx->oacc_partitioning_levels;
+
+      /* If the current context and parent contexts are distributed over a
+	 total of one parallelism level, we have gang partitioning.  */
+      if (level_total == 1)
+        mark_oacc_gangprivate (ctx->oacc_addressable_var_decls);
+    }
 
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
@@ -9092,6 +9181,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   clauses = gimple_omp_target_clauses (stmt);
 
+  oacc_record_private_var_clauses (ctx, clauses);
+
   gimple_seq dep_ilist = NULL;
   gimple_seq dep_olist = NULL;
   if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND))
@@ -9342,6 +9433,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      mark_oacc_gangprivate (ctx->oacc_addressable_var_decls);
+
       /* Declare all the variables created by mapping and the variables
 	 declared in the scope of the target body.  */
       record_vars_into (ctx->block_vars, child_fn);
@@ -10336,6 +10429,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/target.def b/gcc/target.def
index 96f37e0..e154b17 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1707,6 +1707,16 @@ for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 0000000..f378346
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 0000000..a4f81a3
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,95 @@
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
+	    copy(ary) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	int g, w, v;
+
+	g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	ary[ix] = (g << 16) | (w << 8) | v;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      int g = (ary[ix] >> 16) & 255;
+      int w = (ary[ix] >> 8) & 255;
+      int v = ary[ix] & 255;
+
+      gangdist[g]++;
+      workerdist[w]++;
+      vectordist[v]++;
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
new file mode 100644
index 0000000..329e8a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-w" } */
+
+int
+main (void)
+{
+#pragma acc parallel
+  foo ();
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
new file mode 100644
index 0000000..5f8a5e6
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
@@ -0,0 +1,25 @@
+! Test for "oacc gangprivate" attribute on gang-private variables
+
+! { dg-do run }
+! { dg-additional-options "-fdump-tree-omplower-details" }
+! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */
+
+program main
+  integer :: w, arr(0:31)
+
+  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
+    !$acc loop gang private(w)
+    do j = 0, 31
+      w = 0
+      !$acc loop seq
+      do i = 0, 31
+        !$acc atomic update
+        w = w + 1
+        !$acc end atomic
+      end do
+      arr(j) = w
+    end do
+  !$acc end parallel
+
+  if (any (arr .ne. 32)) stop 1
+end program main
Julian Brown June 3, 2019, 4:02 p.m. UTC | #10
On Tue, 11 Dec 2018 15:08:11 +0000
Julian Brown <julian@codesourcery.com> wrote:

> Is this version OK? Re-tested with offloading to NVPTX.

This is a ping for the patch posted here:

https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00749.html

This is a new version of the patch, rebased and with a couple of
additional bugfixes, as follows:

Firstly, in mark_oacc_gangprivate, each decl is looked up (using
maybe_lookup_decl) to apply the "oacc gangprivate" attribute to the
innermost-nested copy of the decl.

Secondly, I'd misunderstood when the maximum parallelism level was
calculated for each nested omp_context, meaning that the code to
trigger adding the "oacc gangprivate" attribute could trigger in the
wrong circumstances. I've fixed this by moving the attribute-setting to
execute_lower_omp.

I've also added a new testcase (gangprivate-attrib-2.f90). Re-tested
with offloading to nvptx.

OK for trunk?

Thank you,

Julian

2019-06-03  Julian Brown  <julian@codesourcery.com>
            Chung-Lin Tang  <cltang@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (tree-hash-traits.h): Include.
        (gangprivate_shared_size): New global variable.
        (gangprivate_shared_align): Likewise.
        (gangprivate_shared_sym): Likewise.
        (gangprivate_shared_hmap): Likewise.
        (nvptx_option_override): Initialize gangprivate_shared_sym,
        gangprivate_shared_align.
        (nvptx_file_end): Output gangprivate_shared_sym.
        (nvptx_goacc_expand_accel_var): New function.
        (nvptx_set_current_function): Initialise gangprivate_shared_hmap. Add
        function comment.
        (TARGET_GOACC_EXPAND_ACCEL): Likewise.
        * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
        * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
        * expr.c (expand_expr_real_1): Remap VAR_DECLs marked with the
        "oacc gangprivate" attribute.
        * omp-low.c (omp_context): Add oacc_partitioning_level and
        oacc_addressable_var_decls fields.
        (new_omp_context): Initialize oacc_addressable_var_decls in new
        omp_context.
        (delete_omp_context): Delete oacc_addressable_var_decls in old
        omp_context.
        (lower_oacc_head_tail): Record partitioning-level count in omp context.
        (oacc_record_private_var_clauses, oacc_record_vars_in_bind,
        mark_oacc_gangprivate): New functions.
        (lower_omp_for): Call oacc_record_private_var_clauses with "for"
        clauses.
        (lower_omp_target): Likewise, for "target" clauses.
        Call mark_oacc_gangprivate for offloaded target regions.
        (process_oacc_gangprivate_1): New function.
        (lower_omp_1): Call oacc_record_vars_in_bind for GIMPLE_BIND within OMP
        regions.
        (execute_lower_omp): Call process_oacc_gangprivate_1 for each OMP
        context.
        * target.def (expand_accel_var): New hook.

        libgomp/
        * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
        * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
        * testsuite/libgomp.oacc-c/pr85465.c: New test.
        * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
        * testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
Jakub Jelinek June 3, 2019, 4:23 p.m. UTC | #11
On Mon, Jun 03, 2019 at 05:02:45PM +0100, Julian Brown wrote:
>         * omp-low.c (omp_context): Add oacc_partitioning_level and
>         oacc_addressable_var_decls fields.
>         (new_omp_context): Initialize oacc_addressable_var_decls in new
>         omp_context.
>         (delete_omp_context): Delete oacc_addressable_var_decls in old
>         omp_context.
>         (lower_oacc_head_tail): Record partitioning-level count in omp context.
>         (oacc_record_private_var_clauses, oacc_record_vars_in_bind,
>         mark_oacc_gangprivate): New functions.
>         (lower_omp_for): Call oacc_record_private_var_clauses with "for"
>         clauses.
>         (lower_omp_target): Likewise, for "target" clauses.
>         Call mark_oacc_gangprivate for offloaded target regions.
>         (process_oacc_gangprivate_1): New function.
>         (lower_omp_1): Call oacc_record_vars_in_bind for GIMPLE_BIND within OMP
>         regions.
>         (execute_lower_omp): Call process_oacc_gangprivate_1 for each OMP
>         context.

Just commenting on the above part:

> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -137,6 +137,12 @@ struct omp_context
>  
>    /* True if this construct can be cancelled.  */
>    bool cancellable;
> +
> +  /* The number of levels of OpenACC partitioning invoked in this context.  */
> +  unsigned oacc_partitioning_levels;
> +
> +  /* Addressable variable decls in this context.  */
> +  vec<tree> *oacc_addressable_var_decls;

Why vec<tree> * rather than vec<tree>?

> @@ -878,6 +884,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx)
>      }
>  
>    ctx->cb.decl_map = new hash_map<tree, tree>;
> +  ctx->oacc_addressable_var_decls = new vec<tree> ();

You then don't have to new it here and delete below.  As the context is
cleared with XCNEW, you don't need to do anything here, and just
release when deleting.  Note, even if using a pointer for some reason was
needed (not in this case), using unconditional new for something only used
for small subset of contexts is unacceptable, it would be then desirable to
only create when needed.

>  
>    return ctx;
>  }
> @@ -960,6 +967,7 @@ delete_omp_context (splay_tree_value value)
>      }
>  
>    delete ctx->lastprivate_conditional_map;
> +  delete ctx->oacc_addressable_var_decls;
>  
>    XDELETE (ctx);
>  }
> @@ -8458,6 +8469,79 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
>      }
>  }
>  
> +/* Record vars listed in private clauses in CLAUSES in CTX.  This information
> +   is used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> +{
> +  tree c;
> +
> +  if (!ctx)
> +    return;
> +
> +  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> +      {
> +	tree decl = OMP_CLAUSE_DECL (c);
> +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> +	  ctx->oacc_addressable_var_decls->safe_push (decl);
> +      }
> +}

You don't want to do this for all GOMP_FOR or GOMP_TARGET context, I'd hope
you only want to do that for OpenACC contexts.  Perhaps it is ok
to bail out early if the context isn't OpenACC one.  On the other side, the
if (!ctx) condition makes no sense, the callers of course guarantee that ctx
is non-NULL.

> @@ -10665,6 +10774,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>  		 ctx);
>        break;
>      case GIMPLE_BIND:
> +      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));

Again, why is this done unconditionally?  It should be relevant to gather it
only in some subset of context, so guard that and don't do it otherwise.

>        lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
>        maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
>        break;
> @@ -10905,6 +11015,7 @@ execute_lower_omp (void)
>  
>    if (all_contexts)
>      {
> +      splay_tree_foreach (all_contexts, process_oacc_gangprivate_1, NULL);

Similarly.  Either guard with if (flag_openacc), or have some flag cleared
at the start of the pass and set only if you find something interesting so
that the splay_tree_foreach does something.

	Jakub
Julian Brown June 7, 2019, 2:08 p.m. UTC | #12
Hi Jakub,

Thanks for the review! I believe I've addressed all your comments in
the attached version of the patch.

On Mon, 3 Jun 2019 18:23:00 +0200
Jakub Jelinek <jakub@redhat.com> wrote:

> Why vec<tree> * rather than vec<tree>?

> > @@ -878,6 +884,7 @@ new_omp_context (gimple *stmt, omp_context
> > *outer_ctx) }
> >  
> >    ctx->cb.decl_map = new hash_map<tree, tree>;
> > +  ctx->oacc_addressable_var_decls = new vec<tree> ();  
> 
> You then don't have to new it here and delete below.  As the context
> is cleared with XCNEW, you don't need to do anything here, and just
> release when deleting.  Note, even if using a pointer for some reason
> was needed (not in this case), using unconditional new for something
> only used for small subset of contexts is unacceptable, it would be
> then desirable to only create when needed.

Fixed.

> > +/* Record vars listed in private clauses in CLAUSES in CTX.  This
> > information
> > +   is used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> > +{
> > +  tree c;
> > +
> > +  if (!ctx)
> > +    return;
> > +
> > +  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> > +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> > +      {
> > +	tree decl = OMP_CLAUSE_DECL (c);
> > +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> > +	  ctx->oacc_addressable_var_decls->safe_push (decl);
> > +      }
> > +}  
> 
> You don't want to do this for all GOMP_FOR or GOMP_TARGET context,
> I'd hope you only want to do that for OpenACC contexts.  Perhaps it
> is ok to bail out early if the context isn't OpenACC one.  On the
> other side, the if (!ctx) condition makes no sense, the callers of
> course guarantee that ctx is non-NULL.

I'm not sure where that came from -- ctx can be NULL at the top-level
of lower_omp as called from execute_lower_omp. Maybe that was left over
from an earlier version of the patch. Anyway, I've removed that bit
and fixed the patch to only call oacc_record_private_var_clauses in
OpenACC contexts.

> > @@ -10665,6 +10774,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p,
> > omp_context *ctx) ctx);
> >        break;
> >      case GIMPLE_BIND:
> > +      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind
> > *> (stmt)));  
> 
> Again, why is this done unconditionally?  It should be relevant to
> gather it only in some subset of context, so guard that and don't do
> it otherwise.

And here (where ctx *can* be NULL).

> >        lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)),
> > ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *>
> > (stmt)); break;
> > @@ -10905,6 +11015,7 @@ execute_lower_omp (void)
> >  
> >    if (all_contexts)
> >      {
> > +      splay_tree_foreach (all_contexts,
> > process_oacc_gangprivate_1, NULL);  
> 
> Similarly.  Either guard with if (flag_openacc), or have some flag
> cleared at the start of the pass and set only if you find something
> interesting so that the splay_tree_foreach does something.

I've introduced maybe_oacc_gangprivate_vars, and the splay tree walk is
only called if that's true. It's set whenever something's put in
oacc_addressable_var_decls in some omp context.

Re-tested with offloading to NVPTX. OK?

Thanks,

Julian
Jakub Jelinek June 12, 2019, 10:22 a.m. UTC | #13
On Fri, Jun 07, 2019 at 03:08:37PM +0100, Julian Brown wrote:
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index a7f35ffe416..67e1e82ec00 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -9794,6 +9882,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>  
>    if (offloaded)
>      {
> +      mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx);
> +

The above one still doesn't seem to be guarded for OpenACC constructs only.

As for the rest of the patch, you need Tom to look over the nvptx changes.

	Jakub
Tom de Vries June 12, 2019, 10:32 a.m. UTC | #14
On 12-06-19 12:22, Jakub Jelinek wrote:
> On Fri, Jun 07, 2019 at 03:08:37PM +0100, Julian Brown wrote:
>> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
>> index a7f35ffe416..67e1e82ec00 100644
>> --- a/gcc/omp-low.c
>> +++ b/gcc/omp-low.c
>> @@ -9794,6 +9882,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>>  
>>    if (offloaded)
>>      {
>> +      mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx);
>> +
> 
> The above one still doesn't seem to be guarded for OpenACC constructs only.
> 
> As for the rest of the patch, you need Tom to look over the nvptx changes.

I haven't seen any nvptx changes mentioned since I ok-ed the nvptx part
( https://gcc.gnu.org/ml/gcc-patches/2018-10/msg00324.html ), so on that
basis I'd say it's still ok.

Thanks,
- Tom
Thomas Schwinge June 12, 2019, 11:57 a.m. UTC | #15
Hi!

First, thanks for picking this up, and improving the patch you inherited.


Then, just a few individual comments, not a complete review.

(As far as I concerned, and as far as relevant, these can be addressed
later, incrementally, of course.)


I understand right that this will address some aspects of PR90115
"OpenACC: predetermined private levels for variables declared in blocks"
(so please mention that one in the ChangeLog updates, and commit log),
but it doesn't address all of these aspects (and see also Cesar's list in
<http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
and also not yet PR90114 "Predetermined private levels for variables
declared in OpenACC accelerator routines"?


On Fri, 7 Jun 2019 15:08:37 +0100, Julian Brown <julian@codesourcery.com> wrote:
> --- a/gcc/config/nvptx/nvptx.c
> +++ b/gcc/config/nvptx/nvptx.c

> @@ -5237,6 +5248,10 @@ nvptx_file_end (void)
>      write_shared_buffer (asm_out_file, vector_red_sym,
>  			 vector_red_align, vector_red_size);
>  
> +  if (gangprivate_shared_size)
> +    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
> +			 gangprivate_shared_align, gangprivate_shared_size);

Curious, what is the reason that we maintain this '__gangprivate_shared'
variable on a per-file basis instead of on a per-function basis (with
names '__gangprivate_shared_[function]', or similar), which should make
it more obvious where each block of '.shared' memory belongs to?


> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi

> +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
> +This hook, if defined, is used by accelerator target back-ends to expand
> +specially handled kinds of VAR_DECL expressions.  A particular use is to
> +place variables with specific attributes inside special accelarator
> +memories.  A return value of NULL indicates that the target does not
> +handle this VAR_DECL, and normal RTL expanding is resumed.
> +@end deftypefn

I guess I'm not terribly happy with the 'goacc.expand_accel_var' name.
Using different "memories" for specially tagged DECLs seems to be a
pretty generic concept (address spaces?), and...

> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
>        exp = SSA_NAME_VAR (ssa_name);
>        goto expand_decl_rtl;
>  
> -    case PARM_DECL:
>      case VAR_DECL:
> +      /* Allow accel compiler to handle specific cases of variables,
> +	 specifically those tagged with the "oacc gangprivate" attribute,
> +	 which may be intended to be placed in special memory in GPUs.  */
> +      if (flag_openacc && targetm.goacc.expand_accel_var)
> +	{
> +	  temp = targetm.goacc.expand_accel_var (exp);
> +	  if (temp)
> +	    return temp;
> +	}
> +      /* ... fall through ...  */
> +
> +    case PARM_DECL:

... I'm thus confused that there isn't already a generic mechanism
available in GCC, that we can just use instead of adding a new one here?
Thinking about the "address spaces" stuff in 'gcc/target.def' -- or is
that the wrong concept?  (I'm not familiar with all that, and haven't
looked closely.)


> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c

> +/* Record vars listed in private clauses in CLAUSES in CTX.  This information
> +   is used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> +{
> +  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> +      {
> +	tree decl = OMP_CLAUSE_DECL (c);
> +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> +	  {
> +	    ctx->oacc_addressable_var_decls.safe_push (decl);
> +	    maybe_oacc_gangprivate_vars = true;
> +	  }
> +      }
> +}

Are all the relevant variables addressable?  And/or, need only those be
considered?

> +/* Record addressable vars declared in BINDVARS in CTX.  This information is
> +   used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> +{
> +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> +      {
> +	ctx->oacc_addressable_var_decls.safe_push (v);
> +	maybe_oacc_gangprivate_vars = true;
> +      }
> +}

Likewise.


> +/* Mark addressable variables which are declared implicitly or explicitly as
> +   gang private with a special attribute.  These may need to have their
> +   declarations altered later on in compilation (e.g. in
> +   execute_oacc_device_lower or the backend, depending on how the OpenACC
> +   execution model is implemented on a given target) to ensure that sharing
> +   semantics are correct.  */
> +
> +static void
> +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> +{
> +  int i;
> +  tree decl;
> +
> +  FOR_EACH_VEC_ELT (*decls, i, decl)
> +    {
> +      for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer)
> +	{
> +	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
> +	  if (inner_decl)
> +	    {
> +	      decl = inner_decl;
> +	      break;
> +	    }
> +	}
> +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    {
> +	      fprintf (dump_file,
> +		       "Setting 'oacc gangprivate' attribute for decl:");
> +	      print_generic_decl (dump_file, decl, TDF_SLIM);
> +	      fputc ('\n', dump_file);
> +	    }
> +	  DECL_ATTRIBUTES (decl)
> +	    = tree_cons (get_identifier ("oacc gangprivate"),
> +			 NULL, DECL_ATTRIBUTES (decl));
> +	}
> +    }
> +}

So I'm confused how that can be done here ('omplower'), given that the
decision about how levels of parallelism (gang, worker, vector) are
assigned is only done later ('oaccdevlow'), separately/differently per
offloading target?

The following seems relevant:

> +/* Find gang-private variables in a context.  */
> +
> +static int
> +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED (data))
> +{
> +  omp_context *ctx = (omp_context *) node->value;
> +  unsigned level_total = 0;
> +  omp_context *thisctx;
> +
> +  for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
> +    level_total += thisctx->oacc_partitioning_levels;
> +
> +  /* If the current context and parent contexts are distributed over a
> +     total of one parallelism level, we have gang partitioning.  */
> +  if (level_total == 1)
> +    mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx);
> +
> +  return 0;
> +}

..., but I didn't quickly manage to grok that.  (I shall try harder,
later on.)

But still then, this looks like it might work for the outer level (gang)
only (because all offloading targets are expected to assign gang level to
the outermost loop -- might that be the underlying assumption?), but it
won't work for inner loop/privatization levels?  (..., which I understand
this patch isn't doing anything about.)


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
> @@ -0,0 +1,11 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-w" } */
> +
> +int
> +main (void)
> +{
> +#pragma acc parallel
> +  foo ();
> +
> +  return 0;
> +}

I think that given your re-work of the implementation (move stuff from
front ends into OMP lowering) this test case isn't relevant anymore (was
a front end ICE).


> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
> @@ -0,0 +1,25 @@
> +! Test for "oacc gangprivate" attribute on gang-private variables
> +
> +! { dg-do run }
> +! { dg-additional-options "-fdump-tree-omplower-details" }
> +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */

I prefer if such scanning is placed close to relevant source code
constructs, so I'd move this 'scan-tree-dump-times'...

> +
> +program main
> +  integer :: w, arr(0:31)
> +
> +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> +    !$acc loop gang private(w)

... here.

(Just to make sure, a Fortran 'integer' will always be
'integer(kind=4)'?)

> +    do j = 0, 31
> +      w = 0
> +      !$acc loop seq
> +      do i = 0, 31
> +        !$acc atomic update
> +        w = w + 1
> +        !$acc end atomic
> +      end do
> +      arr(j) = w
> +    end do
> +  !$acc end parallel
> +
> +  if (any (arr .ne. 32)) stop 1
> +end program main

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
> @@ -0,0 +1,23 @@
> +! Test for lack of "oacc gangprivate" attribute on worker-private variables
> +
> +! { dg-do run }
> +! { dg-additional-options "-fdump-tree-omplower-details" }
> +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl" 0 "omplower" } } */

Likewise...

> +
> +program main
> +  integer :: w, arr(0:31)
> +
> +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> +    !$acc loop gang worker private(w)

... here (I suppose).

> +    do j = 0, 31
> +      w = 0
> +      !$acc loop seq
> +      do i = 0, 31
> +        w = w + 1
> +      end do
> +      arr(j) = w
> +    end do
> +  !$acc end parallel
> +
> +  if (any (arr .ne. 32)) stop 1
> +end program main


Grüße
 Thomas
Julian Brown June 12, 2019, 7:42 p.m. UTC | #16
On Wed, 12 Jun 2019 13:57:22 +0200
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi!
> 
> First, thanks for picking this up, and improving the patch you
> inherited.

Thanks for review!

> I understand right that this will address some aspects of PR90115
> "OpenACC: predetermined private levels for variables declared in
> blocks" (so please mention that one in the ChangeLog updates, and
> commit log), but it doesn't address all of these aspects (and see
> also Cesar's list in
> <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
> and also not yet PR90114 "Predetermined private levels for variables
> declared in OpenACC accelerator routines"?

There's two possible reasons for placing gang-private variables in
shared memory: correct implementation of OpenACC semantics, or
optimisation, since shared memory is faster than local memory (on NVidia
devices). Handling of private variables is intimately tied with the
execution model for gangs/workers/vectors implemented by a particular
target: for PTX, that's handled in the backend using a
broadcasting/neutering scheme.

That is sufficient for code that e.g. sets a variable in worker-single
mode and expects to use the value in worker-partitioned mode. The
difficulty (semantics-wise) comes when the user wants to do something
like an atomic operation in worker-partitioned mode and expects a
worker-single variable to be shared across each partitioned worker.
Forcing use of shared memory for such variables makes that work
properly.

It is *not* sufficient for the next level down, though -- expecting to
perform atomic operations in vector-partitioned mode on a variable
that is declared in vector-single mode, i.e. so that it is supposed to
be shared across all vector elements. AFAIK, that's not
straightforward, and we haven't attempted to implement it.

I think the original motivation for this patch was optimisation, though
-- typical code won't try to use atomics in this way. Cesar's list of
caveats that you linked to seems to support that notion.

> On Fri, 7 Jun 2019 15:08:37 +0100, Julian Brown
> <julian@codesourcery.com> wrote:
> > --- a/gcc/config/nvptx/nvptx.c
> > +++ b/gcc/config/nvptx/nvptx.c  
> 
> > @@ -5237,6 +5248,10 @@ nvptx_file_end (void)
> >      write_shared_buffer (asm_out_file, vector_red_sym,
> >  			 vector_red_align, vector_red_size);
> >  
> > +  if (gangprivate_shared_size)
> > +    write_shared_buffer (asm_out_file, gangprivate_shared_sym,
> > +			 gangprivate_shared_align,
> > gangprivate_shared_size);  
> 
> Curious, what is the reason that we maintain this
> '__gangprivate_shared' variable on a per-file basis instead of on a
> per-function basis (with names '__gangprivate_shared_[function]', or
> similar), which should make it more obvious where each block of
> '.shared' memory belongs to?

I can't comment on that, I'm afraid that was a part of the patch that I
inherited and didn't alter much...

> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi  
> 
> > +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree
> > @var{var}) +This hook, if defined, is used by accelerator target
> > back-ends to expand +specially handled kinds of VAR_DECL
> > expressions.  A particular use is to +place variables with specific
> > attributes inside special accelarator +memories.  A return value of
> > NULL indicates that the target does not +handle this VAR_DECL, and
> > normal RTL expanding is resumed. +@end deftypefn  
> 
> I guess I'm not terribly happy with the 'goacc.expand_accel_var' name.
> Using different "memories" for specially tagged DECLs seems to be a
> pretty generic concept (address spaces?), and...

This is partly another NVPTX weirdness -- the target uses address
spaces, but only within the backend, and without using the generic
middle-end address space machinery. The other reason for using an
attribute instead of assigning an address space is that the former can
be detected by the target compiler, but will be ignored by the host
compiler. Forcing use of an address space this early would mean that
the same non-standard address space would have to make sense for both
host and offloaded code.

For AMD GCN, we do use the generic address space support, and I found
that I could re-use the "oacc gangprivate" attribute -- but not the
expand_accel_var hook (expand time is too late for that target).
Instead, another new hook "TARGET_GOACC_ADJUST_GANGPRIVATE_DECL" is
called from omp-offload.c:execute_oacc_device_lower for variables that
have the "oacc gangprivate" attribute set. Those bits haven't been
posted upstream yet, though.

> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target,
> > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name);
> >        goto expand_decl_rtl;
> >  
> > -    case PARM_DECL:
> >      case VAR_DECL:
> > +      /* Allow accel compiler to handle specific cases of
> > variables,
> > +	 specifically those tagged with the "oacc gangprivate"
> > attribute,
> > +	 which may be intended to be placed in special memory in
> > GPUs.  */
> > +      if (flag_openacc && targetm.goacc.expand_accel_var)
> > +	{
> > +	  temp = targetm.goacc.expand_accel_var (exp);
> > +	  if (temp)
> > +	    return temp;
> > +	}
> > +      /* ... fall through ...  */
> > +
> > +    case PARM_DECL:  
> 
> ... I'm thus confused that there isn't already a generic mechanism
> available in GCC, that we can just use instead of adding a new one
> here? Thinking about the "address spaces" stuff in 'gcc/target.def'
> -- or is that the wrong concept?  (I'm not familiar with all that,
> and haven't looked closely.)

Same point again -- the same address space would have to be supported
on the host and offload compiler. I'm happy to accept suggestions for
another name for the hook though?

> > --- a/gcc/omp-low.c
> > +++ b/gcc/omp-low.c  
> 
> > +/* Record vars listed in private clauses in CLAUSES in CTX.  This
> > information
> > +   is used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> > +{
> > +  for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> > +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> > +      {
> > +	tree decl = OMP_CLAUSE_DECL (c);
> > +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> > +	  {
> > +	    ctx->oacc_addressable_var_decls.safe_push (decl);
> > +	    maybe_oacc_gangprivate_vars = true;
> > +	  }
> > +      }
> > +}  
> 
> Are all the relevant variables addressable?  And/or, need only those
> be considered?

Yes, I believe so. At least from a correctness perspective, a
non-addressable variable can't be accessed outside the current thread,
so it can go in a (faster than shared memory) register -- though that
register may need to be broadcast in some circumstances. A variable can
only meaningfully be "shared" across workers or vector lanes if its
address is taken, e.g. by a call to an atomic builtin.

From an optimisation perspective, the answer might be fuzzier: maybe
sometimes, using shared memory directly would be faster than
broadcasting.

> > +/* Record addressable vars declared in BINDVARS in CTX.  This
> > information is
> > +   used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> > +{
> > +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> > +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> > +      {
> > +	ctx->oacc_addressable_var_decls.safe_push (v);
> > +	maybe_oacc_gangprivate_vars = true;
> > +      }
> > +}  
> 
> Likewise.
> 
> 
> > +/* Mark addressable variables which are declared implicitly or
> > explicitly as
> > +   gang private with a special attribute.  These may need to have
> > their
> > +   declarations altered later on in compilation (e.g. in
> > +   execute_oacc_device_lower or the backend, depending on how the
> > OpenACC
> > +   execution model is implemented on a given target) to ensure
> > that sharing
> > +   semantics are correct.  */
> > +
> > +static void
> > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> > +{
> > +  int i;
> > +  tree decl;
> > +
> > +  FOR_EACH_VEC_ELT (*decls, i, decl)
> > +    {
> > +      for (omp_context *thisctx = ctx; thisctx; thisctx =
> > thisctx->outer)
> > +	{
> > +	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
> > +	  if (inner_decl)
> > +	    {
> > +	      decl = inner_decl;
> > +	      break;
> > +	    }
> > +	}
> > +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES
> > (decl)))
> > +	{
> > +	  if (dump_file && (dump_flags & TDF_DETAILS))
> > +	    {
> > +	      fprintf (dump_file,
> > +		       "Setting 'oacc gangprivate' attribute for
> > decl:");
> > +	      print_generic_decl (dump_file, decl, TDF_SLIM);
> > +	      fputc ('\n', dump_file);
> > +	    }
> > +	  DECL_ATTRIBUTES (decl)
> > +	    = tree_cons (get_identifier ("oacc gangprivate"),
> > +			 NULL, DECL_ATTRIBUTES (decl));
> > +	}
> > +    }
> > +}  
> 
> So I'm confused how that can be done here ('omplower'), given that the
> decision about how levels of parallelism (gang, worker, vector) are
> assigned is only done later ('oaccdevlow'), separately/differently per
> offloading target?
> 
> The following seems relevant:
> 
> > +/* Find gang-private variables in a context.  */
> > +
> > +static int
> > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED
> > (data)) +{
> > +  omp_context *ctx = (omp_context *) node->value;
> > +  unsigned level_total = 0;
> > +  omp_context *thisctx;
> > +
> > +  for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
> > +    level_total += thisctx->oacc_partitioning_levels;
> > +
> > +  /* If the current context and parent contexts are distributed
> > over a
> > +     total of one parallelism level, we have gang partitioning.  */
> > +  if (level_total == 1)
> > +    mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx);
> > +
> > +  return 0;
> > +}  
> 
> ..., but I didn't quickly manage to grok that.  (I shall try harder,
> later on.)
> 
> But still then, this looks like it might work for the outer level
> (gang) only (because all offloading targets are expected to assign
> gang level to the outermost loop -- might that be the underlying
> assumption?), but it won't work for inner loop/privatization levels?
> (..., which I understand this patch isn't doing anything about.)

The "oacc gangprivate" only applies to variables that are (addressable
and) private per-gang, but the attribute marking works on both
top-level "acc parallel" directives and "acc loop" directives below
that -- so long as they don't explicitly use parallelism finer than
"gang" level. It also works on variables declared private() using
OpenACC clauses in all supported languages, or those that are declared
in an appropriate C/C++ scope.

At least for loops with reductions, gang-partitioned loops have
different semantics from worker and vector-partitioned loops. So I
think in general, it must be the case that it is possible to analyse
OpenACC code "lexically" to determine which loops are gang partitioned,
and which are partitioned at finer levels. It can't be deferred
entirely to the target. It's been a while since I read those bits of
the standard, though!

But yes, in GCC, omp-low only tries to calculate the maximum
partitioning level for each loop nest. The final determination isn't
made until oaccdevlow time. That's OK if shared memory is being used
only as an optimisation, much less OK if it's a necessary part of
implementing OpenACC semantics properly. It might be more of an issue
if we tried to support "vector-shared" variables properly.

> > --- /dev/null
> > +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
> > @@ -0,0 +1,11 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-w" } */
> > +
> > +int
> > +main (void)
> > +{
> > +#pragma acc parallel
> > +  foo ();
> > +
> > +  return 0;
> > +}  
> 
> I think that given your re-work of the implementation (move stuff from
> front ends into OMP lowering) this test case isn't relevant anymore
> (was a front end ICE).

OK, I can remove that.

> > --- /dev/null
> > +++
> > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90
> > @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on
> > gang-private variables +
> > +! { dg-do run }
> > +! { dg-additional-options "-fdump-tree-omplower-details" }
> > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate'
> > attribute for decl:  integer\\(kind=4\\) w;" 1 "omplower" } } */  
> 
> I prefer if such scanning is placed close to relevant source code
> constructs, so I'd move this 'scan-tree-dump-times'...
> 
> > +
> > +program main
> > +  integer :: w, arr(0:31)
> > +
> > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > +    !$acc loop gang private(w)  
> 
> ... here.
> 
> (Just to make sure, a Fortran 'integer' will always be
> 'integer(kind=4)'?)

No idea! I can check.

> > +    do j = 0, 31
> > +      w = 0
> > +      !$acc loop seq
> > +      do i = 0, 31
> > +        !$acc atomic update
> > +        w = w + 1
> > +        !$acc end atomic
> > +      end do
> > +      arr(j) = w
> > +    end do
> > +  !$acc end parallel
> > +
> > +  if (any (arr .ne. 32)) stop 1
> > +end program main  
> 
> > --- /dev/null
> > +++
> > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90
> > @@ -0,0 +1,23 @@ +! Test for lack of "oacc gangprivate" attribute
> > on worker-private variables +
> > +! { dg-do run }
> > +! { dg-additional-options "-fdump-tree-omplower-details" }
> > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate'
> > attribute for decl" 0 "omplower" } } */  
> 
> Likewise...
> 
> > +
> > +program main
> > +  integer :: w, arr(0:31)
> > +
> > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > +    !$acc loop gang worker private(w)  
> 
> ... here (I suppose).
> 
> > +    do j = 0, 31
> > +      w = 0
> > +      !$acc loop seq
> > +      do i = 0, 31
> > +        w = w + 1
> > +      end do
> > +      arr(j) = w
> > +    end do
> > +  !$acc end parallel
> > +
> > +  if (any (arr .ne. 32)) stop 1
> > +end program main  

Thanks,

Julian
Julian Brown Nov. 6, 2019, 10:59 p.m. UTC | #17
Hi!

This is a new patch that takes a different approach to the last-posted
version in this thread. I have combined the previous incremental patches
on the og9 branch that culminated in the following patch:

https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01220.html

From that email, the following explanation was given of the previous
approaches taken as to how the partitioning level for OpenACC "private"
variables was calculated and represented in the compiler, and how this
patch differs:

 - The first (by Chung-Lin Tang) recorded which variables should be
   made private per-gang in each front end (i.e. separately in C, C++
   and Fortran) using a new attribute "oacc gangprivate". This was
   deemed too early; the final determination about which loops are
   assigned which parallelism level has not yet been made at parse time.

 - The second, last discussed here:

     https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00726.html

   moved the analysis of OpenACC contexts to determine parallelism
   levels to omp-low.c (but kept the "oacc gangprivate" attribute and
   the NVPTX backend parts). However (as mentioned in that mail), this
   is still too early: in fact the final determination of the
   parallelism level for each loop (especially for loops without
   explicit gang/worker/vector clauses) does not happen until we reach
   the device compiler, in the oaccloops pass.

This patch builds on the second approach, but delays fixing the
parallelism level of each "private" variable (those that are
addressable, and declared private using OpenACC clauses or by defining
them in a scope nested within a compute region or partitioned loop)
until the oaccdevlow pass. This is done by adding a new internal UNIQUE
function (OACC_PRIVATE) that lists (the address of) each private
variable as an argument. These new internal functions fit into the
existing scheme for demarking OpenACC loops, as described in comments
in the patch.

Use of the "oacc gangprivate" attribute is now restricted to the NVPTX
backend (and could probably be replaced with some lighter-weight
mechanism as a followup).

I realised I omitted to make some of the cosmetic changes Thomas
highlighted below on starting to write this email, but I can do that
(with suitable retesting) if desired before committing.

On Wed, 12 Jun 2019 20:42:16 +0100
Julian Brown <julian@codesourcery.com> wrote:

> On Wed, 12 Jun 2019 13:57:22 +0200
> Thomas Schwinge <thomas@codesourcery.com> wrote:
> 
> > I understand right that this will address some aspects of PR90115
> > "OpenACC: predetermined private levels for variables declared in
> > blocks" (so please mention that one in the ChangeLog updates, and
> > commit log), but it doesn't address all of these aspects (and see
> > also Cesar's list in
> > <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>),
> > and also not yet PR90114 "Predetermined private levels for variables
> > declared in OpenACC accelerator routines"?  
> 
> There's two possible reasons for placing gang-private variables in
> shared memory: correct implementation of OpenACC semantics, or
> optimisation, since shared memory is faster than local memory (on
> NVidia devices). Handling of private variables is intimately tied
> with the execution model for gangs/workers/vectors implemented by a
> particular target: for PTX, that's handled in the backend using a
> broadcasting/neutering scheme.
> 
> That is sufficient for code that e.g. sets a variable in worker-single
> mode and expects to use the value in worker-partitioned mode. The
> difficulty (semantics-wise) comes when the user wants to do something
> like an atomic operation in worker-partitioned mode and expects a
> worker-single variable to be shared across each partitioned worker.
> Forcing use of shared memory for such variables makes that work
> properly.
> 
> It is *not* sufficient for the next level down, though -- expecting to
> perform atomic operations in vector-partitioned mode on a variable
> that is declared in vector-single mode, i.e. so that it is supposed to
> be shared across all vector elements. AFAIK, that's not
> straightforward, and we haven't attempted to implement it.
> 
> I think the original motivation for this patch was optimisation,
> though -- typical code won't try to use atomics in this way. Cesar's
> list of caveats that you linked to seems to support that notion.

After a little further investigation, I came to the conclusion that the
patch was always originally about correctness, but optimisation. But
that's largely academic now.

> > I guess I'm not terribly happy with the 'goacc.expand_accel_var'
> > name. Using different "memories" for specially tagged DECLs seems
> > to be a pretty generic concept (address spaces?), and...  
> 
> This is partly another NVPTX weirdness -- the target uses address
> spaces, but only within the backend, and without using the generic
> middle-end address space machinery. The other reason for using an
> attribute instead of assigning an address space is that the former can
> be detected by the target compiler, but will be ignored by the host
> compiler. Forcing use of an address space this early would mean that
> the same non-standard address space would have to make sense for both
> host and offloaded code.
> 
> For AMD GCN, we do use the generic address space support, and I found
> that I could re-use the "oacc gangprivate" attribute -- but not the
> expand_accel_var hook (expand time is too late for that target).
> Instead, another new hook "TARGET_GOACC_ADJUST_GANGPRIVATE_DECL" is
> called from omp-offload.c:execute_oacc_device_lower for variables that
> have the "oacc gangprivate" attribute set. Those bits haven't been
> posted upstream yet, though.

This patch uses both target hooks -- the
TARGET_GOACC_ADJUST_PRIVATE_DECL (renamed), and
TARGET_GOACC_EXPAND_ACCEL_VAR. The first can tweak the decl at
oaccdevlow time, and the second at expand time. This version of the
patch doesn't provide full support for gang-private variables on AMD
GCN yet though, since that depends on other code that hasn't been
upstreamed yet. (GCN works with the equivalent patch to this on the og9
branch though.)

> > > --- a/gcc/expr.c
> > > +++ b/gcc/expr.c
> > > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target,
> > > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name);
> > >        goto expand_decl_rtl;
> > >  
> > > -    case PARM_DECL:
> > >      case VAR_DECL:
> > > +      /* Allow accel compiler to handle specific cases of
> > > variables,
> > > +	 specifically those tagged with the "oacc gangprivate"
> > > attribute,
> > > +	 which may be intended to be placed in special memory in
> > > GPUs.  */
> > > +      if (flag_openacc && targetm.goacc.expand_accel_var)
> > > +	{
> > > +	  temp = targetm.goacc.expand_accel_var (exp);
> > > +	  if (temp)
> > > +	    return temp;
> > > +	}
> > > +      /* ... fall through ...  */
> > > +
> > > +    case PARM_DECL:    
> > 
> > ... I'm thus confused that there isn't already a generic mechanism
> > available in GCC, that we can just use instead of adding a new one
> > here? Thinking about the "address spaces" stuff in 'gcc/target.def'
> > -- or is that the wrong concept?  (I'm not familiar with all that,
> > and haven't looked closely.)  
> 
> Same point again -- the same address space would have to be supported
> on the host and offload compiler. I'm happy to accept suggestions for
> another name for the hook though?

(Still not renamed in this version, sorry.)

> > > +/* Mark addressable variables which are declared implicitly or
> > > explicitly as
> > > +   gang private with a special attribute.  These may need to have
> > > their
> > > +   declarations altered later on in compilation (e.g. in
> > > +   execute_oacc_device_lower or the backend, depending on how the
> > > OpenACC
> > > +   execution model is implemented on a given target) to ensure
> > > that sharing
> > > +   semantics are correct.  */
> > > +
> > > +static void
> > > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> > > +{
> > > +  int i;
> > > +  tree decl;
> > > +
> > > +  FOR_EACH_VEC_ELT (*decls, i, decl)
> > > +    {
> > > +      for (omp_context *thisctx = ctx; thisctx; thisctx =
> > > thisctx->outer)
> > > +	{
> > > +	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
> > > +	  if (inner_decl)
> > > +	    {
> > > +	      decl = inner_decl;
> > > +	      break;
> > > +	    }
> > > +	}
> > > +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES
> > > (decl)))
> > > +	{
> > > +	  if (dump_file && (dump_flags & TDF_DETAILS))
> > > +	    {
> > > +	      fprintf (dump_file,
> > > +		       "Setting 'oacc gangprivate' attribute for
> > > decl:");
> > > +	      print_generic_decl (dump_file, decl, TDF_SLIM);
> > > +	      fputc ('\n', dump_file);
> > > +	    }
> > > +	  DECL_ATTRIBUTES (decl)
> > > +	    = tree_cons (get_identifier ("oacc gangprivate"),
> > > +			 NULL, DECL_ATTRIBUTES (decl));
> > > +	}
> > > +    }
> > > +}    
> > 
> > So I'm confused how that can be done here ('omplower'), given that
> > the decision about how levels of parallelism (gang, worker, vector)
> > are assigned is only done later ('oaccdevlow'),
> > separately/differently per offloading target?
> > 
> > The following seems relevant:
> >   
> > > +/* Find gang-private variables in a context.  */
> > > +
> > > +static int
> > > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED
> > > (data)) +{
> > > +  omp_context *ctx = (omp_context *) node->value;
> > > +  unsigned level_total = 0;
> > > +  omp_context *thisctx;
> > > +
> > > +  for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
> > > +    level_total += thisctx->oacc_partitioning_levels;
> > > +
> > > +  /* If the current context and parent contexts are distributed
> > > over a
> > > +     total of one parallelism level, we have gang partitioning.
> > > */
> > > +  if (level_total == 1)
> > > +    mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls,
> > > ctx); +
> > > +  return 0;
> > > +}    
> > 
> > ..., but I didn't quickly manage to grok that.  (I shall try harder,
> > later on.)
> > 
> > But still then, this looks like it might work for the outer level
> > (gang) only (because all offloading targets are expected to assign
> > gang level to the outermost loop -- might that be the underlying
> > assumption?), but it won't work for inner loop/privatization levels?
> > (..., which I understand this patch isn't doing anything about.)  
> 
> The "oacc gangprivate" only applies to variables that are (addressable
> and) private per-gang, but the attribute marking works on both
> top-level "acc parallel" directives and "acc loop" directives below
> that -- so long as they don't explicitly use parallelism finer than
> "gang" level. It also works on variables declared private() using
> OpenACC clauses in all supported languages, or those that are declared
> in an appropriate C/C++ scope.
> 
> At least for loops with reductions, gang-partitioned loops have
> different semantics from worker and vector-partitioned loops. So I
> think in general, it must be the case that it is possible to analyse
> OpenACC code "lexically" to determine which loops are gang
> partitioned, and which are partitioned at finer levels. It can't be
> deferred entirely to the target. It's been a while since I read those
> bits of the standard, though!
> 
> But yes, in GCC, omp-low only tries to calculate the maximum
> partitioning level for each loop nest. The final determination isn't
> made until oaccdevlow time. That's OK if shared memory is being used
> only as an optimisation, much less OK if it's a necessary part of
> implementing OpenACC semantics properly. It might be more of an issue
> if we tried to support "vector-shared" variables properly.

So: this version moves the partitioning-level calculation for private
variables out of omp-low, so this isn't an issue any more. Variables
are privatized according to the "true" partitioning level of the scope
inside the parallel region that they are associated with (i.e.
"parallel" region, or loop).

> > > +
> > > +program main
> > > +  integer :: w, arr(0:31)
> > > +
> > > +  !$acc parallel num_gangs(32) num_workers(32) copyout(arr)
> > > +    !$acc loop gang private(w)    
> > 
> > ... here.
> > 
> > (Just to make sure, a Fortran 'integer' will always be
> > 'integer(kind=4)'?)  
> 
> No idea! I can check.

That's a yes, I think.

Re-tested with offloading to nvptx. OK for mainline?

Thanks,

Julian

2019-11-06  Julian Brown  <julian@codesourcery.com>
	    Chung-Lin Tang  <cltang@codesourcery.com>

	gcc/
	* config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this.  Add and use LEVEL parameter.
	* config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl): Rename
	to...
	(gcn_goacc_adjust_private_decl): ...this. Add LEVEL parameter.
	* config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Delete.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL): Define using renamed
	gcn_goacc_adjust_private_decl.
	* config/nvptx/nvptx.c (tree-hash-traits.h, tree-pretty-print.h):
	Include.
	(gangprivate_shared_size): New global variable.
	(gangprivate_shared_align): Likewise.
	(gangprivate_shared_sym): Likewise.
	(gangprivate_shared_hmap): Likewise.
	(nvptx_option_override): Initialize gangprivate_shared_sym,
	gangprivate_shared_align.
	(nvptx_file_end): Output gangprivate_shared_sym.
	(nvptx_goacc_adjust_private_decl): New function.
	(nvptx_goacc_expand_accel_var): New function.
	(nvptx_set_current_function): New function.
	(TARGET_GOACC_ADJUST_PRIVATE_DECL, TARGET_GOACC_EXPAND_ACCEL_VAR):
	Define hooks.
	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR,
	TARGET_GOACC_ADJUST_PRIVATE_DECL): Place new documentation hooks.
	* doc/tm.texi: Regenerate.
	* expr.c (expand_expr_real_1): Expand decls using the expand_accel_var
	OpenACC hook if defined.
	* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE.
	* internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE.
	* omp-low.c (omp_context): Add oacc_addressable_var_decls field.
	(new_omp_context): Initialize oacc_addressable_var_decls in new
	omp_context.
	(delete_omp_context): Delete oacc_addressable_var_decls in old
	omp_context.
	(lower_oacc_reductions): Add PRIVATE_MARKER parameter.  Insert private
	marker before fork.
	(lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify private
	marker's gimple call arguments, and pass it to lower_oacc_reductions.
	(oacc_record_private_var_clauses, oacc_record_vars_in_bind,
	make_oacc_private_marker): New functions.
	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
	clauses.  Call oacc_record_vars_in_bind for OpenACC contexts.  Create
	private marker and pass to lower_oacc_head_tail.
	(lower_omp_target): Create private marker and pass to
	lower_oacc_reductions.
	(lower_omp_1): Call oacc_record_vars_in_bind for OpenACC bind contexts.
	* omp-offload.c (convert.h): Include.
	(oacc_loop_xform_head_tail): Treat private-variable markers like
	fork/join when transforming head/tail sequences.
	(execute_oacc_device_lower): Use IFN_UNIQUE_OACC_PRIVATE to determine
	partitioning level of private variables, and process any found via
	adjust_private_decl target hook.
	* target.def (expand_accel_var, adjust_private_decl): New target hooks.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
Thomas Schwinge May 21, 2021, 7:05 p.m. UTC | #18
Hi!

On 2019-06-07T15:08:37+0100, Julian Brown <julian@codesourcery.com> wrote:
> Hi Jakub,
>
> Thanks for the review! I believe I've addressed all your comments in
> the attached version of the patch.
>
> On Mon, 3 Jun 2019 18:23:00 +0200
> Jakub Jelinek <jakub@redhat.com> wrote:
>> > +/* Record vars listed in private clauses in CLAUSES in CTX.  This information
>> > +   is used to mark up variables that should be made private per-gang.  */
>> > +
>> > +static void
>> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
>> > +{
>> > +  [...]
>> > +}
>>
>> You don't want to do this for all GOMP_FOR or GOMP_TARGET context,
>> I'd hope you only want to do that for OpenACC contexts.

> I've [...] fixed the patch to only call oacc_record_private_var_clauses in
> OpenACC contexts.

> commit 6c2a018b940d0b132395048b0600f7d897319ee2
> Author: Julian Brown <julian@codesourcery.com>
> Date:   Thu Aug 9 20:27:04 2018 -0700
>
>     [OpenACC] Add support for gang local storage allocation in shared memory

> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c

> @@ -8599,6 +8681,9 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
>
>    push_gimplify_context ();
>
> +  if (is_gimple_omp_oacc (ctx->stmt))
> +    oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
> +
>    lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
>
>    block = make_node (BLOCK);

So, yes -- but then, apparently, that again got lost in a later version
of the patch.  ;-)

I've pushed "[OpenACC privatization] Don't evaluate OpenMP 'for' clauses
[PR90115]" to master branch in commit
3a285ebd0cf5ab762726018515d23280fa6dd445, see attached.


Grüße
 Thomas


-----------------
Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Thomas Schwinge Feb. 14, 2022, 3:56 p.m. UTC | #19
Hi Julian!

Two more questions here, in context of <https://gcc.gnu.org/PR102330>
"[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932 since
r12-980-g29a2f51806c":

On 2019-06-03T17:02:45+0100, Julian Brown <julian@codesourcery.com> wrote:
> This is a new version of the patch, rebased

The code as we've now got it in master branch has changed some more, but
I think the behavior I'm seeing may have been introduced here:

> and with a couple of
> additional bugfixes, as follows:
>
> Firstly, in mark_oacc_gangprivate, each decl is looked up (using
> maybe_lookup_decl) to apply the "oacc gangprivate" attribute to the
> innermost-nested copy of the decl.

> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -137,6 +137,12 @@ struct omp_context

> +  /* Addressable variable decls in this context.  */
> +  vec<tree> *oacc_addressable_var_decls;
>  };

> +/* Record vars listed in private clauses in CLAUSES in CTX.  This information
> +   is used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> +{
> +  tree c;
> +
> +  if (!ctx)
> +    return;
> +
> +  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> +      {
> +     tree decl = OMP_CLAUSE_DECL (c);
> +     if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> +       ctx->oacc_addressable_var_decls->safe_push (decl);
> +      }
> +}

So, here we analyze 'OMP_CLAUSE_DECL (c)' (as is, without translation
through 'lookup_decl (decl, ctx)')...

> +/* Record addressable vars declared in BINDVARS in CTX.  This information is
> +   used to mark up variables that should be made private per-gang.  */
> +
> +static void
> +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> +{
> +  if (!ctx)
> +    return;
> +
> +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> +      ctx->oacc_addressable_var_decls->safe_push (v);
> +}

..., and similarly here analyze 'v' (without 'lookup_decl (v, ctx)')...

> +/* Mark addressable variables which are declared implicitly or explicitly as
> +   gang private with a special attribute.  These may need to have their
> +   declarations altered later on in compilation (e.g. in
> +   execute_oacc_device_lower or the backend, depending on how the OpenACC
> +   execution model is implemented on a given target) to ensure that sharing
> +   semantics are correct.  */
> +
> +static void
> +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> +{
> +  int i;
> +  tree decl;
> +
> +  FOR_EACH_VEC_ELT (*decls, i, decl)
> +    {
> +      for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer)
> +     {
> +       tree inner_decl = maybe_lookup_decl (decl, thisctx);
> +       if (inner_decl)
> +         {
> +           decl = inner_decl;
> +           break;
> +         }
> +     }
> +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl)))
> +     {
> +       if (dump_file && (dump_flags & TDF_DETAILS))
> +         {
> +           fprintf (dump_file,
> +                    "Setting 'oacc gangprivate' attribute for decl:");
> +           print_generic_decl (dump_file, decl, TDF_SLIM);
> +           fputc ('\n', dump_file);
> +         }
> +       DECL_ATTRIBUTES (decl)
> +         = tree_cons (get_identifier ("oacc gangprivate"),
> +                      NULL, DECL_ATTRIBUTES (decl));
> +     }
> +    }
> +}

..., but here we action on the 'maybe_lookup_decl'-translated
'inner_decl', if applicable.  In certain cases that one may be different
from the original 'decl'.  (In particular (only?), when the OMP lowering
has made 'decl' "late 'TREE_ADDRESSABLE'".)  This assymetry I understand
to give rise to <https://gcc.gnu.org/PR102330> "[12 Regression] ICE in
expand_gimple_stmt_1, at cfgexpand.c:3932 since r12-980-g29a2f51806c".

It makes sense to me that we do the OpenACC privatization on the
'lookup_decl' -- but shouldn't we then do that in the analysis phase,
too?  (This appears to work fine for OpenACC 'private' clauses (..., and
avoids marking a few as addressable/gang-private), and for those in
'gimple_bind_vars' it doesn't seem to make a difference (for the current
test cases and/or compiler transformations).)

And, second question: what case did you run into or foresee, that you
here need the 'thisctx' loop and 'maybe_lookup_decl', instead of a plain
'lookup_decl (decl, ctx)'?  Per my testing that's sufficient.

Unless you think this needs more consideration, I suggest to do these two
changes.  (I have a WIP patch in testing.)


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Julian Brown Feb. 15, 2022, 1:40 p.m. UTC | #20
On Mon, 14 Feb 2022 16:56:35 +0100
Thomas Schwinge <thomas@codesourcery.com> wrote:

> Hi Julian!
> 
> Two more questions here, in context of <https://gcc.gnu.org/PR102330>
> "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932
> since r12-980-g29a2f51806c":
> 
> On 2019-06-03T17:02:45+0100, Julian Brown <julian@codesourcery.com>
> wrote:
> > +/* Record vars listed in private clauses in CLAUSES in CTX.  This
> > information
> > +   is used to mark up variables that should be made private
> > per-gang.  */ +
> > +static void
> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
> > +{
> > +  tree c;
> > +
> > +  if (!ctx)
> > +    return;
> > +
> > +  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
> > +    if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE)
> > +      {
> > +	tree decl = OMP_CLAUSE_DECL (c);
> > +	if (VAR_P (decl) && TREE_ADDRESSABLE (decl))
> > +	  ctx->oacc_addressable_var_decls->safe_push (decl);
> > +      }
> > +}  
> 
> So, here we analyze 'OMP_CLAUSE_DECL (c)' (as is, without translation
> through 'lookup_decl (decl, ctx)')...

I think you're right that this one should be using lookup_decl, but...

> > +/* Record addressable vars declared in BINDVARS in CTX.  This
> > information is
> > +   used to mark up variables that should be made private per-gang.
> >  */ +
> > +static void
> > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
> > +{
> > +  if (!ctx)
> > +    return;
> > +
> > +  for (tree v = bindvars; v; v = DECL_CHAIN (v))
> > +    if (VAR_P (v) && TREE_ADDRESSABLE (v))
> > +      ctx->oacc_addressable_var_decls->safe_push (v);
> > +}  
> 
> ..., and similarly here analyze 'v' (without 'lookup_decl (v,
> ctx)')...

I'm not so sure about this one: if the variables are declared at a
particular binding level, I think they have to be in the current OMP
context (and thus shadow any definitions that might be present in the
parent context)? Maybe that can be confirmed via an assertion.

> > +/* Mark addressable variables which are declared implicitly or
> > explicitly as
> > +   gang private with a special attribute.  These may need to have
> > their
> > +   declarations altered later on in compilation (e.g. in
> > +   execute_oacc_device_lower or the backend, depending on how the
> > OpenACC
> > +   execution model is implemented on a given target) to ensure
> > that sharing
> > +   semantics are correct.  */
> > +
> > +static void
> > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx)
> > +{
> > +  int i;
> > +  tree decl;
> > +
> > +  FOR_EACH_VEC_ELT (*decls, i, decl)
> > +    {
> > +      for (omp_context *thisctx = ctx; thisctx; thisctx =
> > thisctx->outer)
> > +	{
> > +	  tree inner_decl = maybe_lookup_decl (decl, thisctx);
> > +	  if (inner_decl)
> > +	    {
> > +	      decl = inner_decl;
> > +	      break;
> > +	    }
> > +	}
> > +      if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES
> > (decl)))
> > +	{
> > +	  if (dump_file && (dump_flags & TDF_DETAILS))
> > +	    {
> > +	      fprintf (dump_file,
> > +		       "Setting 'oacc gangprivate' attribute for
> > decl:");
> > +	      print_generic_decl (dump_file, decl, TDF_SLIM);
> > +	      fputc ('\n', dump_file);
> > +	    }
> > +	  DECL_ATTRIBUTES (decl)
> > +	    = tree_cons (get_identifier ("oacc gangprivate"),
> > +			 NULL, DECL_ATTRIBUTES (decl));
> > +	}
> > +    }
> > +}  
> 
> ..., but here we action on the 'maybe_lookup_decl'-translated
> 'inner_decl', if applicable.  In certain cases that one may be
> different from the original 'decl'.  (In particular (only?), when the
> OMP lowering has made 'decl' "late 'TREE_ADDRESSABLE'".)  This
> assymetry I understand to give rise to <https://gcc.gnu.org/PR102330>
> "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932
> since r12-980-g29a2f51806c".
> 
> It makes sense to me that we do the OpenACC privatization on the
> 'lookup_decl' -- but shouldn't we then do that in the analysis phase,
> too?  (This appears to work fine for OpenACC 'private' clauses (...,
> and avoids marking a few as addressable/gang-private), and for those
> in 'gimple_bind_vars' it doesn't seem to make a difference (for the
> current test cases and/or compiler transformations).)

Yes, I think you're right.

> And, second question: what case did you run into or foresee, that you
> here need the 'thisctx' loop and 'maybe_lookup_decl', instead of a
> plain 'lookup_decl (decl, ctx)'?  Per my testing that's sufficient.

I'd probably misunderstood about lookup_decl walking up through parent
contexts itself... oops.

> Unless you think this needs more consideration, I suggest to do these
> two changes.  (I have a WIP patch in testing.)

Sounds good to me.

Thank you,

Julian
diff mbox series

Patch

commit 9637e7ea887e100f35d99b8d12101f9f8a9b94e3
Author: Julian Brown <julian@codesourcery.com>
Date:   Thu Aug 9 20:27:04 2018 -0700

    [OpenACC] Add support for gang local storage allocation in shared memory
    
    2018-08-10  Julian Brown  <julian@codesourcery.com>
    	    Chung-Lin Tang  <cltang@codesourcery.com>
    
    	gcc/
    	* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
    	(gangprivate_shared_size): New global variable.
    	(gangprivate_shared_align): Likewise.
    	(gangprivate_shared_sym): Likewise.
    	(gangprivate_shared_hmap): Likewise.
    	(nvptx_option_override): Initialize gangprivate_shared_sym,
    	gangprivate_shared_align.
    	(nvptx_file_end): Output gangprivate_shared_sym.
    	(nvptx_goacc_expand_accel_var): New function.
    	(nvptx_set_current_function): New function.
    	(TARGET_SET_CURRENT_FUNCTION): Define hook.
    	(TARGET_GOACC_EXPAND_ACCEL): Likewise.
    	* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
    	* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
    	* expr.c (expand_expr_real_1): Remap decls marked with the
    	"oacc gangprivate" atttribute.
    	* omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls
    	fields.
    	(new_omp_context): Initialize oacc_decls in new omp_context.
    	(delete_omp_context): Delete oacc_decls in old omp_context.
    	(lower_oacc_head_tail): Record partitioning-level count in omp context.
    	(oacc_record_private_var_clauses, oacc_record_vars_in_bind)
    	(mark_oacc_gangprivate): New functions.
    	(lower_omp_for): Call oacc_record_private_var_clauses with "for"
    	clauses.  Call mark_oacc_gangprivate for gang-partitioned loops.
    	(lower_omp_target): Call oacc_record_private_var_clauses with "target"
    	clauses.
    	Call mark_oacc_gangprivate for offloaded target regions.
    	(lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions.
    	* target.def (expand_accel_var): New hook.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
    	* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
    	* testsuite/libgomp.oacc-c/pr85465.c: New test.

diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index c0b0a2e..14eb842 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -73,6 +73,7 @@ 
 #include "cfgloop.h"
 #include "fold-const.h"
 #include "intl.h"
+#include "tree-hash-traits.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -137,6 +138,12 @@  static unsigned worker_red_size;
 static unsigned worker_red_align;
 static GTY(()) rtx worker_red_sym;
 
+/* Shared memory block for gang-private variables.  */
+static unsigned gangprivate_shared_size;
+static unsigned gangprivate_shared_align;
+static GTY(()) rtx gangprivate_shared_sym;
+static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap;
+
 /* Global lock variable, needed for 128bit worker & gang reductions.  */
 static GTY(()) tree global_lock_var;
 
@@ -210,6 +217,10 @@  nvptx_option_override (void)
   SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED);
   worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
 
+  gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared");
+  SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED);
+  gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT;
+
   diagnose_openacc_conflict (TARGET_GOMP, "-mgomp");
   diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack");
   diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt");
@@ -4968,6 +4979,10 @@  nvptx_file_end (void)
     write_worker_buffer (asm_out_file, worker_red_sym,
 			 worker_red_align, worker_red_size);
 
+  if (gangprivate_shared_size)
+    write_worker_buffer (asm_out_file, gangprivate_shared_sym,
+			 gangprivate_shared_align, gangprivate_shared_size);
+
   if (need_softstack_decl)
     {
       write_var_marker (asm_out_file, false, true, "__nvptx_stacks");
@@ -5915,6 +5930,47 @@  nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
   return false;
 }
 
+static rtx
+nvptx_goacc_expand_accel_var (tree var)
+{
+  if (TREE_CODE (var) == VAR_DECL
+      && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var)))
+    {
+      unsigned int offset, *poffset;
+      poffset = gangprivate_shared_hmap.get (var);
+      if (poffset)
+	offset = *poffset;
+      else
+	{
+	  unsigned HOST_WIDE_INT align = DECL_ALIGN (var);
+	  gangprivate_shared_size =
+	    (gangprivate_shared_size + align - 1) & ~(align - 1);
+	  if (gangprivate_shared_align < align)
+	    gangprivate_shared_align = align;
+
+	  offset = gangprivate_shared_size;
+	  bool existed = gangprivate_shared_hmap.put (var, offset);
+	  gcc_assert (!existed);
+	  gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var));
+	}
+      rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset);
+      return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr);
+    }
+  return NULL_RTX;
+}
+
+static GTY(()) tree nvptx_previous_fndecl;
+
+static void
+nvptx_set_current_function (tree fndecl)
+{
+  if (!fndecl || fndecl == nvptx_previous_fndecl)
+    return;
+
+  gangprivate_shared_hmap.empty ();
+  nvptx_previous_fndecl = fndecl;
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6051,6 +6107,12 @@  nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t)
 #undef TARGET_HAVE_SPECULATION_SAFE_VALUE
 #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed
 
+#undef TARGET_GOACC_EXPAND_ACCEL_VAR
+#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var
+
+#undef TARGET_SET_CURRENT_FUNCTION
+#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index a40f45a..fb87f67 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6064,6 +6064,14 @@  like @code{cond_add@var{m}}.  The default implementation returns a zero
 constant of type @var{type}.
 @end deftypefn
 
+@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var})
+This hook, if defined, is used by accelerator target back-ends to expand
+specially handled kinds of VAR_DECL expressions.  A particular use is to
+place variables with specific attributes inside special accelarator
+memories.  A return value of NULL indicates that the target does not
+handle this VAR_DECL, and normal RTL expanding is resumed.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 39a214e..beace61 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4151,6 +4151,8 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_PREFERRED_ELSE_VALUE
 
+@hook TARGET_GOACC_EXPAND_ACCEL_VAR
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/expr.c b/gcc/expr.c
index de6709d..2c62bf9 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -9854,8 +9854,19 @@  expand_expr_real_1 (tree exp, rtx target, machine_mode tmode,
       exp = SSA_NAME_VAR (ssa_name);
       goto expand_decl_rtl;
 
-    case PARM_DECL:
     case VAR_DECL:
+      /* Allow accel compiler to handle specific cases of variables,
+	 specifically those tagged with the "oacc gangprivate" attribute,
+	 which may intended to be placed in special memory in GPUs.  */
+      if (flag_openacc && targetm.goacc.expand_accel_var)
+	{
+	  temp = targetm.goacc.expand_accel_var (exp);
+	  if (temp)
+	    return temp;
+	}
+      /* ... fall through ...  */
+
+    case PARM_DECL:
       /* If a static var's type was incomplete when the decl was written,
 	 but the type is complete now, lay out the decl now.  */
       if (DECL_SIZE (exp) == 0
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 843c66f..354e182 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -124,6 +124,12 @@  struct omp_context
 
   /* True if this construct can be cancelled.  */
   bool cancellable;
+
+  /* The number of levels of OpenACC partitioning invoked in this context.  */
+  int oacc_partitioning_levels;
+
+  /* Decls in this context.  */
+  vec<tree> *oacc_decls;
 };
 
 static splay_tree all_contexts;
@@ -850,6 +856,7 @@  new_omp_context (gimple *stmt, omp_context *outer_ctx)
     }
 
   ctx->cb.decl_map = new hash_map<tree, tree>;
+  ctx->oacc_decls = new vec<tree> ();
 
   return ctx;
 }
@@ -925,6 +932,8 @@  delete_omp_context (splay_tree_value value)
   if (is_task_ctx (ctx))
     finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt));
 
+  delete ctx->oacc_decls;
+
   XDELETE (ctx);
 }
 
@@ -5716,6 +5725,9 @@  lower_oacc_head_tail (location_t loc, tree clauses,
   tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN);
 
   gcc_assert (count);
+
+  ctx->oacc_partitioning_levels = count;
+
   for (unsigned done = 1; count; count--, done++)
     {
       gimple_seq fork_seq = NULL;
@@ -6732,6 +6744,66 @@  lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p,
     }
 }
 
+/* Record vars listed in private clauses in CLAUSES in CTX.  This information
+   is used to mark up variables that should be made private per-gang.  */
+
+static void
+oacc_record_private_var_clauses (omp_context *ctx, tree clauses)
+{
+  tree c;
+
+  if (!ctx)
+    return;
+
+  for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c))
+    switch (OMP_CLAUSE_CODE (c))
+      {
+      case OMP_CLAUSE_PRIVATE:
+	{
+	  tree decl = OMP_CLAUSE_DECL (c);
+	  ctx->oacc_decls->safe_push (decl);
+	}
+	break;
+
+      default:
+	/* Empty.  */;
+      }
+}
+
+/* Record vars declared in BINDVARS in CTX.  This information is used to mark
+   up variables that should be made private per-gang.  */
+
+static void
+oacc_record_vars_in_bind (omp_context *ctx, tree bindvars)
+{
+  if (!ctx)
+    return;
+
+  for (tree v = bindvars; v; v = DECL_CHAIN (v))
+    ctx->oacc_decls->safe_push (v);
+}
+
+/* Mark variables which are declared implicitly or explicitly as gang private
+   with a special attribute.  These may need to have their declarations altered
+   later on in compilation (e.g. in execute_oacc_device_lower or the backend,
+   depending on how the OpenACC execution model is implemented on a given
+   target) to ensure that sharing semantics are correct.
+   Only variables which have their address taken need to be considered.  */
+
+static void
+mark_oacc_gangprivate (vec<tree> *decls)
+{
+  int i;
+  tree decl;
+
+  FOR_EACH_VEC_ELT (*decls, i, decl)
+    {
+      if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl))
+	DECL_ATTRIBUTES (decl)
+	  = tree_cons (get_identifier ("oacc gangprivate"),
+		       NULL, DECL_ATTRIBUTES (decl));
+    }
+}
 
 /* Lower code for an OMP loop directive.  */
 
@@ -6748,6 +6820,8 @@  lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   push_gimplify_context ();
 
+  oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt));
+
   lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx);
 
   block = make_node (BLOCK);
@@ -6878,7 +6952,20 @@  lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   /* Add OpenACC partitioning and reduction markers just before the loop.  */
   if (oacc_head)
-    gimple_seq_add_seq (&body, oacc_head);
+    {
+      gimple_seq_add_seq (&body, oacc_head);
+
+      int level_total = 0;
+      omp_context *thisctx;
+
+      for (thisctx = ctx; thisctx; thisctx = thisctx->outer)
+        level_total += thisctx->oacc_partitioning_levels;
+
+      /* If the current context and parent contexts are distributed over a
+	 total of one parallelism level, we have gang partitioning.  */
+      if (level_total == 1)
+        mark_oacc_gangprivate (ctx->oacc_decls);
+    }
 
   lower_omp_for_lastprivate (&fd, &body, &dlist, ctx);
 
@@ -7511,6 +7598,8 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   clauses = gimple_omp_target_clauses (stmt);
 
+  oacc_record_private_var_clauses (ctx, clauses);
+
   gimple_seq dep_ilist = NULL;
   gimple_seq dep_olist = NULL;
   if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND))
@@ -7761,6 +7850,8 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 
   if (offloaded)
     {
+      mark_oacc_gangprivate (ctx->oacc_decls);
+
       /* Declare all the variables created by mapping and the variables
 	 declared in the scope of the target body.  */
       record_vars_into (ctx->block_vars, child_fn);
@@ -8755,6 +8846,7 @@  lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 		 ctx);
       break;
     case GIMPLE_BIND:
+      oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt)));
       lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx);
       maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt));
       break;
diff --git a/gcc/target.def b/gcc/target.def
index c570f38..b3b24b8 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1701,6 +1701,16 @@  for allocating any storage for reductions when necessary.",
 void, (gcall *call),
 default_goacc_reduction)
 
+DEFHOOK
+(expand_accel_var,
+"This hook, if defined, is used by accelerator target back-ends to expand\n\
+specially handled kinds of VAR_DECL expressions.  A particular use is to\n\
+place variables with specific attributes inside special accelarator\n\
+memories.  A return value of NULL indicates that the target does not\n\
+handle this VAR_DECL, and normal RTL expanding is resumed.",
+rtx, (tree var),
+NULL)
+
 HOOK_VECTOR_END (goacc)
 
 /* Functions relating to vectorization.  */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
new file mode 100644
index 0000000..f378346
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c
@@ -0,0 +1,38 @@ 
+#include <assert.h>
+
+int main (void)
+{
+  int ret;
+
+  #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret)
+  {
+    int w = 0;
+
+    #pragma acc loop worker
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	w++;
+      }
+
+    ret = (w == 32);
+  }
+  assert (ret);
+
+  #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret)
+  {
+    int v = 0;
+
+    #pragma acc loop vector
+    for (int i = 0; i < 32; i++)
+      {
+        #pragma acc atomic update
+	v++;
+      }
+
+    ret = (v == 32);
+  }
+  assert (ret);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
new file mode 100644
index 0000000..2fa708a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c
@@ -0,0 +1,106 @@ 
+/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */
+
+#include <stdio.h>
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <gomp-constants.h>
+#include <stdlib.h>
+
+#if 0
+#define DEBUG(DIM, IDX, VAL) \
+  fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL))
+#else
+#define DEBUG(DIM, IDX, VAL)
+#endif
+
+#define N (32*32*32)
+
+int
+check (const char *dim, int *dist, int dimsize)
+{
+  int ix;
+  int exit = 0;
+
+  for (ix = 0; ix < dimsize; ix++)
+    {
+      DEBUG(dim, ix, dist[ix]);
+      if (dist[ix] < (N) / (dimsize + 0.5)
+	  || dist[ix] > (N) / (dimsize - 0.5))
+	{
+	  fprintf (stderr, "did not distribute to %ss (%d not between %d "
+		   "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)),
+		   (int) ((N) / (dimsize - 0.5)));
+	  exit |= 1;
+	}
+    }
+
+  return exit;
+}
+
+int main ()
+{
+  int ary[N];
+  int ix;
+  int exit = 0;
+  int ondev = 0;
+  int gangsize = 0, workersize = 0, vectorsize = 0;
+  int *gangdist, *workerdist, *vectordist;
+
+  for (ix = 0; ix < N;ix++)
+    ary[ix] = -1;
+
+#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev) copyout(gangsize, workersize, vectorsize)
+  {
+#pragma acc loop gang worker vector
+    for (unsigned ix = 0; ix < N; ix++)
+      {
+	if (acc_on_device (acc_device_not_host))
+	  {
+	    int g, w, v;
+
+	    g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG);
+	    w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER);
+	    v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR);
+
+	    ary[ix] = (g << 16) | (w << 8) | v;
+	    ondev = 1;
+	  }
+	else
+	  ary[ix] = ix;
+      }
+
+    gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG);
+    workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER);
+    vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR);
+  }
+
+  gangdist = (int *) alloca (gangsize * sizeof (int));
+  workerdist = (int *) alloca (workersize * sizeof (int));
+  vectordist = (int *) alloca (vectorsize * sizeof (int));
+  memset (gangdist, 0, gangsize * sizeof (int));
+  memset (workerdist, 0, workersize * sizeof (int));
+  memset (vectordist, 0, vectorsize * sizeof (int));
+
+  /* Test that work is shared approximately equally amongst each active
+     gang/worker/vector.  */
+  for (ix = 0; ix < N; ix++)
+    {
+      if (ondev)
+	{
+	  int g = (ary[ix] >> 16) & 255;
+	  int w = (ary[ix] >> 8) & 255;
+	  int v = ary[ix] & 255;
+
+	  gangdist[g]++;
+	  workerdist[w]++;
+	  vectordist[v]++;
+	}
+    }
+
+  exit = check ("gang", gangdist, gangsize);
+  exit |= check ("worker", workerdist, workersize);
+  exit |= check ("vector", vectordist, vectorsize);
+
+  return exit;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
new file mode 100644
index 0000000..329e8a0
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-w" } */
+
+int
+main (void)
+{
+#pragma acc parallel
+  foo ();
+
+  return 0;
+}