diff mbox

Default compute dimensions (runtime)

Message ID 56B21D23.5060209@acm.org
State New
Headers show

Commit Message

Nathan Sidwell Feb. 3, 2016, 3:30 p.m. UTC
Jakub,
this is the runtime side of default compute dimension support.

1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime 
choice.  (0 also indicates that, but I thought best to have an explicit syntax 
as well).

2) New plugin helper 'GOMP_PLUGIN_acc_default_dims' that parses a 
GOMP_OPENACC_DIM environment variable.  The syntax here is the same as that for 
the -fopenacc-dim option -- except '-' isn't permitted.  I have future-proofed 
the interface by including a plugin tag parameter.  This  will permit 
device_type support.

3) the plugin itself lazily calls GOMP_PLUGIN_acc_default_dims when it sees an 
unspecified dimension.  Validates the default dimensions and then plugs them 
into the launch parameters.

The testcase reuses the compile-time testcase by breaking its core to a header 
file and explicitly setting the environment variable before first launch.  The 
original testcase also explitily sets  the environment variable, to make sure 
it's not being considered.

There doesn't seem to be a mechanism warning messages -- only debug ones or 
fatal errors.  I'm not sure what the best approach to handling errors in the env 
var parsing, and ducked to silently ignore problems (and the plugin will then 
provide fallback values).

ok?

nathan

Comments

Alexander Monakov Feb. 3, 2016, 4:10 p.m. UTC | #1
Hello,

On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime
> choice.  (0 also indicates that, but I thought best to have an explicit syntax
> as well).

Does it work when the user specifies one of the dimensions, so that references
to it are subject to constant folding and VRP, but leaves some other dimension
unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the
runtime-specified value of the first dimension is different from what the
compiler saw, invalidating all folding and propagation?


Here:

+	  /* Do some sanity checking.  The CUDA API doesn't appear to
+	     provide queries to determine these limits.  */
+	  if (default_dims[GOMP_DIM_GANG] < 1)
+	    default_dims[GOMP_DIM_GANG] = 32;
+	  if (default_dims[GOMP_DIM_WORKER] < 1
+	      || default_dims[GOMP_DIM_WORKER] > 32)
+	    default_dims[GOMP_DIM_WORKER] = 32;
+	  default_dims[GOMP_DIM_VECTOR] = 32;

I don't see why you say that because cuDeviceGetAttribute provides 
CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and
cuFuncGetAttribute that allows to get a per-function thread limit.  There's a
patch on gomp-nvptx branch that adds querying some of those to the plugin.

Alexander
Nathan Sidwell Feb. 3, 2016, 4:21 p.m. UTC | #2
On 02/03/16 11:10, Alexander Monakov wrote:
> Hello,
>
> On Wed, 3 Feb 2016, Nathan Sidwell wrote:
>> 1) extend the -fopenacc-dim=X:Y:Z syntax to allow '-' indicating a runtime
>> choice.  (0 also indicates that, but I thought best to have an explicit syntax
>> as well).
>
> Does it work when the user specifies one of the dimensions, so that references
> to it are subject to constant folding and VRP, but leaves some other dimension
> unspecified, and when eventually GOMP_OPENACC_DIM is parsed at runtime, the
> runtime-specified value of the first dimension is different from what the
> compiler saw, invalidating all folding and propagation?

You can only override at runtime those dimensions that you said you'd override 
at runtime when you compiled your program.

> I don't see why you say that because cuDeviceGetAttribute provides
> CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
> CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case) and
> cuFuncGetAttribute that allows to get a per-function thread limit.  There's a
> patch on gomp-nvptx branch that adds querying some of those to the plugin.

thanks.  There doesn't appear to be one for number of physical CTAs though, right?

nathan
Alexander Monakov Feb. 3, 2016, 4:52 p.m. UTC | #3
On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> You can only override at runtime those dimensions that you said you'd override
> at runtime when you compiled your program.

Ah, I see.  That's not obvious to me, so perhaps added documentation can be
expanded to explain that?  (I now see that the plugin silently drops
user-provided dimensions where a value recorded at compile time is present;
not sure if that'd be worth a runtime diagnostic, could be very noisy)
 
> > I don't see why you say that because cuDeviceGetAttribute provides
> > CU_DEVICE_ATTRIBUTE_WARP_SIZE, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
> > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for this case)
> > and cuFuncGetAttribute that allows to get a per-function thread limit.
> > There's a patch on gomp-nvptx branch that adds querying some of those to
> > the plugin.
> 
> thanks.  There doesn't appear to be one for number of physical CTAs though,
> right?

Sorry, I don't understand the question: CTA is a logical entity.  One could
derive limit of possible concurrent CTAs from number of SMs
(CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many CTAs fit on
one multiprocessor.  The latter figure can be taken as a rough worst-case
value, or semi-intelligent per-kernel estimate based on register limits
(there's code on gomp-nvptx branch that does this), or one can use the cuOcc*
API to ask the driver for a precise per-kernel figure.

Alexander
Julian Brown Oct. 5, 2018, 8:03 p.m. UTC | #4
Hi,

Continuing the thread from here:

https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html

On Wed, 3 Feb 2016 19:52:09 +0300
Alexander Monakov <amonakov@ispras.ru> wrote:

> On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> > You can only override at runtime those dimensions that you said
> > you'd override at runtime when you compiled your program.  
> 
> Ah, I see.  That's not obvious to me, so perhaps added documentation
> can be expanded to explain that?  (I now see that the plugin silently
> drops user-provided dimensions where a value recorded at compile time
> is present; not sure if that'd be worth a runtime diagnostic, could
> be very noisy) 

This version of the patch has slightly-expanded documentation.

> > > I don't see why you say that because cuDeviceGetAttribute provides
> > > CU_DEVICE_ATTRIBUTE_WARP_SIZE,
> > > CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK,
> > > CU_DEVICE_ATTRIBUTE_MAX_GRID_DIM_X (which is not too useful for
> > > this case) and cuFuncGetAttribute that allows to get a
> > > per-function thread limit. There's a patch on gomp-nvptx branch
> > > that adds querying some of those to the plugin.  
> > 
> > thanks.  There doesn't appear to be one for number of physical CTAs
> > though, right?  
> 
> Sorry, I don't understand the question: CTA is a logical entity.  One
> could derive limit of possible concurrent CTAs from number of SMs
> (CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT) multiplied by how many
> CTAs fit on one multiprocessor.  The latter figure can be taken as a
> rough worst-case value, or semi-intelligent per-kernel estimate based
> on register limits (there's code on gomp-nvptx branch that does
> this), or one can use the cuOcc* API to ask the driver for a precise
> per-kernel figure.

While the runtime part of the patch already appears to have been
committed as part of the following patch:

https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html

The compile-time part of the patch has not made it upstream yet. Thus,
this rebased and retested patch consists of the parsing changes (for
-fopenacc-dim=X:Y:Z, allowing '-') and warning changes (for strange
partitioning choices), plus associated testsuite adjustments.

Tested with offloading to NVPTX and bootstrapped.

OK for trunk?

Thanks,

Julian

20xx-xx-xx  Nathan Sidwell  <nathan@acm.org>
	    Tom de Vries  <tdevries@suse.de>
	    Thomas Schwinge  <thomas@codesourcery.com>
	    Julian Brown  <julian@codesourcery.com>

	gcc/
	* doc/invoke.texi (fopenacc-dim): Update.
	* omp-offload.c (oacc_parse_default_dims): Update.
	(oacc_validate_dims): Emit warnings about strange partitioning choices.

	gcc/testsuite/
	* c-c++-common/goacc/acc-icf.c: Update.
	* c-c++-common/goacc/parallel-dims-1.c: Likewise.
	* c-c++-common/goacc/parallel-reduction.c: Likewise.
	* c-c++-common/goacc/pr70688.c: Likewise.
	* c-c++-common/goacc/routine-1.c: Likewise.
	* c-c++-common/goacc/uninit-dim-clause.c: Likewise.
	* gfortran.dg/goacc/parallel-tree.f95: Likewise.
	* gfortran.dg/goacc/routine-4.f90: Likewise.
	* gfortran.dg/goacc/routine-level-of-parallelism-1.f90: Likewise.
	* gfortran.dg/goacc/uninit-dim-clause.f95: Likewise.

	libgomp/
	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: Add -w.
	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c: New.
	* testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c: Update.
	* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/mode-transitions.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/private-variables.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/routine-g-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/routine-w-1.c: Likewise.
	* testsuite/libgomp.oacc-fortran/par-reduction-2-1.f: Likewise.
	* testsuite/libgomp.oacc-fortran/par-reduction-2-2.f: Likewise.
	* testsuite/libgomp.oacc-fortran/pr84028.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/private-variables.f90: Likewise.
	* testsuite/libgomp.oacc-fortran/routine-7.f90: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c: New.
commit a918a8739ae7652250c978b0ececa181a587b0c0
Author: Julian Brown <julian@codesourcery.com>
Date:   Fri Oct 5 11:11:47 2018 -0700

    OpenACC default compute dimensions
    
    20xx-xx-xx  Nathan Sidwell  <nathan@acm.org>
    	    Tom de Vries  <tdevries@suse.de>
    	    Thomas Schwinge  <thomas@codesourcery.com>
    	    Julian Brown  <julian@codesourcery.com>
    
    	gcc/
    	* doc/invoke.texi (fopenacc-dim): Update.
    	* omp-offload.c (oacc_parse_default_dims): Update.
    	(oacc_validate_dims): Emit warnings about strange partitioning choices.
    
    	gcc/testsuite/
    	* c-c++-common/goacc/acc-icf.c: Update.
    	* c-c++-common/goacc/parallel-dims-1.c: Likewise.
    	* c-c++-common/goacc/parallel-reduction.c: Likewise.
    	* c-c++-common/goacc/pr70688.c: Likewise.
    	* c-c++-common/goacc/routine-1.c: Likewise.
    	* c-c++-common/goacc/uninit-dim-clause.c: Likewise.
    	* gfortran.dg/goacc/parallel-tree.f95: Likewise.
    	* gfortran.dg/goacc/routine-4.f90: Likewise.
    	* gfortran.dg/goacc/routine-level-of-parallelism-1.f90: Likewise.
    	* gfortran.dg/goacc/uninit-dim-clause.f95: Likewise.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/loop-g-1.c: Add -w.
    	* testsuite/libgomp.oacc-c-c++-common/loop-g-2.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c: New.
    	* testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c: Update.
    	* testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-w-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/mode-transitions.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/parallel-dims.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/private-variables.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/reduction-7.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/routine-g-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/routine-w-1.c: Likewise.
    	* testsuite/libgomp.oacc-fortran/par-reduction-2-1.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/par-reduction-2-2.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/pr84028.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/private-variables.f90: Likewise.
    	* testsuite/libgomp.oacc-fortran/routine-7.f90: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c: New.

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 167eef5..21ec028 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -2167,8 +2167,12 @@ have support for @option{-pthread}.
 @cindex OpenACC accelerator programming
 Specify default compute dimensions for parallel offload regions that do
 not explicitly specify.  The @var{geom} value is a triple of
-':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
-can be omitted, to use a target-specific default value.
+':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  If a size
+is to be deferred until execution '-' can be used, alternatively a size
+can be omitted to use a target-specific default value.  When deferring
+to runtime, the environment variable @var{GOMP_OPENACC_DIM} can be set.
+It has the same format as the option value, except that '-' is not
+permitted.
 
 @item -fopenmp
 @opindex fopenmp
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index 0abf028..48d1a42 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -574,8 +574,9 @@ static int oacc_default_dims[GOMP_DIM_MAX];
 static int oacc_min_dims[GOMP_DIM_MAX];
 
 /* Parse the default dimension parameter.  This is a set of
-   :-separated optional compute dimensions.  Each specified dimension
-   is a positive integer.  When device type support is added, it is
+   :-separated optional compute dimensions.  Each dimension is either
+   a positive integer, or '-' for a dynamic value computed at
+   runtime.  When device type support is added, it is
    planned to be a comma separated list of such compute dimensions,
    with all but the first prefixed by the colon-terminated device
    type.  */
@@ -610,14 +611,20 @@ oacc_parse_default_dims (const char *dims)
 
 	  if (*pos != ':')
 	    {
-	      long val;
-	      const char *eptr;
+	      long val = 0;
 
-	      errno = 0;
-	      val = strtol (pos, CONST_CAST (char **, &eptr), 10);
-	      if (errno || val <= 0 || (int) val != val)
-		goto malformed;
-	      pos = eptr;
+	      if (*pos == '-')
+		pos++;
+	      else
+		{
+		  const char *eptr;
+
+		  errno = 0;
+		  val = strtol (pos, CONST_CAST (char **, &eptr), 10);
+		  if (errno || val <= 0 || (int) val != val)
+		    goto malformed;
+		  pos = eptr;
+		}
 	      oacc_default_dims[ix] = (int) val;
 	    }
 	}
@@ -659,6 +666,34 @@ oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
       pos = TREE_CHAIN (pos);
     }
 
+  bool check = true;
+#ifdef ACCEL_COMPILER
+  check = false;
+#endif
+  if (check
+      && !lookup_attribute ("oacc kernels", DECL_ATTRIBUTES (fn)))
+    {
+      static char const *const axes[] =
+      /* Must be kept in sync with GOMP_DIM enumeration.  */
+	{"gang", "worker", "vector" };
+      for (ix = level >= 0 ? level : 0; ix != GOMP_DIM_MAX; ix++)
+	if (dims[ix] < 0)
+	  ; /* Defaulting axis.  */
+	else if ((used & GOMP_DIM_MASK (ix)) && dims[ix] == 1)
+	  /* There is partitioned execution, but the user requested a
+	     dimension size of 1.  They're probably confused.  */
+	  warning_at (DECL_SOURCE_LOCATION (fn), 0,
+		      "region contains %s partitoned code but"
+		      " is not %s partitioned", axes[ix], axes[ix]);
+	else if (!(used & GOMP_DIM_MASK (ix)) && dims[ix] != 1)
+	  /* The dimension is explicitly partitioned to non-unity, but
+	     no use is made within the region.  */
+	  warning_at (DECL_SOURCE_LOCATION (fn), 0,
+		      "region is %s partitioned but"
+		      " does not contain %s partitioned code",
+		      axes[ix], axes[ix]);
+    }
+
   bool changed = targetm.goacc.validate_dims (fn, dims, level);
 
   /* Default anything left to 1 or a partitioned default.  */
diff --git a/gcc/testsuite/c-c++-common/goacc/acc-icf.c b/gcc/testsuite/c-c++-common/goacc/acc-icf.c
index ecfe3f2..fb2c791 100644
--- a/gcc/testsuite/c-c++-common/goacc/acc-icf.c
+++ b/gcc/testsuite/c-c++-common/goacc/acc-icf.c
@@ -4,7 +4,7 @@
 
 #pragma acc routine gang
 int
-routine1 (int n)
+routine1 (int n) /* { dg-bogus "region is worker partitioned but does not contain worker partitioned code" "" { xfail *-*-* } } */
 {
   int i;
 
@@ -17,7 +17,7 @@ routine1 (int n)
 
 #pragma acc routine gang
 int
-routine2 (int n)
+routine2 (int n) /* { dg-bogus "region is worker partitioned but does not contain worker partitioned code" "" { xfail *-*-* } } */
 {
   int i;
 
diff --git a/gcc/testsuite/c-c++-common/goacc/parallel-dims-1.c b/gcc/testsuite/c-c++-common/goacc/parallel-dims-1.c
index 57f682f..6cdbebe 100644
--- a/gcc/testsuite/c-c++-common/goacc/parallel-dims-1.c
+++ b/gcc/testsuite/c-c++-common/goacc/parallel-dims-1.c
@@ -3,9 +3,11 @@
 
 void f(int i)
 {
-#pragma acc kernels num_gangs(i) num_workers(i) vector_length(i)
+#pragma acc kernels \
+  num_gangs(i) num_workers(i) vector_length(i)
   ;
 
-#pragma acc parallel num_gangs(i) num_workers(i) vector_length(i)
+#pragma acc parallel /* { dg-bogus "region is (gang|worker|vector) partitioned" "" { xfail *-*-* } } */ \
+  num_gangs(i) num_workers(i) vector_length(i)
   ;
 }
diff --git a/gcc/testsuite/c-c++-common/goacc/parallel-reduction.c b/gcc/testsuite/c-c++-common/goacc/parallel-reduction.c
index d7cc947..9a142c4 100644
--- a/gcc/testsuite/c-c++-common/goacc/parallel-reduction.c
+++ b/gcc/testsuite/c-c++-common/goacc/parallel-reduction.c
@@ -6,7 +6,7 @@ main ()
 
 #pragma acc data copy (dummy)
   {
-#pragma acc parallel num_gangs (10) copy (sum) reduction (+:sum)
+#pragma acc parallel num_gangs (10) copy (sum) reduction (+:sum) /* { dg-warning "gang partitioned" } */
     {
       int v = 5;
       sum += 10 + v;
diff --git a/gcc/testsuite/c-c++-common/goacc/pr70688.c b/gcc/testsuite/c-c++-common/goacc/pr70688.c
index 5a23665..3f5584a 100644
--- a/gcc/testsuite/c-c++-common/goacc/pr70688.c
+++ b/gcc/testsuite/c-c++-common/goacc/pr70688.c
@@ -21,7 +21,7 @@ parallel_reduction ()
 
 #pragma acc data copy (dummy)
   {
-#pragma acc parallel num_gangs (10) copy (sum) reduction (+:sum)
+#pragma acc parallel num_gangs (10) copy (sum) reduction (+:sum) /* { dg-warning "region is gang partitioned" } */
     {
       int v = 5;
       sum += 10 + v;
@@ -36,11 +36,11 @@ main ()
 {
   int i, s = 0;
 
-#pragma acc parallel num_gangs (10) copy (s) reduction (+:s)
+#pragma acc parallel num_gangs (10) copy (s) reduction (+:s) /* { dg-warning "region is gang partitioned" } */
   for (i = 0; i < n; i++)
     s += i+1;
 
-#pragma acc parallel num_gangs (10) reduction (+:s) copy (s)
+#pragma acc parallel num_gangs (10) reduction (+:s) copy (s) /* { dg-warning "region is gang partitioned" } */
   for (i = 0; i < n; i++)
     s += i+1;
 
diff --git a/gcc/testsuite/c-c++-common/goacc/routine-1.c b/gcc/testsuite/c-c++-common/goacc/routine-1.c
index a756922..b90e2c1 100644
--- a/gcc/testsuite/c-c++-common/goacc/routine-1.c
+++ b/gcc/testsuite/c-c++-common/goacc/routine-1.c
@@ -1,16 +1,16 @@
 
 #pragma acc routine gang
-void gang (void)
+void gang (void) /* { dg-warning "partitioned" 3 } */
 {
 }
 
 #pragma acc routine worker
-void worker (void)
+void worker (void) /* { dg-warning "partitioned" 2 } */
 {
 }
 
 #pragma acc routine vector
-void vector (void)
+void vector (void) /* { dg-warning "partitioned" 1 } */
 {
 }
 
diff --git a/gcc/testsuite/c-c++-common/goacc/uninit-dim-clause.c b/gcc/testsuite/c-c++-common/goacc/uninit-dim-clause.c
index 9f11196..72aacd7 100644
--- a/gcc/testsuite/c-c++-common/goacc/uninit-dim-clause.c
+++ b/gcc/testsuite/c-c++-common/goacc/uninit-dim-clause.c
@@ -4,14 +4,17 @@ void acc_parallel()
 {
   int i, j, k;
 
-  #pragma acc parallel num_gangs(i) /* { dg-warning "is used uninitialized in this function" } */
-  ;
+  #pragma acc parallel loop gang num_gangs(i) /* { dg-warning "is used uninitialized in this function" } */
+  for (i = 0; i < 1; i++)
+    ;
 
-  #pragma acc parallel num_workers(j) /* { dg-warning "is used uninitialized in this function" } */
-  ;
+  #pragma acc parallel loop worker num_workers(j) /* { dg-warning "is used uninitialized in this function" } */
+  for (j = 0; j < 1; j++)
+    ;
 
-  #pragma acc parallel vector_length(k) /* { dg-warning "is used uninitialized in this function" } */
-  ;
+  #pragma acc parallel loop vector vector_length(k) /* { dg-warning "is used uninitialized in this function" } */
+  for (k = 0; k < 1; k++)
+    ;
 }
 
 void acc_kernels()
diff --git a/gcc/testsuite/gfortran.dg/goacc/parallel-tree.f95 b/gcc/testsuite/gfortran.dg/goacc/parallel-tree.f95
index 2697bb7..aaa1bfd 100644
--- a/gcc/testsuite/gfortran.dg/goacc/parallel-tree.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/parallel-tree.f95
@@ -11,6 +11,9 @@ program test
   !$acc reduction(max:q), copy(i), copyin(j), copyout(k), create(m) &
   !$acc present(o), pcopy(p), pcopyin(r), pcopyout(s), pcreate(t) &
   !$acc deviceptr(u), private(v), firstprivate(w)
+  ! { dg-warning "region is gang partitioned but does not contain gang partitioned code" "" { target *-*-* } .-1 }
+  ! { dg-warning "region is worker partitioned but does not contain worker partitioned code" "" { target *-*-* } .-2 }
+  ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "" { target *-*-* } .-3 }
   !$acc end parallel
 
 end program test
diff --git a/gcc/testsuite/gfortran.dg/goacc/routine-4.f90 b/gcc/testsuite/gfortran.dg/goacc/routine-4.f90
index 6714c7b..3fb60e7 100644
--- a/gcc/testsuite/gfortran.dg/goacc/routine-4.f90
+++ b/gcc/testsuite/gfortran.dg/goacc/routine-4.f90
@@ -123,6 +123,7 @@ contains
     integer, intent (inout) :: a(N)
     integer :: i
 
+    !$acc loop gang worker vector
     do i = 1, N
        a(i) = a(i) - a(i)
     end do
@@ -133,6 +134,7 @@ contains
     integer, intent (inout) :: a(N)
     integer :: i
 
+    !$acc loop worker vector
     do i = 1, N
        a(i) = a(i) - a(i)
     end do
@@ -143,6 +145,7 @@ contains
     integer, intent (inout) :: a(N)
     integer :: i
 
+    !$acc loop vector
     do i = 1, N
        a(i) = a(i) - a(i)
     end do
@@ -153,6 +156,7 @@ contains
     integer, intent (inout) :: a(N)
     integer :: i
 
+    !$acc loop seq
     do i = 1, N
        a(i) = a(i) - a(i)
     end do
diff --git a/gcc/testsuite/gfortran.dg/goacc/routine-level-of-parallelism-1.f90 b/gcc/testsuite/gfortran.dg/goacc/routine-level-of-parallelism-1.f90
index 75dd1b0..1b41a68 100644
--- a/gcc/testsuite/gfortran.dg/goacc/routine-level-of-parallelism-1.f90
+++ b/gcc/testsuite/gfortran.dg/goacc/routine-level-of-parallelism-1.f90
@@ -2,8 +2,10 @@
 ! parallelism with the OpenACC routine directive.  The Fortran counterpart is
 ! c-c++-common/goacc/routine-level-of-parallelism-2.c
 
-subroutine g_1
+subroutine g_1 ! { dg-warning "region is gang partitioned but does not contain gang partitioned code" }
   !$acc routine gang
+! { dg-bogus "region is worker partitioned but does not contain worker partitioned code" "worker partitioned" { xfail *-*-* } .-2 }
+! { dg-bogus "region is vector partitioned but does not contain vector partitioned code" "worker partitioned" { xfail *-*-* } .-3 }
 end subroutine g_1
 
 subroutine s_1_2a
diff --git a/gcc/testsuite/gfortran.dg/goacc/uninit-dim-clause.f95 b/gcc/testsuite/gfortran.dg/goacc/uninit-dim-clause.f95
index 5dea42b..8551140 100644
--- a/gcc/testsuite/gfortran.dg/goacc/uninit-dim-clause.f95
+++ b/gcc/testsuite/gfortran.dg/goacc/uninit-dim-clause.f95
@@ -4,14 +4,20 @@ subroutine acc_parallel
   implicit none
   integer :: i, j, k
 
-  !$acc parallel num_gangs(i) ! { dg-warning "is used uninitialized in this function" }
-  !$acc end parallel
-
-  !$acc parallel num_workers(j) ! { dg-warning "is used uninitialized in this function" }
-  !$acc end parallel
-
-  !$acc parallel vector_length(k) ! { dg-warning "is used uninitialized in this function" }
-  !$acc end parallel
+  !$acc parallel loop gang num_gangs(i) ! { dg-warning "is used uninitialized in this function" }
+  do i = 0, 1
+  end do
+  !$acc end parallel loop
+
+  !$acc parallel loop worker num_workers(j) ! { dg-warning "is used uninitialized in this function" }
+  do j = 0, 1
+  end do
+  !$acc end parallel loop
+
+  !$acc parallel loop vector vector_length(k) ! { dg-warning "is used uninitialized in this function" }
+  do k = 0, 1
+  end do
+  !$acc end parallel loop
 end subroutine acc_parallel
 
 subroutine acc_kernels
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c
index 689a443..14bc3af 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/firstprivate-1.c
@@ -117,6 +117,8 @@ void t4 ()
     arr[i] = 3;
 
 #pragma acc parallel firstprivate(x) copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 119 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 119 } */
   {
 #pragma acc loop gang
     for (i = 0; i < 32; i++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
index 34bc57e..8e2c1c9 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-auto-1.c
@@ -147,7 +147,7 @@ int gang_1 (int *ary, int size)
 {
   clear (ary, size);
   
-#pragma acc parallel num_gangs (32) num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)
+#pragma acc parallel num_gangs (32) num_workers (32) vector_length(32) copy(ary[0:size]) firstprivate (size)/* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } } */
   {
 #pragma acc loop auto
     for (int jx = 0; jx <  size  / 64; jx++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c
new file mode 100644
index 0000000..6c479e4
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c
@@ -0,0 +1,13 @@
+/* { dg-additional-options "-fopenacc-dim=16:16" } */
+/* This code uses nvptx inline assembly guarded with acc_on_device, which is
+   not optimized away at -O0, and then confuses the target assembler.
+   { dg-skip-if "" { *-*-* } { "-O0" } { "" } } */
+/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "8:8" } */
+
+#include "loop-default.h"
+
+int main ()
+{
+  /* Environment should be ignored.  */
+  return test_1 (16, 16, 32);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
index 98f02e9..5831327 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-1.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-w" } */
 #include <stdio.h>
 #include <openacc.h>
 #include <gomp-constants.h>
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
index 4152a4e..82e8aae 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-g-2.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-w" } */
 #include <stdio.h>
 #include <openacc.h>
 #include <gomp-constants.h>
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c
index 7107502..2f3a44f 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-g-1.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-w" } */
 #include <stdio.h>
 #include <openacc.h>
 #include <gomp-constants.h>
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c
index 6bbd04f..a1bb845 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-1.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-w" } */
 #include <stdio.h>
 #include <openacc.h>
 #include <gomp-constants.h>
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c
index c63a5d4..ae43bb4 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-red-w-2.c
@@ -1,3 +1,4 @@
+/* { dg-additional-options "-w" } */
 #include <stdio.h>
 #include <openacc.h>
 #include <gomp-constants.h>
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
index fa6fb91..10b80f1 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-w-1.c
@@ -14,6 +14,7 @@ int main ()
     ary[ix] = -1;
   
 #pragma acc parallel num_workers(32) vector_length(32) copy(ary) copy(ondev)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 16 } */
   {
 #pragma acc loop worker
     for (unsigned ix = 0; ix < N; ix++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c
new file mode 100644
index 0000000..20a022f
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-warn-1.c
@@ -0,0 +1,37 @@
+
+/* Check warnings about suboptimal partitioning choices.  */
+
+int main ()
+{
+  int ary[10];
+
+#pragma acc parallel copy(ary) num_gangs (1) /* { dg-warning "is not gang partitioned" } */
+  {
+    #pragma acc loop gang
+    for (int  i = 0; i < 10; i++)
+      ary[i] = i;
+  }
+
+#pragma acc parallel copy(ary) num_workers (1) /* { dg-warning "is not worker partitioned" } */
+  {
+    #pragma acc loop worker
+    for (int  i = 0; i < 10; i++)
+      ary[i] = i;
+  }
+
+#pragma acc parallel copy(ary) num_gangs (8) /* { dg-warning "is gang partitioned" } */
+  {
+    #pragma acc loop worker
+    for (int  i = 0; i < 10; i++)
+      ary[i] = i;
+  }
+
+#pragma acc parallel copy(ary) num_workers (8) /* { dg-warning "is worker partitioned" } */
+  {
+    #pragma acc loop gang
+    for (int  i = 0; i < 10; i++)
+      ary[i] = i;
+  }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/mode-transitions.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/mode-transitions.c
index 4474c12..f62daf0 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/mode-transitions.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/mode-transitions.c
@@ -287,6 +287,7 @@ void t7()
   int n = 0;
   #pragma acc parallel copy(n) \
 		       num_gangs(1) num_workers(1) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 288 } */
   {
     n++;
   }
@@ -310,6 +311,7 @@ void t8()
 
       #pragma acc parallel copy(arr) \
 			   num_gangs(gangs) num_workers(1) vector_length(32)
+      /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 312 } */
       {
 	int j;
 	#pragma acc loop gang
@@ -339,6 +341,7 @@ void t9()
 
       #pragma acc parallel copy(arr) \
 			   num_gangs(gangs) num_workers(1) vector_length(32)
+      /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 342 } */
       {
 	int j;
 	#pragma acc loop gang
@@ -371,6 +374,7 @@ void t10()
 
       #pragma acc parallel copy(arr) \
 			   num_gangs(gangs) num_workers(1) vector_length(32)
+      /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 375 } */
       {
 	int j;
 	#pragma acc loop gang
@@ -404,6 +408,7 @@ void t11()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(1024) num_workers(1) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 409 } */
   {
     int j;
 
@@ -442,6 +447,7 @@ void t12()
 
   #pragma acc parallel copyout(fizz, buzz, fizzbuzz) \
 		       num_gangs(NUM_GANGS) num_workers(1) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 448 } */
   {
     int j;
     
@@ -488,6 +494,7 @@ void t13()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 495 } */
   {
     int j;
     #pragma acc loop gang
@@ -613,6 +620,7 @@ void t16()
 
   #pragma acc parallel copy(n, arr) \
 		       num_gangs(8) num_workers(16) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 621 } */
   {
     int j;
     #pragma acc loop gang
@@ -665,6 +673,7 @@ void t17()
 
 	#pragma acc parallel copyin(arr_a) copyout(arr_b) \
 			     num_gangs(num_gangs) num_workers(num_workers) vector_length(32)
+	/* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 674 } */
 	{
 	  int j;
 	  #pragma acc loop gang
@@ -882,6 +891,8 @@ void t21()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 892 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 892 } */
   {
     int j;
     #pragma acc loop gang
@@ -905,6 +916,8 @@ void t22()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 917 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 917 } */
   {
     int j;
     #pragma acc loop gang
@@ -931,6 +944,8 @@ void t23()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 945 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 945 } */
   {
     int j;
     #pragma acc loop gang
@@ -957,6 +972,8 @@ void t24()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 973 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 973 } */
   {
     int j;
     #pragma acc loop gang
@@ -988,6 +1005,7 @@ void t25()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 1006 } */
   {
     int j;
     #pragma acc loop gang
@@ -1020,6 +1038,7 @@ void t26()
 
   #pragma acc parallel copy(arr) \
 		       num_gangs(8) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 1039 } */
   {
     int j;
     #pragma acc loop gang
@@ -1070,6 +1089,8 @@ void t27()
 
   #pragma acc parallel copy(n, arr) copyout(ondev) \
 	  num_gangs(ACTUAL_GANGS) num_workers(8) vector_length(32)
+  /* { dg-warning "region is gang partitioned but does not contain gang partitioned code" "gang" { target *-*-* } 1090 } */
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 1090 } */
   {
     int j;
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
index 7781b32..ebcb760 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/parallel-dims.c
@@ -96,7 +96,7 @@ int main ()
     int gangs_min, gangs_max, workers_min, workers_max, vectors_min, vectors_max;
     gangs_min = workers_min = vectors_min = INT_MAX;
     gangs_max = workers_max = vectors_max = INT_MIN;
-#pragma acc parallel copy (gangs_actual) \
+#pragma acc parallel copy (gangs_actual) /* { dg-warning "region contains gang partitoned code but is not gang partitioned" } */ \
   num_gangs (GANGS) /* { dg-warning "'num_gangs' value must be positive" "" { target c++ } } */
     {
       /* We're actually executing with num_gangs (1).  */
@@ -125,7 +125,7 @@ int main ()
     int gangs_min, gangs_max, workers_min, workers_max, vectors_min, vectors_max;
     gangs_min = workers_min = vectors_min = INT_MAX;
     gangs_max = workers_max = vectors_max = INT_MIN;
-#pragma acc parallel copy (workers_actual) \
+#pragma acc parallel copy (workers_actual) /* { dg-warning "region contains worker partitoned code but is not worker partitioned" } */ \
   num_workers (WORKERS) /* { dg-warning "'num_workers' value must be positive" "" { target c++ } } */
     {
       /* We're actually executing with num_workers (1).  */
@@ -154,7 +154,8 @@ int main ()
     int gangs_min, gangs_max, workers_min, workers_max, vectors_min, vectors_max;
     gangs_min = workers_min = vectors_min = INT_MAX;
     gangs_max = workers_max = vectors_max = INT_MIN;
-#pragma acc parallel copy (vectors_actual) /* { dg-warning "using vector_length \\(32\\), ignoring 1" "" { target openacc_nvidia_accel_selected } } */ \
+#pragma acc parallel copy (vectors_actual) /* { dg-warning "region contains vector partitoned code but is not vector partitioned" } */ \
+  /* { dg-warning "using vector_length \\(32\\), ignoring 1" "" { target openacc_nvidia_accel_selected } 157 } */ \
   vector_length (VECTORS) /* { dg-warning "'vector_length' value must be positive" "" { target c++ } } */
     {
       /* We're actually executing with vector_length (1), just the GCC nvptx
@@ -198,7 +199,7 @@ int main ()
     int gangs_min, gangs_max, workers_min, workers_max, vectors_min, vectors_max;
     gangs_min = workers_min = vectors_min = INT_MAX;
     gangs_max = workers_max = vectors_max = INT_MIN;
-#pragma acc parallel copy (gangs_actual) \
+#pragma acc parallel copy (gangs_actual) /* { dg-warning "region is gang partitioned but does not contain gang partitioned code" } */ \
   reduction (min: gangs_min, workers_min, vectors_min) reduction (max: gangs_max, workers_max, vectors_max) \
   num_gangs (gangs)
     {
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/private-variables.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-variables.c
index 53f03d1..f0c3447 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/private-variables.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/private-variables.c
@@ -22,6 +22,8 @@ void local_g_1()
     arr[i] = 3;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 24 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 24 } */
   {
     int x;
 
@@ -295,6 +297,8 @@ void loop_g_1()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 299 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 299 } */
   {
     #pragma acc loop gang private(x)
     for (i = 0; i < 32; i++)
@@ -320,6 +324,7 @@ void loop_g_2()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 326 } */
   {
     #pragma acc loop gang private(x)
     for (i = 0; i < 32; i++)
@@ -348,6 +353,7 @@ void loop_g_3()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 355 } */
   {
     #pragma acc loop gang private(x)
     for (i = 0; i < 32; i++)
@@ -376,6 +382,7 @@ void loop_g_4()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 384 } */
   {
     #pragma acc loop gang private(x)
     for (i = 0; i < 32; i++)
@@ -408,6 +415,7 @@ void loop_g_5()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 417 } */
   {
     #pragma acc loop gang private(x)
     for (i = 0; i < 32; i++)
@@ -438,6 +446,7 @@ void loop_g_6()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 448 } */
   {
     #pragma acc loop gang private(pt)
     for (i = 0; i < 32; i++)
@@ -559,6 +568,7 @@ void loop_w_1()
     arr[i] = i;
 
   #pragma acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 570 } */
   {
     int j;
 
@@ -875,6 +885,8 @@ void parallel_g_1()
     arr[i] = 3;
 
   #pragma acc parallel private(x) copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 887 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 887 } */
   {
     #pragma acc loop gang(static:1)
     for (i = 0; i < 32; i++)
@@ -904,6 +916,7 @@ void parallel_g_2()
     arr[i] = i;
 
   #pragma acc parallel private(x) copy(arr) num_gangs(32) num_workers(2) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 918 } */
   {
     #pragma acc loop gang
     for (i = 0; i < 32; i++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-7.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-7.c
index c4940b8..68ae919 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-7.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/reduction-7.c
@@ -14,6 +14,8 @@ void g_np_1()
     arr[i] = i;
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 16 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 16 } */
   {
     #pragma acc loop gang reduction(+:res)
     for (i = 0; i < 1024; i++)
@@ -28,6 +30,8 @@ void g_np_1()
   res = hres = 1;
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 32 } */
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 32 } */
   {
     #pragma acc loop gang reduction(*:res)
     for (i = 0; i < 12; i++)
@@ -52,6 +56,7 @@ void gv_np_1()
     arr[i] = i;
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 58 } */
   {
     #pragma acc loop gang vector reduction(+:res)
     for (i = 0; i < 1024; i++)
@@ -76,6 +81,7 @@ void gw_np_1()
     arr[i] = i;
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 83 } */
   {
     #pragma acc loop gang worker reduction(+:res)
     for (i = 0; i < 1024; i++)
@@ -239,6 +245,7 @@ void v_p_1()
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
 		       private(res) copyout(out)
+  /* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 246 } */
   {
     #pragma acc loop gang
     for (j = 0; j < 32; j++)
@@ -315,6 +322,7 @@ void w_p_1()
 
   #pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \
 		       private(res) copyout(out)
+  /* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 323 } */
   {
     #pragma acc loop gang
     for (j = 0; j < 32; j++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-g-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-g-1.c
index a164f57..8c3b938 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-g-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-g-1.c
@@ -6,6 +6,8 @@
 
 #pragma acc routine gang
 void __attribute__ ((noinline)) gang (int ary[N])
+/* { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 8 } */
+/* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 8 } */
 {
 #pragma acc loop gang
     for (unsigned ix = 0; ix < N; ix++)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-w-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-w-1.c
index 81f1e03..e14947c 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-w-1.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/routine-w-1.c
@@ -6,6 +6,7 @@
 
 #pragma acc routine worker
 void __attribute__ ((noinline)) worker (int ary[N])
+/* { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 8 } */
 {
 #pragma acc loop worker
   for (unsigned ix = 0; ix < N; ix++)
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-1.f b/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-1.f
index aa1bb63..ff31116 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-1.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-1.f
@@ -14,7 +14,7 @@
       RES2 = 0
 
 !$ACC PARALLEL NUM_GANGS(256) NUM_WORKERS(32) VECTOR_LENGTH(32)
-!$ACC& REDUCTION(+:RES1) COPY(RES1, RES2) ASYNC(1)
+!$ACC& REDUCTION(+:RES1) COPY(RES1, RES2) ASYNC(1) ! { dg-warning "region is (gang|worker|vector) partitioned" }
       res1 = res1 + 5
 
 !$ACC ATOMIC
@@ -36,7 +36,7 @@
       RES2 = 1
 
 !$ACC PARALLEL NUM_GANGS(8) NUM_WORKERS(32) VECTOR_LENGTH(32)
-!$ACC& REDUCTION(*:RES1) COPY(RES1, RES2) ASYNC(1)
+!$ACC& REDUCTION(*:RES1) COPY(RES1, RES2) ASYNC(1) ! { dg-warning "region is (gang|worker|vector) partitioned" }
       res1 = res1 * 5
 
 !$ACC ATOMIC
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-2.f b/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-2.f
index 5694de1..47c5ff3 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-2.f
+++ b/libgomp/testsuite/libgomp.oacc-fortran/par-reduction-2-2.f
@@ -14,7 +14,7 @@
       RES2 = 0
 
 !$ACC PARALLEL NUM_GANGS(256) NUM_WORKERS(32) VECTOR_LENGTH(32)
-!$ACC& REDUCTION(+:RES1) COPY(RES1, RES2) ASYNC(1)
+!$ACC& REDUCTION(+:RES1) COPY(RES1, RES2) ASYNC(1) ! { dg-warning "region is (gang|worker|vector) partitioned" }
       res1 = res1 + 5
 
 !$ACC ATOMIC
@@ -36,7 +36,7 @@
       RES2 = 1
 
 !$ACC PARALLEL NUM_GANGS(8) NUM_WORKERS(32) VECTOR_LENGTH(32)
-!$ACC& REDUCTION(*:RES1) COPY(RES1, RES2) ASYNC(1)
+!$ACC& REDUCTION(*:RES1) COPY(RES1, RES2) ASYNC(1) ! { dg-warning "region is (gang|worker|vector) partitioned" }
       res1 = res1 * 5
 
 !$ACC ATOMIC
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/pr84028.f90 b/libgomp/testsuite/libgomp.oacc-fortran/pr84028.f90
index 2b36122..8cb76a9 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/pr84028.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/pr84028.f90
@@ -5,7 +5,7 @@ program foo
 
   a = 1
 
-  !$acc parallel num_gangs(1) num_workers(2)
+  !$acc parallel num_gangs(1) num_workers(2) ! { dg-warning "region is worker partitioned" }
 
   if (any(a(1:3,1:3,1:3).ne.1)) STOP 1
 
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/private-variables.f90 b/libgomp/testsuite/libgomp.oacc-fortran/private-variables.f90
index 472a6a1..fbff5cc 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/private-variables.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/private-variables.f90
@@ -13,6 +13,8 @@ subroutine t1()
   end do
 
   !$acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  ! { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 15 }
+  ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 15 }
   !$acc loop gang private(x)
   do i = 1, 32
      x = i * 2;
@@ -37,6 +39,7 @@ subroutine t2()
   end do
 
   !$acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 41 }
   !$acc loop gang private(x)
   do i = 0, 31
      x = i * 2;
@@ -65,6 +68,7 @@ subroutine t3()
   end do
 
   !$acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  ! { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 70 }
   !$acc loop gang private(x)
   do i = 0, 31
      x = i * 2;
@@ -98,6 +102,7 @@ subroutine t4()
   end do
 
   !$acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  ! { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 104 }
   !$acc loop gang private(pt)
   do i = 0, 31
      pt%x = i
@@ -208,6 +213,7 @@ subroutine t7()
   end do
 
   !$acc parallel copy(arr) num_gangs(32) num_workers(8) vector_length(32)
+  ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 215 }
   !$acc loop gang private(x)
   do i = 0, 31
      !$acc loop worker private(x)
@@ -507,6 +513,8 @@ subroutine t14()
   end do
 
   !$acc parallel private(x) copy(arr) num_gangs(n) num_workers(8) vector_length(32)
+  ! { dg-warning "region is worker partitioned but does not contain worker partitioned code" "worker" { target *-*-* } 515 }
+  ! { dg-warning "region is vector partitioned but does not contain vector partitioned code" "vector" { target *-*-* } 515 }
     !$acc loop gang(static:1)
     do i = 1, n
       x = i * 2;
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90 b/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90
index f58a95f..a83e92a 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/routine-7.f90
@@ -1,4 +1,3 @@
-
 ! { dg-do run }
 ! { dg-additional-options "-cpp" }
 
@@ -100,7 +99,7 @@ subroutine gang (a)
   integer, intent (inout) :: a(N)
   integer :: i
 
-  !$acc loop gang
+  !$acc loop gang worker vector
   do i = 1, N
     a(i) = a(i) - i 
   end do
Thomas Schwinge April 25, 2019, 1:15 p.m. UTC | #5
Hi!

On Fri, 5 Oct 2018 21:03:36 +0100, Julian Brown <julian@codesourcery.com> wrote:
> Continuing the thread from here:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-02/msg00198.html
> 
> On Wed, 3 Feb 2016 19:52:09 +0300
> Alexander Monakov <amonakov@ispras.ru> wrote:
> 
> > On Wed, 3 Feb 2016, Nathan Sidwell wrote:
> > > You can only override at runtime those dimensions that you said
> > > you'd override at runtime when you compiled your program.  
> > 
> > Ah, I see.  That's not obvious to me, so perhaps added documentation
> > can be expanded to explain that?  (I now see that the plugin silently
> > drops user-provided dimensions where a value recorded at compile time
> > is present; not sure if that'd be worth a runtime diagnostic, could
> > be very noisy) 
> 
> This version of the patch has slightly-expanded documentation.

Thanks!

> While the runtime part of the patch already appears to have been
> committed as part of the following patch:
> 
> https://gcc.gnu.org/ml/gcc-patches/2016-02/msg01589.html

(Not really as part of that one, but as part of other commits.)

> The compile-time part of the patch has not made it upstream yet. Thus,
> this rebased and retested patch consists of the parsing changes (for
> -fopenacc-dim=X:Y:Z, allowing '-')

That seems reasonable for next GCC development stage 1, with changes as
per below.  But please make sure that adequate testsuite coverage is
present for this functionality.

> and warning changes (for strange
> partitioning choices), plus associated testsuite adjustments.

These changes I'd like to defer, that's a separate topic.  The general
intention is good, but I've seen cases where I considered these
diagnostics to be too noisy.  See also the several 'dg-bogus' with XFAIL
that you're proposing to add?  We should think about that some more.


So, the following is relevant right now:

> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -2167,8 +2167,12 @@ have support for @option{-pthread}.
>  @cindex OpenACC accelerator programming
>  Specify default compute dimensions for parallel offload regions that do
>  not explicitly specify.  The @var{geom} value is a triple of
> -':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
> -can be omitted, to use a target-specific default value.
> +':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  If a size
> +is to be deferred until execution '-' can be used, alternatively a size
> +can be omitted to use a target-specific default value.  When deferring
> +to runtime, the environment variable @var{GOMP_OPENACC_DIM} can be set.
> +It has the same format as the option value, except that '-' is not
> +permitted.

ACK.

I re-discovered that 'GOMP_OPENACC_DIM' is not currently documented in
the libgomp manual (also referring back to the compile-time flag).  That
can be fixed incrementally, later on; that's PR85129 "[openacc] Document
GOMP_OPENACC_DIM".

> --- a/gcc/omp-offload.c
> +++ b/gcc/omp-offload.c
> @@ -574,8 +574,9 @@ static int oacc_default_dims[GOMP_DIM_MAX];
>  static int oacc_min_dims[GOMP_DIM_MAX];
>  
>  /* Parse the default dimension parameter.  This is a set of
> -   :-separated optional compute dimensions.  Each specified dimension
> -   is a positive integer.  When device type support is added, it is
> +   :-separated optional compute dimensions.  Each dimension is either
> +   a positive integer, or '-' for a dynamic value computed at
> +   runtime.  When device type support is added, it is
>     planned to be a comma separated list of such compute dimensions,
>     with all but the first prefixed by the colon-terminated device
>     type.  */
> @@ -610,14 +611,20 @@ oacc_parse_default_dims (const char *dims)
>  
>  	  if (*pos != ':')
>  	    {
> -	      long val;
> -	      const char *eptr;
> +	      long val = 0;

Don't set "val = 0" here, but instead...

>  
> -	      errno = 0;
> -	      val = strtol (pos, CONST_CAST (char **, &eptr), 10);
> -	      if (errno || val <= 0 || (int) val != val)
> -		goto malformed;
> -	      pos = eptr;
> +	      if (*pos == '-')

... do that here, so that it's clear that this is how the '-' case gets
encoded.

> +		pos++;
> +	      else
> +		{
> +		  const char *eptr;
> +
> +		  errno = 0;
> +		  val = strtol (pos, CONST_CAST (char **, &eptr), 10);
> +		  if (errno || val <= 0 || (int) val != val)
> +		    goto malformed;
> +		  pos = eptr;
> +		}
>  	      oacc_default_dims[ix] = (int) val;
>  	    }
>  	}

> --- /dev/null
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-default-compile.c
> @@ -0,0 +1,13 @@
> +/* { dg-additional-options "-fopenacc-dim=16:16" } */
> +/* This code uses nvptx inline assembly guarded with acc_on_device, which is
> +   not optimized away at -O0, and then confuses the target assembler.
> +   { dg-skip-if "" { *-*-* } { "-O0" } { "" } } */

You no longer need to skip for '-O0'.

> +/* { dg-set-target-env-var "GOMP_OPENACC_DIM" "8:8" } */
> +
> +#include "loop-default.h"
> +
> +int main ()
> +{
> +  /* Environment should be ignored.  */
> +  return test_1 (16, 16, 32);
> +}

Does that single test case constitute sufficient testsuite coverage?
Well, it doesn't even test the '-' case?  If that's a useful test on its
own, then commit it on its own.  (I have not researched its history, and
why it's not in trunk yet.)

Probably some of the 'dg-additional-options' '-fopenacc-dim' changes with
'-' that appear in:

    $ git diff upstream/trunk..upstream-git/openacc-gcc-8-branch -- libgomp/testsuite/libgomp.oacc-c-c++-common/pr85486\* libgomp/testsuite/libgomp.oacc-c-c++-common/vector-length-128-\* libgomp/testsuite/libgomp.oacc-fortran/gemm.f90

... should be part of this patch, to add some basic test coverage?
(Unless, of course, there's a reason that the '-fopenacc-dim' flags in
trunk are different, intentionally, which I have not researched.)


Grüße
 Thomas
diff mbox

Patch

2016-02-03  Nathan Sidwell  <nathan@codesourcery.com>

	gcc/
	* doc/invoke.texi (fopenacc-dim): Document runtime support.
	* omp-low.c(oacc_parse_default_dims): Add runtime support.

	libgomp/
	* libgomp.map (GOMP_PLUGIN_acc_default_dims): New.
	* oacc-parallel.c (GOACC_parallel_keyed): Zero initialize dims.
	* oacc-plugin.c (GOMP_PLUGIN_acc_default_dims): New.
	* oacc-plugin.h (GOMP_PLUGIN_acc_default_dims): Declare.
	* plugin/plugin-nvptx.c (nvptx_exec): Add support for runtime
	defaul dimensions.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c: Breakout
	body to and #include ...
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h: ... this.
	* testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c: New.

Index: gcc/doc/invoke.texi
===================================================================
--- gcc/doc/invoke.texi	(revision 233084)
+++ gcc/doc/invoke.texi	(working copy)
@@ -1969,7 +1969,12 @@  have support for @option{-pthread}.
 Specify default compute dimensions for parallel offload regions that do
 not explicitly specify.  The @var{geom} value is a triple of
 ':'-separated sizes, in order 'gang', 'worker' and, 'vector'.  A size
-can be omitted, to use a target-specific default value.
+can be omitted, to use a target-specific default value. Use '-' to defer
+the size determination until execution.  In that case, the environment
+variable @var{GOMP_OPENACC_DIM} should be set.  It has the same format
+as the option value, except that '-' is not permitted.  If it is unset,
+a target-specific value is chosen. Runtime and compile-time values can
+be freely mixed.
 
 @item -fopenmp
 @opindex fopenmp
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 233084)
+++ gcc/omp-low.c	(working copy)
@@ -20275,9 +20275,14 @@  oacc_parse_default_dims (const char *dim
 	      pos++;
 	    }
 
-	  if (*pos != ':')
+	  long val = -1;
+	  if (*pos == '-')
+	    {
+	      pos++;
+	      val = 0;
+	    }
+	  else if (*pos != ':')
 	    {
-	      long val;
 	      const char *eptr;
 
 	      errno = 0;
@@ -20285,8 +20290,8 @@  oacc_parse_default_dims (const char *dim
 	      if (errno || val <= 0 || (int) val != val)
 		goto malformed;
 	      pos = eptr;
-	      oacc_default_dims[ix] = (int) val;
 	    }
+	  oacc_default_dims[ix] = (int) val;
 	}
       if (*pos)
 	{
Index: libgomp/libgomp.map
===================================================================
--- libgomp/libgomp.map	(revision 233084)
+++ libgomp/libgomp.map	(working copy)
@@ -411,4 +411,5 @@  GOMP_PLUGIN_1.0 {
 GOMP_PLUGIN_1.1 {
   global:
 	GOMP_PLUGIN_target_task_completion;
+	GOMP_PLUGIN_acc_default_dims;
 } GOMP_PLUGIN_1.0;
Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 233084)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -103,6 +103,7 @@  GOACC_parallel_keyed (int device, void (
       return;
     }
 
+  memset (dims, 0, sizeof (dims));
   va_start (ap, kinds);
   /* TODO: This will need amending when device_type is implemented.  */
   while ((tag = va_arg (ap, unsigned)) != 0)
Index: libgomp/oacc-plugin.c
===================================================================
--- libgomp/oacc-plugin.c	(revision 233084)
+++ libgomp/oacc-plugin.c	(working copy)
@@ -29,6 +29,9 @@ 
 #include "libgomp.h"
 #include "oacc-plugin.h"
 #include "oacc-int.h"
+#include "gomp-constants.h"
+#include <stdlib.h>
+#include <errno.h>
 
 void
 GOMP_PLUGIN_async_unmap_vars (void *ptr)
@@ -46,3 +49,41 @@  GOMP_PLUGIN_acc_thread (void)
   struct goacc_thread *thr = goacc_thread ();
   return thr ? thr->target_tls : NULL;
 }
+
+/* Determine runtime default compute dimensions from environment. DIMS
+   must be zero-initialized.  Plugin will do remaining default & range
+   validation itself.  This should be called lazily on first required
+   use.
+
+   The tag will be used for device_type support.  */
+
+void
+GOMP_PLUGIN_acc_default_dims (const char *tag __attribute__((unused)),
+			      int *dims)
+{
+  const char *env_var = getenv ("GOMP_OPENACC_DIM");
+
+  if (env_var)
+    {
+      const char *pos = env_var;
+      int i;
+
+      GOMP_PLUGIN_debug (0, "Using GOMP_OPENACC_DIM=%s\n", env_var);
+      for (i = 0; *pos && i != GOMP_DIM_MAX; i++)
+	{
+	  if (i && *pos++ != ':')
+	    break;
+	  if (*pos != ':')
+	    {
+	      const char *eptr;
+
+	      errno = 0;
+	      long val = strtol (pos, (char **)&eptr, 10);
+	      if (errno || val < 0 || (int) val != val)
+		break;
+	      dims[i] = (int) val;
+	      pos = eptr;
+	    }
+	}
+    }
+}
Index: libgomp/oacc-plugin.h
===================================================================
--- libgomp/oacc-plugin.h	(revision 233084)
+++ libgomp/oacc-plugin.h	(working copy)
@@ -30,4 +30,6 @@ 
 extern void GOMP_PLUGIN_async_unmap_vars (void *);
 extern void *GOMP_PLUGIN_acc_thread (void);
 
+extern void GOMP_PLUGIN_acc_default_dims (const char *tag, int *dims);
+
 #endif
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 233084)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -894,9 +894,44 @@  nvptx_exec (void (*fn), size_t mapnum, v
   /* Initialize the launch dimensions.  Typically this is constant,
      provided by the device compiler, but we must permit runtime
      values.  */
-  for (i = 0; i != 3; i++)
-    if (targ_fn->launch->dim[i])
-      dims[i] = targ_fn->launch->dim[i];
+  int seen_zero = 0;
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    {
+      if (targ_fn->launch->dim[i])
+	dims[i] = targ_fn->launch->dim[i];
+      if (!dims[i])
+	seen_zero = 1;
+    }
+
+  if (seen_zero)
+    {
+      /* See if the user provided GOMP_OPENACC_DIM environment
+	 variable to specify runtime defaults. */
+      static int default_dims[GOMP_DIM_MAX];
+
+      if (!default_dims[0])
+	{
+	  GOMP_PLUGIN_acc_default_dims ("nvidia", default_dims);
+
+	  /* Do some sanity checking.  The CUDA API doesn't appear to
+	     provide queries to determine these limits.  */
+	  if (default_dims[GOMP_DIM_GANG] < 1)
+	    default_dims[GOMP_DIM_GANG] = 32;
+	  if (default_dims[GOMP_DIM_WORKER] < 1
+	      || default_dims[GOMP_DIM_WORKER] > 32)
+	    default_dims[GOMP_DIM_WORKER] = 32;
+	  default_dims[GOMP_DIM_VECTOR] = 32;
+
+	  GOMP_PLUGIN_debug (0, "Default dimensions [%d,%d,%d]\n",
+			     default_dims[GOMP_DIM_GANG],
+			     default_dims[GOMP_DIM_WORKER],
+			     default_dims[GOMP_DIM_VECTOR]);
+	}
+
+      for (i = 0; i != GOMP_DIM_MAX; i++)
+	if (!dims[i])
+	  dims[i] = default_dims[i];
+    }
 
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
@@ -918,7 +953,8 @@  nvptx_exec (void (*fn), size_t mapnum, v
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " gangs=%u, workers=%u, vectors=%u\n",
 		     __FUNCTION__, targ_fn->launch->fn,
-		     dims[0], dims[1], dims[2]);
+		     dims[GOMP_DIM_GANG], dims[GOMP_DIM_WORKER],
+		     dims[GOMP_DIM_VECTOR]);
 
   // OpenACC		CUDA
   //
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default-2.c	(working copy)
@@ -0,0 +1,12 @@ 
+
+/* { dg-additional-options "-O2 -fopenacc-dim=-:-:32" } */
+
+#include "loop-dim-default.h"
+#include <stdlib.h>
+
+int main ()
+{
+  setenv ("GOMP_OPENACC_DIM", "8:16", 1);
+
+  return test_1 (8, 16, 32);
+}
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c	(revision 233084)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.c	(working copy)
@@ -1,133 +1,13 @@ 
 
-/* { dg-additional-options "-O2 -fopenacc-dim=16:16" } */
+/* { dg-additional-options "-O2 -fopenacc-dim=16:16:32" } */
 
-#include <openacc.h>
-#include <alloca.h>
-#include <string.h>
-#include <stdio.h>
-
-#pragma acc routine
-static int __attribute__ ((noinline)) coord ()
-{
-  int res = 0;
-
-  if (acc_on_device (acc_device_nvidia))
-    {
-      int g = 0, w = 0, v = 0;
-
-      __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
-      __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
-      __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
-      res = (1 << 24) | (g << 16) | (w << 8) | v;
-    }
-  return res;
-}
-
-
-int check (const int *ary, int size, int gp, int wp, int vp)
-{
-  int exit = 0;
-  int ix;
-  int *gangs = (int *)alloca (gp * sizeof (int));
-  int *workers = (int *)alloca (wp * sizeof (int));
-  int *vectors = (int *)alloca (vp * sizeof (int));
-  int offloaded = 0;
-  
-  memset (gangs, 0, gp * sizeof (int));
-  memset (workers, 0, wp * sizeof (int));
-  memset (vectors, 0, vp * sizeof (int));
-  
-  for (ix = 0; ix < size; ix++)
-    {
-      int g = (ary[ix] >> 16) & 0xff;
-      int w = (ary[ix] >> 8) & 0xff;
-      int v = (ary[ix] >> 0) & 0xff;
-
-      if (g >= gp || w >= wp || v >= vp)
-	{
-	  printf ("unexpected cpu %#x used\n", ary[ix]);
-	  exit = 1;
-	}
-      else
-	{
-	  vectors[v]++;
-	  workers[w]++;
-	  gangs[g]++;
-	}
-      offloaded += ary[ix] >> 24;
-    }
-
-  if (!offloaded)
-    return 0;
-
-  if (offloaded != size)
-    {
-      printf ("offloaded %d times,  expected %d\n", offloaded, size);
-      return 1;
-    }
-
-  for (ix = 0; ix < gp; ix++)
-    if (gangs[ix] != gangs[0])
-      {
-	printf ("gang %d not used %d times\n", ix, gangs[0]);
-	exit = 1;
-      }
-  
-  for (ix = 0; ix < wp; ix++)
-    if (workers[ix] != workers[0])
-      {
-	printf ("worker %d not used %d times\n", ix, workers[0]);
-	exit = 1;
-      }
-  
-  for (ix = 0; ix < vp; ix++)
-    if (vectors[ix] != vectors[0])
-      {
-	printf ("vector %d not used %d times\n", ix, vectors[0]);
-	exit = 1;
-      }
-  
-  return exit;
-}
-
-#define N (32 *32*32)
-
-int test_1 (int gp, int wp, int vp)
-{
-  int ary[N];
-  int exit = 0;
-  
-#pragma acc parallel copyout (ary)
-  {
-#pragma acc loop gang (static:1)
-    for (int ix = 0; ix < N; ix++)
-      ary[ix] = coord ();
-  }
-
-  exit |= check (ary, N, gp, 1, 1);
-
-#pragma  acc parallel copyout (ary)
-  {
-#pragma acc loop worker
-    for (int ix = 0; ix < N; ix++)
-      ary[ix] = coord ();
-  }
-
-  exit |= check (ary, N, 1, wp, 1);
-
-#pragma  acc parallel copyout (ary)
-  {
-#pragma acc loop vector
-    for (int ix = 0; ix < N; ix++)
-      ary[ix] = coord ();
-  }
-
-  exit |= check (ary, N, 1, 1, vp);
-
-  return exit;
-}
+#include "loop-dim-default.h"
+#include <stdlib.h>
 
 int main ()
 {
+  /* Environment should be (silently) ignored.  */
+  setenv ("GOMP_OPENACC_DIM", "8:8:8",  1);
+
   return test_1 (16, 16, 32);
 }
Index: libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h
===================================================================
--- libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h	(revision 0)
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/loop-dim-default.h	(working copy)
@@ -0,0 +1,125 @@ 
+#include <openacc.h>
+#include <alloca.h>
+#include <string.h>
+#include <stdio.h>
+
+#pragma acc routine
+static int __attribute__ ((noinline)) coord ()
+{
+  int res = 0;
+
+  if (acc_on_device (acc_device_nvidia))
+    {
+      int g = 0, w = 0, v = 0;
+
+      __asm__ volatile ("mov.u32 %0,%%ctaid.x;" : "=r" (g));
+      __asm__ volatile ("mov.u32 %0,%%tid.y;" : "=r" (w));
+      __asm__ volatile ("mov.u32 %0,%%tid.x;" : "=r" (v));
+      res = (1 << 24) | (g << 16) | (w << 8) | v;
+    }
+  return res;
+}
+
+
+int check (const int *ary, int size, int gp, int wp, int vp)
+{
+  int exit = 0;
+  int ix;
+  int *gangs = (int *)alloca (gp * sizeof (int));
+  int *workers = (int *)alloca (wp * sizeof (int));
+  int *vectors = (int *)alloca (vp * sizeof (int));
+  int offloaded = 0;
+  
+  memset (gangs, 0, gp * sizeof (int));
+  memset (workers, 0, wp * sizeof (int));
+  memset (vectors, 0, vp * sizeof (int));
+  
+  for (ix = 0; ix < size; ix++)
+    {
+      int g = (ary[ix] >> 16) & 0xff;
+      int w = (ary[ix] >> 8) & 0xff;
+      int v = (ary[ix] >> 0) & 0xff;
+
+      if (g >= gp || w >= wp || v >= vp)
+	{
+	  printf ("unexpected cpu %#x used\n", ary[ix]);
+	  exit = 1;
+	}
+      else
+	{
+	  vectors[v]++;
+	  workers[w]++;
+	  gangs[g]++;
+	}
+      offloaded += ary[ix] >> 24;
+    }
+
+  if (!offloaded)
+    return 0;
+
+  if (offloaded != size)
+    {
+      printf ("offloaded %d times,  expected %d\n", offloaded, size);
+      return 1;
+    }
+
+  for (ix = 0; ix < gp; ix++)
+    if (gangs[ix] != gangs[0])
+      {
+	printf ("gang %d not used %d times\n", ix, gangs[0]);
+	exit = 1;
+      }
+  
+  for (ix = 0; ix < wp; ix++)
+    if (workers[ix] != workers[0])
+      {
+	printf ("worker %d not used %d times\n", ix, workers[0]);
+	exit = 1;
+      }
+  
+  for (ix = 0; ix < vp; ix++)
+    if (vectors[ix] != vectors[0])
+      {
+	printf ("vector %d not used %d times\n", ix, vectors[0]);
+	exit = 1;
+      }
+  
+  return exit;
+}
+
+#define N (32 *32*32)
+
+int test_1 (int gp, int wp, int vp)
+{
+  int ary[N];
+  int exit = 0;
+  
+#pragma acc parallel copyout (ary)
+  {
+#pragma acc loop gang (static:1)
+    for (int ix = 0; ix < N; ix++)
+      ary[ix] = coord ();
+  }
+
+  exit |= check (ary, N, gp, 1, 1);
+
+#pragma  acc parallel copyout (ary)
+  {
+#pragma acc loop worker
+    for (int ix = 0; ix < N; ix++)
+      ary[ix] = coord ();
+  }
+
+  exit |= check (ary, N, 1, wp, 1);
+
+#pragma  acc parallel copyout (ary)
+  {
+#pragma acc loop vector
+    for (int ix = 0; ix < N; ix++)
+      ary[ix] = coord ();
+  }
+
+  exit |= check (ary, N, 1, 1, vp);
+
+  return exit;
+}