diff mbox

[gomp4] Redesign oacc_parallel launch API

Message ID 55B7B332.1010603@acm.org
State New
Headers show

Commit Message

Nathan Sidwell July 28, 2015, 4:52 p.m. UTC
I've committed this patch to the gomp4 branch to redo the launch API.  I'll post 
a version for trunk once the versioning patch gets approved & committed.

This changes the API in a number of ways, allowing device-specific knowledge to 
be moved into the device compiler and out of the host compiler.

Firstly, we attach a tuple of launch dimensions as an attribute to the offloaded 
function's 'oacc function' attribute.  These are the constant launch dimensions. 
  Dynamic dimensions get a zero for their slot in this list.  Further this list 
can be extended in the future to an alist keyed by device_type.

Dynamic dimensions are computed on the host.  however they are passed via 
varadic args to the GOACC_parallel function (which is renamed).  The varadic 
args are passed using key/value representation, and 3 keys are currently defined:
END -- end of the varadic list
DIM - set of runtime-computed dimensions.  Only the dynamic ones are passed.
ASYNC_WAIT - an async and a set of waits (possibly zero).

I have arranged for the key to have a slot that can later be filled by 
device_type, and hence support multiple device types.

The constant dimensions can be used in expansion of the GOACC_nid function in 
the device compiler.  The device compiler could also process that list to select 
the device_type slot that is appropriate.

For PTX the backend is augmented to emit the launch dimensions into the target 
data, from whence the ptx plugin can pick them up and overwrite with any dynamic 
ones passed in from the launch function.

nathan

Comments

Nathan Sidwell July 28, 2015, 5:03 p.m. UTC | #1
Oh, one more thing.  I placed constants for the 3 launch dimensions into 
gomp-constants.h, as they are needed by both library and compiler.  Working on a 
patch to remove the current set of constants from omp-low.h

nathan
Thomas Schwinge July 29, 2015, 9:22 a.m. UTC | #2
Hi Nathan!

On Tue, 28 Jul 2015 12:52:02 -0400, Nathan Sidwell <nathan@acm.org> wrote:
> I've committed this patch to the gomp4 branch to redo the launch API.  I'll post 
> a version for trunk once the versioning patch gets approved & committed.

Thanks!


(I have not yet looked at the patch in detail.)  There is one regression:

    PASS: libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  (test for excess errors)
    [-PASS:-]{+FAIL:+} libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  execution test

    libgomp: Trying to map into device [0x10f7930..0x10f7a30) object when [0x10f7930..0x10f7a30) is already mapped

Likewise for the other torture testing flags.


Grüße,
 Thomas
Nathan Sidwell July 29, 2015, 12:24 p.m. UTC | #3
On 07/29/15 05:22, Thomas Schwinge wrote:
> Hi Nathan!
>
> On Tue, 28 Jul 2015 12:52:02 -0400, Nathan Sidwell <nathan@acm.org> wrote:
>> I've committed this patch to the gomp4 branch to redo the launch API.  I'll post
>> a version for trunk once the versioning patch gets approved & committed.
>
> Thanks!
>
>
> (I have not yet looked at the patch in detail.)  There is one regression:
>
>      PASS: libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  (test for excess errors)
>      [-PASS:-]{+FAIL:+} libgomp.oacc-fortran/asyncwait-2.f90 -DACC_DEVICE_TYPE_nvidia=1 -DACC_MEM_SHARED=0  -O0  execution test
>
>      libgomp: Trying to map into device [0x10f7930..0x10f7a30) object when [0x10f7930..0x10f7a30) is already mapped
>
> Likewise for the other torture testing flags.


Investigating ...  (I've seen those failures be intermittent)

nathan
Cesar Philippidis Aug. 6, 2015, 4:33 p.m. UTC | #4
On 07/28/2015 09:52 AM, Nathan Sidwell wrote:
> I've committed this patch to the gomp4 branch to redo the launch API. 
> I'll post a version for trunk once the versioning patch gets approved &
> committed.
> 
> This changes the API in a number of ways, allowing device-specific
> knowledge to be moved into the device compiler and out of the host
> compiler.
> 
> Firstly, we attach a tuple of launch dimensions as an attribute to the
> offloaded function's 'oacc function' attribute.  These are the constant
> launch dimensions.  Dynamic dimensions get a zero for their slot in this
> list.  Further this list can be extended in the future to an alist keyed
> by device_type.
> 
> Dynamic dimensions are computed on the host.  however they are passed
> via varadic args to the GOACC_parallel function (which is renamed).  The
> varadic args are passed using key/value representation, and 3 keys are
> currently defined:
> END -- end of the varadic list
> DIM - set of runtime-computed dimensions.  Only the dynamic ones are
> passed.
> ASYNC_WAIT - an async and a set of waits (possibly zero).
> 
> I have arranged for the key to have a slot that can later be filled by
> device_type, and hence support multiple device types.
> 
> The constant dimensions can be used in expansion of the GOACC_nid
> function in the device compiler.  The device compiler could also process
> that list to select the device_type slot that is appropriate.
> 
> For PTX the backend is augmented to emit the launch dimensions into the
> target data, from whence the ptx plugin can pick them up and overwrite
> with any dynamic ones passed in from the launch function.

Looking at set_oacc_fn_attrib, it appears that const values are also
considered dynamic. See the attached test case more more info. Is that
the expected behavior? If not, I could take a look at this after I
finished my reduction patch.

Cesar
Nathan Sidwell Aug. 6, 2015, 7:51 p.m. UTC | #5
On 08/06/15 18:33, Cesar Philippidis wrote:

> Looking at set_oacc_fn_attrib, it appears that const values are also
> considered dynamic. See the attached test case more more info. Is that
> the expected behavior? If not, I could take a look at this after I
> finished my reduction patch.

It's annoying that the offload  call is happening too early for that kind of 
constant propagation.  But I guess it might have been propagated by the time we 
get to oacc_xform.  And hence that could optimize there.

Anyway, a thing to notice but not get distracted by.

nathan
diff mbox

Patch

2015-07-28  Nathan Sidwell  <nathan@codesourcery.com>

	include/
	* gomp-constants.h (GOMP_DIM_GANG, GOMP_DIM_WORKER,
	GOMP_DIM_VECTOR): New.
	(GOMP_DIM_MAX, GOMP_DIM_MASK): New.
	(GOMP_LAUNCH_END, GOMP_LAUNCH_DIM, GOMP_LAUNCH_ASYNC_WAIT): New.
	(GOMP_LAUNCH_CODE_SHIFT, GOMP_LAUCNH_DEVICE_SHIFT,
	GOMP_LAUNCH_OP_SHIFT): New.
	(GOMP_LAUNCH_PACK, GOMP_LAUNCH_CODE, GOMP_LAUNCH_DEVICE,
	GOMP_LAUNCH_OP): New.
	(GOMP_VERSION_NVIDIA_PTX): Increment to 1.

	gcc/
	* tree.h (OMP_CLAUSE_EXPR): New.
	* omp-low.c (creste_omp_child_function): Do not set oacc function
	attribute here.
	(oacc_launch_pack): New.
	(OACC_FN_ATTRIB): New define.
	(set_oacc_fn_attrib): New.
	(get_oacc_fn_attrib): New.
	(expand_omp_target): Reimplement openacc launch parameters.
	* omp-low.h (get_oacc_fn_attrib): Declare.
	* omp-builtins.def (BUILT_IN_GOACC_KERNELS_INTERNAL): Change type.
	(BUILT_IN_GOACC_PARALLEL): Change type and target name.
	* builtin-types.def
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): Replace with ...
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR): ... this.
	* tree-parloops.c (create_parallel_loop): Adjust index of
	shared_size arg.
	* except.c: Include omp-low.h
	(finish_eh_generation): Call get_oacc_fn_attrib.
	* config/nvptx/mkoffload.c (process): Accumulate compute grid
	dimensions and emit them.
	* config/nvptx/nvptx.c: Include gomp-constants.h
	(nvptx_record_offload_symbol): Emit compute grid dimensions.

	libgomp/
	* libgomp.map: Add GOACC_parallel_keyed.
	* libgomp.h (struct acc_dispatch_t): Change exec_func parameters.
	* libgomp_g.h (GOACC_parallel): Replace with ...
	(GOACC_parallel_keyed): ... this.
	* oacc-parallel.c (goacc_wait): Take pointer to va_list.  Adjust
	all callers.
	(GOACC_parallel_keyed): Use varadic keyed interface for optional
	parameters.  Renamed from ...
	(GOACC_parallel): ... here.  Replace with forwarding fn.
	* plugin/plugin-host.c (GOMP_OFFLOAD_openacc_parallel): Adjust
	parameters.
	* plugin/plugin-nvptx.c (struct targ_fn_launch): New structure.
	(targ_fn_descriptor): Point to targ_fn_launch instance.
	(nvptx_exec): Adjust parameters.  Process compute dimensions.
	(struct nvptx_tdata): Adjust type.
	(GOMP_OFFLOAD_load_image_ver): Adjust function handling.
	(GOMP_OFFLOAD_openacc_parallel): Adjust.


	gcc/c-family/
	* c-common.c (DEF_FUNCTION_TYPE_VAR_12): Delete.

	gcc/fortran/
	* f95-lang.c (DEF_FUNCTION_TYPE_VAR_12): Delete.
	* types.def (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): Replace with ...
	(BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR): ... this.

	gcc/lto/
	* lto-lang.c (DEF_FUNCTION_TYPE_VAR_12): Delete.

Index: libgomp/oacc-parallel.c
===================================================================
--- libgomp/oacc-parallel.c	(revision 226312)
+++ libgomp/oacc-parallel.c	(working copy)
@@ -171,15 +171,12 @@  goacc_deallocate_static (acc_device_t d)
   alloc_done = false;
 }
 
-static void goacc_wait (int async, int num_waits, va_list ap);
+static void goacc_wait (int async, int num_waits, va_list *ap);
 
 void
-GOACC_parallel (int device, void (*fn) (void *),
-		size_t mapnum, void **hostaddrs, size_t *sizes,
-		unsigned short *kinds,
-		int num_gangs, int num_workers, int vector_length,
-		size_t shared_size,
-		int async, int num_waits, ...)
+GOACC_parallel_keyed (int device, void (*fn) (void *), size_t mapnum,
+		      void **hostaddrs, size_t *sizes, unsigned short *kinds,
+		      size_t shared_size, ...)
 {
   bool host_fallback = device == GOMP_DEVICE_HOST_FALLBACK;
   va_list ap;
@@ -191,17 +188,21 @@  GOACC_parallel (int device, void (*fn) (
   struct splay_tree_key_s k;
   splay_tree_key tgt_fn_key;
   void (*tgt_fn);
+  int async = GOMP_ASYNC_SYNC;
+  unsigned dims[3];
+  unsigned tag;
 
+  memset (dims, 0, sizeof (dims));
 #ifdef HAVE_INTTYPES_H
   gomp_debug (0, "%s: mapnum=%"PRIu64", hostaddrs=%p, sizes=%p, kinds=%p, "
-	      "shared_size=%"PRIu64", async = %d\n",
+	      "shared_size=%"PRIu64"\n",
 	      __FUNCTION__, (uint64_t) mapnum, hostaddrs, sizes, kinds,
-	      (uint64_t) shared_size, async);
+	      (uint64_t) shared_size);
 #else
   gomp_debug (0, "%s: mapnum=%lu, hostaddrs=%p, sizes=%p, kinds=%p, "
-	      "shared_size=%lu, async=%d\n",
+	      "shared_size=%lu\n",
 	      __FUNCTION__, (unsigned long) mapnum, hostaddrs, sizes, kinds,
-	      (unsigned long) shared_size, async);
+	      (unsigned long) shared_size);
 #endif
 
   alloc_ganglocal_addrs (mapnum, hostaddrs, sizes, kinds);
@@ -249,12 +250,36 @@  GOACC_parallel (int device, void (*fn) (
   if (acc_device_type (acc_dev->type) == acc_device_host_nonshm)
     alloc_host_shared_mem (shared_size);
 
-  if (num_waits)
+  va_start (ap, shared_size);
+  /* TODO: This will need amending when device_type is implemented.  */
+  while (GOMP_LAUNCH_PACK (GOMP_LAUNCH_END, 0, 0)
+	 != (tag = va_arg (ap, unsigned)))
     {
-      va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
-      va_end (ap);
+      assert (!GOMP_LAUNCH_DEVICE (tag));
+      switch (GOMP_LAUNCH_CODE (tag))
+	{
+	case GOMP_LAUNCH_DIM:
+	  {
+	    unsigned mask = GOMP_LAUNCH_OP (tag);
+
+	    for (i = 0; i != 3; i++)
+	      if (mask & (1 << i)) /* FIXME: move to gomp-constants. */
+		dims[i] = va_arg (ap, unsigned);
+	  }
+	  break;
+
+	case GOMP_LAUNCH_ASYNC_WAIT:
+	  {
+	    unsigned num_waits = GOMP_LAUNCH_OP (tag);
+
+	    async = va_arg (ap, unsigned);
+	    if (num_waits)
+	      goacc_wait (async, num_waits, &ap);
+	    break;
+	  }
+	}
     }
+  va_end (ap);
   
   acc_dev->openacc.async_set_async_func (async);
 
@@ -287,9 +312,8 @@  GOACC_parallel (int device, void (*fn) (
 	devaddrs[i] = NULL;
     }
 
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, sizes, kinds,
-			      num_gangs, num_workers, vector_length, async,
-			      shared_size, tgt);
+  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+			      shared_size, async, dims, tgt);
 
   /* If running synchronously, unmap immediately.  */
   if (async < acc_async_noval)
@@ -306,6 +330,39 @@  GOACC_parallel (int device, void (*fn) (
     free_host_shared_mem ();
 }
 
+/* Legacy entry point.   */
+
+void
+GOACC_parallel (int device, void (*fn) (void *), size_t mapnum,
+		void **hostaddrs, size_t *sizes, unsigned short *kinds,
+		int num_gangs, int num_workers, int vector_length,
+		size_t shared_size,
+		int async, int num_waits, ...)
+{
+  int waits[9];
+  unsigned ix;
+  va_list ap;
+
+  if (num_waits > 8)
+    gomp_fatal ("too many waits for legacy interface");
+
+  va_start (ap, num_waits);
+  for (ix = 0; ix != num_waits; ix++)
+    waits[ix] = va_arg (ap, int);
+  va_end (ap);
+  waits[ix] = GOMP_LAUNCH_PACK (GOMP_LAUNCH_END, 0, 0);
+  
+  GOACC_parallel_keyed (device, fn, mapnum, hostaddrs, sizes, kinds,
+			shared_size,
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_DIM, 0,
+					  GOMP_DIM_MASK (GOMP_DIM_MAX) - 1),
+			num_gangs, num_workers, vector_length,
+			GOMP_LAUNCH_PACK (GOMP_LAUNCH_ASYNC_WAIT,
+					  0, num_waits),
+			async, waits[0], waits[1], waits[2], waits[3],
+			waits[4], waits[5], waits[6], waits[7], waits[8]);
+}
+
 void
 GOACC_data_start (int device, size_t mapnum,
 		  void **hostaddrs, size_t *sizes, unsigned short *kinds)
@@ -399,7 +456,7 @@  GOACC_enter_exit_data (int device, size_
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
 
@@ -529,15 +586,15 @@  GOACC_enter_exit_data (int device, size_
 }
 
 static void
-goacc_wait (int async, int num_waits, va_list ap)
+goacc_wait (int async, int num_waits, va_list *ap)
 {
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
   while (num_waits--)
     {
-      int qid = va_arg (ap, int);
-
+      int qid = va_arg (*ap, int);
+      
       if (acc_async_test (qid))
 	continue;
 
@@ -574,7 +631,7 @@  GOACC_update (int device, size_t mapnum,
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
 
@@ -616,7 +673,7 @@  GOACC_wait (int async, int num_waits, ..
       va_list ap;
 
       va_start (ap, num_waits);
-      goacc_wait (async, num_waits, ap);
+      goacc_wait (async, num_waits, &ap);
       va_end (ap);
     }
   else if (async == acc_async_sync)
Index: libgomp/libgomp_g.h
===================================================================
--- libgomp/libgomp_g.h	(revision 226312)
+++ libgomp/libgomp_g.h	(working copy)
@@ -222,9 +222,9 @@  extern void GOACC_data_start (int, size_
 extern void GOACC_data_end (void);
 extern void GOACC_enter_exit_data (int, size_t, void **,
 				   size_t *, unsigned short *, int, int, ...);
-extern void GOACC_parallel (int, void (*) (void *), size_t,
-			    void **, size_t *, unsigned short *, int, int, int,
-			    size_t, int, int, ...);
+extern void GOACC_parallel_keyed (int, void (*) (void *), size_t,
+				  void **, size_t *, unsigned short *,
+				  size_t, ...);
 extern void GOACC_update (int, size_t, void **, size_t *,
 			  unsigned short *, int, int, ...);
 extern void GOACC_wait (int, int, ...);
Index: libgomp/libgomp.map
===================================================================
--- libgomp/libgomp.map	(revision 226312)
+++ libgomp/libgomp.map	(working copy)
@@ -337,6 +337,7 @@  GOACC_2.0.GOMP_4_BRANCH {
   global:
 	GOACC_deviceptr;
 	GOACC_get_ganglocal_ptr;
+	GOACC_parallel_keyed;
 	GOACC_register_static;
 } GOACC_2.0;
 
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h	(revision 226312)
+++ libgomp/libgomp.h	(working copy)
@@ -694,8 +694,8 @@  typedef struct acc_dispatch_t
   struct target_mem_desc *data_environ;
 
   /* Execute.  */
-  void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t *,
-		     unsigned short *, int, int, int, int, size_t, void *);
+  void (*exec_func) (void (*) (void *), size_t, void **, void **, size_t, int,
+		     unsigned *, void *);
 
   /* Async cleanup callback registration.  */
   void (*register_async_cleanup_func) (void *);
Index: libgomp/plugin/plugin-host.c
===================================================================
--- libgomp/plugin/plugin-host.c	(revision 226312)
+++ libgomp/plugin/plugin-host.c	(working copy)
@@ -170,13 +170,9 @@  GOMP_OFFLOAD_openacc_parallel (void (*fn
 			       size_t mapnum __attribute__ ((unused)),
 			       void **hostaddrs __attribute__ ((unused)),
 			       void **devaddrs __attribute__ ((unused)),
-			       size_t *sizes __attribute__ ((unused)),
-			       unsigned short *kinds __attribute__ ((unused)),
-			       int num_gangs __attribute__ ((unused)),
-			       int num_workers __attribute__ ((unused)),
-			       int vector_length __attribute__ ((unused)),
-			       int async __attribute__ ((unused)),
 			       size_t shared_size __attribute__ ((unused)),
+			       int async __attribute__ ((unused)),
+			       unsigned *dims __attribute__ ((unused)),
 			       void *targ_mem_desc __attribute__ ((unused)))
 {
 #ifdef HOST_NONSHM_PLUGIN
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 226312)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -282,12 +282,20 @@  map_push (struct ptx_stream *s, int asyn
   return;
 }
 
+/* Target data function launch information.  */
+
+struct targ_fn_launch
+{
+  const char *fn;
+  unsigned short dim[GOMP_DIM_MAX];
+};
+
 /* Descriptor of a loaded function.  */
 
 struct targ_fn_descriptor
 {
   CUfunction fn;
-  const char *name;
+  const struct targ_fn_launch *launch;
 };
 
 /* A loaded PTX image.  */
@@ -988,9 +996,8 @@  event_add (enum ptx_event_type type, CUe
 
 void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	  size_t *sizes, unsigned short *kinds, int num_gangs, int num_workers,
-	  int vector_length, int async, size_t shared_size,
-	  void *targ_mem_desc)
+	    size_t shared_size, int async, unsigned dims[GOMP_DIM_MAX],
+	    void *targ_mem_desc)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
@@ -1007,6 +1014,13 @@  nvptx_exec (void (*fn), size_t mapnum, v
   dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
   assert (dev_str == nvthd->current_stream);
 
+  /* Initialize the launch dimensions.  Typically this is constant,
+     provided by the device compiler, but we must permit runtime
+     values.  */
+  for (i = 0; i != GOMP_DIM_MAX; i++)
+    if (targ_fn->launch->dim[i])
+      dims[i] = targ_fn->launch->dim[i];
+
   /* This reserves a chunk of a pre-allocated page of memory mapped on both
      the host and the device. HP is a host pointer to the new chunk, and DP is
      the corresponding device pointer.  */
@@ -1026,8 +1040,8 @@  nvptx_exec (void (*fn), size_t mapnum, v
 
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " gangs=%u, workers=%u, vectors=%u, shared=%u\n",
-		     __FUNCTION__, targ_fn->name, num_gangs, num_workers,
-		     vector_length, (unsigned)shared_size);
+		     __FUNCTION__, targ_fn->launch->fn,
+		     dims[0], dims[1], dims[2], (unsigned)shared_size);
 
   // OpenACC		CUDA
   //
@@ -1037,8 +1051,8 @@  nvptx_exec (void (*fn), size_t mapnum, v
 
   kargs[0] = &dp;
   r = cuLaunchKernel (function,
-		      num_gangs, 1, 1,
-		      vector_length, num_workers, 1,
+		      dims[GOMP_DIM_GANG], 1, 1,
+		      dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
 		      shared_size, dev_str->stream, kargs, 0);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
@@ -1084,7 +1098,7 @@  nvptx_exec (void (*fn), size_t mapnum, v
 #endif
 
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
-		     targ_fn->name);
+		     targ_fn->launch->fn);
 
 #ifndef DISABLE_ASYNC
   if (async < acc_async_noval)
@@ -1611,7 +1625,7 @@  typedef struct nvptx_tdata
   const char *const *var_names;
   size_t var_num;
 
-  const char *const *fn_names;
+  const struct targ_fn_launch *fn_descs;
   size_t fn_num;
 } nvptx_tdata_t;
 
@@ -1633,7 +1647,8 @@  GOMP_OFFLOAD_load_image_ver (unsigned ve
 			     struct addr_pair **target_table)
 {
   CUmodule module;
-  const char *const *fn_names, *const *var_names;
+  const char *const *var_names;
+  const struct targ_fn_launch *fn_descs;
   unsigned int fn_entries, var_entries, i, j;
   CUresult r;
   struct targ_fn_descriptor *targ_fns;
@@ -1662,7 +1677,7 @@  GOMP_OFFLOAD_load_image_ver (unsigned ve
   var_entries = img_header->var_num;
   var_names = img_header->var_names;
   fn_entries = img_header->fn_num;
-  fn_names = img_header->fn_names;
+  fn_descs = img_header->fn_descs;
 
   targ_tbl = GOMP_PLUGIN_malloc (sizeof (struct addr_pair)
 				 * (fn_entries + var_entries));
@@ -1685,12 +1700,12 @@  GOMP_OFFLOAD_load_image_ver (unsigned ve
     {
       CUfunction function;
 
-      r = cuModuleGetFunction (&function, module, fn_names[i]);
+      r = cuModuleGetFunction (&function, module, fn_descs[i].fn);
       if (r != CUDA_SUCCESS)
 	GOMP_PLUGIN_fatal ("cuModuleGetFunction error: %s", cuda_error (r));
 
       targ_fns->fn = function;
-      targ_fns->name = (const char *) fn_names[i];
+      targ_fns->launch = &fn_descs[i];
 
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
@@ -1770,13 +1785,12 @@  void (*device_run) (int n, void *fn_ptr,
 
 void
 GOMP_OFFLOAD_openacc_parallel (void (*fn) (void *), size_t mapnum,
-			       void **hostaddrs, void **devaddrs, size_t *sizes,
-			       unsigned short *kinds, int num_gangs,
-			       int num_workers, int vector_length, int async,
-			       size_t shared_size, void *targ_mem_desc)
+			       void **hostaddrs, void **devaddrs,
+			       size_t shared_size,
+			       int async, unsigned *dims, void *targ_mem_desc)
 {
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, sizes, kinds, num_gangs,
-	      num_workers, vector_length, async, shared_size, targ_mem_desc);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, shared_size,
+	      async, dims, targ_mem_desc);
 }
 
 void
Index: include/gomp-constants.h
===================================================================
--- include/gomp-constants.h	(revision 226312)
+++ include/gomp-constants.h	(working copy)
@@ -122,9 +122,30 @@  enum gomp_map_kind
 #define GOMP_DEVICE_ICV			-1
 #define GOMP_DEVICE_HOST_FALLBACK	-2
 
+#define GOMP_DIM_GANG	0
+#define GOMP_DIM_WORKER	1
+#define GOMP_DIM_VECTOR	2
+#define GOMP_DIM_MAX	3
+#define GOMP_DIM_MASK(X) (1 << (X))
+
+/* Varadic launch arguments.  */
+#define GOMP_LAUNCH_END 	0  /* End of args, no dev or op */
+#define GOMP_LAUNCH_DIM		1  /* Launch dimensions, op = mask */
+#define GOMP_LAUNCH_ASYNC_WAIT	2  /* Async & Waits, op = num waits.  */
+#define GOMP_LAUNCH_CODE_SHIFT	28
+#define GOMP_LAUNCH_DEVICE_SHIFT 16
+#define GOMP_LAUNCH_OP_SHIFT 0
+#define GOMP_LAUNCH_PACK(CODE,DEVICE,OP)  \
+  (((CODE) << GOMP_LAUNCH_CODE_SHIFT)		\
+   | ((DEVICE) << GOMP_LAUNCH_DEVICE_SHIFT)	\
+   | ((OP) << GOMP_LAUNCH_OP_SHIFT))
+#define GOMP_LAUNCH_CODE(X) (((X) >> GOMP_LAUNCH_CODE_SHIFT) & 0xf)
+#define GOMP_LAUNCH_DEVICE(X) (((X) >> GOMP_LAUNCH_DEVICE_SHIFT) & 0xfff)
+#define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
+
 /* Versions of libgomp and device-specific plugins.  */
 #define GOMP_VERSION	0
-#define GOMP_VERSION_NVIDIA_PTX 0
+#define GOMP_VERSION_NVIDIA_PTX 1
 
 #define GOMP_VERSION_PACK(LIB, DEV) (((LIB) << 16) | (DEV))
 #define GOMP_VERSION_LIB(PACK) (((PACK) >> 16) & 0xffff)
Index: gcc/tree-parloops.c
===================================================================
--- gcc/tree-parloops.c	(revision 226312)
+++ gcc/tree-parloops.c	(working copy)
@@ -2084,7 +2084,7 @@  create_parallel_loop (struct loop *loop,
       tree data_arg = gimple_omp_target_data_arg (kernels);
       gimple_omp_target_set_data_arg (stmt, data_arg);
       tree ganglocal_size
-	= gimple_call_arg (goacc_kernels_internal, /* TODO */ 9);
+	= gimple_call_arg (goacc_kernels_internal, /* TODO */ 6);
       gimple_omp_target_set_ganglocal_size (stmt, ganglocal_size);
 
       gimple_set_location (stmt, loc);
Index: gcc/tree.h
===================================================================
--- gcc/tree.h	(revision 226312)
+++ gcc/tree.h	(working copy)
@@ -1360,6 +1360,8 @@  extern void protected_set_expr_location
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_SCHEDULE), 0)
 
 /* OpenACC clause expressions  */
+#define OMP_CLAUSE_EXPR(NODE, CLAUSE) \
+  OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, CLAUSE), 0)
 #define OMP_CLAUSE_GANG_EXPR(NODE) \
   OMP_CLAUSE_OPERAND ( \
     OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_GANG), 0)
Index: gcc/except.c
===================================================================
--- gcc/except.c	(revision 226312)
+++ gcc/except.c	(working copy)
@@ -151,6 +151,7 @@  along with GCC; see the file COPYING3.
 #include "cfgloop.h"
 #include "builtins.h"
 #include "tree-hash-traits.h"
+#include "omp-low.h"
 
 static GTY(()) int call_site_base;
 
@@ -1491,8 +1492,7 @@  finish_eh_generation (void)
 {
   basic_block bb;
 
-  if (lookup_attribute ("oacc function",
-			DECL_ATTRIBUTES (current_function_decl)))
+  if (get_oacc_fn_attrib (current_function_decl))
     return;
 
   /* Construct the landing pads.  */
Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c	(revision 226312)
+++ gcc/omp-low.c	(working copy)
@@ -2302,12 +2302,6 @@  create_omp_child_function (omp_context *
       = tree_cons (get_identifier ("omp target entrypoint"),
                    NULL_TREE, DECL_ATTRIBUTES (decl));
 
-  if (is_gimple_omp_oacc (ctx->stmt)
-      && !lookup_attribute ("omp function", DECL_ATTRIBUTES (decl)))
-    DECL_ATTRIBUTES (decl)
-      = tree_cons (get_identifier ("oacc function"), NULL_TREE,
-		   DECL_ATTRIBUTES (decl));
-
   t = build_decl (DECL_SOURCE_LOCATION (decl),
 		  RESULT_DECL, NULL_TREE, void_type_node);
   DECL_ARTIFICIAL (t) = 1;
@@ -9277,6 +9271,92 @@  loop_get_oacc_kernels_region_entry (stru
     }
 }
 
+/* Encode an oacc launch argument.  This matches the GOMP_LAUNCH_PACK
+   macro on gomp-constants.h.  We do not check for overflow.  */
+
+static tree
+oacc_launch_pack (unsigned code, tree device, unsigned op)
+{
+  tree res;
+  
+  res = build_int_cst (unsigned_type_node, GOMP_LAUNCH_PACK (code, 0, op));
+  if (device)
+    {
+      device = fold_build2 (LSHIFT_EXPR, unsigned_type_node,
+			    device, build_int_cst (unsigned_type_node,
+						   GOMP_LAUNCH_DEVICE_SHIFT));
+      res = fold_build2 (BIT_IOR_EXPR, unsigned_type_node, res, device);
+    }
+  return res;
+}
+
+/* Look for compute grid dimension clauses and convert to an attribute
+   attached to FN.  This permits the target-side code to (a) massage
+   the dimensions, (b) emit that data and (c) optimize.  Non-constant
+   dimensions are pushed onto ARGS.
+
+   The attribute value is a TREE_LIST.  A set of dimensions is
+   represented as a list of INTEGER_CST.  Those that are runtime
+   expres are represented as an INTEGER_CST of zero.
+
+   TOOO. Normally the attribute will just contain a single such list.  If
+   however it contains a list of lists, this will represent the use of
+   device_type.  Each member of the outer list is an assoc list of
+   dimensions, keyed by the device type.  The first entry will be the
+   default.  Well, that's the plan.  */
+
+#define OACC_FN_ATTRIB "oacc function"
+
+static void
+set_oacc_fn_attrib (tree clauses, tree fn, vec<tree> *args)
+{
+  /* Must match GOMP_DIM ordering.  */
+  static const omp_clause_code ids[] = 
+    {OMP_CLAUSE_NUM_GANGS, OMP_CLAUSE_NUM_WORKERS, OMP_CLAUSE_VECTOR_LENGTH};
+  unsigned ix;
+  tree dims[GOMP_DIM_MAX];
+  tree attr = NULL_TREE;
+  unsigned non_const = 0;
+
+  for (ix = GOMP_DIM_MAX; ix--;)
+    {
+      tree clause = find_omp_clause (clauses, ids[ix]);
+      tree dim;
+
+      if (!clause)
+	dim = integer_one_node;
+      else
+	dim = OMP_CLAUSE_EXPR (clause, ids[ix]);
+      dims[ix] = dim;
+      if (TREE_CODE (dim) != INTEGER_CST)
+	{
+	  dim = integer_zero_node;
+	  non_const |= GOMP_DIM_MASK (ix);
+	}
+      attr = tree_cons (NULL_TREE, dim, attr);
+    }
+
+  /* Add the attributes.  */
+  DECL_ATTRIBUTES (fn) =
+    tree_cons (get_identifier (OACC_FN_ATTRIB), attr, DECL_ATTRIBUTES (fn));
+
+  if (non_const)
+    {
+      /* Push a dynamic argument set.  */
+      args->safe_push (oacc_launch_pack (GOMP_LAUNCH_DIM,
+					 NULL_TREE, non_const));
+      for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	if (non_const & GOMP_DIM_MASK (ix))
+	  args->safe_push (dims[ix]);
+    }
+}
+
+tree
+get_oacc_fn_attrib (tree fn)
+{
+  return lookup_attribute (OACC_FN_ATTRIB, DECL_ATTRIBUTES (fn));
+}
+
 /* Expand the GIMPLE_OMP_TARGET starting at REGION.  */
 
 static void
@@ -9728,6 +9808,7 @@  expand_omp_target (struct omp_region *re
     }
 
   gimple g;
+  bool tagging = false;
   /* The maximum number used by any start_ix, without varargs.  */
   auto_vec<tree, 12> args;
   args.quick_push (device);
@@ -9767,45 +9848,20 @@  expand_omp_target (struct omp_region *re
     case BUILT_IN_GOACC_KERNELS_INTERNAL:
     case BUILT_IN_GOACC_PARALLEL:
       {
-	tree t_num_gangs, t_num_workers, t_vector_length;
-
-	/* Default values for num_gangs, num_workers, and vector_length.  */
-	t_num_gangs = t_num_workers = t_vector_length
-	  = fold_convert_loc (gimple_location (entry_stmt),
-			      integer_type_node, integer_one_node);
-	/* ..., but if present, use the value specified by the respective
-	   clause.  */
-	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_GANGS);
-	if (c)
-	  t_num_gangs = OMP_CLAUSE_NUM_GANGS_EXPR (c);
-	c = find_omp_clause (clauses, OMP_CLAUSE_NUM_WORKERS);
-	if (c)
-	  t_num_workers = OMP_CLAUSE_NUM_WORKERS_EXPR (c);
-	c = find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH);
-	if (c)
-	  t_vector_length = OMP_CLAUSE_VECTOR_LENGTH_EXPR (c);
-	 
-	args.quick_push (t_num_gangs);
-	args.quick_push (t_num_workers);
-	args.quick_push (t_vector_length);
 	args.quick_push (gimple_omp_target_ganglocal_size (entry_stmt));
+	set_oacc_fn_attrib (clauses, child_fn, &args);
+	tagging = true;
       }
       /* FALLTHRU */
     case BUILT_IN_GOACC_ENTER_EXIT_DATA:
     case BUILT_IN_GOACC_UPDATE:
       {
-	tree t_async;
-	int t_wait_idx;
+	tree t_async = NULL_TREE;
 
 	c = find_omp_clause (clauses, OMP_CLAUSE_DEVICE_TYPE);
 	if (c)
 	  sorry ("device_type clause is not supported yet");
 
-	/* Default values for t_async.  */
-	t_async = fold_convert_loc (gimple_location (entry_stmt),
-				    integer_type_node,
-				    build_int_cst (integer_type_node,
-						   GOMP_ASYNC_SYNC));
 	/* ..., but if present, use the value specified by the respective
 	   clause, making sure that is of the correct type.  */
 	c = find_omp_clause (clauses, OMP_CLAUSE_ASYNC);
@@ -9813,42 +9869,59 @@  expand_omp_target (struct omp_region *re
 	  t_async = fold_convert_loc (OMP_CLAUSE_LOCATION (c),
 				      integer_type_node,
 				      OMP_CLAUSE_ASYNC_EXPR (c));
+	else if (!tagging)
+	  /* Default values for t_async.  */
+	  t_async = fold_convert_loc (gimple_location (entry_stmt),
+				      integer_type_node,
+				      build_int_cst (integer_type_node,
+						     GOMP_ASYNC_SYNC));
+	if (t_async && !tagging)
+	  {
+	    args.safe_push (t_async);
+	    t_async = NULL_TREE;
+	  }
 
-	args.quick_push (t_async);
-	/* Save the index, and... */
-	t_wait_idx = args.length ();
-	/* ... push a default value.  */
-	args.quick_push (fold_convert_loc (gimple_location (entry_stmt),
-					   integer_type_node,
-					   integer_zero_node));
+	/* Save the argument index, and... */
+	unsigned t_wait_idx = args.length ();
+	unsigned num_waits = 0;
 	c = find_omp_clause (clauses, OMP_CLAUSE_WAIT);
-	if (c)
-	  {
-	    int n = 0;
+	if (!tagging || c || t_async)
+	  /* ... push a placeholder.  */
+	  args.safe_push (integer_zero_node);
 
-	    for (; c; c = OMP_CLAUSE_CHAIN (c))
-	      {
-		if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
-		  {
-		    args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
-						      integer_type_node,
-						      OMP_CLAUSE_WAIT_EXPR (c)));
-		    n++;
-		  }
-	      }
+	if (tagging && t_async)
+	  args.safe_push (t_async);
+	
+	for (; c; c = OMP_CLAUSE_CHAIN (c))
+	  if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_WAIT)
+	    {
+	      args.safe_push (fold_convert_loc (OMP_CLAUSE_LOCATION (c),
+						integer_type_node,
+						OMP_CLAUSE_WAIT_EXPR (c)));
+	      num_waits++;
+	    }
 
-	    /* Now that we know the number, replace the default value.  */
-	    args.ordered_remove (t_wait_idx);
-	    args.quick_insert (t_wait_idx,
-			       fold_convert_loc (gimple_location (entry_stmt),
-						 integer_type_node,
-						 build_int_cst (integer_type_node, n)));
+	if (!tagging || num_waits || t_async)
+	  {
+	    tree len;
+
+	    /* Now that we know the number, update the placeholder.  */
+	    if (tagging)
+	      len = oacc_launch_pack (GOMP_LAUNCH_ASYNC_WAIT,
+				      NULL_TREE, num_waits);
+	    else
+	      len = build_int_cst (integer_type_node, num_waits);
+	    len = fold_convert_loc (gimple_location (entry_stmt),
+				    unsigned_type_node, len);
+	    args[t_wait_idx] = len;
 	  }
       }
       break;
     default:
       gcc_unreachable ();
     }
+  if (tagging)
+    args.safe_push (oacc_launch_pack (GOMP_LAUNCH_END, NULL_TREE, 0));
 
   g = gimple_build_call_vec (builtin_decl_explicit (start_ix), args);
   gimple_set_location (g, gimple_location (entry_stmt));
Index: gcc/omp-low.h
===================================================================
--- gcc/omp-low.h	(revision 226312)
+++ gcc/omp-low.h	(working copy)
@@ -41,6 +41,7 @@  extern bool make_gimple_omp_edges (basic
 extern void omp_finish_file (void);
 extern bool gimple_stmt_omp_data_i_init_p (gimple);
 extern basic_block loop_get_oacc_kernels_region_entry (struct loop *);
+extern tree get_oacc_fn_attrib (tree);
 
 extern GTY(()) vec<tree, va_gc> *offload_funcs;
 extern GTY(()) vec<tree, va_gc> *offload_vars;
Index: gcc/lto/lto-lang.c
===================================================================
--- gcc/lto/lto-lang.c	(revision 226312)
+++ gcc/lto/lto-lang.c	(working copy)
@@ -164,9 +164,6 @@  enum lto_builtin_type
 				NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "builtin-types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -186,7 +183,6 @@  enum lto_builtin_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   BT_LAST
 };
@@ -674,10 +670,6 @@  lto_define_builtins (tree va_list_ref_ty
 #define DEF_FUNCTION_TYPE_VAR_7(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7)				\
   def_fn_type (ENUM, RETURN, 1, 7, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6, ARG7);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  def_fn_type (ENUM, RETURN, 1, 12, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6,      \
-	       ARG7, ARG8, ARG9, ARG10, ARG11, ARG12);
 #define DEF_POINTER_TYPE(ENUM, TYPE) \
   builtin_types[(int) ENUM] = build_pointer_type (builtin_types[(int) TYPE]);
 
@@ -700,7 +692,6 @@  lto_define_builtins (tree va_list_ref_ty
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/builtin-types.def
===================================================================
--- gcc/builtin-types.def	(revision 226312)
+++ gcc/builtin-types.def	(working copy)
@@ -596,10 +596,9 @@  DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
 
-DEF_FUNCTION_TYPE_VAR_12 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
-			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
-			  BT_PTR, BT_PTR, BT_PTR, BT_INT, BT_INT, BT_INT,
-			  BT_SIZE, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
+			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR, BT_SIZE)
 
 DEF_POINTER_TYPE (BT_PTR_FN_VOID_VAR, BT_FN_VOID_VAR)
 DEF_FUNCTION_TYPE_3 (BT_FN_PTR_PTR_FN_VOID_VAR_PTR_SIZE,
Index: gcc/omp-builtins.def
===================================================================
--- gcc/omp-builtins.def	(revision 226312)
+++ gcc/omp-builtins.def	(working copy)
@@ -45,11 +45,11 @@  DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC
 			  ATTR_NOTHROW_LIST, "...rrr")
 DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC_KERNELS_INTERNAL,
 			  "GOACC_kernels_internal",
-			  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
+			  BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
 			  ATTR_FNSPEC_DOT_DOT_DOT_DOT_r_r_r_NOTHROW_LIST,
 			  ATTR_NOTHROW_LIST, "....rrr")
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel",
-		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
+DEF_GOACC_BUILTIN (BUILT_IN_GOACC_PARALLEL, "GOACC_parallel_keyed",
+		   BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
 		   ATTR_NOTHROW_LIST)
 DEF_GOACC_BUILTIN_FNSPEC (BUILT_IN_GOACC_UPDATE, "GOACC_update",
 			  BT_FN_VOID_INT_SIZE_PTR_PTR_PTR_INT_INT_VAR,
Index: gcc/c-family/c-common.c
===================================================================
--- gcc/c-family/c-common.c	(revision 226312)
+++ gcc/c-family/c-common.c	(working copy)
@@ -5455,9 +5455,6 @@  enum c_builtin_type
 				NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "builtin-types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -5477,7 +5474,6 @@  enum c_builtin_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   BT_LAST
 };
@@ -5573,10 +5569,6 @@  c_define_builtins (tree va_list_ref_type
 #define DEF_FUNCTION_TYPE_VAR_7(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7)				\
   def_fn_type (ENUM, RETURN, 1, 7, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6, ARG7);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  def_fn_type (ENUM, RETURN, 1, 12, ARG1, ARG2, ARG3, ARG4, ARG5, ARG6,      \
-	       ARG7, ARG8, ARG9, ARG10, ARG11, ARG12);
 #define DEF_POINTER_TYPE(ENUM, TYPE) \
   builtin_types[(int) ENUM] = build_pointer_type (builtin_types[(int) TYPE]);
 
@@ -5599,7 +5591,6 @@  c_define_builtins (tree va_list_ref_type
 #undef DEF_FUNCTION_TYPE_VAR_4
 #undef DEF_FUNCTION_TYPE_VAR_5
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/fortran/types.def
===================================================================
--- gcc/fortran/types.def	(revision 226312)
+++ gcc/fortran/types.def	(working copy)
@@ -222,7 +222,6 @@  DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_
 			 BT_VOID, BT_INT, BT_SIZE, BT_PTR, BT_PTR,
 			 BT_PTR, BT_INT, BT_INT)
 
-DEF_FUNCTION_TYPE_VAR_12 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_INT_INT_INT_SIZE_INT_INT_VAR,
-			  BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
-			  BT_PTR, BT_PTR, BT_PTR, BT_INT, BT_INT, BT_INT,
-			  BT_SIZE, BT_INT, BT_INT)
+DEF_FUNCTION_TYPE_VAR_7 (BT_FN_VOID_INT_OMPFN_SIZE_PTR_PTR_PTR_SIZE_VAR,
+			 BT_VOID, BT_INT, BT_PTR_FN_VOID_PTR, BT_SIZE,
+			 BT_PTR, BT_PTR, BT_PTR, BT_SIZE)
Index: gcc/fortran/f95-lang.c
===================================================================
--- gcc/fortran/f95-lang.c	(revision 226312)
+++ gcc/fortran/f95-lang.c	(working copy)
@@ -662,9 +662,6 @@  gfc_init_builtin_functions (void)
 #define DEF_FUNCTION_TYPE_VAR_2(NAME, RETURN, ARG1, ARG2) NAME,
 #define DEF_FUNCTION_TYPE_VAR_7(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
 				ARG6, ARG7) NAME,
-#define DEF_FUNCTION_TYPE_VAR_12(NAME, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-				 NAME,
 #define DEF_POINTER_TYPE(NAME, TYPE) NAME,
 #include "types.def"
 #undef DEF_PRIMITIVE_TYPE
@@ -680,7 +677,6 @@  gfc_init_builtin_functions (void)
 #undef DEF_FUNCTION_TYPE_VAR_0
 #undef DEF_FUNCTION_TYPE_VAR_2
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
     BT_LAST
   };
@@ -1133,23 +1129,6 @@  gfc_init_builtin_functions (void)
 					builtin_types[(int) ARG6],	\
 					builtin_types[(int) ARG7],	\
 					NULL_TREE);
-#define DEF_FUNCTION_TYPE_VAR_12(ENUM, RETURN, ARG1, ARG2, ARG3, ARG4, ARG5, \
-				 ARG6, ARG7, ARG8, ARG9, ARG10, ARG11, ARG12) \
-  builtin_types[(int) ENUM]						\
-    = build_varargs_function_type_list (builtin_types[(int) RETURN],   	\
-					builtin_types[(int) ARG1],     	\
-					builtin_types[(int) ARG2],     	\
-					builtin_types[(int) ARG3],	\
-					builtin_types[(int) ARG4],	\
-					builtin_types[(int) ARG5],	\
-					builtin_types[(int) ARG6],	\
-					builtin_types[(int) ARG7],	\
-					builtin_types[(int) ARG8],	\
-					builtin_types[(int) ARG9],	\
-					builtin_types[(int) ARG10],	\
-					builtin_types[(int) ARG11],	\
-					builtin_types[(int) ARG12],	\
-					NULL_TREE);
 #define DEF_POINTER_TYPE(ENUM, TYPE)			\
   builtin_types[(int) ENUM]				\
     = build_pointer_type (builtin_types[(int) TYPE]);
@@ -1167,7 +1146,6 @@  gfc_init_builtin_functions (void)
 #undef DEF_FUNCTION_TYPE_VAR_0
 #undef DEF_FUNCTION_TYPE_VAR_2
 #undef DEF_FUNCTION_TYPE_VAR_7
-#undef DEF_FUNCTION_TYPE_VAR_12
 #undef DEF_POINTER_TYPE
   builtin_types[(int) BT_LAST] = NULL_TREE;
 
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 226312)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -59,6 +59,7 @@ 
 #include "dominance.h"
 #include "cfg.h"
 #include "omp-low.h"
+#include "gomp-constants.h"
 
 /* This file should be included last.  */
 #include "target-def.h"
@@ -3015,9 +3016,49 @@  nvptx_cannot_copy_insn_p (rtx_insn *insn
 static void
 nvptx_record_offload_symbol (tree decl)
 {
-  fprintf (asm_out_file, "//:%s_MAP %s\n",
-	   TREE_CODE (decl) == VAR_DECL ? "VAR" : "FUNC",
-	   IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+  switch (TREE_CODE (decl))
+    {
+    case VAR_DECL:
+      fprintf (asm_out_file, "//:VAR_MAP \"%s\"\n",
+	       IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+      break;
+
+    case FUNCTION_DECL:
+      {
+	tree attr = get_oacc_fn_attrib (decl);
+	tree dims = NULL_TREE;
+	unsigned ix;
+	
+	if (attr)
+	  dims = TREE_VALUE (attr);
+	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
+		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
+
+	for (ix = 0; ix != GOMP_DIM_MAX; ix++)
+	  {
+	    unsigned HOST_WIDE_INT dim = 0;
+	    if (dims)
+	      {
+		tree cst = TREE_VALUE (dims);
+
+		/* When device_type support is added an ealier pass
+		   should have massaged the attribute to be
+		   ptx-specific.  */
+		gcc_assert (TREE_CODE (cst) == INTEGER_CST);
+
+		dim = TREE_INT_CST_LOW (cst);
+		dims = TREE_CHAIN (dims);
+	      }
+	    fprintf (asm_out_file, ", " HOST_WIDE_INT_PRINT_HEX, dim);
+	  }
+	
+	fprintf (asm_out_file, "\n");
+      }
+      break;
+  
+    default:
+      gcc_unreachable ();
+    }
 }
 
 /* Implement TARGET_ASM_FILE_START.  Write the kinds of things ptxas expects
Index: gcc/config/nvptx/mkoffload.c
===================================================================
--- gcc/config/nvptx/mkoffload.c	(revision 226312)
+++ gcc/config/nvptx/mkoffload.c	(working copy)
@@ -227,6 +227,8 @@  process (FILE *in, FILE *out)
 {
   size_t len;
   const char *input = read_file (in, &len);
+  const char *comma;
+  id_map const *id;
 
   fprintf (out, "static const char ptx_code[] = \n \"");
   for (size_t i = 0; i < len; i++)
@@ -267,14 +269,18 @@  process (FILE *in, FILE *out)
     }
   fprintf (out, "\";\n\n");
 
-  fprintf (out, "static const char *const var_mappings[] = {\n");
-  for (id_map *id = var_ids; id; id = id->next)
-    fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
-  fprintf (out, "};\n\n");
-  fprintf (out, "static const char *const func_mappings[] = {\n");
-  for (id_map *id = func_ids; id; id = id->next)
-    fprintf (out, "\t\"%s\"%s\n", id->ptx_name, id->next ? "," : "");
-  fprintf (out, "};\n\n");
+  fprintf (out, "static const char *const var_mappings[] = {");
+  for (comma = "", id = var_ids; id; comma = ",", id = id->next)
+    fprintf (out, "%s\n\t%s", comma, id->ptx_name);
+  fprintf (out, "\n};\n\n");
+
+  fprintf (out, "static const struct nvptx_fn {\n"
+	   "  const char *name;\n"
+	   "  unsigned short dim[%d];\n"
+	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
+  for (comma = "", id = func_ids; id; comma = ",", id = id->next)
+    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+  fprintf (out, "\n};\n\n");
 
   fprintf (out,
 	   "static const struct nvptx_tdata {\n"
@@ -282,7 +288,7 @@  process (FILE *in, FILE *out)
 	   "  __SIZE_TYPE__ ptx_len;\n"
 	   "  const char *const *var_names;\n"
 	   "  __SIZE_TYPE__ var_num;\n"
-	   "  const char *const *fn_names;\n"
+	   "  const struct nvptx_fn *fn_names;\n"
 	   "  __SIZE_TYPE__ fn_num;\n"
 	   "} target_data = {\n"
 	   "  ptx_code, sizeof (ptx_code),\n"