diff mbox series

[nvptx] Expand OpenACC child function arguments to use CUDA params space

Message ID 4d7dbb50-e8db-209b-63e2-0efaa18eeec1@mentor.com
State New
Headers show
Series [nvptx] Expand OpenACC child function arguments to use CUDA params space | expand

Commit Message

Chung-Lin Tang Sept. 10, 2019, 11:41 a.m. UTC
Hi Tom,
this is a completely new implementation of an earlier optimization
that Cesar submitted:
https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01202.html

The objective is to transform the original single-record-pointer argument
form (OpenMP/pthreads originated) to multiple scalar parameters, that
the CUDA runtime will place directly in the .params space for GPU kernels:

#pragma acc parallel copy(a, b) copyin(c)
{
   a += b;
   b -= c;
}

compiles to GIMPLE as:

__attribute__((oacc function (1, 1, 32), omp target entrypoint))
main._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i)
{
   ...
   _3 = .omp_data_i_2(D)->a;
   _4 = *_3;
   _5 = .omp_data_i_2(D)->b;
   _6 = *_5;
   ...

this patch adds pass to transform into:

__attribute__((oacc function (1, 1, 32), omp target entrypoint))
main._omp_fn.0 (int * c, int * b, int * a)
{
   ...
   _3 = a;
   _4 = *_3;
   _5 = b;
   _6 = *_5;
   ...

Cesar's original implementation tried to do this in the middle-end,
which required lots of changes throughout the compiler, libgomp interface,
etc. and required a dependency on libffi for the CPU-host fallback child
function (since there is no longer a known, fixed single-pointer argument
interface to all child functions)

This new implementation works by modifying the GIMPLE for child functions
directly at the very start (before, actually) of RTL expansion, and thus
is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently
need something different generated between the host-fallback vs for the GPU.

The new nvptx_expand_to_rtl_hook modifies the function decl type and
arguments, and scans the gimple body to remove occurrences of .omp_data_i.*
Detection of OpenACC child functions is done through "omp target entrypoint"
and "oacc function" attributes. Because OpenMP target child functions
have a more elaborate wrapper generated for them, this pass only supports
OpenACC right now.

The libgomp nvptx plugin changes are also quite contained, with lots of
now unneeded profiling code deleted (since we no longer first cuAlloc a
buffer for the argument record before cuLaunchKernel)

libgomp has tested with this patch x86_64-linux (nvptx-none accelerator)
without regressions (I'm currently undergoing more gcc tests as well).
Is this okay for trunk?

Thanks,
Chung-Lin

	* config/nvptx/nvptx.c (nvptx_expand_to_rtl_hook): New function
	implementing CUDA .params space transformation.
	(TARGET_EXPAND_TO_RTL_HOOK): implement hook with
	nvptx_expand_to_rtl_hook.

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_exec): Adjust arguments, add
	kernel argument setup code, adjust cuLaunchKernel calling code.
	(GOMP_OFFLOAD_openacc_exec): Adjust nvptx_exec call, delete
	profiling code.
	(GOMP_OFFLOAD_openacc_async_exec): Likewise.
	(cuda_free_argmem): Delete function.

Comments

Thomas Schwinge Sept. 19, 2019, 4:28 p.m. UTC | #1
Hi Chung-Lin!

On 2019-09-10T19:41:59+0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> this is a completely new implementation of an earlier optimization
> that Cesar submitted:
> https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01202.html

Thanks for your re-work!

> The objective is to transform the original single-record-pointer argument
> form (OpenMP/pthreads originated) to multiple scalar parameters, that
> the CUDA runtime will place directly in the .params space for GPU kernels:
>
> #pragma acc parallel copy(a, b) copyin(c)
> {
>    a += b;
>    b -= c;
> }
>
> compiles to GIMPLE as:
>
> __attribute__((oacc function (1, 1, 32), omp target entrypoint))
> main._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i)
> {
>    ...
>    _3 = .omp_data_i_2(D)->a;
>    _4 = *_3;
>    _5 = .omp_data_i_2(D)->b;
>    _6 = *_5;
>    ...
>
> this patch adds pass to transform into:
>
> __attribute__((oacc function (1, 1, 32), omp target entrypoint))
> main._omp_fn.0 (int * c, int * b, int * a)
> {
>    ...
>    _3 = a;
>    _4 = *_3;
>    _5 = b;
>    _6 = *_5;
>    ...

ACK.

> Cesar's original implementation tried to do this in the middle-end,
> which required lots of changes throughout the compiler, libgomp interface,
> etc. and required a dependency on libffi for the CPU-host fallback child
> function (since there is no longer a known, fixed single-pointer argument
> interface to all child functions)

Specifically, the major problem -- per my understanding -- is that
Cesar's implementation does this in the early stages of the middle end
('pass_lower_omp'), before the target vs. offload target code paths get
separated, and so the transformation was done for target ("host
fallback") as well as all offload targets, without each of them having
the option to opt in/out.

As can be seen from the new highly localized code changes (nvptx code
only), your re-work clearly fixes that aspect!  :-)

> This new implementation works by modifying the GIMPLE for child functions
> directly at the very start (before, actually) of RTL expansion

That's now near the other end of the pipeline.  ;-) What's the motivation
for putting it there, instead of early in the nvptx offloading
compilation (around 'pass_oacc_device_lower' etc. time, where I would've
assumed this transformation to be done)?  Not asking you to change that
now, but curious for the reason.

> and thus
> is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently
> need something different generated between the host-fallback vs for the GPU.

(Likewise, different per each offload target.)

> The new nvptx_expand_to_rtl_hook modifies the function decl type and
> arguments, and scans the gimple body to remove occurrences of .omp_data_i.*
> Detection of OpenACC child functions is done through "omp target entrypoint"
> and "oacc function" attributes. Because OpenMP target child functions
> have a more elaborate wrapper generated for them, this pass only supports
> OpenACC right now.

At the Cauldron, the question indeed has been raised (Jakub, Tom) why not
enabled for OpenMP, too.  My answer was that this surely can be done, but
the change as presented here already is an improvement over the current
status ("stands on its own", as Jeff Law would call it), so I'm fine with
you handling OpenACC first, and then OpenMP can follow later (at some as
of yet indeterminite point in time, even).


> libgomp has tested with this patch x86_64-linux (nvptx-none accelerator)
> without regressions

Can you present performance numbers, too?

> (I'm currently undergoing more gcc tests as well).

As these changes, being confined to nvptx code only, can't possibly have
any effect on other target testing, I assume that's nvptx target testing
you're talking about?  (..., where also I'm not expecting any
disturbance.)


> Is this okay for trunk?

I'm not the one to approve these code changes, but I do have a few
comments/questions:

> --- gcc/config/nvptx/nvptx.c	(revision 275493)
> +++ gcc/config/nvptx/nvptx.c	(working copy)

> +static void
> +nvptx_expand_to_rtl_hook (void)
> +{
> +  /* For utilizing CUDA .param kernel arguments, we detect and modify
> +     the gimple of offloaded child functions, here before RTL expansion,
> +     starting with standard OMP form:
> +      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
> +   
> +     and transform it into a style where the OMP data record fields are
> +     "exploded" into individual scalar arguments:
> +      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
> +
> +     Note that there are implicit assumptions of how OMP lowering (and/or other
> +     intervening passes) behaves contained in this transformation code;
> +     if those passes change in their output, this code may possibly need
> +     updating.  */
> +
> +  if (lookup_attribute ("omp target entrypoint",
> +			DECL_ATTRIBUTES (current_function_decl))
> +      /* The rather indirect manner in which OpenMP target functions are
> +	 launched makes this transformation only valid for OpenACC currently.
> +	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
> +	 needs changes for this to work with OpenMP.  */
> +      && lookup_attribute ("oacc function",
> +			   DECL_ATTRIBUTES (current_function_decl))
> +      && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))))

Why the 'void' return conditional?  (Or, should that rather be an
'gcc_checking_assert' at the top of the following block?)

> +    {
> +      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
> +      tree argtype = TREE_TYPE (omp_data_arg);
> +
> +      /* Ensure this function is of the form of a single reference argument
> +	 to the OMP data record, or a single void* argument (when no values
> +	 passed)  */
> +      if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE
> +	     && ((TREE_CODE (argtype) == REFERENCE_TYPE
> +		  && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
> +		 || (TREE_CODE (argtype) == POINTER_TYPE
> +		     && TREE_TYPE (argtype) == void_type_node))))
> +	return;

Again, is that something we should 'gcc_checking_assert', so that we'll
notice when something changes/breaks?

Given your note above, "there are implicit assumptions [on] OMP
lowering", I'd assume that this code here does quite some
'gcc_checking_assert'ions to make sure that we're within the expected
bounds.

> +      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
> +	 scanning and skipping those entries, creating a new local_decls list.
> +	 We assume a very specific MEM_REF tree expression shape.  */
> +      tree decl;
> +      unsigned int i;
> +      vec<tree, va_gc> *new_local_decls = NULL;
> +      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
> +	{
> +	  if (DECL_HAS_VALUE_EXPR_P (decl))
> +	    {
> +	      tree t = DECL_VALUE_EXPR (decl);
> +	      if (TREE_CODE (t) == MEM_REF
> +		  && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
> +		  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF
> +		  && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
> +		      == omp_data_arg))
> +		continue;
> +	    }
> +	  vec_safe_push (new_local_decls, decl);
> +	}
> +      vec_free (cfun->local_decls);
> +      cfun->local_decls = new_local_decls;

Is it worth doing that manually, or can/should some dead code elimination
pass deal with that?

> +      /* Scan function body for assignments from .omp_data_i->FIELD, and using
> +	 the above created fld_to_args hash map, convert them to reads of
> +	 function arguments.  */

> +	    else if (TREE_CODE (val) == MEM_REF
> +		     && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
> +		     && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
> +	      {
> +		/* This case may happen in the final tree level optimization
> +		   output, due to SLP:
> +		   vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B]
> +
> +		   Therefore here we need a more elaborate search of the field
> +		   list to reverse map to which field the offset is referring
> +		   to.  */

Would this be simpler if the conversion would be done earlier?  (And I
mentioned above.)

> +	    /* If we found the corresponding OMP data record field, replace the
> +	       RHS with the new created PARM_DECL.  */
> +	    if (new_val != NULL_TREE)
> +	      {
> +		if (dump_file)
> +		  {
> +		    fprintf (dump_file, "For gimple stmt: ");
> +		    print_gimple_stmt (dump_file, stmt, 0);
> +		    fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
> +			     print_generic_expr_to_str (val),
> +			     print_generic_expr_to_str (new_val));
> +		  }
> +		/* Write in looked up ARG as new RHS value.  */
> +		*val_ptr = new_val;
> +	      }

If 'new_val == NULL_TREE' that simply means that we've been looking at
something that doesn't need to be handled here, right?

> +      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
> +      tree name;
> +      FOR_EACH_SSA_NAME (i, name, cfun)
> +	if (SSA_NAME_VAR (name) == omp_data_arg)
> +	  (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;

Again, manual cleanup vs. automated?

> --- libgomp/plugin/plugin-nvptx.c	(revision 275493)
> +++ libgomp/plugin/plugin-nvptx.c	(working copy)

> @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
>  				 unsigned *dims, void *targ_mem_desc,
>  				 struct goacc_asyncqueue *aq)
>  {
> [...]
> -  if (mapnum > 0)
> -    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
> +  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
>  }

Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'?


Grüße
 Thomas
Chung-Lin Tang Sept. 24, 2019, 10:43 a.m. UTC | #2
Hi Thomas, thanks for the review.

On 2019/9/20 12:28 AM, Thomas Schwinge wrote:
>> This new implementation works by modifying the GIMPLE for child functions
>> directly at the very start (before, actually) of RTL expansion
> That's now near the other end of the pipeline.;-)  What's the motivation
> for putting it there, instead of early in the nvptx offloading
> compilation (around 'pass_oacc_device_lower' etc. time, where I would've
> assumed this transformation to be done)?  Not asking you to change that
> now, but curious for the reason.

I am not sure we have a natural boundary that defines/marks the start of the
offload compiler stages. Maybe if we had an explicit "start_of_offload" pass,
we can embed this processing there, and enable it with a bool-valued target hook
by the accelerator backend. (possibly only when ACCEL_COMPILER is defined)

In short of that, I think placing it here before RTL expansion seems the
most well defined, even if we have to handle some optimized obscurity.

>> and thus
>> is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently
>> need something different generated between the host-fallback vs for the GPU.
> (Likewise, different per each offload target.)
> 
>> The new nvptx_expand_to_rtl_hook modifies the function decl type and
>> arguments, and scans the gimple body to remove occurrences of .omp_data_i.*
>> Detection of OpenACC child functions is done through "omp target entrypoint"
>> and "oacc function" attributes. Because OpenMP target child functions
>> have a more elaborate wrapper generated for them, this pass only supports
>> OpenACC right now.
> At the Cauldron, the question indeed has been raised (Jakub, Tom) why not
> enabled for OpenMP, too.  My answer was that this surely can be done, but
> the change as presented here already is an improvement over the current
> status ("stands on its own", as Jeff Law would call it), so I'm fine with
> you handling OpenACC first, and then OpenMP can follow later (at some as
> of yet indeterminite point in time, even).

The OpenMP way of wrapping the user defined GPU kernel with lots of initialization
code does make this much more tedious I think.

The question should actually be, can OpenMP simply do this kind of initialization
by the host libgomp runtime like OpenACC does, and make the nvptx kernel
proper more similar between the two?

>> libgomp has tested with this patch x86_64-linux (nvptx-none accelerator)
>> without regressions
> Can you present performance numbers, too?

Haven't got to that yet.

>> (I'm currently undergoing more gcc tests as well).
> As these changes, being confined to nvptx code only, can't possibly have
> any effect on other target testing, I assume that's nvptx target testing
> you're talking about?  (..., where also I'm not expecting any
> disturbance.)

Yeah, I was talking about nvptx-none compiler testing. Haven't found any changes.

> --- gcc/config/nvptx/nvptx.c	(revision 275493)
>> +++ gcc/config/nvptx/nvptx.c	(working copy)
>> +static void
>> +nvptx_expand_to_rtl_hook (void)
>> +{
>> +  /* For utilizing CUDA .param kernel arguments, we detect and modify
>> +     the gimple of offloaded child functions, here before RTL expansion,
>> +     starting with standard OMP form:
>> +      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
>> +
>> +     and transform it into a style where the OMP data record fields are
>> +     "exploded" into individual scalar arguments:
>> +      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
>> +
>> +     Note that there are implicit assumptions of how OMP lowering (and/or other
>> +     intervening passes) behaves contained in this transformation code;
>> +     if those passes change in their output, this code may possibly need
>> +     updating.  */
>> +
>> +  if (lookup_attribute ("omp target entrypoint",
>> +			DECL_ATTRIBUTES (current_function_decl))
>> +      /* The rather indirect manner in which OpenMP target functions are
>> +	 launched makes this transformation only valid for OpenACC currently.
>> +	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
>> +	 needs changes for this to work with OpenMP.  */
>> +      && lookup_attribute ("oacc function",
>> +			   DECL_ATTRIBUTES (current_function_decl))
>> +      && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))))
> Why the 'void' return conditional?  (Or, should that rather be an
> 'gcc_checking_assert' at the top of the following block?)

That the shape of child functions omp-low generates. Maybe that should be an
assertion, though here I'm just doing sanity checking and ignoring otherwise.

Come to think of it, maybe I should try using the assertion to check if
I'm unintentionally ignoring transforming some cases...

>> +    {
>> +      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
>> +      tree argtype = TREE_TYPE (omp_data_arg);
>> +
>> +      /* Ensure this function is of the form of a single reference argument
>> +	 to the OMP data record, or a single void* argument (when no values
>> +	 passed)  */
>> +      if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE
>> +	     && ((TREE_CODE (argtype) == REFERENCE_TYPE
>> +		  && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
>> +		 || (TREE_CODE (argtype) == POINTER_TYPE
>> +		     && TREE_TYPE (argtype) == void_type_node))))
>> +	return;
> Again, is that something we should 'gcc_checking_assert', so that we'll
> notice when something changes/breaks?

As above.

> Given your note above, "there are implicit assumptions [on] OMP
> lowering", I'd assume that this code here does quite some
> 'gcc_checking_assert'ions to make sure that we're within the expected
> bounds.
> 
>> +      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
>> +	 scanning and skipping those entries, creating a new local_decls list.
>> +	 We assume a very specific MEM_REF tree expression shape.  */
>> +      tree decl;
>> +      unsigned int i;
>> +      vec<tree, va_gc> *new_local_decls = NULL;
>> +      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
>> +	{
>> +	  if (DECL_HAS_VALUE_EXPR_P (decl))
>> +	    {
>> +	      tree t = DECL_VALUE_EXPR (decl);
>> +	      if (TREE_CODE (t) == MEM_REF
>> +		  && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
>> +		  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF
>> +		  && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
>> +		      == omp_data_arg))
>> +		continue;
>> +	    }
>> +	  vec_safe_push (new_local_decls, decl);
>> +	}
>> +      vec_free (cfun->local_decls);
>> +      cfun->local_decls = new_local_decls;
> Is it worth doing that manually, or can/should some dead code elimination
> pass deal with that?

I think I ICE'd somewhere before adding this removal.

>> +      /* Scan function body for assignments from .omp_data_i->FIELD, and using
>> +	 the above created fld_to_args hash map, convert them to reads of
>> +	 function arguments.  */
>> +	    else if (TREE_CODE (val) == MEM_REF
>> +		     && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
>> +		     && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
>> +	      {
>> +		/* This case may happen in the final tree level optimization
>> +		   output, due to SLP:
>> +		   vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B]
>> +
>> +		   Therefore here we need a more elaborate search of the field
>> +		   list to reverse map to which field the offset is referring
>> +		   to.  */
> Would this be simpler if the conversion would be done earlier?  (And I
> mentioned above.)

Yes, it would be much less laborious :P

>> +	    /* If we found the corresponding OMP data record field, replace the
>> +	       RHS with the new created PARM_DECL.  */
>> +	    if (new_val != NULL_TREE)
>> +	      {
>> +		if (dump_file)
>> +		  {
>> +		    fprintf (dump_file, "For gimple stmt: ");
>> +		    print_gimple_stmt (dump_file, stmt, 0);
>> +		    fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
>> +			     print_generic_expr_to_str (val),
>> +			     print_generic_expr_to_str (new_val));
>> +		  }
>> +		/* Write in looked up ARG as new RHS value.  */
>> +		*val_ptr = new_val;
>> +	      }
> If 'new_val == NULL_TREE' that simply means that we've been looking at
> something that doesn't need to be handled here, right?

Technically, it only means we haven't found something to replace the .omp_data_i->FIELD ref.

>> +      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
>> +      tree name;
>> +      FOR_EACH_SSA_NAME (i, name, cfun)
>> +	if (SSA_NAME_VAR (name) == omp_data_arg)
>> +	  (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;
> Again, manual cleanup vs. automated?

The "automated" one that marked it unused didn't really remove it for some reason,
I forgot why, probably because it was still considered "used" in some way.
Only this manual manipulation worked.

>> --- libgomp/plugin/plugin-nvptx.c	(revision 275493)
>> +++ libgomp/plugin/plugin-nvptx.c	(working copy)
>> @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
>>   				 unsigned *dims, void *targ_mem_desc,
>>   				 struct goacc_asyncqueue *aq)
>>   {
>> [...]
>> -  if (mapnum > 0)
>> -    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
>> +  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
>>   }
> Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'?

No, gomp_map_vars uses it as a plugin hook from libgomp proper, though
this was the only instance from inside the nvptx plugin.

(another use from oacc-async.c:goacc_async_free() appears to be orphaned now,
though I think we should keep that routine for a while, as it appears to
possibly be of use)

I'll try changing some of those cases we identified to be 'assertifiable' and
see what happens.

Thanks,
Chung-Lin
Chung-Lin Tang Oct. 1, 2019, 12:45 p.m. UTC | #3
On 2019/9/24 6:43 PM, Chung-Lin Tang wrote:
> 
>> --- gcc/config/nvptx/nvptx.c    (revision 275493)
>>> +++ gcc/config/nvptx/nvptx.c    (working copy)
>>> +static void
>>> +nvptx_expand_to_rtl_hook (void)
>>> +{
>>> +  /* For utilizing CUDA .param kernel arguments, we detect and modify
>>> +     the gimple of offloaded child functions, here before RTL expansion,
>>> +     starting with standard OMP form:
>>> +      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
>>> +
>>> +     and transform it into a style where the OMP data record fields are
>>> +     "exploded" into individual scalar arguments:
>>> +      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
>>> +
>>> +     Note that there are implicit assumptions of how OMP lowering (and/or other
>>> +     intervening passes) behaves contained in this transformation code;
>>> +     if those passes change in their output, this code may possibly need
>>> +     updating.  */
>>> +
>>> +  if (lookup_attribute ("omp target entrypoint",
>>> +            DECL_ATTRIBUTES (current_function_decl))
>>> +      /* The rather indirect manner in which OpenMP target functions are
>>> +     launched makes this transformation only valid for OpenACC currently.
>>> +     TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
>>> +     needs changes for this to work with OpenMP.  */
>>> +      && lookup_attribute ("oacc function",
>>> +               DECL_ATTRIBUTES (current_function_decl))
>>> +      && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))))
>> Why the 'void' return conditional?  (Or, should that rather be an
>> 'gcc_checking_assert' at the top of the following block?)
> 
> That the shape of child functions omp-low generates. Maybe that should be an
> assertion, though here I'm just doing sanity checking and ignoring otherwise.
> 
> Come to think of it, maybe I should try using the assertion to check if
> I'm unintentionally ignoring transforming some cases...

I've updated the patch to use an assertion for those convention checks. I think
it's better leave a level of checking in place, so gcc_assert() instead of
gcc_checking_assert(). Also tested no regressions.

Thanks,
Chung-Lin
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 276406)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -68,6 +68,10 @@
 #include "attribs.h"
 #include "tree-vrp.h"
 #include "tree-ssa-operands.h"
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "tree-cfg.h"
+#include "gimple-ssa.h"
 #include "tree-ssanames.h"
 #include "gimplify.h"
 #include "tree-phinodes.h"
@@ -6437,6 +6441,226 @@ nvptx_set_current_function (tree fndecl)
   oacc_bcast_partition = 0;
 }
 
+static void
+nvptx_expand_to_rtl_hook (void)
+{
+  /* For utilizing CUDA .param kernel arguments, we detect and modify
+     the gimple of offloaded child functions, here before RTL expansion,
+     starting with standard OMP form:
+      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
+   
+     and transform it into a style where the OMP data record fields are
+     "exploded" into individual scalar arguments:
+      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
+
+     Note that there are implicit assumptions of how OMP lowering (and/or other
+     intervening passes) behaves contained in this transformation code;
+     if those passes change in their output, this code may possibly need
+     updating.  */
+
+  if (lookup_attribute ("omp target entrypoint",
+			DECL_ATTRIBUTES (current_function_decl))
+      /* The rather indirect manner in which OpenMP target functions are
+	 launched makes this transformation only valid for OpenACC currently.
+	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
+	 needs changes for this to work with OpenMP.  */
+      && lookup_attribute ("oacc function",
+			   DECL_ATTRIBUTES (current_function_decl)))
+    {
+      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
+      tree argtype = TREE_TYPE (omp_data_arg);
+
+      /* Ensure this function is of the form of a single reference argument
+	 to the OMP data record, or a single void* argument (when no values
+	 passed)  */
+      gcc_assert (VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))
+		  && (DECL_CHAIN (omp_data_arg) == NULL_TREE
+		      && ((TREE_CODE (argtype) == REFERENCE_TYPE
+			   && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
+			  || (TREE_CODE (argtype) == POINTER_TYPE
+			      && TREE_TYPE (argtype) == void_type_node))));
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Detected offloaded child function %s, "
+		   "starting parameter conversion\n",
+		   print_generic_expr_to_str (current_function_decl));
+	  fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n",
+		   print_generic_expr_to_str (omp_data_arg),
+		   print_generic_expr_to_str (argtype));
+	  fprintf (dump_file, "Data record fields:\n");
+	}
+      
+      hash_map<tree,tree> fld_to_args;
+      tree fld, rectype = TREE_TYPE (argtype);
+      tree arglist = NULL_TREE, argtypelist = NULL_TREE;
+
+      if (TREE_CODE (rectype) == RECORD_TYPE)
+	{
+	  /* For each field in the OMP data record type, create a corresponding
+	     PARM_DECL, and map field -> parm using the fld_to_args hash_map.
+	     Also create the tree chains for creating function type and
+	     DECL_ARGUMENTS below.  */
+	  for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld))
+	    {
+	      tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL,
+				      DECL_NAME (fld), TREE_TYPE (fld));
+	      DECL_ARTIFICIAL (narg) = 1;
+	      DECL_ARG_TYPE (narg) = TREE_TYPE (fld);
+	      DECL_CONTEXT (narg) = current_function_decl;
+	      TREE_USED (narg) = 1;
+	      TREE_READONLY (narg) = 1;
+
+	      if (dump_file)
+		fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n",
+			 print_generic_expr_to_str (fld),
+			 print_generic_expr_to_str (TREE_TYPE (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld)));
+	      fld_to_args.put (fld, narg);
+
+	      TREE_CHAIN (narg) = arglist;
+	      arglist = narg;
+	      argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg),
+				       argtypelist);
+	    }
+	  arglist = nreverse (arglist);
+	  argtypelist = nreverse (argtypelist);
+	}
+      /* This is needed to not be mistaken for a stdarg function.  */
+      argtypelist = chainon (argtypelist, void_list_node);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function before OMP data arg replaced:\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      /* Actually modify the tree type and DECL_ARGUMENTS here.  */
+      TREE_TYPE (current_function_decl) = build_function_type (void_type_node,
+							       argtypelist);
+      DECL_ARGUMENTS (current_function_decl) = arglist;
+
+      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
+	 scanning and skipping those entries, creating a new local_decls list.
+	 We assume a very specific MEM_REF tree expression shape.  */
+      tree decl;
+      unsigned int i;
+      vec<tree, va_gc> *new_local_decls = NULL;
+      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
+	{
+	  if (DECL_HAS_VALUE_EXPR_P (decl))
+	    {
+	      tree t = DECL_VALUE_EXPR (decl);
+	      if (TREE_CODE (t) == MEM_REF
+		  && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
+		  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF
+		  && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
+		      == omp_data_arg))
+		continue;
+	    }
+	  vec_safe_push (new_local_decls, decl);
+	}
+      vec_free (cfun->local_decls);
+      cfun->local_decls = new_local_decls;
+      
+      /* Scan function body for assignments from .omp_data_i->FIELD, and using
+	 the above created fld_to_args hash map, convert them to reads of
+	 function arguments.  */
+      basic_block bb;
+      gimple_stmt_iterator gsi;
+      FOR_EACH_BB_FN (bb, cfun)
+	for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    tree val, *val_ptr = NULL;
+	    gimple *stmt = gsi_stmt (gsi);
+	    if (is_gimple_assign (stmt)
+		&& gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS)
+	      val_ptr = gimple_assign_rhs1_ptr (stmt);
+	    else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt))
+	      val_ptr = gimple_debug_bind_get_value_ptr (stmt);
+
+	    if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE)
+	      continue;
+
+	    tree new_val = NULL_TREE, fld = NULL_TREE;
+
+	    if (TREE_CODE (val) == COMPONENT_REF
+		&& TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF
+		&& (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == SSA_NAME)
+		&& (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == omp_data_arg))
+	      {
+		/* .omp_data->FIELD case.  */
+		fld = TREE_OPERAND (val, 1);
+		new_val = *fld_to_args.get (fld);
+	      }
+	    else if (TREE_CODE (val) == MEM_REF
+		     && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
+		     && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
+	      {
+		/* This case may happen in the final tree level optimization
+		   output, due to SLP:
+		   vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B]
+
+		   Therefore here we need a more elaborate search of the field
+		   list to reverse map to which field the offset is referring
+		   to.  */
+		unsigned HOST_WIDE_INT offset
+		  = tree_to_uhwi (TREE_OPERAND (val, 1));
+
+		for (hash_map<tree, tree>::iterator i = fld_to_args.begin ();
+		     i != fld_to_args.end (); ++i)
+		  {
+		    tree cur_fld = (*i).first;
+		    tree cur_arg = (*i).second;
+		    gcc_assert (TREE_CODE (cur_arg) == PARM_DECL);
+
+		    unsigned HOST_WIDE_INT cur_offset =
+		      (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld))
+		       + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld))
+			  / BITS_PER_UNIT));
+
+		    if (offset == cur_offset)
+		      {
+			new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val),
+					  cur_arg);
+			break;
+		      }
+		  }
+	      }
+
+	    /* If we found the corresponding OMP data record field, replace the
+	       RHS with the new created PARM_DECL.  */
+	    if (new_val != NULL_TREE)
+	      {
+		if (dump_file)
+		  {
+		    fprintf (dump_file, "For gimple stmt: ");
+		    print_gimple_stmt (dump_file, stmt, 0);
+		    fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
+			     print_generic_expr_to_str (val),
+			     print_generic_expr_to_str (new_val));
+		  }
+		/* Write in looked up ARG as new RHS value.  */
+		*val_ptr = new_val;
+	      }
+	  }
+
+      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
+      tree name;
+      FOR_EACH_SSA_NAME (i, name, cfun)
+	if (SSA_NAME_VAR (name) == omp_data_arg)
+	  (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function after OMP data arg replaced: ");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6576,6 +6800,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 276406)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -695,16 +695,24 @@ link_ptx (CUmodule *module, const struct targ_ptx_
 
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    unsigned *dims, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
   struct nvptx_thread *nvthd = nvptx_thread ();
   int warp_size = nvthd->ptx_dev->warp_size;
+  void **kernel_args = NULL;
 
+  GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum);
+
+  if (mapnum > 0)
+    {
+      kernel_args = alloca (mapnum * sizeof (void *));
+      for (int i = 0; i < mapnum; i++)
+	kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+    }
+  
   function = targ_fn->fn;
 
   /* Initialize the launch dimensions.  Typically this is constant,
@@ -936,11 +944,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
 					    api_info);
     }
 
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-		    0, stream, kargs, 0);
+		    0, stream, kernel_args, 0);
 
   if (profiling_p)
     {
@@ -1349,67 +1356,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
 			   void **hostaddrs, void **devaddrs,
 			   unsigned *dims, void *targ_mem_desc)
 {
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL);
 
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_p = __builtin_expect (prof_info != NULL, false);
-
-  void **hp = NULL;
-  CUdeviceptr dp = 0;
-
-  if (mapnum > 0)
-    {
-      size_t s = mapnum * sizeof (void *);
-      hp = alloca (s);
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
-      if (profiling_p)
-	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
-    }
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-  if (mapnum > 0)
-    {
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_start;
-
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  data_event_info.data_event.valid_bytes
-	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
-	  data_event_info.data_event.parent_construct
-	    = acc_construct_parallel;
-	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
-	  data_event_info.data_event.tool_info = NULL;
-	  data_event_info.data_event.var_name = NULL;
-	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
-	  data_event_info.data_event.host_ptr = hp;
-	  data_event_info.data_event.device_ptr = (const void *) dp;
-
-	  api_info->device_api = acc_device_api_cuda;
-
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-      CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
-			mapnum * sizeof (void *));
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_end;
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-    }
-
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
-
   CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
   if (r == CUDA_ERROR_LAUNCH_FAILED)
@@ -1417,20 +1365,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-
-  CUDA_CALL_ASSERT (cuMemFree, dp);
-  if (profiling_p)
-    goacc_profiling_acc_ev_free (thr, (void *) dp);
 }
 
-static void
-cuda_free_argmem (void *ptr)
-{
-  void **block = (void **) ptr;
-  nvptx_free (block[0], (struct ptx_device *) block[1]);
-  free (block);
-}
-
 void
 GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 				 void **hostaddrs, void **devaddrs,
@@ -1437,78 +1373,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
 				 unsigned *dims, void *targ_mem_desc,
 				 struct goacc_asyncqueue *aq)
 {
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_p = __builtin_expect (prof_info != NULL, false);
-
-  void **hp = NULL;
-  CUdeviceptr dp = 0;
-  void **block = NULL;
-
-  if (mapnum > 0)
-    {
-      size_t s = mapnum * sizeof (void *);
-      block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s);
-      hp = block + 2;
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
-      if (profiling_p)
-	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
-    }
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-  if (mapnum > 0)
-    {
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_start;
-
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  data_event_info.data_event.valid_bytes
-	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
-	  data_event_info.data_event.parent_construct
-	    = acc_construct_parallel;
-	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
-	  data_event_info.data_event.tool_info = NULL;
-	  data_event_info.data_event.var_name = NULL;
-	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
-	  data_event_info.data_event.host_ptr = hp;
-	  data_event_info.data_event.device_ptr = (const void *) dp;
-
-	  api_info->device_api = acc_device_api_cuda;
-
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-
-      CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
-			mapnum * sizeof (void *), aq->cuda_stream);
-      block[0] = (void *) dp;
-
-      struct nvptx_thread *nvthd =
-	(struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
-      block[1] = (void *) nvthd->ptx_dev;
-
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_end;
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-    }
-
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
-
-  if (mapnum > 0)
-    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
 }
 
 void *
Thomas Schwinge Oct. 8, 2019, 2:05 p.m. UTC | #4
Hi Chung-Lin!

While we're all waiting for Tom to comment on this ;-) -- here's another
item I realized:

On 2019-09-10T19:41:59+0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> The libgomp nvptx plugin changes are also quite contained, with lots of
> now unneeded [...] code deleted (since we no longer first cuAlloc a
> buffer for the argument record before cuLaunchKernel)

It would be nice ;-) -- but unless I'm confused, it's not that simple: we
either have to reject (force host-fallback execution) or keep supporting
"old-style" nvptx offloading code: new-libgomp has to continue to work
with nvptx offloading code once generated by old-GCC.  Possibly even a
mixture of old and new nvptx offloading code, if libraries are involved,
huh!

I have not completely thought that through, but I suppose this could be
addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's
synthesized by nvptx 'mkoffload'?

Maybe if fact the 'enum id_map_flag' machinery that I once added for
'Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid
offloading"'?  (That's part of og8 commit
2d42fbf7e989e4bb76727b32ef11deb5845d5ab1 -- not present on og9, huh?!)
The 'enum id_map_flag' machinery serves the purpose of transporting
information from the offload compiler to libgomp, which seems what's
needed here?  (But please verify.)

For reference, your proposed changes:

> --- libgomp/plugin/plugin-nvptx.c	(revision 275493)
> +++ libgomp/plugin/plugin-nvptx.c	(working copy)
> @@ -696,16 +696,24 @@ link_ptx (CUmodule *module, const struct targ_ptx_
>  
>  static void
>  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
> -	    unsigned *dims, void *targ_mem_desc,
> -	    CUdeviceptr dp, CUstream stream)
> +	    unsigned *dims, CUstream stream)
>  {
>    struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
>    CUfunction function;
>    int i;
> -  void *kargs[1];
>    struct nvptx_thread *nvthd = nvptx_thread ();
>    int warp_size = nvthd->ptx_dev->warp_size;
> +  void **kernel_args = NULL;
>  
> +  GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum);
> +
> +  if (mapnum > 0)
> +    {
> +      kernel_args = alloca (mapnum * sizeof (void *));
> +      for (int i = 0; i < mapnum; i++)
> +	kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
> +    }
> +  
>    function = targ_fn->fn;
>  
>    /* Initialize the launch dimensions.  Typically this is constant,
> @@ -937,11 +945,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
>  					    api_info);
>      }
>  
> -  kargs[0] = &dp;
>    CUDA_CALL_ASSERT (cuLaunchKernel, function,
>  		    dims[GOMP_DIM_GANG], 1, 1,
>  		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
> -		    0, stream, kargs, 0);
> +		    0, stream, kernel_args, 0);
>  
>    if (profiling_p)
>      {
> @@ -1350,67 +1357,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
>  			   void **hostaddrs, void **devaddrs,
>  			   unsigned *dims, void *targ_mem_desc)
>  {
> -  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
> +  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL);
>  
> -  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
> -  acc_prof_info *prof_info = thr->prof_info;
> -  acc_event_info data_event_info;
> -  acc_api_info *api_info = thr->api_info;
> -  bool profiling_p = __builtin_expect (prof_info != NULL, false);
> -
> -  void **hp = NULL;
> -  CUdeviceptr dp = 0;
> -
> -  if (mapnum > 0)
> -    {
> -      size_t s = mapnum * sizeof (void *);
> -      hp = alloca (s);
> -      for (int i = 0; i < mapnum; i++)
> -	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
> -      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
> -      if (profiling_p)
> -	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
> -    }
> -
> -  /* Copy the (device) pointers to arguments to the device (dp and hp might in
> -     fact have the same value on a unified-memory system).  */
> -  if (mapnum > 0)
> -    {
> -      if (profiling_p)
> -	{
> -	  prof_info->event_type = acc_ev_enqueue_upload_start;
> -
> -	  data_event_info.data_event.event_type = prof_info->event_type;
> -	  data_event_info.data_event.valid_bytes
> -	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
> -	  data_event_info.data_event.parent_construct
> -	    = acc_construct_parallel;
> -	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
> -	  data_event_info.data_event.tool_info = NULL;
> -	  data_event_info.data_event.var_name = NULL;
> -	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
> -	  data_event_info.data_event.host_ptr = hp;
> -	  data_event_info.data_event.device_ptr = (const void *) dp;
> -
> -	  api_info->device_api = acc_device_api_cuda;
> -
> -	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
> -						api_info);
> -	}
> -      CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
> -			mapnum * sizeof (void *));
> -      if (profiling_p)
> -	{
> -	  prof_info->event_type = acc_ev_enqueue_upload_end;
> -	  data_event_info.data_event.event_type = prof_info->event_type;
> -	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
> -						api_info);
> -	}
> -    }
> -
> -  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
> -	      dp, NULL);
> -
>    CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
>    const char *maybe_abort_msg = "(perhaps abort was called)";
>    if (r == CUDA_ERROR_LAUNCH_FAILED)
> @@ -1418,20 +1366,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
>  		       maybe_abort_msg);
>    else if (r != CUDA_SUCCESS)
>      GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
> -
> -  CUDA_CALL_ASSERT (cuMemFree, dp);
> -  if (profiling_p)
> -    goacc_profiling_acc_ev_free (thr, (void *) dp);
>  }
>  
> -static void
> -cuda_free_argmem (void *ptr)
> -{
> -  void **block = (void **) ptr;
> -  nvptx_free (block[0], (struct ptx_device *) block[1]);
> -  free (block);
> -}
> -
>  void
>  GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
>  				 void **hostaddrs, void **devaddrs,
> @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
>  				 unsigned *dims, void *targ_mem_desc,
>  				 struct goacc_asyncqueue *aq)
>  {
> -  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
> -
> -  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
> -  acc_prof_info *prof_info = thr->prof_info;
> -  acc_event_info data_event_info;
> -  acc_api_info *api_info = thr->api_info;
> -  bool profiling_p = __builtin_expect (prof_info != NULL, false);
> -
> -  void **hp = NULL;
> -  CUdeviceptr dp = 0;
> -  void **block = NULL;
> -
> -  if (mapnum > 0)
> -    {
> -      size_t s = mapnum * sizeof (void *);
> -      block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s);
> -      hp = block + 2;
> -      for (int i = 0; i < mapnum; i++)
> -	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
> -      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
> -      if (profiling_p)
> -	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
> -    }
> -
> -  /* Copy the (device) pointers to arguments to the device (dp and hp might in
> -     fact have the same value on a unified-memory system).  */
> -  if (mapnum > 0)
> -    {
> -      if (profiling_p)
> -	{
> -	  prof_info->event_type = acc_ev_enqueue_upload_start;
> -
> -	  data_event_info.data_event.event_type = prof_info->event_type;
> -	  data_event_info.data_event.valid_bytes
> -	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
> -	  data_event_info.data_event.parent_construct
> -	    = acc_construct_parallel;
> -	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
> -	  data_event_info.data_event.tool_info = NULL;
> -	  data_event_info.data_event.var_name = NULL;
> -	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
> -	  data_event_info.data_event.host_ptr = hp;
> -	  data_event_info.data_event.device_ptr = (const void *) dp;
> -
> -	  api_info->device_api = acc_device_api_cuda;
> -
> -	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
> -						api_info);
> -	}
> -
> -      CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
> -			mapnum * sizeof (void *), aq->cuda_stream);
> -      block[0] = (void *) dp;
> -
> -      struct nvptx_thread *nvthd =
> -	(struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
> -      block[1] = (void *) nvthd->ptx_dev;
> -
> -      if (profiling_p)
> -	{
> -	  prof_info->event_type = acc_ev_enqueue_upload_end;
> -	  data_event_info.data_event.event_type = prof_info->event_type;
> -	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
> -						api_info);
> -	}
> -    }
> -
> -  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
> -	      dp, aq->cuda_stream);
> -
> -  if (mapnum > 0)
> -    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
> +  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
>  }
>  
>  void *


Grüße
 Thomas
Tom de Vries Oct. 9, 2019, 1:33 p.m. UTC | #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 08-10-2019 16:05, Thomas Schwinge wrote:
> Hi Chung-Lin!
>
> While we're all waiting for Tom to comment on this ;-)

Ack, thanks for the ping ...

> -- here's another item I realized:
>
> On 2019-09-10T19:41:59+0800, Chung-Lin Tang
> <chunglin_tang@mentor.com> wrote:
>> The libgomp nvptx plugin changes are also quite contained, with
>> lots of now unneeded [...] code deleted (since we no longer first
>> cuAlloc a buffer for the argument record before cuLaunchKernel)
>
> It would be nice ;-) -- but unless I'm confused, it's not that
> simple: we either have to reject (force host-fallback execution) or
> keep supporting "old-style" nvptx offloading code: new-libgomp has
> to continue to work with nvptx offloading code once generated by
> old-GCC.  Possibly even a mixture of old and new nvptx offloading
> code, if libraries are involved, huh!
>
> I have not completely thought that through, but I suppose this
> could be addressed by adding a flag to the 'struct nvptx_fn' (or
> similar) that's synthesized by nvptx 'mkoffload'?
>
> Maybe if fact the 'enum id_map_flag' machinery that I once added
> for 'Un-parallelized OpenACC kernels constructs with nvptx
> offloading: "avoid offloading"'?  (That's part of og8 commit
> 2d42fbf7e989e4bb76727b32ef11deb5845d5ab1 -- not present on og9,
> huh?!) The 'enum id_map_flag' machinery serves the purpose of
> transporting information from the offload compiler to libgomp,
> which seems what's needed here?  (But please verify.)
>
... and for raising this issue. I think this needs to be addressed.

It would be great if we can avoid it, but ... AFAIU, this means
bumping GOMP_VERSION_NVIDIA_PTX (1 -> 2).

Using a new a.out (registers with GOMP_VERSION_NVIDIA_PTX == 2) with
an old libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 1) will give us an
"Offload data incompatible with PTX plugin" error.

Using an old a.out (registers with GOMP_VERSION_NVIDIA_PTX == 1) with
a new libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 2) will have to be
supported in the way that things are currently handled.

Using a new a.out (registers with GOMP_VERSION_NVIDIA_PTX == 2) with a
new libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 2) will have to be
supported in the way that the patch implements things.

The current approach is that all offload-functions are assumed to be
transformed by the optimization, which implies that failure to
transform should be a compilation error (is that indeed ensured by the
patch?). Which is a bit funny for an 'optimization'. We might wanna
decide to do switch this on/off at offload-function level.

That ties in with the fact that if we're going to keep the path alive
for backward compatibility, it would be nice if we can actually test
this in the trunk version by disabling the optimization. Which is also
nice to have if we run into issues with the optimization. And once we
allow this to be disabled at user level, we're going to have to track
this at offload-function level.

So I'd say for GOMP_VERSION_NVIDIA_PTX == 2 we extend target_data with
a flag such that we can query things on a per offload-function level,
while taking care to represent the common case where the flag is the
same for all offload-functions in an economical way.

That leaves the question of how to get that information to mkoffload,
perhaps the patch Thomas mentioned can be of help there.

Thanks,
- - Tom
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEErJ0nuYSmyzCtZhpo7oVdq2ziRKAFAl2d4aEACgkQ7oVdq2zi
RKDhwQf/efEZRCR+HJ+M50FGKh5a1lrVm8QE5ue7SoY2rzjdKf2JT6tIUysJSYyP
JQYENHAz9Q/1uxYa3VYoFc1c8cVPyhutzezIWPXDVoNBoj/NEwFvQyZl4fqGfkFb
mRgEAHtfE1HZwfXp86UlJbgDV5wF1XGWQQad3P6F38NtXVTORoce79OViITnFq8I
YvfvZWx1EdomacW8oThzo9VY/CM4JeuY4r0dEv8REtk3Py5Cpw4E3xk195BgUAAS
OJj3g8Etg/wTBsgvrO6qqP8ie91Ys/9IRXjf238hay40i44Y7APGuRHgffFE6AE6
RPn24JUY0mdDj9WzlergTjsjWtfppQ==
=EdLk
-----END PGP SIGNATURE-----
Tom de Vries Oct. 9, 2019, 1:39 p.m. UTC | #6
On 01-10-2019 14:45, Chung-Lin Tang wrote:

> Index: gcc/config/nvptx/nvptx.c
> ===================================================================
> --- gcc/config/nvptx/nvptx.c	(revision 276406)
> +++ gcc/config/nvptx/nvptx.c	(working copy)
> @@ -68,6 +68,10 @@
>  #include "attribs.h"
>  #include "tree-vrp.h"
>  #include "tree-ssa-operands.h"
> +#include "tree-pretty-print.h"
> +#include "gimple-pretty-print.h"
> +#include "tree-cfg.h"
> +#include "gimple-ssa.h"
>  #include "tree-ssanames.h"
>  #include "gimplify.h"
>  #include "tree-phinodes.h"
> @@ -6437,6 +6441,226 @@ nvptx_set_current_function (tree fndecl)
>    oacc_bcast_partition = 0;
>  }
>  
> +static void
> +nvptx_expand_to_rtl_hook (void)
> +{
> +  /* For utilizing CUDA .param kernel arguments, we detect and modify
> +     the gimple of offloaded child functions, here before RTL expansion,
> +     starting with standard OMP form:
> +      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
> +   
> +     and transform it into a style where the OMP data record fields are
> +     "exploded" into individual scalar arguments:
> +      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
> +
> +     Note that there are implicit assumptions of how OMP lowering (and/or other
> +     intervening passes) behaves contained in this transformation code;
> +     if those passes change in their output, this code may possibly need
> +     updating.  */
> +
> +  if (lookup_attribute ("omp target entrypoint",
> +			DECL_ATTRIBUTES (current_function_decl))
> +      /* The rather indirect manner in which OpenMP target functions are
> +	 launched makes this transformation only valid for OpenACC currently.
> +	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
> +	 needs changes for this to work with OpenMP.  */
> +      && lookup_attribute ("oacc function",
> +			   DECL_ATTRIBUTES (current_function_decl)))
> +    {

Please do an early-return here.

Otherwise, no comments to the code as such.

Thanks,
- Tom
Chung-Lin Tang Nov. 8, 2019, 12:55 p.m. UTC | #7
On 2019/10/8 10:05 PM, Thomas Schwinge wrote:
> Hi Chung-Lin!
> 
> While we're all waiting for Tom to comment on this;-)  -- here's another
> item I realized:
> 
> On 2019-09-10T19:41:59+0800, Chung-Lin Tang<chunglin_tang@mentor.com>  wrote:
>> The libgomp nvptx plugin changes are also quite contained, with lots of
>> now unneeded [...] code deleted (since we no longer first cuAlloc a
>> buffer for the argument record before cuLaunchKernel)
> It would be nice;-)  -- but unless I'm confused, it's not that simple: we
> either have to reject (force host-fallback execution) or keep supporting
> "old-style" nvptx offloading code: new-libgomp has to continue to work
> with nvptx offloading code once generated by old-GCC.  Possibly even a
> mixture of old and new nvptx offloading code, if libraries are involved,
> huh!
> 
> I have not completely thought that through, but I suppose this could be
> addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's
> synthesized by nvptx 'mkoffload'?

Hi Thomas, Tom,
I've looked at the problem, it is unfortunate that we overlooked the
need for versioning of NVPTX images, and did not reserve something in
'struct nvptx_tdata' for something like this.

But how about something like:

typedef struct nvptx_tdata
{
   const struct targ_ptx_obj *ptx_objs;
   unsigned ptx_num;

   unsigned ptx_version;         /* <==== Add version field here.  */

   const char *const *var_names;
   unsigned var_num;

   const struct targ_fn_launch *fn_descs;
   unsigned fn_num;
} nvptx_tdata_t;

We currently only support x86_64 and powerpc64le hosts, which are both LP64 targets.

Assuming that, the position above where I put the new 'ptx_version' field is already
a 32-bit sized alignment hole, doesn't change the layout of other fields, and in the
static 'target_data' variable generated by mkoffload should be zeroed in current
circulating binaries (unless binutils is not doing the intuitive thing...)

If these assumptions are safe, then we can treat as if ptx_version == 0 right now,
and from now on bump it to 1 for these new nvptx convention changes.

(We can do a similar thing in 'struct targ_fn_launch' if we want to differentiate
at a per-function level.)

Any considerations?

Thanks,
Chung-Lin
Chung-Lin Tang Nov. 26, 2019, 2:48 p.m. UTC | #8
On 2019/11/8 8:55 PM, Chung-Lin Tang wrote:
> On 2019/10/8 10:05 PM, Thomas Schwinge wrote:
>> Hi Chung-Lin!
>>
>> While we're all waiting for Tom to comment on this;-)  -- here's another
>> item I realized:
>>
>> On 2019-09-10T19:41:59+0800, Chung-Lin Tang<chunglin_tang@mentor.com>  wrote:
>>> The libgomp nvptx plugin changes are also quite contained, with lots of
>>> now unneeded [...] code deleted (since we no longer first cuAlloc a
>>> buffer for the argument record before cuLaunchKernel)
>> It would be nice;-)  -- but unless I'm confused, it's not that simple: we
>> either have to reject (force host-fallback execution) or keep supporting
>> "old-style" nvptx offloading code: new-libgomp has to continue to work
>> with nvptx offloading code once generated by old-GCC.  Possibly even a
>> mixture of old and new nvptx offloading code, if libraries are involved,
>> huh!
>>
>> I have not completely thought that through, but I suppose this could be
>> addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's
>> synthesized by nvptx 'mkoffload'?
> 
> Hi Thomas, Tom,
> I've looked at the problem, it is unfortunate that we overlooked the
> need for versioning of NVPTX images, and did not reserve something in
> 'struct nvptx_tdata' for something like this.
> 
> But how about something like:
> 
> typedef struct nvptx_tdata
> {
>    const struct targ_ptx_obj *ptx_objs;
>    unsigned ptx_num;
> 
>    unsigned ptx_version;         /* <==== Add version field here.  */
> 
>    const char *const *var_names;
>    unsigned var_num;
> 
>    const struct targ_fn_launch *fn_descs;
>    unsigned fn_num;
> } nvptx_tdata_t;
> 
> We currently only support x86_64 and powerpc64le hosts, which are both LP64 targets.
> 
> Assuming that, the position above where I put the new 'ptx_version' field is already
> a 32-bit sized alignment hole, doesn't change the layout of other fields, and in the
> static 'target_data' variable generated by mkoffload should be zeroed in current
> circulating binaries (unless binutils is not doing the intuitive thing...)
> 
> If these assumptions are safe, then we can treat as if ptx_version == 0 right now,
> and from now on bump it to 1 for these new nvptx convention changes.
> 
> (We can do a similar thing in 'struct targ_fn_launch' if we want to differentiate
> at a per-function level.)
> 
> Any considerations?

Hi Tom, Thomas,
as a concept, here is a version of what I mentioned above. The _exec,_async_exec plugin
hooks now switch between versions of code based on image version.

Thanks,
Chung-Lin

         gcc/
         * config/nvptx/mkoffload.c (process): Add 'ptx_version' field to
         generated struct nvptx_tdata, and initialized to '1'.
         * config/nvptx/nvptx.c (nvptx_expand_to_rtl_hook): New function
         implementing CUDA .params space transformation.
         (TARGET_EXPAND_TO_RTL_HOOK): implement hook with
         nvptx_expand_to_rtl_hook.

         libgomp/
         * plugin/plugin-nvptx.c (struct nvptx_tdata): Add 'ptx_version' field.
         (struct targ_fn_descriptor): Add 'image' field.
         (struct ptx_image_data): Adjust 'target_data' to be proper pointer
         type of 'const nvptx_tdata_t *'.
         (nvptx_exec): Adjust arguments, add kernel argument setup code,
         adjust cuLaunchKernel calling code.
         (GOMP_OFFLOAD_load_image): Remove now unneeded pointer cast for
         target_data, initialize 'image' link for each function descriptor,
         move adding of new_image to dev->images later after everythin
         is set up.
         (openacc_exec_v0): Rename from old GOMP_OFFLOAD_openacc_exec.
         (openacc_async_exec_v0): Rename from old GOMP_OFFLOAD_openacc_async_exec.
         (GOMP_OFFLOAD_openacc_exec): Switch between v0/v1 versions of code.
         (GOMP_OFFLOAD_openacc_async_exec): Likewise.
         (openacc_exec_v1): New function.
         (openacc_async_exec_v1): Likewise.
Index: gcc/config/nvptx/mkoffload.c
===================================================================
--- gcc/config/nvptx/mkoffload.c	(revision 278656)
+++ gcc/config/nvptx/mkoffload.c	(working copy)
@@ -310,12 +310,13 @@ process (FILE *in, FILE *out)
 	   "static const struct nvptx_tdata {\n"
 	   "  const struct ptx_obj *ptx_objs;\n"
 	   "  unsigned ptx_num;\n"
+	   "  unsigned char ptx_version;\n"
 	   "  const char *const *var_names;\n"
 	   "  unsigned var_num;\n"
 	   "  const struct nvptx_fn *fn_names;\n"
 	   "  unsigned fn_num;\n"
 	   "} target_data = {\n"
-	   "  ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]),\n"
+	   "  ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]), 1,\n"
 	   "  var_mappings,"
 	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
 	   "  func_mappings,"
Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 278656)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -68,6 +68,10 @@
 #include "attribs.h"
 #include "tree-vrp.h"
 #include "tree-ssa-operands.h"
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "tree-cfg.h"
+#include "gimple-ssa.h"
 #include "tree-ssanames.h"
 #include "gimplify.h"
 #include "tree-phinodes.h"
@@ -6463,6 +6467,226 @@ nvptx_set_current_function (tree fndecl)
   oacc_bcast_partition = 0;
 }
 
+static void
+nvptx_expand_to_rtl_hook (void)
+{
+  /* For utilizing CUDA .param kernel arguments, we detect and modify
+     the gimple of offloaded child functions, here before RTL expansion,
+     starting with standard OMP form:
+      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
+   
+     and transform it into a style where the OMP data record fields are
+     "exploded" into individual scalar arguments:
+      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
+
+     Note that there are implicit assumptions of how OMP lowering (and/or other
+     intervening passes) behaves contained in this transformation code;
+     if those passes change in their output, this code may possibly need
+     updating.  */
+
+  if (lookup_attribute ("omp target entrypoint",
+			DECL_ATTRIBUTES (current_function_decl))
+      /* The rather indirect manner in which OpenMP target functions are
+	 launched makes this transformation only valid for OpenACC currently.
+	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
+	 needs changes for this to work with OpenMP.  */
+      && lookup_attribute ("oacc function",
+			   DECL_ATTRIBUTES (current_function_decl)))
+    {
+      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
+      tree argtype = TREE_TYPE (omp_data_arg);
+
+      /* Ensure this function is of the form of a single reference argument
+	 to the OMP data record, or a single void* argument (when no values
+	 passed)  */
+      gcc_assert (VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))
+		  && (DECL_CHAIN (omp_data_arg) == NULL_TREE
+		      && ((TREE_CODE (argtype) == REFERENCE_TYPE
+			   && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
+			  || (TREE_CODE (argtype) == POINTER_TYPE
+			      && TREE_TYPE (argtype) == void_type_node))));
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Detected offloaded child function %s, "
+		   "starting parameter conversion\n",
+		   print_generic_expr_to_str (current_function_decl));
+	  fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n",
+		   print_generic_expr_to_str (omp_data_arg),
+		   print_generic_expr_to_str (argtype));
+	  fprintf (dump_file, "Data record fields:\n");
+	}
+      
+      hash_map<tree,tree> fld_to_args;
+      tree fld, rectype = TREE_TYPE (argtype);
+      tree arglist = NULL_TREE, argtypelist = NULL_TREE;
+
+      if (TREE_CODE (rectype) == RECORD_TYPE)
+	{
+	  /* For each field in the OMP data record type, create a corresponding
+	     PARM_DECL, and map field -> parm using the fld_to_args hash_map.
+	     Also create the tree chains for creating function type and
+	     DECL_ARGUMENTS below.  */
+	  for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld))
+	    {
+	      tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL,
+				      DECL_NAME (fld), TREE_TYPE (fld));
+	      DECL_ARTIFICIAL (narg) = 1;
+	      DECL_ARG_TYPE (narg) = TREE_TYPE (fld);
+	      DECL_CONTEXT (narg) = current_function_decl;
+	      TREE_USED (narg) = 1;
+	      TREE_READONLY (narg) = 1;
+
+	      if (dump_file)
+		fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n",
+			 print_generic_expr_to_str (fld),
+			 print_generic_expr_to_str (TREE_TYPE (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld)));
+	      fld_to_args.put (fld, narg);
+
+	      TREE_CHAIN (narg) = arglist;
+	      arglist = narg;
+	      argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg),
+				       argtypelist);
+	    }
+	  arglist = nreverse (arglist);
+	  argtypelist = nreverse (argtypelist);
+	}
+      /* This is needed to not be mistaken for a stdarg function.  */
+      argtypelist = chainon (argtypelist, void_list_node);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function before OMP data arg replaced:\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      /* Actually modify the tree type and DECL_ARGUMENTS here.  */
+      TREE_TYPE (current_function_decl) = build_function_type (void_type_node,
+							       argtypelist);
+      DECL_ARGUMENTS (current_function_decl) = arglist;
+
+      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
+	 scanning and skipping those entries, creating a new local_decls list.
+	 We assume a very specific MEM_REF tree expression shape.  */
+      tree decl;
+      unsigned int i;
+      vec<tree, va_gc> *new_local_decls = NULL;
+      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
+	{
+	  if (DECL_HAS_VALUE_EXPR_P (decl))
+	    {
+	      tree t = DECL_VALUE_EXPR (decl);
+	      if (TREE_CODE (t) == MEM_REF
+		  && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
+		  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF
+		  && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
+		      == omp_data_arg))
+		continue;
+	    }
+	  vec_safe_push (new_local_decls, decl);
+	}
+      vec_free (cfun->local_decls);
+      cfun->local_decls = new_local_decls;
+      
+      /* Scan function body for assignments from .omp_data_i->FIELD, and using
+	 the above created fld_to_args hash map, convert them to reads of
+	 function arguments.  */
+      basic_block bb;
+      gimple_stmt_iterator gsi;
+      FOR_EACH_BB_FN (bb, cfun)
+	for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    tree val, *val_ptr = NULL;
+	    gimple *stmt = gsi_stmt (gsi);
+	    if (is_gimple_assign (stmt)
+		&& gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS)
+	      val_ptr = gimple_assign_rhs1_ptr (stmt);
+	    else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt))
+	      val_ptr = gimple_debug_bind_get_value_ptr (stmt);
+
+	    if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE)
+	      continue;
+
+	    tree new_val = NULL_TREE, fld = NULL_TREE;
+
+	    if (TREE_CODE (val) == COMPONENT_REF
+		&& TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF
+		&& (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == SSA_NAME)
+		&& (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == omp_data_arg))
+	      {
+		/* .omp_data->FIELD case.  */
+		fld = TREE_OPERAND (val, 1);
+		new_val = *fld_to_args.get (fld);
+	      }
+	    else if (TREE_CODE (val) == MEM_REF
+		     && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
+		     && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
+	      {
+		/* This case may happen in the final tree level optimization
+		   output, due to SLP:
+		   vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B]
+
+		   Therefore here we need a more elaborate search of the field
+		   list to reverse map to which field the offset is referring
+		   to.  */
+		unsigned HOST_WIDE_INT offset
+		  = tree_to_uhwi (TREE_OPERAND (val, 1));
+
+		for (hash_map<tree, tree>::iterator i = fld_to_args.begin ();
+		     i != fld_to_args.end (); ++i)
+		  {
+		    tree cur_fld = (*i).first;
+		    tree cur_arg = (*i).second;
+		    gcc_assert (TREE_CODE (cur_arg) == PARM_DECL);
+
+		    unsigned HOST_WIDE_INT cur_offset =
+		      (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld))
+		       + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld))
+			  / BITS_PER_UNIT));
+
+		    if (offset == cur_offset)
+		      {
+			new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val),
+					  cur_arg);
+			break;
+		      }
+		  }
+	      }
+
+	    /* If we found the corresponding OMP data record field, replace the
+	       RHS with the new created PARM_DECL.  */
+	    if (new_val != NULL_TREE)
+	      {
+		if (dump_file)
+		  {
+		    fprintf (dump_file, "For gimple stmt: ");
+		    print_gimple_stmt (dump_file, stmt, 0);
+		    fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
+			     print_generic_expr_to_str (val),
+			     print_generic_expr_to_str (new_val));
+		  }
+		/* Write in looked up ARG as new RHS value.  */
+		*val_ptr = new_val;
+	      }
+	  }
+
+      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
+      tree name;
+      FOR_EACH_SSA_NAME (i, name, cfun)
+	if (SSA_NAME_VAR (name) == omp_data_arg)
+	  (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function after OMP data arg replaced: ");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6605,6 +6829,9 @@ nvptx_set_current_function (tree fndecl)
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 278656)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -239,6 +239,8 @@ typedef struct nvptx_tdata
   const struct targ_ptx_obj *ptx_objs;
   unsigned ptx_num;
 
+  unsigned char ptx_version;
+
   const char *const *var_names;
   unsigned var_num;
 
@@ -254,12 +256,13 @@ struct targ_fn_descriptor
   const struct targ_fn_launch *launch;
   int regs_per_thread;
   int max_threads_per_block;
+  struct ptx_image_data *image;
 };
 
 /* A loaded PTX image.  */
 struct ptx_image_data
 {
-  const void *target_data;
+  const nvptx_tdata_t *target_data;
   CUmodule module;
 
   struct targ_fn_descriptor *fns;  /* Array of functions.  */
@@ -695,16 +698,30 @@ link_ptx (CUmodule *module, const struct targ_ptx_
 
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    unsigned *dims, CUdeviceptr dp, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
+  void *kargs[1] = { &dp };
+  void **kernel_args;
   struct nvptx_thread *nvthd = nvptx_thread ();
   int warp_size = nvthd->ptx_dev->warp_size;
 
+  if (__builtin_expect (dp == 0, true))
+    {
+      /* This is the newer "exploded" CUDA parameter case.  */
+      GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum);
+      if (mapnum > 0)
+	{
+	  kernel_args = alloca (mapnum * sizeof (void *));
+	  for (int i = 0; i < mapnum; i++)
+	    kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+	}
+    }
+  else
+    kernel_args = kargs;
+
   function = targ_fn->fn;
 
   /* Initialize the launch dimensions.  Typically this is constant,
@@ -936,11 +953,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host
 					    api_info);
     }
 
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-		    0, stream, kargs, 0);
+		    0, stream, kernel_args, 0);
 
   if (profiling_p)
     {
@@ -1232,15 +1248,10 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version
   *target_table = targ_tbl;
 
   new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data));
-  new_image->target_data = target_data;
+  new_image->target_data = (const nvptx_tdata_t *) target_data;
   new_image->module = module;
   new_image->fns = targ_fns;
 
-  pthread_mutex_lock (&dev->image_lock);
-  new_image->next = dev->images;
-  dev->images = new_image;
-  pthread_mutex_unlock (&dev->image_lock);
-
   for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++)
     {
       CUfunction function;
@@ -1257,11 +1268,17 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version
       targ_fns->launch = &fn_descs[i];
       targ_fns->regs_per_thread = nregs;
       targ_fns->max_threads_per_block = mthrs;
+      targ_fns->image = new_image;
 
       targ_tbl->start = (uintptr_t) targ_fns;
       targ_tbl->end = targ_tbl->start + 1;
     }
 
+  pthread_mutex_lock (&dev->image_lock);
+  new_image->next = dev->images;
+  dev->images = new_image;
+  pthread_mutex_unlock (&dev->image_lock);
+
   for (j = 0; j < var_entries; j++, targ_tbl++)
     {
       CUdeviceptr var;
@@ -1344,10 +1361,9 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-void
-GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
-			   void **hostaddrs, void **devaddrs,
-			   unsigned *dims, void *targ_mem_desc)
+static void
+openacc_exec_v0 (void (*fn) (void *), size_t mapnum,
+		 void **hostaddrs, void **devaddrs, unsigned *dims)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1407,8 +1423,7 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 	}
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, NULL);
 
   CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
@@ -1424,6 +1439,43 @@ GOMP_OFFLOAD_free (int ord, void *ptr)
 }
 
 static void
+openacc_exec_v1 (void (*fn) (void *), size_t mapnum,
+		 void **hostaddrs, void **devaddrs, unsigned *dims)
+{
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, NULL);
+
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
+  const char *maybe_abort_msg = "(perhaps abort was called)";
+  if (r == CUDA_ERROR_LAUNCH_FAILED)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+		       maybe_abort_msg);
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+}
+
+void
+GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
+			   void **hostaddrs, void **devaddrs,
+			   unsigned *dims, void *targ_mem_desc)
+{
+  struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
+  unsigned ptx_version = targ_fn->image->target_data->ptx_version;
+
+  if (__builtin_expect (ptx_version == 1, true))
+    openacc_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims);
+  else
+    switch (ptx_version)
+      {
+      case 0:
+	openacc_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims);
+	break;
+      default:
+	GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n",
+			   ptx_version);
+      };
+}
+
+static void
 cuda_free_argmem (void *ptr)
 {
   void **block = (void **) ptr;
@@ -1431,11 +1483,10 @@ cuda_free_argmem (void *ptr)
   free (block);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
-				 void **hostaddrs, void **devaddrs,
-				 unsigned *dims, void *targ_mem_desc,
-				 struct goacc_asyncqueue *aq)
+static void
+openacc_async_exec_v0 (void (*fn) (void *), size_t mapnum,
+		       void **hostaddrs, void **devaddrs,
+		       unsigned *dims, struct goacc_asyncqueue *aq)
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
@@ -1504,13 +1555,43 @@ cuda_free_argmem (void *ptr)
 	}
     }
 
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, aq->cuda_stream);
 
   if (mapnum > 0)
     GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
+static void
+openacc_async_exec_v1 (void (*fn) (void *), size_t mapnum,
+		       void **hostaddrs, void **devaddrs,
+		       unsigned *dims, struct goacc_asyncqueue *aq)
+{
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
+{
+  struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
+  unsigned ptx_version = targ_fn->image->target_data->ptx_version;
+
+  if (__builtin_expect (ptx_version == 1, true))
+    openacc_async_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims, aq);
+  else
+    switch (ptx_version)
+      {
+      case 0:
+	openacc_async_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims, aq);
+	break;
+      default:
+	GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n",
+			   ptx_version);
+      };
+}
+
 void *
 GOMP_OFFLOAD_openacc_create_thread_data (int ord)
 {
diff mbox series

Patch

Index: gcc/config/nvptx/nvptx.c
===================================================================
--- gcc/config/nvptx/nvptx.c	(revision 275493)
+++ gcc/config/nvptx/nvptx.c	(working copy)
@@ -68,6 +68,10 @@ 
 #include "attribs.h"
 #include "tree-vrp.h"
 #include "tree-ssa-operands.h"
+#include "tree-pretty-print.h"
+#include "gimple-pretty-print.h"
+#include "tree-cfg.h"
+#include "gimple-ssa.h"
 #include "tree-ssanames.h"
 #include "gimplify.h"
 #include "tree-phinodes.h"
@@ -6437,6 +6441,228 @@  nvptx_set_current_function (tree fndecl)
   oacc_bcast_partition = 0;
 }
 
+static void
+nvptx_expand_to_rtl_hook (void)
+{
+  /* For utilizing CUDA .param kernel arguments, we detect and modify
+     the gimple of offloaded child functions, here before RTL expansion,
+     starting with standard OMP form:
+      foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... }
+   
+     and transform it into a style where the OMP data record fields are
+     "exploded" into individual scalar arguments:
+      foo._omp_fn.0 (int * a, int * b, int * c) { ... }
+
+     Note that there are implicit assumptions of how OMP lowering (and/or other
+     intervening passes) behaves contained in this transformation code;
+     if those passes change in their output, this code may possibly need
+     updating.  */
+
+  if (lookup_attribute ("omp target entrypoint",
+			DECL_ATTRIBUTES (current_function_decl))
+      /* The rather indirect manner in which OpenMP target functions are
+	 launched makes this transformation only valid for OpenACC currently.
+	 TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc.
+	 needs changes for this to work with OpenMP.  */
+      && lookup_attribute ("oacc function",
+			   DECL_ATTRIBUTES (current_function_decl))
+      && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))))
+    {
+      tree omp_data_arg = DECL_ARGUMENTS (current_function_decl);
+      tree argtype = TREE_TYPE (omp_data_arg);
+
+      /* Ensure this function is of the form of a single reference argument
+	 to the OMP data record, or a single void* argument (when no values
+	 passed)  */
+      if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE
+	     && ((TREE_CODE (argtype) == REFERENCE_TYPE
+		  && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE)
+		 || (TREE_CODE (argtype) == POINTER_TYPE
+		     && TREE_TYPE (argtype) == void_type_node))))
+	return;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Detected offloaded child function %s, "
+		   "starting parameter conversion\n",
+		   print_generic_expr_to_str (current_function_decl));
+	  fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n",
+		   print_generic_expr_to_str (omp_data_arg),
+		   print_generic_expr_to_str (argtype));
+	  fprintf (dump_file, "Data record fields:\n");
+	}
+      
+      hash_map<tree,tree> fld_to_args;
+      tree fld, rectype = TREE_TYPE (argtype);
+      tree arglist = NULL_TREE, argtypelist = NULL_TREE;
+
+      if (TREE_CODE (rectype) == RECORD_TYPE)
+	{
+	  /* For each field in the OMP data record type, create a corresponding
+	     PARM_DECL, and map field -> parm using the fld_to_args hash_map.
+	     Also create the tree chains for creating function type and
+	     DECL_ARGUMENTS below.  */
+	  for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld))
+	    {
+	      tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL,
+				      DECL_NAME (fld), TREE_TYPE (fld));
+	      DECL_ARTIFICIAL (narg) = 1;
+	      DECL_ARG_TYPE (narg) = TREE_TYPE (fld);
+	      DECL_CONTEXT (narg) = current_function_decl;
+	      TREE_USED (narg) = 1;
+	      TREE_READONLY (narg) = 1;
+
+	      if (dump_file)
+		fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n",
+			 print_generic_expr_to_str (fld),
+			 print_generic_expr_to_str (TREE_TYPE (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)),
+			 print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld)));
+	      fld_to_args.put (fld, narg);
+
+	      TREE_CHAIN (narg) = arglist;
+	      arglist = narg;
+	      argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg),
+				       argtypelist);
+	    }
+	  arglist = nreverse (arglist);
+	  argtypelist = nreverse (argtypelist);
+	}
+      /* This is needed to not be mistaken for a stdarg function.  */
+      argtypelist = chainon (argtypelist, void_list_node);
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function before OMP data arg replaced:\n");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+
+      /* Actually modify the tree type and DECL_ARGUMENTS here.  */
+      TREE_TYPE (current_function_decl) = build_function_type (void_type_node,
+							       argtypelist);
+      DECL_ARGUMENTS (current_function_decl) = arglist;
+
+      /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by
+	 scanning and skipping those entries, creating a new local_decls list.
+	 We assume a very specific MEM_REF tree expression shape.  */
+      tree decl;
+      unsigned int i;
+      vec<tree, va_gc> *new_local_decls = NULL;
+      FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl)
+	{
+	  if (DECL_HAS_VALUE_EXPR_P (decl))
+	    {
+	      tree t = DECL_VALUE_EXPR (decl);
+	      if (TREE_CODE (t) == MEM_REF
+		  && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF
+		  && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF
+		  && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0)
+		      == omp_data_arg))
+		continue;
+	    }
+	  vec_safe_push (new_local_decls, decl);
+	}
+      vec_free (cfun->local_decls);
+      cfun->local_decls = new_local_decls;
+      
+      /* Scan function body for assignments from .omp_data_i->FIELD, and using
+	 the above created fld_to_args hash map, convert them to reads of
+	 function arguments.  */
+      basic_block bb;
+      gimple_stmt_iterator gsi;
+      FOR_EACH_BB_FN (bb, cfun)
+	for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	  {
+	    tree val, *val_ptr = NULL;
+	    gimple *stmt = gsi_stmt (gsi);
+	    if (is_gimple_assign (stmt)
+		&& gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS)
+	      val_ptr = gimple_assign_rhs1_ptr (stmt);
+	    else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt))
+	      val_ptr = gimple_debug_bind_get_value_ptr (stmt);
+
+	    if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE)
+	      continue;
+
+	    tree new_val = NULL_TREE, fld = NULL_TREE;
+
+	    if (TREE_CODE (val) == COMPONENT_REF
+		&& TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF
+		&& (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == SSA_NAME)
+		&& (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0))
+		    == omp_data_arg))
+	      {
+		/* .omp_data->FIELD case.  */
+		fld = TREE_OPERAND (val, 1);
+		new_val = *fld_to_args.get (fld);
+	      }
+	    else if (TREE_CODE (val) == MEM_REF
+		     && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME
+		     && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg)
+	      {
+		/* This case may happen in the final tree level optimization
+		   output, due to SLP:
+		   vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B]
+
+		   Therefore here we need a more elaborate search of the field
+		   list to reverse map to which field the offset is referring
+		   to.  */
+		unsigned HOST_WIDE_INT offset
+		  = tree_to_uhwi (TREE_OPERAND (val, 1));
+
+		for (hash_map<tree, tree>::iterator i = fld_to_args.begin ();
+		     i != fld_to_args.end (); ++i)
+		  {
+		    tree cur_fld = (*i).first;
+		    tree cur_arg = (*i).second;
+		    gcc_assert (TREE_CODE (cur_arg) == PARM_DECL);
+
+		    unsigned HOST_WIDE_INT cur_offset =
+		      (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld))
+		       + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld))
+			  / BITS_PER_UNIT));
+
+		    if (offset == cur_offset)
+		      {
+			new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val),
+					  cur_arg);
+			break;
+		      }
+		  }
+	      }
+
+	    /* If we found the corresponding OMP data record field, replace the
+	       RHS with the new created PARM_DECL.  */
+	    if (new_val != NULL_TREE)
+	      {
+		if (dump_file)
+		  {
+		    fprintf (dump_file, "For gimple stmt: ");
+		    print_gimple_stmt (dump_file, stmt, 0);
+		    fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n",
+			     print_generic_expr_to_str (val),
+			     print_generic_expr_to_str (new_val));
+		  }
+		/* Write in looked up ARG as new RHS value.  */
+		*val_ptr = new_val;
+	      }
+	  }
+
+      /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE.  */
+      tree name;
+      FOR_EACH_SSA_NAME (i, name, cfun)
+	if (SSA_NAME_VAR (name) == omp_data_arg)
+	  (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE;
+
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Function after OMP data arg replaced: ");
+	  dump_function_to_file (current_function_decl, dump_file, dump_flags);
+	}
+    }
+}
+
 #undef TARGET_OPTION_OVERRIDE
 #define TARGET_OPTION_OVERRIDE nvptx_option_override
 
@@ -6576,6 +6802,9 @@  nvptx_set_current_function (tree fndecl)
 #undef TARGET_SET_CURRENT_FUNCTION
 #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function
 
+#undef TARGET_EXPAND_TO_RTL_HOOK
+#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-nvptx.h"
Index: libgomp/plugin/plugin-nvptx.c
===================================================================
--- libgomp/plugin/plugin-nvptx.c	(revision 275493)
+++ libgomp/plugin/plugin-nvptx.c	(working copy)
@@ -696,16 +696,24 @@  link_ptx (CUmodule *module, const struct targ_ptx_
 
 static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	    unsigned *dims, void *targ_mem_desc,
-	    CUdeviceptr dp, CUstream stream)
+	    unsigned *dims, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
   int i;
-  void *kargs[1];
   struct nvptx_thread *nvthd = nvptx_thread ();
   int warp_size = nvthd->ptx_dev->warp_size;
+  void **kernel_args = NULL;
 
+  GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum);
+
+  if (mapnum > 0)
+    {
+      kernel_args = alloca (mapnum * sizeof (void *));
+      for (int i = 0; i < mapnum; i++)
+	kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]);
+    }
+  
   function = targ_fn->fn;
 
   /* Initialize the launch dimensions.  Typically this is constant,
@@ -937,11 +945,10 @@  nvptx_exec (void (*fn), size_t mapnum, void **host
 					    api_info);
     }
 
-  kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-		    0, stream, kargs, 0);
+		    0, stream, kernel_args, 0);
 
   if (profiling_p)
     {
@@ -1350,67 +1357,8 @@  GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
 			   void **hostaddrs, void **devaddrs,
 			   unsigned *dims, void *targ_mem_desc)
 {
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL);
 
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_p = __builtin_expect (prof_info != NULL, false);
-
-  void **hp = NULL;
-  CUdeviceptr dp = 0;
-
-  if (mapnum > 0)
-    {
-      size_t s = mapnum * sizeof (void *);
-      hp = alloca (s);
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
-      if (profiling_p)
-	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
-    }
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-  if (mapnum > 0)
-    {
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_start;
-
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  data_event_info.data_event.valid_bytes
-	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
-	  data_event_info.data_event.parent_construct
-	    = acc_construct_parallel;
-	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
-	  data_event_info.data_event.tool_info = NULL;
-	  data_event_info.data_event.var_name = NULL;
-	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
-	  data_event_info.data_event.host_ptr = hp;
-	  data_event_info.data_event.device_ptr = (const void *) dp;
-
-	  api_info->device_api = acc_device_api_cuda;
-
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-      CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
-			mapnum * sizeof (void *));
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_end;
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-    }
-
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, NULL);
-
   CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
   const char *maybe_abort_msg = "(perhaps abort was called)";
   if (r == CUDA_ERROR_LAUNCH_FAILED)
@@ -1418,20 +1366,8 @@  GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si
 		       maybe_abort_msg);
   else if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-
-  CUDA_CALL_ASSERT (cuMemFree, dp);
-  if (profiling_p)
-    goacc_profiling_acc_ev_free (thr, (void *) dp);
 }
 
-static void
-cuda_free_argmem (void *ptr)
-{
-  void **block = (void **) ptr;
-  nvptx_free (block[0], (struct ptx_device *) block[1]);
-  free (block);
-}
-
 void
 GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
 				 void **hostaddrs, void **devaddrs,
@@ -1438,78 +1374,7 @@  GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void
 				 unsigned *dims, void *targ_mem_desc,
 				 struct goacc_asyncqueue *aq)
 {
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-
-  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
-  acc_prof_info *prof_info = thr->prof_info;
-  acc_event_info data_event_info;
-  acc_api_info *api_info = thr->api_info;
-  bool profiling_p = __builtin_expect (prof_info != NULL, false);
-
-  void **hp = NULL;
-  CUdeviceptr dp = 0;
-  void **block = NULL;
-
-  if (mapnum > 0)
-    {
-      size_t s = mapnum * sizeof (void *);
-      block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s);
-      hp = block + 2;
-      for (int i = 0; i < mapnum; i++)
-	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-      CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
-      if (profiling_p)
-	goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
-    }
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-  if (mapnum > 0)
-    {
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_start;
-
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  data_event_info.data_event.valid_bytes
-	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
-	  data_event_info.data_event.parent_construct
-	    = acc_construct_parallel;
-	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
-	  data_event_info.data_event.tool_info = NULL;
-	  data_event_info.data_event.var_name = NULL;
-	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
-	  data_event_info.data_event.host_ptr = hp;
-	  data_event_info.data_event.device_ptr = (const void *) dp;
-
-	  api_info->device_api = acc_device_api_cuda;
-
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-
-      CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
-			mapnum * sizeof (void *), aq->cuda_stream);
-      block[0] = (void *) dp;
-
-      struct nvptx_thread *nvthd =
-	(struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
-      block[1] = (void *) nvthd->ptx_dev;
-
-      if (profiling_p)
-	{
-	  prof_info->event_type = acc_ev_enqueue_upload_end;
-	  data_event_info.data_event.event_type = prof_info->event_type;
-	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
-						api_info);
-	}
-    }
-
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
-	      dp, aq->cuda_stream);
-
-  if (mapnum > 0)
-    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream);
 }
 
 void *