diff mbox

Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"

Message ID 87zivg8rcy.fsf@hertz.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge Feb. 4, 2016, 2:47 p.m. UTC
Hi!

On Fri, 22 Jan 2016 14:31:35 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 01/22/2016 02:25 PM, Jakub Jelinek wrote:
> 
> > What about #pragma oacc parallel?  That would never do that?
> 
> It shouldn't, no (IMO).

Correct.


Here is the patch re-worked for trunk.  Instead of passing
-foffload-force in the affected libgomp test cases, I instead chose to
have them expect the warning.  This way, we're testing more in line to
what users will be doing, and we'll notice how the OpenACC kernels
handling improves, when parloops gets able to parallelize more offloaded
code (and the "avoid offloading" handling will no longer trigger).  OK to
commit?

commit acd66946777671486a0f69706b25a3ec5f877306
Author: Thomas Schwinge <thomas@codesourcery.com>
Date:   Tue Feb 2 20:41:42 2016 +0100

    Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
    
    	gcc/
    	* common.opt: Add -foffload-force.
    	* lto-wrapper.c (merge_and_complain, append_compiler_options):
    	Handle it.
    	* doc/invoke.texi: Document it.
    	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
    	(record_id): Parse, and set it.
    	(process): Use it.
    	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
    	offloading".
    	(nvptx_record_offload_symbol): Use it.
    	(nvptx_goacc_validate_dims): Set it.
    	libgomp/
    	* libgomp.h (gomp_offload_target_available_p): New function
    	declaration.
    	* target.c (gomp_offload_target_available_p): New function
    	definition.
    	(GOMP_offload_register_ver, GOMP_offload_unregister_ver)
    	(gomp_init_device, gomp_unload_device): Handle and document "avoid
    	offloading" flag ("host_table == NULL").
    	(resolve_device): Document "avoid offloading".
    	* oacc-init.c (resolve_device): Likewise.
    	* libgomp.texi (Enabling OpenACC): Likewise.
    	* testsuite/lib/libgomp.exp
    	(check_effective_target_nvptx_offloading_configured): New proc
    	definition.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
    	file.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
    	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Expect warning.
    	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
    	Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
    	Likewise.
    	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
    
    	libgomp/
    	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c: Set
    	"-ftree-parallelize-loops=32".
    	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
    	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
---
 gcc/common.opt                                     |    4 +
 gcc/config/nvptx/mkoffload.c                       |   73 +++++++++++-
 gcc/config/nvptx/nvptx.c                           |   42 ++++++-
 gcc/doc/invoke.texi                                |   12 +-
 gcc/lto-wrapper.c                                  |    2 +
 libgomp/libgomp.h                                  |    1 +
 libgomp/libgomp.texi                               |    8 ++
 libgomp/oacc-init.c                                |   19 ++-
 libgomp/target.c                                   |  122 ++++++++++++++++----
 libgomp/testsuite/lib/libgomp.exp                  |   10 ++
 .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |    4 +-
 .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |    4 +-
 .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |   28 +++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |   38 ++++++
 .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |   29 +++++
 .../combined-directives-1.c                        |    4 +-
 .../libgomp.oacc-c-c++-common/default-1.c          |    4 +-
 .../libgomp.oacc-c-c++-common/deviceptr-1.c        |    4 +-
 .../libgomp.oacc-c-c++-common/host_data-1.c        |    1 +
 .../libgomp.oacc-c-c++-common/kernels-1.c          |   10 +-
 .../kernels-alias-ipa-pta-2.c                      |    4 +-
 .../kernels-alias-ipa-pta-3.c                      |    4 +-
 .../kernels-alias-ipa-pta.c                        |    4 +-
 .../libgomp.oacc-c-c++-common/kernels-empty.c      |    2 +-
 .../kernels-loop-and-seq-2.c                       |    3 +-
 .../kernels-loop-and-seq-3.c                       |    4 +-
 .../kernels-loop-and-seq-4.c                       |    3 +-
 .../kernels-loop-and-seq-5.c                       |    3 +-
 .../kernels-loop-and-seq-6.c                       |    3 +-
 .../kernels-loop-and-seq.c                         |    4 +-
 .../kernels-loop-collapse.c                        |    3 +-
 .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |    2 +-
 .../libgomp.oacc-fortran/avoid-offloading-1.f      |   32 +++++
 .../libgomp.oacc-fortran/avoid-offloading-2.f      |   41 +++++++
 .../libgomp.oacc-fortran/avoid-offloading-3.f      |   31 +++++
 .../libgomp.oacc-fortran/combined-directives-1.f90 |    5 +-
 .../libgomp.oacc-fortran/non-scalar-data.f90       |    5 +-
 37 files changed, 494 insertions(+), 78 deletions(-)



Grüße
 Thomas

Comments

Thomas Schwinge Feb. 10, 2016, 11:49 a.m. UTC | #1
Hi!

Ping.

On Thu, 04 Feb 2016 15:47:25 +0100, I wrote:
> Here is the patch re-worked for trunk.  Instead of passing
> -foffload-force in the affected libgomp test cases, I instead chose to
> have them expect the warning.  This way, we're testing more in line to
> what users will be doing, and we'll notice how the OpenACC kernels
> handling improves, when parloops gets able to parallelize more offloaded
> code (and the "avoid offloading" handling will no longer trigger).  OK to
> commit?
> 
> commit acd66946777671486a0f69706b25a3ec5f877306
> Author: Thomas Schwinge <thomas@codesourcery.com>
> Date:   Tue Feb 2 20:41:42 2016 +0100
> 
>     Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"
>     
>     	gcc/
>     	* common.opt: Add -foffload-force.
>     	* lto-wrapper.c (merge_and_complain, append_compiler_options):
>     	Handle it.
>     	* doc/invoke.texi: Document it.
>     	* config/nvptx/mkoffload.c (struct id_map): Add "flags" member.
>     	(record_id): Parse, and set it.
>     	(process): Use it.
>     	* config/nvptx/nvptx.c (nvptx_attribute_table): Add "omp avoid
>     	offloading".
>     	(nvptx_record_offload_symbol): Use it.
>     	(nvptx_goacc_validate_dims): Set it.
>     	libgomp/
>     	* libgomp.h (gomp_offload_target_available_p): New function
>     	declaration.
>     	* target.c (gomp_offload_target_available_p): New function
>     	definition.
>     	(GOMP_offload_register_ver, GOMP_offload_unregister_ver)
>     	(gomp_init_device, gomp_unload_device): Handle and document "avoid
>     	offloading" flag ("host_table == NULL").
>     	(resolve_device): Document "avoid offloading".
>     	* oacc-init.c (resolve_device): Likewise.
>     	* libgomp.texi (Enabling OpenACC): Likewise.
>     	* testsuite/lib/libgomp.exp
>     	(check_effective_target_nvptx_offloading_configured): New proc
>     	definition.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c: New
>     	file.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-1.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-2.f: Likewise.
>     	* testsuite/libgomp.oacc-fortran/avoid-offloading-3.f: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/abort-3.c: Expect warning.
>     	* testsuite/libgomp.oacc-c-c++-common/abort-4.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-empty.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/combined-directives-1.f90:
>     	Likewise.
>     	* testsuite/libgomp.oacc-fortran/non-scalar-data.f90: Likewise.
>     
>     	libgomp/
>     	* testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c: Set
>     	"-ftree-parallelize-loops=32".
>     	* testsuite/libgomp.oacc-c-c++-common/default-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/host_data-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/kernels-1.c: Likewise.
>     	* testsuite/libgomp.oacc-c-c++-common/nested-2.c: Likewise.
> ---
>  gcc/common.opt                                     |    4 +
>  gcc/config/nvptx/mkoffload.c                       |   73 +++++++++++-
>  gcc/config/nvptx/nvptx.c                           |   42 ++++++-
>  gcc/doc/invoke.texi                                |   12 +-
>  gcc/lto-wrapper.c                                  |    2 +
>  libgomp/libgomp.h                                  |    1 +
>  libgomp/libgomp.texi                               |    8 ++
>  libgomp/oacc-init.c                                |   19 ++-
>  libgomp/target.c                                   |  122 ++++++++++++++++----
>  libgomp/testsuite/lib/libgomp.exp                  |   10 ++
>  .../testsuite/libgomp.oacc-c-c++-common/abort-3.c  |    4 +-
>  .../testsuite/libgomp.oacc-c-c++-common/abort-4.c  |    4 +-
>  .../libgomp.oacc-c-c++-common/avoid-offloading-1.c |   28 +++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-2.c |   38 ++++++
>  .../libgomp.oacc-c-c++-common/avoid-offloading-3.c |   29 +++++
>  .../combined-directives-1.c                        |    4 +-
>  .../libgomp.oacc-c-c++-common/default-1.c          |    4 +-
>  .../libgomp.oacc-c-c++-common/deviceptr-1.c        |    4 +-
>  .../libgomp.oacc-c-c++-common/host_data-1.c        |    1 +
>  .../libgomp.oacc-c-c++-common/kernels-1.c          |   10 +-
>  .../kernels-alias-ipa-pta-2.c                      |    4 +-
>  .../kernels-alias-ipa-pta-3.c                      |    4 +-
>  .../kernels-alias-ipa-pta.c                        |    4 +-
>  .../libgomp.oacc-c-c++-common/kernels-empty.c      |    2 +-
>  .../kernels-loop-and-seq-2.c                       |    3 +-
>  .../kernels-loop-and-seq-3.c                       |    4 +-
>  .../kernels-loop-and-seq-4.c                       |    3 +-
>  .../kernels-loop-and-seq-5.c                       |    3 +-
>  .../kernels-loop-and-seq-6.c                       |    3 +-
>  .../kernels-loop-and-seq.c                         |    4 +-
>  .../kernels-loop-collapse.c                        |    3 +-
>  .../testsuite/libgomp.oacc-c-c++-common/nested-2.c |    2 +-
>  .../libgomp.oacc-fortran/avoid-offloading-1.f      |   32 +++++
>  .../libgomp.oacc-fortran/avoid-offloading-2.f      |   41 +++++++
>  .../libgomp.oacc-fortran/avoid-offloading-3.f      |   31 +++++
>  .../libgomp.oacc-fortran/combined-directives-1.f90 |    5 +-
>  .../libgomp.oacc-fortran/non-scalar-data.f90       |    5 +-
>  37 files changed, 494 insertions(+), 78 deletions(-)
> 
> diff --git gcc/common.opt gcc/common.opt
> index 520fa9c..2cf798d 100644
> --- gcc/common.opt
> +++ gcc/common.opt
> @@ -1779,6 +1779,10 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
>  EnumValue
>  Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
>  
> +foffload-force
> +Common Var(flag_offload_force)
> +Force offloading if the compiler wanted to avoid it.
> +
>  fomit-frame-pointer
>  Common Report Var(flag_omit_frame_pointer) Optimization
>  When possible do not generate stack frames.
> diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
> index c8eed45..586ee8b 100644
> --- gcc/config/nvptx/mkoffload.c
> +++ gcc/config/nvptx/mkoffload.c
> @@ -41,9 +41,19 @@ const char tool_name[] = "nvptx mkoffload";
>  
>  #define COMMENT_PREFIX "#"
>  
> +enum id_map_flag
> +  {
> +    /* All clear.  */
> +    ID_MAP_FLAG_NONE = 0,
> +    /* Avoid offloading.  For example, because there is no sufficient
> +       parallelism.  */
> +    ID_MAP_FLAG_AVOID_OFFLOADING = 1
> +  };
> +
>  struct id_map
>  {
>    id_map *next;
> +  int flags;
>    char *ptx_name;
>  };
>  
> @@ -107,6 +117,38 @@ record_id (const char *p1, id_map ***where)
>      fatal_error (input_location, "malformed ptx file");
>  
>    id_map *v = XNEW (id_map);
> +
> +  /* Do we have any flags?  */
> +  v->flags = ID_MAP_FLAG_NONE;
> +  if (p1[0] == '(')
> +    {
> +      /* Current flag.  */
> +      const char *cur = p1 + 1;
> +
> +      /* Seek to the beginning of ") ".  */
> +      p1 = strchr (cur, ')');
> +      if (!p1 || p1 > end || p1[1] != ' ')
> +	fatal_error (input_location, "malformed ptx file: "
> +		     "expected \") \" at \"%s\"", cur);
> +
> +      while (cur < p1)
> +	{
> +	  const char *next = strchr (cur, ',');
> +	  if (!next || next > p1)
> +	    next = p1;
> +
> +	  if (strncmp (cur, "avoid offloading", next - cur - 1) == 0)
> +	    v->flags |= ID_MAP_FLAG_AVOID_OFFLOADING;
> +	  else
> +	    fatal_error (input_location, "malformed ptx file: "
> +			 "unknown flag at \"%s\"", cur);
> +
> +	  cur = next;
> +	}
> +
> +      /* Skip past ") ".  */
> +      p1 += 2;
> +    }
>    size_t len = end - p1;
>    v->ptx_name = XNEWVEC (char, len + 1);
>    memcpy (v->ptx_name, p1, len);
> @@ -296,12 +338,17 @@ process (FILE *in, FILE *out)
>    fprintf (out, "\n};\n\n");
>  
>    /* Dump out function idents.  */
> +  bool avoid_offloading_p = false;
>    fprintf (out, "static const struct nvptx_fn {\n"
>  	   "  const char *name;\n"
>  	   "  unsigned short dim[%d];\n"
>  	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
>    for (comma = "", id = func_ids; id; comma = ",", id = id->next)
> -    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    {
> +      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
> +	avoid_offloading_p = true;
> +      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
> +    }
>    fprintf (out, "\n};\n\n");
>  
>    fprintf (out,
> @@ -318,7 +365,11 @@ process (FILE *in, FILE *out)
>  	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
>  	   "  func_mappings,"
>  	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
> -	   "};\n\n");
> +	   "};\n");
> +  if (avoid_offloading_p)
> +    /* Need a unique handle for target_data.  */
> +    fprintf (out, "static int target_data_avoid_offloading;\n");
> +  fprintf (out, "\n");
>  
>    fprintf (out, "#ifdef __cplusplus\n"
>  	   "extern \"C\" {\n"
> @@ -338,18 +389,28 @@ process (FILE *in, FILE *out)
>    fprintf (out, "static __attribute__((constructor)) void init (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  
>    fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
>  	   "{\n"
>  	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
> -	   "%d/*NVIDIA_PTX*/, &target_data);\n"
> -	   "};\n",
> +	   "%d/*NVIDIA_PTX*/, &target_data);\n",
>  	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
>  	   GOMP_DEVICE_NVIDIA_PTX);
> +  if (avoid_offloading_p)
> +    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
> +	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
> +	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
> +	     GOMP_DEVICE_NVIDIA_PTX);
> +  fprintf (out, "};\n");
>  }
>  
>  static void
> diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
> index 78614f8..fe28154 100644
> --- gcc/config/nvptx/nvptx.c
> +++ gcc/config/nvptx/nvptx.c
> @@ -3803,6 +3803,9 @@ static const struct attribute_spec nvptx_attribute_table[] =
>    /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
>         affects_type_identity } */
>    { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
> +  /* Avoid offloading.  For example, because there is no sufficient
> +     parallelism.  */
> +  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
>    { NULL, 0, 0, false, false, false, NULL, false }
>  };
>  
> @@ -3867,7 +3870,10 @@ nvptx_record_offload_symbol (tree decl)
>  	tree dims = TREE_VALUE (attr);
>  	unsigned ix;
>  
> -	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
> +	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
> +		 (lookup_attribute ("omp avoid offloading",
> +				    DECL_ATTRIBUTES (decl))
> +		  ? "(avoid offloading) " : ""),
>  		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
>  
>  	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
> @@ -4124,6 +4130,40 @@ nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
>  static bool
>  nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
>  {
> +  /* Detect if a function is unsuitable for offloading.  */
> +  if (!flag_offload_force && decl)
> +    {
> +      tree oacc_function_attr = get_oacc_fn_attrib (decl);
> +      if (oacc_function_attr
> +	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
> +	{
> +	  bool avoid_offloading_p = true;
> +	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
> +	    {
> +	      if (dims[ix] > 1)
> +		{
> +		  avoid_offloading_p = false;
> +		  break;
> +		}
> +	    }
> +	  if (avoid_offloading_p)
> +	    {
> +	      /* OpenACC kernels constructs will never be parallelized for
> +		 optimization levels smaller than -O2; avoid the diagnostic in
> +		 this case.  */
> +	      if (optimize >= 2)
> +		warning_at (DECL_SOURCE_LOCATION (decl), 0,
> +			    "OpenACC kernels construct will be executed "
> +			    "sequentially; will by default avoid offloading "
> +			    "to prevent data copy penalty");
> +	      DECL_ATTRIBUTES (decl)
> +		= tree_cons (get_identifier ("omp avoid offloading"),
> +			     NULL_TREE, DECL_ATTRIBUTES (decl));
> +
> +	    }
> +	}
> +    }
> +
>    bool changed = false;
>  
>    /* The vector size must be 32, unless this is a SEQ routine.  */
> diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
> index fcc404e..c09fbc5 100644
> --- gcc/doc/invoke.texi
> +++ gcc/doc/invoke.texi
> @@ -180,7 +180,8 @@ in the following sections.
>  @gccoptlist{-ansi  -std=@var{standard}  -fgnu89-inline @gol
>  -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol
>  -fno-asm  -fno-builtin  -fno-builtin-@var{function} @gol
> --fhosted  -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol
> +-fhosted  -ffreestanding @gol
> +-foffload-force -fopenacc -fopenacc-dim=@var{geom} -fopenmp -fopenmp-simd @gol
>  -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness}
>  -fallow-single-precision  -fcond-mismatch -flax-vector-conversions @gol
>  -fsigned-bitfields  -fsigned-char @gol
> @@ -1953,6 +1954,15 @@ This is equivalent to @option{-fno-hosted}.
>  @xref{Standards,,Language Standards Supported by GCC}, for details of
>  freestanding and hosted environments.
>  
> +@item -foffload-force
> +@opindex -foffload-force
> +The option @option{-foffload-force} forces offloading if the compiler
> +wanted to avoid it.  For example, when there isn't sufficient
> +parallelism in certain offloading constructs, the compiler may come to
> +the conclusion that offloading incurs too much overhead (for data
> +transfers, for example), and unless overridden with this flag, it then
> +suggests to the runtime (libgomp) to avoid offloading.
> +
>  @item -fopenacc
>  @opindex fopenacc
>  @cindex OpenACC accelerator programming
> diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
> index ced6f2f..702ae47 100644
> --- gcc/lto-wrapper.c
> +++ gcc/lto-wrapper.c
> @@ -275,6 +275,7 @@ merge_and_complain (struct cl_decoded_option **decoded_options,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fcilkplus:
> @@ -517,6 +518,7 @@ append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
>  	case OPT_fsigned_zeros:
>  	case OPT_ftrapping_math:
>  	case OPT_fwrapv:
> +	case OPT_foffload_force:
>  	case OPT_fopenmp:
>  	case OPT_fopenacc:
>  	case OPT_fopenacc_dim_:
> diff --git libgomp/libgomp.h libgomp/libgomp.h
> index 7108a6d..8747b72 100644
> --- libgomp/libgomp.h
> +++ libgomp/libgomp.h
> @@ -984,6 +984,7 @@ extern void gomp_unmap_vars (struct target_mem_desc *, bool);
>  extern void gomp_init_device (struct gomp_device_descr *);
>  extern void gomp_free_memmap (struct splay_tree_s *);
>  extern void gomp_unload_device (struct gomp_device_descr *);
> +extern bool gomp_offload_target_available_p (int);
>  
>  /* work.c */
>  
> diff --git libgomp/libgomp.texi libgomp/libgomp.texi
> index 987ee5f..5795c00 100644
> --- libgomp/libgomp.texi
> +++ libgomp/libgomp.texi
> @@ -1815,6 +1815,14 @@ flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
>  arranges for automatic linking of the OpenACC runtime library 
>  (@ref{OpenACC Runtime Library Routines}).
>  
> +Offloading is enabled by default.  In some cases, the compiler may
> +come to the conclusion that offloading incurs too much overhead, and
> +suggest to the runtime to avoid it.  To counteract that, you can use
> +the option @option{-foffload-force} to force offloading in such cases.
> +Alternatively, offloading is also enabled if a specific device type is
> +requested, in a call to @code{acc_init} or by setting the
> +@env{ACC_DEVICE_TYPE} environment variable, for example.
> +
>  A complete description of all OpenACC directives accepted may be found in 
>  the @uref{http://www.openacc.org/, OpenACC} Application Programming
>  Interface manual, version 2.0.
> diff --git libgomp/oacc-init.c libgomp/oacc-init.c
> index 42d005d..2f053f3 100644
> --- libgomp/oacc-init.c
> +++ libgomp/oacc-init.c
> @@ -122,7 +122,10 @@ resolve_device (acc_device_t d, bool fail_is_error)
>        {
>  	if (goacc_device_type)
>  	  {
> -	    /* Lookup the named device.  */
> +	    /* Lookup the device that has been explicitly named, so do not pay
> +	       attention to gomp_offload_target_available_p.  (That is,
> +	       enforced usage even with an "avoid offloading" flag set, and
> +	       hard error if not actually available.)  */
>  	    while (++d != _ACC_device_hwm)
>  	      if (dispatchers[d]
>  		  && !strcasecmp (goacc_device_type,
> @@ -148,8 +151,15 @@ resolve_device (acc_device_t d, bool fail_is_error)
>      case acc_device_not_host:
>        /* Find the first available device after acc_device_not_host.  */
>        while (++d != _ACC_device_hwm)
> -	if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0)
> +	if (dispatchers[d]
> +	    && dispatchers[d]->get_num_devices_func () > 0
> +	    /* No device has been explicitly named, so pay attention to
> +	       gomp_offload_target_available_p, to not decide on an offload
> +	       target that we don't have offload data available for, or have an
> +	       "avoid offloading" flag set for.  */
> +	    && gomp_offload_target_available_p (dispatchers[d]->type))
>  	  goto found;
> +      /* No non-host device found.  */
>        if (d_arg == acc_device_default)
>  	{
>  	  d = acc_device_host;
> @@ -168,7 +178,7 @@ resolve_device (acc_device_t d, bool fail_is_error)
>        break;
>  
>      default:
> -      if (d > _ACC_device_hwm)
> +      if (d >= _ACC_device_hwm)
>  	{
>  	  if (fail_is_error)
>  	    goto unsupported_device;
> @@ -181,7 +191,8 @@ resolve_device (acc_device_t d, bool fail_is_error)
>  
>    assert (d != acc_device_none
>  	  && d != acc_device_default
> -	  && d != acc_device_not_host);
> +	  && d != acc_device_not_host
> +	  && d < _ACC_device_hwm);
>  
>    if (dispatchers[d] == NULL && fail_is_error)
>      {
> diff --git libgomp/target.c libgomp/target.c
> index 96fe3d5..afcbedb 100644
> --- libgomp/target.c
> +++ libgomp/target.c
> @@ -1165,12 +1165,19 @@ gomp_unload_image_from_device (struct gomp_device_descr *devicep,
>  
>  /* This function should be called from every offload image while loading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
> +   offloading" flag, and the TARGET_TYPE will not be considered by default
> +   until this image gets unregistered.  */
>  
>  void
>  GOMP_offload_register_ver (unsigned version, const void *host_table,
>  			   int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
> @@ -1179,16 +1186,19 @@ GOMP_offload_register_ver (unsigned version, const void *host_table,
>    
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Load image to all initialized devices.  */
> -  for (i = 0; i < num_devices; i++)
> +  if (host_table != NULL)
>      {
> -      struct gomp_device_descr *devicep = &devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type == target_type
> -	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> -	gomp_load_image_to_device (devicep, version,
> -				   host_table, target_data, true);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Load image to all initialized devices.  */
> +      for (i = 0; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep = &devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type == target_type
> +	      && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	    gomp_load_image_to_device (devicep, version,
> +				       host_table, target_data, true);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Insert image to array of pending images.  */
> @@ -1214,26 +1224,36 @@ GOMP_offload_register (const void *host_table, int target_type,
>  
>  /* This function should be called from every offload image while unloading.
>     It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
> -   the target, and TARGET_DATA needed by target plugin.  */
> +   the target, and TARGET_DATA needed by target plugin.
> +
> +   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
> +   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
> +   default.  */
>  
>  void
>  GOMP_offload_unregister_ver (unsigned version, const void *host_table,
>  			     int target_type, const void *target_data)
>  {
> +  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
> +	      version, host_table, target_type, target_data);
> +
>    int i;
>  
>    gomp_mutex_lock (&register_lock);
>  
> -  /* Unload image from all initialized devices.  */
> -  for (i = 0; i < num_devices; i++)
> +  if (host_table != NULL)
>      {
> -      struct gomp_device_descr *devicep = &devices[i];
> -      gomp_mutex_lock (&devicep->lock);
> -      if (devicep->type == target_type
> -	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> -	gomp_unload_image_from_device (devicep, version,
> -				       host_table, target_data);
> -      gomp_mutex_unlock (&devicep->lock);
> +      /* Unload image from all initialized devices.  */
> +      for (i = 0; i < num_devices; i++)
> +	{
> +	  struct gomp_device_descr *devicep = &devices[i];
> +	  gomp_mutex_lock (&devicep->lock);
> +	  if (devicep->type == target_type
> +	      && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	    gomp_unload_image_from_device (devicep, version,
> +					   host_table, target_data);
> +	  gomp_mutex_unlock (&devicep->lock);
> +	}
>      }
>  
>    /* Remove image from array of pending images.  */
> @@ -1267,7 +1287,8 @@ gomp_init_device (struct gomp_device_descr *devicep)
>    for (i = 0; i < num_offload_images; i++)
>      {
>        struct offload_image_descr *image = &offload_images[i];
> -      if (image->type == devicep->type)
> +      if (image->type == devicep->type
> +	  && image->host_table != NULL)
>  	gomp_load_image_to_device (devicep, image->version,
>  				   image->host_table, image->target_data,
>  				   false);
> @@ -1287,7 +1308,8 @@ gomp_unload_device (struct gomp_device_descr *devicep)
>        for (i = 0; i < num_offload_images; i++)
>  	{
>  	  struct offload_image_descr *image = &offload_images[i];
> -	  if (image->type == devicep->type)
> +	  if (image->type == devicep->type
> +	      && image->host_table != NULL)
>  	    gomp_unload_image_from_device (devicep, image->version,
>  					   image->host_table,
>  					   image->target_data);
> @@ -1311,6 +1333,62 @@ gomp_free_memmap (struct splay_tree_s *mem_map)
>      }
>  }
>  
> +/* Do we have offload data available for the given offload target type?
> +   Instead of verifying that *all* offload data is available that could
> +   possibly be required, we instead just look for *any*.  If we later find any
> +   offload data missing, that's user error.  If any offload data of this target
> +   type is tagged with an "avoid offloading" flag, do not consider this target
> +   type available unless it has been initialized already.  */
> +
> +attribute_hidden bool
> +gomp_offload_target_available_p (int type)
> +{
> +  bool available = false;
> +
> +  /* Has the offload target type already been initialized?  */
> +  for (int i = 0; !available && i < num_devices; i++)
> +    {
> +      struct gomp_device_descr *devicep = &devices[i];
> +      gomp_mutex_lock (&devicep->lock);
> +      if (devicep->type == type
> +	  && devicep->state == GOMP_DEVICE_INITIALIZED)
> +	available = true;
> +      gomp_mutex_unlock (&devicep->lock);
> +    }
> +
> +  /* If the offload target type has been initialized already, we ignore "avoid
> +     offloading" flags.  This is important, because data/state may be present
> +     on the device, that we must continue to use.  */
> +  if (!available)
> +    {
> +      gomp_mutex_lock (&register_lock);
> +      if (num_offload_images == 0)
> +	/* If there is no offload data available at all, there is no way to
> +	   later fail to find any of it for a specific offload target type.
> +	   This is the case where there are no offloaded code regions in user
> +	   code, but the target type can be initialized successfully, and
> +	   executable directqives be used, or runtime library calls be
> +	   made.  */
> +	available = true;
> +      else
> +	{
> +	  /* Can the offload target be initialized?  */
> +	  for (int i = 0; !available && i < num_offload_images; i++)
> +	    if (offload_images[i].type == type
> +		&& offload_images[i].host_table != NULL)
> +	      available = true;
> +	  /* If yes, is an "avoid offloading" flag set?  */
> +	  for (int i = 0; available && i < num_offload_images; i++)
> +	    if (offload_images[i].type == type
> +		&& offload_images[i].host_table == NULL)
> +	      available = false;
> +	}
> +      gomp_mutex_unlock (&register_lock);
> +    }
> +
> +  return available;
> +}
> +
>  /* Host fallback for GOMP_target{,_ext} routines.  */
>  
>  static void
> diff --git libgomp/testsuite/lib/libgomp.exp libgomp/testsuite/lib/libgomp.exp
> index a4c9d83..8d2be80 100644
> --- libgomp/testsuite/lib/libgomp.exp
> +++ libgomp/testsuite/lib/libgomp.exp
> @@ -344,6 +344,16 @@ proc check_effective_target_offload_device_nonshared_as { } {
>      } ]
>  }
>  
> +# Return 1 if the compiler has been configured for nvptx offloading.
> +
> +proc check_effective_target_nvptx_offloading_configured { } {
> +    # PR libgomp/65099: Currently, we only support offloading in 64-bit
> +    # configurations.
> +    global offload_targets
> +    return [expr [string match "*,nvptx,*" ",$offload_targets,"] \
> +		&& [is-effective-target lp64] ]
> +}
> +
>  # Return 1 if at least one nvidia board is present.
>  
>  proc check_effective_target_openacc_nvidia_accel_present { } {
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> index bca425e..23156d8 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
> @@ -1,5 +1,3 @@
> -/* { dg-do run } */
> -
>  #include <stdio.h>
>  #include <stdlib.h>
>  
> @@ -7,7 +5,7 @@ int
>  main (void)
>  {
>    fprintf (stderr, "CheCKpOInT\n");
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      abort ();
>    }
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> index c29ca3f..f4d6a07 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
> @@ -1,12 +1,10 @@
> -/* { dg-do run } */
> -
>  #include <stdlib.h>
>  
>  int
>  main (int argc, char **argv)
>  {
>  
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (argc != 1)
>        abort ();
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> new file mode 100644
> index 0000000..08745fc
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
> @@ -0,0 +1,28 @@
> +/* Test that the compiler decides to "avoid offloading".  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +/* The ACC_DEVICE_TYPE environment variable gets set in the testing
> +   framework, and that overrides the "avoid offloading" flag at run time.
> +   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 1)
> +    __builtin_abort();
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> new file mode 100644
> index 0000000..724228a
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
> @@ -0,0 +1,38 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision at run time.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  /* Override the compiler's "avoid offloading" decision.  */
> +  acc_device_t d;
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  d = acc_device_nvidia;
> +#elif defined ACC_DEVICE_TYPE_host
> +  d = acc_device_host;
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +  acc_init (d);
> +
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 0)
> +    __builtin_abort();
> +#else
> +  if (y != 1)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> new file mode 100644
> index 0000000..2fb5196
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
> @@ -0,0 +1,29 @@
> +/* Test that a user can override the compiler's "avoid offloading"
> +   decision at compile time.  */
> +
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
> +/* Override the compiler's "avoid offloading" decision.
> +   { dg-additional-options "-foffload-force" } */
> +
> +#include <openacc.h>
> +
> +int main(void)
> +{
> +  int x, y;
> +
> +#pragma acc data copyout(x, y)
> +#pragma acc kernels
> +  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
> +
> +  if (x != 33)
> +    __builtin_abort();
> +#if defined ACC_DEVICE_TYPE_nvidia
> +  if (y != 0)
> +    __builtin_abort();
> +#else
> +  if (y != 1)
> +    __builtin_abort();
> +#endif
> +
> +  return 0;
> +}
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> index dad6d13..87ca378 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
> @@ -1,6 +1,6 @@
>  /* This test exercises combined directives.  */
>  
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> @@ -33,7 +33,7 @@ main (int argc, char **argv)
>  	abort ();
>      }
>  
> -#pragma acc kernels loop copy (a[0:N]) copy (b[0:N])
> +#pragma acc kernels loop copy (a[0:N]) copy (b[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    for (i = 0; i < N; i++)
>      {
>        b[i] = 3.0;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> index 1ac0b95..8f0144c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include  <openacc.h>
>  
> @@ -51,7 +51,7 @@ int test_kernels ()
>      ary[i] = ~0;
>  
>    /* val defaults to copy, ary defaults to copy.  */
> -#pragma acc kernels copy(ondev)
> +#pragma acc kernels copy(ondev) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      ondev = acc_on_device (acc_device_not_host);
>  #pragma acc loop 
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> index e271a37..9a5f7b1 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
> @@ -1,5 +1,3 @@
> -/* { dg-do run } */
> -
>  #include <stdlib.h>
>  
>  int main (void)
> @@ -10,7 +8,7 @@ int main (void)
>    a = A;
>  
>  #pragma acc data copyout (a_1, a_2)
> -#pragma acc kernels deviceptr (a)
> +#pragma acc kernels deviceptr (a) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a_1 = a;
>      a_2 = &a;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> index 51745ba..3ef6f9b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
> @@ -1,4 +1,5 @@
>  /* { dg-do run { target openacc_nvidia_accel_selected } } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
>  
>  #include <stdlib.h>
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> index 3acfdf5..614ad33 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> @@ -73,7 +73,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (i != -1 || j != -2)
>        abort ();
> @@ -96,7 +96,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      i = 2;
>      j = 1;
> @@ -110,7 +110,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      if (i != -1 || j != -2)
>        abort ();
> @@ -126,7 +126,7 @@ int main (void)
>    i = -1;
>    j = -2;
>    v = 0;
> -#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j)
> +#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      i = 2;
>      j = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> index 0f323c8..8d5101d 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
>    unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
>  
> -#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
> +#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> index 654e750..3726b0c 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int *b = a;
>    unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
>  
> -#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
> +#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> index 44d4fd2..eea4f76 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
> @@ -1,4 +1,4 @@
> -/* { dg-additional-options "-O2 -fipa-pta" } */
> +/* { dg-additional-options "-fipa-pta" } */
>  
>  #include <stdlib.h>
>  
> @@ -11,7 +11,7 @@ main (void)
>    unsigned int b[N];
>    unsigned int c[N];
>  
> -#pragma acc kernels pcopyout (a, b, c)
> +#pragma acc kernels pcopyout (a, b, c) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 0;
>      b[0] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> index a68a7cd..860b6da 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
> @@ -1,6 +1,6 @@
>  int
>  main (void)
>  {
> -#pragma acc kernels
> +#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    ;
>  }
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> index 2e4100f..5cdc200 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = a[0] + 1;
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> index b3e736b..2e4d4d2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,8 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      for (int i = 0; i < n; i++)
>        a[i] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> index 8b9affa..5bf00db 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      a[0] = 2;
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> index 83d4e7f..d39b667 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -9,7 +8,7 @@ unsigned int
>  foo (int n, unsigned int *a)
>  {
>    int r;
> -#pragma acc kernels copyout(r) copy (a[0:N])
> +#pragma acc kernels copyout(r) copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      r = a[0];
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> index 01d5e5e..bb2e85b 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,7 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      int r = a[0];
>  
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> index 61d1283..e513827 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -8,8 +7,7 @@
>  unsigned int
>  foo (int n, unsigned int *a)
>  {
> -
> -#pragma acc kernels copy (a[0:N])
> +#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>      for (int i = 0; i < n; i++)
>        a[i] = 1;
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> index f7f04cb..c4791a4 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
> @@ -1,4 +1,3 @@
> -/* { dg-do run } */
>  /* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
> @@ -11,7 +10,7 @@ void __attribute__((noinline, noclone))
>  foo (int m, int n)
>  {
>    int i, j;
> -  #pragma acc kernels
> +  #pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
>    {
>  #pragma acc loop collapse(2)
>      for (i = 0; i < m; i++)
> diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> index c164598..94a5ae2 100644
> --- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> +++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
> @@ -1,4 +1,4 @@
> -/* { dg-do run } */
> +/* { dg-additional-options "-ftree-parallelize-loops=32" } */
>  
>  #include <stdlib.h>
>  
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> new file mode 100644
> index 0000000..5f18b94
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
> @@ -0,0 +1,32 @@
> +! Test that the compiler decides to "avoid offloading".
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
> +! The ACC_DEVICE_TYPE environment variable gets set in the testing
> +! framework, and that overrides the "avoid offloading" flag at run time.
> +! { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
> +      IF (.NOT. Y) CALL ABORT
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> new file mode 100644
> index 0000000..51801ad
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
> @@ -0,0 +1,41 @@
> +! Test that a user can override the compiler's "avoid offloading"
> +! decision at run time.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!     Override the compiler's "avoid offloading" decision.
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      D = ACC_DEVICE_NVIDIA
> +#elif defined ACC_DEVICE_TYPE_host
> +      D = ACC_DEVICE_HOST
> +#else
> +# error Not ported to this ACC_DEVICE_TYPE
> +#endif
> +      CALL ACC_INIT (D)
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> new file mode 100644
> index 0000000..bea6ab8
> --- /dev/null
> +++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
> @@ -0,0 +1,31 @@
> +! Test that a user can override the compiler's "avoid offloading"
> +! decision at compile time.
> +
> +! { dg-do run }
> +! { dg-additional-options "-cpp" }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! Override the compiler's "avoid offloading" decision.
> +! { dg-additional-options "-foffload-force" }
> +
> +      IMPLICIT NONE
> +      INCLUDE "openacc_lib.h"
> +
> +      INTEGER :: D
> +      INTEGER, VOLATILE :: X
> +      LOGICAL :: Y
> +
> +!$ACC DATA COPYOUT(X, Y)
> +!$ACC KERNELS
> +      X = 33
> +      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
> +!$ACC END KERNELS
> +!$ACC END DATA
> +
> +      IF (X .NE. 33) CALL ABORT
> +#if defined ACC_DEVICE_TYPE_nvidia
> +      IF (Y) CALL ABORT
> +#else
> +      IF (.NOT. Y) CALL ABORT
> +#endif
> +
> +      END
> diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> index 94100b2..4b52579 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
> @@ -1,6 +1,9 @@
>  ! This test exercises combined directives.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
>  program main
>    integer, parameter :: n = 32
> @@ -27,7 +30,7 @@ program main
>    !$acc kernels loop copy (a(1:n)) copy (b(1:n))
>    do i = 1, n
>      b(i) = 3.0;
> -    a(i) = a(i) + b(i)
> +    a(i) = a(i) + b(i) ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
>    end do
>  
>    do i = 1, n
> diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> index 4afb562..b9298c7 100644
> --- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> +++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
> @@ -2,6 +2,9 @@
>  ! offloaded regions are properly mapped using present_or_copy.
>  
>  ! { dg-do run }
> +! { dg-additional-options "-ftree-parallelize-loops=32" }
> +! The "avoid offloading" warning is only triggered for -O2 and higher.
> +! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
>  
>  program main
>    implicit none
> @@ -30,7 +33,7 @@ subroutine kernels (array, n)
>    integer, dimension (n) :: array
>    integer :: n, i
>  
> -  !$acc kernels
> +  !$acc kernels ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
>    do i = 1, n
>       array(i) = i
>    end do


Grüße
 Thomas
Bernd Schmidt Feb. 10, 2016, 1:25 p.m. UTC | #2
On 02/10/2016 12:49 PM, Thomas Schwinge wrote:
> Hi!
>
> Ping.

I think this has to be considered after gcc-6. In general, what's the 
state of OpenACC these days?

I'm slightly confused by the interface between offloaded code and 
libgomp. It looks like you're collecting avoid-offloading flags 
per-function, but then when things get registered, it seems like a 
per-image flag. Is that right? It seems like too large a hammer.

>> +	  bool avoid_offloading_p = true;
>> +	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
>> +	    {
>> +	      if (dims[ix] > 1)
>> +		{
>> +		  avoid_offloading_p = false;
>> +		  break;
>> +		}
>> +	    }

Avoid unnecessary braces.

>> +	   executable directqives be used, or runtime library calls be

Typo.


Bernd
Thomas Schwinge Feb. 10, 2016, 2:39 p.m. UTC | #3
Hi!

On Wed, 10 Feb 2016 14:25:50 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 12:49 PM, Thomas Schwinge wrote:
> > [...]
> 
> I think this has to be considered after gcc-6.

Hmm, I see.


> In general, what's the 
> state of OpenACC these days?

Much improved compared to GCC 5.  :-) Anything specific you'd like me to
elaborate on?  <https://gcc.gnu.org/wiki/OpenACC> should be fairly
accurate.


> I'm slightly confused by the interface between offloaded code and 
> libgomp. It looks like you're collecting avoid-offloading flags 
> per-function, but then when things get registered, it seems like a 
> per-image flag.

(Per-image flag that affects all offloading for a given offloading type,
even.)

> Is that right? It seems like too large a hammer.

Yes, we need a hammer that big: we have to ensure consistency between
data regions on the device and code offloading to the device, as
otherwise we'll very easily run into inconsistencies, because of the
non-shared memory.  In the general case, it's "all or nothing": you
either have to offload all kernels or none of them.


> >> [...]
> 
> Avoid unnecessary braces.
> 
> >> [...]
> 
> Typo.

Thanks for the review; fixed.


Grüße
 Thomas
Bernd Schmidt Feb. 10, 2016, 3:27 p.m. UTC | #4
On 02/10/2016 03:39 PM, Thomas Schwinge wrote:

> Yes, we need a hammer that big: we have to ensure consistency between
> data regions on the device and code offloading to the device, as
> otherwise we'll very easily run into inconsistencies, because of the
> non-shared memory.  In the general case, it's "all or nothing": you
> either have to offload all kernels or none of them.

That's unfortunately not the impression I got from the earlier 
discussion, and this seems to imply that one unprofitable kernel would 
disable all the others - IMO this is not acceptable. There need to be 
more compiler smarts to figure out whether a kernel is a valid candidate 
for skipping the offloading.


Bernd
Thomas Schwinge Feb. 10, 2016, 4:23 p.m. UTC | #5
Hi!

On Wed, 10 Feb 2016 16:27:40 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 03:39 PM, Thomas Schwinge wrote:
> 
> > Yes, we need a hammer that big: we have to ensure consistency between
> > data regions on the device and code offloading to the device, as
> > otherwise we'll very easily run into inconsistencies, because of the
> > non-shared memory.  In the general case, it's "all or nothing": you
> > either have to offload all kernels or none of them.
> 
> That's unfortunately not the impression I got from the earlier 
> discussion

:-(

> and this seems to imply that one unprofitable kernel would 
> disable all the others

Correct.

> - IMO this is not acceptable.

Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
constructs' code offloaded; the user wants his code to execute as fast as
possible.

If you consider the whole of OpenACC kernels code offloading as a
compiler optimization, then it's fine for GCC to abort this
"optimization" if it's reasonably clear that this transformation (code
offloading) will not be profitable -- just like what GCC does with other
possible code optimizations/transformations.  As I've said before,
profiling the execution times of several real-world codes has shown that
under the assumtion that parloops fails to parallelize one kernel (one
out of possibly many), this one kernel has always been a "hot spot", and
avoiding offloading in this case has always helped prevent performance
degradation below host-fallback performance.

It's of course unfortunate that we have to disable our offloading
machinery for a lot of codes using OpenACC kernels, but given the current
state of OpenACC kernels parallelization analysis (parloops), doing so is
still profitable for a user, compared to regressed performance with
single-threaded offloaded execution.

Of course...

> There need to be 
> more compiler smarts to figure out whether a kernel is a valid candidate 
> for skipping the offloading.

... that would be better, obviously.  But, I suggest we work on that
incrementally, after fixing the performance regression with my "avoid
offloading" patch.

I have difficulties coming up with an algorithm/parametrization to have
the compiler/runtime decide whether offloading will be profitable given
input parameters such as a ratio of parallelized/single-threaded kernels.
So I'm all ears to suggestions in that regard.  Consider: if we encounter
a single-threaded kernel, the compiler (parloops) has just given up
"understanding" the user's code.  And again, implementing such heuristics
to me sounds like incremental follow-up projects, quite possibly in
combination with generally improving OpenACC kernels handling/parloops.


Grüße
 Thomas
Bernd Schmidt Feb. 10, 2016, 4:37 p.m. UTC | #6
On 02/10/2016 05:23 PM, Thomas Schwinge wrote:
> Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
> constructs' code offloaded; the user wants his code to execute as fast as
> possible.
>
> If you consider the whole of OpenACC kernels code offloading as a
> compiler optimization, then it's fine for GCC to abort this
> "optimization" if it's reasonably clear that this transformation (code
> offloading) will not be profitable -- just like what GCC does with other
> possible code optimizations/transformations.

Yes, but if a single kernel (which might not even get executed at 
run-time) can inhibit offloading for the whole program, then we're not 
making an intelligent decision, and IMO violating user expectations. 
IIUC it's also disabling offloading for parallels rather than just 
kernels, which we previously said shouldn't happen.

> As I've said before,
> profiling the execution times of several real-world codes has shown that
> under the assumtion that parloops fails to parallelize one kernel (one
> out of possibly many), this one kernel has always been a "hot spot", and
> avoiding offloading in this case has always helped prevent performance
> degradation below host-fallback performance.

IMO a warning for the specific kernel that's problematic would be better 
so that users can selectively apply -fopenacc to files where it is 
profitable.

> It's of course unfortunate that we have to disable our offloading
> machinery for a lot of codes using OpenACC kernels, but given the current
> state of OpenACC kernels parallelization analysis (parloops), doing so is
> still profitable for a user, compared to regressed performance with
> single-threaded offloaded execution.

How often does this occur on real-world code? Will we end up supporting 
OpenACC by not doing offloading at all in the usual case? The way you 
describe it, it sounds like we should recommend that -fopenacc not be 
used in gcc-6 and restore the previous invoke.texi langauge that marks 
it as experimental.


Bernd
Thomas Schwinge Feb. 10, 2016, 5:37 p.m. UTC | #7
Hi!

On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 05:23 PM, Thomas Schwinge wrote:
> > Why?  A user of GCC has no intrinsic interest in getting OpenACC kernels
> > constructs' code offloaded; the user wants his code to execute as fast as
> > possible.
> >
> > If you consider the whole of OpenACC kernels code offloading as a
> > compiler optimization, then it's fine for GCC to abort this
> > "optimization" if it's reasonably clear that this transformation (code
> > offloading) will not be profitable -- just like what GCC does with other
> > possible code optimizations/transformations.
> 
> Yes, but if a single kernel (which might not even get executed at 
> run-time) can inhibit offloading for the whole program, then we're not 
> making an intelligent decision, and IMO violating user expectations. 

Sure, I agree it's a pretty "rough-grained" decision.  (Owed to the
non-shared-memory offloading architecture -- shared-memory offloading
indeed can make such decisions case by case.)

> IIUC it's also disabling offloading for parallels rather than just 
> kernels, which we previously said shouldn't happen.

Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
understood the earlier discussion to apply to parallel-only codes, where
the "avoid offloading" flag will never be set.  In mixed parallel/kernels
code with one un-parallelized kernels construct, offloading would also
(have to be) disabled for the parallel constructs (for the same data
consistency reasons explained before).  The majority of codes I've seen
use either parallel or kernels constructs, typically not both.

> > As I've said before,
> > profiling the execution times of several real-world codes has shown that
> > under the assumtion that parloops fails to parallelize one kernel (one
> > out of possibly many), this one kernel has always been a "hot spot", and
> > avoiding offloading in this case has always helped prevent performance
> > degradation below host-fallback performance.
> 
> IMO a warning for the specific kernel that's problematic would be better 

That's something Tom suggested,
<http://news.gmane.org/find-root.php?message_id=%3C569D2059.4010105%40mentor.com%3E>,
and which motivated my patch, in going one step further:

> so that users can selectively apply -fopenacc to files where it is 
> profitable.

This puts it into the hands of the user to selectively mark kernels
constructs as suitable for GCC's current parloops processing (for
example, by disabling OpenACC/offloading on a per-file basis) -- which is
something we wanted to avoid, given the idea that in the future, GCC will
improve, and will be able to handle kernels constructs better, and the
user would then have to re-visit/un-do their earlier changes with each
GCC release, instead of just recompiling their code.

> > It's of course unfortunate that we have to disable our offloading
> > machinery for a lot of codes using OpenACC kernels, but given the current
> > state of OpenACC kernels parallelization analysis (parloops), doing so is
> > still profitable for a user, compared to regressed performance with
> > single-threaded offloaded execution.
> 
> How often does this occur on real-world code?

Quite a lot for code using the kernels construct, as discussed before,
given that parloops fails to handle a lot of constructs in real-world
code.

> Will we end up supporting 
> OpenACC by not doing offloading at all in the usual case?

This whole discussion does not at all apply to the body of OpenACC code
using the parallel instead of the kernels construct, which will be
parallelized/offloaded just fine.

> The way you 
> describe it, it sounds like we should recommend that -fopenacc not be 
> used in gcc-6 and restore the previous invoke.texi langauge that marks 
> it as experimental.

Huh?  Like, at random, discouraging users from using GCC's SIMD
vectorizer just because that one fails to vectorize some code that it
could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
vectorizer is much more mature than the OpenACC kernels/parloops
handling; it's seen many more years of development.)

Certainly we should document that there is still a lot of room for
improvement in OpenACC kernels handling (just like it's the case for a
lot of other generic compiler optimizations) -- and we're doing exactly
that on <https://gcc.gnu.org/wiki/OpenACC>.  I don't follow how that
translates to discouraging use of -fopenacc however?


Grüße
 Thomas
Bernd Schmidt Feb. 10, 2016, 8:07 p.m. UTC | #8
On 02/10/2016 06:37 PM, Thomas Schwinge wrote:
> On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
>> IIUC it's also disabling offloading for parallels rather than just
>> kernels, which we previously said shouldn't happen.
>
> Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
> understood the earlier discussion to apply to parallel-only codes, where
> the "avoid offloading" flag will never be set.  In mixed parallel/kernels
> code with one un-parallelized kernels construct, offloading would also
> (have to be) disabled for the parallel constructs (for the same data
> consistency reasons explained before).  The majority of codes I've seen
> use either parallel or kernels constructs, typically not both.

That's not something I'd want to hard-code into the compiler however. 
Don't know how Jakub feels but to me this approach is way too 
coarse-grained.

> Huh?  Like, at random, discouraging users from using GCC's SIMD
> vectorizer just because that one fails to vectorize some code that it
> could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
> vectorizer is much more mature than the OpenACC kernels/parloops
> handling; it's seen many more years of development.)

Your description sounded like it's not actually not optimizing, but 
actively hurting performance for a large selection of real world codes. 
If I understood that correctly, we need to document this in the manual.


Bernd
Thomas Schwinge Feb. 11, 2016, 10:01 a.m. UTC | #9
Hi!

There are two issues here: 1. "avoid offloading" mechanism, and 2. "avoid
offloading" policy.

On Wed, 10 Feb 2016 21:07:29 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> On 02/10/2016 06:37 PM, Thomas Schwinge wrote:
> > On Wed, 10 Feb 2016 17:37:30 +0100, Bernd Schmidt <bschmidt@redhat.com> wrote:
> >> IIUC it's also disabling offloading for parallels rather than just
> >> kernels, which we previously said shouldn't happen.
> >
> > Ah, you're talking about mixed OpenACC parallel/kernels codes -- I
> > understood the earlier discussion to apply to parallel-only codes, where
> > the "avoid offloading" flag will never be set.  In mixed parallel/kernels
> > code with one un-parallelized kernels construct, offloading would also
> > (have to be) disabled for the parallel constructs (for the same data
> > consistency reasons explained before).

The "avoid offloading" mechanism.  Owed to the non-shared-memory
offloading architecture, if the compiler/runtime decides to "avoid
offloading", then this has to apply to *all* code offloading, for data
consistency reasons.  Do we agree on that?

> > The majority of codes I've seen
> > use either parallel or kernels constructs, typically not both.
> 
> That's not something I'd want to hard-code into the compiler however. 
> Don't know how Jakub feels but to me this approach is way too 
> coarse-grained.

The "avoid offloading" policy.  I'm looking into improving that.


> > Huh?  Like, at random, discouraging users from using GCC's SIMD
> > vectorizer just because that one fails to vectorize some code that it
> > could/should vectorize?  (Of course, I'm well aware that GCC's SIMD
> > vectorizer is much more mature than the OpenACC kernels/parloops
> > handling; it's seen many more years of development.)
> 
> Your description sounded like it's not actually not optimizing, but 
> actively hurting performance for a large selection of real world codes. 

Indeed single-threaded (that is, un-parallelized OpenACC kernels
construct) offloading execution is hurting performance (data copy
overhead; kernel launch overhead; compared to a single CPU core, a single
GPU core has higher memory access latencies and is slower) -- hence the
idea to resort to host-fallback execution in such a situation.

> If I understood that correctly, we need to document this in the manual.

OK; prototyping that on <https://gcc.gnu.org/wiki/OpenACC>.


Grüße
 Thomas
Bernd Schmidt Feb. 11, 2016, 3:58 p.m. UTC | #10
On 02/11/2016 11:01 AM, Thomas Schwinge wrote:
>
> The "avoid offloading" mechanism.  Owed to the non-shared-memory
> offloading architecture, if the compiler/runtime decides to "avoid
> offloading", then this has to apply to *all* code offloading, for data
> consistency reasons.  Do we agree on that?

Not necessarily, I think. It should be possible to determine whether 
some offloaded code blocks are independent from each other. (That 
doesn't mean we currently have any good way of making such decisions. 
libgomp or even the ptx compiler are probably too late and don't have 
the necessary information anymore).


Bernd
diff mbox

Patch

diff --git gcc/common.opt gcc/common.opt
index 520fa9c..2cf798d 100644
--- gcc/common.opt
+++ gcc/common.opt
@@ -1779,6 +1779,10 @@  Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-force
+Common Var(flag_offload_force)
+Force offloading if the compiler wanted to avoid it.
+
 fomit-frame-pointer
 Common Report Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git gcc/config/nvptx/mkoffload.c gcc/config/nvptx/mkoffload.c
index c8eed45..586ee8b 100644
--- gcc/config/nvptx/mkoffload.c
+++ gcc/config/nvptx/mkoffload.c
@@ -41,9 +41,19 @@  const char tool_name[] = "nvptx mkoffload";
 
 #define COMMENT_PREFIX "#"
 
+enum id_map_flag
+  {
+    /* All clear.  */
+    ID_MAP_FLAG_NONE = 0,
+    /* Avoid offloading.  For example, because there is no sufficient
+       parallelism.  */
+    ID_MAP_FLAG_AVOID_OFFLOADING = 1
+  };
+
 struct id_map
 {
   id_map *next;
+  int flags;
   char *ptx_name;
 };
 
@@ -107,6 +117,38 @@  record_id (const char *p1, id_map ***where)
     fatal_error (input_location, "malformed ptx file");
 
   id_map *v = XNEW (id_map);
+
+  /* Do we have any flags?  */
+  v->flags = ID_MAP_FLAG_NONE;
+  if (p1[0] == '(')
+    {
+      /* Current flag.  */
+      const char *cur = p1 + 1;
+
+      /* Seek to the beginning of ") ".  */
+      p1 = strchr (cur, ')');
+      if (!p1 || p1 > end || p1[1] != ' ')
+	fatal_error (input_location, "malformed ptx file: "
+		     "expected \") \" at \"%s\"", cur);
+
+      while (cur < p1)
+	{
+	  const char *next = strchr (cur, ',');
+	  if (!next || next > p1)
+	    next = p1;
+
+	  if (strncmp (cur, "avoid offloading", next - cur - 1) == 0)
+	    v->flags |= ID_MAP_FLAG_AVOID_OFFLOADING;
+	  else
+	    fatal_error (input_location, "malformed ptx file: "
+			 "unknown flag at \"%s\"", cur);
+
+	  cur = next;
+	}
+
+      /* Skip past ") ".  */
+      p1 += 2;
+    }
   size_t len = end - p1;
   v->ptx_name = XNEWVEC (char, len + 1);
   memcpy (v->ptx_name, p1, len);
@@ -296,12 +338,17 @@  process (FILE *in, FILE *out)
   fprintf (out, "\n};\n\n");
 
   /* Dump out function idents.  */
+  bool avoid_offloading_p = false;
   fprintf (out, "static const struct nvptx_fn {\n"
 	   "  const char *name;\n"
 	   "  unsigned short dim[%d];\n"
 	   "} func_mappings[] = {\n", GOMP_DIM_MAX);
   for (comma = "", id = func_ids; id; comma = ",", id = id->next)
-    fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    {
+      if (id->flags & ID_MAP_FLAG_AVOID_OFFLOADING)
+	avoid_offloading_p = true;
+      fprintf (out, "%s\n\t{%s}", comma, id->ptx_name);
+    }
   fprintf (out, "\n};\n\n");
 
   fprintf (out,
@@ -318,7 +365,11 @@  process (FILE *in, FILE *out)
 	   "  sizeof (var_mappings) / sizeof (var_mappings[0]),\n"
 	   "  func_mappings,"
 	   "  sizeof (func_mappings) / sizeof (func_mappings[0])\n"
-	   "};\n\n");
+	   "};\n");
+  if (avoid_offloading_p)
+    /* Need a unique handle for target_data.  */
+    fprintf (out, "static int target_data_avoid_offloading;\n");
+  fprintf (out, "\n");
 
   fprintf (out, "#ifdef __cplusplus\n"
 	   "extern \"C\" {\n"
@@ -338,18 +389,28 @@  process (FILE *in, FILE *out)
   fprintf (out, "static __attribute__((constructor)) void init (void)\n"
 	   "{\n"
 	   "  GOMP_offload_register_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_register_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 
   fprintf (out, "static __attribute__((destructor)) void fini (void)\n"
 	   "{\n"
 	   "  GOMP_offload_unregister_ver (%#x, __OFFLOAD_TABLE__,"
-	   "%d/*NVIDIA_PTX*/, &target_data);\n"
-	   "};\n",
+	   "%d/*NVIDIA_PTX*/, &target_data);\n",
 	   GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
 	   GOMP_DEVICE_NVIDIA_PTX);
+  if (avoid_offloading_p)
+    fprintf (out, "  GOMP_offload_unregister_ver (%#x, (void *) 0,"
+	     "%d/*NVIDIA_PTX*/, &target_data_avoid_offloading);\n",
+	     GOMP_VERSION_PACK (GOMP_VERSION, GOMP_VERSION_NVIDIA_PTX),
+	     GOMP_DEVICE_NVIDIA_PTX);
+  fprintf (out, "};\n");
 }
 
 static void
diff --git gcc/config/nvptx/nvptx.c gcc/config/nvptx/nvptx.c
index 78614f8..fe28154 100644
--- gcc/config/nvptx/nvptx.c
+++ gcc/config/nvptx/nvptx.c
@@ -3803,6 +3803,9 @@  static const struct attribute_spec nvptx_attribute_table[] =
   /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
        affects_type_identity } */
   { "kernel", 0, 0, true, false,  false, nvptx_handle_kernel_attribute, false },
+  /* Avoid offloading.  For example, because there is no sufficient
+     parallelism.  */
+  { "omp avoid offloading", 0, 0, true, false, false, NULL, false },
   { NULL, 0, 0, false, false, false, NULL, false }
 };
 
@@ -3867,7 +3870,10 @@  nvptx_record_offload_symbol (tree decl)
 	tree dims = TREE_VALUE (attr);
 	unsigned ix;
 
-	fprintf (asm_out_file, "//:FUNC_MAP \"%s\"",
+	fprintf (asm_out_file, "//:FUNC_MAP %s\"%s\"",
+		 (lookup_attribute ("omp avoid offloading",
+				    DECL_ATTRIBUTES (decl))
+		  ? "(avoid offloading) " : ""),
 		 IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl)));
 
 	for (ix = 0; ix != GOMP_DIM_MAX; ix++, dims = TREE_CHAIN (dims))
@@ -4124,6 +4130,40 @@  nvptx_expand_builtin (tree exp, rtx target, rtx ARG_UNUSED (subtarget),
 static bool
 nvptx_goacc_validate_dims (tree decl, int dims[], int fn_level)
 {
+  /* Detect if a function is unsuitable for offloading.  */
+  if (!flag_offload_force && decl)
+    {
+      tree oacc_function_attr = get_oacc_fn_attrib (decl);
+      if (oacc_function_attr
+	  && oacc_fn_attrib_kernels_p (oacc_function_attr))
+	{
+	  bool avoid_offloading_p = true;
+	  for (unsigned ix = 0; ix != GOMP_DIM_MAX; ix++)
+	    {
+	      if (dims[ix] > 1)
+		{
+		  avoid_offloading_p = false;
+		  break;
+		}
+	    }
+	  if (avoid_offloading_p)
+	    {
+	      /* OpenACC kernels constructs will never be parallelized for
+		 optimization levels smaller than -O2; avoid the diagnostic in
+		 this case.  */
+	      if (optimize >= 2)
+		warning_at (DECL_SOURCE_LOCATION (decl), 0,
+			    "OpenACC kernels construct will be executed "
+			    "sequentially; will by default avoid offloading "
+			    "to prevent data copy penalty");
+	      DECL_ATTRIBUTES (decl)
+		= tree_cons (get_identifier ("omp avoid offloading"),
+			     NULL_TREE, DECL_ATTRIBUTES (decl));
+
+	    }
+	}
+    }
+
   bool changed = false;
 
   /* The vector size must be 32, unless this is a SEQ routine.  */
diff --git gcc/doc/invoke.texi gcc/doc/invoke.texi
index fcc404e..c09fbc5 100644
--- gcc/doc/invoke.texi
+++ gcc/doc/invoke.texi
@@ -180,7 +180,8 @@  in the following sections.
 @gccoptlist{-ansi  -std=@var{standard}  -fgnu89-inline @gol
 -aux-info @var{filename} -fallow-parameterless-variadic-functions @gol
 -fno-asm  -fno-builtin  -fno-builtin-@var{function} @gol
--fhosted  -ffreestanding -fopenacc -fopenmp -fopenmp-simd @gol
+-fhosted  -ffreestanding @gol
+-foffload-force -fopenacc -fopenacc-dim=@var{geom} -fopenmp -fopenmp-simd @gol
 -fms-extensions -fplan9-extensions -fsso-struct=@var{endianness}
 -fallow-single-precision  -fcond-mismatch -flax-vector-conversions @gol
 -fsigned-bitfields  -fsigned-char @gol
@@ -1953,6 +1954,15 @@  This is equivalent to @option{-fno-hosted}.
 @xref{Standards,,Language Standards Supported by GCC}, for details of
 freestanding and hosted environments.
 
+@item -foffload-force
+@opindex -foffload-force
+The option @option{-foffload-force} forces offloading if the compiler
+wanted to avoid it.  For example, when there isn't sufficient
+parallelism in certain offloading constructs, the compiler may come to
+the conclusion that offloading incurs too much overhead (for data
+transfers, for example), and unless overridden with this flag, it then
+suggests to the runtime (libgomp) to avoid offloading.
+
 @item -fopenacc
 @opindex fopenacc
 @cindex OpenACC accelerator programming
diff --git gcc/lto-wrapper.c gcc/lto-wrapper.c
index ced6f2f..702ae47 100644
--- gcc/lto-wrapper.c
+++ gcc/lto-wrapper.c
@@ -275,6 +275,7 @@  merge_and_complain (struct cl_decoded_option **decoded_options,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fcilkplus:
@@ -517,6 +518,7 @@  append_compiler_options (obstack *argv_obstack, struct cl_decoded_option *opts,
 	case OPT_fsigned_zeros:
 	case OPT_ftrapping_math:
 	case OPT_fwrapv:
+	case OPT_foffload_force:
 	case OPT_fopenmp:
 	case OPT_fopenacc:
 	case OPT_fopenacc_dim_:
diff --git libgomp/libgomp.h libgomp/libgomp.h
index 7108a6d..8747b72 100644
--- libgomp/libgomp.h
+++ libgomp/libgomp.h
@@ -984,6 +984,7 @@  extern void gomp_unmap_vars (struct target_mem_desc *, bool);
 extern void gomp_init_device (struct gomp_device_descr *);
 extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_unload_device (struct gomp_device_descr *);
+extern bool gomp_offload_target_available_p (int);
 
 /* work.c */
 
diff --git libgomp/libgomp.texi libgomp/libgomp.texi
index 987ee5f..5795c00 100644
--- libgomp/libgomp.texi
+++ libgomp/libgomp.texi
@@ -1815,6 +1815,14 @@  flag @option{-fopenacc} must be specified.  This enables the OpenACC directive
 arranges for automatic linking of the OpenACC runtime library 
 (@ref{OpenACC Runtime Library Routines}).
 
+Offloading is enabled by default.  In some cases, the compiler may
+come to the conclusion that offloading incurs too much overhead, and
+suggest to the runtime to avoid it.  To counteract that, you can use
+the option @option{-foffload-force} to force offloading in such cases.
+Alternatively, offloading is also enabled if a specific device type is
+requested, in a call to @code{acc_init} or by setting the
+@env{ACC_DEVICE_TYPE} environment variable, for example.
+
 A complete description of all OpenACC directives accepted may be found in 
 the @uref{http://www.openacc.org/, OpenACC} Application Programming
 Interface manual, version 2.0.
diff --git libgomp/oacc-init.c libgomp/oacc-init.c
index 42d005d..2f053f3 100644
--- libgomp/oacc-init.c
+++ libgomp/oacc-init.c
@@ -122,7 +122,10 @@  resolve_device (acc_device_t d, bool fail_is_error)
       {
 	if (goacc_device_type)
 	  {
-	    /* Lookup the named device.  */
+	    /* Lookup the device that has been explicitly named, so do not pay
+	       attention to gomp_offload_target_available_p.  (That is,
+	       enforced usage even with an "avoid offloading" flag set, and
+	       hard error if not actually available.)  */
 	    while (++d != _ACC_device_hwm)
 	      if (dispatchers[d]
 		  && !strcasecmp (goacc_device_type,
@@ -148,8 +151,15 @@  resolve_device (acc_device_t d, bool fail_is_error)
     case acc_device_not_host:
       /* Find the first available device after acc_device_not_host.  */
       while (++d != _ACC_device_hwm)
-	if (dispatchers[d] && dispatchers[d]->get_num_devices_func () > 0)
+	if (dispatchers[d]
+	    && dispatchers[d]->get_num_devices_func () > 0
+	    /* No device has been explicitly named, so pay attention to
+	       gomp_offload_target_available_p, to not decide on an offload
+	       target that we don't have offload data available for, or have an
+	       "avoid offloading" flag set for.  */
+	    && gomp_offload_target_available_p (dispatchers[d]->type))
 	  goto found;
+      /* No non-host device found.  */
       if (d_arg == acc_device_default)
 	{
 	  d = acc_device_host;
@@ -168,7 +178,7 @@  resolve_device (acc_device_t d, bool fail_is_error)
       break;
 
     default:
-      if (d > _ACC_device_hwm)
+      if (d >= _ACC_device_hwm)
 	{
 	  if (fail_is_error)
 	    goto unsupported_device;
@@ -181,7 +191,8 @@  resolve_device (acc_device_t d, bool fail_is_error)
 
   assert (d != acc_device_none
 	  && d != acc_device_default
-	  && d != acc_device_not_host);
+	  && d != acc_device_not_host
+	  && d < _ACC_device_hwm);
 
   if (dispatchers[d] == NULL && fail_is_error)
     {
diff --git libgomp/target.c libgomp/target.c
index 96fe3d5..afcbedb 100644
--- libgomp/target.c
+++ libgomp/target.c
@@ -1165,12 +1165,19 @@  gomp_unload_image_from_device (struct gomp_device_descr *devicep,
 
 /* This function should be called from every offload image while loading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, this image (TARGET_DATA) is stored as an "avoid
+   offloading" flag, and the TARGET_TYPE will not be considered by default
+   until this image gets unregistered.  */
 
 void
 GOMP_offload_register_ver (unsigned version, const void *host_table,
 			   int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   if (GOMP_VERSION_LIB (version) > GOMP_VERSION)
@@ -1179,16 +1186,19 @@  GOMP_offload_register_ver (unsigned version, const void *host_table,
   
   gomp_mutex_lock (&register_lock);
 
-  /* Load image to all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_load_image_to_device (devicep, version,
-				   host_table, target_data, true);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Load image to all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_load_image_to_device (devicep, version,
+				       host_table, target_data, true);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Insert image to array of pending images.  */
@@ -1214,26 +1224,36 @@  GOMP_offload_register (const void *host_table, int target_type,
 
 /* This function should be called from every offload image while unloading.
    It gets the descriptor of the host func and var tables HOST_TABLE, TYPE of
-   the target, and TARGET_DATA needed by target plugin.  */
+   the target, and TARGET_DATA needed by target plugin.
+
+   If HOST_TABLE is NULL, the "avoid offloading" flag gets cleared for this
+   image (TARGET_DATA), and this TARGET_TYPE may again be considered by
+   default.  */
 
 void
 GOMP_offload_unregister_ver (unsigned version, const void *host_table,
 			     int target_type, const void *target_data)
 {
+  gomp_debug (0, "%s (%u, %p, %d, %p)\n", __FUNCTION__,
+	      version, host_table, target_type, target_data);
+
   int i;
 
   gomp_mutex_lock (&register_lock);
 
-  /* Unload image from all initialized devices.  */
-  for (i = 0; i < num_devices; i++)
+  if (host_table != NULL)
     {
-      struct gomp_device_descr *devicep = &devices[i];
-      gomp_mutex_lock (&devicep->lock);
-      if (devicep->type == target_type
-	  && devicep->state == GOMP_DEVICE_INITIALIZED)
-	gomp_unload_image_from_device (devicep, version,
-				       host_table, target_data);
-      gomp_mutex_unlock (&devicep->lock);
+      /* Unload image from all initialized devices.  */
+      for (i = 0; i < num_devices; i++)
+	{
+	  struct gomp_device_descr *devicep = &devices[i];
+	  gomp_mutex_lock (&devicep->lock);
+	  if (devicep->type == target_type
+	      && devicep->state == GOMP_DEVICE_INITIALIZED)
+	    gomp_unload_image_from_device (devicep, version,
+					   host_table, target_data);
+	  gomp_mutex_unlock (&devicep->lock);
+	}
     }
 
   /* Remove image from array of pending images.  */
@@ -1267,7 +1287,8 @@  gomp_init_device (struct gomp_device_descr *devicep)
   for (i = 0; i < num_offload_images; i++)
     {
       struct offload_image_descr *image = &offload_images[i];
-      if (image->type == devicep->type)
+      if (image->type == devicep->type
+	  && image->host_table != NULL)
 	gomp_load_image_to_device (devicep, image->version,
 				   image->host_table, image->target_data,
 				   false);
@@ -1287,7 +1308,8 @@  gomp_unload_device (struct gomp_device_descr *devicep)
       for (i = 0; i < num_offload_images; i++)
 	{
 	  struct offload_image_descr *image = &offload_images[i];
-	  if (image->type == devicep->type)
+	  if (image->type == devicep->type
+	      && image->host_table != NULL)
 	    gomp_unload_image_from_device (devicep, image->version,
 					   image->host_table,
 					   image->target_data);
@@ -1311,6 +1333,62 @@  gomp_free_memmap (struct splay_tree_s *mem_map)
     }
 }
 
+/* Do we have offload data available for the given offload target type?
+   Instead of verifying that *all* offload data is available that could
+   possibly be required, we instead just look for *any*.  If we later find any
+   offload data missing, that's user error.  If any offload data of this target
+   type is tagged with an "avoid offloading" flag, do not consider this target
+   type available unless it has been initialized already.  */
+
+attribute_hidden bool
+gomp_offload_target_available_p (int type)
+{
+  bool available = false;
+
+  /* Has the offload target type already been initialized?  */
+  for (int i = 0; !available && i < num_devices; i++)
+    {
+      struct gomp_device_descr *devicep = &devices[i];
+      gomp_mutex_lock (&devicep->lock);
+      if (devicep->type == type
+	  && devicep->state == GOMP_DEVICE_INITIALIZED)
+	available = true;
+      gomp_mutex_unlock (&devicep->lock);
+    }
+
+  /* If the offload target type has been initialized already, we ignore "avoid
+     offloading" flags.  This is important, because data/state may be present
+     on the device, that we must continue to use.  */
+  if (!available)
+    {
+      gomp_mutex_lock (&register_lock);
+      if (num_offload_images == 0)
+	/* If there is no offload data available at all, there is no way to
+	   later fail to find any of it for a specific offload target type.
+	   This is the case where there are no offloaded code regions in user
+	   code, but the target type can be initialized successfully, and
+	   executable directqives be used, or runtime library calls be
+	   made.  */
+	available = true;
+      else
+	{
+	  /* Can the offload target be initialized?  */
+	  for (int i = 0; !available && i < num_offload_images; i++)
+	    if (offload_images[i].type == type
+		&& offload_images[i].host_table != NULL)
+	      available = true;
+	  /* If yes, is an "avoid offloading" flag set?  */
+	  for (int i = 0; available && i < num_offload_images; i++)
+	    if (offload_images[i].type == type
+		&& offload_images[i].host_table == NULL)
+	      available = false;
+	}
+      gomp_mutex_unlock (&register_lock);
+    }
+
+  return available;
+}
+
 /* Host fallback for GOMP_target{,_ext} routines.  */
 
 static void
diff --git libgomp/testsuite/lib/libgomp.exp libgomp/testsuite/lib/libgomp.exp
index a4c9d83..8d2be80 100644
--- libgomp/testsuite/lib/libgomp.exp
+++ libgomp/testsuite/lib/libgomp.exp
@@ -344,6 +344,16 @@  proc check_effective_target_offload_device_nonshared_as { } {
     } ]
 }
 
+# Return 1 if the compiler has been configured for nvptx offloading.
+
+proc check_effective_target_nvptx_offloading_configured { } {
+    # PR libgomp/65099: Currently, we only support offloading in 64-bit
+    # configurations.
+    global offload_targets
+    return [expr [string match "*,nvptx,*" ",$offload_targets,"] \
+		&& [is-effective-target lp64] ]
+}
+
 # Return 1 if at least one nvidia board is present.
 
 proc check_effective_target_openacc_nvidia_accel_present { } {
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
index bca425e..23156d8 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-3.c
@@ -1,5 +1,3 @@ 
-/* { dg-do run } */
-
 #include <stdio.h>
 #include <stdlib.h>
 
@@ -7,7 +5,7 @@  int
 main (void)
 {
   fprintf (stderr, "CheCKpOInT\n");
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     abort ();
   }
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
index c29ca3f..f4d6a07 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/abort-4.c
@@ -1,12 +1,10 @@ 
-/* { dg-do run } */
-
 #include <stdlib.h>
 
 int
 main (int argc, char **argv)
 {
 
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (argc != 1)
       abort ();
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
new file mode 100644
index 0000000..08745fc
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-1.c
@@ -0,0 +1,28 @@ 
+/* Test that the compiler decides to "avoid offloading".  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* The ACC_DEVICE_TYPE environment variable gets set in the testing
+   framework, and that overrides the "avoid offloading" flag at run time.
+   { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+  if (y != 1)
+    __builtin_abort();
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
new file mode 100644
index 0000000..724228a
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-2.c
@@ -0,0 +1,38 @@ 
+/* Test that a user can override the compiler's "avoid offloading"
+   decision at run time.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  /* Override the compiler's "avoid offloading" decision.  */
+  acc_device_t d;
+#if defined ACC_DEVICE_TYPE_nvidia
+  d = acc_device_nvidia;
+#elif defined ACC_DEVICE_TYPE_host
+  d = acc_device_host;
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+  acc_init (d);
+
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
new file mode 100644
index 0000000..2fb5196
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/avoid-offloading-3.c
@@ -0,0 +1,29 @@ 
+/* Test that a user can override the compiler's "avoid offloading"
+   decision at compile time.  */
+
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
+/* Override the compiler's "avoid offloading" decision.
+   { dg-additional-options "-foffload-force" } */
+
+#include <openacc.h>
+
+int main(void)
+{
+  int x, y;
+
+#pragma acc data copyout(x, y)
+#pragma acc kernels
+  *((volatile int *) &x) = 33, y = acc_on_device (acc_device_host);
+
+  if (x != 33)
+    __builtin_abort();
+#if defined ACC_DEVICE_TYPE_nvidia
+  if (y != 0)
+    __builtin_abort();
+#else
+  if (y != 1)
+    __builtin_abort();
+#endif
+
+  return 0;
+}
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
index dad6d13..87ca378 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/combined-directives-1.c
@@ -1,6 +1,6 @@ 
 /* This test exercises combined directives.  */
 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
@@ -33,7 +33,7 @@  main (int argc, char **argv)
 	abort ();
     }
 
-#pragma acc kernels loop copy (a[0:N]) copy (b[0:N])
+#pragma acc kernels loop copy (a[0:N]) copy (b[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   for (i = 0; i < N; i++)
     {
       b[i] = 3.0;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
index 1ac0b95..8f0144c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/default-1.c
@@ -1,4 +1,4 @@ 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include  <openacc.h>
 
@@ -51,7 +51,7 @@  int test_kernels ()
     ary[i] = ~0;
 
   /* val defaults to copy, ary defaults to copy.  */
-#pragma acc kernels copy(ondev)
+#pragma acc kernels copy(ondev) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     ondev = acc_on_device (acc_device_not_host);
 #pragma acc loop 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
index e271a37..9a5f7b1 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/deviceptr-1.c
@@ -1,5 +1,3 @@ 
-/* { dg-do run } */
-
 #include <stdlib.h>
 
 int main (void)
@@ -10,7 +8,7 @@  int main (void)
   a = A;
 
 #pragma acc data copyout (a_1, a_2)
-#pragma acc kernels deviceptr (a)
+#pragma acc kernels deviceptr (a) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a_1 = a;
     a_2 = &a;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
index 51745ba..3ef6f9b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c
@@ -1,4 +1,5 @@ 
 /* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 /* { dg-additional-options "-lcuda -lcublas -lcudart" } */
 
 #include <stdlib.h>
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
index 3acfdf5..614ad33 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-1.c
@@ -1,4 +1,4 @@ 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
@@ -73,7 +73,7 @@  int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyin (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (i != -1 || j != -2)
       abort ();
@@ -96,7 +96,7 @@  int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copyout (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     i = 2;
     j = 1;
@@ -110,7 +110,7 @@  int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_copy (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     if (i != -1 || j != -2)
       abort ();
@@ -126,7 +126,7 @@  int main (void)
   i = -1;
   j = -2;
   v = 0;
-#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j)
+#pragma acc kernels /* copyout */ present_or_copyout (v) present_or_create (i, j) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     i = 2;
     j = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
index 0f323c8..8d5101d 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
@@ -1,4 +1,4 @@ 
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@  main (void)
   unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
   unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
 
-#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
index 654e750..3726b0c 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
@@ -1,4 +1,4 @@ 
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@  main (void)
   unsigned int *b = a;
   unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
 
-#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
index 44d4fd2..eea4f76 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
@@ -1,4 +1,4 @@ 
-/* { dg-additional-options "-O2 -fipa-pta" } */
+/* { dg-additional-options "-fipa-pta" } */
 
 #include <stdlib.h>
 
@@ -11,7 +11,7 @@  main (void)
   unsigned int b[N];
   unsigned int c[N];
 
-#pragma acc kernels pcopyout (a, b, c)
+#pragma acc kernels pcopyout (a, b, c) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 0;
     b[0] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
index a68a7cd..860b6da 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-empty.c
@@ -1,6 +1,6 @@ 
 int
 main (void)
 {
-#pragma acc kernels
+#pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   ;
 }
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
index 2e4100f..5cdc200 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-2.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@ 
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = a[0] + 1;
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
index b3e736b..2e4d4d2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-3.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,8 +7,7 @@ 
 unsigned int
 foo (int n, unsigned int *a)
 {
-
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     for (int i = 0; i < n; i++)
       a[i] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
index 8b9affa..5bf00db 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-4.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@ 
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     a[0] = 2;
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
index 83d4e7f..d39b667 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-5.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -9,7 +8,7 @@  unsigned int
 foo (int n, unsigned int *a)
 {
   int r;
-#pragma acc kernels copyout(r) copy (a[0:N])
+#pragma acc kernels copyout(r) copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     r = a[0];
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
index 01d5e5e..bb2e85b 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq-6.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,7 +7,7 @@ 
 unsigned int
 foo (int n, unsigned int *a)
 {
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     int r = a[0];
 
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
index 61d1283..e513827 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-and-seq.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -8,8 +7,7 @@ 
 unsigned int
 foo (int n, unsigned int *a)
 {
-
-#pragma acc kernels copy (a[0:N])
+#pragma acc kernels copy (a[0:N]) /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
     for (int i = 0; i < n; i++)
       a[i] = 1;
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
index f7f04cb..c4791a4 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-loop-collapse.c
@@ -1,4 +1,3 @@ 
-/* { dg-do run } */
 /* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
@@ -11,7 +10,7 @@  void __attribute__((noinline, noclone))
 foo (int m, int n)
 {
   int i, j;
-  #pragma acc kernels
+  #pragma acc kernels /* { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } } */
   {
 #pragma acc loop collapse(2)
     for (i = 0; i < m; i++)
diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
index c164598..94a5ae2 100644
--- libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
+++ libgomp/testsuite/libgomp.oacc-c-c++-common/nested-2.c
@@ -1,4 +1,4 @@ 
-/* { dg-do run } */
+/* { dg-additional-options "-ftree-parallelize-loops=32" } */
 
 #include <stdlib.h>
 
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
new file mode 100644
index 0000000..5f18b94
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-1.f
@@ -0,0 +1,32 @@ 
+! Test that the compiler decides to "avoid offloading".
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
+! The ACC_DEVICE_TYPE environment variable gets set in the testing
+! framework, and that overrides the "avoid offloading" flag at run time.
+! { dg-xfail-run-if "TODO" { openacc_nvidia_accel_selected } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST);
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_host || defined ACC_DEVICE_TYPE_nvidia
+      IF (.NOT. Y) CALL ABORT
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
new file mode 100644
index 0000000..51801ad
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-2.f
@@ -0,0 +1,41 @@ 
+! Test that a user can override the compiler's "avoid offloading"
+! decision at run time.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!     Override the compiler's "avoid offloading" decision.
+#if defined ACC_DEVICE_TYPE_nvidia
+      D = ACC_DEVICE_NVIDIA
+#elif defined ACC_DEVICE_TYPE_host
+      D = ACC_DEVICE_HOST
+#else
+# error Not ported to this ACC_DEVICE_TYPE
+#endif
+      CALL ACC_INIT (D)
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
new file mode 100644
index 0000000..bea6ab8
--- /dev/null
+++ libgomp/testsuite/libgomp.oacc-fortran/avoid-offloading-3.f
@@ -0,0 +1,31 @@ 
+! Test that a user can override the compiler's "avoid offloading"
+! decision at compile time.
+
+! { dg-do run }
+! { dg-additional-options "-cpp" }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! Override the compiler's "avoid offloading" decision.
+! { dg-additional-options "-foffload-force" }
+
+      IMPLICIT NONE
+      INCLUDE "openacc_lib.h"
+
+      INTEGER :: D
+      INTEGER, VOLATILE :: X
+      LOGICAL :: Y
+
+!$ACC DATA COPYOUT(X, Y)
+!$ACC KERNELS
+      X = 33
+      Y = ACC_ON_DEVICE (ACC_DEVICE_HOST)
+!$ACC END KERNELS
+!$ACC END DATA
+
+      IF (X .NE. 33) CALL ABORT
+#if defined ACC_DEVICE_TYPE_nvidia
+      IF (Y) CALL ABORT
+#else
+      IF (.NOT. Y) CALL ABORT
+#endif
+
+      END
diff --git libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90 libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
index 94100b2..4b52579 100644
--- libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/combined-directives-1.f90
@@ -1,6 +1,9 @@ 
 ! This test exercises combined directives.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
 program main
   integer, parameter :: n = 32
@@ -27,7 +30,7 @@  program main
   !$acc kernels loop copy (a(1:n)) copy (b(1:n))
   do i = 1, n
     b(i) = 3.0;
-    a(i) = a(i) + b(i)
+    a(i) = a(i) + b(i) ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
   end do
 
   do i = 1, n
diff --git libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90 libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
index 4afb562..b9298c7 100644
--- libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
+++ libgomp/testsuite/libgomp.oacc-fortran/non-scalar-data.f90
@@ -2,6 +2,9 @@ 
 ! offloaded regions are properly mapped using present_or_copy.
 
 ! { dg-do run }
+! { dg-additional-options "-ftree-parallelize-loops=32" }
+! The "avoid offloading" warning is only triggered for -O2 and higher.
+! { dg-xfail-if "n/a" { nvptx_offloading_configured } { "-O0" "-O1" } { "" } }
 
 program main
   implicit none
@@ -30,7 +33,7 @@  subroutine kernels (array, n)
   integer, dimension (n) :: array
   integer :: n, i
 
-  !$acc kernels
+  !$acc kernels ! { dg-warning "OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty" "" { target nvptx_offloading_configured } }
   do i = 1, n
      array(i) = i
   end do