diff mbox series

[og9] Port OpenACC profiling interface to OG9

Message ID b9fdfd2d-fe3e-aaa0-9c96-61d7a0948a48@codesourcery.com
State New
Headers show
Series [og9] Port OpenACC profiling interface to OG9 | expand

Commit Message

Kwok Cheung Yeung June 17, 2019, 1:27 p.m. UTC
Hello

This is a straightforward port of the OpenACC profiling interface from 
OG8 to OG9, with a few tweaks to compensate for patches that have not 
been carried over to OG9. I have also bundled in the profiling hooks and 
documentation updates that have been added since the original patch in OG8.

This has been tested on x86_64 with no regressions on the libgomp testsuite.

Okay to push to openacc-gcc-9-branch?

Thanks

Kwok
From 8af550dcfc6c339e19293d59692cb595a2cb2d69 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Tue, 28 Feb 2017 17:36:03 +0000
Subject: [PATCH] OpenACC 2.6 Profiling Interface

	libgomp/
	* acc_prof.h: New file.
	* oacc-profiling-acc_register_library.c: Likewise.
	* oacc-profiling.c: Likewise.
	* Makefile.am (nodist_libsubinclude_HEADERS, libgomp_la_SOURCES):
	Add these, respectively.
	* Makefile.in: Regenerate.
	* libgomp/config/nvptx/oacc-profiling-acc_register_library.c:
	New empty file.
	* libgomp/config/nvptx/oacc-profiling.c: Likewise.
	* env.c (initialize_env): Call goacc_profiling_initialize.
	* libgomp-plugin.c: New function
	GOMP_PLUGIN_goacc_profiling_dispatch.
	* libgomp-plugin.h: Declare function
	GOMP_PLUGIN_goacc_profiling_dispatch.
	* oacc-plugin.c: New function GOMP_PLUGIN_goacc_thread.
	* oacc-plugin.h: Declare function GOMP_PLUGIN_goacc_thread.
	* libgomp.map (OACC_2.5): Add acc_prof_lookup, acc_prof_register,
	acc_prof_unregister, and acc_register_library.
	Add GOMP_PLUGIN_goacc_profiling_dispatch, and
	GOMP_PLUGIN_goacc_thread with new GOMP_PLUGIN_1.3 symbol version.
	* oacc-int.h (struct goacc_thread): Add "acc_prof_info
	*prof_info", "acc_api_info *api_info", and "bool
	prof_callbacks_enabled" members.
	Declare functions goacc_profiling_initialize,
	goacc_profiling_dispatch_p, and goacc_profiling_dispatch.
	* oacc-init.c (acc_init_1): Add "acc_construct_t", and "int"
	formal parameters.  Adjust all users.
	(goacc_lazy_initialize): Don't call acc_init; copy code here,
	instead.
	(acc_init_1, goacc_attach_host_thread_to_device, acc_init)
	(goacc_lazy_initialize): Update for OpenACC Profiling Interface.
	* oacc-parallel.c (GOACC_parallel_keyed): Likewise.
	* plugin/plugin-nvptx.c (cuda_map_create, cuda_map_destroy)
	(map_init, map_fini, map_pop, map_push): Add "struct goacc_thread
	*" formal parameter.  Adjust all users.
	(select_stream_for_async, event_gc, nvptx_exec, nvptx_host2dev)
	(nvptx_dev2host, nvptx_set_cuda_stream): Call
	GOMP_PLUGIN_goacc_thread instead of nvptx_thread.
	(cuda_map_create, cuda_map_destroy, nvptx_exec, nvptx_alloc)
	(nvptx_free, nvptx_host2dev, nvptx_dev2host): Update for OpenACC
	Profiling Interface.
	* libgomp.texi: New chapter "OpenACC Profiling Interface".
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c: New
	file.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c:
	Likewise.

	libgomp/
	* oacc-async.c (acc_async_test, acc_async_test_all, acc_wait)
	(acc_wait_async, acc_wait_all, acc_wait_all_async): Set up
	profiling.
	* oacc-cuda.c (acc_get_current_cuda_device)
	(acc_get_current_cuda_context, acc_get_cuda_stream)
	(acc_set_cuda_stream): Likewise.
	* oacc-init.c (acc_set_device_type, acc_get_device_type)
	(acc_get_device_num): Likewise.
	* oacc-mem.c (acc_malloc, acc_free, memcpy_tofrom_device)
	(acc_map_data, acc_unmap_data, present_create_copy)
	(delete_copyout, update_dev_host): Likewise.
	* oacc-parallel.c (GOACC_data_start, GOACC_data_end)
	(GOACC_enter_exit_data, GOACC_update, GOACC_wait): Likewise.
	* oacc-profiling.c (goacc_profiling_setup_p): New function.
	(goacc_profiling_dispatch_p): Add a "bool" formal parameter.
	Adjust all users.
	* oacc-int.h (goacc_profiling_setup_p)
	(goacc_profiling_dispatch_p): Update.
	* plugin/plugin-nvptx.c (nvptx_exec, nvptx_wait, nvptx_wait_all):
	Generate more profiling events.
	* libgomp.texi (OpenACC Environment Variables): Mention
	"ACC_PROFLIB".
	(OpenACC Profiling Interface): Update.

	libgomp/
	* oacc-init.c (get_property_any): Add profiling code.

	libgomp/
	* acc_prof.h (_ACC_PROF_INFO_VERSION): Update to 201711.
	* libgomp.texi: Update mentions of OpenACC version to 2.6.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
	(cb_any_event): Update expected profiling info version to 201711.
---
 libgomp/ChangeLog.openacc                          |  96 +++
 libgomp/Makefile.am                                |   3 +-
 libgomp/Makefile.in                                |  14 +-
 libgomp/acc_prof.h                                 | 237 +++++++
 .../nvptx/oacc-profiling-acc_register_library.c    |   0
 libgomp/config/nvptx/oacc-profiling.c              |   0
 libgomp/env.c                                      |   2 +
 libgomp/libgomp-plugin.c                           |   9 +
 libgomp/libgomp-plugin.h                           |   6 +
 libgomp/libgomp.map                                |  11 +
 libgomp/libgomp.texi                               | 308 +++++++++
 libgomp/oacc-async.c                               | 114 +++-
 libgomp/oacc-cuda.c                                |  87 ++-
 libgomp/oacc-init.c                                | 203 +++++-
 libgomp/oacc-int.h                                 |  14 +
 libgomp/oacc-mem.c                                 | 154 ++++-
 libgomp/oacc-parallel.c                            | 475 +++++++++++++-
 libgomp/oacc-plugin.c                              |  13 +
 libgomp/oacc-plugin.h                              |   3 +
 libgomp/oacc-profiling-acc_register_library.c      |  39 ++
 libgomp/oacc-profiling.c                           | 666 ++++++++++++++++++++
 libgomp/plugin/plugin-nvptx.c                      | 178 ++++++
 .../acc_prof-dispatch-1.c                          | 344 ++++++++++
 .../libgomp.oacc-c-c++-common/acc_prof-init-1.c    | 306 +++++++++
 .../libgomp.oacc-c-c++-common/acc_prof-kernels-1.c | 253 ++++++++
 .../acc_prof-parallel-1.c                          | 698 +++++++++++++++++++++
 .../acc_prof-valid_bytes-1.c                       | 172 +++++
 .../libgomp.oacc-c-c++-common/acc_prof-version-1.c |  55 ++
 28 files changed, 4399 insertions(+), 61 deletions(-)
 create mode 100644 libgomp/acc_prof.h
 create mode 100644 libgomp/config/nvptx/oacc-profiling-acc_register_library.c
 create mode 100644 libgomp/config/nvptx/oacc-profiling.c
 create mode 100644 libgomp/oacc-profiling-acc_register_library.c
 create mode 100644 libgomp/oacc-profiling.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c

Comments

Thomas Schwinge June 17, 2019, 5:24 p.m. UTC | #1
Hi Kwok!

On Mon, 17 Jun 2019 14:27:46 +0100, Kwok Cheung Yeung <kcy@codesourcery.com> wrote:
> This is a straightforward port of the OpenACC profiling interface from 
> OG8 to OG9, with a few tweaks

Thanks for looking into this.

> to compensate for patches that have not 
> been carried over to OG9.

Conceptually ACK.  (I have not reviewed what's missing/got dropped.)

> I have also bundled in the profiling hooks and 
> documentation updates that have been added since the original patch in OG8.

ACK.

> Okay to push to openacc-gcc-9-branch?

What you probably didn't know (sorry!) is that trunk r271346,
<http://mid.mail-archive.com/yxfplfz4anys.fsf@hertz.schwinge.homeip.net>,
contains some changes that are not in the og8 code: code and
documentation improvements, TODOs resolved, etc.  It also doesn't contain
the actual 'acc_register_library' implementation, because that one needs
further work, as discussed before.

I think what would be best, is the following approach:

  - First, backport trunk r271346 to og9.

    That might not be completely trivial, because as you know, og9
    contains a number of "feature" commits that need to include changes
    related to the profiling code.  It's the question in which order to
    sequence patches -- whether these changes related to the profiling
    code are part of the individual "feature" commits, or they're part of
    a later "profiling" commit.  As og9 has been published with "feature"
    commits not containing any profiling bits, all these things will have
    to be in a later "profiling" commit.

    However, for upstreaming this into trunk later on, it will be
    beneficial to have the backported "profiling" commit as similar as
    possible to trunk r271346, so we shall live with some short-lived
    inconsistencies, maybe even a few testuite regressions, which then
    get resolved when you...

  - Second, on top of that, add the pieces of functionality (from the
    og8) version that are missing from the trunk r271346 backport.  Of
    course, remove all changes that would worsen the state, compared to
    what trunk r271346 already contains.

So, conceptually: a first commit to backport trunk r271346, then a second
commit containing a merger of a (temporary) revert of the first commit
followed by applying the patch you just posted.  Then, remove from the
second commit all changes that worsen the state, compared to what trunk
r271346 already contains.

Or, in other words: split the patch you just posted into two, where the
first one is as close as possible to a backport of trunk r271346, then
all other changes in a second commit.

Will that work?

I'll be happy to help review these changes, especially whether something
should be part of the second commit, or get dropped.


Grüße
 Thomas
Kwok Cheung Yeung June 24, 2019, 7:37 p.m. UTC | #2
On 17/06/2019 6:24 pm, Thomas Schwinge wrote:
>> Okay to push to openacc-gcc-9-branch?
> 
> What you probably didn't know (sorry!) is that trunk r271346,
> <http://mid.mail-archive.com/yxfplfz4anys.fsf@hertz.schwinge.homeip.net>,
> contains some changes that are not in the og8 code: code and
> documentation improvements, TODOs resolved, etc.  It also doesn't contain
> the actual 'acc_register_library' implementation, because that one needs
> further work, as discussed before.
> 
> I think what would be best, is the following approach:
> 
>    - First, backport trunk r271346 to og9.
> 
>      That might not be completely trivial, because as you know, og9
>      contains a number of "feature" commits that need to include changes
>      related to the profiling code.  It's the question in which order to
>      sequence patches -- whether these changes related to the profiling
>      code are part of the individual "feature" commits, or they're part of
>      a later "profiling" commit.  As og9 has been published with "feature"
>      commits not containing any profiling bits, all these things will have
>      to be in a later "profiling" commit.
> 
>      However, for upstreaming this into trunk later on, it will be
>      beneficial to have the backported "profiling" commit as similar as
>      possible to trunk r271346, so we shall live with some short-lived
>      inconsistencies, maybe even a few testuite regressions, which then
>      get resolved when you...
> 
>    - Second, on top of that, add the pieces of functionality (from the
>      og8) version that are missing from the trunk r271346 backport.  Of
>      course, remove all changes that would worsen the state, compared to
>      what trunk r271346 already contains.
> 
> So, conceptually: a first commit to backport trunk r271346, then a second
> commit containing a merger of a (temporary) revert of the first commit
> followed by applying the patch you just posted.  Then, remove from the
> second commit all changes that worsen the state, compared to what trunk
> r271346 already contains.
> 
> Or, in other words: split the patch you just posted into two, where the
> first one is as close as possible to a backport of trunk r271346, then
> all other changes in a second commit.
> 
> Will that work?
> 

I have now ported over the mainline patch over to OG9, plus an 
additional patch on top of that to bring in the bits from OG8 that did 
not make it upstream.

I have dropped the differences in comments, TODOs, documentation etc. in 
favour of the upstream patch. There are also various places where the 
OG8 patch sets up profiling, then gotos the end of the function to tear 
it down again, whereas the mainline version just aborts early without 
setting up profiling in the first place - these I have also resolved in 
favour of the mainline version.

I have rerun the libgomp testsuite with no regressions noted.

Okay to push to openacc-gcc-9-branch?

Thanks

Kwok
From ba24e40c9c1b8c20ed77ed408bffb07e237e23ba Mon Sep 17 00:00:00 2001
From: tschwinge <tschwinge@138bc75d-0d04-0410-961f-82ee72b054a4>
Date: Fri, 17 May 2019 19:13:36 +0000
Subject: [PATCH 1/2] OpenACC Profiling Interface (incomplete)

	libgomp/
	* acc_prof.h: New file.
	* oacc-profiling.c: Likewise.
	* Makefile.am (nodist_libsubinclude_HEADERS, libgomp_la_SOURCES):
	Add these, respectively.
	* Makefile.in: Regenerate.
	* env.c (initialize_env): Call goacc_profiling_initialize.
	* oacc-plugin.c (GOMP_PLUGIN_goacc_thread)
	(GOMP_PLUGIN_goacc_profiling_dispatch): New functions.
	* oacc-plugin.h (GOMP_PLUGIN_goacc_thread)
	(GOMP_PLUGIN_goacc_profiling_dispatch): Declare.
	* libgomp.map (OACC_2.5.1): Add acc_prof_lookup,
	acc_prof_register, acc_prof_unregister, and acc_register_library.
	(GOMP_PLUGIN_1.3): Add GOMP_PLUGIN_goacc_profiling_dispatch, and
	GOMP_PLUGIN_goacc_thread.
	* oacc-int.h (struct goacc_thread): Add prof_info, api_info,
	prof_callbacks_enabled members.
	(goacc_prof_enabled, goacc_profiling_initialize)
	(_goacc_profiling_dispatch_p, _goacc_profiling_setup_p)
	(goacc_profiling_dispatch): Declare.
	(GOACC_PROF_ENABLED, GOACC_PROFILING_DISPATCH_P)
	(GOACC_PROFILING_SETUP_P): Define.
	* oacc-async.c (acc_async_test, acc_async_test_all, acc_wait)
	(acc_wait_async, acc_wait_all, acc_wait_all_async): Update for
	OpenACC Profiling Interface.
	* oacc-cuda.c (acc_get_current_cuda_device)
	(acc_get_current_cuda_context, acc_get_cuda_stream)
	(acc_set_cuda_stream): Likewise.
	* oacc-init.c (acc_init_1, goacc_attach_host_thread_to_device)
	(acc_init, acc_set_device_type, acc_get_device_type)
	(acc_get_device_num, goacc_lazy_initialize): Likewise.
	* oacc-mem.c (acc_malloc, acc_free, memcpy_tofrom_device)
	(acc_deviceptr, acc_hostptr, acc_is_present, acc_map_data)
	(acc_unmap_data, present_create_copy, delete_copyout)
	(update_dev_host): Likewise.
	* oacc-parallel.c (GOACC_parallel_keyed, GOACC_data_start)
	(GOACC_data_end, GOACC_enter_exit_data, GOACC_update, GOACC_wait):
	Likewise.
	* plugin/plugin-nvptx.c (nvptx_exec, nvptx_alloc, nvptx_free)
	(GOMP_OFFLOAD_openacc_exec, GOMP_OFFLOAD_openacc_async_exec):
	Likewise.
	* libgomp.texi: Update.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c: New
	file.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c:
	Likewise.
	* testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c:
	Likewise.

git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@271346 138bc75d-0d04-0410-961f-82ee72b054a4
---
 libgomp/ChangeLog.openacc                          |  55 ++
 libgomp/Makefile.am                                |   4 +-
 libgomp/Makefile.in                                |   7 +-
 libgomp/acc_prof.h                                 | 252 ++++++++
 libgomp/env.c                                      |   2 +
 libgomp/libgomp.map                                |  16 +-
 libgomp/libgomp.texi                               | 412 ++++++++++++
 libgomp/oacc-async.c                               | 121 +++-
 libgomp/oacc-cuda.c                                |  81 ++-
 libgomp/oacc-init.c                                | 124 +++-
 libgomp/oacc-int.h                                 |  29 +
 libgomp/oacc-mem.c                                 | 113 +++-
 libgomp/oacc-parallel.c                            | 457 ++++++++++++-
 libgomp/oacc-plugin.c                              |  17 +
 libgomp/oacc-plugin.h                              |   7 +
 libgomp/oacc-profiling.c                           | 662 +++++++++++++++++++
 libgomp/plugin/plugin-nvptx.c                      | 194 +++++-
 .../acc_prof-dispatch-1.c                          | 353 ++++++++++
 .../libgomp.oacc-c-c++-common/acc_prof-init-1.c    | 316 +++++++++
 .../libgomp.oacc-c-c++-common/acc_prof-kernels-1.c | 229 +++++++
 .../acc_prof-parallel-1.c                          | 719 +++++++++++++++++++++
 .../acc_prof-valid_bytes-1.c                       | 226 +++++++
 .../libgomp.oacc-c-c++-common/acc_prof-version-1.c |  76 +++
 23 files changed, 4401 insertions(+), 71 deletions(-)
 create mode 100644 libgomp/acc_prof.h
 create mode 100644 libgomp/oacc-profiling.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
 create mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c

diff --git a/libgomp/ChangeLog.openacc b/libgomp/ChangeLog.openacc
index b3bcb31..04c7778 100644
--- a/libgomp/ChangeLog.openacc
+++ b/libgomp/ChangeLog.openacc
@@ -1,3 +1,58 @@
+2019-05-17  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* acc_prof.h: New file.
+	* oacc-profiling.c: Likewise.
+	* Makefile.am (nodist_libsubinclude_HEADERS, libgomp_la_SOURCES):
+	Add these, respectively.
+	* Makefile.in: Regenerate.
+	* env.c (initialize_env): Call goacc_profiling_initialize.
+	* oacc-plugin.c (GOMP_PLUGIN_goacc_thread)
+	(GOMP_PLUGIN_goacc_profiling_dispatch): New functions.
+	* oacc-plugin.h (GOMP_PLUGIN_goacc_thread)
+	(GOMP_PLUGIN_goacc_profiling_dispatch): Declare.
+	* libgomp.map (OACC_2.5.1): Add acc_prof_lookup,
+	acc_prof_register, acc_prof_unregister, and acc_register_library.
+	(GOMP_PLUGIN_1.3): Add GOMP_PLUGIN_goacc_profiling_dispatch, and
+	GOMP_PLUGIN_goacc_thread.
+	* oacc-int.h (struct goacc_thread): Add prof_info, api_info,
+	prof_callbacks_enabled members.
+	(goacc_prof_enabled, goacc_profiling_initialize)
+	(_goacc_profiling_dispatch_p, _goacc_profiling_setup_p)
+	(goacc_profiling_dispatch): Declare.
+	(GOACC_PROF_ENABLED, GOACC_PROFILING_DISPATCH_P)
+	(GOACC_PROFILING_SETUP_P): Define.
+	* oacc-async.c (acc_async_test, acc_async_test_all, acc_wait)
+	(acc_wait_async, acc_wait_all, acc_wait_all_async): Update for
+	OpenACC Profiling Interface.
+	* oacc-cuda.c (acc_get_current_cuda_device)
+	(acc_get_current_cuda_context, acc_get_cuda_stream)
+	(acc_set_cuda_stream): Likewise.
+	* oacc-init.c (acc_init_1, goacc_attach_host_thread_to_device)
+	(acc_init, acc_set_device_type, acc_get_device_type)
+	(acc_get_device_num, goacc_lazy_initialize): Likewise.
+	* oacc-mem.c (acc_malloc, acc_free, memcpy_tofrom_device)
+	(acc_deviceptr, acc_hostptr, acc_is_present, acc_map_data)
+	(acc_unmap_data, present_create_copy, delete_copyout)
+	(update_dev_host): Likewise.
+	* oacc-parallel.c (GOACC_parallel_keyed, GOACC_data_start)
+	(GOACC_data_end, GOACC_enter_exit_data, GOACC_update, GOACC_wait):
+	Likewise.
+	* plugin/plugin-nvptx.c (nvptx_exec, nvptx_alloc, nvptx_free)
+	(GOMP_OFFLOAD_openacc_exec, GOMP_OFFLOAD_openacc_async_exec):
+	Likewise.
+	* libgomp.texi: Update.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c: New
+	file.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c:
+	Likewise.
+
 2019-05-28  Julian Brown  <julian@codesourcery.com>
 
 	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 2db3865..5091870 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -72,7 +72,7 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c error.c \
 	proc.c sem.c bar.c ptrlock.c time.c fortran.c affinity.c target.c \
 	splay-tree.c libgomp-plugin.c oacc-parallel.c oacc-host.c oacc-init.c \
 	oacc-mem.c oacc-async.c oacc-plugin.c oacc-cuda.c priority_queue.c \
-	affinity-fmt.c teams.c
+	affinity-fmt.c teams.c oacc-profiling.c
 
 include $(top_srcdir)/plugin/Makefrag.am
 
@@ -81,7 +81,7 @@ libgomp_la_SOURCES += openacc.f90
 endif
 
 nodist_noinst_HEADERS = libgomp_f.h
-nodist_libsubinclude_HEADERS = omp.h openacc.h
+nodist_libsubinclude_HEADERS = omp.h openacc.h acc_prof.h
 if USE_FORTRAN
 nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \
 	openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 820a576..23efc84 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -216,7 +216,7 @@ am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	target.lo splay-tree.lo libgomp-plugin.lo oacc-parallel.lo \
 	oacc-host.lo oacc-init.lo oacc-mem.lo oacc-async.lo \
 	oacc-plugin.lo oacc-cuda.lo priority_queue.lo affinity-fmt.lo \
-	teams.lo $(am__objects_1)
+	teams.lo oacc-profiling.lo $(am__objects_1)
 libgomp_la_OBJECTS = $(am_libgomp_la_OBJECTS)
 AM_V_P = $(am__v_P_@AM_V@)
 am__v_P_ = $(am__v_P_@AM_DEFAULT_V@)
@@ -553,7 +553,7 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c \
 	affinity.c target.c splay-tree.c libgomp-plugin.c \
 	oacc-parallel.c oacc-host.c oacc-init.c oacc-mem.c \
 	oacc-async.c oacc-plugin.c oacc-cuda.c priority_queue.c \
-	affinity-fmt.c teams.c $(am__append_3)
+	affinity-fmt.c teams.c oacc-profiling.c $(am__append_3)
 
 # Nvidia PTX OpenACC plugin.
 @PLUGIN_NVPTX_TRUE@libgomp_plugin_nvptx_version_info = -version-info $(libtool_VERSION)
@@ -577,7 +577,7 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c \
 @PLUGIN_HSA_TRUE@libgomp_plugin_hsa_la_LIBADD = libgomp.la $(PLUGIN_HSA_LIBS)
 @PLUGIN_HSA_TRUE@libgomp_plugin_hsa_la_LIBTOOLFLAGS = --tag=disable-static
 nodist_noinst_HEADERS = libgomp_f.h
-nodist_libsubinclude_HEADERS = omp.h openacc.h
+nodist_libsubinclude_HEADERS = omp.h openacc.h acc_prof.h
 @USE_FORTRAN_TRUE@nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \
 @USE_FORTRAN_TRUE@	openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod
 
@@ -755,6 +755,7 @@ distclean-compile:
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-mem.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-parallel.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-plugin.Plo@am__quote@
+@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-profiling.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/ordered.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/parallel.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/priority_queue.Plo@am__quote@
diff --git a/libgomp/acc_prof.h b/libgomp/acc_prof.h
new file mode 100644
index 0000000..c7a2197
--- /dev/null
+++ b/libgomp/acc_prof.h
@@ -0,0 +1,252 @@
+/* OpenACC Profiling Interface
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   Contributed by Mentor, a Siemens Business.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _ACC_PROF_H
+#define _ACC_PROF_H 1
+
+
+/* The OpenACC specification doesn't say so explicitly, but as its Profiling
+   Interface explicitly makes use of, for example, <openacc.h>'s
+   'acc_device_t', we supposedly are to '#include' that file here.  */
+
+#include <openacc.h>
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+/* Events.  */
+
+typedef enum acc_event_t
+{
+  acc_ev_none = 0,
+  acc_ev_device_init_start,
+  acc_ev_device_init_end,
+  acc_ev_device_shutdown_start,
+  acc_ev_device_shutdown_end,
+  acc_ev_runtime_shutdown,
+  acc_ev_create,
+  acc_ev_delete,
+  acc_ev_alloc,
+  acc_ev_free,
+  acc_ev_enter_data_start,
+  acc_ev_enter_data_end,
+  acc_ev_exit_data_start,
+  acc_ev_exit_data_end,
+  acc_ev_update_start,
+  acc_ev_update_end,
+  acc_ev_compute_construct_start,
+  acc_ev_compute_construct_end,
+  acc_ev_enqueue_launch_start,
+  acc_ev_enqueue_launch_end,
+  acc_ev_enqueue_upload_start,
+  acc_ev_enqueue_upload_end,
+  acc_ev_enqueue_download_start,
+  acc_ev_enqueue_download_end,
+  acc_ev_wait_start,
+  acc_ev_wait_end,
+  acc_ev_last
+} acc_event_t;
+
+
+/* Callbacks Signature.  */
+
+/* "The datatype 'ssize_t' means a signed 32-bit integer for a 32-bit binary
+   and a 64-bit integer for a 64-bit binary".  */
+typedef signed long int _acc_prof_ssize_t;
+/* "The datatype 'size_t' means an unsigned 32-bit integer for a 32-bit binary
+   and a 64-bit integer for a 64-bit binary".  */
+typedef unsigned long int _acc_prof_size_t;
+/* "The datatype 'int' means a 32-bit integer for both 32-bit and 64-bit
+   binaries".  */
+typedef int _acc_prof_int_t;
+
+/* Internal helpers: a struct's 'valid_bytes' may be less than its 'sizeof'.  */
+#define _ACC_PROF_VALID_BYTES_STRUCT(_struct, _lastfield, _valid_bytes_lastfield) \
+  offsetof (_struct, _lastfield) + (_valid_bytes_lastfield)
+#if 0 /* Untested.  */
+#define _ACC_PROF_VALID_BYTES_TYPE_N(_type, _n, _valid_bytes_type) \
+  ((_n - 1) * sizeof (_type) + (_valid_bytes_type))
+#endif
+#define _ACC_PROF_VALID_BYTES_BASICTYPE(_basictype) \
+  (sizeof (_basictype))
+
+typedef struct acc_prof_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  _acc_prof_int_t version;
+  acc_device_t device_type;
+  _acc_prof_int_t device_number;
+  _acc_prof_int_t thread_id;
+  _acc_prof_ssize_t async;
+  _acc_prof_ssize_t async_queue;
+  const char *src_file;
+  const char *func_name;
+  _acc_prof_int_t line_no, end_line_no;
+  _acc_prof_int_t func_line_no, func_end_line_no;
+#define _ACC_PROF_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_prof_info, func_end_line_no, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (_acc_prof_int_t))
+} acc_prof_info;
+
+/* We implement the OpenACC 2.6 Profiling Interface.  */
+
+#define _ACC_PROF_INFO_VERSION 201711
+
+typedef enum acc_construct_t
+{
+  acc_construct_parallel = 0,
+  acc_construct_kernels,
+  acc_construct_loop,
+  acc_construct_data,
+  acc_construct_enter_data,
+  acc_construct_exit_data,
+  acc_construct_host_data,
+  acc_construct_atomic,
+  acc_construct_declare,
+  acc_construct_init,
+  acc_construct_shutdown,
+  acc_construct_set,
+  acc_construct_update,
+  acc_construct_routine,
+  acc_construct_wait,
+  acc_construct_runtime_api,
+  acc_construct_serial
+} acc_construct_t;
+
+typedef struct acc_data_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+  const char *var_name;
+  _acc_prof_size_t bytes;
+  const void *host_ptr;
+  const void *device_ptr;
+#define _ACC_DATA_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_data_event_info, device_ptr, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_data_event_info;
+
+typedef struct acc_launch_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+  const char *kernel_name;
+  _acc_prof_size_t num_gangs, num_workers, vector_length;
+#define _ACC_LAUNCH_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_launch_event_info, vector_length, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (_acc_prof_size_t))
+} acc_launch_event_info;
+
+typedef struct acc_other_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+#define _ACC_OTHER_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_other_event_info, tool_info, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_other_event_info;
+
+typedef union acc_event_info
+{
+  acc_event_t event_type;
+  acc_data_event_info data_event;
+  acc_launch_event_info launch_event;
+  acc_other_event_info other_event;
+} acc_event_info;
+
+typedef enum acc_device_api
+{
+  acc_device_api_none = 0,
+  acc_device_api_cuda,
+  acc_device_api_opencl,
+  acc_device_api_coi,
+  acc_device_api_other
+} acc_device_api;
+
+typedef struct acc_api_info
+{
+  acc_device_api device_api;
+  _acc_prof_int_t valid_bytes;
+  acc_device_t device_type;
+  _acc_prof_int_t vendor;
+  const void *device_handle;
+  const void *context_handle;
+  const void *async_handle;
+#define _ACC_API_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_api_info, async_handle, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_api_info;
+
+/* Don't tag 'acc_prof_callback' as '__GOACC_NOTHROW': these functions are
+   provided by user code, and must be expected to do anything.  */
+typedef void (*acc_prof_callback) (acc_prof_info *, acc_event_info *,
+				   acc_api_info *);
+
+
+/* Loading the Library.  */
+
+typedef enum acc_register_t
+{
+  acc_reg = 0,
+  acc_toggle = 1,
+  acc_toggle_per_thread = 2
+} acc_register_t;
+
+typedef void (*acc_prof_reg) (acc_event_t, acc_prof_callback, acc_register_t);
+extern void acc_prof_register (acc_event_t, acc_prof_callback,
+			       acc_register_t) __GOACC_NOTHROW;
+extern void acc_prof_unregister (acc_event_t, acc_prof_callback,
+				 acc_register_t) __GOACC_NOTHROW;
+typedef void (*acc_query_fn) ();
+typedef acc_query_fn (*acc_prof_lookup_func) (const char *);
+extern acc_query_fn acc_prof_lookup (const char *) __GOACC_NOTHROW;
+/* Don't tag 'acc_register_library' as '__GOACC_NOTHROW': this function can be
+   overridden by user code, and must be expected to do anything.  */
+extern void acc_register_library (acc_prof_reg, acc_prof_reg,
+				  acc_prof_lookup_func);
+
+
+#ifdef __cplusplus
+}
+#endif
+
+
+#endif /* _ACC_PROF_H */
diff --git a/libgomp/env.c b/libgomp/env.c
index 7937fbb..29d9f58 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -1425,5 +1425,7 @@ initialize_env (void)
   parse_gomp_openacc_dim ();
 
   goacc_runtime_initialize ();
+
+  goacc_profiling_initialize ();
 }
 #endif /* LIBGOMP_OFFLOADED_ONLY */
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 717ae66..3be4a0c 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -476,6 +476,14 @@ OACC_2.5 {
 	acc_update_self_async_array_h_;
 } OACC_2.0.1;
 
+OACC_2.5.1 {
+  global:
+	acc_prof_lookup;
+	acc_prof_register;
+	acc_prof_unregister;
+	acc_register_library;
+} OACC_2.5;
+
 OACC_2.6 {
   global:
 	acc_get_property;
@@ -488,7 +496,7 @@ OACC_2.6 {
 	acc_detach_async;
 	acc_detach_finalize;
 	acc_detach_finalize_async;
-} OACC_2.5;
+} OACC_2.5.1;
 
 GOACC_2.0 {
   global:
@@ -536,3 +544,9 @@ GOMP_PLUGIN_1.2 {
   global:
 	GOMP_PLUGIN_acc_default_dim;
 } GOMP_PLUGIN_1.1;
+
+GOMP_PLUGIN_1.3 {
+  global:
+	GOMP_PLUGIN_goacc_profiling_dispatch;
+	GOMP_PLUGIN_goacc_thread;
+} GOMP_PLUGIN_1.2;
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 7538846..2e05272 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -111,6 +111,7 @@ changed to GNU Offloading and Multi Processing Runtime Library.
                                asynchronous operations.
 * OpenACC Library Interoperability:: OpenACC library interoperability with the
                                NVIDIA CUBLAS library.
+* OpenACC Profiling Interface::
 * The libgomp ABI::            Notes on the external ABI presented by libgomp.
 * Reporting Bugs::             How to report bugs in the GNU Offloading and
                                Multi Processing Runtime Library.
@@ -1901,6 +1902,13 @@ API routines for target platforms.
 * acc_get_current_cuda_context::Get CUDA context handle.
 * acc_get_cuda_stream::         Get CUDA stream handle.
 * acc_set_cuda_stream::         Set CUDA stream handle.
+
+API routines for the OpenACC Profiling Interface.
+
+* acc_prof_register::           Register callbacks.
+* acc_prof_unregister::         Unregister callbacks.
+* acc_prof_lookup::             Obtain inquiry functions.
+* acc_register_library::        Library registration.
 @end menu
 
 
@@ -2865,6 +2873,90 @@ A.2.1.4.
 
 
 
+@node acc_prof_register
+@section @code{acc_prof_register} -- Register callbacks.
+@table @asis
+@item @emph{Description}:
+This function registers callbacks.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_prof_register (acc_event_t, acc_prof_callback, acc_register_t);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_prof_unregister
+@section @code{acc_prof_unregister} -- Unregister callbacks.
+@table @asis
+@item @emph{Description}:
+This function unregisters callbacks.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_prof_unregister (acc_event_t, acc_prof_callback, acc_register_t);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_prof_lookup
+@section @code{acc_prof_lookup} -- Obtain inquiry functions.
+@table @asis
+@item @emph{Description}:
+Function to obtain inquiry functions.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{acc_query_fn acc_prof_lookup (const char *);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
+@node acc_register_library
+@section @code{acc_register_library} -- Library registration.
+@table @asis
+@item @emph{Description}:
+Function for library registration.
+
+@item @emph{C/C++}:
+@multitable @columnfractions .20 .80
+@item @emph{Prototype}: @tab @code{void acc_register_library (acc_prof_reg, acc_prof_reg, acc_prof_lookup_func);}
+@end multitable
+
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}, @ref{ACC_PROFLIB}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+5.3.
+@end table
+
+
+
 @c ---------------------------------------------------------------------
 @c OpenACC Environment Variables
 @c ---------------------------------------------------------------------
@@ -2874,11 +2966,14 @@ A.2.1.4.
 
 The variables @env{ACC_DEVICE_TYPE} and @env{ACC_DEVICE_NUM}
 are defined by section 4 of the OpenACC specification in version 2.6.
+The variable @env{ACC_PROFLIB}
+is defined by section 4 of the OpenACC specification in version 2.6.
 The variable @env{GCC_ACC_NOTIFY} is used for diagnostic purposes.
 
 @menu
 * ACC_DEVICE_TYPE::
 * ACC_DEVICE_NUM::
+* ACC_PROFLIB::
 * GCC_ACC_NOTIFY::
 @end menu
 
@@ -2904,6 +2999,19 @@ The variable @env{GCC_ACC_NOTIFY} is used for diagnostic purposes.
 
 
 
+@node ACC_PROFLIB
+@section @code{ACC_PROFLIB}
+@table @asis
+@item @emph{See also}:
+@ref{acc_register_library}, @ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+4.3.
+@end table
+
+
+
 @node GCC_ACC_NOTIFY
 @section @code{GCC_ACC_NOTIFY}
 @table @asis
@@ -3120,6 +3228,310 @@ Application Programming Interface”, Version 2.6.}
 
 
 @c ---------------------------------------------------------------------
+@c OpenACC Profiling Interface
+@c ---------------------------------------------------------------------
+
+@node OpenACC Profiling Interface
+@chapter OpenACC Profiling Interface
+
+@section Implementation Status and Implementation-Defined Behavior
+
+We're implementing the OpenACC Profiling Interface as defined by the
+OpenACC 2.6 specification.  We're clarifying some aspects here as
+@emph{implementation-defined behavior}, while they're still under
+discussion within the OpenACC Technical Committee.
+
+This implementation is tuned to keep the performance impact as low as
+possible for the (very common) case that the Profiling Interface is
+not enabled.  This is relevant, as the Profiling Interface affects all
+the @emph{hot} code paths (in the target code, not in the offloaded
+code).  Users of the OpenACC Profiling Interface can be expected to
+understand that performance will be impacted to some degree once the
+Profiling Interface has gotten enabled: for example, because of the
+@emph{runtime} (libgomp) calling into a third-party @emph{library} for
+every event that has been registered.
+
+We're not yet accounting for the fact that @cite{OpenACC events may
+occur during event processing}.
+
+We're not yet implementing initialization via a
+@code{acc_register_library} function that is either statically linked
+in, or dynamically via @env{LD_PRELOAD}.
+Initialization via @code{acc_register_library} functions dynamically
+loaded via the @env{ACC_PROFLIB} environment variable does work, as
+does directly calling @code{acc_prof_register},
+@code{acc_prof_unregister}, @code{acc_prof_lookup}.
+
+As currently there are no inquiry functions defined, calls to
+@code{acc_prof_lookup} will always return @code{NULL}.
+
+There aren't separate @emph{start}, @emph{stop} events defined for the
+event types @code{acc_ev_create}, @code{acc_ev_delete},
+@code{acc_ev_alloc}, @code{acc_ev_free}.  It's not clear if these
+should be triggered before or after the actual device-specific call is
+made.  We trigger them after.
+
+Remarks about data provided to callbacks:
+
+@table @asis
+
+@item @code{acc_prof_info.event_type}
+It's not clear if for @emph{nested} event callbacks (for example,
+@code{acc_ev_enqueue_launch_start} as part of a parent compute
+construct), this should be set for the nested event
+(@code{acc_ev_enqueue_launch_start}), or if the value of the parent
+construct should remain (@code{acc_ev_compute_construct_start}).  In
+this implementation, the value will generally correspond to the
+innermost nested event type.
+
+@item @code{acc_prof_info.device_type}
+@itemize
+
+@item
+For @code{acc_ev_compute_construct_start}, and in presence of an
+@code{if} clause with @emph{false} argument, this will still refer to
+the offloading device type.
+It's not clear if that's the expected behavior.
+
+@item
+Complementary to the item before, for
+@code{acc_ev_compute_construct_end}, this is set to
+@code{acc_device_host} in presence of an @code{if} clause with
+@emph{false} argument.
+It's not clear if that's the expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.thread_id}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.async}
+@itemize
+
+@item
+Not yet implemented correctly for
+@code{acc_ev_compute_construct_start}.
+
+@item
+In a compute construct, for host-fallback
+execution/@code{acc_device_host} it will always be
+@code{acc_async_sync}.
+It's not clear if that's the expected behavior.
+
+@item
+For @code{acc_ev_device_init_start} and @code{acc_ev_device_init_end},
+it will always be @code{acc_async_sync}.
+It's not clear if that's the expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.async_queue}
+There is no @cite{limited number of asynchronous queues} in libgomp.
+This will always have the same value as @code{acc_prof_info.async}.
+
+@item @code{acc_prof_info.src_file}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.func_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_event_info.event_type}, @code{acc_event_info.*.event_type}
+Relating to @code{acc_prof_info.event_type} discussed above, in this
+implementation, this will always be the same value as
+@code{acc_prof_info.event_type}.
+
+@item @code{acc_event_info.*.parent_construct}
+@itemize
+
+@item
+Will be @code{acc_construct_parallel} for all OpenACC compute
+constructs as well as many OpenACC Runtime API calls; should be the
+one matching the actual construct, or
+@code{acc_construct_runtime_api}, respectively.
+
+@item
+Will be @code{acc_construct_enter_data} or
+@code{acc_construct_exit_data} when processing variable mappings
+specified in OpenACC @emph{declare} directives; should be
+@code{acc_construct_declare}.
+
+@item
+For implicit @code{acc_ev_device_init_start},
+@code{acc_ev_device_init_end}, and explicit as well as implicit
+@code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start}, and
+@code{acc_ev_enqueue_download_end}, will be
+@code{acc_construct_parallel}; should reflect the real parent
+construct.
+
+@end itemize
+
+@item @code{acc_event_info.*.implicit}
+For @code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start}, and
+@code{acc_ev_enqueue_download_end}, this currently will be @code{1}
+also for explicit usage.
+
+@item @code{acc_event_info.data_event.var_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_event_info.data_event.host_ptr}
+For @code{acc_ev_alloc}, and @code{acc_ev_free}, this is always
+@code{NULL}.
+
+@item @code{typedef union acc_api_info}
+@dots{} as printed in @cite{5.2.3. Third Argument: API-Specific
+Information}.  This should obviously be @code{typedef @emph{struct}
+acc_api_info}.
+
+@item @code{acc_api_info.device_api}
+Possibly not yet implemented correctly for
+@code{acc_ev_compute_construct_start},
+@code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}:
+will always be @code{acc_device_api_none} for these event types.
+For @code{acc_ev_enter_data_start}, it will be
+@code{acc_device_api_none} in some cases.
+
+@item @code{acc_api_info.device_type}
+Always the same as @code{acc_prof_info.device_type}.
+
+@item @code{acc_api_info.vendor}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_api_info.device_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.context_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.async_handle}
+Always @code{NULL}; not yet implemented.
+
+@end table
+
+Remarks about certain event types:
+
+@table @asis
+
+@item @code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}
+@itemize
+
+@item
+@c See 'DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT' in
+@c 'libgomp.oacc-c-c++-common/acc_prof-kernels-1.c',
+@c 'libgomp.oacc-c-c++-common/acc_prof-parallel-1.c'.
+Whan a compute construct triggers implicit
+@code{acc_ev_device_init_start} and @code{acc_ev_device_init_end}
+events, they currently aren't @emph{nested within} the corresponding
+@code{acc_ev_compute_construct_start} and
+@code{acc_ev_compute_construct_end}, but they're currently observed
+@emph{before} @code{acc_ev_compute_construct_start}.
+It's not clear what to do: the standard asks us provide a lot of
+details to the @code{acc_ev_compute_construct_start} callback, without
+(implicitly) initializing a device before?
+
+@item
+Callbacks for these event types will not be invoked for calls to the
+@code{acc_set_device_type} and @code{acc_set_device_num} functions.
+It's not clear if they should be.
+
+@end itemize
+
+@item @code{acc_ev_enter_data_start}, @code{acc_ev_enter_data_end}, @code{acc_ev_exit_data_start}, @code{acc_ev_exit_data_end}
+@itemize
+
+@item
+Callbacks for these event types will also be invoked for OpenACC
+@emph{host_data} constructs.
+It's not clear if they should be.
+
+@item
+Callbacks for these event types will also be invoked when processing
+variable mappings specified in OpenACC @emph{declare} directives.
+It's not clear if they should be.
+
+@end itemize
+
+@end table
+
+Callbacks for the following event types will be invoked, but dispatch
+and information provided therein has not yet been thoroughly reviewed:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@item @code{acc_ev_update_start}, @code{acc_ev_update_end}
+@item @code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end}
+@item @code{acc_ev_enqueue_download_start}, @code{acc_ev_enqueue_download_end}
+@end itemize
+
+During device initialization, and finalization, respectively,
+callbacks for the following event types will not yet be invoked:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@end itemize
+
+Callbacks for the following event types have not yet been implemented,
+so currently won't be invoked:
+
+@itemize
+@item @code{acc_ev_device_shutdown_start}, @code{acc_ev_device_shutdown_end}
+@item @code{acc_ev_runtime_shutdown}
+@item @code{acc_ev_create}, @code{acc_ev_delete}
+@item @code{acc_ev_wait_start}, @code{acc_ev_wait_end}
+@end itemize
+
+For the following runtime library functions, not all expected
+callbacks will be invoked (mostly concerning implicit device
+initialization):
+
+@itemize
+@item @code{acc_get_num_devices}
+@item @code{acc_set_device_type}
+@item @code{acc_get_device_type}
+@item @code{acc_set_device_num}
+@item @code{acc_get_device_num}
+@item @code{acc_init}
+@item @code{acc_shutdown}
+@end itemize
+
+Aside from implicit device initialization, for the following runtime
+library functions, no callbacks will be invoked for shared-memory
+offloading devices (it's not clear if they should be):
+
+@itemize
+@item @code{acc_malloc}
+@item @code{acc_free}
+@item @code{acc_copyin}, @code{acc_present_or_copyin}, @code{acc_copyin_async}
+@item @code{acc_create}, @code{acc_present_or_create}, @code{acc_create_async}
+@item @code{acc_copyout}, @code{acc_copyout_async}, @code{acc_copyout_finalize}, @code{acc_copyout_finalize_async}
+@item @code{acc_delete}, @code{acc_delete_async}, @code{acc_delete_finalize}, @code{acc_delete_finalize_async}
+@item @code{acc_update_device}, @code{acc_update_device_async}
+@item @code{acc_update_self}, @code{acc_update_self_async}
+@item @code{acc_map_data}, @code{acc_unmap_data}
+@item @code{acc_memcpy_to_device}, @code{acc_memcpy_to_device_async}
+@item @code{acc_memcpy_from_device}, @code{acc_memcpy_from_device_async}
+@end itemize
+
+
+
+@c ---------------------------------------------------------------------
 @c The libgomp ABI
 @c ---------------------------------------------------------------------
 
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 51bb676..1760e8c 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -43,17 +43,6 @@ get_goacc_thread (void)
   return thr;
 }
 
-static struct gomp_device_descr *
-get_goacc_thread_device (void)
-{
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
-
-  return thr->dev;
-}
-
 static int
 validate_async_val (int async)
 {
@@ -76,7 +65,10 @@ validate_async_val (int async)
 
 /* Return the asyncqueue to be used for OpenACC async-argument ASYNC.  This
    might return NULL if no asyncqueue is to be used.  Otherwise, if CREATE,
-   create the asyncqueue if it doesn't exist yet.  */
+   create the asyncqueue if it doesn't exist yet.
+
+   Unless CREATE, this will not generate any OpenACC Profiling Interface
+   events.  */
 
 attribute_hidden struct goacc_asyncqueue *
 lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
@@ -152,8 +144,25 @@ acc_async_test (int async)
   goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
   if (!aq)
     return 1;
-  else
-    return thr->dev->openacc.async.test_func (aq);
+
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
+  int res = thr->dev->openacc.async.test_func (aq);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
+  return res;
 }
 
 int
@@ -161,6 +170,10 @@ acc_async_test_all (void)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
   int ret = 1;
   gomp_mutex_lock (&thr->dev->openacc.async.lock);
   for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
@@ -170,6 +183,13 @@ acc_async_test_all (void)
 	break;
       }
   gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   return ret;
 }
 
@@ -179,8 +199,26 @@ acc_wait (int async)
   struct goacc_thread *thr = get_goacc_thread ();
 
   goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
-  if (aq && !thr->dev->openacc.async.synchronize_func (aq))
+  if (!aq)
+    return;
+
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
+  if (!thr->dev->openacc.async.synchronize_func (aq))
     gomp_fatal ("wait on %d failed", async);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -205,10 +243,19 @@ acc_wait_async (int async1, int async2)
   if (!aq1)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async2;
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
   /* An async queue is always synchronized with itself.  */
   if (aq1 == aq2)
-    return;
+    goto out_prof;
 
   if (aq2)
     {
@@ -222,18 +269,35 @@ acc_wait_async (int async1, int async2)
       if (!thr->dev->openacc.async.synchronize_func (aq1))
 	gomp_fatal ("wait on %d failed", async1);
     }
+
+ out_prof:
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 acc_wait_all (void)
 {
-  struct gomp_device_descr *dev = get_goacc_thread_device ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
 
   bool ret = true;
-  gomp_mutex_lock (&dev->openacc.async.lock);
-  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
-    ret &= dev->openacc.async.synchronize_func (l->aq);
-  gomp_mutex_unlock (&dev->openacc.async.lock);
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    ret &= thr->dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 
   if (!ret)
     gomp_fatal ("wait all failed");
@@ -255,6 +319,15 @@ acc_wait_all_async (int async)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
   bool ret = true;
@@ -270,6 +343,12 @@ acc_wait_all_async (int async)
     }
   gomp_mutex_unlock (&thr->dev->openacc.async.lock);
 
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   if (!ret)
     gomp_fatal ("wait all async(%d) failed", async);
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index 1ca7149..2eda381 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -37,10 +37,23 @@ acc_get_current_cuda_device (void)
 {
   struct goacc_thread *thr = goacc_thread ();
 
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
-    return thr->dev->openacc.cuda.get_current_device_func ();
+    {
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
+      ret = thr->dev->openacc.cuda.get_current_device_func ();
+
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
+    }
 
-  return NULL;
+  return ret;
 }
 
 void *
@@ -48,10 +61,23 @@ acc_get_current_cuda_context (void)
 {
   struct goacc_thread *thr = goacc_thread ();
 
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_current_context_func)
-    return thr->dev->openacc.cuda.get_current_context_func ();
- 
-  return NULL;
+    {
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
+      ret = thr->dev->openacc.cuda.get_current_context_func ();
+
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
+    }
+
+  return ret;
 }
 
 void *
@@ -62,14 +88,32 @@ acc_get_cuda_stream (int async)
   if (!async_valid_p (async))
     return NULL;
 
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
     {
       goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
-      if (aq)
-	return thr->dev->openacc.cuda.get_stream_func (aq);
+      if (!aq)
+	return ret;
+
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+      if (profiling_p)
+	{
+	  prof_info.async = async;
+	  prof_info.async_queue = prof_info.async;
+	}
+
+      ret = thr->dev->openacc.cuda.get_stream_func (aq);
+
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
     }
- 
-  return NULL;
+
+  return ret;
 }
 
 /* As of OpenACC 2.6, the return code of this function appears to be
@@ -90,6 +134,15 @@ acc_set_cuda_stream (int async, void *stream)
   int ret = 0;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
     {
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+      if (profiling_p)
+	{
+	  prof_info.async = async;
+	  prof_info.async_queue = prof_info.async;
+	}
+
       goacc_aq aq = get_goacc_asyncqueue (async);
       /* Due to not using an asyncqueue for "acc_async_sync", this cannot be
 	 used to change the CUDA stream associated with "acc_async_sync".  */
@@ -98,11 +151,19 @@ acc_set_cuda_stream (int async, void *stream)
 	  assert (async == acc_async_sync);
 	  gomp_debug (0, "Refusing request to set CUDA stream associated"
 		      " with \"acc_async_sync\"\n");
-	  return 0;
+	  ret = 0;
+	  goto out_prof;
 	}
       gomp_mutex_lock (&thr->dev->openacc.async.lock);
       ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
       gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
+    out_prof:
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
     }
 
   return ret;
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 0cd673e..f7ea58f 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -228,8 +228,67 @@ acc_dev_num_out_of_range (acc_device_t d, int ord, int ndevs)
    held before calling this function.  */
 
 static struct gomp_device_descr *
-acc_init_1 (acc_device_t d)
+acc_init_1 (acc_device_t d, acc_construct_t parent_construct, int implicit)
 {
+  bool check_not_nested_p;
+  if (implicit)
+    {
+      /* In the implicit case, there should (TODO: must?) already be something
+	 have been set up for an outer construct.  */
+      check_not_nested_p = false;
+    }
+  else
+    {
+      check_not_nested_p = true;
+      /* TODO: should we set 'thr->prof_info' etc. in this case ('acc_init')?
+	 The problem is, that we don't have 'thr' yet?  (So,
+	 'check_not_nested_p = true' also is pointless actually.)  */
+    }
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (check_not_nested_p);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_device_init_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = d;
+      prof_info.device_number = goacc_device_num;
+      prof_info.thread_id = -1;
+      prof_info.async = acc_async_sync;
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info device_init_event_info;
+  if (profiling_p)
+    {
+      device_init_event_info.other_event.event_type = prof_info.event_type;
+      device_init_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      device_init_event_info.other_event.parent_construct = parent_construct;
+      device_init_event_info.other_event.implicit = implicit;
+      device_init_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &device_init_event_info, &api_info);
+
   struct gomp_device_descr *base_dev, *acc_dev;
   int ndevs;
 
@@ -261,6 +320,14 @@ acc_init_1 (acc_device_t d)
   acc_init_state = initialized;
   gomp_mutex_unlock (&acc_init_state_lock);
 
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_device_init_end;
+      device_init_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &device_init_event_info,
+				&api_info);
+    }
+
   return base_dev;
 }
 
@@ -454,7 +521,11 @@ goacc_attach_host_thread_to_device (int ord)
   thr->dev = acc_dev = &base_dev[ord];
   thr->saved_bound_dev = NULL;
   thr->mapped_data = NULL;
-  
+  thr->prof_info = NULL;
+  thr->api_info = NULL;
+  /* Initially, all callbacks for all events are enabled.  */
+  thr->prof_callbacks_enabled = true;
+
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
 }
@@ -468,9 +539,7 @@ acc_init (acc_device_t d)
   gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
-
-  cached_base_dev = acc_init_1 (d);
-
+  cached_base_dev = acc_init_1 (d, acc_construct_runtime_api, 0);
   gomp_mutex_unlock (&acc_device_lock);
   
   goacc_attach_host_thread_to_device (-1);
@@ -529,6 +598,12 @@ acc_set_device_type (acc_device_t d)
   struct gomp_device_descr *base_dev, *acc_dev;
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    prof_info.device_type = d;
+
   gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
@@ -553,6 +628,12 @@ acc_set_device_type (acc_device_t d)
     }
 
   goacc_attach_host_thread_to_device (-1);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 ialias (acc_set_device_type)
@@ -588,12 +669,22 @@ acc_get_device_type (void)
     ;
   else
     {
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
       gomp_init_targets_once ();
 
       gomp_mutex_lock (&acc_device_lock);
       dev = resolve_device (acc_device_default, true);
       gomp_mutex_unlock (&acc_device_lock);
       res = acc_device_type (dev->type);
+
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
     }
 
   assert (res != acc_device_default
@@ -614,12 +705,24 @@ acc_get_device_num (acc_device_t d)
   if (d >= _ACC_device_hwm)
     gomp_fatal ("unknown device type %u", (unsigned) d);
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    prof_info.device_type = d;
+
   gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
   dev = resolve_device (d, true);
   gomp_mutex_unlock (&acc_device_lock);
 
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   if (thr && thr->base_dev == dev && thr->dev)
     return thr->dev->target_id;
 
@@ -813,8 +916,13 @@ goacc_lazy_initialize (void)
   if (thr && thr->dev)
     return;
 
+  gomp_init_targets_once ();
+
+  gomp_mutex_lock (&acc_device_lock);
   if (!cached_base_dev)
-    acc_init (acc_device_default);
-  else
-    goacc_attach_host_thread_to_device (-1);
+    cached_base_dev = acc_init_1 (acc_device_default,
+				  acc_construct_parallel, 1);
+  gomp_mutex_unlock (&acc_device_lock);
+
+  goacc_attach_host_thread_to_device (-1);
 }
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index e4b6ea6..5ca9944 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -40,6 +40,7 @@
 
 #include "openacc.h"
 #include "config.h"
+#include "acc_prof.h"
 #include <stddef.h>
 #include <stdbool.h>
 #include <stdarg.h>
@@ -68,6 +69,12 @@ struct goacc_thread
      strictly push/pop semantics according to lexical scope.  */
   struct target_mem_desc *mapped_data;
 
+  /* Data of the OpenACC Profiling Interface.  */
+  acc_prof_info *prof_info;
+  acc_api_info *api_info;
+  /* Per-thread toggle of OpenACC Profiling Interface callbacks.  */
+  bool prof_callbacks_enabled;
+
   /* These structures form a list: this is the next thread in that list.  */
   struct goacc_thread *next;
 
@@ -128,6 +135,28 @@ async_synchronous_p (int async)
   return async == acc_async_sync;
 }
 
+
+extern bool goacc_prof_enabled;
+/* Tune for the (very common) case that profiling is not enabled.  */
+#define GOACC_PROF_ENABLED \
+  (__builtin_expect (__atomic_load_n (&goacc_prof_enabled, \
+				      MEMMODEL_ACQUIRE) == true, false))
+
+void goacc_profiling_initialize (void);
+bool _goacc_profiling_dispatch_p (bool);
+/* Tune for the (very common) case that profiling is not enabled.  */
+#define GOACC_PROFILING_DISPATCH_P(...) \
+  (GOACC_PROF_ENABLED \
+   && _goacc_profiling_dispatch_p (__VA_ARGS__))
+bool _goacc_profiling_setup_p (struct goacc_thread *,
+			       acc_prof_info *, acc_api_info *);
+/* Tune for the (very common) case that profiling is not enabled.  */
+#define GOACC_PROFILING_SETUP_P(...) \
+  (GOACC_PROFILING_DISPATCH_P (false) \
+   && _goacc_profiling_setup_p (__VA_ARGS__))
+void goacc_profiling_dispatch (acc_prof_info *, acc_event_info *,
+			       acc_api_info *);
+
 #ifdef HAVE_ATTRIBUTE_VISIBILITY
 # pragma GCC visibility pop
 #endif
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index b3b3777..c07a5eb 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -104,7 +104,19 @@ acc_malloc (size_t s)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return malloc (s);
 
-  return thr->dev->alloc_func (thr->dev->target_id, s);
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
+  void *res = thr->dev->alloc_func (thr->dev->target_id, s);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
+  return res;
 }
 
 /* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event
@@ -127,6 +139,10 @@ acc_free (void *d)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return free (d);
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
   gomp_mutex_lock (&acc_dev->lock);
 
   /* We don't have to call lazy open here, as the ptr value must have
@@ -147,6 +163,12 @@ acc_free (void *d)
 
   if (!acc_dev->free_func (acc_dev->target_id, d))
     gomp_fatal ("error in freeing device memory in %s", __FUNCTION__);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 static void
@@ -168,11 +190,26 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
       return;
     }
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq aq = get_goacc_asyncqueue (async);
   if (from)
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
     gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -217,6 +254,9 @@ acc_deviceptr (void *h)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&dev->lock);
 
   n = lookup_host (dev, h, 1);
@@ -254,6 +294,9 @@ acc_hostptr (void *d)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return d;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_dev (&acc_dev->mem_map, d, 1);
@@ -291,6 +334,9 @@ acc_is_present (void *h, size_t s)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h != NULL;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -335,6 +381,10 @@ acc_map_data (void *h, void *d, size_t s)
 	gomp_fatal ("[%p,+%d]->[%p,+%d] is a bad map",
                     (void *)h, (int)s, (void *)d, (int)s);
 
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
       gomp_mutex_lock (&acc_dev->lock);
 
       if (lookup_host (acc_dev, h, s))
@@ -356,6 +406,12 @@ acc_map_data (void *h, void *d, size_t s)
       tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, &devaddrs, &sizes,
 			   &kinds, true, GOMP_MAP_VARS_OPENACC);
       tgt->list[0].key->refcount = REFCOUNT_INFINITY;
+
+      if (profiling_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
     }
 }
 
@@ -372,6 +428,10 @@ acc_unmap_data (void *h)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+
   gomp_mutex_lock (&acc_dev->lock);
 
   cur_node.host_start = (uintptr_t) h;
@@ -405,6 +465,12 @@ acc_unmap_data (void *h)
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 #define FLAG_PRESENT (1 << 0)
@@ -428,6 +494,15 @@ present_create_copy (unsigned f, void *h, size_t s, int async)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -489,6 +564,12 @@ present_create_copy (unsigned f, void *h, size_t s, int async)
       gomp_mutex_unlock (&acc_dev->lock);
     }
 
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   return d;
 }
 
@@ -569,6 +650,15 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -627,6 +717,12 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -698,6 +794,15 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
   if (h == NULL)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -719,6 +824,12 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
     gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index b949599..c798578 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -205,21 +205,75 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   thr = goacc_thread ();
   acc_dev = thr->dev;
 
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_compute_construct_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1;
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info compute_construct_event_info;
+  if (profiling_p)
+    {
+      compute_construct_event_info.other_event.event_type
+	= prof_info.event_type;
+      compute_construct_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      compute_construct_event_info.other_event.parent_construct
+	= acc_construct_parallel;
+      compute_construct_event_info.other_event.implicit = 0;
+      compute_construct_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+			      &api_info);
+
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
   /* Host fallback if "if" clause is false or if the current device is set to
      the host.  */
   if (flags & GOACC_FLAG_HOST_FALLBACK)
     {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
       goacc_save_and_set_bind (acc_device_host);
       goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goacc_restore_bind ();
-      return;
+      goto out_prof;
     }
   else if (acc_device_type (acc_dev->type) == acc_device_host)
     {
       goacc_call_host_fn (fn, mapnum, hostaddrs, params);
-      return;
+      goto out_prof;
     }
 
   /* Default: let the runtime choose.  */
@@ -252,6 +306,13 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
 
 	    if (async == GOMP_LAUNCH_OP_MAX)
 	      async = va_arg (*ap, unsigned);
+
+	    if (profiling_p)
+	      {
+		prof_info.async = async;
+		prof_info.async_queue = prof_info.async;
+	      }
+
 	    break;
 	  }
 
@@ -284,6 +345,22 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   else
     tgt_fn = (void (*)) fn;
 
+  acc_event_info enter_exit_data_event_info;
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_start;
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_exit_data_event_info.other_event.parent_construct
+	= compute_construct_event_info.other_event.parent_construct;
+      enter_exit_data_event_info.other_event.implicit = 1;
+      enter_exit_data_event_info.other_event.tool_info = NULL;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+    }
+
   goacc_aq aq = get_goacc_asyncqueue (async);
 
   tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
@@ -298,6 +375,15 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
 
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_end;
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+    }
+
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     devaddrs[i] = (void *) gomp_map_val (tgt, hostaddrs, i);
@@ -310,8 +396,6 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
       else
 	acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
 				    dims, tgt);
-      /* If running synchronously, unmap immediately.  */
-      gomp_unmap_vars (tgt, true);
     }
   else
     {
@@ -321,7 +405,29 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
       else
 	acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs,
 					  devaddrs, dims, tgt, aq);
-      gomp_unmap_vars_async (tgt, true, aq);
+    }
+
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_exit_data_start;
+      enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
+      enter_exit_data_event_info.other_event.tool_info = NULL;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+    }
+
+  /* If running synchronously, unmap immediately.  */
+  if (aq == NULL)
+    gomp_unmap_vars (tgt, true);
+  else
+    gomp_unmap_vars_async (tgt, true, aq);
+
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_exit_data_end;
+      enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
     }
 
 #ifdef RC_CHECKING
@@ -329,6 +435,19 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+  out_prof:
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_compute_construct_end;
+      compute_construct_event_info.other_event.event_type
+	= prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+				&api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -390,6 +509,71 @@ GOACC_data_start (int flags_m, size_t mapnum,
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_enter_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1;
+      prof_info.async = acc_async_sync; /* Always synchronous.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info enter_data_event_info;
+  if (profiling_p)
+    {
+      enter_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_data_event_info.other_event.parent_construct = acc_construct_data;
+      for (int i = 0; i < mapnum; ++i)
+	if ((kinds[i] & 0xff) == GOMP_MAP_USE_DEVICE_PTR)
+	  {
+	    /* If there is one such data mapping kind, then this is actually an
+	       OpenACC 'host_data' construct.  (GCC maps the OpenACC
+	       'host_data' construct to the OpenACC 'data' construct.)  Apart
+	       from artificial test cases (such as an OpenACC 'host_data'
+	       construct's (implicit) device initialization when there hasn't
+	       been any device data be set up before...), there can't really
+	       any meaningful events be generated from OpenACC 'host_data'
+	       constructs, though.  */
+	    enter_data_event_info.other_event.parent_construct
+	      = acc_construct_host_data;
+	    break;
+	  }
+      enter_data_event_info.other_event.implicit = 0;
+      enter_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &enter_data_event_info, &api_info);
+
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
   enum gomp_map_vars_kind pragma_kind;
@@ -402,11 +586,13 @@ GOACC_data_start (int flags_m, size_t mapnum,
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
     {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
       tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, pragma_kind);
       tgt->prev = thr->mapped_data;
       thr->mapped_data = tgt;
 
-      return;
+      goto out_prof;
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
@@ -421,14 +607,76 @@ GOACC_data_start (int flags_m, size_t mapnum,
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+out_prof:
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_end;
+      enter_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_data_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 GOACC_data_end (void)
 {
   struct goacc_thread *thr = goacc_thread ();
+  struct gomp_device_descr *acc_dev = thr->dev;
   struct target_mem_desc *tgt = thr->mapped_data;
 
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_exit_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1;
+      prof_info.async = acc_async_sync; /* Always synchronous.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info exit_data_event_info;
+  if (profiling_p)
+    {
+      exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      exit_data_event_info.other_event.parent_construct = acc_construct_data;
+      exit_data_event_info.other_event.implicit = 0;
+      exit_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &exit_data_event_info, &api_info);
+
   gomp_debug (0, "  %s: restore mappings\n", __FUNCTION__);
   thr->mapped_data = tgt->prev;
   gomp_unmap_vars (tgt, true);
@@ -439,6 +687,16 @@ GOACC_data_end (void)
   gomp_rc_check (thr->dev, thr->mapped_data);
   gomp_mutex_unlock (&thr->dev->lock);
 #endif
+
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_exit_data_end;
+      exit_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &exit_data_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -458,19 +716,6 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
   thr = goacc_thread ();
   acc_dev = thr->dev;
 
-  if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
-      || (flags & GOACC_FLAG_HOST_FALLBACK))
-    return;
-
-  if (num_waits)
-    {
-      va_list ap;
-
-      va_start (ap, num_waits);
-      goacc_wait (async, num_waits, &ap);
-      va_end (ap);
-    }
-
   /* Determine whether "finalize" semantics apply to all mappings of this
      OpenACC directive.  */
   bool finalize = false;
@@ -521,6 +766,77 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
 		      kind);
     }
 
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type
+	= data_enter ? acc_ev_enter_data_start : acc_ev_exit_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1;
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info enter_exit_data_event_info;
+  if (profiling_p)
+    {
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_exit_data_event_info.other_event.parent_construct
+	= data_enter ? acc_construct_enter_data : acc_construct_exit_data;
+      enter_exit_data_event_info.other_event.implicit = 0;
+      enter_exit_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+			      &api_info);
+
+  if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+      || (flags & GOACC_FLAG_HOST_FALLBACK))
+    {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+
+      goto out_prof;
+    }
+
+  if (num_waits)
+    {
+      va_list ap;
+
+      va_start (ap, num_waits);
+      goacc_wait (async, num_waits, &ap);
+      va_end (ap);
+    }
+
   /* In c, non-pointers and arrays are represented by a single data clause.
      Dynamically allocated arrays and subarrays are represented by a data
      clause followed by an internal GOMP_MAP_POINTER.
@@ -739,6 +1055,19 @@ GOACC_enter_exit_data (int flags_m, size_t mapnum,
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+ out_prof:
+  if (profiling_p)
+    {
+      prof_info.event_type
+	= data_enter ? acc_ev_enter_data_end : acc_ev_exit_data_end;
+      enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 static void
@@ -787,9 +1116,64 @@ GOACC_update (int flags_m, size_t mapnum,
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  bool profiling_p = GOACC_PROFILING_DISPATCH_P (true);
+
+  acc_prof_info prof_info;
+  if (profiling_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_update_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1;
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL;
+      prof_info.func_name = NULL;
+      prof_info.line_no = -1;
+      prof_info.end_line_no = -1;
+      prof_info.func_line_no = -1;
+      prof_info.func_end_line_no = -1;
+    }
+  acc_event_info update_event_info;
+  if (profiling_p)
+    {
+      update_event_info.other_event.event_type
+	= prof_info.event_type;
+      update_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      update_event_info.other_event.parent_construct = acc_construct_update;
+      update_event_info.other_event.implicit = 0;
+      update_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1;
+      api_info.device_handle = NULL;
+      api_info.context_handle = NULL;
+      api_info.async_handle = NULL;
+    }
+
+  if (profiling_p)
+    goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
-    return;
+    {
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+
+      goto out_prof;
+    }
 
   if (num_waits)
     {
@@ -861,11 +1245,38 @@ GOACC_update (int flags_m, size_t mapnum,
 	  break;
 	}
     }
+
+ out_prof:
+  if (profiling_p)
+    {
+      prof_info.event_type = acc_ev_update_end;
+      update_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 GOACC_wait (int async, int num_waits, ...)
 {
+  goacc_lazy_initialize ();
+
+  struct goacc_thread *thr = goacc_thread ();
+
+  /* No nesting.  */
+  assert (thr->prof_info == NULL);
+  assert (thr->api_info == NULL);
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_p = GOACC_PROFILING_SETUP_P (thr, &prof_info, &api_info);
+  if (profiling_p)
+    {
+      prof_info.async = async;
+      prof_info.async_queue = prof_info.async;
+    }
+
   if (num_waits)
     {
       va_list ap;
@@ -878,6 +1289,12 @@ GOACC_wait (int async, int num_waits, ...)
     acc_wait_all ();
   else
     acc_wait_all_async (async);
+
+  if (profiling_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 int
diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c
index 5480c1d..b4f71a1 100644
--- a/libgomp/oacc-plugin.c
+++ b/libgomp/oacc-plugin.c
@@ -29,6 +29,7 @@
 #include "libgomp.h"
 #include "oacc-plugin.h"
 #include "oacc-int.h"
+#include "acc_prof.h"
 
 /* This plugin function is now obsolete.  */
 void
@@ -38,6 +39,14 @@ GOMP_PLUGIN_async_unmap_vars (void *ptr __attribute__((unused)),
   gomp_fatal ("invalid plugin function");
 }
 
+/* Return the TLS data for the current thread.  */
+
+struct goacc_thread *
+GOMP_PLUGIN_goacc_thread (void)
+{
+  return goacc_thread ();
+}
+
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
@@ -57,3 +66,11 @@ GOMP_PLUGIN_acc_default_dim (unsigned int i)
     }
   return goacc_default_dims[i];
 }
+
+void
+GOMP_PLUGIN_goacc_profiling_dispatch (acc_prof_info *prof_info,
+				      acc_event_info *event_info,
+				      acc_api_info *api_info)
+{
+  goacc_profiling_dispatch (prof_info, event_info, api_info);
+}
diff --git a/libgomp/oacc-plugin.h b/libgomp/oacc-plugin.h
index 112d4a59..887c6f6 100644
--- a/libgomp/oacc-plugin.h
+++ b/libgomp/oacc-plugin.h
@@ -27,8 +27,15 @@
 #ifndef OACC_PLUGIN_H
 #define OACC_PLUGIN_H 1
 
+#include "oacc-int.h"
+#include "acc_prof.h"
+
 extern void GOMP_PLUGIN_async_unmap_vars (void *, int);
+extern struct goacc_thread *GOMP_PLUGIN_goacc_thread (void);
 extern void *GOMP_PLUGIN_acc_thread (void);
 extern int GOMP_PLUGIN_acc_default_dim (unsigned int);
+extern void GOMP_PLUGIN_goacc_profiling_dispatch (acc_prof_info *,
+						  acc_event_info *,
+						  acc_api_info *);
 
 #endif
diff --git a/libgomp/oacc-profiling.c b/libgomp/oacc-profiling.c
new file mode 100644
index 0000000..eff2886
--- /dev/null
+++ b/libgomp/oacc-profiling.c
@@ -0,0 +1,662 @@
+/* OpenACC Profiling Interface
+
+   Copyright (C) 2019 Free Software Foundation, Inc.
+
+   Contributed by Mentor, a Siemens Business.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#define _GNU_SOURCE
+#include "libgomp.h"
+#include "oacc-int.h"
+#include "secure_getenv.h"
+#include "acc_prof.h"
+#include <assert.h>
+#ifdef HAVE_STRING_H
+# include <string.h>
+#endif
+#ifdef PLUGIN_SUPPORT
+# include <dlfcn.h>
+#endif
+
+#define STATIC_ASSERT(expr) _Static_assert (expr, "!(" #expr ")")
+
+/* Statically assert that the layout of the common fields in the
+   'acc_event_info' variants matches.  */
+/* 'event_type' */
+STATIC_ASSERT (offsetof (acc_event_info, event_type)
+	       == offsetof (acc_event_info, data_event.event_type));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.event_type)
+	       == offsetof (acc_event_info, launch_event.event_type));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.event_type)
+	       == offsetof (acc_event_info, other_event.event_type));
+/* 'valid_bytes' */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.valid_bytes)
+	       == offsetof (acc_event_info, launch_event.valid_bytes));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.valid_bytes)
+	       == offsetof (acc_event_info, other_event.valid_bytes));
+/* 'parent_construct' */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.parent_construct)
+	       == offsetof (acc_event_info, launch_event.parent_construct));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.parent_construct)
+	       == offsetof (acc_event_info, other_event.parent_construct));
+/* 'implicit' */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.implicit)
+	       == offsetof (acc_event_info, launch_event.implicit));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.implicit)
+	       == offsetof (acc_event_info, other_event.implicit));
+/* 'tool_info' */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.tool_info)
+	       == offsetof (acc_event_info, launch_event.tool_info));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.tool_info)
+	       == offsetof (acc_event_info, other_event.tool_info));
+
+struct goacc_prof_callback_entry
+{
+  acc_prof_callback cb;
+  int ref;
+  bool enabled;
+  struct goacc_prof_callback_entry *next;
+};
+
+/* Use a separate flag to minimize run-time performance impact for the (very
+   common) case that profiling is not enabled.
+
+   Once enabled, we're not going to disable this anymore, anywhere.  We
+   probably could, by adding appropriate logic to 'acc_prof_register',
+   'acc_prof_unregister'.  */
+bool goacc_prof_enabled = false;
+
+/* Global state for registered callbacks.
+   'goacc_prof_callbacks_enabled[acc_ev_none]' acts as a global toggle.  */
+static bool goacc_prof_callbacks_enabled[acc_ev_last];
+static struct goacc_prof_callback_entry *goacc_prof_callback_entries[acc_ev_last];
+/* Lock used to protect access to 'goacc_prof_callbacks_enabled', and
+   'goacc_prof_callback_entries'.  */
+static gomp_mutex_t goacc_prof_lock;
+
+void
+goacc_profiling_initialize (void)
+{
+  gomp_mutex_init (&goacc_prof_lock);
+
+  /* Initially, all callbacks for all events are enabled.  */
+  for (int i = 0; i < acc_ev_last; ++i)
+    goacc_prof_callbacks_enabled[i] = true;
+
+
+#ifdef PLUGIN_SUPPORT
+  char *acc_proflibs = secure_getenv ("ACC_PROFLIB");
+  while (acc_proflibs != NULL && acc_proflibs[0] != '\0')
+    {
+      char *acc_proflibs_sep = strchr (acc_proflibs, ';');
+      char *acc_proflib;
+      if (acc_proflibs_sep == acc_proflibs)
+	{
+	  /* Stray ';' separator: make sure we don't 'dlopen' the main
+	     program.  */
+	  acc_proflib = NULL;
+	}
+      else
+	{
+	  if (acc_proflibs_sep != NULL)
+	    {
+	      /* Single out the first library.  */
+	      acc_proflib = gomp_malloc (acc_proflibs_sep - acc_proflibs + 1);
+	      memcpy (acc_proflib, acc_proflibs,
+		      acc_proflibs_sep - acc_proflibs);
+	      acc_proflib[acc_proflibs_sep - acc_proflibs] = '\0';
+	    }
+	  else
+	    {
+	      /* No ';' separator, so only one library.  */
+	      acc_proflib = acc_proflibs;
+	    }
+
+	  gomp_debug (0, "%s: dlopen (\"%s\")\n", __FUNCTION__, acc_proflib);
+	  void *dl_handle = dlopen (acc_proflib, RTLD_LAZY);
+	  if (dl_handle != NULL)
+	    {
+	      typeof (&acc_register_library) a_r_l
+		= dlsym (dl_handle, "acc_register_library");
+	      if (a_r_l == NULL)
+		goto dl_fail;
+	      gomp_debug (0, "  %s: calling %s:acc_register_library\n",
+			  __FUNCTION__, acc_proflib);
+	      a_r_l (acc_prof_register, acc_prof_unregister,
+		     acc_prof_lookup);
+	    }
+	  else
+	    {
+	    dl_fail:
+	      gomp_error ("while loading ACC_PROFLIB \"%s\": %s",
+			  acc_proflib, dlerror ());
+	      if (dl_handle != NULL)
+		{
+		  int err = dlclose (dl_handle);
+		  dl_handle = NULL;
+		  if (err != 0)
+		    goto dl_fail;
+		}
+	    }
+	}
+
+      if (acc_proflib != acc_proflibs)
+	{
+	  free (acc_proflib);
+
+	  acc_proflibs = acc_proflibs_sep + 1;
+	}
+      else
+	acc_proflibs = NULL;
+    }
+#endif /* PLUGIN_SUPPORT */
+}
+
+void
+acc_prof_register (acc_event_t ev, acc_prof_callback cb, acc_register_t reg)
+{
+  gomp_debug (0, "%s: ev=%d, cb=%p, reg=%d\n",
+	      __FUNCTION__, (int) ev, (void *) cb, (int) reg);
+
+
+  /* For any events to be dispatched, the user first has to register a
+     callback, which makes this here a good place for enabling the whole
+     machinery.  */
+  if (!GOACC_PROF_ENABLED)
+    __atomic_store_n (&goacc_prof_enabled, true, MEMMODEL_RELEASE);
+
+
+  enum
+  {
+    EVENT_KIND_BOGUS,
+    EVENT_KIND_NORMAL,
+    /* As end events invoke callbacks in the reverse order, we register these
+       in the reverse order here.  */
+    EVENT_KIND_END,
+  } event_kind = EVENT_KIND_BOGUS;
+  switch (ev)
+    {
+    case acc_ev_none:
+    case acc_ev_device_init_start:
+    case acc_ev_device_shutdown_start:
+    case acc_ev_runtime_shutdown:
+    case acc_ev_create:
+    case acc_ev_delete:
+    case acc_ev_alloc:
+    case acc_ev_free:
+    case acc_ev_enter_data_start:
+    case acc_ev_exit_data_start:
+    case acc_ev_update_start:
+    case acc_ev_compute_construct_start:
+    case acc_ev_enqueue_launch_start:
+    case acc_ev_enqueue_upload_start:
+    case acc_ev_enqueue_download_start:
+    case acc_ev_wait_start:
+      event_kind = EVENT_KIND_NORMAL;
+      break;
+    case acc_ev_device_init_end:
+    case acc_ev_device_shutdown_end:
+    case acc_ev_enter_data_end:
+    case acc_ev_exit_data_end:
+    case acc_ev_update_end:
+    case acc_ev_compute_construct_end:
+    case acc_ev_enqueue_launch_end:
+    case acc_ev_enqueue_upload_end:
+    case acc_ev_enqueue_download_end:
+    case acc_ev_wait_end:
+      event_kind = EVENT_KIND_END;
+      break;
+    case acc_ev_last:
+      break;
+    }
+  if (event_kind == EVENT_KIND_BOGUS)
+    {
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring request for bogus 'acc_event_t'\n");
+      return;
+    }
+
+  bool bogus = true;
+  switch (reg)
+    {
+    case acc_reg:
+    case acc_toggle:
+    case acc_toggle_per_thread:
+      bogus = false;
+      break;
+    }
+  if (bogus)
+    {
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring request with bogus 'acc_register_t'\n");
+      return;
+    }
+
+  /* Special cases.  */
+  if (reg == acc_toggle)
+    {
+      if (cb == NULL)
+	{
+	  gomp_debug (0, "  globally enabling callbacks\n");
+	  gomp_mutex_lock (&goacc_prof_lock);
+	  /* For 'acc_ev_none', this acts as a global toggle.  */
+	  goacc_prof_callbacks_enabled[ev] = true;
+	  gomp_mutex_unlock (&goacc_prof_lock);
+	  return;
+	}
+      else if (ev == acc_ev_none && cb != NULL)
+	{
+	  gomp_debug (0, "  ignoring request\n");
+	  return;
+	}
+    }
+  else if (reg == acc_toggle_per_thread)
+    {
+      if (ev == acc_ev_none && cb == NULL)
+	{
+	  gomp_debug (0, "  thread: enabling callbacks\n");
+	  goacc_lazy_initialize ();
+	  struct goacc_thread *thr = goacc_thread ();
+	  thr->prof_callbacks_enabled = true;
+	  return;
+	}
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring bogus request\n");
+      return;
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  struct goacc_prof_callback_entry *it, *it_p;
+  it = goacc_prof_callback_entries[ev];
+  it_p = NULL;
+  while (it)
+    {
+      if (it->cb == cb)
+	break;
+      it_p = it;
+      it = it->next;
+    }
+
+  switch (reg)
+    {
+    case acc_reg:
+      /* If we already have this callback registered, just increment its
+	 reference count.  */
+      if (it != NULL)
+	{
+	  it->ref++;
+	  gomp_debug (0, "  already registered;"
+		      " incrementing reference count to: %d\n", it->ref);
+	}
+      else
+	{
+	  struct goacc_prof_callback_entry *e
+	    = gomp_malloc (sizeof (struct goacc_prof_callback_entry));
+	  e->cb = cb;
+	  e->ref = 1;
+	  e->enabled = true;
+	  bool prepend = (event_kind == EVENT_KIND_END);
+	  /* If we don't have any callback registered yet, also use the
+	     'prepend' code path.  */
+	  if (it_p == NULL)
+	    prepend = true;
+	  if (prepend)
+	    {
+	      gomp_debug (0, "  prepending\n");
+	      e->next = goacc_prof_callback_entries[ev];
+	      goacc_prof_callback_entries[ev] = e;
+	    }
+	  else
+	    {
+	      gomp_debug (0, "  appending\n");
+	      e->next = NULL;
+	      it_p->next = e;
+	    }
+	}
+      break;
+
+    case acc_toggle:
+      if (it == NULL)
+	{
+	  gomp_debug (0, "  ignoring request: is not registered\n");
+	  break;
+	}
+      else
+	{
+	  gomp_debug (0, "  enabling\n");
+	  it->enabled = true;
+	}
+      break;
+
+    case acc_toggle_per_thread:
+      __builtin_unreachable ();
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
+
+void
+acc_prof_unregister (acc_event_t ev, acc_prof_callback cb, acc_register_t reg)
+{
+  gomp_debug (0, "%s: ev=%d, cb=%p, reg=%d\n",
+	      __FUNCTION__, (int) ev, (void *) cb, (int) reg);
+
+  /* If profiling is not enabled, there cannot be anything to unregister.  */
+  if (!GOACC_PROF_ENABLED)
+    return;
+
+  if (ev < acc_ev_none
+      || ev >= acc_ev_last)
+    {
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring request for bogus 'acc_event_t'\n");
+      return;
+    }
+
+  bool bogus = true;
+  switch (reg)
+    {
+    case acc_reg:
+    case acc_toggle:
+    case acc_toggle_per_thread:
+      bogus = false;
+      break;
+    }
+  if (bogus)
+    {
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring request with bogus 'acc_register_t'\n");
+      return;
+    }
+
+  /* Special cases.  */
+  if (reg == acc_toggle)
+    {
+      if (cb == NULL)
+	{
+	  gomp_debug (0, "  globally disabling callbacks\n");
+	  gomp_mutex_lock (&goacc_prof_lock);
+	  /* For 'acc_ev_none', this acts as a global toggle.  */
+	  goacc_prof_callbacks_enabled[ev] = false;
+	  gomp_mutex_unlock (&goacc_prof_lock);
+	  return;
+	}
+      else if (ev == acc_ev_none && cb != NULL)
+	{
+	  gomp_debug (0, "  ignoring request\n");
+	  return;
+	}
+    }
+  else if (reg == acc_toggle_per_thread)
+    {
+      if (ev == acc_ev_none && cb == NULL)
+	{
+	  gomp_debug (0, "  thread: disabling callbacks\n");
+	  goacc_lazy_initialize ();
+	  struct goacc_thread *thr = goacc_thread ();
+	  thr->prof_callbacks_enabled = false;
+	  return;
+	}
+      /* Silently ignore.  */
+      gomp_debug (0, "  ignoring bogus request\n");
+      return;
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  struct goacc_prof_callback_entry *it, *it_p;
+  it = goacc_prof_callback_entries[ev];
+  it_p = NULL;
+  while (it)
+    {
+      if (it->cb == cb)
+	break;
+      it_p = it;
+      it = it->next;
+    }
+
+  switch (reg)
+    {
+    case acc_reg:
+      if (it == NULL)
+	{
+	  /* Silently ignore.  */
+	  gomp_debug (0, "  ignoring bogus request: is not registered\n");
+	  break;
+	}
+      it->ref--;
+      gomp_debug (0, "  decrementing reference count to: %d\n", it->ref);
+      if (it->ref == 0)
+	{
+	  if (it_p == NULL)
+	    goacc_prof_callback_entries[ev] = it->next;
+	  else
+	    it_p->next = it->next;
+	  free (it);
+	}
+      break;
+
+    case acc_toggle:
+      if (it == NULL)
+	{
+	  gomp_debug (0, "  ignoring request: is not registered\n");
+	  break;
+	}
+      else
+	{
+	  gomp_debug (0, "  disabling\n");
+	  it->enabled = false;
+	}
+      break;
+
+    case acc_toggle_per_thread:
+      __builtin_unreachable ();
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
+
+acc_query_fn
+acc_prof_lookup (const char *name)
+{
+  gomp_debug (0, "%s (%s)\n",
+	      __FUNCTION__, name ?: "NULL");
+
+  return NULL;
+}
+
+void
+acc_register_library (acc_prof_reg reg, acc_prof_reg unreg,
+		      acc_prof_lookup_func lookup)
+{
+  gomp_fatal ("TODO");
+}
+
+/* Prepare to dispatch events?  */
+
+bool
+_goacc_profiling_dispatch_p (bool check_not_nested_p)
+{
+  gomp_debug (0, "%s\n", __FUNCTION__);
+
+  bool ret;
+
+  struct goacc_thread *thr = goacc_thread ();
+  if (__builtin_expect (thr == NULL, false))
+    {
+      /* If we don't have any per-thread state yet, that means that per-thread
+	 callback dispatch has not been explicitly disabled (which only a call
+	 to 'acc_prof_unregister' with 'acc_toggle_per_thread' would do, and
+	 that would have allocated per-thread state via
+	 'goacc_lazy_initialize'); initially, all callbacks for all events are
+	 enabled.  */
+      gomp_debug (0, "  %s: don't have any per-thread state yet\n", __FUNCTION__);
+    }
+  else
+    {
+      if (check_not_nested_p)
+	{
+	  /* No nesting.  */
+	  assert (thr->prof_info == NULL);
+	  assert (thr->api_info == NULL);
+	}
+
+      if (__builtin_expect (!thr->prof_callbacks_enabled, true))
+	{
+	  gomp_debug (0, "  %s: disabled for this thread\n", __FUNCTION__);
+	  ret = false;
+	  goto out;
+	}
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  /* 'goacc_prof_callbacks_enabled[acc_ev_none]' acts as a global toggle.  */
+  if (__builtin_expect (!goacc_prof_callbacks_enabled[acc_ev_none], true))
+    {
+      gomp_debug (0, "  %s: disabled globally\n", __FUNCTION__);
+      ret = false;
+      goto out_unlock;
+    }
+  else
+    ret = true;
+
+ out_unlock:
+  gomp_mutex_unlock (&goacc_prof_lock);
+
+ out:
+  return ret;
+}
+
+/* Set up to dispatch events?  */
+
+bool
+_goacc_profiling_setup_p (struct goacc_thread *thr,
+			  acc_prof_info *prof_info, acc_api_info *api_info)
+{
+  gomp_debug (0, "%s (%p)\n", __FUNCTION__, thr);
+
+  /* If we don't have any per-thread state yet, we can't register 'prof_info'
+     and 'api_info'.  */
+  if (__builtin_expect (thr == NULL, false))
+    {
+      gomp_debug (0, "Can't dispatch OpenACC Profiling Interface events for"
+		  " the current call, construct, or directive\n");
+      return false;
+    }
+
+  if (thr->prof_info != NULL)
+    {
+      /* Profiling has already been set up for an outer construct.  In this
+	 case, we continue to use the existing information, and thus return
+	 'false' here.
+
+	 This can happen, for example, for an 'enter data' directive, which
+	 sets up profiling, then calls into 'acc_copyin', which should not
+	 again set up profiling, should not overwrite the existing
+	 information.  */
+      return false;
+    }
+
+  thr->prof_info = prof_info;
+  thr->api_info = api_info;
+
+  /* Fill in some defaults.  */
+
+  prof_info->event_type = -1; /* Must be set later.  */
+  prof_info->valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+  prof_info->version = _ACC_PROF_INFO_VERSION;
+  if (thr->dev)
+    {
+      prof_info->device_type = acc_device_type (thr->dev->type);
+      prof_info->device_number = thr->dev->target_id;
+    }
+  else
+    {
+      prof_info->device_type = -1;
+      prof_info->device_number = -1;
+    }
+  prof_info->thread_id = -1;
+  prof_info->async = acc_async_sync;
+  prof_info->async_queue = prof_info->async;
+  prof_info->src_file = NULL;
+  prof_info->func_name = NULL;
+  prof_info->line_no = -1;
+  prof_info->end_line_no = -1;
+  prof_info->func_line_no = -1;
+  prof_info->func_end_line_no = -1;
+
+  api_info->device_api = acc_device_api_none;
+  api_info->valid_bytes = _ACC_API_INFO_VALID_BYTES;
+  api_info->device_type = prof_info->device_type;
+  api_info->vendor = -1;
+  api_info->device_handle = NULL;
+  api_info->context_handle = NULL;
+  api_info->async_handle = NULL;
+
+  return true;
+}
+
+/* Dispatch events.
+
+   This must only be called if 'GOACC_PROFILING_DISPATCH_P' or
+   'GOACC_PROFILING_SETUP_P' returned a true result.  */
+
+void
+goacc_profiling_dispatch (acc_prof_info *prof_info, acc_event_info *event_info,
+			  acc_api_info *apt_info)
+{
+  acc_event_t event_type = event_info->event_type;
+  gomp_debug (0, "%s: event_type=%d\n", __FUNCTION__, (int) event_type);
+  assert (event_type > acc_ev_none
+	  && event_type < acc_ev_last);
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  if (!goacc_prof_callbacks_enabled[event_type])
+    {
+      gomp_debug (0, "  disabled for this event type\n");
+
+      goto out_unlock;
+    }
+
+  for (struct goacc_prof_callback_entry *e
+	 = goacc_prof_callback_entries[event_type];
+       e != NULL;
+       e = e->next)
+    {
+      if (!e->enabled)
+	{
+	  gomp_debug (0, "  disabled for callback %p\n", e->cb);
+	  continue;
+	}
+
+      gomp_debug (0, "  calling callback %p\n", e->cb);
+      e->cb (prof_info, event_info, apt_info);
+    }
+
+ out_unlock:
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index cb18e87..09567ce 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -37,6 +37,7 @@
 #include "libgomp-plugin.h"
 #include "oacc-plugin.h"
 #include "gomp-constants.h"
+#include "oacc-int.h"
 
 #include <pthread.h>
 #include <cuda.h>
@@ -903,26 +904,121 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   // num_gangs		nctaid.x
   // num_workers	ntid.y
   // vector length	ntid.x
+
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info enqueue_launch_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_launch_start;
+
+      enqueue_launch_event_info.launch_event.event_type
+	= prof_info->event_type;
+      enqueue_launch_event_info.launch_event.valid_bytes
+	= _ACC_LAUNCH_EVENT_INFO_VALID_BYTES;
+      enqueue_launch_event_info.launch_event.parent_construct
+	= acc_construct_parallel;
+      enqueue_launch_event_info.launch_event.implicit = 1;
+      enqueue_launch_event_info.launch_event.tool_info = NULL;
+      enqueue_launch_event_info.launch_event.kernel_name = targ_fn->launch->fn;
+      enqueue_launch_event_info.launch_event.num_gangs
+	= dims[GOMP_DIM_GANG];
+      enqueue_launch_event_info.launch_event.num_workers
+	= dims[GOMP_DIM_WORKER];
+      enqueue_launch_event_info.launch_event.vector_length
+	= dims[GOMP_DIM_VECTOR];
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
+					    api_info);
+    }
+
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
 		    0, stream, kargs, 0);
 
+  if (profiling_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_launch_end;
+      enqueue_launch_event_info.launch_event.event_type
+	= prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
+					    api_info);
+    }
+
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
 		     targ_fn->launch->fn);
 }
 
 void * openacc_get_current_cuda_context (void);
 
+static void
+goacc_profiling_acc_ev_alloc (struct goacc_thread *thr, void *dp, size_t s)
+{
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+
+  prof_info->event_type = acc_ev_alloc;
+
+  data_event_info.data_event.event_type = prof_info->event_type;
+  data_event_info.data_event.valid_bytes = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+  data_event_info.data_event.parent_construct = acc_construct_parallel;
+  data_event_info.data_event.implicit = 1;
+  data_event_info.data_event.tool_info = NULL;
+  data_event_info.data_event.var_name = NULL;
+  data_event_info.data_event.bytes = s;
+  data_event_info.data_event.host_ptr = NULL;
+  data_event_info.data_event.device_ptr = dp;
+
+  api_info->device_api = acc_device_api_cuda;
+
+  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, api_info);
+}
+
 static void *
 nvptx_alloc (size_t s)
 {
   CUdeviceptr d;
 
   CUDA_CALL_ERET (NULL, cuMemAlloc, &d, s);
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  bool profiling_p
+    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
+  if (profiling_p)
+    goacc_profiling_acc_ev_alloc (thr, (void *) d, s);
+
   return (void *) d;
 }
 
+static void
+goacc_profiling_acc_ev_free (struct goacc_thread *thr, void *p)
+{
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+
+  prof_info->event_type = acc_ev_free;
+
+  data_event_info.data_event.event_type = prof_info->event_type;
+  data_event_info.data_event.valid_bytes = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+  data_event_info.data_event.parent_construct = acc_construct_parallel;
+  data_event_info.data_event.implicit = 1;
+  data_event_info.data_event.tool_info = NULL;
+  data_event_info.data_event.var_name = NULL;
+  data_event_info.data_event.bytes = -1;
+  data_event_info.data_event.host_ptr = NULL;
+  data_event_info.data_event.device_ptr = p;
+
+  api_info->device_api = acc_device_api_cuda;
+
+  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, api_info);
+}
+
 static bool
 nvptx_free (void *p, struct ptx_device *ptx_dev)
 {
@@ -950,6 +1046,12 @@ nvptx_free (void *p, struct ptx_device *ptx_dev)
     }
 
   CUDA_CALL (cuMemFree, (CUdeviceptr) p);
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  bool profiling_p
+    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
+  if (profiling_p)
+    goacc_profiling_acc_ev_free (thr, p);
+
   return true;
 }
 
@@ -1335,6 +1437,12 @@ openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_p = __builtin_expect (prof_info != NULL, false);
+
   void **hp = alloca (mapnum * sizeof (void *));
   CUdeviceptr dp = 0;
 
@@ -1347,17 +1455,50 @@ openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 	}
       else
 	{
+	  size_t s = mapnum * sizeof (void *);
 	  for (int i = 0; i < mapnum; i++)
 	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
+	  if (profiling_p)
+	    goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
 	}
     }
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
      fact have the same value on a unified-memory system).  */
   if (!params && mapnum > 0)
-    CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
-		      mapnum * sizeof (void *));
+    {
+      if (profiling_p)
+	{
+	  prof_info->event_type = acc_ev_enqueue_upload_start;
+
+	  data_event_info.data_event.event_type = prof_info->event_type;
+	  data_event_info.data_event.valid_bytes
+	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+	  data_event_info.data_event.parent_construct
+	    = acc_construct_parallel;
+	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
+	  data_event_info.data_event.tool_info = NULL;
+	  data_event_info.data_event.var_name = NULL;
+	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
+	  data_event_info.data_event.host_ptr = hp;
+	  data_event_info.data_event.device_ptr = (const void *) dp;
+
+	  api_info->device_api = acc_device_api_cuda;
+
+	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+						api_info);
+	}
+      CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
+			mapnum * sizeof (void *));
+      if (profiling_p)
+	{
+	  prof_info->event_type = acc_ev_enqueue_upload_end;
+	  data_event_info.data_event.event_type = prof_info->event_type;
+	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+						api_info);
+	}
+    }
 
   if (params)
     nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
@@ -1378,7 +1519,11 @@ openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
     GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
 
   if (!params)
-    CUDA_CALL_ASSERT (cuMemFree, dp);
+    {
+      CUDA_CALL_ASSERT (cuMemFree, dp);
+      if (profiling_p)
+	goacc_profiling_acc_ev_free (thr, (void *) dp);
+    }
 }
 
 void
@@ -1415,6 +1560,12 @@ openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 {
   GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_p = __builtin_expect (prof_info != NULL, false);
+
   void **hp = NULL;
   CUdeviceptr dp = 0;
   void **block = NULL;
@@ -1429,11 +1580,14 @@ openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 	}
       else
 	{
+	  size_t s = mapnum * sizeof (void *);
 	  block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
 	  hp = block + 2;
 	  for (int i = 0; i < mapnum; i++)
 	    hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
-	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+	  CUDA_CALL_ASSERT (cuMemAlloc, &dp, s);
+	  if (profiling_p)
+	    goacc_profiling_acc_ev_alloc (thr, (void *) dp, s);
 	}
     }
 
@@ -1441,6 +1595,28 @@ openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
      fact have the same value on a unified-memory system).  */
   if (!params && mapnum > 0)
     {
+      if (profiling_p)
+	{
+	  prof_info->event_type = acc_ev_enqueue_upload_start;
+
+	  data_event_info.data_event.event_type = prof_info->event_type;
+	  data_event_info.data_event.valid_bytes
+	    = _ACC_DATA_EVENT_INFO_VALID_BYTES;
+	  data_event_info.data_event.parent_construct
+	    = acc_construct_parallel;
+	  data_event_info.data_event.implicit = 1; /* Always implicit.  */
+	  data_event_info.data_event.tool_info = NULL;
+	  data_event_info.data_event.var_name = NULL;
+	  data_event_info.data_event.bytes = mapnum * sizeof (void *);
+	  data_event_info.data_event.host_ptr = hp;
+	  data_event_info.data_event.device_ptr = (const void *) dp;
+
+	  api_info->device_api = acc_device_api_cuda;
+
+	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+						api_info);
+	}
+
       CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
 			mapnum * sizeof (void *), aq->cuda_stream);
       block[0] = (void *) dp;
@@ -1448,6 +1624,14 @@ openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
       struct nvptx_thread *nvthd =
 	(struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
       block[1] = (void *) nvthd->ptx_dev;
+
+      if (profiling_p)
+	{
+	  prof_info->event_type = acc_ev_enqueue_upload_end;
+	  data_event_info.data_event.event_type = prof_info->event_type;
+	  GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+						api_info);
+	}
     }
 
   if (params)
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
new file mode 100644
index 0000000..d929bfd
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
@@ -0,0 +1,353 @@
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+
+/* Use explicit 'copyin' clauses, to work around "'firstprivate'
+   optimizations", which will cause the value at the point of call to be used
+   (*before* any potential modifications done in callbacks), as opposed to its
+   address being taken, which then later gets dereferenced (*after* any
+   modifications done in callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+static int state = -1;
+
+#define STATE_OP(state, op) \
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+
+static void cb_compute_construct_start_1 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 0
+	  || state == 10
+	  || state == 30
+	  || state == 41
+	  || state == 51
+	  || state == 91
+	  || state == 101
+	  || state == 151);
+  STATE_OP (state, ++);
+}
+
+static void cb_compute_construct_start_2 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 1
+	  || state == 11
+	  || state == 40
+	  || state == 50
+	  || state == 90
+	  || state == 100
+	  || state == 150);
+  STATE_OP (state, ++);
+}
+
+static void cb_compute_construct_end_1 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 14
+	  || state == 21
+	  || state == 32
+	  || state == 42
+	  || state == 80
+	  || state == 103
+	  || state == 152);
+  STATE_OP (state, ++);
+}
+
+static void cb_compute_construct_end_2 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 13
+	  || state == 43
+	  || state == 102
+	  || state == 154);
+  STATE_OP (state, ++);
+}
+
+static void cb_compute_construct_end_3 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 12
+	  || state == 20
+	  || state == 31
+	  || state == 44
+	  || state == 81
+	  || state == 104
+	  || state == 153);
+  STATE_OP (state, ++);
+}
+
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  STATE_OP (state, = 0);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 2);
+  }
+  assert (state == 2);
+
+  STATE_OP (state, = 10);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 12);
+  }
+  assert (state == 15);
+
+  STATE_OP (state, = 20);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_toggle);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_toggle);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 20);
+  }
+  assert (state == 20);
+
+  STATE_OP (state, = 30);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_toggle);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 31);
+  }
+  assert (state == 33);
+
+  STATE_OP (state, = 40);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 42);
+  }
+  assert (state == 45);
+
+  STATE_OP (state, = 50);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 52);
+  }
+  assert (state == 52);
+
+  STATE_OP (state, = 60);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 60);
+  }
+  assert (state == 60);
+
+  STATE_OP (state, = 70);
+  unreg (acc_ev_compute_construct_start, NULL, acc_toggle);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 70);
+  }
+  assert (state == 70);
+
+  STATE_OP (state, = 80);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 80);
+  }
+  assert (state == 82);
+
+  STATE_OP (state, = 90);
+  reg (acc_ev_compute_construct_start, NULL, acc_toggle);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 92);
+  }
+  assert (state == 92);
+
+  STATE_OP (state, = 100);
+  reg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 102);
+  }
+  assert (state == 105);
+
+  STATE_OP (state, = 110);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 110);
+  }
+  assert (state == 110);
+
+  STATE_OP (state, = 120);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 120);
+  }
+  assert (state == 120);
+
+  STATE_OP (state, = 130);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 130);
+  }
+  assert (state == 130);
+
+  STATE_OP (state, = 140);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 140);
+  }
+  assert (state == 140);
+
+  STATE_OP (state, = 150);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 152);
+  }
+  assert (state == 155);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
new file mode 100644
index 0000000..b356feb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
@@ -0,0 +1,316 @@
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+
+/* Use explicit 'copyin' clauses, to work around "'firstprivate'
+   optimizations", which will cause the value at the point of call to be used
+   (*before* any potential modifications done in callbacks), as opposed to its
+   address being taken, which then later gets dereferenced (*after* any
+   modifications done in callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+static int state = -1;
+
+#define STATE_OP(state, op) \
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int acc_async;
+
+
+struct tool_info
+{
+  acc_event_info event_info;
+  struct tool_info *nested;
+};
+struct tool_info *tool_info;
+
+static void cb_device_init_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 0
+	  || state == 100);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_device_init_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  if (state == 1)
+    assert (prof_info->device_type == acc_device_host);
+  else
+    assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_runtime_api);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+}
+
+static void cb_device_init_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_device_init_start);
+
+  assert (prof_info->event_type == acc_ev_device_init_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  if (state == 2)
+    assert (prof_info->device_type == acc_device_host);
+  else
+    assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_runtime_api);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 10
+	  || state == 110);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_compute_construct_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+}
+
+static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 11
+	  || state == 111);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+
+  assert (prof_info->event_type == acc_ev_compute_construct_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  STATE_OP (state, = 0);
+  reg (acc_ev_device_init_start, cb_device_init_start, acc_reg);
+  reg (acc_ev_device_init_end, cb_device_init_end, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end, acc_reg);
+  assert (state == 0);
+
+  acc_init (acc_device_host);
+  assert (state == 2);
+
+  STATE_OP (state, = 10);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = 12;
+
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 11);
+  }
+  assert (state == 12);
+
+  STATE_OP (state, = 90);
+  acc_shutdown (acc_device_host);
+  assert (state == 90);
+
+
+  STATE_OP (state, = 100);
+  acc_init (acc_device_default);
+  assert (state == 102);
+
+  STATE_OP (state, = 110);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = 12;
+
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 111);
+  }
+  assert (state == 112);
+
+  STATE_OP (state, = 190);
+  acc_shutdown (acc_device_default);
+  assert (state == 190);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
new file mode 100644
index 0000000..7cfc364
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
@@ -0,0 +1,229 @@
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+
+/* Use explicit 'copyin' clauses, to work around "'firstprivate'
+   optimizations", which will cause the value at the point of call to be used
+   (*before* any potential modifications done in callbacks), as opposed to its
+   address being taken, which then later gets dereferenced (*after* any
+   modifications done in callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+
+/* See the 'DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT' reference in
+   'libgomp.texi'.  */
+#define DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT 0
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+volatile // TODO PR90488
+static int state = -1;
+
+#define STATE_OP(state, op) \
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int num_gangs, num_workers, vector_length;
+
+
+static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 0);
+  STATE_OP (state, = 1);
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == NULL);
+  assert (event_info->launch_event.kernel_name != NULL);
+  {
+    const char *s = strstr (event_info->launch_event.kernel_name, "main");
+    assert (s != NULL);
+    s = strstr (s, "omp_fn");
+    assert (s != NULL);
+  }
+  if (num_gangs < 1)
+    assert (event_info->launch_event.num_gangs >= 1);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.num_gangs == num_gangs);
+#else
+      /* No parallelized OpenACC 'kernels' constructs.  Unparallelized OpenACC
+	 'kernels' constructs must get launched as 1 x 1 x 1 GPU kernels.  */
+      assert (event_info->launch_event.num_gangs == 1);
+#endif
+    }
+  if (num_workers < 1)
+    assert (event_info->launch_event.num_workers >= 1);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.num_workers == num_workers);
+#else
+      /* See 'num_gangs' above.  */
+      assert (event_info->launch_event.num_workers == 1);
+#endif
+    }
+  if (vector_length < 1)
+    assert (event_info->launch_event.vector_length >= 1);
+  else if (acc_device_type == acc_device_nvidia) /* ... is special.  */
+    assert (event_info->launch_event.vector_length == 32);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.vector_length == vector_length);
+#else
+      /* See 'num_gangs' above.  */
+      assert (event_info->launch_event.vector_length == 1);
+#endif
+    }
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+}
+
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  STATE_OP (state, = 0);
+  reg (acc_ev_enqueue_launch_start, cb_enqueue_launch_start, acc_reg);
+  assert (state == 0);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  assert (state == 0);
+
+  /* Parallelism dimensions: compiler/runtime decides.  */
+  STATE_OP (state, = 0);
+  num_gangs = num_workers = vector_length = 0;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No 'acc_ev_enqueue_launch_start'.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  /* Parallelism dimensions: literal.  */
+  STATE_OP (state, = 0);
+  num_gangs = 30;
+  num_workers = 3;
+  vector_length = 5;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels \
+  num_gangs (30) num_workers (3) vector_length (5)
+    /* { dg-prune-output "using vector_length \\(32\\), ignoring 5" } */
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No 'acc_ev_enqueue_launch_start'.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  /* Parallelism dimensions: variable.  */
+  STATE_OP (state, = 0);
+  num_gangs = 22;
+  num_workers = 5;
+  vector_length = 7;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels \
+  num_gangs (num_gangs) num_workers (num_workers) vector_length (vector_length)
+    /* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No 'acc_ev_enqueue_launch_start'.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
new file mode 100644
index 0000000..ac6eb48
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -0,0 +1,719 @@
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+
+/* Use explicit 'copyin' clauses, to work around "'firstprivate'
+   optimizations", which will cause the value at the point of call to be used
+   (*before* any potential modifications done in callbacks), as opposed to its
+   address being taken, which then later gets dereferenced (*after* any
+   modifications done in callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+
+/* See the 'DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT' reference in
+   libgomp.texi.  */
+#define DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT 0
+
+
+/* Do we expect to see 'acc_ev_exit_data_start' and 'acc_ev_exit_data_end'
+   after a compute construct with an 'async' clause?  */
+#define ASYNC_EXIT_DATA 1
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+static int state = -1;
+
+#define STATE_OP(state, op) \
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int acc_async;
+
+
+struct tool_info
+{
+  acc_event_info event_info;
+  struct tool_info *nested;
+};
+struct tool_info *tool_info;
+
+static void cb_device_init_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+#else
+  assert (state == 0
+	  || state == 100);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+#endif
+
+  assert (prof_info->event_type == acc_ev_device_init_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+#else
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+#endif
+}
+
+static void cb_device_init_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 2
+	  || state == 102);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_device_init_start);
+#else
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_device_init_start);
+#endif
+
+  assert (prof_info->event_type == acc_ev_device_init_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (event_info->other_event.tool_info == tool_info->nested);
+#else
+  assert (event_info->other_event.tool_info == tool_info);
+#endif
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+#else
+  free (tool_info);
+  tool_info = NULL;
+#endif
+}
+
+static void cb_enter_data_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 3
+	  || state == 103);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_enter_data_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+static void cb_enter_data_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 4
+	  || state == 104);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_enter_data_start);
+
+  assert (prof_info->event_type == acc_ev_enter_data_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == tool_info->nested);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+static void cb_exit_data_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 7
+#if ASYNC_EXIT_DATA
+	  || state == 107
+#endif
+	  );
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_exit_data_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+static void cb_exit_data_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 8
+#if ASYNC_EXIT_DATA
+	  || state == 108
+#endif
+	  );
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_exit_data_start);
+
+  assert (prof_info->event_type == acc_ev_exit_data_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == tool_info->nested);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+static void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 0
+	  || state == 100);
+  if (state == 100)
+    {
+      /* Compensate for the missing 'acc_ev_device_init_start' and
+	 'acc_ev_device_init_end'.  */
+      state += 2;
+    }
+#else
+  if (state == 100)
+    {
+      /* Compensate for the missing 'acc_ev_device_init_start' and
+	 'acc_ev_device_init_end'.  */
+      state += 2;
+    }
+  assert (state == 2
+	  || state == 102);
+#endif
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_compute_construct_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+
+  if (acc_device_type == acc_device_host)
+    {
+      /* Compensate for the missing 'acc_ev_enter_data_start'.  */
+      state += 1;
+    }
+}
+
+static void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  if (acc_device_type == acc_device_host)
+    {
+      /* Compensate for the missing 'acc_ev_enter_data_end'.  */
+      state += 1;
+      /* Compensate for the missing 'acc_ev_enqueue_launch_start' and
+	 'acc_ev_enqueue_launch_end'.  */
+      state += 2;
+      /* Compensate for the missing 'acc_ev_exit_data_start' and
+	 'acc_ev_exit_data_end'.  */
+      state += 2;
+    }
+#if !ASYNC_EXIT_DATA
+  else if (acc_async != acc_async_sync)
+    {
+      /* Compensate for the missing 'acc_ev_exit_data_start' and
+	 'acc_ev_exit_data_end'.  */
+      state += 2;
+    }
+#endif
+  assert (state == 9
+	  || state == 109);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+
+  assert (prof_info->event_type == acc_ev_compute_construct_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+static void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 5
+	  || state == 105);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == NULL);
+  assert (event_info->launch_event.kernel_name != NULL);
+  {
+    const char *s = strstr (event_info->launch_event.kernel_name, "main");
+    assert (s != NULL);
+    s = strstr (s, "omp_fn");
+    assert (s != NULL);
+  }
+  assert (event_info->launch_event.num_gangs >= 1);
+  assert (event_info->launch_event.num_workers >= 1);
+  assert (event_info->launch_event.vector_length >= 1);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.launch_event.event_type = event_info->launch_event.event_type;
+  tool_info->nested->event_info.launch_event.kernel_name = strdup (event_info->launch_event.kernel_name);
+  tool_info->nested->event_info.launch_event.num_gangs = event_info->launch_event.num_gangs;
+  tool_info->nested->event_info.launch_event.num_workers = event_info->launch_event.num_workers;
+  tool_info->nested->event_info.launch_event.vector_length = event_info->launch_event.vector_length;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+static void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 6
+	  || state == 106);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.launch_event.event_type == acc_ev_enqueue_launch_start);
+  assert (tool_info->nested->event_info.launch_event.kernel_name != NULL);
+  assert (tool_info->nested->event_info.launch_event.num_gangs >= 1);
+  assert (tool_info->nested->event_info.launch_event.num_workers >= 1);
+  assert (tool_info->nested->event_info.launch_event.vector_length >= 1);
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == tool_info->nested);
+  assert (event_info->launch_event.kernel_name != NULL);
+  assert (strcmp (event_info->launch_event.kernel_name, tool_info->nested->event_info.launch_event.kernel_name) == 0);
+  assert (event_info->launch_event.num_gangs == tool_info->nested->event_info.launch_event.num_gangs);
+  assert (event_info->launch_event.num_workers == tool_info->nested->event_info.launch_event.num_workers);
+  assert (event_info->launch_event.vector_length == tool_info->nested->event_info.launch_event.vector_length);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free ((void *) tool_info->nested->event_info.launch_event.kernel_name);
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  STATE_OP (state, = 0);
+  reg (acc_ev_device_init_start, cb_device_init_start, acc_reg);
+  reg (acc_ev_device_init_end, cb_device_init_end, acc_reg);
+  reg (acc_ev_enter_data_start, cb_enter_data_start, acc_reg);
+  reg (acc_ev_enter_data_end, cb_enter_data_end, acc_reg);
+  reg (acc_ev_exit_data_start, cb_exit_data_start, acc_reg);
+  reg (acc_ev_exit_data_end, cb_exit_data_end, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end, acc_reg);
+  reg (acc_ev_enqueue_launch_start, cb_enqueue_launch_start, acc_reg);
+  reg (acc_ev_enqueue_launch_end, cb_enqueue_launch_end, acc_reg);
+  assert (state == 0);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = acc_async_sync;
+  assert (state == 0);
+
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      asm volatile ("" : : : "memory"); // TODO PR90488
+
+      state_init = state;
+    }
+    assert (state_init == 4);
+  }
+  assert (state == 10);
+
+  STATE_OP (state, = 100);
+
+  acc_async = 12;
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      asm volatile ("" : : : "memory"); // TODO PR90488
+
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 104);
+  }
+  assert (state == 110);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
new file mode 100644
index 0000000..5b58c51
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
@@ -0,0 +1,226 @@
+/* Test the 'valid_bytes' magic.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+static int ev_count_data;
+
+static void cb_data_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->data_event.valid_bytes == _ACC_DATA_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+
+  ++ev_count_data;
+}
+
+static int ev_count_launch;
+
+static void cb_launch_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+
+  ++ev_count_launch;
+}
+
+static int ev_count_other;
+
+static void cb_other_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+
+  ++ev_count_other;
+}
+
+
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg_ (acc_ev_device_init_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_init_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_runtime_shutdown, cb_other_event, acc_reg);
+  reg_ (acc_ev_create, cb_data_event, acc_reg);
+  reg_ (acc_ev_delete, cb_data_event, acc_reg);
+  reg_ (acc_ev_alloc, cb_data_event, acc_reg);
+  reg_ (acc_ev_free, cb_data_event, acc_reg);
+  reg_ (acc_ev_enter_data_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_enter_data_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_exit_data_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_exit_data_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_update_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_update_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_compute_construct_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_compute_construct_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_start, cb_launch_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_end, cb_launch_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_start, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_end, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_start, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_end, cb_data_event, acc_reg);
+  reg_ (acc_ev_wait_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_wait_end, cb_other_event, acc_reg);
+}
+
+
+/* Basic struct.  */
+typedef struct A
+{
+  int a;
+  int b;
+#define VALID_BYTES_A \
+  _ACC_PROF_VALID_BYTES_STRUCT (A, b, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (int))
+} A;
+
+/* Add a 'char' field.  */
+typedef struct B
+{
+  int a;
+  int b;
+  char c;
+#define VALID_BYTES_B \
+  _ACC_PROF_VALID_BYTES_STRUCT (B, c, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} B;
+
+/* Add another 'char' field.  */
+typedef struct C
+{
+  int a;
+  int b;
+  char c, d;
+#define VALID_BYTES_C \
+  _ACC_PROF_VALID_BYTES_STRUCT (C, d, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} C;
+
+/* Add two 'void *' fields.  */
+typedef struct D
+{
+  int a;
+  int b;
+  char c, d;
+  void *e;
+  void *f;
+#define VALID_BYTES_D \
+  _ACC_PROF_VALID_BYTES_STRUCT (D, f, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} D;
+
+/* Add another three 'char' fields.  */
+typedef struct E
+{
+  int a;
+  int b;
+  char c, d;
+  void *e;
+  void *f;
+  char g, h, i;
+#define VALID_BYTES_E \
+  _ACC_PROF_VALID_BYTES_STRUCT (E, i, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} E;
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  A A1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof A1, VALID_BYTES_A);
+  assert (VALID_BYTES_A <= sizeof A1);
+  DEBUG_printf ("&A1=%p, &A1.b=%p\n", &A1, &A1.b);
+  assert (((char *) &A1) + VALID_BYTES_A == (char *) (&A1.b + 1));
+
+  B B1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof B1, VALID_BYTES_B);
+  assert (VALID_BYTES_B <= sizeof B1);
+  DEBUG_printf ("&B1=%p, &B1.c=%p\n", &B1, &B1.c);
+  assert (((char *) &B1) + VALID_BYTES_B == (char *) (&B1.c + 1));
+
+  assert (VALID_BYTES_B == VALID_BYTES_A + 1 * sizeof (char));
+
+  C C1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof C1, VALID_BYTES_C);
+  assert (VALID_BYTES_C <= sizeof C1);
+  DEBUG_printf ("&C1=%p, &C1.d=%p\n", &C1, &C1.d);
+  assert (((char *) &C1) + VALID_BYTES_C == (char *) (&C1.d + 1));
+
+  assert (VALID_BYTES_C == VALID_BYTES_B + 1 * sizeof (char));
+
+  D D1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof D1, VALID_BYTES_D);
+  assert (VALID_BYTES_D <= sizeof D1);
+  DEBUG_printf ("&D1=%p, &D1.f=%p\n", &D1, &D1.f);
+  assert (((char *) &D1) + VALID_BYTES_D == (char *) (&D1.f + 1));
+
+  assert (VALID_BYTES_D > VALID_BYTES_C);
+
+  E E1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof E1, VALID_BYTES_E);
+  assert (VALID_BYTES_E <= sizeof E1);
+  DEBUG_printf ("&E1=%p, &E1.i=%p\n", &E1, &E1.i);
+  assert (((char *) &E1) + VALID_BYTES_E == (char *) (&E1.i + 1));
+
+  assert (VALID_BYTES_E == VALID_BYTES_D + 3 * sizeof (char));
+
+  ev_count_data = 0;
+  ev_count_launch = 0;
+  ev_count_other = 0;
+
+  /* Trigger tests done in 'cb_*' functions.  */
+  int host;
+#pragma acc parallel copyout (host)
+  {
+    asm volatile ("" : : : "memory"); // TODO PR90488
+
+    host = acc_on_device (acc_device_host);
+  }
+
+  DEBUG_printf ("ev_count_data = %d\n", ev_count_data);
+  if (host)
+    assert (ev_count_data == 0);
+  else
+    {
+      /* We don't know exactly how many data events to expect, but we at least
+	 expect some.  */
+      assert (ev_count_data > 0);
+    }
+
+  DEBUG_printf ("ev_count_launch = %d\n", ev_count_launch);
+  if (host)
+    assert (ev_count_data == 0);
+  else
+    {
+      /* We expect two launch events, 'acc_ev_enqueue_launch_start',
+	 'acc_ev_enqueue_launch_end'.  */
+      assert (ev_count_launch == 2);
+    }
+
+  DEBUG_printf ("ev_count_other = %d\n", ev_count_other);
+  /* We don't know exactly how many other events to expect, but we at least
+     expect 'acc_ev_device_init_start', 'acc_ev_device_init_end',
+     'acc_ev_compute_construct_start', 'acc_ev_compute_construct_end'.  */
+  assert (ev_count_other >= 4);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
new file mode 100644
index 0000000..f537868
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
@@ -0,0 +1,76 @@
+/* Test the 'version' field of 'acc_prof_info'.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+
+static int ev_count;
+
+
+static void cb_any_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->version == 201711);
+
+  ++ev_count;
+}
+
+
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg_ (acc_ev_device_init_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_init_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_runtime_shutdown, cb_any_event, acc_reg);
+  reg_ (acc_ev_create, cb_any_event, acc_reg);
+  reg_ (acc_ev_delete, cb_any_event, acc_reg);
+  reg_ (acc_ev_alloc, cb_any_event, acc_reg);
+  reg_ (acc_ev_free, cb_any_event, acc_reg);
+  reg_ (acc_ev_enter_data_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enter_data_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_exit_data_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_exit_data_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_update_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_update_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_compute_construct_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_compute_construct_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_wait_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_wait_end, cb_any_event, acc_reg);
+}
+
+
+int main()
+{
+  acc_register_library (acc_prof_register, acc_prof_unregister, acc_prof_lookup);
+
+  ev_count = 0;
+
+  /* Trigger tests done in 'cb_*' functions.  */
+#pragma acc parallel
+  {
+    asm volatile ("" : : : "memory"); // TODO PR90488
+  }
+
+  DEBUG_printf ("ev_count = %d\n", ev_count);
+  /* We don't know exactly how many events to expect, but we at least expect
+     'acc_ev_device_init_start', 'acc_ev_device_init_end',
+     'acc_ev_compute_construct_start', 'acc_ev_compute_construct_end'.  */
+  assert (ev_count >= 4);
+
+  return 0;
+}
Thomas Schwinge July 24, 2019, 10:45 a.m. UTC | #3
Hi Kwok!

On 2019-06-24T20:37:12+0100, Kwok Cheung Yeung <kcy@codesourcery.com> wrote:
> On 17/06/2019 6:24 pm, Thomas Schwinge wrote:
>>> Okay to push to openacc-gcc-9-branch?

>> I think what would be best, is the following approach: [...]

> I have now ported over the mainline patch over to OG9, plus an 
> additional patch on top of that to bring in the bits from OG8 that did 
> not make it upstream.
> 
> I have dropped the differences in comments, TODOs, documentation etc. in 
> favour of the upstream patch. There are also various places where the 
> OG8 patch sets up profiling, then gotos the end of the function to tear 
> it down again, whereas the mainline version just aborts early without 
> setting up profiling in the first place - these I have also resolved in 
> favour of the mainline version.

ACK, thanks!

> I have rerun the libgomp testsuite with no regressions noted.

(I didn't actually test, but just reviewed the code changes.)

> Okay to push to openacc-gcc-9-branch?

Yes, please push after re-testing with minor modifications in the second
commit, as follows:

> From 341285e282f5b7ed73daaeb9fd2f820dc1fe91f9 Mon Sep 17 00:00:00 2001
> From: Kwok Cheung Yeung <kcy@codesourcery.com>
> Date: Fri, 21 Jun 2019 10:40:38 -0700
> Subject: [PATCH 2/2] Add changes to profiling interface from OG8 branch
>
> This bundles up the parts of the profiling code from the OG8 branch that were
> not included in the upstream patch.

> +2017-02-28  Thomas Schwinge  <thomas@codesourcery.com>
> +
> +	[...]
> +	* oacc-parallel.c (GOACC_parallel_keyed_internal): Set device_api for
> +	profiling.

> --- a/libgomp/oacc-parallel.c
> +++ b/libgomp/oacc-parallel.c
> @@ -275,6 +275,8 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
>        goacc_call_host_fn (fn, mapnum, hostaddrs, params);
>        goto out_prof;
>      }
> +  else if (profiling_p)
> +    api_info.device_api = acc_device_api_cuda;

That change is not quite right, and I'm pretty sure it wasn't me who
introduced that code ;-P -- but that can be resolved later.

> --- a/libgomp/Makefile.in
> +++ b/libgomp/Makefile.in
> @@ -524,7 +525,7 @@ search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
>  
>  fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
>  libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
> -libgomp_la_LIBADD = $(LIBFFI)
> +@USE_LIBFFI_TRUE@libgomp_la_LIBADD = $(LIBFFI)

That must've been an earlier commit not properly regenerating the file,
but we shall not bother with that now.

> diff --git a/libgomp/config/nvptx/oacc-profiling-acc_register_library.c b/libgomp/config/nvptx/oacc-profiling-acc_register_library.c
> new file mode 100644
> index 0000000..e69de29
> diff --git a/libgomp/config/nvptx/oacc-profiling.c b/libgomp/config/nvptx/oacc-profiling.c
> new file mode 100644
> index 0000000..e69de29

Not sure if these empty files are actually needed, but no harm done, so
that can be resolved later.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
> @@ -41,6 +41,7 @@ static int state = -1;
>  static acc_device_t acc_device_type;
>  static int acc_device_num;
>  static int num_gangs, num_workers, vector_length;
> +static int async;

All these 'async' changes in this file logically belong into the
respective commit of the OpenACC 'kernels' changes.  It's not a problem
to have them included here; we shall just try to remember to include them
in the OpenACC 'kernels' trunk changes.  (I've made a note; no need for
you to re-test/re-post.)

> @@ -165,6 +166,15 @@ int main()
>        for (int i = 0; i < N; ++i)
>  	x[i] = i * i;
>      }
> +#ifdef __OPTIMIZE__
> +    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
> +       "state == 0" still holds.  It's not yet clear what's going on.
> +       Mis-optimization across the GOMP function call boundary?  Per its
> +       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
> +       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
> +       must expect calls back into this compilation unit?  */
> +    asm volatile ("" : : : "memory");
> +#endif

That workaround is no longer needed given the more specific workaround
that I've added, marked with "TODO PR90488".

> @@ -184,12 +196,22 @@ int main()
>        for (int i = 0; i < N; ++i)
>  	x[i] = i * i;
>      }
> +#ifdef __OPTIMIZE__
> +    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
> +       "state == 0" still holds.  It's not yet clear what's going on.
> +       Mis-optimization across the GOMP function call boundary?  Per its
> +       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
> +       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
> +       must expect calls back into this compilation unit?  */
> +    asm volatile ("" : : : "memory");
> +#endif

Likewise.

> @@ -209,12 +233,22 @@ int main()
>        for (int i = 0; i < N; ++i)
>  	x[i] = i * i;
>      }
> +#ifdef __OPTIMIZE__
> +    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
> +       "state == 0" still holds.  It's not yet clear what's going on.
> +       Mis-optimization across the GOMP function call boundary?  Per its
> +       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
> +       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
> +       must expect calls back into this compilation unit?  */
> +    asm volatile ("" : : : "memory");
> +#endif

Likewise.

> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
> @@ -697,6 +695,15 @@ int main()
>      }
>      assert (state_init == 4);
>    }
> +#ifdef __OPTIMIZE__
> +  /* TODO.  With -O2 optimizations enabled, the compiler believes that here
> +     "state == 0" still holds.  It's not yet clear what's going on.
> +     Mis-optimization across the GOMP function call boundary?  Per its
> +     gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
> +     "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
> +     must expect calls back into this compilation unit?  */
> +  asm volatile ("" : : : "memory");
> +#endif

Likewise.

In 'libgomp.oacc-c-c++-common/acc_prof-version-1.c:main', we should also
remove the explicit call to 'acc_register_library'.


Grüße
 Thomas
Kwok Cheung Yeung July 26, 2019, 9:35 p.m. UTC | #4
On 24/07/2019 11:45 am, Thomas Schwinge wrote:
>> +2017-02-28  Thomas Schwinge  <thomas@codesourcery.com>
>> +
>> +	[...]
>> +	* oacc-parallel.c (GOACC_parallel_keyed_internal): Set device_api for
>> +	profiling.
> 
>> --- a/libgomp/oacc-parallel.c
>> +++ b/libgomp/oacc-parallel.c
>> @@ -275,6 +275,8 @@ GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
>>         goacc_call_host_fn (fn, mapnum, hostaddrs, params);
>>         goto out_prof;
>>       }
>> +  else if (profiling_p)
>> +    api_info.device_api = acc_device_api_cuda;
> 
> That change is not quite right, and I'm pretty sure it wasn't me who
> introduced that code ;-P -- but that can be resolved later.
> 

git blame shows that that fragment was introduced in commit 
fef820dc593f3ee87dd79532701ed709e249eca4 (on the OG8 branch) named 'OpenACC 
async re-work'.

>> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
>> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
>> @@ -41,6 +41,7 @@ static int state = -1;
>>   static acc_device_t acc_device_type;
>>   static int acc_device_num;
>>   static int num_gangs, num_workers, vector_length;
>> +static int async;
> 
> All these 'async' changes in this file logically belong into the
> respective commit of the OpenACC 'kernels' changes.  It's not a problem
> to have them included here; we shall just try to remember to include them
> in the OpenACC 'kernels' trunk changes.  (I've made a note; no need for
> you to re-test/re-post.)
> 

This is because in OG8, the kernel changes came after the introduction of the 
profiling interface code, while on OG9 it is the other way around.

>> @@ -165,6 +166,15 @@ int main()
>>         for (int i = 0; i < N; ++i)
>>   	x[i] = i * i;
>>       }
>> +#ifdef __OPTIMIZE__
>> +    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
>> +       "state == 0" still holds.  It's not yet clear what's going on.
>> +       Mis-optimization across the GOMP function call boundary?  Per its
>> +       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
>> +       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
>> +       must expect calls back into this compilation unit?  */
>> +    asm volatile ("" : : : "memory");
>> +#endif
> 
> That workaround is no longer needed given the more specific workaround
> that I've added, marked with "TODO PR90488".
> 

Removed.

> In 'libgomp.oacc-c-c++-common/acc_prof-version-1.c:main', we should also
> remove the explicit call to 'acc_register_library'.
> 

Removed.
The patches have now been pushed to the openacc-gcc-9-branch git-only branch.

Kwok
diff mbox series

Patch

diff --git a/libgomp/ChangeLog.openacc b/libgomp/ChangeLog.openacc
index b3bcb31..bcca494 100644
--- a/libgomp/ChangeLog.openacc
+++ b/libgomp/ChangeLog.openacc
@@ -1,3 +1,99 @@ 
+2019-01-09  Julian Brown  <julian@codesourcery.com>
+
+	* acc_prof.h (_ACC_PROF_INFO_VERSION): Update to 201711.
+	* libgomp.texi: Update mentions of OpenACC version to 2.6.
+	* testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
+	(cb_any_event): Update expected profiling info version to 201711.
+
+2018-12-20  Maciej W. Rozycki  <macro@codesourcery.com>
+
+	* oacc-init.c (get_property_any): Add profiling code.
+
+2017-05-15  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* oacc-async.c (acc_async_test, acc_async_test_all, acc_wait)
+	(acc_wait_async, acc_wait_all, acc_wait_all_async): Set up
+	profiling.
+	* oacc-cuda.c (acc_get_current_cuda_device)
+	(acc_get_current_cuda_context, acc_get_cuda_stream)
+	(acc_set_cuda_stream): Likewise.
+	* oacc-init.c (acc_set_device_type, acc_get_device_type)
+	(acc_get_device_num): Likewise.
+	* oacc-mem.c (acc_malloc, acc_free, memcpy_tofrom_device)
+	(acc_map_data, acc_unmap_data, present_create_copy)
+	(delete_copyout, update_dev_host): Likewise.
+	* oacc-parallel.c (GOACC_data_start, GOACC_data_end)
+	(GOACC_enter_exit_data, GOACC_update, GOACC_wait): Likewise.
+	* oacc-profiling.c (goacc_profiling_setup_p): New function.
+	(goacc_profiling_dispatch_p): Add a "bool" formal parameter.
+	Adjust all users.
+	* oacc-int.h (goacc_profiling_setup_p)
+	(goacc_profiling_dispatch_p): Update.
+	* plugin/plugin-nvptx.c (nvptx_exec, nvptx_wait, nvptx_wait_all):
+	Generate more profiling events.
+	* libgomp.texi (OpenACC Environment Variables): Mention
+	"ACC_PROFLIB".
+	(OpenACC Profiling Interface): Update.
+
+2017-05-14  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-kernels-1.c: New
+	file.
+
+2017-02-28  Thomas Schwinge  <thomas@codesourcery.com>
+
+	* acc_prof.h: New file.
+	* oacc-profiling-acc_register_library.c: Likewise.
+	* oacc-profiling.c: Likewise.
+	* Makefile.am (nodist_libsubinclude_HEADERS, libgomp_la_SOURCES):
+	Add these, respectively.
+	* Makefile.in: Regenerate.
+	* libgomp/config/nvptx/oacc-profiling-acc_register_library.c:
+	New empty file.
+	* libgomp/config/nvptx/oacc-profiling.c: Likewise.
+	* env.c (initialize_env): Call goacc_profiling_initialize.
+	* libgomp-plugin.c: New function
+	GOMP_PLUGIN_goacc_profiling_dispatch.
+	* libgomp-plugin.h: Declare function
+	GOMP_PLUGIN_goacc_profiling_dispatch.
+	* oacc-plugin.c: New function GOMP_PLUGIN_goacc_thread.
+	* oacc-plugin.h: Declare function GOMP_PLUGIN_goacc_thread.
+	* libgomp.map (OACC_2.5): Add acc_prof_lookup, acc_prof_register,
+	acc_prof_unregister, and acc_register_library.
+	Add GOMP_PLUGIN_goacc_profiling_dispatch, and
+	GOMP_PLUGIN_goacc_thread with new GOMP_PLUGIN_1.3 symbol version.
+	* oacc-int.h (struct goacc_thread): Add "acc_prof_info
+	*prof_info", "acc_api_info *api_info", and "bool
+	prof_callbacks_enabled" members.
+	Declare functions goacc_profiling_initialize,
+	goacc_profiling_dispatch_p, and goacc_profiling_dispatch.
+	* oacc-init.c (acc_init_1): Add "acc_construct_t", and "int"
+	formal parameters.  Adjust all users.
+	(goacc_lazy_initialize): Don't call acc_init; copy code here,
+	instead.
+	(acc_init_1, goacc_attach_host_thread_to_device, acc_init)
+	(goacc_lazy_initialize): Update for OpenACC Profiling Interface.
+	* oacc-parallel.c (GOACC_parallel_keyed): Likewise.
+	* plugin/plugin-nvptx.c (cuda_map_create, cuda_map_destroy)
+	(map_init, map_fini, map_pop, map_push): Add "struct goacc_thread
+	*" formal parameter.  Adjust all users.
+	(select_stream_for_async, event_gc, nvptx_exec, nvptx_host2dev)
+	(nvptx_dev2host, nvptx_set_cuda_stream): Call
+	GOMP_PLUGIN_goacc_thread instead of nvptx_thread.
+	(cuda_map_create, cuda_map_destroy, nvptx_exec, nvptx_alloc)
+	(nvptx_free, nvptx_host2dev, nvptx_dev2host): Update for OpenACC
+	Profiling Interface.
+	* libgomp.texi: New chapter "OpenACC Profiling Interface".
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-dispatch-1.c: New
+	file.
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-init-1.c: Likewise.
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-parallel-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-valid_bytes-1.c:
+	Likewise.
+	* testsuite/libgomp.oacc-c-c-common/acc_prof-version-1.c:
+	Likewise.
+
 2019-05-28  Julian Brown  <julian@codesourcery.com>
 
 	* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 2db3865..a38b66ef 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -72,6 +72,7 @@  libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c error.c \
 	proc.c sem.c bar.c ptrlock.c time.c fortran.c affinity.c target.c \
 	splay-tree.c libgomp-plugin.c oacc-parallel.c oacc-host.c oacc-init.c \
 	oacc-mem.c oacc-async.c oacc-plugin.c oacc-cuda.c priority_queue.c \
+	oacc-profiling.c oacc-profiling-acc_register_library.c \
 	affinity-fmt.c teams.c
 
 include $(top_srcdir)/plugin/Makefrag.am
@@ -81,7 +82,7 @@  libgomp_la_SOURCES += openacc.f90
 endif
 
 nodist_noinst_HEADERS = libgomp_f.h
-nodist_libsubinclude_HEADERS = omp.h openacc.h
+nodist_libsubinclude_HEADERS = acc_prof.h omp.h openacc.h
 if USE_FORTRAN
 nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \
 	openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 820a576..d81e3d8 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -16,7 +16,7 @@ 
 
 # Plugins for offload execution, Makefile.am fragment.
 #
-# Copyright (C) 2014-2018 Free Software Foundation, Inc.
+# Copyright (C) 2014-2019 Free Software Foundation, Inc.
 #
 # Contributed by Mentor Embedded.
 #
@@ -215,8 +215,9 @@  am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	sem.lo bar.lo ptrlock.lo time.lo fortran.lo affinity.lo \
 	target.lo splay-tree.lo libgomp-plugin.lo oacc-parallel.lo \
 	oacc-host.lo oacc-init.lo oacc-mem.lo oacc-async.lo \
-	oacc-plugin.lo oacc-cuda.lo priority_queue.lo affinity-fmt.lo \
-	teams.lo $(am__objects_1)
+	oacc-plugin.lo oacc-cuda.lo priority_queue.lo \
+	oacc-profiling.lo oacc-profiling-acc_register_library.lo \
+	affinity-fmt.lo teams.lo $(am__objects_1)
 libgomp_la_OBJECTS = $(am_libgomp_la_OBJECTS)
 AM_V_P = $(am__v_P_@AM_V@)
 am__v_P_ = $(am__v_P_@AM_DEFAULT_V@)
@@ -524,7 +525,7 @@  search_path = $(addprefix $(top_srcdir)/config/, $(config_path)) $(top_srcdir) \
 
 fincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)$(MULTISUBDIR)/finclude
 libsubincludedir = $(libdir)/gcc/$(target_alias)/$(gcc_version)/include
-libgomp_la_LIBADD = $(LIBFFI)
+@USE_LIBFFI_TRUE@libgomp_la_LIBADD = $(LIBFFI)
 AM_CPPFLAGS = $(addprefix -I, $(search_path)) $(LIBFFIINCS)
 AM_CFLAGS = $(XCFLAGS)
 AM_LDFLAGS = $(XLDFLAGS) $(SECTION_LDFLAGS) $(OPT_LDFLAGS)
@@ -553,6 +554,7 @@  libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c \
 	affinity.c target.c splay-tree.c libgomp-plugin.c \
 	oacc-parallel.c oacc-host.c oacc-init.c oacc-mem.c \
 	oacc-async.c oacc-plugin.c oacc-cuda.c priority_queue.c \
+	oacc-profiling.c oacc-profiling-acc_register_library.c \
 	affinity-fmt.c teams.c $(am__append_3)
 
 # Nvidia PTX OpenACC plugin.
@@ -577,7 +579,7 @@  libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c \
 @PLUGIN_HSA_TRUE@libgomp_plugin_hsa_la_LIBADD = libgomp.la $(PLUGIN_HSA_LIBS)
 @PLUGIN_HSA_TRUE@libgomp_plugin_hsa_la_LIBTOOLFLAGS = --tag=disable-static
 nodist_noinst_HEADERS = libgomp_f.h
-nodist_libsubinclude_HEADERS = omp.h openacc.h
+nodist_libsubinclude_HEADERS = acc_prof.h omp.h openacc.h
 @USE_FORTRAN_TRUE@nodist_finclude_HEADERS = omp_lib.h omp_lib.f90 omp_lib.mod omp_lib_kinds.mod \
 @USE_FORTRAN_TRUE@	openacc_lib.h openacc.f90 openacc.mod openacc_kinds.mod
 
@@ -755,6 +757,8 @@  distclean-compile:
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-mem.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-parallel.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-plugin.Plo@am__quote@
+@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-profiling-acc_register_library.Plo@am__quote@
+@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/oacc-profiling.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/ordered.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/parallel.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/priority_queue.Plo@am__quote@
diff --git a/libgomp/acc_prof.h b/libgomp/acc_prof.h
new file mode 100644
index 0000000..fd11d4b
--- /dev/null
+++ b/libgomp/acc_prof.h
@@ -0,0 +1,237 @@ 
+/* OpenACC Runtime Library: Profiling Interface
+
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+   Contributed by Mentor Embedded.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef _ACC_PROF_H
+#define _ACC_PROF_H 1
+
+/* The OpenACC standard doesn't say so explicitly, but as its Profiling
+   Interface makes use of, for example, <openacc.h>'s acc_device_t, we
+   supposedly are to #include that file here.  */
+#include <openacc.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/* OpenACC 2.5, 5. Profiling Interface, 5.1. Events.  */
+
+typedef enum acc_event_t
+{
+  acc_ev_none = 0,
+  acc_ev_device_init_start,
+  acc_ev_device_init_end,
+  acc_ev_device_shutdown_start,
+  acc_ev_device_shutdown_end,
+  acc_ev_runtime_shutdown,
+  acc_ev_create,
+  acc_ev_delete,
+  acc_ev_alloc,
+  acc_ev_free,
+  acc_ev_enter_data_start,
+  acc_ev_enter_data_end,
+  acc_ev_exit_data_start,
+  acc_ev_exit_data_end,
+  acc_ev_update_start,
+  acc_ev_update_end,
+  acc_ev_compute_construct_start,
+  acc_ev_compute_construct_end,
+  acc_ev_enqueue_launch_start,
+  acc_ev_enqueue_launch_end,
+  acc_ev_enqueue_upload_start,
+  acc_ev_enqueue_upload_end,
+  acc_ev_enqueue_download_start,
+  acc_ev_enqueue_download_end,
+  acc_ev_wait_start,
+  acc_ev_wait_end,
+  acc_ev_last
+} acc_event_t;
+
+
+/* OpenACC 2.5, 5. Profiling Interface, 5.2. Callbacks Signature.  */
+
+//TODO
+/* 'In all cases, a datatype of "size_t" means a 32-bit integer for a 32-bit
+   binary and a 64-bit integer for a 64-bit binary, and a datatype "int" means
+   a 32-bit integer for both 32-bit and 64-bit binaries'.  */
+typedef long int _acc_prof_size_t;
+typedef int _acc_prof_int_t;
+
+/* Internal helpers: a struct's "valid_bytes" may be less than its "sizeof".  */
+#define _ACC_PROF_VALID_BYTES_STRUCT(_struct, _lastfield, _valid_bytes_lastfield) \
+  offsetof (_struct, _lastfield) + (_valid_bytes_lastfield)
+#if 0 /* Untested.  */
+#define _ACC_PROF_VALID_BYTES_TYPE_N(_type, _n, _valid_bytes_type) \
+  ((_n - 1) * sizeof (_type) + (_valid_bytes_type))
+#endif
+#define _ACC_PROF_VALID_BYTES_BASICTYPE(_basictype) \
+  (sizeof (_basictype))
+
+typedef struct acc_prof_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  _acc_prof_int_t version;
+  acc_device_t device_type;
+  _acc_prof_int_t device_number;
+  _acc_prof_int_t thread_id;
+  _acc_prof_size_t async;
+  _acc_prof_size_t async_queue;
+  char *src_file;
+  char *func_name;
+  _acc_prof_int_t line_no, end_line_no;
+  _acc_prof_int_t func_line_no, func_end_line_no;
+#define _ACC_PROF_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_prof_info, func_end_line_no, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (_acc_prof_int_t))
+} acc_prof_info;
+
+/* We implement the OpenACC 2.6 Profiling Interface.  */
+#define _ACC_PROF_INFO_VERSION 201711
+
+typedef enum acc_construct_t
+{
+  acc_construct_parallel = 0,
+  acc_construct_kernels,
+  acc_construct_loop,
+  acc_construct_data,
+  acc_construct_enter_data,
+  acc_construct_exit_data,
+  acc_construct_host_data,
+  acc_construct_atomic,
+  acc_construct_declare,
+  acc_construct_init,
+  acc_construct_shutdown,
+  acc_construct_set,
+  acc_construct_update,
+  acc_construct_routine,
+  acc_construct_wait,
+  acc_construct_runtime_api
+} acc_construct_t;
+
+typedef struct acc_data_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+  char *var_name;
+  _acc_prof_size_t bytes;
+  void *host_ptr;
+  void *device_ptr;
+#define _ACC_DATA_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_data_event_info, device_ptr, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_data_event_info;
+
+typedef struct acc_launch_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+  char *kernel_name;
+  _acc_prof_size_t num_gangs, num_workers, vector_length;
+#define _ACC_LAUNCH_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_launch_event_info, vector_length, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (_acc_prof_size_t))
+} acc_launch_event_info;
+
+typedef struct acc_other_event_info
+{
+  acc_event_t event_type;
+  _acc_prof_int_t valid_bytes;
+  acc_construct_t parent_construct;
+  _acc_prof_int_t implicit;
+  void *tool_info;
+#define _ACC_OTHER_EVENT_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_other_event_info, tool_info, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_other_event_info;
+
+typedef union acc_event_info
+{
+  acc_event_t event_type;
+  acc_data_event_info data_event;
+  acc_launch_event_info launch_event;
+  acc_other_event_info other_event;
+} acc_event_info;
+
+//TODO: should these relate to acc_device_t values?
+typedef enum acc_device_api
+{
+  acc_device_api_none = 0,
+  acc_device_api_cuda,
+  acc_device_api_opencl,
+  acc_device_api_coi,
+  acc_device_api_other
+} acc_device_api;
+
+typedef struct acc_api_info
+{
+  acc_device_api device_api;
+  _acc_prof_int_t valid_bytes;
+  acc_device_t device_type;
+  _acc_prof_int_t vendor;
+  void *device_handle;
+  void *context_handle;
+  void *async_handle;
+#define _ACC_API_INFO_VALID_BYTES \
+  _ACC_PROF_VALID_BYTES_STRUCT (acc_api_info, async_handle, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} acc_api_info;
+
+typedef void (*acc_prof_callback) (acc_prof_info *, acc_event_info *,
+				   acc_api_info *);
+
+
+/* OpenACC 2.5, 5. Profiling Interface, 5.3. Loading the Library.  */
+
+typedef enum acc_register_t
+{
+  acc_reg = 0,
+  acc_toggle = 1,
+  acc_toggle_per_thread = 2
+} acc_register_t;
+
+typedef void (*acc_prof_reg) (acc_event_t, acc_prof_callback, acc_register_t);
+extern void acc_prof_register (acc_event_t, acc_prof_callback, acc_register_t) __GOACC_NOTHROW;
+extern void acc_prof_unregister (acc_event_t, acc_prof_callback, acc_register_t) __GOACC_NOTHROW;
+typedef void (*acc_query_fn) ();
+typedef acc_query_fn (*acc_prof_lookup_func) (const char *);
+extern acc_query_fn acc_prof_lookup (const char *) __GOACC_NOTHROW;
+/* Don't tag "acc_register_library" as "__GOACC_NOTHROW": this function can be
+   overridden by the application, and must be expected to do "everything".  */
+extern void acc_register_library (acc_prof_reg, acc_prof_reg, acc_prof_lookup_func);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _ACC_PROF_H */
diff --git a/libgomp/config/nvptx/oacc-profiling-acc_register_library.c b/libgomp/config/nvptx/oacc-profiling-acc_register_library.c
new file mode 100644
index 0000000..e69de29
diff --git a/libgomp/config/nvptx/oacc-profiling.c b/libgomp/config/nvptx/oacc-profiling.c
new file mode 100644
index 0000000..e69de29
diff --git a/libgomp/env.c b/libgomp/env.c
index 7937fbb..29d9f58 100644
--- a/libgomp/env.c
+++ b/libgomp/env.c
@@ -1425,5 +1425,7 @@  initialize_env (void)
   parse_gomp_openacc_dim ();
 
   goacc_runtime_initialize ();
+
+  goacc_profiling_initialize ();
 }
 #endif /* LIBGOMP_OFFLOADED_ONLY */
diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c
index 5b77a53..f9f77b1 100644
--- a/libgomp/libgomp-plugin.c
+++ b/libgomp/libgomp-plugin.c
@@ -29,6 +29,7 @@ 
 #include <stdlib.h>
 
 #include "libgomp.h"
+#include "oacc-int.h"
 #include "libgomp-plugin.h"
 
 void *
@@ -78,3 +79,11 @@  GOMP_PLUGIN_fatal (const char *msg, ...)
   gomp_vfatal (msg, ap);
   va_end (ap);
 }
+
+void
+GOMP_PLUGIN_goacc_profiling_dispatch (acc_prof_info *prof_info,
+				      acc_event_info *event_info,
+				      acc_api_info *api_info)
+{
+  goacc_profiling_dispatch (prof_info, event_info, api_info);
+}
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index da8ce26..81819db 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -33,6 +33,8 @@ 
 #include <stddef.h>
 #include <stdint.h>
 
+#include "acc_prof.h"
+
 #ifdef __cplusplus
 extern "C" {
 #endif
@@ -95,6 +97,10 @@  extern void GOMP_PLUGIN_error (const char *, ...)
 extern void GOMP_PLUGIN_fatal (const char *, ...)
 	__attribute__ ((noreturn, format (printf, 1, 2)));
 
+extern void GOMP_PLUGIN_goacc_profiling_dispatch (acc_prof_info *,
+						  acc_event_info *,
+						  acc_api_info *);
+
 /* Prototypes for functions implemented by libgomp plugins.  */
 extern const char *GOMP_OFFLOAD_get_name (void);
 extern unsigned int GOMP_OFFLOAD_get_caps (void);
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 717ae66..d07f4fe 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -466,6 +466,10 @@  OACC_2.5 {
 	acc_delete_finalize_async_array_h_;
 	acc_memcpy_from_device_async;
 	acc_memcpy_to_device_async;
+	acc_prof_lookup;
+	acc_prof_register;
+	acc_prof_unregister;
+	acc_register_library;
 	acc_update_device_async;
 	acc_update_device_async_32_h_;
 	acc_update_device_async_64_h_;
@@ -536,3 +540,10 @@  GOMP_PLUGIN_1.2 {
   global:
 	GOMP_PLUGIN_acc_default_dim;
 } GOMP_PLUGIN_1.1;
+
+# TODO
+GOMP_PLUGIN_1.3 {
+  global:
+	GOMP_PLUGIN_goacc_profiling_dispatch;
+	GOMP_PLUGIN_goacc_thread;
+} GOMP_PLUGIN_1.2;
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 7538846..25c17e0 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -111,6 +111,7 @@  changed to GNU Offloading and Multi Processing Runtime Library.
                                asynchronous operations.
 * OpenACC Library Interoperability:: OpenACC library interoperability with the
                                NVIDIA CUBLAS library.
+* OpenACC Profiling Interface::
 * The libgomp ABI::            Notes on the external ABI presented by libgomp.
 * Reporting Bugs::             How to report bugs in the GNU Offloading and
                                Multi Processing Runtime Library.
@@ -2874,11 +2875,14 @@  A.2.1.4.
 
 The variables @env{ACC_DEVICE_TYPE} and @env{ACC_DEVICE_NUM}
 are defined by section 4 of the OpenACC specification in version 2.6.
+The variable @code{ACC_PROFLIB} is defined by section 4 of the OpenACC
+specification in version 2.6.
 The variable @env{GCC_ACC_NOTIFY} is used for diagnostic purposes.
 
 @menu
 * ACC_DEVICE_TYPE::
 * ACC_DEVICE_NUM::
+* ACC_PROFLIB::
 * GCC_ACC_NOTIFY::
 @end menu
 
@@ -2904,6 +2908,19 @@  The variable @env{GCC_ACC_NOTIFY} is used for diagnostic purposes.
 
 
 
+@node ACC_PROFLIB
+@section @code{ACC_PROFLIB}
+@table @asis
+@item @emph{See also}:
+@ref{OpenACC Profiling Interface}
+
+@item @emph{Reference}:
+@uref{https://www.openacc.org, OpenACC specification v2.6}, section
+4.3.
+@end table
+
+
+
 @node GCC_ACC_NOTIFY
 @section @code{GCC_ACC_NOTIFY}
 @table @asis
@@ -3120,6 +3137,297 @@  Application Programming Interface”, Version 2.6.}
 
 
 @c ---------------------------------------------------------------------
+@c OpenACC Profiling Interface
+@c ---------------------------------------------------------------------
+
+@node OpenACC Profiling Interface
+@chapter OpenACC Profiling Interface
+
+@section Implementation Status and Implementation-Defined Behavior
+
+We're implementing most of the Profiling Interface as defined by
+the OpenACC 2.6 specification.  The specification doesn't
+clearly define some aspects of its Profiling Interface, so we're
+clarifying these as @emph{implementation-defined behavior} here.  We
+already have reported to the OpenACC Technical Committee some issues,
+and will report more, later on.
+
+This implementation of the OpenACC Profiling Interface is tuned to
+keep the performance impact as low as possible when it's not in use.
+This is relevant, as the Profiling Interface affects all the
+@emph{hot} code paths (in the target code, not in the offloaded code).
+Users of the OpenACC Profiling Interface can be expected to understand
+that performance will always be impacted to some degree: for example,
+because of the @emph{runtime} (libgomp) calling into a third-party
+@emph{library} for every event that has been registered.
+
+This implementation of the OpenACC Profiling Interface has not yet
+been validated for use in multi-threaded code.  This is a more general
+issue; see CSTS-110 @cite{Make sure all OpenACC entry points in
+libgomp are thread-safe}.
+
+The @code{acc_prof_lookup} interface is not implemented, and
+@code{acc_register_library} will receive @code{NULL} for its
+@code{lookup} parameter.
+
+Remarks about data provided to callbacks:
+
+@table @asis
+
+@item @code{acc_prof_info.event_type}
+It is not clear if for @emph{nested} event callbacks (for example,
+@code{acc_ev_enqueue_launch_start} as part of a parent compute
+construct), this should be set for the nested event
+(@code{acc_ev_enqueue_launch_start}), or if the value of the parent
+construct should remain (@code{acc_ev_compute_construct_start}).  In
+this implementation, the value will generally correspond to the
+innermost nested event type.
+
+@item @code{acc_prof_info.device_type}
+@itemize
+
+@item
+For @code{acc_ev_compute_construct_start}, and in presence of an
+@code{if} clause with @emph{false} argument, this will still refer to
+the offloading device type; unsure whether that's the expected
+behavior.
+
+@item
+Complementary to the item before, for
+@code{acc_ev_compute_construct_end}, this is set to
+@code{acc_device_host} in presence of an @code{if} clause with
+@emph{false} argument, unsure whether that's the expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.thread_id}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.async}
+@itemize
+
+@item
+Not yet implemented correctly for
+@code{acc_ev_compute_construct_start}.
+
+@item
+In a compute construct, for host-fallback
+execution/@code{acc_device_host} it will always be
+@code{acc_async_sync}; unsure if that is the expected behavior.
+
+@item
+For @code{acc_ev_device_init_start} and @code{acc_ev_device_init_end},
+it will always be @code{acc_async_sync}; unsure if that is the
+expected behavior.
+
+@end itemize
+
+@item @code{acc_prof_info.async_queue}
+There is no @cite{limited number of asynchronous queues} in libgomp.
+We define this to always have the same value as
+@code{acc_prof_info.async}.
+
+@item @code{acc_prof_info.file}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.func_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_prof_info.line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_prof_info.func_end_line_no}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_event_info.event_type}, @code{acc_event_info.*.event_type}
+Relating to @code{acc_prof_info.event_type} discussed above, in this
+implementation, this will always be the same value as
+@code{acc_prof_info.event_type}.
+
+@item @code{acc_event_info.*.parent_construct}
+@itemize
+
+@item
+Will be @code{acc_construct_parallel} for OpenACC kernels constructs;
+should be @code{acc_construct_kernels}.
+
+@item
+Will be @code{acc_construct_enter_data} or
+@code{acc_construct_exit_data} when processing variable mappings
+specified in OpenACC declare directives; should be
+@code{acc_construct_declare}.
+
+@item
+For implicit @code{acc_ev_device_init_start},
+@code{acc_ev_device_init_end}, and explicit as well as implicit
+@code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start},
+@code{acc_ev_enqueue_download_end}, @code{acc_ev_wait_start}, and
+@code{acc_ev_wait_end}, will be
+@code{acc_construct_parallel}; should reflect the real parent
+construct.
+
+@end itemize
+
+@item @code{acc_event_info.*.implicit}
+For @code{acc_ev_alloc}, @code{acc_ev_free},
+@code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end},
+@code{acc_ev_enqueue_download_start},
+@code{acc_ev_enqueue_download_end}, @code{acc_ev_wait_start}, and
+@code{acc_ev_wait_end}, this currently will be @code{1}
+also for explicit usage.
+
+@item @code{acc_event_info.data_event.var_name}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_event_info.data_event.host_ptr}
+For @code{acc_ev_alloc}, and @code{acc_ev_free}, this is always
+@code{NULL}.
+
+@item @code{typedef union acc_api_info}
+@dots{} as printed in @cite{5.2.3. Third Argument: API-Specific
+Information}, should obviously be @code{typedef @emph{struct}
+acc_api_info}.
+
+@item @code{acc_api_info.device_api}
+Possibly not yet implemented correctly for
+@code{acc_ev_compute_construct_start},
+@code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}:
+will always be @code{acc_device_api_none} for these event types.
+For @code{acc_ev_enter_data_start}, it will be
+@code{acc_device_api_none} in some cases.
+
+@item @code{acc_api_info.device_type}
+Always the same as @code{acc_prof_info.device_type}.
+
+@item @code{acc_api_info.vendor}
+Always @code{-1}; not yet implemented.
+
+@item @code{acc_api_info.device_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.context_handle}
+Always @code{NULL}; not yet implemented.
+
+@item @code{acc_api_info.async_handle}
+Always @code{NULL}; not yet implemented.
+
+@end table
+
+Remarks about certain event types:
+
+@table @asis
+
+@item @code{acc_ev_device_init_start}, @code{acc_ev_device_init_end}
+@itemize
+
+@item
+@c See DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT in
+@c libgomp.oacc-c-c++-common/acc_prof-parallel-1.c.
+Whan a compute construct triggers implicit
+@code{acc_ev_device_init_start} and @code{acc_ev_device_init_end}
+events, they currently aren't @emph{nested within} the corresponding
+@code{acc_ev_compute_construct_start} and
+@code{acc_ev_compute_construct_end}, but they're currently observed
+@emph{before} @code{acc_ev_compute_construct_start}.  It is not clear
+what to do: the standard asks us provide a lot of details to the
+@code{acc_ev_compute_construct_start} callback, without (implicitly)
+initializing a device before?
+
+@item
+Callbacks for these event types will not be invoked for calls to the
+@code{acc_set_device_type} and @code{acc_set_device_num} functions;
+it's not clear if they should be.
+
+@end itemize
+
+@item @code{acc_ev_enter_data_start}, @code{acc_ev_enter_data_end}, @code{acc_ev_exit_data_start}, @code{acc_ev_exit_data_end}
+@itemize
+
+@item
+Callbacks for these event types will also be invoked for OpenACC
+host_data constructs; it's not clear if they should be.
+
+@item
+Callbacks for these event types will also be invoked when processing
+variable mappings specified in OpenACC declare directives; it's not
+clear if they should be.
+
+@end itemize
+
+@end table
+
+Callbacks for the following event types will be invoked, but dispatch
+and information provided therein has not yet been thoroughly reviewed:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@item @code{acc_ev_update_start}, @code{acc_ev_update_end}
+@item @code{acc_ev_enqueue_upload_start}, @code{acc_ev_enqueue_upload_end}
+@item @code{acc_ev_enqueue_download_start}, @code{acc_ev_enqueue_download_end}
+@item @code{acc_ev_wait_start}, @code{acc_ev_wait_end}
+@end itemize
+
+During device initialization, and finalization, respectively,
+callbacks for the following event types will not yet be invoked:
+
+@itemize
+@item @code{acc_ev_alloc}
+@item @code{acc_ev_free}
+@end itemize
+
+Callbacks for the following event types have not yet been implemented,
+so currently won't be invoked:
+
+@itemize
+@item @code{acc_ev_device_shutdown_start}, @code{acc_ev_device_shutdown_end}
+@item @code{acc_ev_runtime_shutdown}
+@item @code{acc_ev_create}, @code{acc_ev_delete}
+@end itemize
+
+For the following runtime library functions, not all expected
+callbacks will be invoked (mostly concerning implicit device
+initialization):
+
+@itemize
+@item @code{acc_get_num_devices}
+@item @code{acc_set_device_type}
+@item @code{acc_get_device_type}
+@item @code{acc_set_device_num}
+@item @code{acc_get_device_num}
+@item @code{acc_init}
+@item @code{acc_shutdown}
+@end itemize
+
+Aside from implicit device initialization, for the following runtime
+library functions, no callbacks will be invoked for shared-memory
+offloading devices (it's not clear if they should be):
+
+@itemize
+@item @code{acc_malloc}
+@item @code{acc_free}
+@item @code{acc_copyin}, @code{acc_present_or_copyin}, @code{acc_copyin_async}
+@item @code{acc_create}, @code{acc_present_or_create}, @code{acc_create_async}
+@item @code{acc_copyout}, @code{acc_copyout_async}
+@item @code{acc_delete}, @code{acc_delete_async}
+@item @code{acc_update_device}, @code{acc_update_device_async}
+@item @code{acc_update_self}, @code{acc_update_self_async}
+@item @code{acc_map_data}, @code{acc_unmap_data}
+@item @code{acc_memcpy_to_device}, @code{acc_memcpy_to_device_async}
+@item @code{acc_memcpy_from_device}, @code{acc_memcpy_from_device_async}
+@end itemize
+
+
+
+@c ---------------------------------------------------------------------
 @c The libgomp ABI
 @c ---------------------------------------------------------------------
 
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 51bb676..bf56560 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -125,7 +125,7 @@  lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
 
   ret_aq = dev->openacc.async.asyncqueue[async];
 
- end:
+end:
   gomp_mutex_unlock (&dev->openacc.async.lock);
   return ret_aq;
 }
@@ -149,11 +149,30 @@  acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
-  if (!aq)
-    return 1;
-  else
-    return thr->dev->openacc.async.test_func (aq);
+  int res = 1;
+  if (aq)
+    res = thr->dev->openacc.async.test_func (aq);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
+  return res;
 }
 
 int
@@ -161,6 +180,12 @@  acc_async_test_all (void)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   int ret = 1;
   gomp_mutex_lock (&thr->dev->openacc.async.lock);
   for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
@@ -170,6 +195,13 @@  acc_async_test_all (void)
 	break;
       }
   gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   return ret;
 }
 
@@ -178,9 +210,27 @@  acc_wait (int async)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
   if (aq && !thr->dev->openacc.async.synchronize_func (aq))
     gomp_fatal ("wait on %d failed", async);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -199,16 +249,28 @@  acc_wait_async (int async1, int async2)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async2; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
   /* TODO: Is this also correct for acc_async_sync, assuming that in this case,
      we'll always be synchronous anyways?  */
   if (!aq1)
-    return;
+    goto end;
 
   goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
   /* An async queue is always synchronized with itself.  */
   if (aq1 == aq2)
-    return;
+    goto end;
 
   if (aq2)
     {
@@ -222,12 +284,26 @@  acc_wait_async (int async1, int async2)
       if (!thr->dev->openacc.async.synchronize_func (aq1))
 	gomp_fatal ("wait on %d failed", async1);
     }
+
+end:
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 acc_wait_all (void)
 {
   struct gomp_device_descr *dev = get_goacc_thread_device ();
+  struct goacc_thread *thr = goacc_thread ();
+
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
 
   bool ret = true;
   gomp_mutex_lock (&dev->openacc.async.lock);
@@ -237,6 +313,12 @@  acc_wait_all (void)
 
   if (!ret)
     gomp_fatal ("wait all failed");
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -255,6 +337,18 @@  acc_wait_all_async (int async)
 {
   struct goacc_thread *thr = get_goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
   bool ret = true;
@@ -272,6 +366,12 @@  acc_wait_all_async (int async)
 
   if (!ret)
     gomp_fatal ("wait all async(%d) failed", async);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 attribute_hidden void
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index 1ca7149..4ba0bb3 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -37,10 +37,23 @@  acc_get_current_cuda_device (void)
 {
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_current_device_func)
-    return thr->dev->openacc.cuda.get_current_device_func ();
+    ret = thr->dev->openacc.cuda.get_current_device_func ();
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 
-  return NULL;
+  return ret;
 }
 
 void *
@@ -48,10 +61,23 @@  acc_get_current_cuda_context (void)
 {
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_current_context_func)
-    return thr->dev->openacc.cuda.get_current_context_func ();
- 
-  return NULL;
+    ret = thr->dev->openacc.cuda.get_current_context_func ();
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
+  return ret;
 }
 
 void *
@@ -62,14 +88,33 @@  acc_get_cuda_stream (int async)
   if (!async_valid_p (async))
     return NULL;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
+  void *ret = NULL;
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
     {
       goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
       if (aq)
-	return thr->dev->openacc.cuda.get_stream_func (aq);
+	ret = thr->dev->openacc.cuda.get_stream_func (aq);
     }
  
-  return NULL;
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
+  return ret;
 }
 
 /* As of OpenACC 2.6, the return code of this function appears to be
@@ -87,6 +132,18 @@  acc_set_cuda_stream (int async, void *stream)
 
   thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   int ret = 0;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
     {
@@ -98,11 +155,19 @@  acc_set_cuda_stream (int async, void *stream)
 	  assert (async == acc_async_sync);
 	  gomp_debug (0, "Refusing request to set CUDA stream associated"
 		      " with \"acc_async_sync\"\n");
-	  return 0;
 	}
-      gomp_mutex_lock (&thr->dev->openacc.async.lock);
-      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
-      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+      else
+	{
+	  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+	  ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+	  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+	}
+    }
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
     }
 
   return ret;
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 0cd673e..57b3b87 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -228,8 +228,70 @@  acc_dev_num_out_of_range (acc_device_t d, int ord, int ndevs)
    held before calling this function.  */
 
 static struct gomp_device_descr *
-acc_init_1 (acc_device_t d)
+acc_init_1 (acc_device_t d, acc_construct_t parent_construct, int implicit)
 {
+  bool check_not_nested_p;
+  if (implicit)
+    {
+      /* In the implicit case, there should (TODO: must?) already be something
+	 have been set up for an outer construct.  */
+      check_not_nested_p = false;
+    }
+  else
+    {
+      check_not_nested_p = true;
+      /* TODO: should we set "thr->prof_info" etc. in this case (acc_init)?
+	 The problem is, that we don't have "thr" yet?  (So,
+	 "check_not_nested_p = true" also is pointless actually.)  */
+    }
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (check_not_nested_p),
+			false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_device_init_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = d;
+      prof_info.device_number = goacc_device_num;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = acc_async_sync; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info device_init_event_info;
+  if (profiling_dispatch_p)
+    {
+      device_init_event_info.other_event.event_type = prof_info.event_type;
+      device_init_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      device_init_event_info.other_event.parent_construct = parent_construct;
+      device_init_event_info.other_event.implicit = implicit;
+      device_init_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      api_info.device_api = acc_device_api_none; //TODO
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &device_init_event_info, &api_info);
+
   struct gomp_device_descr *base_dev, *acc_dev;
   int ndevs;
 
@@ -261,6 +323,14 @@  acc_init_1 (acc_device_t d)
   acc_init_state = initialized;
   gomp_mutex_unlock (&acc_init_state_lock);
 
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_device_init_end;
+      device_init_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &device_init_event_info,
+				&api_info);
+    }
+
   return base_dev;
 }
 
@@ -454,7 +524,11 @@  goacc_attach_host_thread_to_device (int ord)
   thr->dev = acc_dev = &base_dev[ord];
   thr->saved_bound_dev = NULL;
   thr->mapped_data = NULL;
-  
+  thr->prof_info = NULL;
+  thr->api_info = NULL;
+  /* Initially, all callbacks for all events are enabled.  */
+  thr->prof_callbacks_enabled = true;
+
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
 }
@@ -468,9 +542,7 @@  acc_init (acc_device_t d)
   gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
-
-  cached_base_dev = acc_init_1 (d);
-
+  cached_base_dev = acc_init_1 (d, acc_construct_runtime_api, 0);
   gomp_mutex_unlock (&acc_device_lock);
   
   goacc_attach_host_thread_to_device (-1);
@@ -495,11 +567,21 @@  ialias (acc_shutdown)
 int
 acc_get_num_devices (acc_device_t d)
 {
+#if 0 //TODO
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    prof_info.device_type = d; //TODO
+#endif
+
   int n = 0;
   struct gomp_device_descr *acc_dev;
 
   if (d == acc_device_none)
-    return 0;
+    goto out;
 
   gomp_init_targets_once ();
 
@@ -508,12 +590,21 @@  acc_get_num_devices (acc_device_t d)
   gomp_mutex_unlock (&acc_device_lock);
 
   if (!acc_dev)
-    return 0;
+    goto out;
 
   n = acc_dev->get_num_devices_func ();
   if (n < 0)
     n = 0;
 
+ out:
+#if 0 //TODO
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+#endif
+
   return n;
 }
 
@@ -529,6 +620,14 @@  acc_set_device_type (acc_device_t d)
   struct gomp_device_descr *base_dev, *acc_dev;
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    prof_info.device_type = d; //TODO
+
   gomp_init_targets_once ();
 
   gomp_mutex_lock (&acc_device_lock);
@@ -553,6 +652,12 @@  acc_set_device_type (acc_device_t d)
     }
 
   goacc_attach_host_thread_to_device (-1);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 ialias (acc_set_device_type)
@@ -588,12 +693,25 @@  acc_get_device_type (void)
     ;
   else
     {
+      acc_prof_info prof_info;
+      acc_api_info api_info;
+      bool profiling_setup_p
+	= __builtin_expect (goacc_profiling_setup_p (thr,
+						     &prof_info, &api_info),
+			    false);
+
       gomp_init_targets_once ();
 
       gomp_mutex_lock (&acc_device_lock);
       dev = resolve_device (acc_device_default, true);
       gomp_mutex_unlock (&acc_device_lock);
       res = acc_device_type (dev->type);
+
+      if (profiling_setup_p)
+	{
+	  thr->prof_info = NULL;
+	  thr->api_info = NULL;
+	}
     }
 
   assert (res != acc_device_default
@@ -611,6 +729,14 @@  acc_get_device_num (acc_device_t d)
   const struct gomp_device_descr *dev;
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    prof_info.device_type = d; //TODO
+
   if (d >= _ACC_device_hwm)
     gomp_fatal ("unknown device type %u", (unsigned) d);
 
@@ -620,6 +746,12 @@  acc_get_device_num (acc_device_t d)
   dev = resolve_device (d, true);
   gomp_mutex_unlock (&acc_device_lock);
 
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   if (thr && thr->base_dev == dev && thr->dev)
     return thr->dev->target_id;
 
@@ -631,6 +763,19 @@  ialias (acc_get_device_num)
 void
 acc_set_device_num (int ord, acc_device_t d)
 {
+#if 0 //TODO
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.device_type = d; //TODO
+      prof_info.device_type = ord; //TODO
+    }
+#endif
+
   struct gomp_device_descr *base_dev, *acc_dev;
   int num_devices;
 
@@ -668,6 +813,14 @@  acc_set_device_num (int ord, acc_device_t d)
     }
   
   goacc_device_num = ord;
+
+#if 0 //TODO
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+#endif
 }
 
 ialias (acc_set_device_num)
@@ -688,14 +841,32 @@  get_property_any (int ord, acc_device_t d, acc_device_property_t prop)
   if (d == acc_device_current && (!thr || !thr->dev))
     return (union gomp_device_property_value) { .val = 0 };
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   if (d == acc_device_current)
     {
+      if (profiling_setup_p)
+	{
+	  prof_info.device_type = acc_device_type (thr->dev->type);
+	  prof_info.device_number = thr->dev->target_id;
+	}
+
       dev = thr->dev;
     }
   else
     {
       int num_devices;
 
+      if (profiling_setup_p)
+	{
+	  prof_info.device_type = d;
+	  prof_info.device_number = ord;
+	}
+
       gomp_mutex_lock (&acc_device_lock);
 
       dev = resolve_device (d, false);
@@ -719,6 +890,12 @@  get_property_any (int ord, acc_device_t d, acc_device_property_t prop)
 
   propval = dev->get_property_func (dev->target_id, prop);
 
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   return propval;
 }
 
@@ -809,12 +986,16 @@  attribute_hidden void
 goacc_lazy_initialize (void)
 {
   struct goacc_thread *thr = goacc_thread ();
-
   if (thr && thr->dev)
     return;
 
+  gomp_init_targets_once ();
+
+  gomp_mutex_lock (&acc_device_lock);
   if (!cached_base_dev)
-    acc_init (acc_device_default);
-  else
-    goacc_attach_host_thread_to_device (-1);
+    cached_base_dev = acc_init_1 (acc_device_default,
+				  /* TODO */ acc_construct_parallel, 1);
+  gomp_mutex_unlock (&acc_device_lock);
+
+  goacc_attach_host_thread_to_device (-1);
 }
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index e4b6ea6..980cf32 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -40,6 +40,7 @@ 
 
 #include "openacc.h"
 #include "config.h"
+#include "acc_prof.h"
 #include <stddef.h>
 #include <stdbool.h>
 #include <stdarg.h>
@@ -68,6 +69,12 @@  struct goacc_thread
      strictly push/pop semantics according to lexical scope.  */
   struct target_mem_desc *mapped_data;
 
+  /* Data of the OpenACC Profiling Interface.  */
+  acc_prof_info *prof_info;
+  acc_api_info *api_info;
+  /* Per-thread toggle of OpenACC Profiling Interface callbacks.  */
+  bool prof_callbacks_enabled;
+
   /* These structures form a list: this is the next thread in that list.  */
   struct goacc_thread *next;
 
@@ -128,6 +135,13 @@  async_synchronous_p (int async)
   return async == acc_async_sync;
 }
 
+void goacc_profiling_initialize (void);
+bool goacc_profiling_setup_p (struct goacc_thread *,
+			      acc_prof_info *, acc_api_info *);
+bool goacc_profiling_dispatch_p (bool);
+void goacc_profiling_dispatch (acc_prof_info *, acc_event_info *,
+			       acc_api_info *);
+
 #ifdef HAVE_ATTRIBUTE_VISIBILITY
 # pragma GCC visibility pop
 #endif
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index b3b3777..1922f38 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -99,12 +99,30 @@  acc_malloc (size_t s)
 
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   assert (thr->dev);
 
+  void *ret;
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
-    return malloc (s);
+    {
+      /* TODO: Should we also generate acc_ev_alloc here?  */
+      ret = malloc (s);
+    }
+  else
+    ret = thr->dev->alloc_func (thr->dev->target_id, s);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 
-  return thr->dev->alloc_func (thr->dev->target_id, s);
+  return ret;
 }
 
 /* OpenACC 2.0a (3.2.16) doesn't specify what to do in the event
@@ -120,12 +138,23 @@  acc_free (void *d)
 
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   assert (thr && thr->dev);
 
   struct gomp_device_descr *acc_dev = thr->dev;
 
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
-    return free (d);
+    {
+      /* TODO: Should we also generate acc_ev_free here?  */
+      free (d);
+
+      goto out;
+    }
 
   gomp_mutex_lock (&acc_dev->lock);
 
@@ -147,6 +176,13 @@  acc_free (void *d)
 
   if (!acc_dev->free_func (acc_dev->target_id, d))
     gomp_fatal ("error in freeing device memory in %s", __FUNCTION__);
+
+ out:
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 static void
@@ -157,15 +193,31 @@  memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
      been obtained from a routine that did that.  */
   struct goacc_thread *thr = goacc_thread ();
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   assert (thr && thr->dev);
 
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     {
+      /* TODO: Should we also generate
+	 acc_ev_enqueue_upload_start/acc_ev_enqueue_upload_end or
+	 acc_ev_enqueue_download_start/acc_ev_enqueue_download_end here?  */
       if (from)
 	memmove (h, d, s);
       else
 	memmove (d, h, s);
-      return;
+
+      goto out;
     }
 
   goacc_aq aq = get_goacc_asyncqueue (async);
@@ -173,6 +225,13 @@  memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
     gomp_copy_dev2host (thr->dev, aq, h, d, s);
   else
     gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
+
+out:
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -217,6 +276,9 @@  acc_deviceptr (void *h)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&dev->lock);
 
   n = lookup_host (dev, h, 1);
@@ -254,6 +316,9 @@  acc_hostptr (void *d)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return d;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_dev (&acc_dev->mem_map, d, 1);
@@ -291,6 +356,9 @@  acc_is_present (void *h, size_t s)
   if (thr->dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h != NULL;
 
+  /* In the following, no OpenACC Profiling Interface events can possibly be
+     generated.  */
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -322,6 +390,12 @@  acc_map_data (void *h, void *d, size_t s)
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     {
       if (d != h)
@@ -357,6 +431,12 @@  acc_map_data (void *h, void *d, size_t s)
 			   &kinds, true, GOMP_MAP_VARS_OPENACC);
       tgt->list[0].key->refcount = REFCOUNT_INFINITY;
     }
+
+   if (profiling_setup_p)
+     {
+       thr->prof_info = NULL;
+       thr->api_info = NULL;
+     }
 }
 
 void
@@ -372,6 +452,12 @@  acc_unmap_data (void *h)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+
   gomp_mutex_lock (&acc_dev->lock);
 
   cur_node.host_start = (uintptr_t) h;
@@ -405,6 +491,12 @@  acc_unmap_data (void *h)
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 #define FLAG_PRESENT (1 << 0)
@@ -428,6 +520,18 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return h;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -489,6 +593,12 @@  present_create_copy (unsigned f, void *h, size_t s, int async)
       gomp_mutex_unlock (&acc_dev->lock);
     }
 
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
+
   return d;
 }
 
@@ -569,6 +679,18 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
   if (acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   gomp_mutex_lock (&acc_dev->lock);
 
   n = lookup_host (acc_dev, h, s);
@@ -627,6 +749,12 @@  delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -700,6 +828,18 @@  update_dev_host (int is_dev, void *h, size_t s, int async)
 
   gomp_mutex_lock (&acc_dev->lock);
 
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   n = lookup_host (acc_dev, h, s);
 
   if (!n)
@@ -719,6 +859,12 @@  update_dev_host (int is_dev, void *h, size_t s, int async)
     gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
   gomp_mutex_unlock (&acc_dev->lock);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index b949599..670c62c 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -205,22 +205,81 @@  GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   thr = goacc_thread ();
   acc_dev = thr->dev;
 
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (true), false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_compute_construct_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = async;
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info compute_construct_event_info;
+  if (profiling_dispatch_p)
+    {
+      compute_construct_event_info.other_event.event_type
+	= prof_info.event_type;
+      compute_construct_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      compute_construct_event_info.other_event.parent_construct
+	= acc_construct_parallel; //TODO: kernels...
+      compute_construct_event_info.other_event.implicit = 0;
+      compute_construct_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+			      &api_info);
+
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
   /* Host fallback if "if" clause is false or if the current device is set to
      the host.  */
   if (flags & GOACC_FLAG_HOST_FALLBACK)
     {
+      //TODO
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
       goacc_save_and_set_bind (acc_device_host);
       goacc_call_host_fn (fn, mapnum, hostaddrs, params);
       goacc_restore_bind ();
-      return;
+      goto out;
     }
   else if (acc_device_type (acc_dev->type) == acc_device_host)
     {
       goacc_call_host_fn (fn, mapnum, hostaddrs, params);
-      return;
+      goto out;
     }
+  else if (profiling_dispatch_p)
+    api_info.device_api = acc_device_api_cuda;
 
   /* Default: let the runtime choose.  */
   for (i = 0; i != GOMP_DIM_MAX; i++)
@@ -252,6 +311,14 @@  GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
 
 	    if (async == GOMP_LAUNCH_OP_MAX)
 	      async = va_arg (*ap, unsigned);
+
+	    if (profiling_dispatch_p)
+	      {
+		prof_info.async = async;
+		/* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+		prof_info.async_queue = prof_info.async;
+	      }
+
 	    break;
 	  }
 
@@ -284,10 +351,34 @@  GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   else
     tgt_fn = (void (*)) fn;
 
+  acc_event_info enter_exit_data_event_info;
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_start;
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_exit_data_event_info.other_event.parent_construct
+	= compute_construct_event_info.other_event.parent_construct;
+      enter_exit_data_event_info.other_event.implicit = 1;
+      enter_exit_data_event_info.other_event.tool_info = NULL;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+    }
+
   goacc_aq aq = get_goacc_asyncqueue (async);
 
   tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
 			     true, GOMP_MAP_VARS_OPENACC);
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_end;
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+    }
 
 #ifdef RC_CHECKING
   gomp_mutex_lock (&acc_dev->lock);
@@ -310,8 +401,25 @@  GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
       else
 	acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
 				    dims, tgt);
+      if (profiling_dispatch_p)
+	{
+	  prof_info.event_type = acc_ev_exit_data_start;
+	  enter_exit_data_event_info.other_event.event_type
+	    = prof_info.event_type;
+	  enter_exit_data_event_info.other_event.tool_info = NULL;
+	  goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				    &api_info);
+	}
       /* If running synchronously, unmap immediately.  */
       gomp_unmap_vars (tgt, true);
+      if (profiling_dispatch_p)
+	{
+	  prof_info.event_type = acc_ev_exit_data_end;
+	  enter_exit_data_event_info.other_event.event_type
+	    = prof_info.event_type;
+	  goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				    &api_info);
+	}
     }
   else
     {
@@ -329,6 +437,19 @@  GOACC_parallel_keyed_internal (int flags_m, int params, void (*fn) (void *),
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+out:
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_compute_construct_end;
+      compute_construct_event_info.other_event.event_type
+	= prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &compute_construct_event_info,
+				&api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -390,6 +511,72 @@  GOACC_data_start (int flags_m, size_t mapnum,
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (true), false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_enter_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = acc_async_sync; /* Always synchronous.  */
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info enter_data_event_info;
+  if (profiling_dispatch_p)
+    {
+      enter_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_data_event_info.other_event.parent_construct = acc_construct_data;
+      for (int i = 0; i < mapnum; ++i)
+	if (kinds[i] == GOMP_MAP_USE_DEVICE_PTR)
+	  {
+	    /* If there is one such data mapping kind, then this is actually an
+	       OpenACC host_data construct.  (GCC maps the OpenACC host_data
+	       construct to the OpenACC data construct.)  Apart from artificial
+	       test cases (such as an OpenACC host_data construct's (implicit)
+	       device initialization when there hasn't been any device data be
+	       set up before...), there can't really any meaningful events be
+	       generated from OpenACC host_data constructs, though.  */
+	    enter_data_event_info.other_event.parent_construct
+	      = acc_construct_host_data;
+	    break;
+	  }
+      enter_data_event_info.other_event.implicit = 0;
+      enter_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &enter_data_event_info, &api_info);
+
   handle_ftn_pointers (mapnum, hostaddrs, sizes, kinds);
 
   enum gomp_map_vars_kind pragma_kind;
@@ -402,11 +589,14 @@  GOACC_data_start (int flags_m, size_t mapnum,
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
     {
+      //TODO
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
       tgt = gomp_map_vars (NULL, 0, NULL, NULL, NULL, NULL, true, pragma_kind);
       tgt->prev = thr->mapped_data;
       thr->mapped_data = tgt;
 
-      return;
+      goto out;
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
@@ -421,14 +611,78 @@  GOACC_data_start (int flags_m, size_t mapnum,
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+out:
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_enter_data_end;
+      enter_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_data_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 GOACC_data_end (void)
 {
   struct goacc_thread *thr = goacc_thread ();
+  struct gomp_device_descr *acc_dev = thr->dev;
   struct target_mem_desc *tgt = thr->mapped_data;
 
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (true), false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_exit_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = acc_async_sync; /* Always synchronous.  */
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info exit_data_event_info;
+  if (profiling_dispatch_p)
+    {
+      exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      exit_data_event_info.other_event.parent_construct = acc_construct_data;
+      exit_data_event_info.other_event.implicit = 0;
+      exit_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &exit_data_event_info, &api_info);
+
   gomp_debug (0, "  %s: restore mappings\n", __FUNCTION__);
   thr->mapped_data = tgt->prev;
   gomp_unmap_vars (tgt, true);
@@ -439,6 +693,16 @@  GOACC_data_end (void)
   gomp_rc_check (thr->dev, thr->mapped_data);
   gomp_mutex_unlock (&thr->dev->lock);
 #endif
+
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_exit_data_end;
+      exit_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &exit_data_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
@@ -453,24 +717,6 @@  GOACC_enter_exit_data (int flags_m, size_t mapnum,
   bool data_enter = false;
   size_t i;
 
-  goacc_lazy_initialize ();
-
-  thr = goacc_thread ();
-  acc_dev = thr->dev;
-
-  if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
-      || (flags & GOACC_FLAG_HOST_FALLBACK))
-    return;
-
-  if (num_waits)
-    {
-      va_list ap;
-
-      va_start (ap, num_waits);
-      goacc_wait (async, num_waits, &ap);
-      va_end (ap);
-    }
-
   /* Determine whether "finalize" semantics apply to all mappings of this
      OpenACC directive.  */
   bool finalize = false;
@@ -521,6 +767,86 @@  GOACC_enter_exit_data (int flags_m, size_t mapnum,
 		      kind);
     }
 
+  goacc_lazy_initialize ();
+
+  thr = goacc_thread ();
+  acc_dev = thr->dev;
+
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (true), false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type
+	= data_enter ? acc_ev_enter_data_start : acc_ev_exit_data_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = async;
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info enter_exit_data_event_info;
+  if (profiling_dispatch_p)
+    {
+      enter_exit_data_event_info.other_event.event_type
+	= prof_info.event_type;
+      enter_exit_data_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      enter_exit_data_event_info.other_event.parent_construct
+	= data_enter ? acc_construct_enter_data : acc_construct_exit_data;
+      enter_exit_data_event_info.other_event.implicit = 0;
+      enter_exit_data_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+			      &api_info);
+
+  if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+      || (flags & GOACC_FLAG_HOST_FALLBACK))
+    {
+      //TODO
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+      goto out;
+    }
+
+  if (num_waits > 0)
+    {
+      va_list ap;
+
+      va_start (ap, num_waits);
+      goacc_wait (async, num_waits, &ap);
+      va_end (ap);
+    }
+  else if (num_waits == acc_async_noval)
+    acc_wait_all_async (async);
+
   /* In c, non-pointers and arrays are represented by a single data clause.
      Dynamically allocated arrays and subarrays are represented by a data
      clause followed by an internal GOMP_MAP_POINTER.
@@ -739,6 +1065,18 @@  GOACC_enter_exit_data (int flags_m, size_t mapnum,
   gomp_rc_check (acc_dev, thr->mapped_data);
   gomp_mutex_unlock (&acc_dev->lock);
 #endif
+
+out:
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = data_enter ? acc_ev_enter_data_end: acc_ev_exit_data_end;
+      enter_exit_data_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &enter_exit_data_event_info,
+				&api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 static void
@@ -787,9 +1125,66 @@  GOACC_update (int flags_m, size_t mapnum,
   struct goacc_thread *thr = goacc_thread ();
   struct gomp_device_descr *acc_dev = thr->dev;
 
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (true), false);
+
+  acc_prof_info prof_info;
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = &prof_info;
+
+      prof_info.event_type = acc_ev_update_start;
+      prof_info.valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info.version = _ACC_PROF_INFO_VERSION;
+      prof_info.device_type = acc_device_type (acc_dev->type);
+      prof_info.device_number = acc_dev->target_id;
+      prof_info.thread_id = -1; //TODO
+      prof_info.async = async;
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+      prof_info.src_file = NULL; //TODO
+      prof_info.func_name = NULL; //TODO
+      prof_info.line_no = -1; //TODO
+      prof_info.end_line_no = -1; //TODO
+      prof_info.func_line_no = -1; //TODO
+      prof_info.func_end_line_no = -1; //TODO
+    }
+  acc_event_info update_event_info;
+  if (profiling_dispatch_p)
+    {
+      update_event_info.other_event.event_type
+	= prof_info.event_type;
+      update_event_info.other_event.valid_bytes
+	= _ACC_OTHER_EVENT_INFO_VALID_BYTES;
+      update_event_info.other_event.parent_construct = acc_construct_update;
+      update_event_info.other_event.implicit = 0;
+      update_event_info.other_event.tool_info = NULL;
+    }
+  acc_api_info api_info;
+  if (profiling_dispatch_p)
+    {
+      thr->api_info = &api_info;
+
+      api_info.device_api = acc_device_api_none;
+      api_info.valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info.device_type = prof_info.device_type;
+      api_info.vendor = -1; //TODO
+      api_info.device_handle = NULL; //TODO
+      api_info.context_handle = NULL; //TODO
+      api_info.async_handle = NULL; //TODO
+    }
+
+  if (profiling_dispatch_p)
+    goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+
   if ((acc_dev->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || (flags & GOACC_FLAG_HOST_FALLBACK))
-    return;
+    {
+      //TODO
+      prof_info.device_type = acc_device_host;
+      api_info.device_type = prof_info.device_type;
+      goto out;
+    }
 
   if (num_waits)
     {
@@ -861,11 +1256,41 @@  GOACC_update (int flags_m, size_t mapnum,
 	  break;
 	}
     }
+
+out:
+  if (profiling_dispatch_p)
+    {
+      prof_info.event_type = acc_ev_update_end;
+      update_event_info.other_event.event_type = prof_info.event_type;
+      goacc_profiling_dispatch (&prof_info, &update_event_info, &api_info);
+
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 void
 GOACC_wait (int async, int num_waits, ...)
 {
+  goacc_lazy_initialize ();
+
+  struct goacc_thread *thr = goacc_thread ();
+
+  /* No nesting.  */
+  assert (thr->prof_info == NULL);
+  assert (thr->api_info == NULL);
+  acc_prof_info prof_info;
+  acc_api_info api_info;
+  bool profiling_setup_p
+    = __builtin_expect (goacc_profiling_setup_p (thr, &prof_info, &api_info),
+			false);
+  if (profiling_setup_p)
+    {
+      prof_info.async = async;
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info.async_queue = prof_info.async;
+    }
+
   if (num_waits)
     {
       va_list ap;
@@ -878,6 +1303,12 @@  GOACC_wait (int async, int num_waits, ...)
     acc_wait_all ();
   else
     acc_wait_all_async (async);
+
+  if (profiling_setup_p)
+    {
+      thr->prof_info = NULL;
+      thr->api_info = NULL;
+    }
 }
 
 int
diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c
index 5480c1d..a04abb0 100644
--- a/libgomp/oacc-plugin.c
+++ b/libgomp/oacc-plugin.c
@@ -47,6 +47,19 @@  GOMP_PLUGIN_acc_thread (void)
   return thr ? thr->target_tls : NULL;
 }
 
+/* Return the TLS data for the current thread.  */
+/* TODO.  Should we be able to directly call (the static inline function)
+   goacc_thread from within plugin code?  I didn't manage to get the
+   "goacc_tls_data" symbol configured correctly: "[...]/ld:
+   .libs/libgomp-plugin-nvptx.so.1.0.0: hidden symbol `goacc_tls_data' isn't
+   defined".  */
+
+struct goacc_thread *
+GOMP_PLUGIN_goacc_thread (void)
+{
+  return goacc_thread ();
+}
+
 int
 GOMP_PLUGIN_acc_default_dim (unsigned int i)
 {
diff --git a/libgomp/oacc-plugin.h b/libgomp/oacc-plugin.h
index 112d4a59..112d3cd 100644
--- a/libgomp/oacc-plugin.h
+++ b/libgomp/oacc-plugin.h
@@ -27,8 +27,11 @@ 
 #ifndef OACC_PLUGIN_H
 #define OACC_PLUGIN_H 1
 
+#include "oacc-int.h"
+
 extern void GOMP_PLUGIN_async_unmap_vars (void *, int);
 extern void *GOMP_PLUGIN_acc_thread (void);
+extern struct goacc_thread *GOMP_PLUGIN_goacc_thread (void);
 extern int GOMP_PLUGIN_acc_default_dim (unsigned int);
 
 #endif
diff --git a/libgomp/oacc-profiling-acc_register_library.c b/libgomp/oacc-profiling-acc_register_library.c
new file mode 100644
index 0000000..f6b482b
--- /dev/null
+++ b/libgomp/oacc-profiling-acc_register_library.c
@@ -0,0 +1,39 @@ 
+/* Copyright (C) 2017 Free Software Foundation, Inc.
+
+   Contributed by Mentor Embedded.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* This file provides an stub acc_register_library function.  It's in a
+   separate file so that this function can easily be overridden when linking
+   statically.  */
+
+#include "libgomp.h"
+#include "acc_prof.h"
+
+void
+acc_register_library (acc_prof_reg reg, acc_prof_reg unreg,
+		      acc_prof_lookup_func lookup)
+{
+  gomp_debug (0, "dummy %s\n", __FUNCTION__);
+}
diff --git a/libgomp/oacc-profiling.c b/libgomp/oacc-profiling.c
new file mode 100644
index 0000000..5542d6a
--- /dev/null
+++ b/libgomp/oacc-profiling.c
@@ -0,0 +1,666 @@ 
+/* Copyright (C) 2017 Free Software Foundation, Inc.
+
+   Contributed by Mentor Embedded.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should have received a copy of the GNU General Public License and
+   a copy of the GCC Runtime Library Exception along with this program;
+   see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
+   <http://www.gnu.org/licenses/>.  */
+
+/* OpenACC Profiling Interface.  */
+
+#include "libgomp.h"
+#include "oacc-int.h"
+#include "acc_prof.h"
+#include <assert.h>
+#ifdef HAVE_STRING_H
+# include <string.h>
+#endif
+#ifdef PLUGIN_SUPPORT
+# include <dlfcn.h>
+#endif
+
+#define STATIC_ASSERT(expr) _Static_assert (expr, "!(" #expr ")")
+
+/* Statically assert that the layout of the common fields in the
+   "acc_event_info" variants matches.  */
+/* event_type */
+STATIC_ASSERT (offsetof (acc_event_info, event_type)
+	       == offsetof (acc_event_info, data_event.event_type));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.event_type)
+	       == offsetof (acc_event_info, launch_event.event_type));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.event_type)
+	       == offsetof (acc_event_info, other_event.event_type));
+/* valid_bytes */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.valid_bytes)
+	       == offsetof (acc_event_info, launch_event.valid_bytes));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.valid_bytes)
+	       == offsetof (acc_event_info, other_event.valid_bytes));
+/* parent_construct */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.parent_construct)
+	       == offsetof (acc_event_info, launch_event.parent_construct));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.parent_construct)
+	       == offsetof (acc_event_info, other_event.parent_construct));
+/* implicit */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.implicit)
+	       == offsetof (acc_event_info, launch_event.implicit));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.implicit)
+	       == offsetof (acc_event_info, other_event.implicit));
+/* tool_info */
+STATIC_ASSERT (offsetof (acc_event_info, data_event.tool_info)
+	       == offsetof (acc_event_info, launch_event.tool_info));
+STATIC_ASSERT (offsetof (acc_event_info, data_event.tool_info)
+	       == offsetof (acc_event_info, other_event.tool_info));
+
+struct goacc_prof_callback_entry
+{
+  acc_prof_callback cb;
+  int ref;
+  bool enabled;
+  struct goacc_prof_callback_entry *next;
+};
+
+/* goacc_prof_callbacks_enabled[acc_ev_none] acts as a global toggle.  */
+static bool goacc_prof_callbacks_enabled[acc_ev_last];
+static struct goacc_prof_callback_entry *goacc_prof_callback_entries[acc_ev_last];
+
+/* This lock is used to protect access to goacc_prof_callbacks_enabled, and
+   goacc_prof_callback_entries.  */
+static gomp_mutex_t goacc_prof_lock;
+
+void
+goacc_profiling_initialize (void)
+{
+  gomp_mutex_init (&goacc_prof_lock);
+
+  /* Initially, all callbacks for all events are enabled.  */
+  for (int i = 0; i < acc_ev_last; ++i)
+    goacc_prof_callbacks_enabled[i] = true;
+
+  /* We are to invoke an external acc_register_library routine, defaulting to
+     our stub oacc-profiling-acc_register_library.c:acc_register_library
+     implementation.  */
+  gomp_debug (0, "%s: calling acc_register_library\n", __FUNCTION__);
+  //TODO.
+  acc_register_library (acc_prof_register, acc_prof_unregister, NULL);
+#ifdef PLUGIN_SUPPORT
+  char *acc_proflibs = getenv ("ACC_PROFLIB");
+  while (acc_proflibs != NULL && acc_proflibs[0] != '\0')
+    {
+      char *acc_proflibs_sep = strchr (acc_proflibs, ';');
+      char *acc_proflib;
+      if (acc_proflibs_sep == acc_proflibs)
+	{
+	  /* Stray ";" separator: make sure we don't dlopen the main
+	     program.  */
+	  acc_proflib = NULL;
+	}
+      else
+	{
+	  if (acc_proflibs_sep != NULL)
+	    {
+	      /* Single out the first library.  */
+	      acc_proflib = gomp_malloc (acc_proflibs_sep - acc_proflibs + 1);
+	      memcpy (acc_proflib, acc_proflibs,
+		      acc_proflibs_sep - acc_proflibs);
+	      acc_proflib[acc_proflibs_sep - acc_proflibs] = '\0';
+	    }
+	  else
+	    {
+	      /* No ";" separator, so only one library.  */
+	      acc_proflib = acc_proflibs;
+	    }
+
+	  gomp_debug (0, "%s: dlopen(%s)\n", __FUNCTION__, acc_proflib);
+	  void *dl_handle = dlopen (acc_proflib, RTLD_LAZY);
+	  if (dl_handle != NULL)
+	    {
+	      typeof (&acc_register_library) a_r_l
+		= dlsym (dl_handle, "acc_register_library");
+	      if (a_r_l == NULL)
+		goto dl_fail;
+	      /* Avoid duplicate registration, for example if the same shared
+		 library is specified in LD_PRELOAD and ACC_PROFLIB -- which
+		 TAU 2.26 does when using "tau_exec -openacc".  */
+	      if (a_r_l != acc_register_library)
+		{
+		  gomp_debug (0, "  %s: calling %s:acc_register_library\n",
+			      __FUNCTION__, acc_proflib);
+		  //TODO.
+		  a_r_l (acc_prof_register, acc_prof_unregister, NULL);
+		}
+	      else
+		gomp_debug (0, "  %s: skipping duplicate"
+			    " %s:acc_register_library\n",
+			    __FUNCTION__, acc_proflib);
+	    }
+	  else
+	    {
+	    dl_fail:
+	      gomp_error ("while loading ACC_PROFLIB %s: %s",
+			  acc_proflib, dlerror ());
+	      if (dl_handle != NULL)
+		{
+		  int err = dlclose (dl_handle);
+		  dl_handle = NULL;
+		  if (err != 0)
+		    goto dl_fail;
+		}
+	    }
+	}
+
+      if (acc_proflib != acc_proflibs)
+	{
+	  free (acc_proflib);
+
+	  acc_proflibs = acc_proflibs_sep + 1;
+	}
+      else
+	acc_proflibs = NULL;
+    }
+#endif /* PLUGIN_SUPPORT */
+}
+
+void
+acc_prof_register (acc_event_t ev, acc_prof_callback cb, acc_register_t reg)
+{
+  //TODO
+  gomp_debug (0, "%s: ev=%d, cb=%p, reg=%d\n",
+	      __FUNCTION__, (int) ev, (void *) cb, (int) reg);
+
+  enum
+  {
+    EVENT_KIND_BOGUS,
+    EVENT_KIND_NORMAL,
+    /* As end events invoke callbacks in the reverse order, we register these
+       in the reverse order here.  */
+    EVENT_KIND_END,
+  } event_kind = EVENT_KIND_BOGUS;
+  switch (ev)
+    {
+    case acc_ev_none:
+    case acc_ev_device_init_start:
+    case acc_ev_device_shutdown_start:
+    case acc_ev_runtime_shutdown:
+    case acc_ev_create:
+    case acc_ev_delete:
+    case acc_ev_alloc:
+    case acc_ev_free:
+    case acc_ev_enter_data_start:
+    case acc_ev_exit_data_start:
+    case acc_ev_update_start:
+    case acc_ev_compute_construct_start:
+    case acc_ev_enqueue_launch_start:
+    case acc_ev_enqueue_upload_start:
+    case acc_ev_enqueue_download_start:
+    case acc_ev_wait_start:
+      event_kind = EVENT_KIND_NORMAL;
+      break;
+    case acc_ev_device_init_end:
+    case acc_ev_device_shutdown_end:
+    case acc_ev_enter_data_end:
+    case acc_ev_exit_data_end:
+    case acc_ev_update_end:
+    case acc_ev_compute_construct_end:
+    case acc_ev_enqueue_launch_end:
+    case acc_ev_enqueue_upload_end:
+    case acc_ev_enqueue_download_end:
+    case acc_ev_wait_end:
+      event_kind = EVENT_KIND_END;
+      break;
+    case acc_ev_last:
+      break;
+    }
+  if (event_kind == EVENT_KIND_BOGUS)
+    {
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these, for forward compatibility?
+      gomp_error ("ignoring %s request for TODOinvalid acc_event_t %d",
+		  __FUNCTION__, /* TODO */ (int) ev);
+      return;
+    }
+
+  bool bogus = true;
+  switch (reg)
+    {
+    case acc_reg:
+    case acc_toggle:
+    case acc_toggle_per_thread:
+      bogus = false;
+      break;
+    }
+  if (bogus)
+    {
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these, for forward compatibility?
+      gomp_error ("ignoring %s request with TODOinvalid acc_register_t %d",
+		  __FUNCTION__, /* TODO */ (int) reg);
+      return;
+    }
+
+  /* Special cases.  */
+  if (reg == acc_toggle)
+    {
+      if (cb == NULL)
+	{
+	  gomp_debug (0, "  globally enabling callbacks\n");
+	  gomp_mutex_lock (&goacc_prof_lock);
+	  /* goacc_prof_callbacks_enabled[acc_ev_none] acts as a global
+	     toggle.  */
+	  goacc_prof_callbacks_enabled[ev] = true;
+	  gomp_mutex_unlock (&goacc_prof_lock);
+	  return;
+	}
+      else if (ev == acc_ev_none && cb != NULL)
+	{
+	  gomp_debug (0, "  ignoring request\n");
+	  /* Silently ignore request.  */
+	  return;
+	}
+    }
+  else if (reg == acc_toggle_per_thread)
+    {
+      if (ev == acc_ev_none && cb == NULL)
+	{
+	  gomp_debug (0, "  thread: enabling callbacks\n");
+	  goacc_lazy_initialize ();
+	  struct goacc_thread *thr = goacc_thread ();
+	  thr->prof_callbacks_enabled = true;
+	  return;
+	}
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these?
+      gomp_error ("ignoring %s request for acc_toggle_per_thread with TODO",
+		  __FUNCTION__);
+      return;
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  struct goacc_prof_callback_entry *it, *it_p;
+  it = goacc_prof_callback_entries[ev];
+  it_p = NULL;
+  while (it)
+    {
+      if (it->cb == cb)
+	break;
+      it_p = it;
+      it = it->next;
+    }
+
+  switch (reg)
+    {
+    case acc_reg:
+      /* If we already have this callback registered, just increment its ref
+	 count.  */
+      if (it != NULL)
+	{
+	  it->ref++;
+	  gomp_debug (0, "  already registered;"
+		      " incrementing ref count to: %d\n", it->ref);
+	}
+      else
+	{
+	  struct goacc_prof_callback_entry *e
+	    = gomp_malloc (sizeof (struct goacc_prof_callback_entry));
+	  e->cb = cb;
+	  e->ref = 1;
+	  e->enabled = true;
+	  bool prepend = (event_kind == EVENT_KIND_END);
+	  /* If we don't have any callback registered yet, also use the
+	     "prepend" code path.  */
+	  if (it_p == NULL)
+	    prepend = true;
+	  if (prepend)
+	    {
+	      gomp_debug (0, "  prepending\n");
+	      e->next = goacc_prof_callback_entries[ev];
+	      goacc_prof_callback_entries[ev] = e;
+	    }
+	  else
+	    {
+	      gomp_debug (0, "  appending\n");
+	      e->next = NULL;
+	      it_p->next = e;
+	    }
+	}
+      break;
+
+    case acc_toggle:
+      if (it == NULL)
+	{
+	  /* Silently ignore acc_toggle request if not registered.  */
+	  gomp_debug (0, "  not enabling; not registered\n");
+	}
+      else
+	{
+	  gomp_debug (0, "  enabling\n");
+	  it->enabled = true;
+	}
+      break;
+
+    case acc_toggle_per_thread:
+      __builtin_unreachable ();
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
+
+void
+acc_prof_unregister (acc_event_t ev, acc_prof_callback cb, acc_register_t reg)
+{
+  //TODO
+  gomp_debug (0, "%s: ev=%d, cb=%p, reg=%d\n",
+	      __FUNCTION__, (int) ev, (void *) cb, (int) reg);
+
+  if (ev < acc_ev_none
+      || ev >= acc_ev_last)
+    {
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these, for forward compatibility?
+      gomp_error ("ignoring %s request for TODOinvalid acc_event_t %d",
+		  __FUNCTION__, /* TODO */ (int) ev);
+      return;
+    }
+
+  bool bogus = true;
+  switch (reg)
+    {
+    case acc_reg:
+    case acc_toggle:
+    case acc_toggle_per_thread:
+      bogus = false;
+      break;
+    }
+  if (bogus)
+    {
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these, for forward compatibility?
+      gomp_error ("ignoring %s request with TODOinvalid acc_register_t %d",
+		  __FUNCTION__, /* TODO */ (int) reg);
+      return;
+    }
+
+  /* Special cases.  */
+  if (reg == acc_toggle)
+    {
+      if (cb == NULL)
+	{
+	  gomp_debug (0, "  globally disabling callbacks\n");
+	  gomp_mutex_lock (&goacc_prof_lock);
+	  /* goacc_prof_callbacks_enabled[acc_ev_none] acts as a global
+	     toggle.  */
+	  goacc_prof_callbacks_enabled[ev] = false;
+	  gomp_mutex_unlock (&goacc_prof_lock);
+	  return;
+	}
+      else if (ev == acc_ev_none && cb != NULL)
+	{
+	  gomp_debug (0, "  ignoring request\n");
+	  /* Silently ignore request.  */
+	  return;
+	}
+    }
+  else if (reg == acc_toggle_per_thread)
+    {
+      if (ev == acc_ev_none && cb == NULL)
+	{
+	  gomp_debug (0, "  thread: disabling callbacks\n");
+	  goacc_lazy_initialize ();
+	  struct goacc_thread *thr = goacc_thread ();
+	  thr->prof_callbacks_enabled = false;
+	  return;
+	}
+      //TODO: should this be a fatal error?  Or, should we (silently?) ignore these?
+      gomp_error ("ignoring %s request for acc_toggle_per_thread with TODO",
+		  __FUNCTION__);
+      return;
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  struct goacc_prof_callback_entry *it, *it_p;
+  it = goacc_prof_callback_entries[ev];
+  it_p = NULL;
+  while (it)
+    {
+      if (it->cb == cb)
+	break;
+      it_p = it;
+      it = it->next;
+    }
+
+  switch (reg)
+    {
+    case acc_reg:
+      if (it == NULL)
+	{
+	  //TODO: should this be a fatal error?  Or, should we (silently?) ignore these?
+	  gomp_error ("ignoring %s request for acc_event_t %d: not registered",
+		      __FUNCTION__, /* TODO */ (int) ev);
+	  gomp_mutex_unlock (&goacc_prof_lock);
+	  return;
+	}
+      it->ref--;
+      gomp_debug (0, "  decrementing ref count to: %d\n", it->ref);
+      if (it->ref == 0)
+	{
+	  if (it_p == NULL)
+	    goacc_prof_callback_entries[ev] = it->next;
+	  else
+	    it_p->next = it->next;
+	  free (it);
+	}
+      break;
+
+    case acc_toggle:
+      if (it == NULL)
+	{
+	  /* Silently ignore acc_toggle request if not registered.  */
+	  gomp_debug (0, "  not disabling; not registered\n");
+	}
+      else
+	{
+	  gomp_debug (0, "  disabling\n");
+	  it->enabled = false;
+	}
+      break;
+
+    case acc_toggle_per_thread:
+      __builtin_unreachable ();
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
+
+/* Set up to dispatch events?  */
+
+bool
+goacc_profiling_setup_p (struct goacc_thread *thr,
+			 acc_prof_info *prof_info, acc_api_info *api_info)
+{
+  //TODO
+  gomp_debug (0, "%s (%p)\n", __FUNCTION__, thr);
+
+  /* If we don't have any per-thread state yet, we can't register prof_info and
+     api_info.  */
+  /* TODO: In this case, should we actually call goacc_lazy_initialize here,
+     and return the "thr" from goacc_profiling_setup_p?  */
+  if (__builtin_expect (thr == NULL, false))
+    {
+      //TODO
+      gomp_debug (0, "Can't generate OpenACC Profiling Interface events for"
+		  " the current call, construct, or directive\n");
+      return false;
+    }
+
+  bool profiling_dispatch_p
+    = __builtin_expect (goacc_profiling_dispatch_p (false), false);
+  if (thr->prof_info != NULL)
+    {
+      assert (profiling_dispatch_p); //TODO
+      /* Profiling has already been set up for an outer construct.  In this
+	 case, we continue to use the existing information, and thus return
+	 "false" here.
+
+	 This can happen, for example, for an enter data directive, which sets
+	 up profiling, then calls into acc_copyin, which should not again set
+	 up profiling, should not overwrite the existing information.  */
+      //TODO: Is this all kosher?
+      return false;
+    }
+
+  if (profiling_dispatch_p)
+    {
+      thr->prof_info = prof_info;
+
+      prof_info->event_type = -1; /* Must be set later.  */
+      prof_info->valid_bytes = _ACC_PROF_INFO_VALID_BYTES;
+      prof_info->version = _ACC_PROF_INFO_VERSION;
+      //TODO
+      if (thr->dev)
+	{
+	  prof_info->device_type = acc_device_type (thr->dev->type);
+	  prof_info->device_number = thr->dev->target_id;
+	}
+      else
+	{
+	  prof_info->device_type = -1;
+	  prof_info->device_number = -1;
+	}
+      prof_info->thread_id = -1; //TODO
+      prof_info->async = acc_async_sync; //TODO
+      /* See <https://github.com/OpenACC/openacc-spec/issues/71>.  */
+      prof_info->async_queue = prof_info->async;
+      prof_info->src_file = NULL; //TODO
+      prof_info->func_name = NULL; //TODO
+      prof_info->line_no = -1; //TODO
+      prof_info->end_line_no = -1; //TODO
+      prof_info->func_line_no = -1; //TODO
+      prof_info->func_end_line_no = -1; //TODO
+
+      thr->api_info = api_info;
+
+      api_info->device_api = acc_device_api_none; //TODO
+      api_info->valid_bytes = _ACC_API_INFO_VALID_BYTES;
+      api_info->device_type = prof_info->device_type;
+      api_info->vendor = -1; //TODO
+      api_info->device_handle = NULL; //TODO
+      api_info->context_handle = NULL; //TODO
+      api_info->async_handle = NULL; //TODO
+    }
+
+  return profiling_dispatch_p;
+}
+
+/* Prepare to dispatch events?  */
+
+bool
+goacc_profiling_dispatch_p (bool check_not_nested_p)
+{
+  //TODO
+  gomp_debug (0, "%s\n", __FUNCTION__);
+
+  struct goacc_thread *thr = goacc_thread ();
+  if (__builtin_expect (thr == NULL, false))
+    {
+      /* If we don't have any per-thread state yet, that means that per-thread
+	 callback dispatch has not been explicitly disabled (which only a call
+	 to acc_prof_unregister with acc_toggle_per_thread will do, and that
+	 would have allocated per-thread state via goacc_lazy_initialize);
+	 initially, all callbacks for all events are enabled.  */
+      //TODO
+      gomp_debug (0, "  %s: don't have any per-thread state yet\n", __FUNCTION__);
+    }
+  else
+    {
+      if (check_not_nested_p)
+	{
+	  /* No nesting.  */
+	  assert (thr->prof_info == NULL);
+	  assert (thr->api_info == NULL);
+	}
+
+      if (__builtin_expect (!thr->prof_callbacks_enabled, true))
+	{
+	  //TODO
+	  gomp_debug (0, "  %s: disabled for this thread\n", __FUNCTION__);
+	  return false;
+	}
+    }
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  /* goacc_prof_callbacks_enabled[acc_ev_none] acts as a global toggle.  */
+  if (__builtin_expect (!goacc_prof_callbacks_enabled[acc_ev_none], true))
+    {
+      //TODO
+      gomp_debug (0, "  %s: disabled globally\n", __FUNCTION__);
+      gomp_mutex_unlock (&goacc_prof_lock);
+      return false;
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+
+  return true;
+}
+
+/* Dispatch events.
+
+   TODO.
+   This must only be called if goacc_profiling_dispatch_p returned a true
+   result.  */
+
+void
+goacc_profiling_dispatch (acc_prof_info *prof_info, acc_event_info *event_info,
+			  acc_api_info *apt_info)
+{
+  acc_event_t event_type = event_info->event_type;
+  //TODO
+  gomp_debug (0, "%s: event_type=%d\n", __FUNCTION__, (int) event_type);
+  //TODO
+  assert (event_type > acc_ev_none
+	  && event_type < acc_ev_last);
+
+  gomp_mutex_lock (&goacc_prof_lock);
+
+  if (!goacc_prof_callbacks_enabled[event_type])
+    {
+      //TODO
+      gomp_debug (0, "  %s: disabled for this event type\n", __FUNCTION__);
+      gomp_mutex_unlock (&goacc_prof_lock);
+      return;
+    }
+
+  for (struct goacc_prof_callback_entry *e
+	 = goacc_prof_callback_entries[event_type];
+       e != NULL;
+       e = e->next)
+    {
+      if (!e->enabled)
+	{
+	  //TODO
+	  gomp_debug (0, "  %s: disabled for callback %p\n",
+		      __FUNCTION__, e->cb);
+	  continue;
+	}
+
+      //TODO
+      gomp_debug (0, "  %s: calling callback %p\n", __FUNCTION__, e->cb);
+      e->cb (prof_info, event_info, apt_info);
+    }
+
+  gomp_mutex_unlock (&goacc_prof_lock);
+}
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index cb18e87..1b6f2d4 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -37,6 +37,7 @@ 
 #include "libgomp-plugin.h"
 #include "oacc-plugin.h"
 #include "gomp-constants.h"
+#include "oacc-int.h"
 
 #include <pthread.h>
 #include <cuda.h>
@@ -903,11 +904,54 @@  nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   // num_gangs		nctaid.x
   // num_workers	ntid.y
   // vector length	ntid.x
+
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info enqueue_launch_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_launch_start;
+
+      enqueue_launch_event_info.launch_event.event_type
+	= prof_info->event_type;
+      enqueue_launch_event_info.launch_event.valid_bytes
+	= _ACC_LAUNCH_EVENT_INFO_VALID_BYTES;
+      enqueue_launch_event_info.launch_event.parent_construct
+	/* TODO = compute_construct_event_info.other_event.parent_construct */
+	= acc_construct_parallel; //TODO: kernels...
+      enqueue_launch_event_info.launch_event.implicit = 1;
+      enqueue_launch_event_info.launch_event.tool_info = NULL;
+      enqueue_launch_event_info.launch_event.kernel_name
+	= /* TODO */ (char *) /* TODO */ targ_fn->launch->fn;
+      enqueue_launch_event_info.launch_event.num_gangs
+	= dims[GOMP_DIM_GANG];
+      enqueue_launch_event_info.launch_event.num_workers
+	= dims[GOMP_DIM_WORKER];
+      enqueue_launch_event_info.launch_event.vector_length
+	= dims[GOMP_DIM_VECTOR];
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
+					    api_info);
+    }
+
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
 		    0, stream, kargs, 0);
 
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_launch_end;
+      enqueue_launch_event_info.launch_event.event_type
+	= prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &enqueue_launch_event_info,
+					    api_info);
+    }
+
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
 		     targ_fn->launch->fn);
 }
@@ -920,6 +964,36 @@  nvptx_alloc (size_t s)
   CUdeviceptr d;
 
   CUDA_CALL_ERET (NULL, cuMemAlloc, &d, s);
+
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  bool profiling_dispatch_p
+    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      acc_prof_info *prof_info = thr->prof_info;
+      acc_event_info data_event_info;
+      acc_api_info *api_info = thr->api_info;
+
+      prof_info->event_type = acc_ev_alloc;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      data_event_info.data_event.implicit = 1; //TODO
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = s;
+      data_event_info.data_event.host_ptr = NULL;
+      data_event_info.data_event.device_ptr = (void *) d;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
   return (void *) d;
 }
 
@@ -950,6 +1024,36 @@  nvptx_free (void *p, struct ptx_device *ptx_dev)
     }
 
   CUDA_CALL (cuMemFree, (CUdeviceptr) p);
+
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  bool profiling_dispatch_p
+    = __builtin_expect (thr != NULL && thr->prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      acc_prof_info *prof_info = thr->prof_info;
+      acc_event_info data_event_info;
+      acc_api_info *api_info = thr->api_info;
+
+      prof_info->event_type = acc_ev_free;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      data_event_info.data_event.implicit = 1; //TODO
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = ps;
+      data_event_info.data_event.host_ptr = NULL;
+      data_event_info.data_event.device_ptr = p;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
   return true;
 }
 
@@ -1355,10 +1459,47 @@  openacc_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
      fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+   }
+
   if (!params && mapnum > 0)
     CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
 		      mapnum * sizeof (void *));
 
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
   if (params)
     nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
 		hp, NULL);
@@ -1439,6 +1580,35 @@  openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
 
   /* Copy the (device) pointers to arguments to the device (dp and hp might in
      fact have the same value on a unified-memory system).  */
+  struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread ();
+  acc_prof_info *prof_info = thr->prof_info;
+  acc_event_info data_event_info;
+  acc_api_info *api_info = thr->api_info;
+  bool profiling_dispatch_p = __builtin_expect (prof_info != NULL, false);
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_start;
+
+      data_event_info.data_event.event_type = prof_info->event_type;
+      data_event_info.data_event.valid_bytes
+	= _ACC_DATA_EVENT_INFO_VALID_BYTES;
+      data_event_info.data_event.parent_construct
+	= acc_construct_parallel; //TODO
+      /* Always implicit for "data mapping arguments for cuLaunchKernel".  */
+      data_event_info.data_event.implicit = 1;
+      data_event_info.data_event.tool_info = NULL;
+      data_event_info.data_event.var_name = NULL; //TODO
+      data_event_info.data_event.bytes = mapnum * sizeof (void *);
+      data_event_info.data_event.host_ptr = hp;
+      if (!params)
+	data_event_info.data_event.device_ptr = (void *) dp;
+
+      api_info->device_api = acc_device_api_cuda;
+
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
   if (!params && mapnum > 0)
     {
       CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
@@ -1450,6 +1620,14 @@  openacc_async_exec_internal (void (*fn) (void *), int params, size_t mapnum,
       block[1] = (void *) nvthd->ptx_dev;
     }
 
+  if (profiling_dispatch_p)
+    {
+      prof_info->event_type = acc_ev_enqueue_upload_end;
+      data_event_info.data_event.event_type = prof_info->event_type;
+      GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info,
+					    api_info);
+    }
+
   if (params)
     nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
 		hp, aq->cuda_stream);
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
new file mode 100644
index 0000000..4c1f2bb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-dispatch-1.c
@@ -0,0 +1,344 @@ 
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+/* Use explicit copyin clauses, to work around firstprivate optimizations,
+   which will cause the value at the point of call to be used (*before* any
+   potential modifications done in callbacks), as opposed to its address being
+   taken, which then later gets dereferenced (*after* any modifications done in
+   callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+static int state = -1;
+#define STATE_OP(state, op)\
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+void cb_compute_construct_start_1 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 0
+	  || state == 10
+	  || state == 30
+	  || state == 41
+	  || state == 51
+	  || state == 91
+	  || state == 101
+	  || state == 151);
+  STATE_OP (state, ++);
+}
+
+void cb_compute_construct_start_2 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 1
+	  || state == 11
+	  || state == 40
+	  || state == 50
+	  || state == 90
+	  || state == 100
+	  || state == 150);
+  STATE_OP (state, ++);
+}
+
+void cb_compute_construct_end_1 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 14
+	  || state == 21
+	  || state == 32
+	  || state == 42
+	  || state == 80
+	  || state == 103
+	  || state == 152);
+  STATE_OP (state, ++);
+}
+
+void cb_compute_construct_end_2 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 13
+	  || state == 43
+	  || state == 102
+	  || state == 154);
+  STATE_OP (state, ++);
+}
+
+void cb_compute_construct_end_3 (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 12
+	  || state == 20
+	  || state == 31
+	  || state == 44
+	  || state == 81
+	  || state == 104
+	  || state == 153);
+  STATE_OP (state, ++);
+}
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+int main()
+{
+  STATE_OP (state, = 0);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 2);
+  }
+  assert (state == 2);
+
+  STATE_OP (state, = 10);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 12);
+  }
+  assert (state == 15);
+
+  STATE_OP (state, = 20);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_toggle);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_toggle);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_toggle);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 20);
+  }
+  assert (state == 20);
+
+  STATE_OP (state, = 30);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_toggle);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 31);
+  }
+  assert (state == 33);
+
+  STATE_OP (state, = 40);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_2, acc_reg);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 42);
+  }
+  assert (state == 45);
+
+  STATE_OP (state, = 50);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 52);
+  }
+  assert (state == 52);
+
+  STATE_OP (state, = 60);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 60);
+  }
+  assert (state == 60);
+
+  STATE_OP (state, = 70);
+  unreg (acc_ev_compute_construct_start, NULL, acc_toggle);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 70);
+  }
+  assert (state == 70);
+
+  STATE_OP (state, = 80);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  reg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 80);
+  }
+  assert (state == 82);
+
+  STATE_OP (state, = 90);
+  reg (acc_ev_compute_construct_start, NULL, acc_toggle);
+  unreg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_2, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 92);
+  }
+  assert (state == 92);
+
+  STATE_OP (state, = 100);
+  reg (acc_ev_compute_construct_end, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 102);
+  }
+  assert (state == 105);
+
+  STATE_OP (state, = 110);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 110);
+  }
+  assert (state == 110);
+
+  STATE_OP (state, = 120);
+  unreg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 120);
+  }
+  assert (state == 120);
+
+  STATE_OP (state, = 130);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_3, acc_reg);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 130);
+  }
+  assert (state == 130);
+
+  STATE_OP (state, = 140);
+  unreg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start_1, acc_reg);
+  unreg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end_1, acc_reg);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 140);
+  }
+  assert (state == 140);
+
+  STATE_OP (state, = 150);
+  reg (/* TODO */ (acc_event_t) 0, NULL, acc_toggle_per_thread);
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 152);
+  }
+  assert (state == 155);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
new file mode 100644
index 0000000..436f436
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-init-1.c
@@ -0,0 +1,306 @@ 
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+/* Use explicit copyin clauses, to work around firstprivate optimizations,
+   which will cause the value at the point of call to be used (*before* any
+   potential modifications done in callbacks), as opposed to its address being
+   taken, which then later gets dereferenced (*after* any modifications done in
+   callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+static int state = -1;
+#define STATE_OP(state, op)\
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int acc_async;
+
+struct tool_info
+{
+  acc_event_info event_info;
+  struct tool_info *nested;
+};
+struct tool_info *tool_info;
+
+void cb_device_init_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 0
+	  || state == 100);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_device_init_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  if (state == 1)
+    assert (prof_info->device_type == acc_device_host);
+  else
+    assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_runtime_api);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+}
+
+void cb_device_init_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_device_init_start);
+
+  assert (prof_info->event_type == acc_ev_device_init_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  if (state == 2)
+    assert (prof_info->device_type == acc_device_host);
+  else
+    assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_runtime_api);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 10
+	  || state == 110);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_compute_construct_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+}
+
+void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 11
+	  || state == 111);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+
+  assert (prof_info->event_type == acc_ev_compute_construct_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+int main()
+{
+  STATE_OP (state, = 0);
+  reg (acc_ev_device_init_start, cb_device_init_start, acc_reg);
+  reg (acc_ev_device_init_end, cb_device_init_end, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end, acc_reg);
+  assert (state == 0);
+
+  acc_init (acc_device_host);
+  assert (state == 2);
+
+  STATE_OP (state, = 10);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = 12;
+
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 11);
+  }
+  assert (state == 12);
+
+  STATE_OP (state, = 90);
+  acc_shutdown (acc_device_host);
+  assert (state == 90);
+
+
+  STATE_OP (state, = 100);
+  acc_init (acc_device_default);
+  assert (state == 102);
+
+  STATE_OP (state, = 110);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = 12;
+
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 111);
+  }
+  assert (state == 112);
+
+  STATE_OP (state, = 190);
+  acc_shutdown (acc_device_default);
+  assert (state == 190);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
new file mode 100644
index 0000000..1a5b5fb
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-kernels-1.c
@@ -0,0 +1,253 @@ 
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+/* Use explicit copyin clauses, to work around firstprivate optimizations,
+   which will cause the value at the point of call to be used (*before* any
+   potential modifications done in callbacks), as opposed to its address being
+   taken, which then later gets dereferenced (*after* any modifications done in
+   callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+/* See the "DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT" reference in
+   libgomp.texi.  */
+#define DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT 0
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+static int state = -1;
+#define STATE_OP(state, op)\
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int num_gangs, num_workers, vector_length;
+static int async;
+
+void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 0);
+  STATE_OP (state, = 1);
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == NULL);
+  assert (event_info->launch_event.kernel_name != NULL);
+  {
+    char *s = strstr (event_info->launch_event.kernel_name, "main");
+    assert (s != NULL);
+    s = strstr (s, "omp_fn");
+    assert (s != NULL);
+  }
+  if (num_gangs < 1)
+    assert (event_info->launch_event.num_gangs >= 1);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.num_gangs == num_gangs);
+#else
+      /* No parallelized OpenACC kernels constructs, and unparallelized OpenACC
+	 kernels constructs must get launched as 1 x 1 x 1 kernels.  */
+      assert (event_info->launch_event.num_gangs == 1);
+#endif
+    }
+  if (num_workers < 1)
+    assert (event_info->launch_event.num_workers >= 1);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.num_workers == num_workers);
+#else
+      /* See num_gangs above.  */
+      assert (event_info->launch_event.num_workers == 1);
+#endif
+    }
+  if (vector_length < 1)
+    assert (event_info->launch_event.vector_length >= 1);
+  else if (acc_device_type == acc_device_nvidia) /* ... is special.  */
+    assert (event_info->launch_event.vector_length == 32);
+  else
+    {
+#ifdef __OPTIMIZE__
+      assert (event_info->launch_event.vector_length == vector_length);
+#else
+      /* See num_gangs above.  */
+      assert (event_info->launch_event.vector_length == 1);
+#endif
+    }
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+}
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+int main()
+{
+  STATE_OP (state, = 0);
+  reg (acc_ev_enqueue_launch_start, cb_enqueue_launch_start, acc_reg);
+  assert (state == 0);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  assert (state == 0);
+
+  STATE_OP (state, = 0);
+  /* Implicit async.  */
+  async = acc_async_noval;
+  /* Parallelism dimensions: compiler/runtime decides.  */
+  num_gangs = num_workers = vector_length = 0;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+#ifdef __OPTIMIZE__
+    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
+       "state == 0" still holds.  It's not yet clear what's going on.
+       Mis-optimization across the GOMP function call boundary?  Per its
+       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
+       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
+       must expect calls back into this compilation unit?  */
+    asm volatile ("" : : : "memory");
+#endif
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No acc_ev_enqueue_launch_start.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  STATE_OP (state, = 0);
+  /* Explicit async: without argument.  */
+  async = acc_async_noval;
+  /* Parallelism dimensions: literal.  */
+  num_gangs = 30;
+  num_workers = 3;
+  vector_length = 5;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels \
+  async \
+  num_gangs (30) num_workers (3) vector_length (5)
+    /* { dg-prune-output "using vector_length \\(32\\), ignoring 5" } */
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+#ifdef __OPTIMIZE__
+    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
+       "state == 0" still holds.  It's not yet clear what's going on.
+       Mis-optimization across the GOMP function call boundary?  Per its
+       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
+       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
+       must expect calls back into this compilation unit?  */
+    asm volatile ("" : : : "memory");
+#endif
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No acc_ev_enqueue_launch_start.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  STATE_OP (state, = 0);
+  /* Explicit async: variable.  */
+  async = 123;
+  /* Parallelism dimensions: variable.  */
+  num_gangs = 22;
+  num_workers = 5;
+  vector_length = 7;
+  {
+#define N 100
+    int x[N];
+#pragma acc kernels \
+  async (async) \
+  num_gangs (num_gangs) num_workers (num_workers) vector_length (vector_length)
+    /* { dg-prune-output "using vector_length \\(32\\), ignoring runtime setting" } */
+    {
+      for (int i = 0; i < N; ++i)
+	x[i] = i * i;
+    }
+#ifdef __OPTIMIZE__
+    /* TODO.  With -O2 optimizations enabled, the compiler believes that here
+       "state == 0" still holds.  It's not yet clear what's going on.
+       Mis-optimization across the GOMP function call boundary?  Per its
+       gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
+       "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
+       must expect calls back into this compilation unit?  */
+    asm volatile ("" : : : "memory");
+#endif
+    if (acc_device_type == acc_device_host)
+      assert (state == 0); /* No acc_ev_enqueue_launch_start.  */
+    else
+      assert (state == 1);
+    for (int i = 0; i < N; ++i)
+      if (x[i] != i * i)
+	__builtin_abort ();
+#undef N
+  }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
new file mode 100644
index 0000000..ef37ae9
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c
@@ -0,0 +1,698 @@ 
+/* Test dispatch of events to callbacks.  */
+
+#undef NDEBUG
+#include <assert.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <acc_prof.h>
+
+/* Use explicit copyin clauses, to work around firstprivate optimizations,
+   which will cause the value at the point of call to be used (*before* any
+   potential modifications done in callbacks), as opposed to its address being
+   taken, which then later gets dereferenced (*after* any modifications done in
+   callbacks).  */
+#define COPYIN(...) copyin(__VA_ARGS__)
+
+/* See the "DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT" reference in
+   libgomp.texi.  */
+#define DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT 0
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+static int state = -1;
+#define STATE_OP(state, op)\
+  do \
+    { \
+      typeof (state) state_o = (state); \
+      (void) state_o; \
+      (state)op; \
+      DEBUG_printf("state: %d -> %d\n", state_o, (state)); \
+    } \
+  while (0)
+
+static acc_device_t acc_device_type;
+static int acc_device_num;
+static int acc_async;
+
+struct tool_info
+{
+  acc_event_info event_info;
+  struct tool_info *nested;
+};
+struct tool_info *tool_info;
+
+void cb_device_init_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+#else
+  assert (state == 0
+	  || state == 100);
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+#endif
+
+  assert (prof_info->event_type == acc_ev_device_init_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+#else
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+#endif
+}
+
+void cb_device_init_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 2
+	  || state == 102);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_device_init_start);
+#else
+  assert (state == 1
+	  || state == 101);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_device_init_start);
+#endif
+
+  assert (prof_info->event_type == acc_ev_device_init_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_default);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (event_info->other_event.tool_info == tool_info->nested);
+#else
+  assert (event_info->other_event.tool_info == tool_info);
+#endif
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+#else
+  free (tool_info);
+  tool_info = NULL;
+#endif
+}
+
+void cb_enter_data_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 3
+	  || state == 103);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_enter_data_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+void cb_enter_data_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 4
+	  || state == 104);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_enter_data_start);
+
+  assert (prof_info->event_type == acc_ev_enter_data_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == tool_info->nested);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+void cb_exit_data_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 7);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_exit_data_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == NULL);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+void cb_exit_data_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (state == 8);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.other_event.event_type == acc_ev_exit_data_start);
+
+  assert (prof_info->event_type == acc_ev_exit_data_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 1);
+  assert (event_info->other_event.tool_info == tool_info->nested);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+void cb_compute_construct_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+#if DEVICE_INIT_INSIDE_COMPUTE_CONSTRUCT
+  assert (state == 0
+	  || state == 100);
+  if (state == 100)
+    {
+      /* Compensate for the missing acc_ev_device_init_start and
+	 acc_ev_device_init_end.  */
+      state += 2;
+    }
+#else
+  if (state == 100)
+    {
+      /* Compensate for the missing acc_ev_device_init_start and
+	 acc_ev_device_init_end.  */
+      state += 2;
+    }
+  assert (state == 2
+	  || state == 102);
+#endif
+  STATE_OP (state, ++);
+
+  assert (tool_info == NULL);
+  tool_info = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info != NULL);
+  tool_info->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_compute_construct_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == /* TODO acc_async */ acc_async_sync);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == NULL);
+
+  assert (api_info->device_api == acc_device_api_none);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->event_info.other_event.event_type = event_info->other_event.event_type;
+  event_info->other_event.tool_info = tool_info;
+
+  if (acc_device_type == acc_device_host)
+    {
+      /* Compensate for the missing acc_ev_enter_data_start.  */
+      state += 1;
+    }
+}
+
+void cb_compute_construct_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  if (acc_device_type == acc_device_host)
+    {
+      /* Compensate for the missing acc_ev_enter_data_end.  */
+      state += 1;
+      /* Compensate for the missing acc_ev_enqueue_launch_start and
+	 acc_ev_enqueue_launch_end.  */
+      state += 2;
+      /* Compensate for the missing acc_ev_exit_data_start and
+	 acc_ev_exit_data_end.  */
+      state += 2;
+    }
+  else if (acc_async != acc_async_sync)
+    {
+      /* Compensate for the missing acc_ev_exit_data_start and
+	 acc_ev_exit_data_end.  */
+      state += 2;
+    }
+  assert (state == 9
+	  || state == 109);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+
+  assert (prof_info->event_type == acc_ev_compute_construct_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  if (acc_device_type == acc_device_host)
+    assert (prof_info->async == acc_async_sync);
+  else
+    assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->other_event.event_type == prof_info->event_type);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (event_info->other_event.parent_construct == acc_construct_parallel);
+  assert (event_info->other_event.implicit == 0);
+  assert (event_info->other_event.tool_info == tool_info);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info);
+  tool_info = NULL;
+}
+
+void cb_enqueue_launch_start (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 5
+	  || state == 105);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested == NULL);
+  tool_info->nested = (struct tool_info *) malloc(sizeof *tool_info);
+  assert (tool_info->nested != NULL);
+  tool_info->nested->nested = NULL;
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_start);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == NULL);
+  assert (event_info->launch_event.kernel_name != NULL);
+  {
+    char *s = strstr (event_info->launch_event.kernel_name, "main");
+    assert (s != NULL);
+    s = strstr (s, "omp_fn");
+    assert (s != NULL);
+  }
+  assert (event_info->launch_event.num_gangs >= 1);
+  assert (event_info->launch_event.num_workers >= 1);
+  assert (event_info->launch_event.vector_length >= 1);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  tool_info->nested->event_info.launch_event.event_type = event_info->launch_event.event_type;
+  tool_info->nested->event_info.launch_event.kernel_name = strdup (event_info->launch_event.kernel_name);
+  tool_info->nested->event_info.launch_event.num_gangs = event_info->launch_event.num_gangs;
+  tool_info->nested->event_info.launch_event.num_workers = event_info->launch_event.num_workers;
+  tool_info->nested->event_info.launch_event.vector_length = event_info->launch_event.vector_length;
+  event_info->other_event.tool_info = tool_info->nested;
+}
+
+void cb_enqueue_launch_end (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  assert (acc_device_type != acc_device_host);
+
+  assert (state == 6
+	  || state == 106);
+  STATE_OP (state, ++);
+
+  assert (tool_info != NULL);
+  assert (tool_info->event_info.other_event.event_type == acc_ev_compute_construct_start);
+  assert (tool_info->nested != NULL);
+  assert (tool_info->nested->event_info.launch_event.event_type == acc_ev_enqueue_launch_start);
+  assert (tool_info->nested->event_info.launch_event.kernel_name != NULL);
+  assert (tool_info->nested->event_info.launch_event.num_gangs >= 1);
+  assert (tool_info->nested->event_info.launch_event.num_workers >= 1);
+  assert (tool_info->nested->event_info.launch_event.vector_length >= 1);
+
+  assert (prof_info->event_type == acc_ev_enqueue_launch_end);
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (prof_info->version == _ACC_PROF_INFO_VERSION);
+  assert (prof_info->device_type == acc_device_type);
+  assert (prof_info->device_number == acc_device_num);
+  assert (prof_info->thread_id == -1);
+  assert (prof_info->async == acc_async);
+  assert (prof_info->async_queue == prof_info->async);
+  assert (prof_info->src_file == NULL);
+  assert (prof_info->func_name == NULL);
+  assert (prof_info->line_no == -1);
+  assert (prof_info->end_line_no == -1);
+  assert (prof_info->func_line_no == -1);
+  assert (prof_info->func_end_line_no == -1);
+
+  assert (event_info->launch_event.event_type == prof_info->event_type);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (event_info->launch_event.parent_construct == acc_construct_parallel);
+  assert (event_info->launch_event.implicit == 1);
+  assert (event_info->launch_event.tool_info == tool_info->nested);
+  assert (event_info->launch_event.kernel_name != NULL);
+  assert (strcmp (event_info->launch_event.kernel_name, tool_info->nested->event_info.launch_event.kernel_name) == 0);
+  assert (event_info->launch_event.num_gangs == tool_info->nested->event_info.launch_event.num_gangs);
+  assert (event_info->launch_event.num_workers == tool_info->nested->event_info.launch_event.num_workers);
+  assert (event_info->launch_event.vector_length == tool_info->nested->event_info.launch_event.vector_length);
+
+  if (acc_device_type == acc_device_host)
+    assert (api_info->device_api == acc_device_api_none);
+  else
+    assert (api_info->device_api == acc_device_api_cuda);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+  assert (api_info->device_type == prof_info->device_type);
+  assert (api_info->vendor == -1);
+  assert (api_info->device_handle == NULL);
+  assert (api_info->context_handle == NULL);
+  assert (api_info->async_handle == NULL);
+
+  free (tool_info->nested->event_info.launch_event.kernel_name);
+  free (tool_info->nested);
+  tool_info->nested = NULL;
+}
+
+static acc_prof_reg reg;
+static acc_prof_reg unreg;
+static acc_prof_lookup_func lookup;
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg = reg_;
+  unreg = unreg_;
+  lookup = lookup_;
+}
+
+int main()
+{
+  STATE_OP (state, = 0);
+  reg (acc_ev_device_init_start, cb_device_init_start, acc_reg);
+  reg (acc_ev_device_init_end, cb_device_init_end, acc_reg);
+  reg (acc_ev_enter_data_start, cb_enter_data_start, acc_reg);
+  reg (acc_ev_enter_data_end, cb_enter_data_end, acc_reg);
+  reg (acc_ev_exit_data_start, cb_exit_data_start, acc_reg);
+  reg (acc_ev_exit_data_end, cb_exit_data_end, acc_reg);
+  reg (acc_ev_compute_construct_start, cb_compute_construct_start, acc_reg);
+  reg (acc_ev_compute_construct_end, cb_compute_construct_end, acc_reg);
+  reg (acc_ev_enqueue_launch_start, cb_enqueue_launch_start, acc_reg);
+  reg (acc_ev_enqueue_launch_end, cb_enqueue_launch_end, acc_reg);
+  assert (state == 0);
+
+  acc_device_type = acc_get_device_type ();
+  acc_device_num = acc_get_device_num (acc_device_type);
+  acc_async = acc_async_sync;
+  assert (state == 0);
+
+  {
+    int state_init;
+#pragma acc parallel COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+    assert (state_init == 4);
+  }
+#ifdef __OPTIMIZE__
+  /* TODO.  With -O2 optimizations enabled, the compiler believes that here
+     "state == 0" still holds.  It's not yet clear what's going on.
+     Mis-optimization across the GOMP function call boundary?  Per its
+     gcc/omp-builtins.def definition, BUILT_IN_GOACC_PARALLEL
+     "GOACC_parallel_keyed" doesn't have a "leaf" attribute, so the compiler
+     must expect calls back into this compilation unit?  */
+  asm volatile ("" : : : "memory");
+#endif
+  assert (state == 10);
+
+  STATE_OP (state, = 100);
+
+  acc_async = 12;
+  {
+    int state_init;
+#pragma acc parallel async(acc_async) COPYIN(state) copyout(state_init)
+    {
+      state_init = state;
+    }
+#pragma acc wait
+    assert (state_init == 104);
+  }
+  assert (state == 110);
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
new file mode 100644
index 0000000..a952c7a
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-valid_bytes-1.c
@@ -0,0 +1,172 @@ 
+/* Test the "valid_bytes" magic.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+void cb_data_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->data_event.valid_bytes == _ACC_DATA_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+}
+
+void cb_launch_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->launch_event.valid_bytes == _ACC_LAUNCH_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+}
+
+void cb_other_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->valid_bytes == _ACC_PROF_INFO_VALID_BYTES);
+  assert (event_info->other_event.valid_bytes == _ACC_OTHER_EVENT_INFO_VALID_BYTES);
+  assert (api_info->valid_bytes == _ACC_API_INFO_VALID_BYTES);
+}
+
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg_ (acc_ev_device_init_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_init_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_runtime_shutdown, cb_other_event, acc_reg);
+  reg_ (acc_ev_create, cb_data_event, acc_reg);
+  reg_ (acc_ev_delete, cb_data_event, acc_reg);
+  reg_ (acc_ev_alloc, cb_data_event, acc_reg);
+  reg_ (acc_ev_free, cb_data_event, acc_reg);
+  reg_ (acc_ev_enter_data_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_enter_data_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_exit_data_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_exit_data_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_update_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_update_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_compute_construct_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_compute_construct_end, cb_other_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_start, cb_launch_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_end, cb_launch_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_start, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_end, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_start, cb_data_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_end, cb_data_event, acc_reg);
+  reg_ (acc_ev_wait_start, cb_other_event, acc_reg);
+  reg_ (acc_ev_wait_end, cb_other_event, acc_reg);
+}
+
+/* Basic struct.  */
+typedef struct A
+{
+  int a;
+  int b;
+#define VALID_BYTES_A \
+  _ACC_PROF_VALID_BYTES_STRUCT (A, b, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (int))
+} A;
+
+/* Add a "char" field.  */
+typedef struct B
+{
+  int a;
+  int b;
+  char c;
+#define VALID_BYTES_B \
+  _ACC_PROF_VALID_BYTES_STRUCT (B, c, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} B;
+
+/* Add another "char" field.  */
+typedef struct C
+{
+  int a;
+  int b;
+  char c, d;
+#define VALID_BYTES_C \
+  _ACC_PROF_VALID_BYTES_STRUCT (C, d, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} C;
+
+/* Add two "void *" fields.  */
+typedef struct D
+{
+  int a;
+  int b;
+  char c, d;
+  void *e;
+  void *f;
+#define VALID_BYTES_D \
+  _ACC_PROF_VALID_BYTES_STRUCT (D, f, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (void *))
+} D;
+
+/* Add another three "char" fields.  */
+typedef struct E
+{
+  int a;
+  int b;
+  char c, d;
+  void *e;
+  void *f;
+  char g, h, i;
+#define VALID_BYTES_E \
+  _ACC_PROF_VALID_BYTES_STRUCT (E, i, \
+				_ACC_PROF_VALID_BYTES_BASICTYPE (char))
+} E;
+
+int main()
+{
+  A A1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof A1, VALID_BYTES_A);
+  assert (VALID_BYTES_A <= sizeof A1);
+  DEBUG_printf ("&A1=%p, &A1.b=%p\n", &A1, &A1.b);
+  assert (((char *) &A1) + VALID_BYTES_A == (char *) (&A1.b + 1));
+
+  B B1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof B1, VALID_BYTES_B);
+  assert (VALID_BYTES_B <= sizeof B1);
+  DEBUG_printf ("&B1=%p, &B1.c=%p\n", &B1, &B1.c);
+  assert (((char *) &B1) + VALID_BYTES_B == (char *) (&B1.c + 1));
+
+  assert (VALID_BYTES_B == VALID_BYTES_A + 1 * sizeof (char));
+
+  C C1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof C1, VALID_BYTES_C);
+  assert (VALID_BYTES_C <= sizeof C1);
+  DEBUG_printf ("&C1=%p, &C1.d=%p\n", &C1, &C1.d);
+  assert (((char *) &C1) + VALID_BYTES_C == (char *) (&C1.d + 1));
+
+  assert (VALID_BYTES_C == VALID_BYTES_B + 1 * sizeof (char));
+
+  D D1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof D1, VALID_BYTES_D);
+  assert (VALID_BYTES_D <= sizeof D1);
+  DEBUG_printf ("&D1=%p, &D1.f=%p\n", &D1, &D1.f);
+  assert (((char *) &D1) + VALID_BYTES_D == (char *) (&D1.f + 1));
+
+  assert (VALID_BYTES_D > VALID_BYTES_C);
+
+  E E1;
+  DEBUG_printf ("s=%zd, vb=%zd\n", sizeof E1, VALID_BYTES_E);
+  assert (VALID_BYTES_E <= sizeof E1);
+  DEBUG_printf ("&E1=%p, &E1.i=%p\n", &E1, &E1.i);
+  assert (((char *) &E1) + VALID_BYTES_E == (char *) (&E1.i + 1));
+
+  assert (VALID_BYTES_E == VALID_BYTES_D + 3 * sizeof (char));
+
+#pragma acc parallel
+  {
+  }
+
+  return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
new file mode 100644
index 0000000..11e3ad7
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/acc_prof-version-1.c
@@ -0,0 +1,55 @@ 
+/* Test "acc_prof_info"'s  "version" field.  */
+
+#undef NDEBUG
+#include <assert.h>
+
+#include <acc_prof.h>
+
+#define DEBUG_printf(...) //__builtin_printf (__VA_ARGS__)
+
+void cb_any_event (acc_prof_info *prof_info, acc_event_info *event_info, acc_api_info *api_info)
+{
+  DEBUG_printf ("%s %d\n", __FUNCTION__, prof_info->event_type);
+
+  assert (prof_info->version == 201711);
+}
+
+void acc_register_library (acc_prof_reg reg_, acc_prof_reg unreg_, acc_prof_lookup_func lookup_)
+{
+  DEBUG_printf ("%s\n", __FUNCTION__);
+
+  reg_ (acc_ev_device_init_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_init_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_device_shutdown_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_runtime_shutdown, cb_any_event, acc_reg);
+  reg_ (acc_ev_create, cb_any_event, acc_reg);
+  reg_ (acc_ev_delete, cb_any_event, acc_reg);
+  reg_ (acc_ev_alloc, cb_any_event, acc_reg);
+  reg_ (acc_ev_free, cb_any_event, acc_reg);
+  reg_ (acc_ev_enter_data_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enter_data_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_exit_data_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_exit_data_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_update_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_update_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_compute_construct_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_compute_construct_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_launch_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_upload_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_enqueue_download_end, cb_any_event, acc_reg);
+  reg_ (acc_ev_wait_start, cb_any_event, acc_reg);
+  reg_ (acc_ev_wait_end, cb_any_event, acc_reg);
+}
+
+int main()
+{
+#pragma acc parallel
+  {
+  }
+
+  return 0;
+}