Message ID | 798d7ee1-2ffa-a591-38cb-a9ad421265d0@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | [v5] libgomp/nvptx: Prepare for reverse-offload callback handling | expand |
On Fri, Oct 07, 2022 at 04:26:58PM +0200, Tobias Burnus wrote: > libgomp/nvptx: Prepare for reverse-offload callback handling > > This patch adds a stub 'gomp_target_rev' in the host's target.c, which will > later handle the reverse offload. > For nvptx, it adds support for forwarding the offload gomp_target_ext call > to the host by setting values in a struct on the device and querying it on > the host - invoking gomp_target_rev on the result. > > For host-device consistency guarantee reasons, reverse offload is currently > limited -march=sm_70 (for libgomp). > > gcc/ChangeLog: > > * config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a > has not been compiled with sm_70 or higher and disable code gen then. > > include/ChangeLog: > > * cuda/cuda.h (enum CUdevice_attribute): Add > CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING. > (CU_MEMHOSTALLOC_DEVICEMAP): Define. > (cuMemHostAlloc): Add prototype. > > libgomp/ChangeLog: > > * config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove > 'static' for this variable. > * config/nvptx/libgomp-nvptx.h: New file. > * config/nvptx/target.c: Include it. > (GOMP_ADDITIONAL_ICVS): Declare extern var. > (GOMP_REV_OFFLOAD_VAR): Declare var. > (GOMP_target_ext): Handle reverse offload. > * libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype. > * libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ... > * target.c (gomp_target_rev): ... this new stub function. > * libgomp.h (gomp_target_rev): Declare. > * libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev. > * plugin/cuda-lib.def (cuMemHostAlloc): Add. > * plugin/plugin-nvptx.c: Include libgomp-nvptx.h. > (struct ptx_device): Add rev_data member. > (nvptx_open_device): #if 0 unused check; add > unified address assert check. > (GOMP_OFFLOAD_get_num_devices): Claim unified address > support. > (GOMP_OFFLOAD_load_image): Free rev_fn_table if no > offload functions exist. Make offload var available > on host and device. > (rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New. > (GOMP_OFFLOAD_run): Handle reverse offload. So, does this mean one has to have gcc configured --with-arch=sm_70 or later to make reverse offloading work (and then on the other side no support for older PTX arches at all)? If yes, I was kind of hoping we could arrange for it to be more user-friendly, build libgomp.a normally (sm_35 or what is the default), build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70 and arrange for mkoffload to link in the sm_70 stuff only if the user wants reverse offload (or has requires reverse_offload?). In that case ignore sm_60 and older devices, if reverse offload isn't wanted, don't link in the part that needs sm_70 and make stuff working on sm_35 and later. Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let mkoffload choose among them. > + /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c > + for < sm_70 exists but is disabled here as it is unclear whether there > + is the required consistency between host and device. > + See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html > + for details. */ > + warning_at (input_location, 0, > + "Disabling offload-code generation for this device type: " > + "%<omp requires reverse_offload%> can only be fulfilled " > + "for %<sm_70%> or higher"); > + inform (UNKNOWN_LOCATION, > + "Reverse offload requires that GCC is configured with " > + "%<--with-arch=sm_70%> or higher and not overridden by a lower " > + "value for %<-foffload-options=nvptx-none=-march=%>"); Diagnostics (sure, Fortran FE is an exception) shouldn't start with capital letters). > @@ -519,10 +523,20 @@ nvptx_open_device (int n) > CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev); > ptx_dev->max_threads_per_multiprocessor = pi; > > +#if 0 > + int async_engines; > r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines, > CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev); > if (r != CUDA_SUCCESS) > async_engines = 1; > +#endif Please avoid #if 0 code. > + > + /* Required below for reverse offload as implemented, but with compute > + capability >= 2.0 and 64bit device processes, this should be universally be > + the case; hence, an assert. */ > + r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi, > + CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev); > + assert (r == CUDA_SUCCESS && pi); > > for (int i = 0; i != GOMP_DIM_MAX; i++) > ptx_dev->default_dims[i] = 0; > @@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) > { > int num_devices = nvptx_get_num_devices (); > /* Return -1 if no omp_requires_mask cannot be fulfilled but > - devices were present. */ > - if (num_devices > 0 && omp_requires_mask != 0) > + devices were present. Unified-shared address: see comment in 2 spaces after . rather than 1. > --- a/libgomp/target.c > +++ b/libgomp/target.c > @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum, > htab_free (refcount_set); > } > > +/* Handle reverse offload. This is called by the device plugins for a > + reverse offload; it is not called if the outer target runs on the host. */ Likewise. Jakub
On Tue, 11 Oct 2022, Jakub Jelinek wrote: > So, does this mean one has to have gcc configured --with-arch=sm_70 > or later to make reverse offloading work (and then on the other > side no support for older PTX arches at all)? > If yes, I was kind of hoping we could arrange for it to be more > user-friendly, build libgomp.a normally (sm_35 or what is the default), > build the single TU in libgomp that needs the sm_70 stuff with -march=sm_70 > and arrange for mkoffload to link in the sm_70 stuff only if the user > wants reverse offload (or has requires reverse_offload?). In that case > ignore sm_60 and older devices, if reverse offload isn't wanted, don't link > in the part that needs sm_70 and make stuff working on sm_35 and later. > Or perhaps have 2 versions of target.o, one sm_35 and one sm_70 and let > mkoffload choose among them. My understanding is such trickery should not be necessary with the barrier-based approach, i.e. the sequence of PTX instructions st % plain store membar.sys st.volatile should be enough to guarantee that the former store is visible on the host before the latter, and work all the way back to sm_20. Alexander
On 11.10.22 13:12, Alexander Monakov wrote: > My understanding is such trickery should not be necessary with > the barrier-based approach, i.e. the sequence of PTX instructions > > st % plain store > membar.sys > st.volatile > > should be enough to guarantee that the former store is visible on the host > before the latter, and work all the way back to sm_20. If I understand it correctly, you mean: GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; __sync_synchronize (); /* membar.sys */ asm volatile ("st.volatile.global.u64 [%0], %1;" : : "r"(addr_struct_fn), "r" (fn) : "memory"); And then directly followed by the busy wait: while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) ; /* spin */ which GCC expands to: /* ld.global.u64 %r64,[__gomp_rev_offload_var]; ld.u64 %r36,[%r64]; membar.sys; */ The such updated patch is attached. (This is the only change + removing the mkoffload.cc part is the only larger change. Otherwise, it only handles the minor comments by Jakub. The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e. it is a really old left over!) Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged), sm_35 and sm_70. Tobias *With some added code; until GOMP_OFFLOAD_get_num_devices accepts GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image gets passed a non-NULL for rev_fn_table, the current patch is a no op. Planned next is the related GCN patch – and the actual change in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices) ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
On 12.10.22 10:55, Tobias Burnus wrote: > On 11.10.22 13:12, Alexander Monakov wrote: >> My understanding is such trickery should not be necessary with >> the barrier-based approach, i.e. the sequence of PTX instructions >> >> st % plain store >> membar.sys >> st.volatile >> >> should be enough to guarantee that the former store is visible on the >> host >> before the latter, and work all the way back to sm_20. > > If I understand it correctly, you mean: > > GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; > > __sync_synchronize (); /* membar.sys */ > asm volatile ("st.volatile.global.u64 [%0], %1;" > : : "r"(addr_struct_fn), "r" (fn) : "memory"); > > > And then directly followed by the busy wait: > > while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) > != 0) > ; /* spin */ > > which GCC expands to: > > /* ld.global.u64 %r64,[__gomp_rev_offload_var]; > ld.u64 %r36,[%r64]; > membar.sys; */ > > The such updated patch is attached. > > (This is the only change + removing the mkoffload.cc part is the only > larger change. Otherwise, it only handles the minor comments by Jakub. > The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used > until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e. > it is a really old left over!) > > Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged), > sm_35 and sm_70. > > Tobias > > *With some added code; until GOMP_OFFLOAD_get_num_devices accepts > GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image > gets passed a non-NULL for rev_fn_table, the current patch is a no op. > > Planned next is the related GCN patch – and the actual change > in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices) ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
On Wed, 12 Oct 2022, Tobias Burnus wrote: > On 11.10.22 13:12, Alexander Monakov wrote: > > My understanding is such trickery should not be necessary with > > the barrier-based approach, i.e. the sequence of PTX instructions > > > > st % plain store > > membar.sys > > st.volatile > > > > should be enough to guarantee that the former store is visible on the host > > before the latter, and work all the way back to sm_20. > > If I understand it correctly, you mean: > > GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; > > __sync_synchronize (); /* membar.sys */ > asm volatile ("st.volatile.global.u64 [%0], %1;" > : : "r"(addr_struct_fn), "r" (fn) : "memory"); > > > And then directly followed by the busy wait: > > while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) > ; /* spin */ > > which GCC expands to: > > /* ld.global.u64 %r64,[__gomp_rev_offload_var]; > ld.u64 %r36,[%r64]; > membar.sys; */ > > The such updated patch is attached. I think the topic for which I was Cc'ed (memory space and access method for the synchronization variable) has been resolved nicely. I am not satisfied with some other points raised in the conversation, I hope they are noted. Alexander > (This is the only change + removing the mkoffload.cc part is the only > larger change. Otherwise, it only handles the minor comments by Jakub. > The now removed CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT was used > until commit r10-304-g1f4c5b9bb2eb81880e2bc725435d596fcd2bdfef i.e. > it is a really old left over!) > > Otherwise, tested* to work with sm_30 (error by mkoffload, unchanged), > sm_35 and sm_70. > > Tobias > > *With some added code; until GOMP_OFFLOAD_get_num_devices accepts > GOMP_REQUIRES_UNIFIED_SHARED_MEMORY and GOMP_OFFLOAD_load_image > gets passed a non-NULL for rev_fn_table, the current patch is a no op. > > Planned next is the related GCN patch – and the actual change > in libgomp/target.c (+ accepting USM in GOMP_OFFLOAD_get_num_devices)
On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote: > libgomp/nvptx: Prepare for reverse-offload callback handling > > This patch adds a stub 'gomp_target_rev' in the host's target.c, which will > later handle the reverse offload. > For nvptx, it adds support for forwarding the offload gomp_target_ext call > to the host by setting values in a struct on the device and querying it on > the host - invoking gomp_target_rev on the result. > > include/ChangeLog: > > * cuda/cuda.h (enum CUdevice_attribute): Add > CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING. > (CU_MEMHOSTALLOC_DEVICEMAP): Define. > (cuMemHostAlloc): Add prototype. > > libgomp/ChangeLog: > > * config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove > 'static' for this variable. > * config/nvptx/libgomp-nvptx.h: New file. > * config/nvptx/target.c: Include it. > (GOMP_ADDITIONAL_ICVS): Declare extern var. > (GOMP_REV_OFFLOAD_VAR): Declare var. > (GOMP_target_ext): Handle reverse offload. > * libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype. > * libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ... > * target.c (gomp_target_rev): ... this new stub function. > * libgomp.h (gomp_target_rev): Declare. > * libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev. > * plugin/cuda-lib.def (cuMemHostAlloc): Add. > * plugin/plugin-nvptx.c: Include libgomp-nvptx.h. > (struct ptx_device): Add rev_data member. > (nvptx_open_device): Remove async_engines query, last used in > r10-304-g1f4c5b9b; add unified-address assert check. > (GOMP_OFFLOAD_get_num_devices): Claim unified address > support. > (GOMP_OFFLOAD_load_image): Free rev_fn_table if no > offload functions exist. Make offload var available > on host and device. > (rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New. > (GOMP_OFFLOAD_run): Handle reverse offload. Ok, thanks. Jakub
Hi Tobias! On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote: >> libgomp/nvptx: Prepare for reverse-offload callback handling > Ok, thanks. Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609 "libgomp/nvptx: Prepare for reverse-offload callback handling", I'm seeing a lot of libgomp execution test regressions. Random example, 'libgomp.c-c++-common/error-1.c': [...] GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1] Thread 1 "a.out" received signal SIGSEGV, Segmentation fault. 0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127 2127 if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0) (gdb) print ptx_dev $1 = (struct ptx_device *) 0x6a55a0 (gdb) print ptx_dev->rev_data $2 = (struct rev_offload *) 0xffffffff00000000 (gdb) print ptx_dev->rev_data->fn Cannot access memory at address 0xffffffff00000000 Why is it even taking this 'if (reverse_offload)' code path, which isn't applicable to this test case (as far as I understand)? (Well, the answer is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Hi Tobias! On 2022-10-24T21:05:46+0200, I wrote: > On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: >> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote: >>> libgomp/nvptx: Prepare for reverse-offload callback handling > >> Ok, thanks. > > Per commit r13-3460-g131d18e928a3ea1ab2d3bf61aa92d68a8a254609 > "libgomp/nvptx: Prepare for reverse-offload callback handling", > I'm seeing a lot of libgomp execution test regressions. Random > example, 'libgomp.c-c++-common/error-1.c': > > [...] > GOMP_OFFLOAD_run: kernel main$_omp_fn$0: launch [(teams: 1), 1, 1] [(lanes: 32), (threads: 8), 1] > > Thread 1 "a.out" received signal SIGSEGV, Segmentation fault. > 0x00007ffff793b87d in GOMP_OFFLOAD_run (ord=<optimized out>, tgt_fn=<optimized out>, tgt_vars=<optimized out>, args=<optimized out>) at [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:2127 > 2127 if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0) > (gdb) print ptx_dev > $1 = (struct ptx_device *) 0x6a55a0 > (gdb) print ptx_dev->rev_data > $2 = (struct rev_offload *) 0xffffffff00000000 > (gdb) print ptx_dev->rev_data->fn > Cannot access memory at address 0xffffffff00000000 > > Why is it even taking this 'if (reverse_offload)' code path, which isn't > applicable to this test case (as far as I understand)? (Well, the answer > is 'bool reverse_offload = ptx_dev->rev_data != NULL;', but why is that?) Well. --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -329,6 +332,7 @@ struct ptx_device pthread_mutex_t lock; } omp_stacks; + struct rev_offload *rev_data; struct ptx_device *next; }; ... but as far as I can tell, this is never initialized in 'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'. Would the following be the correct fix (currently testing)? --- libgomp/plugin/plugin-nvptx.c +++ libgomp/plugin/plugin-nvptx.c @@ -546,6 +546,8 @@ nvptx_open_device (int n) ptx_dev->omp_stacks.size = 0; pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL); + ptx_dev->rev_data = NULL; + return ptx_dev; } Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Hi Tobias! On 24.10.22 21:11, Thomas Schwinge wrote: > On 2022-10-24T21:05:46+0200, I wrote: >> On 2022-10-24T16:07:25+0200, Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: >>> On Wed, Oct 12, 2022 at 10:55:26AM +0200, Tobias Burnus wrote: >>>> libgomp/nvptx: Prepare for reverse-offload callback handling > Well. > + struct rev_offload *rev_data; > ... but as far as I can tell, this is never initialized in > 'nvptx_open_device', which does 'ptx_dev = GOMP_PLUGIN_malloc ([...]);'. > Would the following be the correct fix (currently testing)? > > --- libgomp/plugin/plugin-nvptx.c > +++ libgomp/plugin/plugin-nvptx.c > @@ -546,6 +546,8 @@ nvptx_open_device (int n) > ptx_dev->omp_stacks.size = 0; > pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL); > > + ptx_dev->rev_data = NULL; > + > return ptx_dev; > } LGTM and I think it is obvious – albeit I am not sure why it did not fail when testing it here. Thanks, Tobias ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
libgomp/nvptx: Prepare for reverse-offload callback handling This patch adds a stub 'gomp_target_rev' in the host's target.c, which will later handle the reverse offload. For nvptx, it adds support for forwarding the offload gomp_target_ext call to the host by setting values in a struct on the device and querying it on the host - invoking gomp_target_rev on the result. For host-device consistency guarantee reasons, reverse offload is currently limited -march=sm_70 (for libgomp). gcc/ChangeLog: * config/nvptx/mkoffload.cc (process): Warn if the linked-in libgomp.a has not been compiled with sm_70 or higher and disable code gen then. include/ChangeLog: * cuda/cuda.h (enum CUdevice_attribute): Add CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING. (CU_MEMHOSTALLOC_DEVICEMAP): Define. (cuMemHostAlloc): Add prototype. libgomp/ChangeLog: * config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove 'static' for this variable. * config/nvptx/libgomp-nvptx.h: New file. * config/nvptx/target.c: Include it. (GOMP_ADDITIONAL_ICVS): Declare extern var. (GOMP_REV_OFFLOAD_VAR): Declare var. (GOMP_target_ext): Handle reverse offload. * libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype. * libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ... * target.c (gomp_target_rev): ... this new stub function. * libgomp.h (gomp_target_rev): Declare. * libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev. * plugin/cuda-lib.def (cuMemHostAlloc): Add. * plugin/plugin-nvptx.c: Include libgomp-nvptx.h. (struct ptx_device): Add rev_data member. (nvptx_open_device): #if 0 unused check; add unified address assert check. (GOMP_OFFLOAD_get_num_devices): Claim unified address support. (GOMP_OFFLOAD_load_image): Free rev_fn_table if no offload functions exist. Make offload var available on host and device. (rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New. (GOMP_OFFLOAD_run): Handle reverse offload. gcc/config/nvptx/mkoffload.cc | 60 +++++++++++++++----- include/cuda/cuda.h | 3 + libgomp/config/nvptx/icv-device.c | 2 +- libgomp/config/nvptx/libgomp-nvptx.h | 51 +++++++++++++++++ libgomp/config/nvptx/target.c | 61 +++++++++++++++++--- libgomp/libgomp-plugin.c | 12 ++++ libgomp/libgomp-plugin.h | 7 +++ libgomp/libgomp.h | 5 ++ libgomp/libgomp.map | 5 ++ libgomp/plugin/cuda-lib.def | 1 + libgomp/plugin/plugin-nvptx.c | 107 +++++++++++++++++++++++++++++++++-- libgomp/target.c | 19 +++++++ 12 files changed, 304 insertions(+), 29 deletions(-) diff --git a/gcc/config/nvptx/mkoffload.cc b/gcc/config/nvptx/mkoffload.cc index 854cd72..aa2e042 100644 --- a/gcc/config/nvptx/mkoffload.cc +++ b/gcc/config/nvptx/mkoffload.cc @@ -258,6 +258,7 @@ process (FILE *in, FILE *out, uint32_t omp_requires) unsigned ix; const char *sm_ver = NULL, *version = NULL; const char *sm_ver2 = NULL, *version2 = NULL; + const char *sm_libgomp = NULL; size_t file_cnt = 0; size_t *file_idx = XALLOCAVEC (size_t, len); @@ -268,6 +269,7 @@ process (FILE *in, FILE *out, uint32_t omp_requires) for (size_t i = 0; i != len;) { char c; + bool is_libgomp = false; bool output_fn_ptr = false; file_idx[file_cnt++] = i; @@ -291,6 +293,13 @@ process (FILE *in, FILE *out, uint32_t omp_requires) version = input + i + strlen (".version "); continue; } + if (UNLIKELY (startswith (input + i, + "// BEGIN GLOBAL FUNCTION " + "DEF: GOMP_target_ext"))) + { + is_libgomp = true; + continue; + } while (startswith (input + i, "//:")) { i += 3; @@ -319,28 +328,49 @@ process (FILE *in, FILE *out, uint32_t omp_requires) putc (c, out); } fprintf (out, "\";\n\n"); + if (is_libgomp) + sm_libgomp = sm_ver; if (output_fn_ptr && (omp_requires & GOMP_REQUIRES_REVERSE_OFFLOAD) != 0) { - if (sm_ver && sm_ver[0] == '3' && sm_ver[1] == '0' - && sm_ver[2] == '\n') - { - warning_at (input_location, 0, - "%<omp requires reverse_offload%> requires at " - "least %<sm_35%> for " - "%<-foffload-options=nvptx-none=-march=%> - disabling" - " offload-code generation for this device type"); - /* As now an empty file is compiled and there is no call to - GOMP_offload_register_ver, this device type is effectively - disabled. */ - fflush (out); - ftruncate (fileno (out), 0); - return; - } sm_ver2 = sm_ver; version2 = version; } } + if (sm_ver2 && sm_libgomp + && sm_libgomp[0] < '7' && sm_libgomp[1] && sm_libgomp[2] == '\n') + { + /* The code for nvptx for GOMP_target_ext in libgomp/config/nvptx/target.c + for < sm_70 exists but is disabled here as it is unclear whether there + is the required consistency between host and device. + See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html + for details. */ + warning_at (input_location, 0, + "Disabling offload-code generation for this device type: " + "%<omp requires reverse_offload%> can only be fulfilled " + "for %<sm_70%> or higher"); + inform (UNKNOWN_LOCATION, + "Reverse offload requires that GCC is configured with " + "%<--with-arch=sm_70%> or higher and not overridden by a lower " + "value for %<-foffload-options=nvptx-none=-march=%>"); + /* As now an empty file is compiled and there is no call to + GOMP_offload_register_ver, this device type is effectively disabled. */ + fflush (out); + ftruncate (fileno (out), 0); + return; + } + if (sm_ver2 && sm_ver2[0] == '3' && sm_ver2[1] == '0' && sm_ver[2] == '\n') + { + warning_at (input_location, 0, + "%<omp requires reverse_offload%> requires at least %<sm_35%> " + "for %<-foffload-options=nvptx-none=-march=%> - disabling " + "offload-code generation for this device type"); + /* As now an empty file is compiled and there is no call to + GOMP_offload_register_ver, this device type is effectively disabled. */ + fflush (out); + ftruncate (fileno (out), 0); + return; + } /* Create function-pointer array, required for reverse offload function-pointer lookup. */ diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 3938d05..e081f04 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -77,6 +77,7 @@ typedef enum { CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS = 31, CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR = 39, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT = 40, + CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING = 41, CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR = 82 } CUdevice_attribute; @@ -113,6 +114,7 @@ enum { #define CU_LAUNCH_PARAM_END ((void *) 0) #define CU_LAUNCH_PARAM_BUFFER_POINTER ((void *) 1) #define CU_LAUNCH_PARAM_BUFFER_SIZE ((void *) 2) +#define CU_MEMHOSTALLOC_DEVICEMAP 0x02U enum { CU_STREAM_DEFAULT = 0, @@ -169,6 +171,7 @@ CUresult cuMemGetInfo (size_t *, size_t *); CUresult cuMemAlloc (CUdeviceptr *, size_t); #define cuMemAllocHost cuMemAllocHost_v2 CUresult cuMemAllocHost (void **, size_t); +CUresult cuMemHostAlloc (void **, size_t, unsigned int); CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t); #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream); diff --git a/libgomp/config/nvptx/icv-device.c b/libgomp/config/nvptx/icv-device.c index 6f869be..eef151c 100644 --- a/libgomp/config/nvptx/icv-device.c +++ b/libgomp/config/nvptx/icv-device.c @@ -30,7 +30,7 @@ /* This is set to the ICV values of current GPU during device initialization, when the offload image containing this libgomp portion is loaded. */ -static volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS; +volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS; void omp_set_default_device (int device_num __attribute__((unused))) diff --git a/libgomp/config/nvptx/libgomp-nvptx.h b/libgomp/config/nvptx/libgomp-nvptx.h new file mode 100644 index 0000000..5da9aae --- /dev/null +++ b/libgomp/config/nvptx/libgomp-nvptx.h @@ -0,0 +1,51 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + Contributed by Tobias Burnus <tobias@codesourcery.com>. + + This file is part of the GNU Offloading and Multi Processing Library + (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Library Exception along with this program; + see the files COPYING3 and COPYING.RUNTIME respectively. If not, see + <http://www.gnu.org/licenses/>. */ + +/* This file contains defines and type definitions shared between the + nvptx target's libgomp.a and the plugin-nvptx.c, but that is only + needef for this target. */ + +#ifndef LIBGOMP_NVPTX_H +#define LIBGOMP_NVPTX_H 1 + +#define GOMP_REV_OFFLOAD_VAR __gomp_rev_offload_var + +struct rev_offload { + uint64_t fn; + uint64_t mapnum; + uint64_t addrs; + uint64_t sizes; + uint64_t kinds; + int32_t dev_num; +}; + +#if (__SIZEOF_SHORT__ != 2 \ + || __SIZEOF_SIZE_T__ != 8 \ + || __SIZEOF_POINTER__ != 8) +#error "Data-type conversion required for rev_offload" +#endif + +#endif /* LIBGOMP_NVPTX_H */ + diff --git a/libgomp/config/nvptx/target.c b/libgomp/config/nvptx/target.c index 11108d2..6470ae8 100644 --- a/libgomp/config/nvptx/target.c +++ b/libgomp/config/nvptx/target.c @@ -24,9 +24,12 @@ <http://www.gnu.org/licenses/>. */ #include "libgomp.h" +#include "libgomp-nvptx.h" /* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */ #include <limits.h> extern int __gomp_team_num __attribute__((shared)); +extern volatile struct gomp_offload_icvs GOMP_ADDITIONAL_ICVS; +volatile struct rev_offload *GOMP_REV_OFFLOAD_VAR; bool GOMP_teams4 (unsigned int num_teams_lower, unsigned int num_teams_upper, @@ -88,16 +91,60 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum, void **hostaddrs, size_t *sizes, unsigned short *kinds, unsigned int flags, void **depend, void **args) { - (void) device; - (void) fn; - (void) mapnum; - (void) hostaddrs; - (void) sizes; - (void) kinds; + static int lock = 0; /* == gomp_mutex_t lock; gomp_mutex_init (&lock); */ (void) flags; (void) depend; (void) args; - __builtin_unreachable (); + + if (device != GOMP_DEVICE_HOST_FALLBACK + || fn == NULL + || GOMP_REV_OFFLOAD_VAR == NULL) + return; + + gomp_mutex_lock (&lock); + + GOMP_REV_OFFLOAD_VAR->mapnum = mapnum; + GOMP_REV_OFFLOAD_VAR->addrs = (uint64_t) hostaddrs; + GOMP_REV_OFFLOAD_VAR->sizes = (uint64_t) sizes; + GOMP_REV_OFFLOAD_VAR->kinds = (uint64_t) kinds; + GOMP_REV_OFFLOAD_VAR->dev_num = GOMP_ADDITIONAL_ICVS.device_num; + + /* 'fn' must be last. */ +#if __PTX_SM__ >= 700 + uint64_t addr_struct_fn = (uint64_t) &GOMP_REV_OFFLOAD_VAR->fn; + asm volatile ("st.global.release.sys.u64 [%0], %1;" + : : "r"(addr_struct_fn), "r" (fn) : "memory"); +#else +/* The following has been effectively disabled via mkoffload as it is unclear + whether there is the required consistency between host and device. + See https://gcc.gnu.org/pipermail/gcc-patches/2022-October/602715.html + Note: Using atomic with scope = .sys is already supported since >= 600. + The generated code is: + @ %r69 membar.sys; + @ %r69 atom.exch.b64 _,[%r27],%r41; */ + __atomic_store_n (&GOMP_REV_OFFLOAD_VAR->fn, fn, __ATOMIC_RELEASE); +#endif + + /* Processed on the host - when done, fn is set to NULL. */ +#if __PTX_SM__ >= 700 + uint64_t fn2; + do + { + asm volatile ("ld.acquire.sys.global.u64 %0, [%1];" + : "=r" (fn2) : "r" (addr_struct_fn) : "memory"); + } + while (fn2 != 0); +#else +/* See remark above. The generated memory-access code is + ld.global.u64 %r64,[__gomp_rev_offload_var]; + ld.u64 %r36,[%r64]; + membar.sys; */ + __sync_synchronize (); /* membar.sys */ + while (__atomic_load_n (&GOMP_REV_OFFLOAD_VAR->fn, __ATOMIC_ACQUIRE) != 0) + ; /* spin */ +#endif + + gomp_mutex_unlock (&lock); } void diff --git a/libgomp/libgomp-plugin.c b/libgomp/libgomp-plugin.c index 9d4cc62..316de74 100644 --- a/libgomp/libgomp-plugin.c +++ b/libgomp/libgomp-plugin.c @@ -78,3 +78,15 @@ GOMP_PLUGIN_fatal (const char *msg, ...) gomp_vfatal (msg, ap); va_end (ap); } + +void +GOMP_PLUGIN_target_rev (uint64_t fn_ptr, uint64_t mapnum, uint64_t devaddrs_ptr, + uint64_t sizes_ptr, uint64_t kinds_ptr, int dev_num, + void (*dev_to_host_cpy) (void *, const void *, size_t, + void *), + void (*host_to_dev_cpy) (void *, const void *, size_t, + void *), void *token) +{ + gomp_target_rev (fn_ptr, mapnum, devaddrs_ptr, sizes_ptr, kinds_ptr, dev_num, + dev_to_host_cpy, host_to_dev_cpy, token); +} diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index 6ab5ac6..875f967 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -121,6 +121,13 @@ extern void GOMP_PLUGIN_error (const char *, ...) extern void GOMP_PLUGIN_fatal (const char *, ...) __attribute__ ((noreturn, format (printf, 1, 2))); +extern void GOMP_PLUGIN_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, + uint64_t, int, + void (*) (void *, const void *, size_t, + void *), + void (*) (void *, const void *, size_t, + void *), void *); + /* Prototypes for functions implemented by libgomp plugins. */ extern const char *GOMP_OFFLOAD_get_name (void); extern unsigned int GOMP_OFFLOAD_get_caps (void); diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 7519274..5803683 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -1128,6 +1128,11 @@ extern int gomp_pause_host (void); extern void gomp_init_targets_once (void); extern int gomp_get_num_devices (void); extern bool gomp_target_task_fn (void *); +extern void gomp_target_rev (uint64_t, uint64_t, uint64_t, uint64_t, uint64_t, + int, + void (*) (void *, const void *, size_t, void *), + void (*) (void *, const void *, size_t, void *), + void *); /* Splay tree definitions. */ typedef struct splay_tree_node_s *splay_tree_node; diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index 46d5f10..12f76f7 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -622,3 +622,8 @@ GOMP_PLUGIN_1.3 { GOMP_PLUGIN_goacc_profiling_dispatch; GOMP_PLUGIN_goacc_thread; } GOMP_PLUGIN_1.2; + +GOMP_PLUGIN_1.4 { + global: + GOMP_PLUGIN_target_rev; +} GOMP_PLUGIN_1.3; diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def index cd91b39..dff42d6 100644 --- a/libgomp/plugin/cuda-lib.def +++ b/libgomp/plugin/cuda-lib.def @@ -29,6 +29,7 @@ CUDA_ONE_CALL_MAYBE_NULL (cuLinkCreate_v2) CUDA_ONE_CALL (cuLinkDestroy) CUDA_ONE_CALL (cuMemAlloc) CUDA_ONE_CALL (cuMemAllocHost) +CUDA_ONE_CALL (cuMemHostAlloc) CUDA_ONE_CALL (cuMemcpy) CUDA_ONE_CALL (cuMemcpyDtoDAsync) CUDA_ONE_CALL (cuMemcpyDtoH) diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index ba6b229..de24398 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -40,6 +40,9 @@ #include "gomp-constants.h" #include "oacc-int.h" +/* For struct rev_offload + GOMP_REV_OFFLOAD_VAR. */ +#include "config/nvptx/libgomp-nvptx.h" + #include <pthread.h> #ifndef PLUGIN_NVPTX_INCLUDE_SYSTEM_CUDA_H # include "cuda/cuda.h" @@ -329,6 +332,7 @@ struct ptx_device pthread_mutex_t lock; } omp_stacks; + struct rev_offload *rev_data; struct ptx_device *next; }; @@ -423,7 +427,7 @@ nvptx_open_device (int n) struct ptx_device *ptx_dev; CUdevice dev, ctx_dev; CUresult r; - int async_engines, pi; + int pi; CUDA_CALL_ERET (NULL, cuDeviceGet, &dev, n); @@ -519,10 +523,20 @@ nvptx_open_device (int n) CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR, dev); ptx_dev->max_threads_per_multiprocessor = pi; +#if 0 + int async_engines; r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &async_engines, CU_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT, dev); if (r != CUDA_SUCCESS) async_engines = 1; +#endif + + /* Required below for reverse offload as implemented, but with compute + capability >= 2.0 and 64bit device processes, this should be universally be + the case; hence, an assert. */ + r = CUDA_CALL_NOCHECK (cuDeviceGetAttribute, &pi, + CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING, dev); + assert (r == CUDA_SUCCESS && pi); for (int i = 0; i != GOMP_DIM_MAX; i++) ptx_dev->default_dims[i] = 0; @@ -1179,8 +1193,10 @@ GOMP_OFFLOAD_get_num_devices (unsigned int omp_requires_mask) { int num_devices = nvptx_get_num_devices (); /* Return -1 if no omp_requires_mask cannot be fulfilled but - devices were present. */ - if (num_devices > 0 && omp_requires_mask != 0) + devices were present. Unified-shared address: see comment in + nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING. */ + if (num_devices > 0 + && (omp_requires_mask & ~GOMP_REQUIRES_UNIFIED_ADDRESS) != 0) return -1; return num_devices; } @@ -1380,7 +1396,7 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data, else if (rev_fn_table) { CUdeviceptr var; - size_t bytes; + size_t bytes, i; r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, &var, &bytes, module, "$offload_func_table"); if (r != CUDA_SUCCESS) @@ -1390,6 +1406,37 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version, const void *target_data, r = CUDA_CALL_NOCHECK (cuMemcpyDtoH, *rev_fn_table, var, bytes); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuMemcpyDtoH error: %s", cuda_error (r)); + /* Free if only NULL entries. */ + for (i = 0; i < fn_entries; ++i) + if ((*rev_fn_table)[i] != 0) + break; + if (i == fn_entries) + { + free (*rev_fn_table); + *rev_fn_table = NULL; + } + } + + if (rev_fn_table && *rev_fn_table && dev->rev_data == NULL) + { + /* cuMemHostAlloc memory is accessible on the device, if unified-shared + address is supported; this is assumed - see comment in + nvptx_open_device for CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING. */ + CUDA_CALL_ASSERT (cuMemHostAlloc, (void **) &dev->rev_data, + sizeof (*dev->rev_data), CU_MEMHOSTALLOC_DEVICEMAP); + CUdeviceptr dp = (CUdeviceptr) dev->rev_data; + CUdeviceptr device_rev_offload_var; + size_t device_rev_offload_size; + CUresult r = CUDA_CALL_NOCHECK (cuModuleGetGlobal, + &device_rev_offload_var, + &device_rev_offload_size, module, + XSTRING (GOMP_REV_OFFLOAD_VAR)); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuModuleGetGlobal error - GOMP_REV_OFFLOAD_VAR: %s", cuda_error (r)); + r = CUDA_CALL_NOCHECK (cuMemcpyHtoD, device_rev_offload_var, &dp, + sizeof (dp)); + if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuMemcpyHtoD error: %s", cuda_error (r)); } nvptx_set_clocktick (module, dev); @@ -2001,6 +2048,23 @@ nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num) return (void *) ptx_dev->omp_stacks.ptr; } + +void +rev_off_dev_to_host_cpy (void *dest, const void *src, size_t size, + CUstream stream) +{ + CUDA_CALL_ASSERT (cuMemcpyDtoHAsync, dest, (CUdeviceptr) src, size, stream); + CUDA_CALL_ASSERT (cuStreamSynchronize, stream); +} + +void +rev_off_host_to_dev_cpy (void *dest, const void *src, size_t size, + CUstream stream) +{ + CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, (CUdeviceptr) dest, src, size, stream); + CUDA_CALL_ASSERT (cuStreamSynchronize, stream); +} + void GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) { @@ -2035,6 +2099,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads); size_t stack_size = nvptx_stacks_size (); + bool reverse_offload = ptx_dev->rev_data != NULL; + CUstream copy_stream = NULL; pthread_mutex_lock (&ptx_dev->omp_stacks.lock); void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads); @@ -2048,12 +2114,41 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" " [(teams: %u), 1, 1] [(lanes: 32), (threads: %u), 1]\n", __FUNCTION__, fn_name, teams, threads); + if (reverse_offload) + CUDA_CALL_ASSERT (cuStreamCreate, ©_stream, CU_STREAM_NON_BLOCKING); r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, 32, threads, 1, 0, NULL, NULL, config); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r)); - - r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); + if (reverse_offload) + while (true) + { + r = CUDA_CALL_NOCHECK (cuStreamQuery, NULL); + if (r == CUDA_SUCCESS) + break; + if (r == CUDA_ERROR_LAUNCH_FAILED) + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s %s\n", cuda_error (r), + maybe_abort_msg); + else if (r != CUDA_ERROR_NOT_READY) + GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); + + if (__atomic_load_n (&ptx_dev->rev_data->fn, __ATOMIC_ACQUIRE) != 0) + { + struct rev_offload *rev_data = ptx_dev->rev_data; + GOMP_PLUGIN_target_rev (rev_data->fn, rev_data->mapnum, + rev_data->addrs, rev_data->sizes, + rev_data->kinds, rev_data->dev_num, + rev_off_dev_to_host_cpy, + rev_off_host_to_dev_cpy, copy_stream); + CUDA_CALL_ASSERT (cuStreamSynchronize, copy_stream); + __atomic_store_n (&rev_data->fn, 0, __ATOMIC_RELEASE); + } + usleep (1); + } + else + r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); + if (reverse_offload) + CUDA_CALL_ASSERT (cuStreamDestroy, copy_stream); if (r == CUDA_ERROR_LAUNCH_FAILED) GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r), maybe_abort_msg); diff --git a/libgomp/target.c b/libgomp/target.c index 5763483..71bcb05 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -2925,6 +2925,25 @@ GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum, htab_free (refcount_set); } +/* Handle reverse offload. This is called by the device plugins for a + reverse offload; it is not called if the outer target runs on the host. */ + +void +gomp_target_rev (uint64_t fn_ptr __attribute__ ((unused)), + uint64_t mapnum __attribute__ ((unused)), + uint64_t devaddrs_ptr __attribute__ ((unused)), + uint64_t sizes_ptr __attribute__ ((unused)), + uint64_t kinds_ptr __attribute__ ((unused)), + int dev_num __attribute__ ((unused)), + void (*dev_to_host_cpy) (void *, const void *, size_t, + void *) __attribute__ ((unused)), + void (*host_to_dev_cpy) (void *, const void *, size_t, + void *) __attribute__ ((unused)), + void *token __attribute__ ((unused))) +{ + __builtin_unreachable (); +} + /* Host fallback for GOMP_target_data{,_ext} routines. */ static void