Message ID | 12ebc63d-9f48-a292-ae17-4ac70254c500@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect) | expand |
Now committed as r14-2865-g8b9e559fe7ca5715c74115322af99dbf9137a399 Tobias On 28.07.23 13:51, Tobias Burnus wrote: > thanks for proof reading and the suggestions! – Do have comments to the > attached patch? > > * * * > > Crossref: For further optimizations, see also > > https://gcc.gnu.org/PR101581 — [OpenMP] omp_target_memcpy – support > inter-device memcpy > https://gcc.gnu.org/PR110813 — [OpenMP] omp_target_memcpy_rect (+ > strided 'target update'): Improve GCN performance and contiguous > subranges > > and just added based on Thomas' comment: > > https://gcc.gnu.org/PR107424 — [OpenMP] Check whether device locking is > really needed for bare memcopy to/from devices (omp_target_memcpy...) > > * * * > > On 27.07.23 23:00, Thomas Schwinge wrote: >>> +++ b/include/cuda/cuda.h >> I note that you're not actually using everything you're adding here. >> (..., but I understand you're simply adding everying that relates to >> these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.) > > Yes. That was on purpose to make it easier to pick something when needed > – especially as we might want to use some of those later on. > > For symmetry, I now also added cuMemcpyPeer + ...Async, which also > remain unused. (But could be used as part of the PRs linked above.) > >>> + const void *dstHost; >> That last one isn't 'const'. ;-) > Fixed - three times. >> A 'cuda.h' that I looked at calls that last one 'reserved0', with >> comment >> "Must be NULL". > Seems to be unused in real world code and in the documentation. But > let's use this name as it might be exposed in the wild. >>> --- a/libgomp/libgomp-plugin.h >>> +++ b/libgomp/libgomp-plugin.h >>> +extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t, >>> + void*, size_t, size_t, size_t, >>> + const void*, size_t, size_t, size_t); >>> +extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t, >>> void *, >>> + size_t, size_t, size_t, size_t, size_t, >>> + const void *, size_t, size_t, >>> size_t, size_t, >>> + size_t); >> Oh, wow. ;-) > > Maybe this is not the best ABI. We can consider to modify it before the > GCC 14 release. (And in principle also afterwards, given that libgomp > and its plugins should™ be compiled and installed alongside.) > > I think once we know how to implement GCN, we will see whether it was > done smartly or whether other arguments should be used or whether the > two functions should be combined. > > [Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it > should be NULL or not; quoting the usage in plugin-nvptx.c:] > >> I note that this doesn't adhere to the two "Must be NULL" remarks from >> above -- but I'm confused, because, for example, on >> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g4b5238975579f002c0199a3800ca44df >> >> "cuMemcpy3D", there also are no such remarks. (... in contast to: >> "srcLOD and dstLOD members of the CUDA_MEMCPY3D structure must be set >> to 0", >> which you do set accordingly.) >> >> Maybe just 'memset' the whole thing to '0', for good measure? > OK - but then we can remove the LOD init. >>> + else if (src_devicep != NULL >>> + && (dst_devicep == NULL >>> + || (dst_devicep->capabilities >>> + & GOMP_OFFLOAD_CAP_SHARED_MEM))) >> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that >> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears >> out >> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'? > > I have now undone this change – I did not dig deep enough into the > function calls. > > >>> + else if (dst_devicep == NULL && src_devicep == NULL) >>> + { >>> + memcpy ((char *) dst + dst_off, (const char *) src + src_off, >>> + length); >>> + ret = 1; >>> + } >>> else if (src_devicep == dst_devicep) >>> ret = src_devicep->dev2dev_func (src_devicep->target_id, >>> (char *) dst + dst_off, >>> (const char *) src + src_off, >>> length); >> ..., but also left the intra-device case here -- which should now be >> dead >> code here? > > Why? Unless I missed something, the old, the current, and the proposed > (= old) code do still run this code. > > I have not added an assert to confirm, but in any case, it is tested for > in my recently added testcase - thus, we could add a 'printf' to confirm. > >>> + else if (*tmp_size < length) >>> + { >>> + *tmp_size = length; >>> + *tmp = realloc (*tmp, length); >>> + if (*tmp == NULL) >>> + return ENOMEM; >> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'? >> >> Do we really need here the property here that if the re-allocation can't >> be done in-place, 'realloc' copies the original content to the new? In >> other words, should we just unconditionally 'free' and re-'malloc' here, >> instead of 'realloc'? > I have now done so – but I am not really sure what's faster on average. > If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is > better. >> I haven't looked whether the re-use of 'tmp' for multiple calls to this >> is then actually useful, or whether we should just always 'malloc', use, >> 'free' the buffer here? > > Well, it can run in a hot loop – assume a C-array array[1024][1024][2] > and copying array[:1024,:1024,0:1] (using OpenMP syntax) – i.e. 1048576 > times every other element. And therefore I would like to avoid repeated > malloc/free in such a case. (But in general, interdevice copying should > be very rare.) > > Actually, I think the realloc case is unreachable: for rectangular > copies, as implied both by 'target update' with strided access and by > 'omp_target_memcpy_rect', the size should be constant. > >> (For later...) Is that what >> <https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1ge1f5c7771544fee150ada8853c7cbf4a >> >> "cuMemcpyPeer" would be used for? > > I concur that we could/should use cuMemcpyPeer and cuMemcpy3DPeer for > omp_target and ..._rect. I added a comment to the respective PRs (linked > at the top). > >>> + lock_src = (src_devicep >>> + && (!dst_devicep >>> + || src_devicep == dst_devicep >>> + || !(src_devicep->capabilities >>> + & GOMP_OFFLOAD_CAP_SHARED_MEM))); >> Similar doubt than above re "'GOMP_OFFLOAD_CAP_SHARED_MEM' actually >> reachable"? > Updated in the attach patch + filed a PR about the need of the lock (see > link at the top). >>> + DLSYM (memcpy2d); >>> + DLSYM (memcpy3d); >> With 'DLSYM' used here, won't that fail if these symbols don't actually >> exist (like for 'libgomp/plugin/plugin-gcn.c')? > > Hmm, that should be: DLSYM_OPT, but somehow testing on GCN did not show > any fails. (Neither here nor in the testsuite; odd.) > > I intent to commit the attached follow-up patch very soon. > > [It has been tested without offloading support compiled in, with AMD GCN > offloading (single GPU, no memcpy[23]d available), and on a 2-GPU system > with nvptx offloading.] > > Tobias > ----------------- > Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße > 201, 80634 München; Gesellschaft mit beschränkter Haftung; > Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: > München; Registergericht München, HRB 106955 ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Hi Tobias! On 2023-07-28T13:51:41+0200, Tobias Burnus <tobias@codesourcery.com> wrote: > On 27.07.23 23:00, Thomas Schwinge wrote: >>> + else if (src_devicep != NULL >>> + && (dst_devicep == NULL >>> + || (dst_devicep->capabilities >>> + & GOMP_OFFLOAD_CAP_SHARED_MEM))) >> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that >> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears out >> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'? > > I have now undone this change – I did not dig deep enough into the > function calls. > > >>> + else if (dst_devicep == NULL && src_devicep == NULL) >>> + { >>> + memcpy ((char *) dst + dst_off, (const char *) src + src_off, >>> + length); >>> + ret = 1; >>> + } >>> else if (src_devicep == dst_devicep) >>> ret = src_devicep->dev2dev_func (src_devicep->target_id, >>> (char *) dst + dst_off, >>> (const char *) src + src_off, >>> length); >> ..., but also left the intra-device case here -- which should now be dead >> code here? > > Why? Unless I missed something, the old, the current, and the proposed > (= old) code do still run this code. It is now again reachable, but wasn't in the "intermediate state" (without your follow-on change) -- at least per my understanding, which may be gappy. > I have not added an assert to confirm, but in any case, it is tested for > in my recently added testcase - thus, we could add a 'printf' to confirm. We're now back to the original code, which should be fine. >>> + else if (*tmp_size < length) >>> + { >>> + *tmp_size = length; >>> + *tmp = realloc (*tmp, length); >>> + if (*tmp == NULL) >>> + return ENOMEM; >> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'? >> >> Do we really need here the property here that if the re-allocation can't >> be done in-place, 'realloc' copies the original content to the new? In >> other words, should we just unconditionally 'free' and re-'malloc' here, >> instead of 'realloc'? > I have now done so – but I am not really sure what's faster on average. > If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is > better. I have no proof, but would assume that the C library handles as efficiently as 'realloc' the case of 'free' plus subsequent 'malloc' if there's space available after the original allocation. >> I haven't looked whether the re-use of 'tmp' for multiple calls to this >> is then actually useful, or whether we should just always 'malloc', use, >> 'free' the buffer here? ..., hence that suggestion. But I agree what we've got now is good, just one more idea: couldn't we now actually unify the (original) 'malloc' and (original) 'realloc' case: -if (*tmp_size == 0) - { - *tmp_size = length; - *tmp = malloc (length); - if (*tmp == NULL) - return ENOMEM; - } -else if (*tmp_size < length) +if (*tmp_size < length) { *tmp_size = length; free (*tmp); *tmp = malloc (length); if (*tmp == NULL) return ENOMEM; } (Untested.) > Well, it can run in a hot loop – assume a C-array array[1024][1024][2] > and copying array[:1024,:1024,0:1] (using OpenMP syntax) – i.e. 1048576 > times every other element. And therefore I would like to avoid repeated > malloc/free in such a case. (But in general, interdevice copying should > be very rare.) > > Actually, I think the realloc case is unreachable: for rectangular > copies, as implied both by 'target update' with strided access and by > 'omp_target_memcpy_rect', the size should be constant. Worth an 'assert'? Now I "only" don't understand the '__builtin_mul_overflow' computations and error checks ('return EINVAL;') for the four cases handled in 'libgomp/target.c:omp_target_memcpy_rect_worker': for example, the generic loop at end of function iterates over all 'num_dims', but the specific earlier 'num_dims == N' cases don't. But I suppose that's just my own problem not understanding this API/code, and I'm not currently planning on looking into this any further. ;-) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
libgomp: cuda.h and omp_target_memcpy_rect cleanup Fixes for commit r14-2792-g25072a477a56a727b369bf9b20f4d18198ff5894 "OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect", namely: In that commit, the code was changed to handle shared-memory devices; however, as pointed out, omp_target_memcpy_check already set the pointer to NULL in that case. Hence, this commit reverts to the prior version. In cuda.h, it adds cuMemcpyPeer{,Async} for symmetry for cuMemcpy3DPeer (all currently unused) and in three structs, fixes reserved-member names and remove a bogus 'const' in three structs. And it changes a DLSYM to DLSYM_OPT as not all plugins support the new functions, yet. include/ChangeLog: * cuda/cuda.h (CUDA_MEMCPY2D, CUDA_MEMCPY3D, CUDA_MEMCPY3D_PEER): Remove bogus 'const' from 'const void *dst' and fix reserved-name name in those structs. (cuMemcpyPeer, cuMemcpyPeerAsync): Add. libgomp/ChangeLog: * target.c (omp_target_memcpy_rect_worker): Undo dim=1 change for GOMP_OFFLOAD_CAP_SHARED_MEM. (omp_target_memcpy_rect_copy): Likewise for lock condition. (gomp_load_plugin_for_device): Use DLSYM_OPT not DLSYM for memcpy3d/memcpy2d. * plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): Use memset 0 to nullify reserved and unused src/dst fields for that mem type; remove '{src,dst}LOD = 0'. Co-authored-by: Thomas Schwinge <thomas@codesourcery.com> include/cuda/cuda.h | 12 +++++----- libgomp/plugin/plugin-nvptx.c | 6 +++-- libgomp/target.c | 52 ++++++++++++++----------------------------- 3 files changed, 28 insertions(+), 42 deletions(-) diff --git a/include/cuda/cuda.h b/include/cuda/cuda.h index 09c3c2b8dbe..94fc64a488d 100644 --- a/include/cuda/cuda.h +++ b/include/cuda/cuda.h @@ -147,7 +147,7 @@ typedef struct { size_t dstXInBytes, dstY; CUmemorytype dstMemoryType; - const void *dstHost; + void *dstHost; CUdeviceptr dstDevice; CUarray dstArray; size_t dstPitch; @@ -162,16 +162,16 @@ typedef struct { const void *srcHost; CUdeviceptr srcDevice; CUarray srcArray; - void *dummy; + void *reserved0; size_t srcPitch, srcHeight; size_t dstXInBytes, dstY, dstZ; size_t dstLOD; CUmemorytype dstMemoryType; - const void *dstHost; + void *dstHost; CUdeviceptr dstDevice; CUarray dstArray; - void *dummy2; + void *reserved1; size_t dstPitch, dstHeight; size_t WidthInBytes, Height, Depth; @@ -190,7 +190,7 @@ typedef struct { size_t dstXInBytes, dstY, dstZ; size_t dstLOD; CUmemorytype dstMemoryType; - const void *dstHost; + void *dstHost; CUdeviceptr dstDevice; CUarray dstArray; CUcontext dstContext; @@ -246,6 +246,8 @@ CUresult cuMemAlloc (CUdeviceptr *, size_t); CUresult cuMemAllocHost (void **, size_t); CUresult cuMemHostAlloc (void **, size_t, unsigned int); CUresult cuMemcpy (CUdeviceptr, CUdeviceptr, size_t); +CUresult cuMemcpyPeer (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t); +CUresult cuMemcpyPeerAsync (CUdeviceptr, CUcontext, CUdeviceptr, CUcontext, size_t, CUstream); #define cuMemcpyDtoDAsync cuMemcpyDtoDAsync_v2 CUresult cuMemcpyDtoDAsync (CUdeviceptr, CUdeviceptr, size_t, CUstream); #define cuMemcpyDtoH cuMemcpyDtoH_v2 diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index 9cdc55cac6b..00d4241ae02 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -1794,6 +1794,8 @@ GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size, /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported. */ CUDA_MEMCPY2D data; + + memset (&data, 0, sizeof (data)); data.WidthInBytes = dim1_size; data.Height = dim0_len; @@ -1855,6 +1857,8 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, /* TODO: Consider using CU_MEMORYTYPE_UNIFIED if supported. */ CUDA_MEMCPY3D data; + + memset (&data, 0, sizeof (data)); data.WidthInBytes = dim2_size; data.Height = dim1_len; data.Depth = dim0_len; @@ -1874,7 +1878,6 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, data.dstXInBytes = dst_offset2_size; data.dstY = dst_offset1_len; data.dstZ = dst_offset0_len; - data.dstLOD = 0; if (src_ord == -1) { @@ -1891,7 +1894,6 @@ GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size, data.srcXInBytes = src_offset2_size; data.srcY = src_offset1_len; data.srcZ = src_offset0_len; - data.srcLOD = 0; CUDA_CALL (cuMemcpy3D, &data); return true; diff --git a/libgomp/target.c b/libgomp/target.c index 5cf2e8dce37..cd4cc1b01ca 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -4540,33 +4540,22 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, || __builtin_mul_overflow (element_size, dst_offsets[0], &dst_off) || __builtin_mul_overflow (element_size, src_offsets[0], &src_off)) return EINVAL; - if (src_devicep != NULL && src_devicep == dst_devicep) - ret = src_devicep->dev2dev_func (src_devicep->target_id, - (char *) dst + dst_off, - (const char *) src + src_off, - length); - else if (src_devicep != NULL - && (dst_devicep == NULL - || (dst_devicep->capabilities - & GOMP_OFFLOAD_CAP_SHARED_MEM))) - ret = src_devicep->dev2host_func (src_devicep->target_id, + if (dst_devicep == NULL && src_devicep == NULL) + { + memcpy ((char *) dst + dst_off, (const char *) src + src_off, + length); + ret = 1; + } + else if (src_devicep == NULL) + ret = dst_devicep->host2dev_func (dst_devicep->target_id, (char *) dst + dst_off, (const char *) src + src_off, length); - else if (dst_devicep != NULL - && (src_devicep == NULL - || (src_devicep->capabilities - & GOMP_OFFLOAD_CAP_SHARED_MEM))) - ret = dst_devicep->host2dev_func (dst_devicep->target_id, + else if (dst_devicep == NULL) + ret = src_devicep->dev2host_func (src_devicep->target_id, (char *) dst + dst_off, (const char *) src + src_off, length); - else if (dst_devicep == NULL && src_devicep == NULL) - { - memcpy ((char *) dst + dst_off, (const char *) src + src_off, - length); - ret = 1; - } else if (src_devicep == dst_devicep) ret = src_devicep->dev2dev_func (src_devicep->target_id, (char *) dst + dst_off, @@ -4584,7 +4573,8 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, else if (*tmp_size < length) { *tmp_size = length; - *tmp = realloc (*tmp, length); + free (*tmp); + *tmp = malloc (length); if (*tmp == NULL) return ENOMEM; } @@ -4599,7 +4589,7 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size, return ret ? 0 : EINVAL; } - /* host->device, device->host and same-device device->device. */ + /* host->device, device->host and intra device. */ if (num_dims == 2 && ((src_devicep && src_devicep == dst_devicep @@ -4711,16 +4701,8 @@ omp_target_memcpy_rect_copy (void *dst, const void *src, bool lock_src; bool lock_dst; - lock_src = (src_devicep - && (!dst_devicep - || src_devicep == dst_devicep - || !(src_devicep->capabilities - & GOMP_OFFLOAD_CAP_SHARED_MEM))); - lock_dst = (dst_devicep - && (!lock_src - || (src_devicep != dst_devicep - && !(dst_devicep->capabilities - & GOMP_OFFLOAD_CAP_SHARED_MEM)))); + lock_src = src_devicep != NULL; + lock_dst = dst_devicep != NULL && src_devicep != dst_devicep; if (lock_src) gomp_mutex_lock (&src_devicep->lock); if (lock_dst) @@ -5076,8 +5058,8 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, DLSYM (free); DLSYM (dev2host); DLSYM (host2dev); - DLSYM (memcpy2d); - DLSYM (memcpy3d); + DLSYM_OPT (memcpy2d, memcpy2d); + DLSYM_OPT (memcpy3d, memcpy3d); device->capabilities = device->get_caps_func (); if (device->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400) {