OpenMP: Support accelerated 2D/3D memory copies for AMD GCN

Message ID	20230920114405.134100-1-julian@codesourcery.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9C53B3858C2B IronPort-SDR: Lt16CWjyy7iwPtqXUlPT5fSIr0fzNjtogQ1OWx3j5E5BohFiqmdr7kCxXho5IoXF+5ODqyEyJU xmOLRTvWY47sDR4qd8GthcsRBH+UMLokxGR8DKBmFOct7LKTW8r92aK+GQ+WaQxo2JFU6zmxQK oQ3q6XYB3sXoAvt8mmsUjX8mDw96bJwVLwAubp8A0DDAitnranVHHxHVYTijLgN4oxd2SJlRlI Ume32QT6bbvZve4lDFUvTS4+/9dtj8Add4tY5bjtZlaeMfoMoB8Re+7Z1mgcEgrZ4y+A+i8nMq 4ss= From: Julian Brown <julian@codesourcery.com> To: <gcc-patches@gcc.gnu.org> CC: <tobias@codesourcery.com>, <jakub@redhat.com> Subject: [PATCH] OpenMP: Support accelerated 2D/3D memory copies for AMD GCN Date: Wed, 20 Sep 2023 04:44:04 -0700 Message-ID: <20230920114405.134100-1-julian@codesourcery.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	OpenMP: Support accelerated 2D/3D memory copies for AMD GCN \| expand OpenMP: Support accelerated 2D/3D memory copies for AMD GCN

Message ID

20230920114405.134100-1-julian@codesourcery.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 9C53B3858C2B
IronPort-SDR: 
 Lt16CWjyy7iwPtqXUlPT5fSIr0fzNjtogQ1OWx3j5E5BohFiqmdr7kCxXho5IoXF+5ODqyEyJU
 xmOLRTvWY47sDR4qd8GthcsRBH+UMLokxGR8DKBmFOct7LKTW8r92aK+GQ+WaQxo2JFU6zmxQK
 oQ3q6XYB3sXoAvt8mmsUjX8mDw96bJwVLwAubp8A0DDAitnranVHHxHVYTijLgN4oxd2SJlRlI
 Ume32QT6bbvZve4lDFUvTS4+/9dtj8Add4tY5bjtZlaeMfoMoB8Re+7Z1mgcEgrZ4y+A+i8nMq
 4ss=
From: Julian Brown <julian@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
CC: <tobias@codesourcery.com>, <jakub@redhat.com>
Subject: [PATCH] OpenMP: Support accelerated 2D/3D memory copies for AMD GCN
Date: Wed, 20 Sep 2023 04:44:04 -0700
Message-ID: <20230920114405.134100-1-julian@codesourcery.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Precedence: list
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Series

OpenMP: Support accelerated 2D/3D memory copies for AMD GCN | expand

Commit Message

Julian Brown Sept. 20, 2023, 11:44 a.m. UTC

This patch adds support for 2D/3D memory copies for omp_target_memcpy_rect
and "target update", using AMD extensions to the HSA API.  I've just
committed a version of this patch to the og13 branch, but this is the
mainline version.

Support is also added for 1-dimensional strided accesses: these are
treated as a special case of 2-dimensional transfers, where the innermost
dimension is formed from the stride length (in bytes).

This patch has (somewhat awkwardly from a review perspective) been merged
on top of the following list of in-review series:

"OpenMP/OpenACC: map clause and OMP gimplify rework":
https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627895.html

"OpenMP: lvalue parsing and "declare mapper" support":
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629363.html

"OpenMP: Array-shaping operator and strided/rectangular 'target update'
support":
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629422.html

"OpenMP: Enable 'declare mapper' mappers for 'target update' directives":
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629432.html

Though it only depends directly on parts of that work (regarding
strided/rectangular updates).  A stand-alone version that just works
for the OpenMP API routine omp_target_memcpy_rect could be prepared to
apply separately, if preferable.

This version has been re-tested and bootstrapped.  OK?

2023-09-20  Julian Brown  <julian@codesourcery.com>

libgomp/
	* plugin/plugin-gcn.c (hsa_runtime_fn_info): Add
	hsa_amd_memory_lock_fn, hsa_amd_memory_unlock_fn,
	hsa_amd_memory_async_copy_rect_fn function pointers.
	(init_hsa_runtime_functions): Add above functions, with
	DLSYM_OPT_FN.
	(GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New functions.
	* target.c (omp_target_memcpy_rect_worker): Add 1D strided transfer
	support.
---
 libgomp/plugin/plugin-gcn.c | 359 ++++++++++++++++++++++++++++++++++++
 libgomp/target.c            |  31 ++++
 2 files changed, 390 insertions(+)

Comments

Tobias Burnus Dec. 21, 2023, 4:05 p.m. UTC | #1

Hi Julian,

On 20.09.23 13:44, Julian Brown wrote:

> This patch adds support for 2D/3D memory copies for omp_target_memcpy_rect
> and "target update", using AMD extensions to the HSA API.  I've just
> committed a version of this patch to the og13 branch, but this is the
> mainline version.
>
> Support is also added for 1-dimensional strided accesses: these are
> treated as a special case of 2-dimensional transfers, where the innermost
> dimension is formed from the stride length (in bytes).
>
> This patch has (somewhat awkwardly from a review perspective) been merged
> on top of the following list of in-review series:
>
> "OpenMP/OpenACC: map clause and OMP gimplify rework":
> https://gcc.gnu.org/pipermail/gcc-patches/2023-August/627895.html
[That's now in mainline.]
> "OpenMP: lvalue parsing and "declare mapper" support":
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629363.html
[Review in progress for the lvalue patches (C++; then C); mapper part
should not be required for this patch.]
> "OpenMP: Array-shaping operator and strided/rectangular 'target update'
> [support":
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629422.html
Here, 2/5 and 3/5 are required (and 1/5 is committed). [4+5/5 are make
it more useful but should not be required for this patch]
> "OpenMP: Enable 'declare mapper' mappers for 'target update' directives":
> https://gcc.gnu.org/pipermail/gcc-patches/2023-September/629432.html

I think this patch is not required for that patch, but obviously still
useful.

* * *

I think it makes sense to split this patch into two parts:

* Thelibgomp/plugin/plugin-gcn.c – which is independent and would already
used by omp_memcpy_rect. * The libgomp/target.c which depends on otherpatches and adds
   1D strided transfer support.

I will ignore the latter change in this review. And add a note to:
https://gcc.gnu.org/wiki/openmpPendingPatches to ensure that it
will eventually be reviewed and added.

* * *

> Though it only depends directly on parts of that work (regarding
> strided/rectangular updates).  A stand-alone version that just works
> for the OpenMP API routine omp_target_memcpy_rect could be prepared to
> apply separately, if preferable.
>
> This version has been re-tested and bootstrapped.  OK?
>
> 2023-09-20  Julian Brown  <julian@codesourcery.com>
>
> libgomp/
>       * plugin/plugin-gcn.c (hsa_runtime_fn_info): Add
>       hsa_amd_memory_lock_fn, hsa_amd_memory_unlock_fn,
>       hsa_amd_memory_async_copy_rect_fn function pointers.
>       (init_hsa_runtime_functions): Add above functions, with
>       DLSYM_OPT_FN.
>       (GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New functions.
>       * target.c (omp_target_memcpy_rect_worker): Add 1D strided transfer
>       support.
> ---
>   libgomp/plugin/plugin-gcn.c | 359 ++++++++++++++++++++++++++++++++++++
The plugin change LGTM. Thanks.
>   libgomp/target.c            |  31 ++++

I defer the review this part until the required other patches are in.

Cf. https://gcc.gnu.org/wiki/openmpPendingPatches

Thanks,

Tobias

> diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
> index ef22d48da79..95c0a57e792 100644
> --- a/libgomp/plugin/plugin-gcn.c
> +++ b/libgomp/plugin/plugin-gcn.c
> @@ -196,6 +196,16 @@ struct hsa_runtime_fn_info
>     hsa_status_t (*hsa_code_object_deserialize_fn)
>       (void *serialized_code_object, size_t serialized_code_object_size,
>        const char *options, hsa_code_object_t *code_object);
> +  hsa_status_t (*hsa_amd_memory_lock_fn)
> +    (void *host_ptr, size_t size, hsa_agent_t *agents, int num_agent,
> +     void **agent_ptr);
> +  hsa_status_t (*hsa_amd_memory_unlock_fn) (void *host_ptr);
> +  hsa_status_t (*hsa_amd_memory_async_copy_rect_fn)
> +    (const hsa_pitched_ptr_t *dst, const hsa_dim3_t *dst_offset,
> +     const hsa_pitched_ptr_t *src, const hsa_dim3_t *src_offset,
> +     const hsa_dim3_t *range, hsa_agent_t copy_agent,
> +     hsa_amd_copy_direction_t dir, uint32_t num_dep_signals,
> +     const hsa_signal_t *dep_signals, hsa_signal_t completion_signal);
>   };
>
>   /* Structure describing the run-time and grid properties of an HSA kernel
> @@ -1398,6 +1408,9 @@ init_hsa_runtime_functions (void)
>     DLSYM_FN (hsa_signal_load_acquire)
>     DLSYM_FN (hsa_queue_destroy)
>     DLSYM_FN (hsa_code_object_deserialize)
> +  DLSYM_OPT_FN (hsa_amd_memory_lock)
> +  DLSYM_OPT_FN (hsa_amd_memory_unlock)
> +  DLSYM_OPT_FN (hsa_amd_memory_async_copy_rect)
>     return true;
>   #undef DLSYM_FN
>   }
> @@ -3790,6 +3803,352 @@ GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n)
>     return true;
>   }
>
> +/* Here <quantity>_size refers to <quantity> multiplied by size -- i.e.
> +   measured in bytes.  So we have:
> +
> +   dim1_size: number of bytes to copy on innermost dimension ("row")
> +   dim0_len: number of rows to copy
> +   dst: base pointer for destination of copy
> +   dst_offset1_size: innermost row offset (for dest), in bytes
> +   dst_offset0_len: offset, number of rows (for dest)
> +   dst_dim1_size: whole-array dest row length, in bytes (pitch)
> +   src: base pointer for source of copy
> +   src_offset1_size: innermost row offset (for source), in bytes
> +   src_offset0_len: offset, number of rows (for source)
> +   src_dim1_size: whole-array source row length, in bytes (pitch)
> +*/
> +
> +int
> +GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
> +                    size_t dim0_len, void *dst, size_t dst_offset1_size,
> +                    size_t dst_offset0_len, size_t dst_dim1_size,
> +                    const void *src, size_t src_offset1_size,
> +                    size_t src_offset0_len, size_t src_dim1_size)
> +{
> +  if (!hsa_fns.hsa_amd_memory_lock_fn
> +      || !hsa_fns.hsa_amd_memory_unlock_fn
> +      || !hsa_fns.hsa_amd_memory_async_copy_rect_fn)
> +    return -1;
> +
> +  /* GCN hardware requires 4-byte alignment for base addresses & pitches.  Bail
> +     out quietly if we have anything oddly-aligned rather than letting the
> +     driver raise an error.  */
> +  if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0)
> +    return -1;
> +
> +  if ((dst_dim1_size & 3) != 0 || (src_dim1_size & 3) != 0)
> +    return -1;
> +
> +  /* Only handle host to device or device to host transfers here.  */
> +  if ((dst_ord == -1 && src_ord == -1)
> +      || (dst_ord != -1 && src_ord != -1))
> +    return -1;
> +
> +  hsa_amd_copy_direction_t dir
> +    = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost;
> +  hsa_agent_t copy_agent;
> +
> +  /* We need to pin (lock) host memory before we start the transfer.  Try to
> +     lock the minimum size necessary, i.e. using partial first/last rows of the
> +     whole array.  Something like this:
> +
> +      rows -->
> +      ..............
> +     c | ..#######+++++ <- first row apart from {src,dst}_offset1_size
> +     o | ++#######+++++ <- whole row
> +     l | ++#######+++++ <-     "
> +     s v ++#######..... <- last row apart from trailing remainder
> +      ..............
> +
> +     We could split very large transfers into several rectangular copies, but
> +     that is unimplemented for now.  */
> +
> +  size_t bounded_size_host, first_elem_offset_host;
> +  void *host_ptr;
> +  if (dir == hsaHostToDevice)
> +    {
> +      bounded_size_host = src_dim1_size * (dim0_len - 1) + dim1_size;
> +      first_elem_offset_host = src_offset0_len * src_dim1_size
> +                            + src_offset1_size;
> +      host_ptr = (void *) src;
> +      struct agent_info *agent = get_agent_info (dst_ord);
> +      copy_agent = agent->id;
> +    }
> +  else
> +    {
> +      bounded_size_host = dst_dim1_size * (dim0_len - 1) + dim1_size;
> +      first_elem_offset_host = dst_offset0_len * dst_dim1_size
> +                            + dst_offset1_size;
> +      host_ptr = dst;
> +      struct agent_info *agent = get_agent_info (src_ord);
> +      copy_agent = agent->id;
> +    }
> +
> +  void *agent_ptr;
> +
> +  hsa_status_t status
> +    = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host,
> +                                   bounded_size_host, NULL, 0, &agent_ptr);
> +  /* We can't lock the host memory: don't give up though, we might still be
> +     able to use the slow path in our caller.  So, don't make this an
> +     error.  */
> +  if (status != HSA_STATUS_SUCCESS)
> +    return -1;
> +
> +  hsa_pitched_ptr_t dstpp, srcpp;
> +  hsa_dim3_t dst_offsets, src_offsets, ranges;
> +
> +  int retval = 1;
> +
> +  hsa_signal_t completion_signal;
> +  status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal);
> +  if (status != HSA_STATUS_SUCCESS)
> +    {
> +      retval = -1;
> +      goto unlock;
> +    }
> +
> +  if (dir == hsaHostToDevice)
> +    {
> +      srcpp.base = agent_ptr - first_elem_offset_host;
> +      dstpp.base = dst;
> +    }
> +  else
> +    {
> +      srcpp.base = (void *) src;
> +      dstpp.base = agent_ptr - first_elem_offset_host;
> +    }
> +
> +  srcpp.pitch = src_dim1_size;
> +  srcpp.slice = 0;
> +
> +  src_offsets.x = src_offset1_size;
> +  src_offsets.y = src_offset0_len;
> +  src_offsets.z = 0;
> +
> +  dstpp.pitch = dst_dim1_size;
> +  dstpp.slice = 0;
> +
> +  dst_offsets.x = dst_offset1_size;
> +  dst_offsets.y = dst_offset0_len;
> +  dst_offsets.z = 0;
> +
> +  ranges.x = dim1_size;
> +  ranges.y = dim0_len;
> +  ranges.z = 1;
> +
> +  status
> +    = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp,
> +                                              &src_offsets, &ranges,
> +                                              copy_agent, dir, 0, NULL,
> +                                              completion_signal);
> +  /* If the rectangular copy fails, we might still be able to use the slow
> +     path.  We need to unlock the host memory though, so don't return
> +     immediately.  */
> +  if (status != HSA_STATUS_SUCCESS)
> +    retval = -1;
> +  else
> +    hsa_fns.hsa_signal_wait_acquire_fn (completion_signal,
> +                                     HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX,
> +                                     HSA_WAIT_STATE_ACTIVE);
> +
> +  hsa_fns.hsa_signal_destroy_fn (completion_signal);
> +
> +unlock:
> +  status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host);
> +  if (status != HSA_STATUS_SUCCESS)
> +    hsa_fatal ("Could not unlock host memory", status);
> +
> +  return retval;
> +}
> +
> +/* As above, <quantity>_size refers to <quantity> multiplied by size -- i.e.
> +   measured in bytes.  So we have:
> +
> +   dim2_size: number of bytes to copy on innermost dimension ("row")
> +   dim1_len: number of rows per slice to copy
> +   dim0_len: number of slices to copy
> +   dst: base pointer for destination of copy
> +   dst_offset2_size: innermost row offset (for dest), in bytes
> +   dst_offset1_len: offset, number of rows (for dest)
> +   dst_offset0_len: offset, number of slices (for dest)
> +   dst_dim2_size: whole-array dest row length, in bytes (pitch)
> +   dst_dim1_len: whole-array number of rows in slice (for dest)
> +   src: base pointer for source of copy
> +   src_offset2_size: innermost row offset (for source), in bytes
> +   src_offset1_len: offset, number of rows (for source)
> +   src_offset0_len: offset, number of slices (for source)
> +   src_dim2_size: whole-array source row length, in bytes (pitch)
> +   src_dim1_len: whole-array number of rows in slice (for source)
> +*/
> +
> +int
> +GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
> +                    size_t dim1_len, size_t dim0_len, void *dst,
> +                    size_t dst_offset2_size, size_t dst_offset1_len,
> +                    size_t dst_offset0_len, size_t dst_dim2_size,
> +                    size_t dst_dim1_len, const void *src,
> +                    size_t src_offset2_size, size_t src_offset1_len,
> +                    size_t src_offset0_len, size_t src_dim2_size,
> +                    size_t src_dim1_len)
> +{
> +  if (!hsa_fns.hsa_amd_memory_lock_fn
> +      || !hsa_fns.hsa_amd_memory_unlock_fn
> +      || !hsa_fns.hsa_amd_memory_async_copy_rect_fn)
> +    return -1;
> +
> +  /* GCN hardware requires 4-byte alignment for base addresses & pitches.  Bail
> +     out quietly if we have anything oddly-aligned rather than letting the
> +     driver raise an error.  */
> +  if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0)
> +    return -1;
> +
> +  if ((dst_dim2_size & 3) != 0 || (src_dim2_size & 3) != 0)
> +    return -1;
> +
> +  /* Only handle host to device or device to host transfers here.  */
> +  if ((dst_ord == -1 && src_ord == -1)
> +      || (dst_ord != -1 && src_ord != -1))
> +    return -1;
> +
> +  hsa_amd_copy_direction_t dir
> +    = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost;
> +  hsa_agent_t copy_agent;
> +
> +  /* We need to pin (lock) host memory before we start the transfer.  Try to
> +     lock the minimum size necessary, i.e. using partial first/last slices of
> +     the whole 3D array.  Something like this:
> +
> +      slice 0:          slice 1:          slice 2:
> +          __________        __________        __________
> +       ^    /+++++++++/    :  /+++++++++/    :       /         /
> +    column /+++##++++/|    | /+++##++++/|    | /+++##    /  # = subarray
> +     /         /   ##++++/ |    |/+++##++++/ |    |/+++##++++/   + = area to pin
> +      /_________/  :    /_________/  :    /_________/
> +     row --->
> +
> +     We could split very large transfers into several rectangular copies, but
> +     that is unimplemented for now.  */
> +
> +  size_t bounded_size_host, first_elem_offset_host;
> +  void *host_ptr;
> +  if (dir == hsaHostToDevice)
> +    {
> +      size_t slice_bytes = src_dim2_size * src_dim1_len;
> +      bounded_size_host = slice_bytes * (dim0_len - 1)
> +                       + src_dim2_size * (dim1_len - 1)
> +                       + dim2_size;
> +      first_elem_offset_host = src_offset0_len * slice_bytes
> +                            + src_offset1_len * src_dim2_size
> +                            + src_offset2_size;
> +      host_ptr = (void *) src;
> +      struct agent_info *agent = get_agent_info (dst_ord);
> +      copy_agent = agent->id;
> +    }
> +  else
> +    {
> +      size_t slice_bytes = dst_dim2_size * dst_dim1_len;
> +      bounded_size_host = slice_bytes * (dim0_len - 1)
> +                       + dst_dim2_size * (dim1_len - 1)
> +                       + dim2_size;
> +      first_elem_offset_host = dst_offset0_len * slice_bytes
> +                            + dst_offset1_len * dst_dim2_size
> +                            + dst_offset2_size;
> +      host_ptr = dst;
> +      struct agent_info *agent = get_agent_info (src_ord);
> +      copy_agent = agent->id;
> +    }
> +
> +  void *agent_ptr;
> +
> +  hsa_status_t status
> +    = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host,
> +                                   bounded_size_host, NULL, 0, &agent_ptr);
> +  /* We can't lock the host memory: don't give up though, we might still be
> +     able to use the slow path in our caller (maybe even with iterated memcpy2d
> +     calls).  So, don't make this an error.  */
> +  if (status != HSA_STATUS_SUCCESS)
> +    return -1;
> +
> +  hsa_pitched_ptr_t dstpp, srcpp;
> +  hsa_dim3_t dst_offsets, src_offsets, ranges;
> +
> +  int retval = 1;
> +
> +  hsa_signal_t completion_signal;
> +  status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal);
> +  if (status != HSA_STATUS_SUCCESS)
> +    {
> +      retval = -1;
> +      goto unlock;
> +    }
> +
> +  if (dir == hsaHostToDevice)
> +    {
> +      srcpp.base = agent_ptr - first_elem_offset_host;
> +      dstpp.base = dst;
> +    }
> +  else
> +    {
> +      srcpp.base = (void *) src;
> +      dstpp.base = agent_ptr - first_elem_offset_host;
> +    }
> +
> +  /* Pitch is measured in bytes.  */
> +  srcpp.pitch = src_dim2_size;
> +  /* Slice is also measured in bytes (i.e. total per-slice).  */
> +  srcpp.slice = src_dim2_size * src_dim1_len;
> +
> +  src_offsets.x = src_offset2_size;
> +  src_offsets.y = src_offset1_len;
> +  src_offsets.z = src_offset0_len;
> +
> +  /* As above.  */
> +  dstpp.pitch = dst_dim2_size;
> +  dstpp.slice = dst_dim2_size * dst_dim1_len;
> +
> +  dst_offsets.x = dst_offset2_size;
> +  dst_offsets.y = dst_offset1_len;
> +  dst_offsets.z = dst_offset0_len;
> +
> +  ranges.x = dim2_size;
> +  ranges.y = dim1_len;
> +  ranges.z = dim0_len;
> +
> +  status
> +    = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp,
> +                                              &src_offsets, &ranges,
> +                                              copy_agent, dir, 0, NULL,
> +                                              completion_signal);
> +  /* If the rectangular copy fails, we might still be able to use the slow
> +     path.  We need to unlock the host memory though, so don't return
> +     immediately.  */
> +  if (status != HSA_STATUS_SUCCESS)
> +    retval = -1;
> +  else
> +    {
> +      hsa_signal_value_t sv
> +     = hsa_fns.hsa_signal_wait_acquire_fn (completion_signal,
> +                                           HSA_SIGNAL_CONDITION_LT, 1,
> +                                           UINT64_MAX,
> +                                           HSA_WAIT_STATE_ACTIVE);
> +      if (sv < 0)
> +     {
> +       GCN_WARNING ("async copy rect failure");
> +       retval = -1;
> +     }
> +    }
> +
> +  hsa_fns.hsa_signal_destroy_fn (completion_signal);
> +
> +unlock:
> +  status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host);
> +  if (status != HSA_STATUS_SUCCESS)
> +    hsa_fatal ("Could not unlock host memory", status);
> +
> +  return retval;
> +}
> +
>   /* }}}  */
>   /* {{{ OpenMP Plugin API  */
>
> diff --git a/libgomp/target.c b/libgomp/target.c
> index 9b9b08db8be..8f0795837ae 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -4726,6 +4726,37 @@ omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
>         if (__builtin_mul_overflow (span, strides[0], &stride))
>       return EINVAL;
>
> +      if (((src_devicep && src_devicep->memcpy2d_func)
> +        || (dst_devicep && dst_devicep->memcpy2d_func))
> +       && (stride % element_size) == 0)
> +     {
> +       /* Try using memcpy2d for a 1-dimensional strided access.  Here we
> +          treat the transfer as a 2-dimensional array, where the inner
> +          dimension is calculated to be (stride in bytes) / element_size.
> +          Indices/offsets are adjusted so the source/destination pointers
> +          point to the first element to be transferred, to make the sums
> +          easier.  (There are some configurations of 2D strided accesses
> +          that memcpy3d could handle similarly, but those are probably rare
> +          and are unimplemented for now.)   */
> +
> +       /* If stride is element size, this is a contiguous transfer and
> +          should have been handled above.  */
> +       assert (stride > element_size);
> +
> +       int dst_id = dst_devicep ? dst_devicep->target_id : -1;
> +       int src_id = src_devicep ? src_devicep->target_id : -1;
> +       void *subarray_src = (char *) src + src_off;
> +       void *subarray_dst = (char *) dst + dst_off;
> +
> +       struct gomp_device_descr *devp = dst_devicep ? dst_devicep
> +                                                    : src_devicep;
> +       ret = devp->memcpy2d_func (dst_id, src_id, element_size, volume[0],
> +                                  subarray_dst, 0, 0, stride, subarray_src,
> +                                  0, 0, stride);
> +       if (ret != -1)
> +         return ret ? 0 : EINVAL;
> +     }
> +
>         for (i = 0, ret = 1; i < volume[0] && ret; i++)
>       {
>         if (src_devicep == NULL)
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955

Julian Brown Jan. 8, 2024, 5:55 p.m. UTC | #2

On Thu, 21 Dec 2023 17:05:18 +0100
Tobias Burnus <tobias@codesourcery.com> wrote:

> I think it makes sense to split this patch into two parts:
> 
> * The libgomp/plugin/plugin-gcn.c – which is independent and would
> already used by omp_memcpy_rect.

I will commit this version in a moment. I needed to add the
DLSYM_OPT_FN bit from one of Andrew Stubbs's patches elsewhere.
Re-tested with offloading to AMD GCN (...with a couple of patches
applied locally to get working test results, as plain mainline as of a
few days ago wasn't working too well for GCN offloading).

Thanks for review!

Julian

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index ef22d48da79..95c0a57e792 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -196,6 +196,16 @@  struct hsa_runtime_fn_info
   hsa_status_t (*hsa_code_object_deserialize_fn)
     (void *serialized_code_object, size_t serialized_code_object_size,
      const char *options, hsa_code_object_t *code_object);
+  hsa_status_t (*hsa_amd_memory_lock_fn)
+    (void *host_ptr, size_t size, hsa_agent_t *agents, int num_agent,
+     void **agent_ptr);
+  hsa_status_t (*hsa_amd_memory_unlock_fn) (void *host_ptr);
+  hsa_status_t (*hsa_amd_memory_async_copy_rect_fn)
+    (const hsa_pitched_ptr_t *dst, const hsa_dim3_t *dst_offset,
+     const hsa_pitched_ptr_t *src, const hsa_dim3_t *src_offset,
+     const hsa_dim3_t *range, hsa_agent_t copy_agent,
+     hsa_amd_copy_direction_t dir, uint32_t num_dep_signals,
+     const hsa_signal_t *dep_signals, hsa_signal_t completion_signal);
 };
 
 /* Structure describing the run-time and grid properties of an HSA kernel
@@ -1398,6 +1408,9 @@  init_hsa_runtime_functions (void)
   DLSYM_FN (hsa_signal_load_acquire)
   DLSYM_FN (hsa_queue_destroy)
   DLSYM_FN (hsa_code_object_deserialize)
+  DLSYM_OPT_FN (hsa_amd_memory_lock)
+  DLSYM_OPT_FN (hsa_amd_memory_unlock)
+  DLSYM_OPT_FN (hsa_amd_memory_async_copy_rect)
   return true;
 #undef DLSYM_FN
 }
@@ -3790,6 +3803,352 @@  GOMP_OFFLOAD_dev2dev (int device, void *dst, const void *src, size_t n)
   return true;
 }
 
+/* Here <quantity>_size refers to <quantity> multiplied by size -- i.e.
+   measured in bytes.  So we have:
+
+   dim1_size: number of bytes to copy on innermost dimension ("row")
+   dim0_len: number of rows to copy
+   dst: base pointer for destination of copy
+   dst_offset1_size: innermost row offset (for dest), in bytes
+   dst_offset0_len: offset, number of rows (for dest)
+   dst_dim1_size: whole-array dest row length, in bytes (pitch)
+   src: base pointer for source of copy
+   src_offset1_size: innermost row offset (for source), in bytes
+   src_offset0_len: offset, number of rows (for source)
+   src_dim1_size: whole-array source row length, in bytes (pitch)
+*/
+
+int
+GOMP_OFFLOAD_memcpy2d (int dst_ord, int src_ord, size_t dim1_size,
+		       size_t dim0_len, void *dst, size_t dst_offset1_size,
+		       size_t dst_offset0_len, size_t dst_dim1_size,
+		       const void *src, size_t src_offset1_size,
+		       size_t src_offset0_len, size_t src_dim1_size)
+{
+  if (!hsa_fns.hsa_amd_memory_lock_fn
+      || !hsa_fns.hsa_amd_memory_unlock_fn
+      || !hsa_fns.hsa_amd_memory_async_copy_rect_fn)
+    return -1;
+
+  /* GCN hardware requires 4-byte alignment for base addresses & pitches.  Bail
+     out quietly if we have anything oddly-aligned rather than letting the
+     driver raise an error.  */
+  if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0)
+    return -1;
+
+  if ((dst_dim1_size & 3) != 0 || (src_dim1_size & 3) != 0)
+    return -1;
+
+  /* Only handle host to device or device to host transfers here.  */
+  if ((dst_ord == -1 && src_ord == -1)
+      || (dst_ord != -1 && src_ord != -1))
+    return -1;
+
+  hsa_amd_copy_direction_t dir
+    = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost;
+  hsa_agent_t copy_agent;
+
+  /* We need to pin (lock) host memory before we start the transfer.  Try to
+     lock the minimum size necessary, i.e. using partial first/last rows of the
+     whole array.  Something like this:
+
+	 rows -->
+	 ..............
+     c | ..#######+++++ <- first row apart from {src,dst}_offset1_size
+     o | ++#######+++++ <- whole row
+     l | ++#######+++++ <-     "
+     s v ++#######..... <- last row apart from trailing remainder
+	 ..............
+
+     We could split very large transfers into several rectangular copies, but
+     that is unimplemented for now.  */
+
+  size_t bounded_size_host, first_elem_offset_host;
+  void *host_ptr;
+  if (dir == hsaHostToDevice)
+    {
+      bounded_size_host = src_dim1_size * (dim0_len - 1) + dim1_size;
+      first_elem_offset_host = src_offset0_len * src_dim1_size
+			       + src_offset1_size;
+      host_ptr = (void *) src;
+      struct agent_info *agent = get_agent_info (dst_ord);
+      copy_agent = agent->id;
+    }
+  else
+    {
+      bounded_size_host = dst_dim1_size * (dim0_len - 1) + dim1_size;
+      first_elem_offset_host = dst_offset0_len * dst_dim1_size
+			       + dst_offset1_size;
+      host_ptr = dst;
+      struct agent_info *agent = get_agent_info (src_ord);
+      copy_agent = agent->id;
+    }
+
+  void *agent_ptr;
+
+  hsa_status_t status
+    = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host,
+				      bounded_size_host, NULL, 0, &agent_ptr);
+  /* We can't lock the host memory: don't give up though, we might still be
+     able to use the slow path in our caller.  So, don't make this an
+     error.  */
+  if (status != HSA_STATUS_SUCCESS)
+    return -1;
+
+  hsa_pitched_ptr_t dstpp, srcpp;
+  hsa_dim3_t dst_offsets, src_offsets, ranges;
+
+  int retval = 1;
+
+  hsa_signal_t completion_signal;
+  status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal);
+  if (status != HSA_STATUS_SUCCESS)
+    {
+      retval = -1;
+      goto unlock;
+    }
+
+  if (dir == hsaHostToDevice)
+    {
+      srcpp.base = agent_ptr - first_elem_offset_host;
+      dstpp.base = dst;
+    }
+  else
+    {
+      srcpp.base = (void *) src;
+      dstpp.base = agent_ptr - first_elem_offset_host;
+    }
+
+  srcpp.pitch = src_dim1_size;
+  srcpp.slice = 0;
+
+  src_offsets.x = src_offset1_size;
+  src_offsets.y = src_offset0_len;
+  src_offsets.z = 0;
+
+  dstpp.pitch = dst_dim1_size;
+  dstpp.slice = 0;
+
+  dst_offsets.x = dst_offset1_size;
+  dst_offsets.y = dst_offset0_len;
+  dst_offsets.z = 0;
+
+  ranges.x = dim1_size;
+  ranges.y = dim0_len;
+  ranges.z = 1;
+
+  status
+    = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp,
+						 &src_offsets, &ranges,
+						 copy_agent, dir, 0, NULL,
+						 completion_signal);
+  /* If the rectangular copy fails, we might still be able to use the slow
+     path.  We need to unlock the host memory though, so don't return
+     immediately.  */
+  if (status != HSA_STATUS_SUCCESS)
+    retval = -1;
+  else
+    hsa_fns.hsa_signal_wait_acquire_fn (completion_signal,
+					HSA_SIGNAL_CONDITION_LT, 1, UINT64_MAX,
+					HSA_WAIT_STATE_ACTIVE);
+
+  hsa_fns.hsa_signal_destroy_fn (completion_signal);
+
+unlock:
+  status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not unlock host memory", status);
+
+  return retval;
+}
+
+/* As above, <quantity>_size refers to <quantity> multiplied by size -- i.e.
+   measured in bytes.  So we have:
+
+   dim2_size: number of bytes to copy on innermost dimension ("row")
+   dim1_len: number of rows per slice to copy
+   dim0_len: number of slices to copy
+   dst: base pointer for destination of copy
+   dst_offset2_size: innermost row offset (for dest), in bytes
+   dst_offset1_len: offset, number of rows (for dest)
+   dst_offset0_len: offset, number of slices (for dest)
+   dst_dim2_size: whole-array dest row length, in bytes (pitch)
+   dst_dim1_len: whole-array number of rows in slice (for dest)
+   src: base pointer for source of copy
+   src_offset2_size: innermost row offset (for source), in bytes
+   src_offset1_len: offset, number of rows (for source)
+   src_offset0_len: offset, number of slices (for source)
+   src_dim2_size: whole-array source row length, in bytes (pitch)
+   src_dim1_len: whole-array number of rows in slice (for source)
+*/
+
+int
+GOMP_OFFLOAD_memcpy3d (int dst_ord, int src_ord, size_t dim2_size,
+		       size_t dim1_len, size_t dim0_len, void *dst,
+		       size_t dst_offset2_size, size_t dst_offset1_len,
+		       size_t dst_offset0_len, size_t dst_dim2_size,
+		       size_t dst_dim1_len, const void *src,
+		       size_t src_offset2_size, size_t src_offset1_len,
+		       size_t src_offset0_len, size_t src_dim2_size,
+		       size_t src_dim1_len)
+{
+  if (!hsa_fns.hsa_amd_memory_lock_fn
+      || !hsa_fns.hsa_amd_memory_unlock_fn
+      || !hsa_fns.hsa_amd_memory_async_copy_rect_fn)
+    return -1;
+
+  /* GCN hardware requires 4-byte alignment for base addresses & pitches.  Bail
+     out quietly if we have anything oddly-aligned rather than letting the
+     driver raise an error.  */
+  if ((((uintptr_t) dst) & 3) != 0 || (((uintptr_t) src) & 3) != 0)
+    return -1;
+
+  if ((dst_dim2_size & 3) != 0 || (src_dim2_size & 3) != 0)
+    return -1;
+
+  /* Only handle host to device or device to host transfers here.  */
+  if ((dst_ord == -1 && src_ord == -1)
+      || (dst_ord != -1 && src_ord != -1))
+    return -1;
+
+  hsa_amd_copy_direction_t dir
+    = (src_ord == -1) ? hsaHostToDevice : hsaDeviceToHost;
+  hsa_agent_t copy_agent;
+
+  /* We need to pin (lock) host memory before we start the transfer.  Try to
+     lock the minimum size necessary, i.e. using partial first/last slices of
+     the whole 3D array.  Something like this:
+
+	 slice 0:          slice 1:	     slice 2:
+	     __________        __________        __________
+       ^    /+++++++++/    :  /+++++++++/    :	/         /
+    column /+++##++++/|    | /+++##++++/|    | /+++##    /  # = subarray
+     /	  /   ##++++/ |    |/+++##++++/ |    |/+++##++++/   + = area to pin
+	 /_________/  :    /_________/  :    /_________/
+	row --->
+
+     We could split very large transfers into several rectangular copies, but
+     that is unimplemented for now.  */
+
+  size_t bounded_size_host, first_elem_offset_host;
+  void *host_ptr;
+  if (dir == hsaHostToDevice)
+    {
+      size_t slice_bytes = src_dim2_size * src_dim1_len;
+      bounded_size_host = slice_bytes * (dim0_len - 1)
+			  + src_dim2_size * (dim1_len - 1)
+			  + dim2_size;
+      first_elem_offset_host = src_offset0_len * slice_bytes
+			       + src_offset1_len * src_dim2_size
+			       + src_offset2_size;
+      host_ptr = (void *) src;
+      struct agent_info *agent = get_agent_info (dst_ord);
+      copy_agent = agent->id;
+    }
+  else
+    {
+      size_t slice_bytes = dst_dim2_size * dst_dim1_len;
+      bounded_size_host = slice_bytes * (dim0_len - 1)
+			  + dst_dim2_size * (dim1_len - 1)
+			  + dim2_size;
+      first_elem_offset_host = dst_offset0_len * slice_bytes
+			       + dst_offset1_len * dst_dim2_size
+			       + dst_offset2_size;
+      host_ptr = dst;
+      struct agent_info *agent = get_agent_info (src_ord);
+      copy_agent = agent->id;
+    }
+
+  void *agent_ptr;
+
+  hsa_status_t status
+    = hsa_fns.hsa_amd_memory_lock_fn (host_ptr + first_elem_offset_host,
+				      bounded_size_host, NULL, 0, &agent_ptr);
+  /* We can't lock the host memory: don't give up though, we might still be
+     able to use the slow path in our caller (maybe even with iterated memcpy2d
+     calls).  So, don't make this an error.  */
+  if (status != HSA_STATUS_SUCCESS)
+    return -1;
+
+  hsa_pitched_ptr_t dstpp, srcpp;
+  hsa_dim3_t dst_offsets, src_offsets, ranges;
+
+  int retval = 1;
+
+  hsa_signal_t completion_signal;
+  status = hsa_fns.hsa_signal_create_fn (1, 0, NULL, &completion_signal);
+  if (status != HSA_STATUS_SUCCESS)
+    {
+      retval = -1;
+      goto unlock;
+    }
+
+  if (dir == hsaHostToDevice)
+    {
+      srcpp.base = agent_ptr - first_elem_offset_host;
+      dstpp.base = dst;
+    }
+  else
+    {
+      srcpp.base = (void *) src;
+      dstpp.base = agent_ptr - first_elem_offset_host;
+    }
+
+  /* Pitch is measured in bytes.  */
+  srcpp.pitch = src_dim2_size;
+  /* Slice is also measured in bytes (i.e. total per-slice).  */
+  srcpp.slice = src_dim2_size * src_dim1_len;
+
+  src_offsets.x = src_offset2_size;
+  src_offsets.y = src_offset1_len;
+  src_offsets.z = src_offset0_len;
+
+  /* As above.  */
+  dstpp.pitch = dst_dim2_size;
+  dstpp.slice = dst_dim2_size * dst_dim1_len;
+
+  dst_offsets.x = dst_offset2_size;
+  dst_offsets.y = dst_offset1_len;
+  dst_offsets.z = dst_offset0_len;
+
+  ranges.x = dim2_size;
+  ranges.y = dim1_len;
+  ranges.z = dim0_len;
+
+  status
+    = hsa_fns.hsa_amd_memory_async_copy_rect_fn (&dstpp, &dst_offsets, &srcpp,
+						 &src_offsets, &ranges,
+						 copy_agent, dir, 0, NULL,
+						 completion_signal);
+  /* If the rectangular copy fails, we might still be able to use the slow
+     path.  We need to unlock the host memory though, so don't return
+     immediately.  */
+  if (status != HSA_STATUS_SUCCESS)
+    retval = -1;
+  else
+    {
+      hsa_signal_value_t sv
+	= hsa_fns.hsa_signal_wait_acquire_fn (completion_signal,
+					      HSA_SIGNAL_CONDITION_LT, 1,
+					      UINT64_MAX,
+					      HSA_WAIT_STATE_ACTIVE);
+      if (sv < 0)
+	{
+	  GCN_WARNING ("async copy rect failure");
+	  retval = -1;
+	}
+    }
+
+  hsa_fns.hsa_signal_destroy_fn (completion_signal);
+
+unlock:
+  status = hsa_fns.hsa_amd_memory_unlock_fn (host_ptr + first_elem_offset_host);
+  if (status != HSA_STATUS_SUCCESS)
+    hsa_fatal ("Could not unlock host memory", status);
+
+  return retval;
+}
+
 /* }}}  */
 /* {{{ OpenMP Plugin API  */
 
diff --git a/libgomp/target.c b/libgomp/target.c
index 9b9b08db8be..8f0795837ae 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -4726,6 +4726,37 @@  omp_target_memcpy_rect_worker (void *dst, const void *src, size_t element_size,
       if (__builtin_mul_overflow (span, strides[0], &stride))
 	return EINVAL;
 
+      if (((src_devicep && src_devicep->memcpy2d_func)
+	   || (dst_devicep && dst_devicep->memcpy2d_func))
+	  && (stride % element_size) == 0)
+	{
+	  /* Try using memcpy2d for a 1-dimensional strided access.  Here we
+	     treat the transfer as a 2-dimensional array, where the inner
+	     dimension is calculated to be (stride in bytes) / element_size.
+	     Indices/offsets are adjusted so the source/destination pointers
+	     point to the first element to be transferred, to make the sums
+	     easier.  (There are some configurations of 2D strided accesses
+	     that memcpy3d could handle similarly, but those are probably rare
+	     and are unimplemented for now.)   */
+
+	  /* If stride is element size, this is a contiguous transfer and
+	     should have been handled above.  */
+	  assert (stride > element_size);
+
+	  int dst_id = dst_devicep ? dst_devicep->target_id : -1;
+	  int src_id = src_devicep ? src_devicep->target_id : -1;
+	  void *subarray_src = (char *) src + src_off;
+	  void *subarray_dst = (char *) dst + dst_off;
+
+	  struct gomp_device_descr *devp = dst_devicep ? dst_devicep
+						       : src_devicep;
+	  ret = devp->memcpy2d_func (dst_id, src_id, element_size, volume[0],
+				     subarray_dst, 0, 0, stride, subarray_src,
+				     0, 0, stride);
+	  if (ret != -1)
+	    return ret ? 0 : EINVAL;
+	}
+
       for (i = 0, ret = 1; i < volume[0] && ret; i++)
 	{
 	  if (src_devicep == NULL)