diff mbox

GOMP_target: alignment (was: [gomp4] #pragma omp target* fixes)

Message ID 87zjo6s8e9.fsf@kepler.schwinge.homeip.net
State New
Headers show

Commit Message

Thomas Schwinge Dec. 12, 2013, 9:53 a.m. UTC
Hi!

On Thu, 5 Sep 2013 18:11:05 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> 3) I figured out we need to tell the runtime library not just
> address, size and kind, but also alignment (we won't need that for
> the #pragma omp declare target global vars though), so that the
> runtime library can properly align it.  As TYPE_ALIGN/DECL_ALIGN
> is in bits and is 32 bit wide, when that is in bytes and we only care
> about power of twos, I've decided to encode it in the upper 5 bits
> of the kind (lower 3 bits are used for OMP_CLAUSE_MAP_* kind).

Unfortunately, this scheme breaks down with OpenACC: we need an
additional bit to codify a flag for present_or_* map clauses (meaning:
only map the data (allocate/to/from/tofrom, as for OpenMP) if not already
present on the device).

With five bits available for the OpenMP case, we can describe alignments
up to 2 GiB, and I've empirically found on my development system that the
largest possible alignment is MAX_OFILE_ALIGNMENT, 256 MiB for ELF
systems, so that's fine.  But with only four bits available, we get to
describe alignments up to 1 << ((1 << 4) - 1) = 32 KiB, which is too
small -- even though it'd be fine for "normal" usage of __attribute__
((aligned (x))).

So it seems our options are to use a bigger datatype for the kinds array,
to split off from the kinds array a new alignments array, or to generally
switch to using an array of a struct containing hostaddr, size,
alignment, kind.  The latter would require additional changes in the
child_fn.

As it's an ABI change no matter what, would you like to see this limited
to OpenACC?  Changing it also for OpenMP's GOMP_target would have the
advantage to have them not diverge (especially at the generating side in
omp-low.c's lowering functions), but I'm not sure whether such an ABI
change would easily be possible now, with the OpenMP 4 support merged
into trunk -- though, it is not yet part of a regular GCC release?


> --- gcc/omp-low.c.jj	2013-09-05 09:19:03.000000000 +0200
> +++ gcc/omp-low.c	2013-09-05 17:11:14.693638660 +0200
> @@ -9342,6 +9349,11 @@ lower_omp_target (gimple_stmt_iterator *
|  	    unsigned char tkind = 0;
|  	    switch (OMP_CLAUSE_CODE (c))
|  	      {
|  	      case OMP_CLAUSE_MAP:
|  		tkind = OMP_CLAUSE_MAP_KIND (c);
|  		break;
|  	      case OMP_CLAUSE_TO:
|  		tkind = OMP_CLAUSE_MAP_TO;
|  		break;
|  	      case OMP_CLAUSE_FROM:
|  		tkind = OMP_CLAUSE_MAP_FROM;
|  		break;
>  	      default:
>  		gcc_unreachable ();
>  	      }
> +	    unsigned int talign = TYPE_ALIGN_UNIT (TREE_TYPE (ovar));
> +	    if (DECL_P (ovar) && DECL_ALIGN_UNIT (ovar) > talign)
> +	      talign = DECL_ALIGN_UNIT (ovar);
> +	    talign = ceil_log2 (talign);
> +	    tkind |= talign << 3;
>  	    CONSTRUCTOR_APPEND_ELT (vkind, purpose,
>  				    build_int_cst (unsigned_char_type_node,
>  						   tkind));

The use of OMP_CLAUSE_MAP_* on the generating and integer numerals on the
receiving (libgomp) side is a bit unesthetic, likewise for the hard-coded
3 in the bit shift.  What would be the standard GCC way of sharing a
description of the tkind layout between gcc/omp-low.c and
libgomp/target.c?  Are we allowed to #include (a new header file)
libgomp/target.h from gcc/omp-low.c?


To avoid silent breakage should alignments bigger than 2 GiB be allowed
in a distant future, would a check like the following be appropriate?



Grüße,
 Thomas

Comments

Jakub Jelinek Dec. 12, 2013, 10:02 a.m. UTC | #1
On Thu, Dec 12, 2013 at 10:53:02AM +0100, Thomas Schwinge wrote:
> On Thu, 5 Sep 2013 18:11:05 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> > 3) I figured out we need to tell the runtime library not just
> > address, size and kind, but also alignment (we won't need that for
> > the #pragma omp declare target global vars though), so that the
> > runtime library can properly align it.  As TYPE_ALIGN/DECL_ALIGN
> > is in bits and is 32 bit wide, when that is in bytes and we only care
> > about power of twos, I've decided to encode it in the upper 5 bits
> > of the kind (lower 3 bits are used for OMP_CLAUSE_MAP_* kind).
> 
> Unfortunately, this scheme breaks down with OpenACC: we need an
> additional bit to codify a flag for present_or_* map clauses (meaning:
> only map the data (allocate/to/from/tofrom, as for OpenMP) if not already
> present on the device).

The OpenMP behavior is always only map the data (allocate/to/from/tofrom)
if not already mapped on the device.  So what behavior does OpenACC have
if present_or_* isn't present?

	Jakub
Thomas Schwinge Dec. 12, 2013, 11:06 a.m. UTC | #2
Hi!

On Thu, 12 Dec 2013 11:02:30 +0100, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Dec 12, 2013 at 10:53:02AM +0100, Thomas Schwinge wrote:
> > On Thu, 5 Sep 2013 18:11:05 +0200, Jakub Jelinek <jakub@redhat.com> wrote:
> > > 3) I figured out we need to tell the runtime library not just
> > > address, size and kind, but also alignment (we won't need that for
> > > the #pragma omp declare target global vars though), so that the
> > > runtime library can properly align it.  As TYPE_ALIGN/DECL_ALIGN
> > > is in bits and is 32 bit wide, when that is in bytes and we only care
> > > about power of twos, I've decided to encode it in the upper 5 bits
> > > of the kind (lower 3 bits are used for OMP_CLAUSE_MAP_* kind).
> > 
> > Unfortunately, this scheme breaks down with OpenACC: we need an
> > additional bit to codify a flag for present_or_* map clauses (meaning:
> > only map the data (allocate/to/from/tofrom, as for OpenMP) if not already
> > present on the device).
> 
> The OpenMP behavior is always only map the data (allocate/to/from/tofrom)
> if not already mapped on the device.  So what behavior does OpenACC have
> if present_or_* isn't present?

OpenACC has a concept of (possibly nested) data regions (for reference,
OpenACC 2.0, 2.6.2 Data Regions and Data Lifetimes), and the semantics
are as follows:

    #pragma acc parallel copy(x[0:n])
    for (int i = 0; i < n; ++i)
      x[i] += 1;

This will first allocate the x array on the device, copy the host's x
array to the device's, then execute the structured block, then copy back
the data from the device to the host, then deallocate the copy on the
device.

    #pragma acc parallel present_or_copy(x[0:n])
    for (int i = 0; i < n; ++i)
      x[i] += 1;

If the x array is not present on the device, this will proceed as for the
copy clause just described.  If the data already is present, this will
directly proceed to executing the structured block, then *not* copy back
the data from the device to the host, and *not* deallocate the copy on
the device.

The reason is that often you'd first set up explicit data regions around
several OpenACC pragmas, as data movement is expensive, and the compiler
has a hard time figuring out when it might be avoided.  For example:

    void foo(int n, float *x)
    {
      #pragma acc parallel present_or_copy(x[0:n])
      for (int i = 0; i < n; ++i)
        x[i] += 1;
    }

    void bar(int n, float *x1, float *x2)
    {
      foo(n, x1);

      #pragma acc enter data copyin(x2[0:n])
      foo(n, x2);
      [...]
      foo(n, x2);
      [...]
      foo(n, x2);
      #pragma acc exit data copyout(x2[0:n])
      // Now use x2 on the host.
    }

For x1, when executing foo, the runtime will do: allocate on device,
copyin, execute, copyout, deallocate on device -- that is, the
present_or_copy clause handled as a copy clause.

For x2, the data will first manually be allocated on and copied to the
device, entering a dynamic data region, and when executing foo is already
present (so, the present_or_copy clause basically becomes a no-op), and
then manually be copied out and deallocated, terminating the data region.


Apart from the different semantics of deallocation, while I couldn't
quickly find it in the pragmas' descriptions, the description for the
acc_copyin runtime library function explicitly states that »it is a
runtime error to call this routine if the data is already present on the
device«.


Grüße,
 Thomas
diff mbox

Patch

--- gcc/omp-low.c
+++ gcc/omp-low.c
@@ -10378,6 +10383,11 @@  lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx)
 	    unsigned int talign = TYPE_ALIGN_UNIT (TREE_TYPE (ovar));
 	    if (DECL_P (ovar) && DECL_ALIGN_UNIT (ovar) > talign)
 	      talign = DECL_ALIGN_UNIT (ovar);
+	    const unsigned int talign_max
+	      = 1 << ((1 << (BITS_PER_UNIT - 3)) - 1);
+	    if (talign > talign_max)
+	      sorry ("can't encode alignment of %u bytes, which is bigger than "
+		     "%u bytes", talign, talign_max);
 	    talign = ceil_log2 (talign);
 	    tkind |= talign << 3;
 	    CONSTRUCTOR_APPEND_ELT (vkind, purpose,