diff mbox series

[gomp5] Allow OpenMP atomics inside simd regions

Message ID 20180614100613.GQ7166@tucnak
State New
Headers show
Series [gomp5] Allow OpenMP atomics inside simd regions | expand

Commit Message

Jakub Jelinek June 14, 2018, 10:06 a.m. UTC
Hi!

#pragma omp atomic is now allowed inside of simd regions.
Tested on x86_64-linux, committed to gomp-5_0-branch.

We will actually not vectorize it then though, so some further work will be
needed in the vectorizer to handle it.  Either, if we have hw atomics for both
the size of the scalar accesses and size of the whole vector type, the
accesses are adjacent and known to be aligned, we could replace it with
atomic on the whole vector, or emit as a small loop or unrolled loop doing
the extraction, scalar atomics and if needed insert result back into
vectors.  Richard, thoughts on that?

2018-06-14  Jakub Jelinek  <jakub@redhat.com>

	* omp-low.c (check_omp_nesting_restrictions): Allow OpenMP atomics
	inside of simd regions.

	* c-c++-common/gomp/simd7.c: New test.


	Jakub

Comments

Richard Biener June 14, 2018, 10:34 a.m. UTC | #1
On Thu, 14 Jun 2018, Jakub Jelinek wrote:

> Hi!
> 
> #pragma omp atomic is now allowed inside of simd regions.
> Tested on x86_64-linux, committed to gomp-5_0-branch.
> 
> We will actually not vectorize it then though, so some further work will be
> needed in the vectorizer to handle it.  Either, if we have hw atomics for both
> the size of the scalar accesses and size of the whole vector type, the
> accesses are adjacent and known to be aligned, we could replace it with
> atomic on the whole vector, or emit as a small loop or unrolled loop doing
> the extraction, scalar atomics and if needed insert result back into
> vectors.  Richard, thoughts on that?

What's the semantic of this?  Generally for non-vectorizable stmts
we could implement partial vectorization by doing VF number of copies
of the scalar stmt:

 tem10 = vinput1[0];
 tem20 = vinput2[0];
 ...
 soutput0 = OP (tem10, tem20,...);
 ...
 voutput = { soutput0, ... };

or alternatively specific for the OMP atomic handling, outline the
atomic stmts to a OMP simd function?

Anyway, still need to have an idea of the semantics of an atomic
stmt inside a SIMD loop.

Richard.

> 2018-06-14  Jakub Jelinek  <jakub@redhat.com>
> 
> 	* omp-low.c (check_omp_nesting_restrictions): Allow OpenMP atomics
> 	inside of simd regions.
> 
> 	* c-c++-common/gomp/simd7.c: New test.
> 
> --- gcc/omp-low.c.jj	2018-06-08 12:12:34.377111135 +0200
> +++ gcc/omp-low.c	2018-06-14 11:09:55.439824132 +0200
> @@ -2463,9 +2463,13 @@ check_omp_nesting_restrictions (gimple *
>  		  return true;
>  		}
>  	    }
> +	  else if (gimple_code (stmt) == GIMPLE_OMP_ATOMIC_LOAD
> +		   || gimple_code (stmt) == GIMPLE_OMP_ATOMIC_STORE)
> +	    return true;
>  	  error_at (gimple_location (stmt),
>  		    "OpenMP constructs other than %<#pragma omp ordered simd%>"
> -		    " may not be nested inside %<simd%> region");
> +		    " or %<#pragma omp atomic%> may not be nested inside"
> +		    " %<simd%> region");
>  	  return false;
>  	}
>        else if (gimple_code (ctx->stmt) == GIMPLE_OMP_TEAMS)
> --- gcc/testsuite/c-c++-common/gomp/simd7.c.jj	2018-06-14 11:19:30.466274188 +0200
> +++ gcc/testsuite/c-c++-common/gomp/simd7.c	2018-06-14 10:55:52.520186067 +0200
> @@ -0,0 +1,21 @@
> +int a[64];
> +
> +#pragma omp declare simd linear(x)
> +int
> +bar (int x, int y)
> +{
> +  int v;
> +  #pragma omp atomic capture
> +  v = a[x] += y;
> +  return v;
> +}
> +
> +void
> +foo (void)
> +{
> +  int i;
> +  #pragma omp simd
> +  for (i = 0; i < 64; i++)
> +    #pragma omp atomic
> +    a[i] += 1;
> +}
> 
> 	Jakub
> 
>
Jakub Jelinek June 14, 2018, 10:55 a.m. UTC | #2
On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote:
> > #pragma omp atomic is now allowed inside of simd regions.
> > Tested on x86_64-linux, committed to gomp-5_0-branch.
> > 
> > We will actually not vectorize it then though, so some further work will be
> > needed in the vectorizer to handle it.  Either, if we have hw atomics for both
> > the size of the scalar accesses and size of the whole vector type, the
> > accesses are adjacent and known to be aligned, we could replace it with
> > atomic on the whole vector, or emit as a small loop or unrolled loop doing
> > the extraction, scalar atomics and if needed insert result back into
> > vectors.  Richard, thoughts on that?
> 
> What's the semantic of this?  Generally for non-vectorizable stmts

OpenMP already has #pragma omp ordered simd which specifies part of the loop
body that should not be vectorized (which we right now just implement as
forcing no vectorization) and I guess the atomics could be handled
similarly.  I.e. say for
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
  {
    int v;
    a[i] = sqrt (b[i]);
    c[i] = a[i];
    #pragma omp atomic capture
    v = d[i] += c[i];
    e[i] = v;
  }
}
vectorize it say with vf of 4 as:
for (i = 0; i < 64; i += 4)
  {
    v4si v;
    *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i]));
    *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i]));
    v4si c_ = *((v4si *)&c[i]);
    for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar
      v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0);
    // or, if we have hw supported __atomic_compare_exchange_16 and d is known
    // to be aligned to 128-bits, we could do a 128-bit load + vector add +
    // cmpxchg.
    e[i] = v;
  }

The semantics of atomics inside of simd should be the same as of:
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
  {
    int v;
    a[i] = sqrt (b[i]);
    c[i] = a[i];
    #pragma omp ordered simd
    {
      #pragma omp atomic capture
      v = d[i] += c[i];
    }
    e[i] = v;
  }
}

in that it vectorizes (if possible) the loop, except for not vectorizing
the ordered simd part of the loop, but instead iterating from 0 to vf-1
sequentially.

	Jakub
Richard Biener June 14, 2018, 12:37 p.m. UTC | #3
On Thu, 14 Jun 2018, Jakub Jelinek wrote:

> On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote:
> > > #pragma omp atomic is now allowed inside of simd regions.
> > > Tested on x86_64-linux, committed to gomp-5_0-branch.
> > > 
> > > We will actually not vectorize it then though, so some further work will be
> > > needed in the vectorizer to handle it.  Either, if we have hw atomics for both
> > > the size of the scalar accesses and size of the whole vector type, the
> > > accesses are adjacent and known to be aligned, we could replace it with
> > > atomic on the whole vector, or emit as a small loop or unrolled loop doing
> > > the extraction, scalar atomics and if needed insert result back into
> > > vectors.  Richard, thoughts on that?
> > 
> > What's the semantic of this?  Generally for non-vectorizable stmts
> 
> OpenMP already has #pragma omp ordered simd which specifies part of the loop
> body that should not be vectorized (which we right now just implement as
> forcing no vectorization) and I guess the atomics could be handled
> similarly.  I.e. say for
> float a[64], b[64];
> int c[64], d[64], e[64];
> void foo (void) {
> #pragma omp simd
> for (int i = 0; i < 64; ++i)
>   {
>     int v;
>     a[i] = sqrt (b[i]);
>     c[i] = a[i];
>     #pragma omp atomic capture
>     v = d[i] += c[i];
>     e[i] = v;
>   }
> }
> vectorize it say with vf of 4 as:
> for (i = 0; i < 64; i += 4)
>   {
>     v4si v;
>     *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i]));
>     *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i]));
>     v4si c_ = *((v4si *)&c[i]);
>     for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar
>       v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0);
>     // or, if we have hw supported __atomic_compare_exchange_16 and d is known
>     // to be aligned to 128-bits, we could do a 128-bit load + vector add +
>     // cmpxchg.
>     e[i] = v;
>   }
> 
> The semantics of atomics inside of simd should be the same as of:
> float a[64], b[64];
> int c[64], d[64], e[64];
> void foo (void) {
> #pragma omp simd
> for (int i = 0; i < 64; ++i)
>   {
>     int v;
>     a[i] = sqrt (b[i]);
>     c[i] = a[i];
>     #pragma omp ordered simd
>     {
>       #pragma omp atomic capture
>       v = d[i] += c[i];
>     }
>     e[i] = v;
>   }
> }
> 
> in that it vectorizes (if possible) the loop, except for not vectorizing
> the ordered simd part of the loop, but instead iterating from 0 to vf-1
> sequentially.

So re-ordering iterations for the non-ordered/atomic part of the
loop is OK, even crossing the ordered/atomic parts - like
above the store to a[1] may happen before the d[0] += c[0] atomic
operation, but the atomic/ordered stmts have to happen in-order
with respect to only themselves?

Then we can indeed vectorize this by copying the scalar stmts N times
with decomposing the input vectors beforehand and building a vector
result afterwards.

I'd like to see us trying this for otherwise not vectorizable
code as well (with appropriate costing of course).  Then the atomics
vectorization would work transparently and we only need to think
about how to mark stmts in ordered regions?

Richard.
diff mbox series

Patch

--- gcc/omp-low.c.jj	2018-06-08 12:12:34.377111135 +0200
+++ gcc/omp-low.c	2018-06-14 11:09:55.439824132 +0200
@@ -2463,9 +2463,13 @@  check_omp_nesting_restrictions (gimple *
 		  return true;
 		}
 	    }
+	  else if (gimple_code (stmt) == GIMPLE_OMP_ATOMIC_LOAD
+		   || gimple_code (stmt) == GIMPLE_OMP_ATOMIC_STORE)
+	    return true;
 	  error_at (gimple_location (stmt),
 		    "OpenMP constructs other than %<#pragma omp ordered simd%>"
-		    " may not be nested inside %<simd%> region");
+		    " or %<#pragma omp atomic%> may not be nested inside"
+		    " %<simd%> region");
 	  return false;
 	}
       else if (gimple_code (ctx->stmt) == GIMPLE_OMP_TEAMS)
--- gcc/testsuite/c-c++-common/gomp/simd7.c.jj	2018-06-14 11:19:30.466274188 +0200
+++ gcc/testsuite/c-c++-common/gomp/simd7.c	2018-06-14 10:55:52.520186067 +0200
@@ -0,0 +1,21 @@ 
+int a[64];
+
+#pragma omp declare simd linear(x)
+int
+bar (int x, int y)
+{
+  int v;
+  #pragma omp atomic capture
+  v = a[x] += y;
+  return v;
+}
+
+void
+foo (void)
+{
+  int i;
+  #pragma omp simd
+  for (i = 0; i < 64; i++)
+    #pragma omp atomic
+    a[i] += 1;
+}