Message ID | 20180614100613.GQ7166@tucnak |
---|---|
State | New |
Headers | show |
Series | [gomp5] Allow OpenMP atomics inside simd regions | expand |
On Thu, 14 Jun 2018, Jakub Jelinek wrote: > Hi! > > #pragma omp atomic is now allowed inside of simd regions. > Tested on x86_64-linux, committed to gomp-5_0-branch. > > We will actually not vectorize it then though, so some further work will be > needed in the vectorizer to handle it. Either, if we have hw atomics for both > the size of the scalar accesses and size of the whole vector type, the > accesses are adjacent and known to be aligned, we could replace it with > atomic on the whole vector, or emit as a small loop or unrolled loop doing > the extraction, scalar atomics and if needed insert result back into > vectors. Richard, thoughts on that? What's the semantic of this? Generally for non-vectorizable stmts we could implement partial vectorization by doing VF number of copies of the scalar stmt: tem10 = vinput1[0]; tem20 = vinput2[0]; ... soutput0 = OP (tem10, tem20,...); ... voutput = { soutput0, ... }; or alternatively specific for the OMP atomic handling, outline the atomic stmts to a OMP simd function? Anyway, still need to have an idea of the semantics of an atomic stmt inside a SIMD loop. Richard. > 2018-06-14 Jakub Jelinek <jakub@redhat.com> > > * omp-low.c (check_omp_nesting_restrictions): Allow OpenMP atomics > inside of simd regions. > > * c-c++-common/gomp/simd7.c: New test. > > --- gcc/omp-low.c.jj 2018-06-08 12:12:34.377111135 +0200 > +++ gcc/omp-low.c 2018-06-14 11:09:55.439824132 +0200 > @@ -2463,9 +2463,13 @@ check_omp_nesting_restrictions (gimple * > return true; > } > } > + else if (gimple_code (stmt) == GIMPLE_OMP_ATOMIC_LOAD > + || gimple_code (stmt) == GIMPLE_OMP_ATOMIC_STORE) > + return true; > error_at (gimple_location (stmt), > "OpenMP constructs other than %<#pragma omp ordered simd%>" > - " may not be nested inside %<simd%> region"); > + " or %<#pragma omp atomic%> may not be nested inside" > + " %<simd%> region"); > return false; > } > else if (gimple_code (ctx->stmt) == GIMPLE_OMP_TEAMS) > --- gcc/testsuite/c-c++-common/gomp/simd7.c.jj 2018-06-14 11:19:30.466274188 +0200 > +++ gcc/testsuite/c-c++-common/gomp/simd7.c 2018-06-14 10:55:52.520186067 +0200 > @@ -0,0 +1,21 @@ > +int a[64]; > + > +#pragma omp declare simd linear(x) > +int > +bar (int x, int y) > +{ > + int v; > + #pragma omp atomic capture > + v = a[x] += y; > + return v; > +} > + > +void > +foo (void) > +{ > + int i; > + #pragma omp simd > + for (i = 0; i < 64; i++) > + #pragma omp atomic > + a[i] += 1; > +} > > Jakub > >
On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote: > > #pragma omp atomic is now allowed inside of simd regions. > > Tested on x86_64-linux, committed to gomp-5_0-branch. > > > > We will actually not vectorize it then though, so some further work will be > > needed in the vectorizer to handle it. Either, if we have hw atomics for both > > the size of the scalar accesses and size of the whole vector type, the > > accesses are adjacent and known to be aligned, we could replace it with > > atomic on the whole vector, or emit as a small loop or unrolled loop doing > > the extraction, scalar atomics and if needed insert result back into > > vectors. Richard, thoughts on that? > > What's the semantic of this? Generally for non-vectorizable stmts OpenMP already has #pragma omp ordered simd which specifies part of the loop body that should not be vectorized (which we right now just implement as forcing no vectorization) and I guess the atomics could be handled similarly. I.e. say for float a[64], b[64]; int c[64], d[64], e[64]; void foo (void) { #pragma omp simd for (int i = 0; i < 64; ++i) { int v; a[i] = sqrt (b[i]); c[i] = a[i]; #pragma omp atomic capture v = d[i] += c[i]; e[i] = v; } } vectorize it say with vf of 4 as: for (i = 0; i < 64; i += 4) { v4si v; *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i])); *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i])); v4si c_ = *((v4si *)&c[i]); for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0); // or, if we have hw supported __atomic_compare_exchange_16 and d is known // to be aligned to 128-bits, we could do a 128-bit load + vector add + // cmpxchg. e[i] = v; } The semantics of atomics inside of simd should be the same as of: float a[64], b[64]; int c[64], d[64], e[64]; void foo (void) { #pragma omp simd for (int i = 0; i < 64; ++i) { int v; a[i] = sqrt (b[i]); c[i] = a[i]; #pragma omp ordered simd { #pragma omp atomic capture v = d[i] += c[i]; } e[i] = v; } } in that it vectorizes (if possible) the loop, except for not vectorizing the ordered simd part of the loop, but instead iterating from 0 to vf-1 sequentially. Jakub
On Thu, 14 Jun 2018, Jakub Jelinek wrote: > On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote: > > > #pragma omp atomic is now allowed inside of simd regions. > > > Tested on x86_64-linux, committed to gomp-5_0-branch. > > > > > > We will actually not vectorize it then though, so some further work will be > > > needed in the vectorizer to handle it. Either, if we have hw atomics for both > > > the size of the scalar accesses and size of the whole vector type, the > > > accesses are adjacent and known to be aligned, we could replace it with > > > atomic on the whole vector, or emit as a small loop or unrolled loop doing > > > the extraction, scalar atomics and if needed insert result back into > > > vectors. Richard, thoughts on that? > > > > What's the semantic of this? Generally for non-vectorizable stmts > > OpenMP already has #pragma omp ordered simd which specifies part of the loop > body that should not be vectorized (which we right now just implement as > forcing no vectorization) and I guess the atomics could be handled > similarly. I.e. say for > float a[64], b[64]; > int c[64], d[64], e[64]; > void foo (void) { > #pragma omp simd > for (int i = 0; i < 64; ++i) > { > int v; > a[i] = sqrt (b[i]); > c[i] = a[i]; > #pragma omp atomic capture > v = d[i] += c[i]; > e[i] = v; > } > } > vectorize it say with vf of 4 as: > for (i = 0; i < 64; i += 4) > { > v4si v; > *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i])); > *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i])); > v4si c_ = *((v4si *)&c[i]); > for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar > v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0); > // or, if we have hw supported __atomic_compare_exchange_16 and d is known > // to be aligned to 128-bits, we could do a 128-bit load + vector add + > // cmpxchg. > e[i] = v; > } > > The semantics of atomics inside of simd should be the same as of: > float a[64], b[64]; > int c[64], d[64], e[64]; > void foo (void) { > #pragma omp simd > for (int i = 0; i < 64; ++i) > { > int v; > a[i] = sqrt (b[i]); > c[i] = a[i]; > #pragma omp ordered simd > { > #pragma omp atomic capture > v = d[i] += c[i]; > } > e[i] = v; > } > } > > in that it vectorizes (if possible) the loop, except for not vectorizing > the ordered simd part of the loop, but instead iterating from 0 to vf-1 > sequentially. So re-ordering iterations for the non-ordered/atomic part of the loop is OK, even crossing the ordered/atomic parts - like above the store to a[1] may happen before the d[0] += c[0] atomic operation, but the atomic/ordered stmts have to happen in-order with respect to only themselves? Then we can indeed vectorize this by copying the scalar stmts N times with decomposing the input vectors beforehand and building a vector result afterwards. I'd like to see us trying this for otherwise not vectorizable code as well (with appropriate costing of course). Then the atomics vectorization would work transparently and we only need to think about how to mark stmts in ordered regions? Richard.
--- gcc/omp-low.c.jj 2018-06-08 12:12:34.377111135 +0200 +++ gcc/omp-low.c 2018-06-14 11:09:55.439824132 +0200 @@ -2463,9 +2463,13 @@ check_omp_nesting_restrictions (gimple * return true; } } + else if (gimple_code (stmt) == GIMPLE_OMP_ATOMIC_LOAD + || gimple_code (stmt) == GIMPLE_OMP_ATOMIC_STORE) + return true; error_at (gimple_location (stmt), "OpenMP constructs other than %<#pragma omp ordered simd%>" - " may not be nested inside %<simd%> region"); + " or %<#pragma omp atomic%> may not be nested inside" + " %<simd%> region"); return false; } else if (gimple_code (ctx->stmt) == GIMPLE_OMP_TEAMS) --- gcc/testsuite/c-c++-common/gomp/simd7.c.jj 2018-06-14 11:19:30.466274188 +0200 +++ gcc/testsuite/c-c++-common/gomp/simd7.c 2018-06-14 10:55:52.520186067 +0200 @@ -0,0 +1,21 @@ +int a[64]; + +#pragma omp declare simd linear(x) +int +bar (int x, int y) +{ + int v; + #pragma omp atomic capture + v = a[x] += y; + return v; +} + +void +foo (void) +{ + int i; + #pragma omp simd + for (i = 0; i < 64; i++) + #pragma omp atomic + a[i] += 1; +}