[gomp5] Allow OpenMP atomics inside simd regions

Thu Jun 14 12:37:00 GMT 2018

On Thu, 14 Jun 2018, Jakub Jelinek wrote:

> On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote:
> > > #pragma omp atomic is now allowed inside of simd regions.
> > > Tested on x86_64-linux, committed to gomp-5_0-branch.
> > > 
> > > We will actually not vectorize it then though, so some further work will be
> > > needed in the vectorizer to handle it.  Either, if we have hw atomics for both
> > > the size of the scalar accesses and size of the whole vector type, the
> > > accesses are adjacent and known to be aligned, we could replace it with
> > > atomic on the whole vector, or emit as a small loop or unrolled loop doing
> > > the extraction, scalar atomics and if needed insert result back into
> > > vectors.  Richard, thoughts on that?
> > 
> > What's the semantic of this?  Generally for non-vectorizable stmts
> 
> OpenMP already has #pragma omp ordered simd which specifies part of the loop
> body that should not be vectorized (which we right now just implement as
> forcing no vectorization) and I guess the atomics could be handled
> similarly.  I.e. say for
> float a[64], b[64];
> int c[64], d[64], e[64];
> void foo (void) {
> #pragma omp simd
> for (int i = 0; i < 64; ++i)
>   {
>     int v;
>     a[i] = sqrt (b[i]);
>     c[i] = a[i];
>     #pragma omp atomic capture
>     v = d[i] += c[i];
>     e[i] = v;
>   }
> }
> vectorize it say with vf of 4 as:
> for (i = 0; i < 64; i += 4)
>   {
>     v4si v;
>     *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i]));
>     *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i]));
>     v4si c_ = *((v4si *)&c[i]);
>     for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar
>       v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0);
>     // or, if we have hw supported __atomic_compare_exchange_16 and d is known
>     // to be aligned to 128-bits, we could do a 128-bit load + vector add +
>     // cmpxchg.
>     e[i] = v;
>   }
> 
> The semantics of atomics inside of simd should be the same as of:
> float a[64], b[64];
> int c[64], d[64], e[64];
> void foo (void) {
> #pragma omp simd
> for (int i = 0; i < 64; ++i)
>   {
>     int v;
>     a[i] = sqrt (b[i]);
>     c[i] = a[i];
>     #pragma omp ordered simd
>     {
>       #pragma omp atomic capture
>       v = d[i] += c[i];
>     }
>     e[i] = v;
>   }
> }
> 
> in that it vectorizes (if possible) the loop, except for not vectorizing
> the ordered simd part of the loop, but instead iterating from 0 to vf-1
> sequentially.

So re-ordering iterations for the non-ordered/atomic part of the
loop is OK, even crossing the ordered/atomic parts - like
above the store to a[1] may happen before the d[0] += c[0] atomic
operation, but the atomic/ordered stmts have to happen in-order
with respect to only themselves?

Then we can indeed vectorize this by copying the scalar stmts N times
with decomposing the input vectors beforehand and building a vector
result afterwards.

I'd like to see us trying this for otherwise not vectorizable
code as well (with appropriate costing of course).  Then the atomics
vectorization would work transparently and we only need to think
about how to mark stmts in ordered regions?

Richard.