[gomp5] Allow OpenMP atomics inside simd regions

Thu Jun 14 10:55:00 GMT 2018

On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote:
> > #pragma omp atomic is now allowed inside of simd regions.
> > Tested on x86_64-linux, committed to gomp-5_0-branch.
> > 
> > We will actually not vectorize it then though, so some further work will be
> > needed in the vectorizer to handle it.  Either, if we have hw atomics for both
> > the size of the scalar accesses and size of the whole vector type, the
> > accesses are adjacent and known to be aligned, we could replace it with
> > atomic on the whole vector, or emit as a small loop or unrolled loop doing
> > the extraction, scalar atomics and if needed insert result back into
> > vectors.  Richard, thoughts on that?
> 
> What's the semantic of this?  Generally for non-vectorizable stmts

OpenMP already has #pragma omp ordered simd which specifies part of the loop
body that should not be vectorized (which we right now just implement as
forcing no vectorization) and I guess the atomics could be handled
similarly.  I.e. say for
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
  {
    int v;
    a[i] = sqrt (b[i]);
    c[i] = a[i];
    #pragma omp atomic capture
    v = d[i] += c[i];
    e[i] = v;
  }
}
vectorize it say with vf of 4 as:
for (i = 0; i < 64; i += 4)
  {
    v4si v;
    *((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i]));
    *((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i]));
    v4si c_ = *((v4si *)&c[i]);
    for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar
      v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0);
    // or, if we have hw supported __atomic_compare_exchange_16 and d is known
    // to be aligned to 128-bits, we could do a 128-bit load + vector add +
    // cmpxchg.
    e[i] = v;
  }

The semantics of atomics inside of simd should be the same as of:
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
  {
    int v;
    a[i] = sqrt (b[i]);
    c[i] = a[i];
    #pragma omp ordered simd
    {
      #pragma omp atomic capture
      v = d[i] += c[i];
    }
    e[i] = v;
  }
}

in that it vectorizes (if possible) the loop, except for not vectorizing
the ordered simd part of the loop, but instead iterating from 0 to vf-1
sequentially.

	Jakub