[gomp5] Allow OpenMP atomics inside simd regions
Jakub Jelinek
jakub@redhat.com
Thu Jun 14 10:55:00 GMT 2018
On Thu, Jun 14, 2018 at 12:34:11PM +0200, Richard Biener wrote:
> > #pragma omp atomic is now allowed inside of simd regions.
> > Tested on x86_64-linux, committed to gomp-5_0-branch.
> >
> > We will actually not vectorize it then though, so some further work will be
> > needed in the vectorizer to handle it. Either, if we have hw atomics for both
> > the size of the scalar accesses and size of the whole vector type, the
> > accesses are adjacent and known to be aligned, we could replace it with
> > atomic on the whole vector, or emit as a small loop or unrolled loop doing
> > the extraction, scalar atomics and if needed insert result back into
> > vectors. Richard, thoughts on that?
>
> What's the semantic of this? Generally for non-vectorizable stmts
OpenMP already has #pragma omp ordered simd which specifies part of the loop
body that should not be vectorized (which we right now just implement as
forcing no vectorization) and I guess the atomics could be handled
similarly. I.e. say for
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
{
int v;
a[i] = sqrt (b[i]);
c[i] = a[i];
#pragma omp atomic capture
v = d[i] += c[i];
e[i] = v;
}
}
vectorize it say with vf of 4 as:
for (i = 0; i < 64; i += 4)
{
v4si v;
*((v4sf *)&a[i]) = sqrtv4sf (*((v4sf *)&b[i]));
*((v4si *)&c[i]) = fix_truncv4sfv4si (*((v4sf *)&a[i]));
v4si c_ = *((v4si *)&c[i]);
for (i_ = 0; i_ < 4; i_++) // possibly unrolled, in any case scalar
v[i_] = __atomic_add_fetch_4(&d[i + i_], c_[i_], 0);
// or, if we have hw supported __atomic_compare_exchange_16 and d is known
// to be aligned to 128-bits, we could do a 128-bit load + vector add +
// cmpxchg.
e[i] = v;
}
The semantics of atomics inside of simd should be the same as of:
float a[64], b[64];
int c[64], d[64], e[64];
void foo (void) {
#pragma omp simd
for (int i = 0; i < 64; ++i)
{
int v;
a[i] = sqrt (b[i]);
c[i] = a[i];
#pragma omp ordered simd
{
#pragma omp atomic capture
v = d[i] += c[i];
}
e[i] = v;
}
}
in that it vectorizes (if possible) the loop, except for not vectorizing
the ordered simd part of the loop, but instead iterating from 0 to vf-1
sequentially.
Jakub
More information about the Gcc-patches
mailing list