Loop Vectorization and OpenMP

Mon Jan 14 16:05:00 GMT 2013

Hi all,

I have a function which I wish to accelerate with auto-vectorization and
OpenMP:

void fn(float *restrict rho_in,     float *restrict E_in,
        float *restrict rhou_in,    float *restrict rhov_in,
        float *restrict f0rho_out,  float *restrict f0E_out,
        float *restrict f0rhou_out, float *restrict f0rhov_out,
        float *restrict f1rho_out,  float *restrict f1E_out,
        float *restrict f1rhou_out, float *restrict f1rhov_out,
        int n)
{
    rho_in  = (float *) __builtin_assume_aligned(rho_in, 32);
    E_in    = (float *) __builtin_assume_aligned(E_in, 32);
    rhou_in = (float *) __builtin_assume_aligned(rhou_in, 32);
    rhov_in = (float *) __builtin_assume_aligned(rhov_in, 32);

    f0rho_out  = (float *) __builtin_assume_aligned(f0rho_out, 32);
    f0E_out    = (float *) __builtin_assume_aligned(f0E_out, 32);
    f0rhou_out = (float *) __builtin_assume_aligned(f0rhou_out, 32);
    f0rhov_out = (float *) __builtin_assume_aligned(f0rhov_out, 32);

    f1rho_out  = (float *) __builtin_assume_aligned(f1rho_out, 32);
    f1E_out    = (float *) __builtin_assume_aligned(f1E_out, 32);
    f1rhou_out = (float *) __builtin_assume_aligned(f1rhou_out, 32);
    f1rhov_out = (float *) __builtin_assume_aligned(f1rhov_out, 32);

    #pragma omp parallel for
    for (int i = 0; i < n; ++i)
    {
        float rho = rho_in[i], E = E_in[i];
        float rhou = rhou_in[i], rhov = rhov_in[i];

        float invrho = 1.0f/rho;
        float u = invrho*rhou, v = invrho*rhov;

        float p = 0.4f*(E - 0.5f*(rhou*u + rhov*v));

        f0rho_out[i]  = rhou;       f1rho_out[i]  = rhov;
        f0rhou_out[i] = rhou*u + p; f1rhou_out[i] = rhov*u;
        f0rhov_out[i] = rhou*v;     f1rhov_out[i] = rhov*v + p;
        f0E_out[i]    = (E + p)*u;  f1E_out[i]    = (E + p)*v;
    }
}

the combination of "restrict" along with the alignment fluff yields some
extremely tight ASM on my AVX-capable system.  However, when OpenMP
enters the mix the resulting code is not vectorized:

  gcc-4.7.2 -std=c99 -Ofast -fopenmp -march=native -S fn.c

as can be seen by a simple inspection of the resulting assembly.  I
believe this is due to Bug 46032 (although some of the comments imply
that it should be fixed).  It appears as if either the "restrict"
properly or the alignment is getting clobbered when the OpenMP 'inner'
function is generated.

Can anyone suggest any workarounds?  It seems like a common problem and
really do not want to reinvent the wheel if a simple refactoring of my
code can iron everything out.

Regards, Freddie.