omp-simd.c: void poisson(int Q, const double *restrict gsym, const double *restrict du, double *restrict dv) { #pragma omp simd for (int i=0; i<Q; i++) { const double g[2][2] = {{gsym[Q*0+i], gsym[Q*2+i]}, {gsym[Q*2+i], gsym[Q*1+i]}}; for (int j=0; j<2; j++) dv[Q*j+i] = g[j][0] * du[Q*0+i] + g[j][1] * du[Q*1+i]; } } The above fails to vectorize despite unrolling the inner loop. $ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c omp-simd.c:6:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152) omp-simd.c:4:38: missed: couldn't vectorize loop omp-simd.c:4:18: missed: not vectorized: not suitable for scatter store D.4095[_37][0][0] = _4; If I remove the "#pragma omp simd", it vectorizes: $ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c omp-simd.c:5:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152) omp-simd.c:2:3: optimized: loop vectorized using 32 byte vectors omp-simd.c:2:3: optimized: loop versioned for vectorization because of possible aliasing omp-simd.c:2:3: optimized: loop with 2 iterations completely unrolled (header execution count 18709371) If instead, I replace "#pragma omp simd" with "#pragma GCC ivdep", it vectorizes without possible aliasing. $ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c omp-simd.c:6:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152) omp-simd.c:3:3: optimized: loop vectorized using 32 byte vectors omp-simd.c:3:3: optimized: loop with 2 iterations completely unrolled (header execution count 24166268) I think aliasing should not be a concern due to use of restrict. Also, if I manually unroll the inner loop (which the compiler is unrolling for me), the original "omp simd" version vectorizes nicely. Reproduced on trunk: https://gcc.godbolt.org/z/wKdHg0
The runtime alias test is between the two stores of the inner unrolled loop. That's dv[i] vs. dv[i+Q]. Creating dr for *_61 analyze_innermost: success. base_address: dv_44(D) offset from base address: 0 constant offset from base address: 0 step: 8 base alignment: 8 base misalignment: 0 offset alignment: 512 step alignment: 8 base_object: *dv_44(D) Access function 0: {0B, +, 8}_1 Creating dr for *_79 analyze_innermost: success. base_address: (double *) dv_44(D) + (sizetype) ((long unsigned int) Q_35(D) * 8) offset from base address: 0 constant offset from base address: 0 step: 8 base alignment: 8 base misalignment: 0 offset alignment: 512 step alignment: 8 base_object: *(double *) dv_44(D) + (sizetype) ((long unsigned int) Q_35(D) * 8) Access function 0: {0B, +, 8}_1 it's probably unfortunate association since we compute inside the loop _11 = Q_35(D) + i_39; _12 = (long unsigned int) _11; _13 = _12 * 8; With OpenMP SIMD we fail to analyze the data-refs: Creating dr for D.4113[_37][1][0] analyze_innermost: t.c:4:18: missed: failed: evolution of offset is not affine. base_address: offset from base address: constant offset from base address: step: base alignment: 0 base misalignment: 0 offset alignment: 0 step alignment: 0 base_object: D.4113 Access function 0: 0 Access function 1: 1 Access function 2: scev_not_known; where _37 is the SIMD lane.
The routine is obfuscated too much, why not use gsym[Q*2*j+i] instead of g[j][0] and similarly gsym[Q*2-j*Q+i] instead of g[j][1]? The reason this isn't vectorized is that we need to effectively privatize the g variable, because every SIMD lane needs different values for it, and SRA isn't able to split that appart into scalars indexed by the simd lane. So, in the end this is pretty much a dup of PR91020.
> why not use gsym[Q*2*j+i] instead of g[j][0] and similarly gsym[Q*2-j*Q+i] instead of g[j][1]? The pattern here is that gsym is packed storage of a symmetric 2x2 matrix, while g unpacks it so that inner loops (intended for unrolling) can be written using index notation. This case (a finite element quadrature routine for 2D anisotropic Poisson) is reduced from more complicated examples (such as 3D nonlinear solid and fluid mechanics) where this technique provides substantial clarity and correspondence to mathematical notation. The suggested transformation (eliminating the temporary g[][] in exchange for fancy indexing of g) is problematic when representing higher order tensors (https://en.wikipedia.org/wiki/Voigt_notation#Mnemonic_rule). It's also sometimes desirable to roll the second loop instead of repeating, in which case you don't get to have a different indexing rule for g[j][0] and g[j][1]. for (int i=0; i<Q; i++) { const double g[2][2] = {{gsym[Q*0+i], gsym[Q*2+i]}, {gsym[Q*2+i], gsym[Q*1+i]}}; for (int j=0; j<2; j++) { dv[Q*j+i] = 0; for (int k=0; k<2; k++) dv[Q*j+i] += g[j][k] * du[Q*k+i]; } } } Fortunately, we're generally getting good codegen for more complicated cases when using "GCC ivdep" and hit-or-miss (but good in this case) without any pragmas (restrict becomes important when there are more arrays). I filed this report specifically because adding a semantically-correct "omp simd" prevented vectorization.