Bug 91732 - Adding omp simd pragma prevents vectorization
Summary: Adding omp simd pragma prevents vectorization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, openmp
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2019-09-11 04:21 UTC by Jed Brown
Modified: 2019-09-11 12:32 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-09-11 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jed Brown 2019-09-11 04:21:26 UTC
omp-simd.c:
void poisson(int Q, const double *restrict gsym, const double *restrict du, double *restrict dv) {
#pragma omp simd
  for (int i=0; i<Q; i++) {
    const double g[2][2] = {{gsym[Q*0+i], gsym[Q*2+i]},
                            {gsym[Q*2+i], gsym[Q*1+i]}};
    for (int j=0; j<2; j++)
      dv[Q*j+i] = g[j][0] * du[Q*0+i] + g[j][1] * du[Q*1+i];
  }
}

The above fails to vectorize despite unrolling the inner loop.

$ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c
omp-simd.c:6:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152)
omp-simd.c:4:38: missed: couldn't vectorize loop
omp-simd.c:4:18: missed: not vectorized: not suitable for scatter store D.4095[_37][0][0] = _4;

If I remove the "#pragma omp simd", it vectorizes:

$ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c
omp-simd.c:5:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152)
omp-simd.c:2:3: optimized: loop vectorized using 32 byte vectors
omp-simd.c:2:3: optimized:  loop versioned for vectorization because of possible aliasing
omp-simd.c:2:3: optimized: loop with 2 iterations completely unrolled (header execution count 18709371)

If instead, I replace "#pragma omp simd" with "#pragma GCC ivdep", it vectorizes without possible aliasing.

$ gcc -Ofast -march=skylake-avx512 -fopenmp -fopt-info -fopt-info-missed -c omp-simd.c
omp-simd.c:6:5: optimized: loop with 2 iterations completely unrolled (header execution count 357878152)
omp-simd.c:3:3: optimized: loop vectorized using 32 byte vectors
omp-simd.c:3:3: optimized: loop with 2 iterations completely unrolled (header execution count 24166268)

I think aliasing should not be a concern due to use of restrict.  Also, if I manually unroll the inner loop (which the compiler is unrolling for me), the original "omp simd" version vectorizes nicely.

Reproduced on trunk: https://gcc.godbolt.org/z/wKdHg0
Comment 1 Richard Biener 2019-09-11 09:23:39 UTC
The runtime alias test is between the two stores of the inner unrolled loop.
That's dv[i] vs. dv[i+Q].

Creating dr for *_61
analyze_innermost: success.
        base_address: dv_44(D)
        offset from base address: 0
        constant offset from base address: 0
        step: 8
        base alignment: 8
        base misalignment: 0
        offset alignment: 512
        step alignment: 8
        base_object: *dv_44(D)
        Access function 0: {0B, +, 8}_1
Creating dr for *_79
analyze_innermost: success.
        base_address: (double *) dv_44(D) + (sizetype) ((long unsigned int) Q_35(D) * 8)
        offset from base address: 0
        constant offset from base address: 0
        step: 8
        base alignment: 8
        base misalignment: 0
        offset alignment: 512
        step alignment: 8
        base_object: *(double *) dv_44(D) + (sizetype) ((long unsigned int) Q_35(D) * 8)
        Access function 0: {0B, +, 8}_1

it's probably unfortunate association since we compute inside the loop

  _11 = Q_35(D) + i_39;
  _12 = (long unsigned int) _11;
  _13 = _12 * 8;


With OpenMP SIMD we fail to analyze the data-refs:

Creating dr for D.4113[_37][1][0]
analyze_innermost: t.c:4:18: missed:  failed: evolution of offset is not affine.
        base_address:
        offset from base address:
        constant offset from base address:
        step:
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: D.4113
        Access function 0: 0
        Access function 1: 1
        Access function 2: scev_not_known;

where _37 is the SIMD lane.
Comment 2 Jakub Jelinek 2019-09-11 09:30:15 UTC
The routine is obfuscated too much, why not use gsym[Q*2*j+i] instead of g[j][0] and similarly gsym[Q*2-j*Q+i] instead of g[j][1]?
The reason this isn't vectorized is that we need to effectively privatize the g variable, because every SIMD lane needs different values for it, and SRA isn't able to split that appart into scalars indexed by the simd lane.
So, in the end this is pretty much a dup of PR91020.
Comment 3 Jed Brown 2019-09-11 12:32:27 UTC
> why not use gsym[Q*2*j+i] instead of g[j][0] and similarly gsym[Q*2-j*Q+i] instead of g[j][1]?

The pattern here is that gsym is packed storage of a symmetric 2x2 matrix, while g unpacks it so that inner loops (intended for unrolling) can be written using index notation. This case (a finite element quadrature routine for 2D anisotropic Poisson) is reduced from more complicated examples (such as 3D nonlinear solid and fluid mechanics) where this technique provides substantial clarity and correspondence to mathematical notation. The suggested transformation (eliminating the temporary g[][] in exchange for fancy indexing of g) is problematic when representing higher order tensors (https://en.wikipedia.org/wiki/Voigt_notation#Mnemonic_rule).

It's also sometimes desirable to roll the second loop instead of repeating, in which case you don't get to have a different indexing rule for g[j][0] and g[j][1].

  for (int i=0; i<Q; i++) {
    const double g[2][2] = {{gsym[Q*0+i], gsym[Q*2+i]},
                            {gsym[Q*2+i], gsym[Q*1+i]}};
    for (int j=0; j<2; j++) {
      dv[Q*j+i] = 0;
      for (int k=0; k<2; k++)
        dv[Q*j+i] += g[j][k] * du[Q*k+i];
    }
  }
}

Fortunately, we're generally getting good codegen for more complicated cases when using "GCC ivdep" and hit-or-miss (but good in this case) without any pragmas (restrict becomes important when there are more arrays).  I filed this report specifically because adding a semantically-correct "omp simd" prevented vectorization.