Bug 113358 - OpenMP inhibits vectorization
Summary: OpenMP inhibits vectorization
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-01-12 17:23 UTC by Thomas (T.J.) Koopman
Modified: 2024-01-15 08:16 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-01-12 00:00:00


Attachments
The three different versions as well as preprocessed output. (14.33 KB, application/x-tar)
2024-01-12 17:23 UTC, Thomas (T.J.) Koopman
Details
testcase that does vectorize with openmp though (415 bytes, text/plain)
2024-01-12 17:41 UTC, Andrew Pinski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Thomas (T.J.) Koopman 2024-01-12 17:23:31 UTC
Created attachment 57054 [details]
The three different versions as well as preprocessed output.

I attached three programs that compute an array accel of points in R^3 from a 3d array of positions in R^3. accel[i] is sum_j ||positions[i] - positions[j]||^2. 

These are seq.c which is the most basic, omp.c which parallelises the outer loop with OpenMP and block.c which uses the blocking optimisation. The first version vectorizes as expected, but the other two do not. objdump -d omp.o | grep ymm  shows up empty.

They are compiled with

gcc -c seq.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o seq.o -lm

gcc -c omp.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o omp.o -lm -fopenmp

gcc -c block.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o block.o -lm

and

gcc -v gives the following.

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/thomas/.local/libexec/gcc/x86_64-pc-linux-gnu/13.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --prefix=/home/thomas/.local
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (GCC)
Comment 1 Andrew Pinski 2024-01-12 17:41:20 UTC
Created attachment 57055 [details]
testcase that does vectorize with openmp though

So what is happening is loop invariant motion is not happening with the openmp version, I have not looked into why exactly. BUT you can manually it for the loop and get a vectorized version of the loop.
Comment 2 Andrew Pinski 2024-01-12 17:44:13 UTC
One thing I should note is __restrict__ inside a struct is almost definitely not going to help here. See PR 49761 for that.
Comment 3 Richard Biener 2024-01-15 08:16:56 UTC
The issue with block.c is

Analyzing loop at block.c:22
block.c:22:39: note:  === analyze_loop_nest ===
block.c:22:39: note:   === vect_analyze_loop_form ===
block.c:22:39: note:    === get_loop_niters ===
block.c:22:39: missed:   not vectorized: number of iterations cannot be computed.
block.c:22:39: missed:  bad loop form.
block.c:22:39: missed: couldn't vectorize loop

we fail to compute an expression for the number of scalar iterations in the
innermost loop.  That's because we have 'j < J + BLOCK && j < n' as
the terminating condition.  I suspect that the blocking should peel the
case where J + BLOCK > n, basically

      if (J + BLOCK > n || I + BLOCK > n)
        {
          ... blocking nest with < n exit condition
        }
      else
        {
          ... blocking nest with < {J,I} + BLOCK exit condition
        }

the vectorizer (or rather niter analysis) could try to recover in a similar
way with using 'assumptions' - basically we can compute the number of
iterations to BLOCK if we assume that J + BLOCK <= n.  The exit condition
looks like

  _145 = J_86 + 999;
...
  <bb 4> [local count: 958878294]:
  # j_88 = PHI <j_58(18), J_86(7)>
...
  j_58 = j_88 + 1;
  _63 = n_49(D) > j_58;
  _64 = j_58 <= _145;
  _65 = _63 & _64;
  if (_65 != 0)

we could try to pattern-match this NE_EXPR (we need to choose which
condition we use as assumption and which to base the niters on).
Another possibility would be (I think this came up in another bugreport
as well) to use j < MIN (J + BLOCK, n).

The following source modification works:

    for (int i = I; i < I + BLOCK && i < n; i++) {
        int m = J + BLOCK > n ? n : J + BLOCK;
        for (int j = J; j < m; j++) {

whether it's a general profitable transform or should be matched again
only during niter analysis I'm not sure (if the MIN is loop invariant
and this is an exit condition it surely is profitable).