Created attachment 57054 [details] The three different versions as well as preprocessed output. I attached three programs that compute an array accel of points in R^3 from a 3d array of positions in R^3. accel[i] is sum_j ||positions[i] - positions[j]||^2. These are seq.c which is the most basic, omp.c which parallelises the outer loop with OpenMP and block.c which uses the blocking optimisation. The first version vectorizes as expected, but the other two do not. objdump -d omp.o | grep ymm shows up empty. They are compiled with gcc -c seq.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o seq.o -lm gcc -c omp.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o omp.o -lm -fopenmp gcc -c block.c -Ofast -mavx2 -mfma -save-temps -Wall -Wextra -o block.o -lm and gcc -v gives the following. Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/home/thomas/.local/libexec/gcc/x86_64-pc-linux-gnu/13.2.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ./configure --prefix=/home/thomas/.local Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (GCC)
Created attachment 57055 [details] testcase that does vectorize with openmp though So what is happening is loop invariant motion is not happening with the openmp version, I have not looked into why exactly. BUT you can manually it for the loop and get a vectorized version of the loop.
One thing I should note is __restrict__ inside a struct is almost definitely not going to help here. See PR 49761 for that.
The issue with block.c is Analyzing loop at block.c:22 block.c:22:39: note: === analyze_loop_nest === block.c:22:39: note: === vect_analyze_loop_form === block.c:22:39: note: === get_loop_niters === block.c:22:39: missed: not vectorized: number of iterations cannot be computed. block.c:22:39: missed: bad loop form. block.c:22:39: missed: couldn't vectorize loop we fail to compute an expression for the number of scalar iterations in the innermost loop. That's because we have 'j < J + BLOCK && j < n' as the terminating condition. I suspect that the blocking should peel the case where J + BLOCK > n, basically if (J + BLOCK > n || I + BLOCK > n) { ... blocking nest with < n exit condition } else { ... blocking nest with < {J,I} + BLOCK exit condition } the vectorizer (or rather niter analysis) could try to recover in a similar way with using 'assumptions' - basically we can compute the number of iterations to BLOCK if we assume that J + BLOCK <= n. The exit condition looks like _145 = J_86 + 999; ... <bb 4> [local count: 958878294]: # j_88 = PHI <j_58(18), J_86(7)> ... j_58 = j_88 + 1; _63 = n_49(D) > j_58; _64 = j_58 <= _145; _65 = _63 & _64; if (_65 != 0) we could try to pattern-match this NE_EXPR (we need to choose which condition we use as assumption and which to base the niters on). Another possibility would be (I think this came up in another bugreport as well) to use j < MIN (J + BLOCK, n). The following source modification works: for (int i = I; i < I + BLOCK && i < n; i++) { int m = J + BLOCK > n ? n : J + BLOCK; for (int j = J; j < m; j++) { whether it's a general profitable transform or should be matched again only during niter analysis I'm not sure (if the MIN is loop invariant and this is an exit condition it surely is profitable).