Another pr46032-inspired example. Consider par-2.c: ... #define nEvents 1000 int __attribute__((noinline,noclone)) f (int argc, double *__restrict results, double *__restrict data) { double coeff = 12.2; for (INDEX_TYPE idx = 0; idx < nEvents; idx++) results[idx] = coeff * data[idx]; return !(results[argc] == 0.0); } #if defined (MAIN) int main (int argc) { double results[nEvents] = {0}; double data[nEvents] = {0}; return f (argc, results, data); } #endif ... And investigate.sh: ... #!/bin/bash src=par-2.c for parloops_factor in 0 2; do for index_type in "int" "unsigned int" "long" "unsigned long"; do rm -f *.c.*; ./lean-c/install/bin/gcc -O2 $src -S \ -ftree-parallelize-loops=$parloops_factor \ -ftree-vectorize \ -fdump-tree-all-all \ "-DINDEX_TYPE=$index_type" vectdump=$src.132t.vect pardump=$src.129t.parloops vectorized=$(grep -c "LOOP VECTORIZED" $vectdump) if [ ! -f $pardump ]; then parallelized=0 else parallelized=$(grep -c "parallelizing inner loop" $pardump) fi echo "parloops_factor: $parloops_factor, index_type: $index_type:" echo " vectorized: $vectorized, parallelized: $parallelized" done done ... If we're not parallelizing, vectorization succeeds: ... parloops_factor: 0, index_type: int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: long: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned long: vectorized: 1, parallelized: 0 ... If we're parallelizing, vectorization succeeds for (unsigned) long: ... parloops_factor: 2, index_type: long: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: unsigned long: vectorized: 1, parallelized: 1 ... but not for (unsigned) int: ... parloops_factor: 2, index_type: int: vectorized: 0, parallelized: 1 parloops_factor: 2, index_type: unsigned int: vectorized: 0, parallelized: 1 ...

FWIW, this patch puts pass_parallelize_loops before pass_vectorize: ... diff --git a/gcc/passes.def b/gcc/passes.def index 4690e23..f0629ff 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -243,14 +243,14 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_dce); POP_INSERT_PASSES () NEXT_PASS (pass_iv_canon); - NEXT_PASS (pass_parallelize_loops); - PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops) - NEXT_PASS (pass_expand_omp_ssa); - POP_INSERT_PASSES () NEXT_PASS (pass_if_conversion); /* pass_vectorize must immediately follow pass_if_conversion. Please do not add any other passes in between. */ NEXT_PASS (pass_vectorize); + NEXT_PASS (pass_parallelize_loops); + PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops) + NEXT_PASS (pass_expand_omp_ssa); + POP_INSERT_PASSES () PUSH_INSERT_PASSES_WITHIN (pass_vectorize) NEXT_PASS (pass_dce); POP_INSERT_PASSES () ... And that makes the problem go away (btw, dump file names need adapting in investigate.sh): ... $ ./investigate.sh parloops_factor: 0, index_type: int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: long: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned long: vectorized: 1, parallelized: 0 parloops_factor: 2, index_type: int: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: unsigned int: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: long: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: unsigned long: vectorized: 1, parallelized: 1 ... Of course, the patch means we're no longer vectorizing parallelized loops, but parallelizing vectorized loops.

Created attachment 35623 [details] par-2.c.129t.parloops For -DINDEX_TYPE=int, par-2.c.129t.parloops

Created attachment 35624 [details] par-2.c.130t.ompexpssa par-2.c.130t.ompexpssa

Created attachment 35625 [details] par-2.c.131t.ifcvt par-2.c.131t.ifcvt

Created attachment 35626 [details] par-2.c.132t.vect par-2.c.132t.vect

I thought that parallelizing vectorized loops is harder (you eventually get extra prologue and epliogue loops, etc).

(In reply to Richard Biener from comment #6) > I thought that parallelizing vectorized loops is harder (you eventually get > extra prologue and epliogue loops, etc). Another example, par-4.c: ... int __attribute__((noinline,noclone)) f (int argc, double *__restrict results, double *__restrict data, INDEX_TYPE n) { double coeff = 12.2; for (INDEX_TYPE idx = 0; idx < n; idx++) results[idx] = coeff * data[idx]; return !(results[argc] == 0.0); } #define nEvents 1000 #if defined (MAIN) int main (int argc) { double results[nEvents] = {0}; double data[nEvents] = {0}; return f (argc, results, data, nEvents); } #endif ... When not parallelizing, we vectorize without problems: ... parloops_factor: 0, index_type: int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: long: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned long: vectorized: 1, parallelized: 0 ... When parallelizing, we generate both a low iteration count loop, and a split-off parallelized loop. The vectorizer vectorizes both loops (each of which contains an epilogue): ... parloops_factor: 2, index_type: int: vectorized: 2, parallelized: 1 parloops_factor: 2, index_type: long: vectorized: 2, parallelized: 1 parloops_factor: 2, index_type: unsigned long: vectorized: 2, parallelized: 1 ... Except in the case of unsigned int, in which case it only vectorizes the low iteration count loop: ... parloops_factor: 2, index_type: unsigned int: vectorized: 1, parallelized: 1 ... The other loop fails to vectorize in a fashion similar as decribed for par-2.c with INDEX_TYPE (unsigned) int.

For example par-4.c, if we use the same patch to interchange the passes, we get: When not parallelizing, all loops get vectorized: ... parloops_factor: 0, index_type: int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned int: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: long: vectorized: 1, parallelized: 0 parloops_factor: 0, index_type: unsigned long: vectorized: 1, parallelized: 0 ... When parallelizing, we parallelize one loop. ... parloops_factor: 2, index_type: int: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: unsigned int: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: long: vectorized: 1, parallelized: 1 parloops_factor: 2, index_type: unsigned long: vectorized: 1, parallelized: 1 ... The loop that is parallelized is the vectorized loop, not the epilogue. So AFAIU: - with this patch the epilogue is only performed by the main thread, after all the threads are done. Each thread handles one slice of the vectorized loop. - without the patch, the epilogue is potentially executed by each thread.