Bug 66285 - failure to vectorize parallelized loop
Summary: failure to vectorize parallelized loop
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.0
: P3 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-05-26 07:55 UTC by Tom de Vries
Modified: 2021-08-16 04:46 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
par-2.c.129t.parloops (4.49 KB, text/plain)
2015-05-26 08:11 UTC, Tom de Vries
Details
par-2.c.130t.ompexpssa (4.98 KB, text/plain)
2015-05-26 08:12 UTC, Tom de Vries
Details
par-2.c.131t.ifcvt (1.81 KB, text/plain)
2015-05-26 08:12 UTC, Tom de Vries
Details
par-2.c.132t.vect (2.34 KB, text/plain)
2015-05-26 08:13 UTC, Tom de Vries
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tom de Vries 2015-05-26 07:55:19 UTC
Another pr46032-inspired example.

Consider par-2.c:
...
#define nEvents 1000

int __attribute__((noinline,noclone))
f (int argc, double *__restrict results, double *__restrict data)
{
  double coeff = 12.2;

  for (INDEX_TYPE idx = 0; idx < nEvents; idx++)
    results[idx] = coeff * data[idx];

  return !(results[argc] == 0.0);
}

#if defined (MAIN)
int
main (int argc)
{
  double results[nEvents] = {0};
  double data[nEvents] = {0};

  return f (argc, results, data);
}
#endif
...

And investigate.sh:
...
#!/bin/bash

src=par-2.c

for parloops_factor in 0 2; do
    for index_type in "int" "unsigned int" "long" "unsigned long"; do
	rm -f *.c.*;

	./lean-c/install/bin/gcc -O2 $src -S \
	    -ftree-parallelize-loops=$parloops_factor \
	    -ftree-vectorize \
	    -fdump-tree-all-all \
	    "-DINDEX_TYPE=$index_type"

	vectdump=$src.132t.vect
	pardump=$src.129t.parloops

	vectorized=$(grep -c "LOOP VECTORIZED" $vectdump)

	if [ ! -f $pardump ]; then 
	    parallelized=0
	else
	    parallelized=$(grep -c "parallelizing inner loop" $pardump)
	fi

	echo "parloops_factor: $parloops_factor, index_type: $index_type:"
	echo "  vectorized: $vectorized, parallelized: $parallelized"
    done
done
...

If we're not parallelizing, vectorization succeeds:
...
parloops_factor: 0, index_type: int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: long:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned long:
  vectorized: 1, parallelized: 0
...

If we're parallelizing, vectorization succeeds for (unsigned) long:
...
parloops_factor: 2, index_type: long:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: unsigned long:
  vectorized: 1, parallelized: 1
...

but not for (unsigned) int:
...
parloops_factor: 2, index_type: int:
  vectorized: 0, parallelized: 1
parloops_factor: 2, index_type: unsigned int:
  vectorized: 0, parallelized: 1
...
Comment 1 Tom de Vries 2015-05-26 07:59:28 UTC
FWIW, this patch puts pass_parallelize_loops before pass_vectorize: 
...
diff --git a/gcc/passes.def b/gcc/passes.def
index 4690e23..f0629ff 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -243,14 +243,14 @@ along with GCC; see the file COPYING3.  If not see
              NEXT_PASS (pass_dce);
          POP_INSERT_PASSES ()
          NEXT_PASS (pass_iv_canon);
-         NEXT_PASS (pass_parallelize_loops);
-         PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
-             NEXT_PASS (pass_expand_omp_ssa);
-         POP_INSERT_PASSES ()
          NEXT_PASS (pass_if_conversion);
          /* pass_vectorize must immediately follow pass_if_conversion.
             Please do not add any other passes in between.  */
          NEXT_PASS (pass_vectorize);
+         NEXT_PASS (pass_parallelize_loops);
+         PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
+             NEXT_PASS (pass_expand_omp_ssa);
+         POP_INSERT_PASSES ()
           PUSH_INSERT_PASSES_WITHIN (pass_vectorize)
              NEXT_PASS (pass_dce);
           POP_INSERT_PASSES ()
...

And that makes the problem go away (btw, dump file names need adapting in investigate.sh):
...
$ ./investigate.sh 
parloops_factor: 0, index_type: int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: long:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned long:
  vectorized: 1, parallelized: 0
parloops_factor: 2, index_type: int:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: unsigned int:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: long:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: unsigned long:
  vectorized: 1, parallelized: 1
...

Of course, the patch means we're no longer vectorizing parallelized loops, but parallelizing vectorized loops.
Comment 2 Tom de Vries 2015-05-26 08:11:43 UTC
Created attachment 35623 [details]
par-2.c.129t.parloops

For -DINDEX_TYPE=int, par-2.c.129t.parloops
Comment 3 Tom de Vries 2015-05-26 08:12:25 UTC
Created attachment 35624 [details]
par-2.c.130t.ompexpssa

par-2.c.130t.ompexpssa
Comment 4 Tom de Vries 2015-05-26 08:12:59 UTC
Created attachment 35625 [details]
par-2.c.131t.ifcvt

par-2.c.131t.ifcvt
Comment 5 Tom de Vries 2015-05-26 08:13:49 UTC
Created attachment 35626 [details]
par-2.c.132t.vect

par-2.c.132t.vect
Comment 6 Richard Biener 2015-05-26 10:58:37 UTC
I thought that parallelizing vectorized loops is harder (you eventually get extra prologue and epliogue loops, etc).
Comment 7 Tom de Vries 2015-05-26 12:54:18 UTC
(In reply to Richard Biener from comment #6)
> I thought that parallelizing vectorized loops is harder (you eventually get
> extra prologue and epliogue loops, etc).

Another example, par-4.c:
...
int __attribute__((noinline,noclone))
f (int argc, double *__restrict results, double *__restrict data, INDEX_TYPE n)
{
  double coeff = 12.2;

  for (INDEX_TYPE idx = 0; idx < n; idx++)
    results[idx] = coeff * data[idx];

  return !(results[argc] == 0.0);
}

#define nEvents 1000

#if defined (MAIN)
int
main (int argc)
{
  double results[nEvents] = {0};
  double data[nEvents] = {0};

  return f (argc, results, data, nEvents);
}
#endif
...

When not parallelizing, we vectorize without problems:
...
parloops_factor: 0, index_type: int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: long:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned long:
  vectorized: 1, parallelized: 0
...


When parallelizing, we generate both a low iteration count loop, and a split-off parallelized loop. The vectorizer vectorizes both loops (each of which contains an epilogue):
...
parloops_factor: 2, index_type: int:
  vectorized: 2, parallelized: 1
parloops_factor: 2, index_type: long:
  vectorized: 2, parallelized: 1
parloops_factor: 2, index_type: unsigned long:
  vectorized: 2, parallelized: 1
...

Except in the case of unsigned int, in which case it only vectorizes the low iteration count loop:
...
parloops_factor: 2, index_type: unsigned int:
  vectorized: 1, parallelized: 1
...
The other loop fails to vectorize in a fashion similar as decribed for par-2.c with INDEX_TYPE (unsigned) int.
Comment 8 Tom de Vries 2015-05-26 13:24:33 UTC
For example par-4.c, if we use the same patch to interchange the passes, we get:

When not parallelizing, all loops get vectorized:
...
parloops_factor: 0, index_type: int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned int:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: long:
  vectorized: 1, parallelized: 0
parloops_factor: 0, index_type: unsigned long:
  vectorized: 1, parallelized: 0
...

When parallelizing, we parallelize one loop.
...
parloops_factor: 2, index_type: int:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: unsigned int:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: long:
  vectorized: 1, parallelized: 1
parloops_factor: 2, index_type: unsigned long:
  vectorized: 1, parallelized: 1
...
The loop that is parallelized is the vectorized loop, not the epilogue.


So AFAIU:
- with this patch the epilogue is only performed by the main thread, after all
  the threads are done. Each thread handles one slice of the vectorized loop.
- without the patch, the epilogue is potentially executed by each thread.