#define N 10000000 int a[N], b[N], c[N]; main() { int i; #pragma omp parallel num_threads(4) for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } for (i = 0; i < N; i++) { a[i] += b[i] + c[i]; } } compiled with gcc -O3 -fopenmp The first loop gets parallelized by openmp, the second loop gets vectorized. But why does the parallelized loop not get vectorized too?
We have a duplicate (or related) bug somewhere.
Regression, doesn't happen on 4.8
I certainly see the same thing in 4.8, it seems this regressed in r161655. Anyway, I will have a look tomorrow.
I had a typo in the test case (remove += to make the loops identical) #define N 10000000 int a[N], b[N], c[N]; main() { int i; #pragma omp parallel num_threads(4) for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } } The case I saw vectorized on 4.8 (opensuse 13.1 compiler), but not on 5.0, was slightly different, auto parallelized #define N 10000000 int a[N], b[N], c[N]; main() { int i; for (i = 0; i < N; i++) { a[i] = b[i] + c[i]; } } With -O3 -mtree-parallelize-loops=4 I understand this will just internally generate openmp
The issue here is the induction variable is shared? <bb 5>: # _17 = PHI <0(2), _11(4)> _6 = b[_17]; _7 = c[_17]; _8 = _6 + _7; a[_17] = _8; _10 = .omp_data_i_3(D)->i; _11 = _10 + 1; .omp_data_i_3(D)->i = _11; if (_11 <= 9999999) goto <bb 4>; else goto <bb 3>; and we didn't apply store-motion to it: Memory reference 1: b[_17] Memory reference 2: c[_17] Memory reference 3: a[_17] Memory reference 4: .omp_data_i_3(D)->i ... Querying dependency of refs 4 and 1: dependent. Querying dependencies of ref 4 in loop 1: dependent I think the static chain never points to global decls thus we could special-case this. We could also treat members of the frame as non-allocated and thus not worry about trailing arrays and such. Unfortunately the .omp_data_i is a regular function parameter and not a static chain :/
(In reply to Richard Biener from comment #5) > The issue here is the induction variable is shared? Ah right, the testcase is clearly invalid. There is a data race on i, as well as data races on all the a[i] stores (because 4 different threads all do the same thing). Andi, did you mean #pragma omp parallel for num_threads(4) instead?
Ok, I think Tom had a similar patch for this but passing the .omp_data_i by effective "reference" (restrict qualified pointer and DECL_BY_REFERENCE) fixes this. Index: gcc/omp-low.c =================================================================== --- gcc/omp-low.c (revision 217692) +++ gcc/omp-low.c (working copy) @@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct layout_type (type); } - TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type); + TREE_TYPE (ctx->receiver_decl) + = build_qualified_type (build_pointer_type (type), TYPE_QUAL_RESTRICT); } /* Instantiate decls as necessary in CTX to satisfy the data sharing @@ -2006,6 +2007,7 @@ create_omp_child_function (omp_context * DECL_NAMELESS (t) = 1; DECL_ARG_TYPE (t) = ptr_type_node; DECL_CONTEXT (t) = current_function_decl; + DECL_BY_REFERENCE (t) = 1; TREE_USED (t) = 1; if (cilk_for_count) DECL_CHAIN (t) = DECL_ARGUMENTS (decl); there are more similar objects built, so the above may not fully optimize all cases.
(In reply to Richard Biener from comment #7) > Ok, I think Tom had a similar patch for this but passing the .omp_data_i by > effective "reference" (restrict qualified pointer and DECL_BY_REFERENCE) > fixes this. > > Index: gcc/omp-low.c > =================================================================== > --- gcc/omp-low.c (revision 217692) > +++ gcc/omp-low.c (working copy) > @@ -1517,7 +1517,8 @@ fixup_child_record_type (omp_context *ct > layout_type (type); > } > > - TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type); > + TREE_TYPE (ctx->receiver_decl) > + = build_qualified_type (build_pointer_type (type), TYPE_QUAL_RESTRICT); > } > > /* Instantiate decls as necessary in CTX to satisfy the data sharing > @@ -2006,6 +2007,7 @@ create_omp_child_function (omp_context * > DECL_NAMELESS (t) = 1; > DECL_ARG_TYPE (t) = ptr_type_node; > DECL_CONTEXT (t) = current_function_decl; > + DECL_BY_REFERENCE (t) = 1; > TREE_USED (t) = 1; > if (cilk_for_count) > DECL_CHAIN (t) = DECL_ARGUMENTS (decl); > > there are more similar objects built, so the above may not fully optimize > all cases. For the invalid testcase the DECL_BY_REFERENCE setting isn't needed and its positive effect on PTA could also be seen if using build_reference_type instead of build_pointer_type in the first hunk.
Valid testcase (well, assign sth useful to g...) #define N 10000000 int a[N], b[N], c[N]; main() { int i, g; #pragma omp parallel for num_threads(4) for (i = 0; i < N; i++) { a[i] = b[i] + c[i] + g; } } in the OMP fn we fail to hoist the load of .omp_child_i->g out of the loop and thus generate inferior code.
(In reply to Richard Biener from comment #9) > Valid testcase (well, assign sth useful to g...) > > #define N 10000000 > int a[N], b[N], c[N]; > > main() > { > > int i, g; > > #pragma omp parallel for num_threads(4) > for (i = 0; i < N; i++) { > a[i] = b[i] + c[i] + g; > } > } > > in the OMP fn we fail to hoist the load of .omp_child_i->g out of the > loop and thus generate inferior code. And does making receiver_decl restrict and/or reference type help with that? BTW, we probably should add some IPA omp pass that would try to constant propagate across from GOMP_parallel*/GOMP_task* callers to the *.omp_fn.* functions (or teach IPA-SRA to do that?).
On Tue, 18 Nov 2014, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63844 > > --- Comment #10 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #9) > > Valid testcase (well, assign sth useful to g...) > > > > #define N 10000000 > > int a[N], b[N], c[N]; > > > > main() > > { > > > > int i, g; > > > > #pragma omp parallel for num_threads(4) > > for (i = 0; i < N; i++) { > > a[i] = b[i] + c[i] + g; > > } > > } > > > > in the OMP fn we fail to hoist the load of .omp_child_i->g out of the > > loop and thus generate inferior code. > > And does making receiver_decl restrict and/or reference type help with that? Yes. > BTW, we probably should add some IPA omp pass that would try to constant > propagate across from GOMP_parallel*/GOMP_task* callers to the *.omp_fn.* > functions (or teach IPA-SRA to do that?). I think aggregate IPA-CP does that, IPA-SRA cannot as the function has its address taken. Richard.
Yes should have been omp parallel for
>I think aggregate IPA-CP does that, IPA-SRA cannot as the function has >its address taken. Perhaps that case (only passing address to gomp runtime) could be special cased in the escape analysis.
Author: rguenth Date: Wed Nov 19 09:47:19 2014 New Revision: 217757 URL: https://gcc.gnu.org/viewcvs?rev=217757&root=gcc&view=rev Log: 2014-11-19 Richard Biener <rguenther@suse.de> PR tree-optimization/63844 * omp-low.c (fixup_child_record_type): Use a restrict qualified referece type for the receiver parameter. Modified: trunk/gcc/ChangeLog trunk/gcc/omp-low.c
Fixed on trunk(?)
GCC 4.8.4 has been released.
Author: rguenth Date: Mon Feb 23 11:14:25 2015 New Revision: 220912 URL: https://gcc.gnu.org/viewcvs?rev=220912&root=gcc&view=rev Log: 2015-02-23 Richard Biener <rguenther@suse.de> Backport from mainline 2014-11-19 Richard Biener <rguenther@suse.de> PR tree-optimization/63844 * omp-low.c (fixup_child_record_type): Use a restrict qualified referece type for the receiver parameter. 2014-11-27 Richard Biener <rguenther@suse.de> PR tree-optimization/61634 * tree-vect-slp.c: Include gimple-walk.h. (vect_detect_hybrid_slp_stmts): Rewrite to propagate hybrid down the SLP tree for one scalar statement. (vect_detect_hybrid_slp_1): New walker function. (vect_detect_hybrid_slp_2): Likewise. (vect_detect_hybrid_slp): Properly handle pattern statements in a pre-scan over all loop stmts. * gcc.dg/vect/pr61634.c: New testcase. 2015-01-14 Richard Biener <rguenther@suse.de> PR tree-optimization/59354 * tree-vect-slp.c (vect_build_slp_tree_1): Treat loads from groups larger than the slp group size as having gaps. * gcc.dg/vect/pr59354.c: New testcase. 2015-02-10 Richard Biener <rguenther@suse.de> PR tree-optimization/64909 * tree-vect-loop.c (vect_estimate_min_profitable_iters): Properly pass a scalar-stmt count estimate to the cost model. * tree-vect-data-refs.c (vect_peeling_hash_get_lowest_cost): Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c: New testcase. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr64909.c branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr59354.c branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr61634.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/omp-low.c branches/gcc-4_9-branch/gcc/testsuite/ChangeLog branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c branches/gcc-4_9-branch/gcc/tree-vect-loop.c branches/gcc-4_9-branch/gcc/tree-vect-slp.c
Fixed.
Author: rguenth Date: Tue Feb 24 15:09:00 2015 New Revision: 220941 URL: https://gcc.gnu.org/viewcvs?rev=220941&root=gcc&view=rev Log: 2015-02-24 Richard Biener <rguenther@suse.de> Backport from mainline 2014-11-19 Richard Biener <rguenther@suse.de> PR tree-optimization/63844 * omp-low.c (fixup_child_record_type): Use a restrict qualified referece type for the receiver parameter. Modified: branches/gcc-4_8-branch/gcc/ChangeLog branches/gcc-4_8-branch/gcc/omp-low.c