GCC generates incorrect code for the SIMD loop (#pragma [omp] simd) with function call that have pointer parameters. (test case attached) gcc test.c foo.c -fcilkplus -O3 ./a.out ; echo $? 1 gcc test.c foo.c -fcilkplus -O0 ./a.out ; echo $? 0 the same for -fopenmp -DENABLE_OMP instead of -fcilkplus Moreover g++ crashes on the test: g++ test.c -fcilkplus -O3 -c test.c: In function 'int main()': test.c:10:5: error: dead STMT in EH table int main() ^ # .MEM_11 = VDEF <.MEM_39> foo (_10, &v1, &v2); test.c:10:5: internal compiler error: verify_gimple failed 0xb774b3 verify_gimple_in_cfg(function*) /export/users/mstester/stability/svn/trunk/gcc/tree-cfg.c:4853 0xa9fc52 execute_function_todo /export/users/mstester/stability/svn/trunk/gcc/passes.c:1853 0xaa05a3 execute_todo /export/users/mstester/stability/svn/trunk/gcc/passes.c:1887 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions.
Created attachment 31981 [details] test case
Confirmed.
Vectorizer dump snippet for main: foo.simdclone.0 (vect__12.7_3, vect_cst_.8_53, vect_cst_.8_53, vect_cst_.9_51, vect_cst_.9_51); GIMPLE_NOP vect_v1.12_37 = MEM[(int *)vectp_v1.10_39]; (1) v1.0_14 = v1; vect_v2.16_60 = MEM[(int *)vectp_v2.14_58]; (2) v2.1_15 = v2; vect__16.18_63 = vect_cst_.13_7 * vect_cst_.17_62; <--- constants instead of _16 = v1.0_14 * v2.1_15; vect_v1.12_37 and MEM[(int *)vectp_a.19_65] = vect__16.18_63; vect_v2.16_60 Then DCE destroys (1) and (2) and later LIM hoists the multiplication away from the loop.
vect details show that v1.0_14 = v1 and v2.1_15 = v2 are treated as invariants: test.c:24:14: note: ------>vectorizing statement: v1.0_14 = v1; test.c:24:14: note: transform statement. test.c:24:14: note: transform load. ncopies = 1 test.c:24:14: note: create vector_type-pointer variable to type: vector(4) int vectorizing a pointer ref: v1 test.c:24:14: note: created vectp_v1.11_40 test.c:24:14: note: add new stmt: vect_v1.12_37 = MEM[(int *)vectp_v1.10_39]; test.c:24:14: note: hoisting out of the vectorized loop: v1.0_14 = v1; test.c:24:14: note: created new init_stmt: vect_cst_.13_7 = {v1.0_8, v1.0_8, v1.0_8, v1.0_8}; test.c:24:14: note: ------>vectorizing statement: v2.1_15 = v2; test.c:24:14: note: transform statement. test.c:24:14: note: transform load. ncopies = 1 test.c:24:14: note: create vector_type-pointer variable to type: vector(4) int vectorizing a pointer ref: v2 test.c:24:14: note: created vectp_v2.15_1 test.c:24:14: note: add new stmt: vect_v2.16_60 = MEM[(int *)vectp_v2.14_58]; test.c:24:14: note: hoisting out of the vectorized loop: v2.1_15 = v2; test.c:24:14: note: created new init_stmt: vect_cst_.17_62 = {v2.1_61, v2.1_61, v2.1_61, v2.1_61}; Step for both loads determined as 0. Seems support for such case should be explicitly added
I guess during gimplification when gimplifying #pragma omp simd body we should add all the local variables (declared in the body) that aren't static and are address taken to private clause on the #pragma omp simd, so that they will be handled like if you in your testcase do instead: int v1, v2; #pragma omp simd private (v1, v2) for(i = 0; i < 1000; i++) { foo(a[i], &v1, &v2); a[i] = v1 * v2; } We don't vectorize that right now, but that is something that is likely fixable.
Created attachment 32077 [details] Untested fix Untested fix. Likely I'll need to do something more, so that we don't emit extra decls for the debug info outside of the #pragma omp simd (flag on the private clause?) and add a testcase. And, the reason why with this patch (or the adjusted testcase with explicit private clause) isn't vectorized is that we'd need to teach vectorizer how to vectorize SSA_NAME = &array[SSA_NAME]; where the second SSA_NAME is GOMP_SIMD_LANE () result and array is an "omp simd array" array, as a vector of addresses (initialized before the loop), containing { &array[0], &array[1], ..., &array[vf-1] }.
Author: jakub Date: Sat Feb 8 09:10:14 2014 New Revision: 207629 URL: http://gcc.gnu.org/viewcvs?rev=207629&root=gcc&view=rev Log: PR c/59984 * gimplify.c (gimplify_bind_expr): In ORT_SIMD region mark local addressable non-static vars as GOVD_PRIVATE instead of GOVD_LOCAL. * omp-low.c (lower_omp_for): Move gimple_bind_vars and BLOCK_VARS of gimple_bind_block to new_stmt rather than copying them. * gcc.dg/vect/pr59984.c: New test. Added: trunk/gcc/testsuite/gcc.dg/vect/pr59984.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimplify.c trunk/gcc/omp-low.c trunk/gcc/testsuite/ChangeLog
The wrong-code issue is now fixed, but we still don't vectorize this. See http://gcc.gnu.org/ml/gcc-patches/2014-02/msg00514.html for details. Note that without making the two pointer parameters linear on the elemental function it actually will not be successfully vectorized unless you have CPU that can do scatter (Skylake, anything else?). Keeping this PR open so that I don't forget to address this.
GCC 4.9.0 has been released
GCC 4.9.1 has been released.
GCC 4.9.2 has been released.
Created attachment 33963 [details] test case where pragma simd disable vectorization The following test case compiled with "-Ofast" vectorize the loop in the GetXsum function. Adding "-fopenmp" leads to failed vectorization due to: simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed D.2329[_7].x = _12; It looks like before the patch in this Bug loop was vectorized with -fopenmp.
(In reply to Stupachenko Evgeny from comment #12) > Created attachment 33963 [details] > test case where pragma simd disable vectorization > > The following test case compiled with "-Ofast" vectorize the loop in the > GetXsum function. > Adding "-fopenmp" leads to failed vectorization due to: > > simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed > D.2329[_7].x = _12; > > It looks like before the patch in this Bug loop was vectorized with -fopenmp. The testcase is invalid, you need reduction(+:sim) clause, otherwise the loop has invalid inter-iteration dependencies. That said, even with that, with C it vectorizes fine, while with C++ it doesn't. In *.einline the C -> C++ difference is (before that I don't see such): - D.1856[_19].x = _24; - _26 = &D.1856[_19]; - _27 = MEM[(const struct XY *)_26].x; + D.2352[_19].x = _24; + _26 = &D.2352[_19]; + _40 = MEM[(float *)_26]; In *.ealias the C -> C++ difference is: - D.1856[_19].x = _24; - _27 = MEM[(const struct XY *)&D.1856][_19].x; + D.2352[_19].x = _24; + _26 = &D.2352[_19]; + _40 = MEM[(float *)_26]; and apparently FRE1 handles the former but not the latter. Richard? As the struct contains float at that offset, I don't see why FRE1 shouldn't optimize that to _40 = _24. Shorter testcase for the FRE1 missed-optimization: struct S { float a, b; }; float foo (int x, float y) { struct S z[1024]; z[x].a = y; struct S *p = &z[x]; float *q = (float *) p; return *q; } (dunno why the inliner handles things differently between C and C++ on the #c12 testcase). Now, as for vectorizing it even if FRE isn't able to optimize it, we currently don't support interleaved accesses to the "omp simd array" attributed arrays, perhaps we could at least some easy cases thereof, and supposedly we should teach SRA about those too (like, if the arrays aren't addressable and aren't accesses as whole, but just individual fields, split it into separate "omp simd array" accesses instead. In this particular case due to the FRE missed optimization it is addressable though. Or perhaps teach fold to gimple folding to fold that: q_5 = &z[x_2(D)]; _6 = *q_5; back into: _6 = z[x_2(D)].x; ?
On Fri, 14 Nov 2014, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59984 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|ASSIGNED |NEW > CC| |jamborm at gcc dot gnu.org, > | |rguenth at gcc dot gnu.org > Assignee|jakub at gcc dot gnu.org |unassigned at gcc dot gnu.org > > --- Comment #13 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > (In reply to Stupachenko Evgeny from comment #12) > > Created attachment 33963 [details] > > test case where pragma simd disable vectorization > > > > The following test case compiled with "-Ofast" vectorize the loop in the > > GetXsum function. > > Adding "-fopenmp" leads to failed vectorization due to: > > > > simd_issue.cpp:26:18: note: not vectorized: data ref analysis failed > > D.2329[_7].x = _12; > > > > It looks like before the patch in this Bug loop was vectorized with -fopenmp. > > The testcase is invalid, you need reduction(+:sim) clause, otherwise the loop > has invalid inter-iteration dependencies. > > That said, even with that, with C it vectorizes fine, while with C++ it > doesn't. > > In *.einline the C -> C++ difference is (before that I don't see such): > - D.1856[_19].x = _24; > - _26 = &D.1856[_19]; > - _27 = MEM[(const struct XY *)_26].x; > + D.2352[_19].x = _24; > + _26 = &D.2352[_19]; > + _40 = MEM[(float *)_26]; > > In *.ealias the C -> C++ difference is: > - D.1856[_19].x = _24; > - _27 = MEM[(const struct XY *)&D.1856][_19].x; > + D.2352[_19].x = _24; > + _26 = &D.2352[_19]; > + _40 = MEM[(float *)_26]; > > and apparently FRE1 handles the former but not the latter. Richard? > As the struct contains float at that offset, I don't see why FRE1 shouldn't > optimize that to _40 = _24. > > Shorter testcase for the FRE1 missed-optimization: > struct S { float a, b; }; > > float > foo (int x, float y) > { > struct S z[1024]; > z[x].a = y; > struct S *p = &z[x]; > float *q = (float *) p; > return *q; > } I will have a look - it's designed to handle that fine. > (dunno why the inliner handles things differently between C and C++ on the #c12 > testcase). Now, as for vectorizing it even if FRE isn't able to optimize it, > we currently don't support interleaved accesses to the "omp simd array" > attributed arrays, perhaps we could at least some easy cases thereof, and > supposedly we should teach SRA about those too (like, if the arrays aren't > addressable and aren't accesses as whole, but just individual fields, split it > into separate "omp simd array" accesses instead. In this particular case due > to the FRE missed optimization it is addressable though. > Or perhaps teach fold to gimple folding to fold that: > q_5 = &z[x_2(D)]; > _6 = *q_5; > back into: > _6 = z[x_2(D)].x; > ? No, that's generally invalid (forwprop does that if types match closely enough which appearantly they don't?)
Ok, so SCCVN only can forward constant addresses (because variable ones wouldn't fit in the same refs vector without "inserting" ops). See vn_reference_maybe_forwprop_address. It's also required so that we can re-materialize the expression in PRE for example and not wind up with invalid access paths there (similar to the reason why forwprop doesn't do this transform). I think it might be possible to teach SCCVN to do this, but it might be not trivial.
GCC 4.9.3 has been released.
Cilk Plus, deprecated for 7.x, will not be in 8.x.