The use of openmp to parallelize loop inhibits auto-vectorization. This defeats all benefits of parallelization making the parallel code slower than the "sequential one". Is it foreseen a version of openmp that preserve auto-vectorization? Example on Linux 2.6.18-194.11.3.el5.cve20103081 #1 SMP Thu Sep 16 15:17:10 CEST 2010 x86_64 x86_64 x86_64 GNU/Linux using GNU C++ (GCC) version 4.6.0 20100408 (experimental) (x86_64-unknown-linux-gnu) compiled by GNU C version 4.6.0 20100408 (experimental), GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.1 GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 compiling this simple example cat openmpvector.cpp int main() { const unsigned int nEvents = 1000; double results[nEvents] = {0}; double pData[nEvents] = {0}; double coeff = 12.2; #pragma omp parallel for for (int idx = 0; idx<(int)nEvents; idx++) { results[idx] = coeff*pData[idx]; } return resultsCPU[0]; // avoid optimization of "dead" code } gives g++ -O2 -fopenmp -ftree-vectorize -ftree-vectorizer-verbose=7 openmpvector.cpp openmpvector.cpp:11: note: not vectorized: loop contains function calls or data references that cannot be analyzed openmpvector.cpp:9: note: vectorized 0 loops in function.
The problem is that local variables are accessed indirectly via the .omp_data_i pointer and alias analysis is unable to hoist the load of .omp_data_i_12(D)->coeff across the store to *pretmp.5_27[idx_1]. A fix is to make the argument DECL_BY_REFERENCE and the type restrict qualified. This will make alias analysis assume that the pointed-to object is not aliased unless later somebody takes its address. <bb 3>: pretmp.5_23 = .omp_data_i_12(D)->pData; pretmp.5_27 = .omp_data_i_12(D)->results; <bb 4>: # idx_1 = PHI <idx_8(3), idx_18(5)> D.2142_14 = *pretmp.5_23[idx_1]; D.2143_15 = .omp_data_i_12(D)->coeff; D.2144_16 = D.2142_14 * D.2143_15; *pretmp.5_27[idx_1] = D.2144_16; idx_18 = idx_1 + 1; if (D.2139_10 > idx_18) goto <bb 5>; else goto <bb 6>; <bb 5>: goto <bb 4>; Not completely enough though, as we consider *.omp_data_i escaped (and thus reachable by NONLOCAL). The following fixes that (with unknown consequences, I think fortran array descriptors are the only other user): Index: gcc/omp-low.c =================================================================== --- gcc/omp-low.c (revision 165474) +++ gcc/omp-low.c (working copy) @@ -1349,7 +1349,8 @@ fixup_child_record_type (omp_context *ct layout_type (type); } - TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type); + TREE_TYPE (ctx->receiver_decl) = build_qualified_type (build_pointer_type (type), + TYPE_QUAL_RESTRICT); } /* Instantiate decls as necessary in CTX to satisfy the data sharing @@ -1584,6 +1585,7 @@ create_omp_child_function (omp_context * DECL_NAMELESS (t) = 1; DECL_ARG_TYPE (t) = ptr_type_node; DECL_CONTEXT (t) = current_function_decl; + DECL_BY_REFERENCE (t) = 1; TREE_USED (t) = 1; DECL_ARGUMENTS (decl) = t; if (!task_copy) Index: gcc/tree-ssa-structalias.c =================================================================== --- gcc/tree-ssa-structalias.c (revision 165474) +++ gcc/tree-ssa-structalias.c (working copy) @@ -5575,7 +5575,6 @@ intra_create_variable_infos (void) var_ann_t ann; heapvar = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (t)), "PARM_NOALIAS"); - DECL_EXTERNAL (heapvar) = 1; heapvar_insert (t, 0, heapvar); ann = get_var_ann (heapvar); ann->is_heapvar = 1; @@ -5590,6 +5589,12 @@ intra_create_variable_infos (void) rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); vi->is_restrict_var = 1; + do + { + make_constraint_from (vi, nonlocal_id); + vi = vi->next; + } + while (vi); continue; } it means that stores to *.omp_data_i in the omp fn are considered not escaping to the caller (and thus can be DSEd). With the above patch the loop is vectorized with a runtime alias check, as we can't see that results and pData do not alias. Not even with IPA-PTA as the OMP function escapes through __builtin_GOMP_parallel_start.
If I hack PTA to make the omp function not escape IPA-PTA computes <bb 4>: # idx_1 = PHI <idx_11(3), idx_18(6)> # PT = { D.2069 } D.2112_13 = .omp_data_i_12(D)->pData; D.2113_14 = *D.2112_13[idx_1]; D.2114_15 = .omp_data_i_12(D)->coeff; D.2115_16 = D.2113_14 * D.2114_15; # PT = { D.2068 } D.2116_17 = .omp_data_i_12(D)->results; thus knows what the pointers point to and we vectorize w/o a runtime alias check (we still have no idea about alignment though, but that's probably correct). Thus it might be worth annotating some of the OMP builtins with the fnspec attribute.
Created attachment 22053 [details] fnspec attr test Like this (ugh). Fixes the thing with -fipa-pta on trunk.
A few things to consider: __builtin_GOMP_parallel_start (main._omp_fn.0, &.omp_data_o.1, 0); main._omp_fn.0 (&.omp_data_o.1); __builtin_GOMP_parallel_end (); for PTA purposes we can ignore that __builtin_GOMP_parallel_start calls main._omp_fn.0 and I suppose the function pointer doesn't escape through it. We can't assume that .omp_data_o.1 does not escape through __builtin_GOMP_parallel_start though, as __builtin_GOMP_parallel_end needs to be a barrier for optimization for it (and thus needs to be considered reading and writing .omp_data_o.1). As it doesn't take any arguments the only way to ensure that is by making .omp_data_o.1 escape. We could probably arrange for __builtin_GOMP_parallel_end to get &.omp_data_o.1 as argument solely for alias-analysis purposes though. In that case we could use ".xw" for __builtin_GOMP_parallel_start and ".w" for __builtin_GOMP_parallel_end.
in case anybody wandering it seems fixed in Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/user/i/innocent/w2/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper Target: x86_64-unknown-linux-gnu Configured with: ./configure --prefix=/afs/cern.ch/user/i/innocent/w2 --enable-languages=c,c++,fortran -enable-gold=yes --enable-lto --with-build-config=bootstrap-lto --with-gmp-lib=/usr/local/lib64 --with-mpfr-lib=/usr/local/lib64 -with-mpc-lib=/usr/local/lib64 --enable-cloog-backend=isl --with-cloog=/usr/local --with-ppl-lib=/usr/local/lib64 CFLAGS='-O2 -ftree-vectorize -fPIC' CXXFLAGS='-O2 -fPIC -ftree-vectorize -fvisibility-inlines-hidden' Thread model: posix gcc version 4.7.0 20110725 (experimental) (GCC) c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: mark for run-time aliasing test between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: mark for run-time aliasing test between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_32 openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_34 openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: Vectorizing an unaligned access. openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 . openmpvector.cpp:11: note: vect_model_store_cost: unaligned supported by hardware. openmpvector.cpp:11: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 . openmpvector.cpp:11: note: cost model: Adding cost of checks for loop versioning aliasing. openmpvector.cpp:11: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown . openmpvector.cpp:11: note: Cost model analysis: Vector inside of loop cost: 7 Vector outside of loop cost: 19 Scalar iteration cost: 4 Scalar outside cost: 1 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 7 openmpvector.cpp:11: note: Profitability threshold = 6 openmpvector.cpp:11: note: Profitability threshold is 6 loop iterations. openmpvector.cpp:11: note: create runtime check for data references *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: create runtime check for data references .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3] openmpvector.cpp:11: note: created 2 versioning for alias checks. openmpvector.cpp:11: note: LOOP VECTORIZED. openmpvector.cpp:9: note: vectorized 1 loops in function. graphite breaks itβ¦. c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp -fgraphite -fgraphite-identity -floop-block -floop-flatten -floop-interchange -floop-strip-mine -ftree-loop-linear -floop-parallelize-all openmpvector.cpp:9: note: not vectorized: data ref analysis failed D.2372_47 = *pretmp.11_32[D.2403_49]; openmpvector.cpp:9: note: vectorized 0 loops in function.
Good. But it Graphite breaks it, let's add Sebastian in CC..
Any update on this? I do see loops getting slower even for large nx*ny sometimes after omp on gcc 4.6.2, e.g., #pragma omp parallel for for(int iy=0; iy<ny; iy++) { for(int ix=0; ix<nx; ix++) { dest[(size_t)iy*nx + ix] = src[(size_t)iy*nx + ix] * 2; } } Sometimes gcc won't vectorize the inner loop, i have to put it into an inline function to force it. The performance is only marginally better after that. ps: I break the loop because I noticed previously that omp parallel inhibits auto-vectorization, forgot which gcc version I used ... Graphite did improve the scalability of openmp programs from my experience, so the fix (with tests) is important ... (In reply to comment #6) > Good. But it Graphite breaks it, let's add Sebastian in CC..
See also pr60997.
Created attachment 33348 [details] patch for 4.6 branch - patches from comment 1 and 3 - c-common.c patch re-applied to lto/lto-lang.c to fix lto buildbreaker - testcases added. Patch applies to 4.6 branch, non-bootstrap build succeeds and added test-cases pass. In 4.7 branch, this snippet: ... @@ -5612,6 +5611,12 @@ intra_create_variable_infos (void) rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); vi->is_restrict_var = 1; + do + { + make_constraint_from (vi, nonlocal_id); + vi = vi->next; + } + while (vi); continue; } ... conflicts with: ... rhsc.offset = 0; process_constraint (new_constraint (lhsc, rhsc)); for (; vi; vi = vi->next) if (vi->may_have_pointers) { if (vi->only_restrict_pointers) make_constraint_from_global_restrict (vi, "GLOBAL_RESTRICT"); else make_copy_constraint (vi, nonlocal_id); } continue; } ...
An observation. A patch like this allows vectorization without alias check: ... diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 8290a65..501d631 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -1241,7 +1241,12 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx) type = build_pointer_type (build_pointer_type (type)); } else if (by_ref) +#if 0 type = build_pointer_type (type); +#else + type = build_qualified_type (build_reference_type (type), + TYPE_QUAL_RESTRICT); +#endif else if ((mask & 3) == 1 && is_reference (var)) type = TREE_TYPE (type); ... The problem is that we don't have information at this point to decide between pointer and restrict reference. If var would be a scalar, we could use addr_taken to ensure that the var is not aliased. For arrays that doesn't work. If the c frontend would distinguish between: - element read: result[x], and - alias created: result, &result, &result[x] and store that in an alias_created property, we could use that property to decide between pointer and restrict reference. That would not fix the problem in general though, since that solution would already no longer work if the example was rewritten using pointers. I wonder if postponing omp_expand till after ealias would give us enough information to update the field reference types with a restrict tag (or not) at that point. [ Though I'm not sure if doing that update there would actually have the desired effect. ]
The ipa-pta solution no longer works. In 4.6, we had: ... # USE = anything # CLB = anything GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0); # USE = anything # CLB = anything main._omp_fn.0D.1472 (&.omp_data_o.1D.1484); # USE = anything # CLB = anything GOMP_parallel_endD.1049 (); ... On trunk, we have now: ... # USE = anything # CLB = anything GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0); ... So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that.
(In reply to vries from comment #11) > The ipa-pta solution no longer works. In 4.6, we had: > ... > # USE = anything > # CLB = anything > GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0); > # USE = anything > # CLB = anything > main._omp_fn.0D.1472 (&.omp_data_o.1D.1484); > # USE = anything > # CLB = anything > GOMP_parallel_endD.1049 (); > ... > > On trunk, we have now: > ... > # USE = anything > # CLB = anything > GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0); > ... > > So there's no longer a path in the call graph from main to main._omp_fn. > Perhaps a dummy body for GOMP_parallel could fix that. Hm? The IPA PTA "solution" was to tell IPA PTA that the call to GOMP_parallel doesn't make .omp_data_o escape. The attached patch doesn't work because it only patches GOMP_parallel_start, not GOMP_parallel. Of course it would even better to teach IPA PTA that GOMP_parallel is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument. How many different ways of IL do we get doing this kind of indirect function invocations?
(In reply to Richard Biener from comment #12) > (In reply to vries from comment #11) > > The ipa-pta solution no longer works. In 4.6, we had: > > ... > > # USE = anything > > # CLB = anything > > GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0); > > # USE = anything > > # CLB = anything > > main._omp_fn.0D.1472 (&.omp_data_o.1D.1484); > > # USE = anything > > # CLB = anything > > GOMP_parallel_endD.1049 (); > > ... > > > > On trunk, we have now: > > ... > > # USE = anything > > # CLB = anything > > GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0); > > ... > > > > So there's no longer a path in the call graph from main to main._omp_fn. > > Perhaps a dummy body for GOMP_parallel could fix that. > > Hm? The IPA PTA "solution" was to tell IPA PTA that the call to > GOMP_parallel > doesn't make .omp_data_o escape. > > The attached patch doesn't work because it only patches GOMP_parallel_start, > not GOMP_parallel. > > Of course it would even better to teach IPA PTA that GOMP_parallel > is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument. > > How many different ways of IL do we get doing this kind of indirect > function invocations? Other IPA propagators like IPA-CP probably also would like to know this. I see various builtins taking a OMPFN argument in omp-builtins.def. If we assume the GOMP runtime itself is "transparent" then do we know how the builtins end up calling the actual implementation function?
(In reply to Richard Biener from comment #13) > > > So there's no longer a path in the call graph from main to main._omp_fn. > > > Perhaps a dummy body for GOMP_parallel could fix that. > > > > Hm? The IPA PTA "solution" was to tell IPA PTA that the call to > > GOMP_parallel > > doesn't make .omp_data_o escape. > > > > The attached patch doesn't work because it only patches GOMP_parallel_start, > > not GOMP_parallel. > > > > Of course it would even better to teach IPA PTA that GOMP_parallel > > is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument. > > > > How many different ways of IL do we get doing this kind of indirect > > function invocations? > > Other IPA propagators like IPA-CP probably also would like to know this. > > I see various builtins taking a OMPFN argument in omp-builtins.def. If we > assume the GOMP runtime itself is "transparent" then do we know how the > builtins end up calling the actual implementation function? GOMP_parallel* call the ompfn function (first argument) with the second argument (pointer to some structure filled before GOMP_parallel* and dead (using a clobber) after the call) as the only argument. The callback function can be called just once or more times (once in each thread). Then there is GOMP_task*, where there is one or two callback functions, if just one (the other one is NULL), then either the first callback function (1st argument) is called with the second argument as the only argument, or with a pointer to a memory block that was filled with memcpy from the second argument. If the third argument (second callback) is non-NULL, then that callback is called instead of the memcpy and the pointers can be to two different structures. GOMP_target is another case, but there is often a cross-device boundary in between the two, so it is much harder to model that for IPA-PTA etc. purposes. So, schematically, GOMP_parallel* (fn1, data1, ...) performs: if (somecond) for (...) pthread_create (..., fn1, data1); fn1 (data1); if (somecond) for (...) pthread_join (...); and GOMP_task (fn1, data1, fn2, ...) performs: if (fn2 == 0 && somecond1) fn1 (data1); else { char *buf = malloc (...); // or alloca/vla if (fn2 == 0) }
and GOMP_task (fn1, data1, fn2, ...) performs: if (somecond) { if (fn2 == 0) fn1 (data1); else { void *buf = alloca (...); // Takes care also about alignment fn2 (buf, data1); fn1 (buf); } } else { void *buf = malloc (...); // Takes care also about alignment if (fn2 == 0) memcpy (buf, data1, ...); else fn2 (buf, data1); // Arrange for fn1 (buf); to be called at some point later (like C++ futures) } The purpose of fn2 is to run copy constructors of the vars, for vars that will be residing within the buf.
(In reply to Richard Biener from comment #12) > (In reply to vries from comment #11) > > The ipa-pta solution no longer works. In 4.6, we had: > > ... > > # USE = anything > > # CLB = anything > > GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0); > > # USE = anything > > # CLB = anything > > main._omp_fn.0D.1472 (&.omp_data_o.1D.1484); > > # USE = anything > > # CLB = anything > > GOMP_parallel_endD.1049 (); > > ... > > > > On trunk, we have now: > > ... > > # USE = anything > > # CLB = anything > > GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0); > > ... > > > > So there's no longer a path in the call graph from main to main._omp_fn. > > Perhaps a dummy body for GOMP_parallel could fix that. > > Hm? The IPA PTA "solution" was to tell IPA PTA that the call to > GOMP_parallel [ GOMP_parallel_start ] > doesn't make .omp_data_o escape. > Right, for 4.6, adding fnspec ".rw" to GOMP_parallel_start has this effect in ipa-pta: ... D.1505_14 = { ESCAPED NONLOCAL pData } D.1509_18 = { ESCAPED NONLOCAL results } --> D.1505_14 = { pData } D.1509_18 = { results } ... where _14 and _18 are the omp_data_i relative loads in the split-off function: ... # VUSE <.MEMD.1514_20> # PT = nonlocal D.1505_14 = .omp_data_iD.1474_13(D)->pDataD.1477; # VUSE <.MEMD.1514_20> D.1506_15 = *D.1505_14[idxD.1495_1]; ... # VUSE <.MEMD.1514_20> # PT = nonlocal D.1509_18 = .omp_data_iD.1474_13(D)->resultsD.1479; # .MEMD.1514_22 = VDEF <.MEMD.1514_20> *D.1509_18[idxD.1495_1] = D.1508_17; ... > The attached patch doesn't work because it only patches GOMP_parallel_start, > not GOMP_parallel. > [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch on trunk (and dropping the loop in the hunk for intra_create_variable_infos that does not apply cleanly anymore) and applying fnspec ".rw" on GOMP_parallel, gives us in ipa-pta: ... _17 = { } _21 = { } ... where _17 and _21 are the omp_data_i relative loads in the split-off function: ... # VUSE <.MEM_4> # PT = nonlocal escaped _17 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base 1].pDataD.1719; # VUSE <.MEM_4> _18 = *_17[idx_1]; # VUSE <.MEM_4> # PT = nonlocal escaped _21 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base 1].resultsD.1721; # .MEM_22 = VDEF <.MEM_4> *_21[idx_1] = _20; ... It is reasonable to assume that we no longer are able to relate back these loads in the split-off function to pData and result in the donor function, due to the fact that there's no longer a direct function call to main._omp_fn in the donor function. On 4.6, that direct function call to main._omp_fn still existed. On trunk, not anymore.
(In reply to vries from comment #16) > (In reply to Richard Biener from comment #12) > > (In reply to vries from comment #11) > > > The ipa-pta solution no longer works. In 4.6, we had: > > > ... > > > # USE = anything > > > # CLB = anything > > > GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0); > > > # USE = anything > > > # CLB = anything > > > main._omp_fn.0D.1472 (&.omp_data_o.1D.1484); > > > # USE = anything > > > # CLB = anything > > > GOMP_parallel_endD.1049 (); > > > ... > > > > > > On trunk, we have now: > > > ... > > > # USE = anything > > > # CLB = anything > > > GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0); > > > ... > > > > > > So there's no longer a path in the call graph from main to main._omp_fn. > > > Perhaps a dummy body for GOMP_parallel could fix that. > > > > Hm? The IPA PTA "solution" was to tell IPA PTA that the call to > > GOMP_parallel > > [ GOMP_parallel_start ] > > > doesn't make .omp_data_o escape. > > > > Right, for 4.6, adding fnspec ".rw" to GOMP_parallel_start has this effect > in ipa-pta: > ... > D.1505_14 = { ESCAPED NONLOCAL pData } > D.1509_18 = { ESCAPED NONLOCAL results } > --> > D.1505_14 = { pData } > D.1509_18 = { results } > ... > > where _14 and _18 are the omp_data_i relative loads in the split-off > function: > ... > # VUSE <.MEMD.1514_20> > # PT = nonlocal > D.1505_14 = .omp_data_iD.1474_13(D)->pDataD.1477; > > # VUSE <.MEMD.1514_20> > D.1506_15 = *D.1505_14[idxD.1495_1]; > > ... > > # VUSE <.MEMD.1514_20> > # PT = nonlocal > D.1509_18 = .omp_data_iD.1474_13(D)->resultsD.1479; > > # .MEMD.1514_22 = VDEF <.MEMD.1514_20> > *D.1509_18[idxD.1495_1] = D.1508_17; > ... > > > > The attached patch doesn't work because it only patches GOMP_parallel_start, > > not GOMP_parallel. > > > > [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch > on trunk (and dropping the loop in the hunk for intra_create_variable_infos > that does not apply cleanly anymore) and applying fnspec ".rw" on > GOMP_parallel, gives us in ipa-pta: > ... > _17 = { } > _21 = { } > ... > > where _17 and _21 are the omp_data_i relative loads in the split-off > function: > ... > # VUSE <.MEM_4> > # PT = nonlocal escaped > _17 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base > 1].pDataD.1719; > > # VUSE <.MEM_4> > _18 = *_17[idx_1]; > > # VUSE <.MEM_4> > # PT = nonlocal escaped > _21 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base > 1].resultsD.1721; > > # .MEM_22 = VDEF <.MEM_4> > *_21[idx_1] = _20; > ... > > It is reasonable to assume that we no longer are able to relate back these > loads in the split-off function to pData and result in the donor function, > due to the fact that there's no longer a direct function call to > main._omp_fn in the donor function. > > On 4.6, that direct function call to main._omp_fn still existed. On trunk, > not anymore. In fact it even looks like wrong IPA PTA results to me (_17 and _21 point to nothing). Index: gcc/tree-ssa-structalias.c =================================================================== --- gcc/tree-ssa-structalias.c (revision 223737) +++ gcc/tree-ssa-structalias.c (working copy) @@ -7372,7 +7372,8 @@ ipa_pta_execute (void) constraints for parameters. */ if (node->used_from_other_partition || node->externally_visible - || node->force_output) + || node->force_output + || node->address_taken) { intra_create_variable_infos (func); fixes that. Of course that makes a solution handling the OMP builtins specially not going to work as if the function has its address taken we don't know whether it is called from anywhere else. The fix is required for correctness though. _17 = { ESCAPED NONLOCAL } _21 = { ESCAPED NONLOCAL } _18 = *_17[idx_3]; *_21[idx_3] = _20; handling the OMP builtin specially will only get the solution amended to { ESCAPED NONLOCAL &results } and { ESCAPED NONLOCAL &pData } and thus still conflict.
The *.omp_fn.* functions indeed, while they necessarily have to be addressable, because that is how they are passed to the libgomp entrypoints, are never called by anything but the libgomp runtime. For GOMP_parallel*, they are only called before the GOMP_parallel* function exits, for GOMP_task* they could be called at some later point.
On Wed, 27 May 2015, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032 > > --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > The *.omp_fn.* functions indeed, while they necessarily have to be addressable, > because that is how they are passed to the libgomp entrypoints, are never > called by anything but the libgomp runtime. For GOMP_parallel*, they are only > called before the GOMP_parallel* function exits, for GOMP_task* they could be > called at some later point. Ok, so this just means that IPA PTA would need to handle those specially (and thus the OMP functions should be marked specially in the cgraph node). Not that I think IPA PTA is anywhere near production ready (or I have time to fix it up properly...). Just testing the addressable fix now.
This patch seems to have the desired effect on the original testcase: ... diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 830db75..996756b 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region *region, if (collapse_bb == NULL) loop->latch = cont_bb; add_loop (loop, body_bb->loop_father); + loop->safelen = INT_MAX; } } ... AFAIU, adding the omp for to the loop is an assertion that the loop is independent. It seems reasonable to assume that if the original loop was independent, the loop operating on a slice of the original iteration space will be independent as well.
(In reply to vries from comment #20) > This patch seems to have the desired effect on the original testcase: > ... > diff --git a/gcc/omp-low.c b/gcc/omp-low.c > index 830db75..996756b 100644 > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region > *region, > if (collapse_bb == NULL) > loop->latch = cont_bb; > add_loop (loop, body_bb->loop_father); > + loop->safelen = INT_MAX; > } > } > ... > > AFAIU, adding the omp for to the loop is an assertion that the loop is > independent. It seems reasonable to assume that if the original loop was > independent, the loop operating on a slice of the original iteration space > will be independent as well. That is very much wrong. Static scheduling, both nochunk and chunk, doesn't imply in any way that the iterations are independent, the OpenMP standard says how the work is split among the threads, with nochunk that threads get consecutive sets of iterations as one chunk that are approximately the same size, but eventhough it is not exactly specified how exactly the iteration space is deviced (for nochunk), if you make the loop iterations independent, you would break many observable properties (say through threadprivate vars, omp_get_thread_num etc.). Note loop->safelen == INT_MAX is actually weaker than independent iterations, when loop->safelen == INT_MAX, there can be dependencies, but only of certain kinds, it says that it is equivalent if you run the loop normally and if you run simultaneously (or emulated) the first statements of all the iterations, then second statements and so on (so vectorize with any vectorization factor the compiler wants).
https://gcc.gnu.org/ml/gcc-patches/2015-11/msg03448.html
Author: vries Date: Mon Nov 30 16:34:26 2015 New Revision: 231076 URL: https://gcc.gnu.org/viewcvs?rev=231076&root=gcc&view=rev Log: Handle BUILT_IN_GOMP_PARALLEL in ipa-pta 2015-11-30 Tom de Vries <tom@codesourcery.com> PR tree-optimization/46032 * tree-ssa-structalias.c (find_func_aliases_for_call_arg): New function, factored out of ... (find_func_aliases_for_call): ... here. (find_func_aliases_for_builtin_call, find_func_clobbers): Handle BUILT_IN_GOMP_PARALLEL. (ipa_pta_execute): Same. Handle node->parallelized_function as a local function. * gcc.dg/pr46032.c: New test. * testsuite/libgomp.c/pr46032.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr46032.c trunk/libgomp/testsuite/libgomp.c/pr46032.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-structalias.c trunk/libgomp/ChangeLog
patch with testcase committed, marking resolved-fixed.