Bug 46032 - openmp inhibits loop vectorization
Summary: openmp inhibits loop vectorization
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.5.1
: P3 enhancement
Target Milestone: ---
Assignee: Tom de Vries
URL:
Keywords: missed-optimization, patch
Depends on:
Blocks:
 
Reported: 2010-10-15 07:23 UTC by vincenzo Innocente
Modified: 2022-03-02 02:14 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-10-15 10:08:39


Attachments
fnspec attr test (878 bytes, patch)
2010-10-15 11:51 UTC, Richard Biener
Details | Diff
patch for 4.6 branch (2.03 KB, patch)
2014-08-18 13:56 UTC, Tom de Vries
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2010-10-15 07:23:20 UTC
The use of openmp to parallelize loop inhibits auto-vectorization.
This defeats all benefits of parallelization making the parallel code slower than the "sequential one".
Is it foreseen a version of openmp that preserve auto-vectorization?

Example
on
Linux  2.6.18-194.11.3.el5.cve20103081 #1 SMP Thu Sep 16 15:17:10 CEST 2010 x86_64 x86_64 x86_64 GNU/Linux
using
GNU C++ (GCC) version 4.6.0 20100408 (experimental) (x86_64-unknown-linux-gnu)
	compiled by GNU C version 4.6.0 20100408 (experimental), GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.1
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
compiling this simple example
cat openmpvector.cpp
int main()
{

 const unsigned int nEvents = 1000;
 double results[nEvents] = {0};
 double pData[nEvents] = {0};
 double coeff = 12.2;

#pragma omp parallel for
 for (int idx = 0; idx<(int)nEvents; idx++) {
   results[idx] = coeff*pData[idx];
 }

 return resultsCPU[0]; // avoid optimization of "dead" code

}

gives
g++  -O2 -fopenmp -ftree-vectorize -ftree-vectorizer-verbose=7 openmpvector.cpp

openmpvector.cpp:11: note: not vectorized: loop contains function calls or data references that cannot be analyzed
openmpvector.cpp:9: note: vectorized 0 loops in function.
Comment 1 Richard Biener 2010-10-15 10:08:39 UTC
The problem is that local variables are accessed indirectly via the
.omp_data_i pointer and alias analysis is unable to hoist the load of
.omp_data_i_12(D)->coeff across the store to *pretmp.5_27[idx_1].

A fix is to make the argument DECL_BY_REFERENCE and the
type restrict qualified.  This will  make alias analysis assume that
the pointed-to object is not aliased unless later somebody takes its
address.

<bb 3>:
  pretmp.5_23 = .omp_data_i_12(D)->pData;
  pretmp.5_27 = .omp_data_i_12(D)->results;

<bb 4>:
  # idx_1 = PHI <idx_8(3), idx_18(5)>
  D.2142_14 = *pretmp.5_23[idx_1];
  D.2143_15 = .omp_data_i_12(D)->coeff;
  D.2144_16 = D.2142_14 * D.2143_15;
  *pretmp.5_27[idx_1] = D.2144_16;
  idx_18 = idx_1 + 1;
  if (D.2139_10 > idx_18)
    goto <bb 5>;
  else
    goto <bb 6>;

<bb 5>:
  goto <bb 4>;


Not completely enough though, as we consider *.omp_data_i escaped
(and thus reachable by NONLOCAL).

The following fixes that (with unknown consequences, I think fortran
array descriptors are the only other user):

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c       (revision 165474)
+++ gcc/omp-low.c       (working copy)
@@ -1349,7 +1349,8 @@ fixup_child_record_type (omp_context *ct
       layout_type (type);
     }
 
-  TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type);
+  TREE_TYPE (ctx->receiver_decl) = build_qualified_type (build_pointer_type (type),
+                                                        TYPE_QUAL_RESTRICT);
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing
@@ -1584,6 +1585,7 @@ create_omp_child_function (omp_context *
   DECL_NAMELESS (t) = 1;
   DECL_ARG_TYPE (t) = ptr_type_node;
   DECL_CONTEXT (t) = current_function_decl;
+  DECL_BY_REFERENCE (t) = 1;
   TREE_USED (t) = 1;
   DECL_ARGUMENTS (decl) = t;
   if (!task_copy)
Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c  (revision 165474)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -5575,7 +5575,6 @@ intra_create_variable_infos (void)
              var_ann_t ann;
              heapvar = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (t)),
                                            "PARM_NOALIAS");
-             DECL_EXTERNAL (heapvar) = 1;
              heapvar_insert (t, 0, heapvar);
              ann = get_var_ann (heapvar);
              ann->is_heapvar = 1;
@@ -5590,6 +5589,12 @@ intra_create_variable_infos (void)
          rhsc.offset = 0;
          process_constraint (new_constraint (lhsc, rhsc));
          vi->is_restrict_var = 1;
+         do
+           {
+             make_constraint_from (vi, nonlocal_id);
+             vi = vi->next;
+           }
+         while (vi);
          continue;
        }
 

it means that stores to *.omp_data_i in the omp fn are considered not
escaping to the caller (and thus can be DSEd).  With the above patch
the loop is vectorized with a runtime alias check, as we can't
see that results and pData do not alias.  Not even with IPA-PTA as
the OMP function escapes through __builtin_GOMP_parallel_start.
Comment 2 Richard Biener 2010-10-15 10:30:42 UTC
If I hack PTA to make the omp function not escape IPA-PTA computes

<bb 4>:
  # idx_1 = PHI <idx_11(3), idx_18(6)>
  # PT = { D.2069 }
  D.2112_13 = .omp_data_i_12(D)->pData;
  D.2113_14 = *D.2112_13[idx_1];
  D.2114_15 = .omp_data_i_12(D)->coeff;
  D.2115_16 = D.2113_14 * D.2114_15;
  # PT = { D.2068 }
  D.2116_17 = .omp_data_i_12(D)->results;

thus knows what the pointers point to and we vectorize w/o a runtime
alias check (we still have no idea about alignment though, but that's
probably correct).

Thus it might be worth annotating some of the OMP builtins with
the fnspec attribute.
Comment 3 Richard Biener 2010-10-15 11:51:58 UTC
Created attachment 22053 [details]
fnspec attr test

Like this (ugh).  Fixes the thing with -fipa-pta on trunk.
Comment 4 Richard Biener 2010-10-15 12:09:38 UTC
A few things to consider:

  __builtin_GOMP_parallel_start (main._omp_fn.0, &.omp_data_o.1, 0);
  main._omp_fn.0 (&.omp_data_o.1);
  __builtin_GOMP_parallel_end ();

for PTA purposes we can ignore that __builtin_GOMP_parallel_start calls
main._omp_fn.0 and I suppose the function pointer doesn't escape through
it.  We can't assume that .omp_data_o.1 does not escape through
__builtin_GOMP_parallel_start though, as __builtin_GOMP_parallel_end needs
to be a barrier for optimization for it (and thus needs to be considered
reading and writing .omp_data_o.1).  As it doesn't take any arguments
the only way to ensure that is by making .omp_data_o.1 escape.  We could
probably arrange for __builtin_GOMP_parallel_end to get &.omp_data_o.1
as argument solely for alias-analysis purposes though.  In that case
we could use ".xw" for __builtin_GOMP_parallel_start and ".w" for
__builtin_GOMP_parallel_end.
Comment 5 vincenzo Innocente 2011-07-26 13:00:18 UTC
in case anybody wandering
it seems fixed in
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/user/i/innocent/w2/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --prefix=/afs/cern.ch/user/i/innocent/w2 --enable-languages=c,c++,fortran -enable-gold=yes --enable-lto --with-build-config=bootstrap-lto --with-gmp-lib=/usr/local/lib64 --with-mpfr-lib=/usr/local/lib64 -with-mpc-lib=/usr/local/lib64 --enable-cloog-backend=isl --with-cloog=/usr/local --with-ppl-lib=/usr/local/lib64 CFLAGS='-O2 -ftree-vectorize -fPIC' CXXFLAGS='-O2 -fPIC -ftree-vectorize -fvisibility-inlines-hidden'
Thread model: posix
gcc version 4.7.0 20110725 (experimental) (GCC) 


c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp                
openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_32
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_34
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_store_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: cost model: Adding cost of checks for loop versioning aliasing.

openmpvector.cpp:11: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
openmpvector.cpp:11: note: Cost model analysis: 
  Vector inside of loop cost: 7
  Vector outside of loop cost: 19
  Scalar iteration cost: 4
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

openmpvector.cpp:11: note:   Profitability threshold = 6

openmpvector.cpp:11: note: Profitability threshold is 6 loop iterations.
openmpvector.cpp:11: note: create runtime check for data references *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: create runtime check for data references .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: created 2 versioning for alias checks.

openmpvector.cpp:11: note: LOOP VECTORIZED.
openmpvector.cpp:9: note: vectorized 1 loops in function.



graphite breaks it…. 
c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp -fgraphite -fgraphite-identity -floop-block -floop-flatten -floop-interchange -floop-strip-mine -ftree-loop-linear -floop-parallelize-all

openmpvector.cpp:9: note: not vectorized: data ref analysis failed D.2372_47 = *pretmp.11_32[D.2403_49];

openmpvector.cpp:9: note: vectorized 0 loops in function.
Comment 6 Paolo Carlini 2011-07-26 13:47:35 UTC
Good. But it Graphite breaks it, let's add Sebastian in CC..
Comment 7 Feng Chen 2012-07-06 16:17:28 UTC
Any update on this? I do see loops getting slower even for large nx*ny sometimes after omp on gcc 4.6.2, e.g.,

#pragma omp parallel for
for(int iy=0; iy<ny; iy++) {
  for(int ix=0; ix<nx; ix++) {
    dest[(size_t)iy*nx + ix] = src[(size_t)iy*nx + ix] * 2;
  }
}

Sometimes gcc won't vectorize the inner loop, i have to put it into an inline function to force it.  The performance is only marginally better after that.
ps: I break the loop because I noticed previously that omp parallel inhibits auto-vectorization, forgot which gcc version I used ...

Graphite did improve the scalability of openmp programs from my experience, so the fix (with tests) is important ...

(In reply to comment #6)
> Good. But it Graphite breaks it, let's add Sebastian in CC..
Comment 8 Dominique d'Humieres 2014-04-29 13:39:35 UTC
See also pr60997.
Comment 9 Tom de Vries 2014-08-18 13:56:07 UTC
Created attachment 33348 [details]
patch for 4.6 branch

- patches from comment 1 and 3
- c-common.c patch re-applied to lto/lto-lang.c to fix lto buildbreaker
- testcases added.

Patch applies to 4.6 branch, non-bootstrap build succeeds and added test-cases pass.

In 4.7 branch, this snippet:
...
@@ -5612,6 +5611,12 @@ intra_create_variable_infos (void)
	  rhsc.offset = 0;
	  process_constraint (new_constraint (lhsc, rhsc));
	  vi->is_restrict_var = 1;
+         do
+           {
+             make_constraint_from (vi, nonlocal_id);
+             vi = vi->next;
+           }
+         while (vi);
	  continue;
	}
... 

conflicts with: 
...
          rhsc.offset = 0;
          process_constraint (new_constraint (lhsc, rhsc));
          for (; vi; vi = vi->next)
            if (vi->may_have_pointers)
              {
                if (vi->only_restrict_pointers)
                  make_constraint_from_global_restrict (vi, "GLOBAL_RESTRICT");
                else
                  make_copy_constraint (vi, nonlocal_id);
              }
          continue;
        }
...
Comment 10 Tom de Vries 2015-05-24 09:33:10 UTC
An observation. A patch like this allows vectorization without alias check:
...
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 8290a65..501d631 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -1241,7 +1241,12 @@ install_var_field (tree var, bool by_ref, int mask, omp_context *ctx)
       type = build_pointer_type (build_pointer_type (type));
     }
   else if (by_ref)
+#if 0
     type = build_pointer_type (type);
+#else
+    type = build_qualified_type (build_reference_type (type),
+                                TYPE_QUAL_RESTRICT);
+#endif
   else if ((mask & 3) == 1 && is_reference (var))
     type = TREE_TYPE (type);
 
...

The problem is that we don't have information at this point to decide between pointer and restrict reference. If var would be a scalar, we could use addr_taken to ensure that the var is not aliased. For arrays that doesn't work. If the c frontend would distinguish between:
- element read: result[x], and
- alias created: result, &result, &result[x]
and store that in an alias_created property, we could use that property to decide between pointer and restrict reference.

That would not fix the problem in general though, since that solution would already no longer work if the example was rewritten using pointers.

I wonder if postponing omp_expand till after ealias would give us enough information to update the field reference types with a restrict tag (or not) at that point. [ Though I'm not sure if doing that update there would actually have the desired effect. ]
Comment 11 Tom de Vries 2015-05-24 13:56:51 UTC
The ipa-pta solution no longer works. In 4.6, we had:
...
  # USE = anything
  # CLB = anything
  GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0);
  # USE = anything
  # CLB = anything
  main._omp_fn.0D.1472 (&.omp_data_o.1D.1484);
  # USE = anything
  # CLB = anything
  GOMP_parallel_endD.1049 ();
...

On trunk, we have now:
...
  # USE = anything
  # CLB = anything
  GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0);
...

So there's no longer a path in the call graph from main to main._omp_fn. Perhaps a dummy body for GOMP_parallel could fix that.
Comment 12 Richard Biener 2015-05-26 12:25:08 UTC
(In reply to vries from comment #11)
> The ipa-pta solution no longer works. In 4.6, we had:
> ...
>   # USE = anything
>   # CLB = anything
>   GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0);
>   # USE = anything
>   # CLB = anything
>   main._omp_fn.0D.1472 (&.omp_data_o.1D.1484);
>   # USE = anything
>   # CLB = anything
>   GOMP_parallel_endD.1049 ();
> ...
> 
> On trunk, we have now:
> ...
>   # USE = anything
>   # CLB = anything
>   GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0);
> ...
> 
> So there's no longer a path in the call graph from main to main._omp_fn.
> Perhaps a dummy body for GOMP_parallel could fix that.

Hm?  The IPA PTA "solution" was to tell IPA PTA that the call to GOMP_parallel
doesn't make .omp_data_o escape.

The attached patch doesn't work because it only patches GOMP_parallel_start,
not GOMP_parallel.

Of course it would even better to teach IPA PTA that GOMP_parallel
is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument.

How many different ways of IL do we get doing this kind of indirect
function invocations?
Comment 13 Richard Biener 2015-05-26 12:28:35 UTC
(In reply to Richard Biener from comment #12)
> (In reply to vries from comment #11)
> > The ipa-pta solution no longer works. In 4.6, we had:
> > ...
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0);
> >   # USE = anything
> >   # CLB = anything
> >   main._omp_fn.0D.1472 (&.omp_data_o.1D.1484);
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallel_endD.1049 ();
> > ...
> > 
> > On trunk, we have now:
> > ...
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0);
> > ...
> > 
> > So there's no longer a path in the call graph from main to main._omp_fn.
> > Perhaps a dummy body for GOMP_parallel could fix that.
> 
> Hm?  The IPA PTA "solution" was to tell IPA PTA that the call to
> GOMP_parallel
> doesn't make .omp_data_o escape.
> 
> The attached patch doesn't work because it only patches GOMP_parallel_start,
> not GOMP_parallel.
> 
> Of course it would even better to teach IPA PTA that GOMP_parallel
> is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument.
> 
> How many different ways of IL do we get doing this kind of indirect
> function invocations?

Other IPA propagators like IPA-CP probably also would like to know this.

I see various builtins taking a OMPFN argument in omp-builtins.def.  If we
assume the GOMP runtime itself is "transparent" then do we know how the
builtins end up calling the actual implementation function?
Comment 14 Jakub Jelinek 2015-05-26 13:54:41 UTC
(In reply to Richard Biener from comment #13)
> > > So there's no longer a path in the call graph from main to main._omp_fn.
> > > Perhaps a dummy body for GOMP_parallel could fix that.
> > 
> > Hm?  The IPA PTA "solution" was to tell IPA PTA that the call to
> > GOMP_parallel
> > doesn't make .omp_data_o escape.
> > 
> > The attached patch doesn't work because it only patches GOMP_parallel_start,
> > not GOMP_parallel.
> > 
> > Of course it would even better to teach IPA PTA that GOMP_parallel
> > is really invoking main._omp_fn.0 with a &.omp_data_o.1 argument.
> > 
> > How many different ways of IL do we get doing this kind of indirect
> > function invocations?
> 
> Other IPA propagators like IPA-CP probably also would like to know this.
> 
> I see various builtins taking a OMPFN argument in omp-builtins.def.  If we
> assume the GOMP runtime itself is "transparent" then do we know how the
> builtins end up calling the actual implementation function?

GOMP_parallel* call the ompfn function (first argument) with the second argument (pointer to some structure filled before GOMP_parallel* and dead (using a clobber) after the call) as the only argument.  The callback function can be called just once or more times (once in each thread).
Then there is GOMP_task*, where there is one or two callback functions,
if just one (the other one is NULL), then either the first callback function (1st argument) is called with the second argument as the only argument, or
with a pointer to a memory block that was filled with memcpy from the second argument.  If the third argument (second callback) is non-NULL, then that callback is called instead of the memcpy and the pointers can be to two different structures.
GOMP_target is another case, but there is often a cross-device boundary in between the two, so it is much harder to model that for IPA-PTA etc. purposes.
So, schematically, GOMP_parallel* (fn1, data1, ...) performs:
if (somecond)
  for (...)
    pthread_create (..., fn1, data1);
fn1 (data1);
if (somecond)
  for (...)
    pthread_join (...);
and GOMP_task (fn1, data1, fn2, ...) performs:
if (fn2 == 0 && somecond1)
  fn1 (data1);
else
  {
    char *buf = malloc (...); // or alloca/vla
    if (fn2 == 0)

  }
Comment 15 Jakub Jelinek 2015-05-26 13:59:46 UTC
and GOMP_task (fn1, data1, fn2, ...) performs:
if (somecond)
  {
    if (fn2 == 0)
      fn1 (data1);
    else
      {
        void *buf = alloca (...); // Takes care also about alignment
        fn2 (buf, data1);
        fn1 (buf);
      }
  }
else
  {
    void *buf = malloc (...); // Takes care also about alignment
    if (fn2 == 0)
      memcpy (buf, data1, ...);
    else
      fn2 (buf, data1);
    // Arrange for fn1 (buf); to be called at some point later (like C++ futures)
  }
The purpose of fn2 is to run copy constructors of the vars, for vars that will be residing within the buf.
Comment 16 Tom de Vries 2015-05-26 16:38:58 UTC
(In reply to Richard Biener from comment #12)
> (In reply to vries from comment #11)
> > The ipa-pta solution no longer works. In 4.6, we had:
> > ...
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0);
> >   # USE = anything
> >   # CLB = anything
> >   main._omp_fn.0D.1472 (&.omp_data_o.1D.1484);
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallel_endD.1049 ();
> > ...
> > 
> > On trunk, we have now:
> > ...
> >   # USE = anything
> >   # CLB = anything
> >   GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0);
> > ...
> > 
> > So there's no longer a path in the call graph from main to main._omp_fn.
> > Perhaps a dummy body for GOMP_parallel could fix that.
> 
> Hm?  The IPA PTA "solution" was to tell IPA PTA that the call to
> GOMP_parallel

[ GOMP_parallel_start ]

> doesn't make .omp_data_o escape.
> 

Right, for 4.6, adding fnspec ".rw" to GOMP_parallel_start has this effect in ipa-pta:
...
D.1505_14 = { ESCAPED NONLOCAL pData }
D.1509_18 = { ESCAPED NONLOCAL results }
-->
D.1505_14 = { pData }
D.1509_18 = { results }
...

where _14 and _18 are the omp_data_i relative loads in the split-off function:
...
  # VUSE <.MEMD.1514_20>
  # PT = nonlocal
  D.1505_14 = .omp_data_iD.1474_13(D)->pDataD.1477;

  # VUSE <.MEMD.1514_20>
  D.1506_15 = *D.1505_14[idxD.1495_1];

  ...

  # VUSE <.MEMD.1514_20>
  # PT = nonlocal
  D.1509_18 = .omp_data_iD.1474_13(D)->resultsD.1479;

  # .MEMD.1514_22 = VDEF <.MEMD.1514_20>
  *D.1509_18[idxD.1495_1] = D.1508_17;
...


> The attached patch doesn't work because it only patches GOMP_parallel_start,
> not GOMP_parallel.
> 

[ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch on trunk (and dropping the loop in the hunk for intra_create_variable_infos that does not apply cleanly anymore) and applying fnspec ".rw" on GOMP_parallel, gives us in ipa-pta:
...
_17 = { }
_21 = { }
...

where _17 and _21 are the omp_data_i relative loads in the split-off function:
...
  # VUSE <.MEM_4>
  # PT = nonlocal escaped
  _17 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base 1].pDataD.1719;

  # VUSE <.MEM_4>
  _18 = *_17[idx_1];

  # VUSE <.MEM_4>
  # PT = nonlocal escaped
  _21 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base 1].resultsD.1721;

  # .MEM_22 = VDEF <.MEM_4>
  *_21[idx_1] = _20;
...

It is reasonable to assume that we no longer are able to relate back these loads in the split-off function to pData and result in the donor function, due to the fact that there's no longer a direct function call to main._omp_fn in the donor function.

On 4.6, that direct function call to main._omp_fn still existed. On trunk, not anymore.
Comment 17 Richard Biener 2015-05-27 08:11:52 UTC
(In reply to vries from comment #16)
> (In reply to Richard Biener from comment #12)
> > (In reply to vries from comment #11)
> > > The ipa-pta solution no longer works. In 4.6, we had:
> > > ...
> > >   # USE = anything
> > >   # CLB = anything
> > >   GOMP_parallel_startD.1048 (main._omp_fn.0D.1472, &.omp_data_o.1D.1484, 0);
> > >   # USE = anything
> > >   # CLB = anything
> > >   main._omp_fn.0D.1472 (&.omp_data_o.1D.1484);
> > >   # USE = anything
> > >   # CLB = anything
> > >   GOMP_parallel_endD.1049 ();
> > > ...
> > > 
> > > On trunk, we have now:
> > > ...
> > >   # USE = anything
> > >   # CLB = anything
> > >   GOMP_parallelD.1345 (main._omp_fn.0D.1844, &.omp_data_o.1D.1856, 0, 0);
> > > ...
> > > 
> > > So there's no longer a path in the call graph from main to main._omp_fn.
> > > Perhaps a dummy body for GOMP_parallel could fix that.
> > 
> > Hm?  The IPA PTA "solution" was to tell IPA PTA that the call to
> > GOMP_parallel
> 
> [ GOMP_parallel_start ]
> 
> > doesn't make .omp_data_o escape.
> > 
> 
> Right, for 4.6, adding fnspec ".rw" to GOMP_parallel_start has this effect
> in ipa-pta:
> ...
> D.1505_14 = { ESCAPED NONLOCAL pData }
> D.1509_18 = { ESCAPED NONLOCAL results }
> -->
> D.1505_14 = { pData }
> D.1509_18 = { results }
> ...
> 
> where _14 and _18 are the omp_data_i relative loads in the split-off
> function:
> ...
>   # VUSE <.MEMD.1514_20>
>   # PT = nonlocal
>   D.1505_14 = .omp_data_iD.1474_13(D)->pDataD.1477;
> 
>   # VUSE <.MEMD.1514_20>
>   D.1506_15 = *D.1505_14[idxD.1495_1];
> 
>   ...
> 
>   # VUSE <.MEMD.1514_20>
>   # PT = nonlocal
>   D.1509_18 = .omp_data_iD.1474_13(D)->resultsD.1479;
> 
>   # .MEMD.1514_22 = VDEF <.MEMD.1514_20>
>   *D.1509_18[idxD.1495_1] = D.1508_17;
> ...
> 
> 
> > The attached patch doesn't work because it only patches GOMP_parallel_start,
> > not GOMP_parallel.
> > 
> 
> [ GOMP_parallel_start is no longer around on trunk. ] Applying the 4.6 patch
> on trunk (and dropping the loop in the hunk for intra_create_variable_infos
> that does not apply cleanly anymore) and applying fnspec ".rw" on
> GOMP_parallel, gives us in ipa-pta:
> ...
> _17 = { }
> _21 = { }
> ...
> 
> where _17 and _21 are the omp_data_i relative loads in the split-off
> function:
> ...
>   # VUSE <.MEM_4>
>   # PT = nonlocal escaped
>   _17 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base
> 1].pDataD.1719;
> 
>   # VUSE <.MEM_4>
>   _18 = *_17[idx_1];
> 
>   # VUSE <.MEM_4>
>   # PT = nonlocal escaped
>   _21 = MEM[(struct .omp_data_s.0D.1713 &).omp_data_i_16(D) clique 1 base
> 1].resultsD.1721;
> 
>   # .MEM_22 = VDEF <.MEM_4>
>   *_21[idx_1] = _20;
> ...
> 
> It is reasonable to assume that we no longer are able to relate back these
> loads in the split-off function to pData and result in the donor function,
> due to the fact that there's no longer a direct function call to
> main._omp_fn in the donor function.
> 
> On 4.6, that direct function call to main._omp_fn still existed. On trunk,
> not anymore.

In fact it even looks like wrong IPA PTA results to me (_17 and _21 point to
nothing).

Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c  (revision 223737)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -7372,7 +7372,8 @@ ipa_pta_execute (void)
         constraints for parameters.  */
       if (node->used_from_other_partition
          || node->externally_visible
-         || node->force_output)
+         || node->force_output
+         || node->address_taken)
        {
          intra_create_variable_infos (func);
 
fixes that.  Of course that makes a solution handling the OMP builtins
specially not going to work as if the function has its address taken
we don't know whether it is called from anywhere else.  The fix is
required for correctness though.

_17 = { ESCAPED NONLOCAL }
_21 = { ESCAPED NONLOCAL }

  _18 = *_17[idx_3];
  *_21[idx_3] = _20;

handling the OMP builtin specially will only get the solution amended
to { ESCAPED NONLOCAL &results } and { ESCAPED NONLOCAL &pData } and
thus still conflict.
Comment 18 Jakub Jelinek 2015-05-27 08:20:08 UTC
The *.omp_fn.* functions indeed, while they necessarily have to be addressable, because that is how they are passed to the libgomp entrypoints, are never called by anything but the libgomp runtime.  For GOMP_parallel*, they are only called before the GOMP_parallel* function exits, for GOMP_task* they could be called at some later point.
Comment 19 rguenther@suse.de 2015-05-27 08:43:08 UTC
On Wed, 27 May 2015, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
> 
> --- Comment #18 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> The *.omp_fn.* functions indeed, while they necessarily have to be addressable,
> because that is how they are passed to the libgomp entrypoints, are never
> called by anything but the libgomp runtime.  For GOMP_parallel*, they are only
> called before the GOMP_parallel* function exits, for GOMP_task* they could be
> called at some later point.

Ok, so this just means that IPA PTA would need to handle those specially
(and thus the OMP functions should be marked specially in the cgraph
node).  Not that I think IPA PTA is anywhere near production ready
(or I have time to fix it up properly...).  Just testing the addressable
fix now.
Comment 20 Tom de Vries 2015-11-19 16:58:01 UTC
This patch seems to have the desired effect on the original testcase: 
...
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 830db75..996756b 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region *region,
       if (collapse_bb == NULL)
        loop->latch = cont_bb;
       add_loop (loop, body_bb->loop_father);
+      loop->safelen = INT_MAX;
     }
 }
...

AFAIU, adding the omp for to the loop is an assertion that the loop is independent. It seems reasonable to assume that if the original loop was independent, the loop operating on a slice of the original iteration space will be independent as well.
Comment 21 Jakub Jelinek 2015-11-19 17:14:59 UTC
(In reply to vries from comment #20)
> This patch seems to have the desired effect on the original testcase: 
> ...
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index 830db75..996756b 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -9361,6 +9361,7 @@ expand_omp_for_static_nochunk (struct omp_region
> *region,
>        if (collapse_bb == NULL)
>         loop->latch = cont_bb;
>        add_loop (loop, body_bb->loop_father);
> +      loop->safelen = INT_MAX;
>      }
>  }
> ...
> 
> AFAIU, adding the omp for to the loop is an assertion that the loop is
> independent. It seems reasonable to assume that if the original loop was
> independent, the loop operating on a slice of the original iteration space
> will be independent as well.

That is very much wrong.  Static scheduling, both nochunk and chunk, doesn't imply in any way that the iterations are independent, the OpenMP standard says how the work is split among the threads, with nochunk that threads get consecutive sets of iterations as one chunk that are approximately the same size, but eventhough it is not exactly specified how exactly the iteration space is deviced (for nochunk), if you make the loop iterations independent, you would break many observable properties (say through threadprivate vars, omp_get_thread_num etc.).
Note loop->safelen == INT_MAX is actually weaker than independent iterations, when loop->safelen == INT_MAX, there can be dependencies, but only of certain kinds, it says that it is equivalent if you run the loop normally and if you run simultaneously (or emulated) the first statements of all the iterations, then second statements and so on (so vectorize with any vectorization factor the compiler wants).
Comment 23 Tom de Vries 2015-11-30 16:34:58 UTC
Author: vries
Date: Mon Nov 30 16:34:26 2015
New Revision: 231076

URL: https://gcc.gnu.org/viewcvs?rev=231076&root=gcc&view=rev
Log:
Handle BUILT_IN_GOMP_PARALLEL in ipa-pta

2015-11-30  Tom de Vries  <tom@codesourcery.com>

	PR tree-optimization/46032
	* tree-ssa-structalias.c (find_func_aliases_for_call_arg): New function,
	factored out of ...
	(find_func_aliases_for_call): ... here.
	(find_func_aliases_for_builtin_call, find_func_clobbers): Handle
	BUILT_IN_GOMP_PARALLEL.
	(ipa_pta_execute): Same.  Handle node->parallelized_function as a local
	function.

	* gcc.dg/pr46032.c: New test.

	* testsuite/libgomp.c/pr46032.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/pr46032.c
    trunk/libgomp/testsuite/libgomp.c/pr46032.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-structalias.c
    trunk/libgomp/ChangeLog
Comment 24 Tom de Vries 2015-11-30 16:43:00 UTC
patch with testcase committed, marking resolved-fixed.