Bug 46032 - openmp inhibits loop vectorization
openmp inhibits loop vectorization
Status: NEW
Product: gcc
Classification: Unclassified
Component: tree-optimization
4.5.1
: P3 major
: ---
Assigned To: Not yet assigned to anyone
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-10-15 07:23 UTC by vincenzo Innocente
Modified: 2012-07-06 16:17 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-10-15 10:08:39


Attachments
fnspec attr test (3.53 KB, patch)
2010-10-15 11:51 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2010-10-15 07:23:20 UTC
The use of openmp to parallelize loop inhibits auto-vectorization.
This defeats all benefits of parallelization making the parallel code slower than the "sequential one".
Is it foreseen a version of openmp that preserve auto-vectorization?

Example
on
Linux  2.6.18-194.11.3.el5.cve20103081 #1 SMP Thu Sep 16 15:17:10 CEST 2010 x86_64 x86_64 x86_64 GNU/Linux
using
GNU C++ (GCC) version 4.6.0 20100408 (experimental) (x86_64-unknown-linux-gnu)
	compiled by GNU C version 4.6.0 20100408 (experimental), GMP version 4.3.2, MPFR version 2.4.2, MPC version 0.8.1
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
compiling this simple example
cat openmpvector.cpp
int main()
{

 const unsigned int nEvents = 1000;
 double results[nEvents] = {0};
 double pData[nEvents] = {0};
 double coeff = 12.2;

#pragma omp parallel for
 for (int idx = 0; idx<(int)nEvents; idx++) {
   results[idx] = coeff*pData[idx];
 }

 return resultsCPU[0]; // avoid optimization of "dead" code

}

gives
g++  -O2 -fopenmp -ftree-vectorize -ftree-vectorizer-verbose=7 openmpvector.cpp

openmpvector.cpp:11: note: not vectorized: loop contains function calls or data references that cannot be analyzed
openmpvector.cpp:9: note: vectorized 0 loops in function.
Comment 1 Richard Biener 2010-10-15 10:08:39 UTC
The problem is that local variables are accessed indirectly via the
.omp_data_i pointer and alias analysis is unable to hoist the load of
.omp_data_i_12(D)->coeff across the store to *pretmp.5_27[idx_1].

A fix is to make the argument DECL_BY_REFERENCE and the
type restrict qualified.  This will  make alias analysis assume that
the pointed-to object is not aliased unless later somebody takes its
address.

<bb 3>:
  pretmp.5_23 = .omp_data_i_12(D)->pData;
  pretmp.5_27 = .omp_data_i_12(D)->results;

<bb 4>:
  # idx_1 = PHI <idx_8(3), idx_18(5)>
  D.2142_14 = *pretmp.5_23[idx_1];
  D.2143_15 = .omp_data_i_12(D)->coeff;
  D.2144_16 = D.2142_14 * D.2143_15;
  *pretmp.5_27[idx_1] = D.2144_16;
  idx_18 = idx_1 + 1;
  if (D.2139_10 > idx_18)
    goto <bb 5>;
  else
    goto <bb 6>;

<bb 5>:
  goto <bb 4>;


Not completely enough though, as we consider *.omp_data_i escaped
(and thus reachable by NONLOCAL).

The following fixes that (with unknown consequences, I think fortran
array descriptors are the only other user):

Index: gcc/omp-low.c
===================================================================
--- gcc/omp-low.c       (revision 165474)
+++ gcc/omp-low.c       (working copy)
@@ -1349,7 +1349,8 @@ fixup_child_record_type (omp_context *ct
       layout_type (type);
     }
 
-  TREE_TYPE (ctx->receiver_decl) = build_pointer_type (type);
+  TREE_TYPE (ctx->receiver_decl) = build_qualified_type (build_pointer_type (type),
+                                                        TYPE_QUAL_RESTRICT);
 }
 
 /* Instantiate decls as necessary in CTX to satisfy the data sharing
@@ -1584,6 +1585,7 @@ create_omp_child_function (omp_context *
   DECL_NAMELESS (t) = 1;
   DECL_ARG_TYPE (t) = ptr_type_node;
   DECL_CONTEXT (t) = current_function_decl;
+  DECL_BY_REFERENCE (t) = 1;
   TREE_USED (t) = 1;
   DECL_ARGUMENTS (decl) = t;
   if (!task_copy)
Index: gcc/tree-ssa-structalias.c
===================================================================
--- gcc/tree-ssa-structalias.c  (revision 165474)
+++ gcc/tree-ssa-structalias.c  (working copy)
@@ -5575,7 +5575,6 @@ intra_create_variable_infos (void)
              var_ann_t ann;
              heapvar = create_tmp_var_raw (TREE_TYPE (TREE_TYPE (t)),
                                            "PARM_NOALIAS");
-             DECL_EXTERNAL (heapvar) = 1;
              heapvar_insert (t, 0, heapvar);
              ann = get_var_ann (heapvar);
              ann->is_heapvar = 1;
@@ -5590,6 +5589,12 @@ intra_create_variable_infos (void)
          rhsc.offset = 0;
          process_constraint (new_constraint (lhsc, rhsc));
          vi->is_restrict_var = 1;
+         do
+           {
+             make_constraint_from (vi, nonlocal_id);
+             vi = vi->next;
+           }
+         while (vi);
          continue;
        }
 

it means that stores to *.omp_data_i in the omp fn are considered not
escaping to the caller (and thus can be DSEd).  With the above patch
the loop is vectorized with a runtime alias check, as we can't
see that results and pData do not alias.  Not even with IPA-PTA as
the OMP function escapes through __builtin_GOMP_parallel_start.
Comment 2 Richard Biener 2010-10-15 10:30:42 UTC
If I hack PTA to make the omp function not escape IPA-PTA computes

<bb 4>:
  # idx_1 = PHI <idx_11(3), idx_18(6)>
  # PT = { D.2069 }
  D.2112_13 = .omp_data_i_12(D)->pData;
  D.2113_14 = *D.2112_13[idx_1];
  D.2114_15 = .omp_data_i_12(D)->coeff;
  D.2115_16 = D.2113_14 * D.2114_15;
  # PT = { D.2068 }
  D.2116_17 = .omp_data_i_12(D)->results;

thus knows what the pointers point to and we vectorize w/o a runtime
alias check (we still have no idea about alignment though, but that's
probably correct).

Thus it might be worth annotating some of the OMP builtins with
the fnspec attribute.
Comment 3 Richard Biener 2010-10-15 11:51:58 UTC
Created attachment 22053 [details]
fnspec attr test

Like this (ugh).  Fixes the thing with -fipa-pta on trunk.
Comment 4 Richard Biener 2010-10-15 12:09:38 UTC
A few things to consider:

  __builtin_GOMP_parallel_start (main._omp_fn.0, &.omp_data_o.1, 0);
  main._omp_fn.0 (&.omp_data_o.1);
  __builtin_GOMP_parallel_end ();

for PTA purposes we can ignore that __builtin_GOMP_parallel_start calls
main._omp_fn.0 and I suppose the function pointer doesn't escape through
it.  We can't assume that .omp_data_o.1 does not escape through
__builtin_GOMP_parallel_start though, as __builtin_GOMP_parallel_end needs
to be a barrier for optimization for it (and thus needs to be considered
reading and writing .omp_data_o.1).  As it doesn't take any arguments
the only way to ensure that is by making .omp_data_o.1 escape.  We could
probably arrange for __builtin_GOMP_parallel_end to get &.omp_data_o.1
as argument solely for alias-analysis purposes though.  In that case
we could use ".xw" for __builtin_GOMP_parallel_start and ".w" for
__builtin_GOMP_parallel_end.
Comment 5 vincenzo Innocente 2011-07-26 13:00:18 UTC
in case anybody wandering
it seems fixed in
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/user/i/innocent/w2/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --prefix=/afs/cern.ch/user/i/innocent/w2 --enable-languages=c,c++,fortran -enable-gold=yes --enable-lto --with-build-config=bootstrap-lto --with-gmp-lib=/usr/local/lib64 --with-mpfr-lib=/usr/local/lib64 -with-mpc-lib=/usr/local/lib64 --enable-cloog-backend=isl --with-cloog=/usr/local --with-ppl-lib=/usr/local/lib64 CFLAGS='-O2 -ftree-vectorize -fPIC' CXXFLAGS='-O2 -fPIC -ftree-vectorize -fvisibility-inlines-hidden'
Thread model: posix
gcc version 4.7.0 20110725 (experimental) (GCC) 


c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp                
openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: versioning for alias required: can't determine dependence between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: mark for run-time aliasing test between .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_32
openmpvector.cpp:11: note: Unknown alignment for access: *pretmp.11_34
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: Vectorizing an unaligned access.
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_load_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
openmpvector.cpp:11: note: vect_model_store_cost: unaligned supported by hardware.
openmpvector.cpp:11: note: vect_model_store_cost: inside_cost = 2, outside_cost = 0 .
openmpvector.cpp:11: note: cost model: Adding cost of checks for loop versioning aliasing.

openmpvector.cpp:11: note: cost model: epilogue peel iters set to vf/2 because loop iterations are unknown .
openmpvector.cpp:11: note: Cost model analysis: 
  Vector inside of loop cost: 7
  Vector outside of loop cost: 19
  Scalar iteration cost: 4
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

openmpvector.cpp:11: note:   Profitability threshold = 6

openmpvector.cpp:11: note: Profitability threshold is 6 loop iterations.
openmpvector.cpp:11: note: create runtime check for data references *pretmp.11_32[idx_3] and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: create runtime check for data references .omp_data_i_14(D)->coeff and *pretmp.11_34[idx_3]
openmpvector.cpp:11: note: created 2 versioning for alias checks.

openmpvector.cpp:11: note: LOOP VECTORIZED.
openmpvector.cpp:9: note: vectorized 1 loops in function.



graphite breaks it…. 
c++ -std=gnu++0x -DNDEBUG -Wall -Ofast -mavx openmpvector.cpp -ftree-vectorizer-verbose=7 -fopenmp -fgraphite -fgraphite-identity -floop-block -floop-flatten -floop-interchange -floop-strip-mine -ftree-loop-linear -floop-parallelize-all

openmpvector.cpp:9: note: not vectorized: data ref analysis failed D.2372_47 = *pretmp.11_32[D.2403_49];

openmpvector.cpp:9: note: vectorized 0 loops in function.
Comment 6 Paolo Carlini 2011-07-26 13:47:35 UTC
Good. But it Graphite breaks it, let's add Sebastian in CC..
Comment 7 Feng Chen 2012-07-06 16:17:28 UTC
Any update on this? I do see loops getting slower even for large nx*ny sometimes after omp on gcc 4.6.2, e.g.,

#pragma omp parallel for
for(int iy=0; iy<ny; iy++) {
  for(int ix=0; ix<nx; ix++) {
    dest[(size_t)iy*nx + ix] = src[(size_t)iy*nx + ix] * 2;
  }
}

Sometimes gcc won't vectorize the inner loop, i have to put it into an inline function to force it.  The performance is only marginally better after that.
ps: I break the loop because I noticed previously that omp parallel inhibits auto-vectorization, forgot which gcc version I used ...

Graphite did improve the scalability of openmp programs from my experience, so the fix (with tests) is important ...

(In reply to comment #6)
> Good. But it Graphite breaks it, let's add Sebastian in CC..