[Bug rtl-optimization/68128] A huge regression in Parboil v2.5 OpenMP CUTCP test (2.5 times lower performance)

Fri Nov 20 15:19:00 GMT 2015

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68128

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org,
                   |                            |rth at gcc dot gnu.org
   Target Milestone|---                         |6.0

--- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Though, comparing the performance with ICC still shows a huge difference, even
with -D__INTEL_COMPILER defined.
Comparing what loops are vectorized with -Ofast -fno-openmp and what loops are
vectorized with -Ofast -fopenmp shows only one difference, apparently the
hottest loop (and the only important one) in the benchmark.
I can get performance comparable to ICC by adding
firstprivate (gridspacing)
clause to the #pragma omp parallel, which is what the benchmark authors should
have used, because those variables are never modified in the parallel.
Or alternatively changing
  float gridspacing = lattice->dim.h;
line to
  const float gridspacing = lattice->dim.h;
(then it is firstprivate implicitly).

But, that suggests we should either improve something on the aliasing side, or
try to optimize it at omp lowering or expansion time.

Reduced testcase:

void
foo (float *pgstart, const float dxstart, float gridspacing,
     const float inv_a2, const float dydz2, const float a2, const float q)
{
  int i, j, ia, ib;
  float dx, *pg, r2, s, e;
#pragma omp parallel for private (i, j, ia, ib, dx, pg, r2, s, e)
  for (j = 0; j < 1024; j++)
    {
      ia = j * 64;
      ib = j * 64 + 63;
      dx = dxstart + j * gridspacing;
      pg = pgstart + j * 64;
      for (i = ia; i <= ib; i++, pg++, dx += gridspacing)
        {
          r2 = dx * dx + dydz2;
          s = (1.f - r2 * inv_a2) * (1.f - r2 * inv_a2);
          e = q * (1 / __builtin_sqrtf (r2)) * s;
          *pg += (r2 < a2 ? e : 0);
        }
    }
}

With firstprivate (gridspacing) or const float gridspacing the above is
vectorized with -Ofast -fopenmp, otherwise only with -Ofast -fno-openmp.

If a shared variable is or might be modified in the parallel, then the aliasing
analysis for it is complicated, other threads could be modifying the variable
asynchronously.  But perhaps at least the case where a shared variable is not
addressable during omp lowering might be something we should try to optimize -
analyze the body and if we are sure it is not modified, turn it (at least for
gimple reg types) into firstprivate.  Though of course such analysis pre-SSA is
tiny bit harder, but if it is not addressable, perhaps not that much.
For the non-addressable scalars that are copy-in/out optimized in the parallel,
another possibility is to remember that (say by adding special omp attribute on
the field decl), then the .omp_data_i->field is the scalar itself and we could
check if it is ever written to in the outlined function body (or might be in
anything it calls), and if not load it into a private var.  Richard, any
thoughts on this?