[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

Mon Jan 29 12:05:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
So strided stores are costed as

  /* Costs of the stores.  */
  if (memory_access_type == VMAT_ELEMENTWISE
      || memory_access_type == VMAT_GATHER_SCATTER)
    {
      /* N scalar stores plus extracting the elements.  */
      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
      inside_cost += record_stmt_cost (body_cost_vec,
                                       ncopies * assumed_nunits,
                                       scalar_store, stmt_info, 0, vect_body);
    }
...
  if (memory_access_type == VMAT_ELEMENTWISE
      || memory_access_type == VMAT_STRIDED_SLP)
    {
      /* N scalar stores plus extracting the elements.  */
      unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
      inside_cost += record_stmt_cost (body_cost_vec,
                                       ncopies * assumed_nunits,
                                       vec_to_scalar, stmt_info, 0, vect_body);
    }

there's the issue of "overloading" vec_to_scalar with extraction.  It's costed
as generic sse_op which IMHO is reasonable here (vextract*).

The scalar cost is 12 for each of the following stmts

  _66 = *_150[_65];
  d1.76_67 = d1;
  _160 = d1.76_67 * _73;
  _74 = _66 * _160;
  *_150[_65] = _74;

the vector variant is adding the construction/extraction cost compared
to the scalar variant and wins with the two multiplications being costed
once instead of four times.  We don't actually factor in the "win" by
hoisting the vectorized load of 'd1' only in the vector case.

With AVX2 things become even more "cheap" vectorized.  And we of course
peel the epilogue completely.

Ideally we'd interchange this specific loop but interchange doesn't do
anything here because we get niters that might be zero.  Later dependences
would probably wreck things but here this also is a missed optimization.
We have two paths running into the loop loading ng1 and checking it
against zero properly but the PHI result doesn't have this range info
merged (well, VRP sets the info but it needs LIM / PRE to see the
opportunity so it's only set by late VRP).