[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Wed Nov 18 09:15:07 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Or

double a[1024], b[1024], c[1024];

void foo()
{
  for (int i = 0; i < 256; ++i)
    {
      a[2*i] = 1. - a[2*i] + b[2*i];
      a[2*i+1] = 1 + a[2*i+1] - b[2*i+1];
    }
}

which early folding breaks unless we add -fno-associative-math.  We then
end up with

  a[_1] = (((b[_1]) - (a[_1])) + 1.0e+0);
  a[_6] = (((a[_6]) - (b[_6])) + 1.0e+0);

where SLP operator swaping cannot handle to bring the grouped loads into
the same lanes.

So the idea is to look at single-use chains of plus/minus operations and
handle those as wide associated SLP nodes with flags denoting which lanes
need negation.  We'd have three children and each child has a per-lane
spec whether to add or subtract.