[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Wed Nov 18 09:15:07 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Or
double a[1024], b[1024], c[1024];
void foo()
{
for (int i = 0; i < 256; ++i)
{
a[2*i] = 1. - a[2*i] + b[2*i];
a[2*i+1] = 1 + a[2*i+1] - b[2*i+1];
}
}
which early folding breaks unless we add -fno-associative-math. We then
end up with
a[_1] = (((b[_1]) - (a[_1])) + 1.0e+0);
a[_6] = (((a[_6]) - (b[_6])) + 1.0e+0);
where SLP operator swaping cannot handle to bring the grouped loads into
the same lanes.
So the idea is to look at single-use chains of plus/minus operations and
handle those as wide associated SLP nodes with flags denoting which lanes
need negation. We'd have three children and each child has a per-lane
spec whether to add or subtract.
More information about the Gcc-bugs
mailing list