[Bug tree-optimization/88760] GCC unrolling is suboptimal
wilco at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Thu Jan 24 13:27:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #18)
> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
>
> If that helps, sure (I'd have guessed uarchs are going to split
> load-multiple into separate loads, but eventually it avoids
> load-port contention?)
Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
128-bit LDP in a single cycle (see Optimization Guide).
> > 2) Unrolling and breaking accumulator dependencies.
>
> IIRC RTL unrolling can do this (as side-effect, not as main
> cost motivation) guarded with an extra switch.
>
> > I think more general unrolling and the peeling associated with it can be
> > discussed independently of 1) and 2) once we collect more data on more
> > microarchitectures.
>
> I think both of these can be "implemented" on the RTL unroller
> side.
You still need dependence analysis, alias info, ivopt to run again. The goal is
to remove the increment of the index, use efficient addressing modes (base+imm)
and allow scheduling to move instructions between iterations. I don't believe
the RTL unroller supports any of this today.
More information about the Gcc-bugs
mailing list