[Bug tree-optimization/88760] GCC unrolling is suboptimal

Thu Jan 24 13:41:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #20 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #19 from Wilco <wilco at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #18)
> 
> > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > suggestion)
> > 
> > If that helps, sure (I'd have guessed uarchs are going to split
> > load-multiple into separate loads, but eventually it avoids
> > load-port contention?)
> 
> Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> 128-bit LDP in a single cycle (see Optimization Guide).
> 
> > > 2) Unrolling and breaking accumulator dependencies.
> > 
> > IIRC RTL unrolling can do this (as side-effect, not as main
> > cost motivation) guarded with an extra switch.
> > 
> > > I think more general unrolling and the peeling associated with it can be
> > > discussed independently of 1) and 2) once we collect more data on more
> > > microarchitectures.
> > 
> > I think both of these can be "implemented" on the RTL unroller
> > side.
> 
> You still need dependence analysis, alias info, ivopt to run again. The goal is
> to remove the increment of the index, use efficient addressing modes (base+imm)
> and allow scheduling to move instructions between iterations. I don't believe
> the RTL unroller supports any of this today.

There's no way to encode load-multiple on GIMPLE that wouldn't be
awkward to other GIMPLE optimizers.

Yes, the RTL unroller supports scheduling (sched runs after unrolling)
and scheduling can do dependence analysis.  Yes, the RTL unroller
does _not_ do dependence analysis at the moment, so if you want to
know beforehand whether you can interleave iterations you have to
actually perform dependence analysis.  Of course you can do that
on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
at the moment).

Note I'm not opposed to have IVOPTs on GIMPLE itself perform
unrolling (I know Bin was against this given IVOPTs is already
so comples) and a do accumulator breakup.  But I don't see how
to do the load-multiple thing (yes, you could represent it
as a vector load plus N element extracts on GIMPLE and it
would be easy to teach SLP vectorization to perform this
transform on its own if it is really profitable - which I
doubt you can reasonably argue before RA, let alone on GIMPLE).