This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] Tree loop unroller pass
Hi,
On Fri, 16 Feb 2018, Wilco Dijkstra wrote:
> > How so?
>
> I thought it is well-known for many years that the rtl unroller doesn't
> work properly. In practically all cases where LLVM beats GCC, it is due
> to unrolling small loops.
I thought it was because of vectorizing at -O2, not due to unrolling
itself.
> > To generate more ILP for modern out-of-order processors you need to be
> > able to do followup transforms that remove dependences. So rather than
> > inventing magic params we should look at those transforms and key
> > unrolling on them. Like we do in predictive commoning or other passes
> > that end up performing unrolling as part of their transform.
>
> This is why unrolling needs to be done at the tree level. Alias info is
> correct, addressing modes end up more optimal and the scheduler can now
> interleave the iterations (often not possible after the rtl-unroller due
> to bad alias info).
The better alias info at gimple level is offsetted by worse information
about addressing modes, register pressure and precise machine
instructions. It's not a very obvious tradeoff between both approaches.
> > Our measurements on x86 concluded that unrolling isn't worth it, in
> > fact it very often hurts. That was of course with saner params than
> > the defaults of the RTL unroller.
> >
> > Often you even have to fight with followup passes doing stuff that
> > ends up inreasing register pressure too much so we end up spilling.
>
> Yes that's why I mentioned we should only unroll small loops where there
> is always a benefit from reduced loop counter increments and branching.
With modern OOO processors the loop counter increments and branching costs
about nothing on average due to speculation and bypassing. In fact a loop
cache might prefer small loops.
> > So _please_ first get testcases we know unrolling will be beneficial
> > on and _also_ have a thorough description _why_.
>
> I'm sure we can find good examples. The why will be obvious just from
> instruction count.
The instruction count is only one part. If the unrolling really enables
followup transformations like cross-iteration redundancy removal, then
it's worthwhile. But for that we already have passes (e.g. predictive
commoning). If the only thing unrolling does is getting rid of N-1 loop
counter test-and-branch, it's not beneficial on OOO processors (if the
predictor is worth anything).
The reason why we tried some unrolling measurements last year is to
establish if unrolling is worth it or not on an normal OOO x86-64
processor (I basically wanted to have a case for implementing a generic
unroller on gimple, and replace the RTL one (!)). The results were so
underwhelming that I dropped the plan.
> > With Ooo CPUs speculatively executing the next iterations I very much
> > doubt that.
>
> OoO execution is like really dumb loop unrolling,
Actually I think of it as a rather clever way of unrolling. The processor
has exact knowledge of (non-)dependencies, even on those the compiler
can't proof to not exist.
And if a dependency exists for real (such that an OOO CPU can't ignore it)
how would you propose the compiler to magically remove it to get rid of
the involved instructions? That can usually only be done by duplicating
threads-of-compute and that increases not decreases instruction count (or
leaves them the same). For instance the RTL unroller replaces IV adjusts
ala:
i = i + 1; use (i)
i = i + 1; use (i)
with
i1 = i + 1; use (i1)
i2 = i + 2; use (i2)
So, compiler was able to get rid of one dependency chain, but in expense
of having to use a new constant and register. A cost analysis needs to
happen to be sure that this is worthwhile.
> you still have all the dependencies between iterations, all the
> branches, all the pointer increments etc. Optimizing those reduces
> instruction counts like vectorization. Fewer instructions implies faster
> code.
Also not true in this generality. We recently had a case where
vectorization hurt very much but it wasn't obvious why. Our current
theory (and we had many) is that because before the vector ALU actually
could start working there had to be eight (strided) loads done per vector,
for multiple streams, and all those loads eventually clogged the
pipeline. The scalar loop (on the OOO CPU) nicely intermingled ALU and
memory ops. In the end Richi got the vector loop to be only 2-3% slower
than the scalar loop (and that was a fairly major effort), even though
from looking at the instructions it had only 70-80% of the original scalar
dynamic instructions count.
In any case, I'm with Richi on this one about having some dependable
testcases that speed up with a tree unroller, reasons of why this is so,
and some cost model that reflects these reasons at least roughly.
All the above is of course only true for OOO CPUs. For in-order unrolling
is helpful. But for getting that it would be good to work towards
exchanging the RTL with a gimple unroller if must be to not have two of
them.
Ciao,
Michael.
P.S: For a very long time I also thought "hmm, gimple unroller, that would
be nice to have; could get rid of RTL one". Until these experiments last
year. Since then I'm basically 180 degrees around ;-)