[PATCH] rs6000: Refine RTL unroll adjust hook

Segher Boessenkool segher@kernel.crashing.org
Wed Jul 8 23:13:18 GMT 2020


On Wed, Jul 08, 2020 at 11:39:56AM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > I am not happy about what is considered "a complex loop" here.
> For early exit, which may cause and *next* unrolled iterations may be
> not executed, then unroll may be not benifit.

Yes, and it can well result in worse branch prediction.

> For too many branches (this patch say 20% of insns), may cause more
> branch-misses even for unrolled loop.
> 
> >From my intial intuition, I once think each condition may *define* loop
> as complex. :-)
> 
> But for each single condition, loop unrolling may still be helpful.
> While, if these conditions are all occur in a loop, it would be more
> possible to get negative impacts after unrolled.

Yes, but how many loops have *all* these conditions?  That is my problem
with it: it is only tested with one specific loop, and only benefits
that loop.

> > (A PARAM would be nice, but too many of those isn't actually useful
> > either...  Next time, add one as soon as writing the code, at least it
> > is useful at that point in time, when you still need to experiment with
> > it :-) )
> Yes, with PARAM, we can just change param to experiment:
> "if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located
> n one cache line, it may not need to unroll. But, after some tunning, I
> chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM.  2 PARAMs
> (one for 6, one for 10) may be ok.  Any comments?

RTL insns are not one-to-one with machine insns.  One important reason
to unroll is for small loops, where we need to hide the latency of a
fetch redirect (say, three cycles); this is complicated by most insns
having a 2 cycle latency (or more for FP and such), and by different
SMT modes changing stuff here, oh and different CPUs of course.

So it is very important to unroll small loops enough, and it can be
beneficial for larger loops too, but it also hurts (in general, and
more for bigger loops, or loops with calls, or jumps).

It's not a science, more an art.  You'll just have to find something
that works well in practice.  But not something that looks at very
special cases only, preferably.

> > So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
> > reduce the unrolling, and only in very specific cases.
> >
> > Is there no middle road possible?  Say, don't unroll to more than 25
> > insns total (which is what the "only small loops" does, sort of -- it
> > also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
> > calls, and not to more than 4 branches (I'm making up those numbers, of
> > course, and PARAMS would be helpful).  Some of this already does exist,
> > and might need retuning for us?
> It may make sense. There are param_max_unrolled_insns,
> param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add
> calls number and branches number checking for rtl unroller.

Hrm yes, that may be generally useful even.

> While, actually, here we would need condition to define *complex* loop,
> where contains call exist (may just 1), branch exist(may 2) and early
> exit(may 1) at the same time, but each number is not large.
> Any sugguestions? Thanks.

How many loops have you seen where all those conditions are true, but
the generic code still decides to unroll things?


Segher


More information about the Gcc-patches mailing list