[PATCH, GCC, AArch64] Fix PR88398 for AArch64

Mon Nov 18 08:36:00 GMT 2019

On Fri, Nov 15, 2019 at 6:13 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Richard,
>
> > So what do we actually do unpatched with -funroll-loops here?
>
> Yes so it does the insane "fully unrolled trailing loop before the unrolled
> loop" thing. One always does the trailing loop last (and typically as an
> actual loop of course) and then the code ends up much faster, close to
> the ideal version shown in the PR.

Well, you can't do the unrolled loop first unless you keep all exit tests.
Not keeping them is the whole point of unrolling!

> > When I force "stupid" unrolling the unrolled
> > part is clearly worse and I see no benefit from unrolling here
> > (eventually over the histogram of the number of iterations the
> > repeated exits are now better predicted because they are duplicated).
>
> For these kinds of loops, stupid unrolling is clearly better than the
> default unrolling, both in size and in performance. For the example
> we only ever execute part of the "trailing" loop, and never enter the
> unrolled main loop!

Well, then you don't want unrolling you want peeling.  You'd be
actually happy with four peeled iterations and then the regular,
not unrolled loop at the tail.

> > I wonder if structuring  it more like the "stupid" case would be
> > better, thus
>
> The key thing is to avoid emitting an unrolled trailing loop before
> the actual unrolled loop. The stupid strategy can already do that,
> hence it often faster than unrolling badly.

The stupid strategy is what it says - stupid.

We've removed the peeling code from RTL a few releases back
but I do agree that the emitted code for the peeled copies for
the runtime algorithm is "bad" - it doesn't match what the comment
indicates (a switch, implying a jump table) nor does it do what
one would trivially expect ("stupid" peeling).

> > that would cater for your case, does not require -funroll-all-loops
> > or a new target hook.  It might be that all that is needed is
> > reordering of basic blocks with the current scheme to decrease
> > branch density.  Alternatively limit the unroll factor or pad
> > the prologue jumps to make the branch predictor happy.
>
> We shouldn't treat symptoms but fix the underlying problems.

Sure, which is why I suggest to change how we emit the
prologue here.  We can select the variant of the prologue
with a target hook based on preference for example, between
doing it peeling-like (which you prefer), using a scheme
like current (preferably in some optimized form).

As said, it might be layouting the prologue in a better way is
all that is needed.

Richard.

> Wilco