[PR91598] Improve autoprefetcher heuristic in haifa-sched.c

Thu Aug 29 18:18:00 GMT 2019

On Thu, 29 Aug 2019, Maxim Kuvyrkov wrote:

> >> r1 = [rb + 0]
> >> <math with r1>
> >> r2 = [rb + 8]
> >> <math with r2>
> >> r3 = [rb + 16]
> >> <math with r3>
> >> 
> >> which, apparently, cortex-a53 autoprefetcher doesn't recognize.  This
> >> schedule happens because r2= load gets lower priority than the
> >> "irrelevant" <math with r1> due to the above patch.
> >> 
> >> If we think about it, the fact that "r1 = [rb + 0]" can be scheduled
> >> means that true dependencies of all similar base+offset loads are
> >> resolved.  Therefore, for autoprefetcher-friendly schedule we should
> >> prioritize memory reads before "irrelevant" instructions.
> > 
> > But isn't there also max number of load issues in a fetch window to consider? 
> > So interleaving arithmetic with loads might be profitable. 
> 
> It appears that cores with autoprefetcher hardware prefer loads and stores
> bundled together, not interspersed with other instructions to occupy the rest
> of CPU units.

Let me point out that the motivating example has a bigger effect in play:

(1) r1 = [rb + 0]
(2) <math with r1>
(3) r2 = [rb + 8]
(4) <math with r2>
(5) r3 = [rb + 16]
(6) <math with r3>

here Cortex-A53, being an in-order core, cannot issue the load at (3) until
after the load at (1) has completed, because the use at (2) depends on it.
The good schedule allows the three loads to issue in a pipelined fashion.

So essentially the main issue is not a hardware peculiarity, but rather the
bad schedule being totally wrong (it could only make sense if loads had 1-cycle
latency, which they do not).

I think this highlights how implementing this autoprefetch heuristic via the
dfa_lookahead_guard interface looks questionable in the first place, but the
patch itself makes sense to me.

Alexander