[PATCH 2/2] Aarch64: Add branch diluter pass

Fri Jul 24 11:53:47 GMT 2020

Hi!

On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> Correct, it's a sliding window only because the real load address is not
> >> known to the compiler and the algorithm is conservative.  I believe we
> >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al
> >> least) the granule size, then we should be able to insert 'nop aligned
> >> labels' precisely.
> >
> > Yeah, we have similar issues on Power...  Our "granule" (fetch group
> > size, in our terminology) is 32 typically, but we align functions to
> > just 16.  This is causing some problems, but aligning to bigger
> > boundaries isn't a very happy alternative either.  WIP...
> 
> Interesting, I was expecting other CPUs to have a similar mechanism.

On old cpus (like the 970) there were at most two branch predictions per
cycle.  Nowadays, all branches are predicted; not sure when this changed,
it is pretty long ago already.

> > (We don't have this exact same problem, because our non-ancient cores
> > can just predict *all* branches in the same cycle).
> >
> >> My main fear is that given new cores tend to have big granules code size
> >> would blow.  One advantage of the implemented algorithm is that even if
> >> slightly conservative it's impacting code size only where an high branch
> >> density shows up.
> >
> > What is "big granules" for you?
> 
> N1 is 8 instructions so 32 bytes as well, I guess this may grow further
> (my speculation).

It has to sooner rather than later, yeah.  Or the mechanism has to change
more radically.  Interesting times ahead, I guess :-)

About your patch itself.  The basic idea seems fine (I didn't look too
closely), but do you really need a new RTX class for this?  That is not
very appetising...

Segher