znver3 tuning part 1

Mon Mar 22 12:18:38 GMT 2021

On Mon, Mar 22, 2021 at 12:02 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > > Hi,
> > > I plan to commit some retuning of znver3 codegen that is based on real
> > > hardware benchmarks.  It turns out that there are not too many changes
> > > necessary sinze Zen3 is quite smooth upgrade to Zen2.  In summary:
> > >
> > >  - some instructions (like idiv) have shorter latencies.  Adjusting
> > >    costs reduces code size a bit but seems within noise in benchmark
> > >    (since our cost calculation is quite off anyway because it does not
> > >    account register pressure and parallelism that does make huge
> > >    difference here)
> > >  - gather instructions are still microcoded but a lot faster than in
> > >    znver1/znver2 and it turns out they are now beneficial for few tsmc
> > >    benchmarks, so I plan to enable them.
> >
> > Can we get a copy of this benchmark to try ?
> > we need to check on bigger benchmarks like SPEC also.
>
> Yes, I am also running specs.  However for basic instruction selection
> tuning smaller benchmarks are doing quite well.  In general if there are
> relatively natural loops where gather helps, i think we should enable it
> and try to fix possible regressions (I did not see one in spec runs, but
> I plan to do more benhcmarking this week).
>
> I did some work on TSVC mostly because zen3 seems very smooth update to
> zen2 for instruction selection (which is already happy with almost
> everything especially for scalar code) and vectorizer costs seems to be
> place where we seem to have most room for improvement.
>
> I briefly analyzed all tsvc kernels where we regress compared to clang,
> aocc and icc.  You can search tsvc in bugzilla. Richard also wrote some
> observations there.  These are related to missing features rather than
> cost model however.
>
> One problem of tsvc is that it is FP only.  I hacked it for integer but
> it would be nice to have someting else as well.
> >
> > >
> > >    It seems we missed revisiting this for znver2 tuning.
> > >    I think even for znver2 it may make sense to re-enable them, so I
> > >    will benchmark this as well.
> > >  - memcpy/memset expansion seems to work same way as for znver2,
> > >    so I am keeping same changes.
> > >  - instruction scheduler is already modified in trunk to some degree
> > >    reflecting new units.  Problem with instruction scheduling is that
> > >    it treats zen as in-order CPU and is unlikely going to fill all
> > >    execution resources this way.
> > >    We may want to try to model the out-of-order nature similar way as
> > >    LLVM does, but at the other hand the current scheduling logic seems
> > >    to do mostly fine (i.e. not worse than llvm's).  What matters is
> > >    to schedule for long latencies and just after branch boundaries
> > >    where simplified model seems to do just fine.
> >
> > So we can keep the existing model for znver3 for GCC 11 ?
>
> I think so - I experimented with making the model bit more precise and
> it does not seem to add any performance improvements and makes the
> automaton a lot bigger.  The existing model already handles the updated
> zen3 latencies...
>
> I think the only possible iprovment here would be to start modelling
> explicitly the out of order nature but even then I am not sure how much
> benefits that can bring (given that we are limited to relatively small
> basic blocks and do not have a lot of information needed to model the
> execution precisely). Do you have some options on this?

I think it makes sense to model instruction fetch quite precisely
(including, or rather either/or fetch from the uop cache) up to where
OOO starts.  From there on backwards only very long latency insns
and of course insn dependences should be a factor to maximise
issue width per fetch block.  Not sure if it makes sense to model the
uop cache at all or whehter we should switch between L1 fetch
and uop cache assumption based on loop depth?

That said, for loops scheduling is somewhat moot but for cold
(in terms of the OOO window size) serial code it makes sense to
optimize for uop issue.  I also note that this seems to work out
quite well with the existing automata - if only as side-effect.

Richard.

> Honza