This is the mail archive of the
mailing list for the GCC project.
RE: [AArch64] A question about Cortex-A57 pipeline description
- From: Evandro Menezes <e dot menezes at samsung dot com>
- To: 'Nikolai Bozhenov' <n dot bozhenov at samsung dot com>, 'James Greenhalgh' <james dot greenhalgh at arm dot com>
- Cc: gcc at gcc dot gnu dot org
- Date: Tue, 15 Sep 2015 10:47:56 -0500
- Subject: RE: [AArch64] A question about Cortex-A57 pipeline description
- Authentication-results: sourceware.org; auth=none
- References: <55F2F3D9 dot 9060100 at samsung dot com> <20150911162118 dot GA5279 at arm dot com> <55F676F2 dot 5030105 at samsung dot com>
Indeed, we observed some problems with scheduling which we believe has more
to do with the scheduling algorithm than with the model DFA, as we said in
Evandro Menezes Austin, TX
> -----Original Message-----
> From: email@example.com [mailto:firstname.lastname@example.org] On Behalf Of
> Nikolai Bozhenov
> Sent: Monday, September 14, 2015 2:28
> To: James Greenhalgh
> Cc: email@example.com
> Subject: Re: [AArch64] A question about Cortex-A57 pipeline description
> Thanks for the reply! I see you point. Indeed, I've also seen cases where
> load pipeline was overused at the beginning of a basic block, whereas at
> end the code got stuck with a bunch of stores and no other instructions to
> run in parallel. And indeed, relaxing the restrictions makes things even
> worse in some cases. Anyway, I don't believe it's the best we can do, I'm
> going to have a closer look at the scheduler and see what can be done to
> improve the situation.
> On 09/11/2015 07:21 PM, James Greenhalgh wrote:
> > On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote:
> >> Hi!
> >> Recently I got somewhat confused by Cortex-A57 pipeline description
> >> in GCC and I would be grateful if you could help me understand a few
> >> unclear points.
> > Sure,
> >> Particularly I am interested in how memory operations (loads/stores)
> >> are scheduled. It seems that according to the cortex-a57.md file,
> >> firstly, two memory operations may never be scheduled at the same
> >> cycle and, secondly, two loads may never be scheduled at two
> >> ;; 5. Two pipelines for load and store operations: LS1, LS2. The
> >> ;; valuable thing we can do is force a structural hazard to
> >> ;; up loads/stores.
> >> (define_cpu_unit "ca57_ls_issue" "cortex_a57")
> >> (define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57")
> >> (define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2")
> >> (define_reservation "ca57_store_model"
> >> "ca57_ls_issue,ca57_str")
> >> However, the Cortex-A57 Software Optimization Guide states that the
> >> core is able to execute one load operation and one store operation
> >> every cycle. And that agrees with my experiments. Indeed, a loop
> >> consisting of 10 loads, 10 stores and several arithmetic operations
> >> takes on average about 10 cycles per iteration, provided that the
> instructions are intermixed properly.
> >> So, what is the purpose of additional restrictions imposed on the
> >> scheduler in cortex-a57.md file? It doesn't look like an error.
> >> Rather, it looks like a deliberate decision.
> > When designing the model for the Cortex-A57 processor, I was primarily
> > trying to build a model which would increase the blend of utilized
> > pipelines on each cycle across a range of benchmarks, rather than to
> > accurately reflect the constraints listed in the Cortex-A57 Software
> > Optimisation Guide .
> > My reasoning here is that the Cortex-A57 is a high-performance
> > processor, and an accurate model would be infeasible to build. Because
> > of this, it is unlikely that the model in GCC will be representative
> > of the true state of the processor, and consequently GCC may make
> > decisions which result in an instruction stream which would bias
> > towards one execution pipeline. In particular, given a less
> > restrictive model, GCC will try to hoist more loads to be earlier in
> > the basic block, which can result in less good utilization of the other
> execution pipelines.
> > In my experiments, I found this model to be more beneficial across a
> > range of benchmarks than a model with the additional restrictions I
> > I'd be happy to consider counter-examples where this modeling produces
> > suboptimal results - and where the changes you suggest are sufficient
> > to resolve the issue.
> > Thanks,
> > James
> > ---
> > : Cortex-A57 Software Optimisation Guide