This is the mail archive of the
mailing list for the GCC project.
Re: [AArch64] A question about Cortex-A57 pipeline description
- From: James Greenhalgh <james dot greenhalgh at arm dot com>
- To: Nikolai Bozhenov <n dot bozhenov at samsung dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Fri, 11 Sep 2015 17:21:18 +0100
- Subject: Re: [AArch64] A question about Cortex-A57 pipeline description
- Authentication-results: sourceware.org; auth=none
- References: <55F2F3D9 dot 9060100 at samsung dot com>
On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote:
> Recently I got somewhat confused by Cortex-A57 pipeline description in
> GCC and
> I would be grateful if you could help me understand a few unclear points.
> Particularly I am interested in how memory operations (loads/stores) are
> scheduled. It seems that according to the cortex-a57.md file, firstly, two
> memory operations may never be scheduled at the same cycle and,
> secondly, two loads may never be scheduled at two consecutive cycles:
> ;; 5. Two pipelines for load and store operations: LS1, LS2. The most
> ;; valuable thing we can do is force a structural hazard to split
> ;; up loads/stores.
> (define_cpu_unit "ca57_ls_issue" "cortex_a57")
> (define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57")
> (define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2")
> (define_reservation "ca57_store_model" "ca57_ls_issue,ca57_str")
> However, the Cortex-A57 Software Optimization Guide states that the core is
> able to execute one load operation and one store operation every cycle. And
> that agrees with my experiments. Indeed, a loop consisting of 10 loads, 10
> stores and several arithmetic operations takes on average about 10 cycles per
> iteration, provided that the instructions are intermixed properly.
> So, what is the purpose of additional restrictions imposed on the scheduler
> in cortex-a57.md file? It doesn't look like an error. Rather, it looks like a
> deliberate decision.
When designing the model for the Cortex-A57 processor, I was primarily
trying to build a model which would increase the blend of utilized
pipelines on each cycle across a range of benchmarks, rather than to
accurately reflect the constraints listed in the Cortex-A57 Software
Optimisation Guide .
My reasoning here is that the Cortex-A57 is a high-performance processor,
and an accurate model would be infeasible to build. Because of this, it is
unlikely that the model in GCC will be representative of the true state of the
processor, and consequently GCC may make decisions which result in an
instruction stream which would bias towards one execution pipeline. In
particular, given a less restrictive model, GCC will try to hoist more
loads to be earlier in the basic block, which can result in less good
utilization of the other execution pipelines.
In my experiments, I found this model to be more beneficial across a range
of benchmarks than a model with the additional restrictions I imposed relaxed.
I'd be happy to consider counter-examples where this modeling produces
suboptimal results - and where the changes you suggest are sufficient to
resolve the issue.
: Cortex-A57 Software Optimisation Guide