This is the mail archive of the
mailing list for the GCC project.
Re: [AArch64] A question about Cortex-A57 pipeline description
- From: Nikolai Bozhenov <n dot bozhenov at samsung dot com>
- To: James Greenhalgh <james dot greenhalgh at arm dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Mon, 14 Sep 2015 10:27:46 +0300
- Subject: Re: [AArch64] A question about Cortex-A57 pipeline description
- Authentication-results: sourceware.org; auth=none
- References: <55F2F3D9 dot 9060100 at samsung dot com> <20150911162118 dot GA5279 at arm dot com>
Thanks for the reply! I see you point. Indeed, I've also seen cases
load pipeline was overused at the beginning of a basic block, whereas at
the code got stuck with a bunch of stores and no other instructions to
parallel. And indeed, relaxing the restrictions makes things even worse
cases. Anyway, I don't believe it's the best we can do, I'm going to have a
closer look at the scheduler and see what can be done to improve the
On 09/11/2015 07:21 PM, James Greenhalgh wrote:
On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote:
Recently I got somewhat confused by Cortex-A57 pipeline description in
I would be grateful if you could help me understand a few unclear points.
Particularly I am interested in how memory operations (loads/stores) are
scheduled. It seems that according to the cortex-a57.md file, firstly, two
memory operations may never be scheduled at the same cycle and,
secondly, two loads may never be scheduled at two consecutive cycles:
;; 5. Two pipelines for load and store operations: LS1, LS2. The most
;; valuable thing we can do is force a structural hazard to split
;; up loads/stores.
(define_cpu_unit "ca57_ls_issue" "cortex_a57")
(define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57")
(define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2")
(define_reservation "ca57_store_model" "ca57_ls_issue,ca57_str")
However, the Cortex-A57 Software Optimization Guide states that the core is
able to execute one load operation and one store operation every cycle. And
that agrees with my experiments. Indeed, a loop consisting of 10 loads, 10
stores and several arithmetic operations takes on average about 10 cycles per
iteration, provided that the instructions are intermixed properly.
So, what is the purpose of additional restrictions imposed on the scheduler
in cortex-a57.md file? It doesn't look like an error. Rather, it looks like a
When designing the model for the Cortex-A57 processor, I was primarily
trying to build a model which would increase the blend of utilized
pipelines on each cycle across a range of benchmarks, rather than to
accurately reflect the constraints listed in the Cortex-A57 Software
Optimisation Guide .
My reasoning here is that the Cortex-A57 is a high-performance processor,
and an accurate model would be infeasible to build. Because of this, it is
unlikely that the model in GCC will be representative of the true state of the
processor, and consequently GCC may make decisions which result in an
instruction stream which would bias towards one execution pipeline. In
particular, given a less restrictive model, GCC will try to hoist more
loads to be earlier in the basic block, which can result in less good
utilization of the other execution pipelines.
In my experiments, I found this model to be more beneficial across a range
of benchmarks than a model with the additional restrictions I imposed relaxed.
I'd be happy to consider counter-examples where this modeling produces
suboptimal results - and where the changes you suggest are sufficient to
resolve the issue.
: Cortex-A57 Software Optimisation Guide