This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [AArch64] A question about Cortex-A57 pipeline description

Thanks for the reply! I see you point. Indeed, I've also seen cases where the load pipeline was overused at the beginning of a basic block, whereas at the end the code got stuck with a bunch of stores and no other instructions to run in parallel. And indeed, relaxing the restrictions makes things even worse in some
cases. Anyway, I don't believe it's the best we can do, I'm going to have a
closer look at the scheduler and see what can be done to improve the situation.


On 09/11/2015 07:21 PM, James Greenhalgh wrote:
On Fri, Sep 11, 2015 at 04:31:37PM +0100, Nikolai Bozhenov wrote:

Recently I got somewhat confused by Cortex-A57 pipeline description in
GCC and
I would be grateful if you could help me understand a few unclear points.

Particularly I am interested in how memory operations (loads/stores) are
scheduled. It seems that according to the file, firstly, two
memory operations may never be scheduled at the same cycle and,
secondly, two loads may never be scheduled at two consecutive cycles:

      ;; 5.  Two pipelines for load and store operations: LS1, LS2. The most
      ;;     valuable thing we can do is force a structural hazard to split
      ;;     up loads/stores.

      (define_cpu_unit "ca57_ls_issue" "cortex_a57")
      (define_cpu_unit "ca57_ldr, ca57_str" "cortex_a57")
      (define_reservation "ca57_load_model" "ca57_ls_issue,ca57_ldr*2")
      (define_reservation "ca57_store_model" "ca57_ls_issue,ca57_str")

However, the Cortex-A57 Software Optimization Guide states that the core is
able to execute one load operation and one store operation every cycle. And
that agrees with my experiments. Indeed, a loop consisting of 10 loads, 10
stores and several arithmetic operations takes on average about 10 cycles per
iteration, provided that the instructions are intermixed properly.

So, what is the purpose of additional restrictions imposed on the scheduler
in file? It doesn't look like an error. Rather, it looks like a
deliberate decision.
When designing the model for the Cortex-A57 processor, I was primarily
trying to build a model which would increase the blend of utilized
pipelines on each cycle across a range of benchmarks, rather than to
accurately reflect the constraints listed in the Cortex-A57 Software
Optimisation Guide [1].

My reasoning here is that the Cortex-A57 is a high-performance processor,
and an accurate model would be infeasible to build. Because of this, it is
unlikely that the model in GCC will be representative of the true state of the
processor, and consequently GCC may make decisions which result in an
instruction stream which would bias towards one execution pipeline. In
particular, given a less restrictive model, GCC will try to hoist more
loads to be earlier in the basic block, which can result in less good
utilization of the other execution pipelines.

In my experiments, I found this model to be more beneficial across a range
of benchmarks than a model with the additional restrictions I imposed relaxed.
I'd be happy to consider counter-examples where this modeling produces
suboptimal results - and where the changes you suggest are sufficient to
resolve the issue.


[1]: Cortex-A57 Software Optimisation Guide

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]