This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: [RFC] alpha/ev6: model 1-cycle cross-cluster delay
- From: Richard Henderson <rth at redhat dot com>
- To: Matt Turner <mattst88 at gmail dot com>
- Cc: gcc at gcc dot gnu dot org, Richard Henderson <rth at twiddle dot net>, Michael Cree <mcree at orcon dot net dot nz>, Uros Bizjak <ubizjak at gmail dot com>
- Date: Thu, 26 May 2011 10:12:24 -0700
- Subject: Re: [RFC] alpha/ev6: model 1-cycle cross-cluster delay
- References: <20110525035240.GA29629@localhost.mattst88>
On 05/24/2011 08:52 PM, Matt Turner wrote:
> Alpha EV6 and newer can execute four instructions per cycle if correctly
> scheduled. The architecture has two clusters {0, 1}, each with its own
> register file. In each cluster, there are two slots {upper, lower}. Some
> instructions only execute from either upper or lower slots.
>
> Register values produced in one cluster take 1 cycle to appear in the
> other cluster, so improperly scheduled instructions may incur a cross-
> cluster delay.
Given the lack of control of how insns are dispatched to clusters, this
is essentially an intractable problem. One can manage clusters only in
extremely rare situations in hand-tuned assembly. Namely:
(1) One has to start with an empty re-order queue. Such as on transition
to/from PALcode, at the beginning of an align 16 block of code.
(2) One has to pad with lots of nearly-nops in order to keep the dispatch
to the various pipelines aligned with the programmer's idea of how
dispatch is occurring.
> - The CWG lists the latency of unconditional branches and jsr/call
> instructions as 3, whereas we have 1. I guess this latency value is
> only meaningful if the instruction produces a value? I'm a bit
> confused by this value in the CWG since it lists the latency of
> conditional branches as N/A, while these other types of branches as
> 3, although none produce a register value.
They produce a value -- the return address. It's $31 in most
unconditional branches, but it's still there.
> - I also see that fadd/fcmov/fmul instructions take an extra two cycles
> when the consumer is fst/ftoi, so something similar should be added
> for them. Can a (define_bypass ...) function specify a latency value
> greater than the default latency?
Yes.
r~