This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] alpha/ev6: model 1-cycle cross-cluster delay

From: Richard Henderson <rth at redhat dot com>
To: Matt Turner <mattst88 at gmail dot com>
Cc: gcc at gcc dot gnu dot org, Richard Henderson <rth at twiddle dot net>, Michael Cree <mcree at orcon dot net dot nz>, Uros Bizjak <ubizjak at gmail dot com>
Date: Thu, 26 May 2011 10:12:24 -0700
Subject: Re: [RFC] alpha/ev6: model 1-cycle cross-cluster delay
References: <20110525035240.GA29629@localhost.mattst88>

On 05/24/2011 08:52 PM, Matt Turner wrote:
> Alpha EV6 and newer can execute four instructions per cycle if correctly
> scheduled. The architecture has two clusters {0, 1}, each with its own
> register file. In each cluster, there are two slots {upper, lower}. Some
> instructions only execute from either upper or lower slots.
> 
> Register values produced in one cluster take 1 cycle to appear in the
> other cluster, so improperly scheduled instructions may incur a cross-
> cluster delay.

Given the lack of control of how insns are dispatched to clusters, this
is essentially an intractable problem.  One can manage clusters only in
extremely rare situations in hand-tuned assembly.  Namely:

(1) One has to start with an empty re-order queue.  Such as on transition
    to/from PALcode, at the beginning of an align 16 block of code.
(2) One has to pad with lots of nearly-nops in order to keep the dispatch
    to the various pipelines aligned with the programmer's idea of how
    dispatch is occurring.

>  - The CWG lists the latency of unconditional branches and jsr/call
>    instructions as 3, whereas we have 1. I guess this latency value is
>    only meaningful if the instruction produces a value? I'm a bit
>    confused by this value in the CWG since it lists the latency of
>    conditional branches as N/A, while these other types of branches as
>    3, although none produce a register value.

They produce a value -- the return address.  It's $31 in most
unconditional branches, but it's still there.

>  - I also see that fadd/fcmov/fmul instructions take an extra two cycles
>    when the consumer is fst/ftoi, so something similar should be added
>    for them. Can a (define_bypass ...) function specify a latency value
>    greater than the default latency?

Yes.


r~

References:
- [RFC] alpha/ev6: model 1-cycle cross-cluster delay
  - From: Matt Turner

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]