This is the mail archive of the
mailing list for the GCC project.
DFA for PPro, P2, P3
- From: law at redhat dot com
- To: gcc at gcc dot gnu dot org
- Date: Thu, 02 May 2002 16:55:46 -0600
- Subject: DFA for PPro, P2, P3
- Reply-to: law at redhat dot com
This is mainly an informative message for those who may still be struggling
to understand situations where the DFA pipeline description is superior
to the current pipeline description model.
I've been poking at building a DFA description for the PPro, P2 and P3
pipelines. Like the PA descriptions, I'm starting by first trying to build
a DFA description which produces identical schedules to the old description.
This ensures that we can build a DFA description which does not regress
performance-wise, but which can ultimately take advantage of the software
pipeliner in the future. Of course, the hope is that we can go beyond just
modeling the old pipeline model, but being able to guarantee no regression
is a very very nice property to have.
Anyway, I ran into something which I thought was worth sharing with others.
Basically there's an aspect of the PPro/P2/P3 pipelines which we do not
currently model correctly and which trivially modeled using the DFA
First, an over-simplified view of the PPro/P2/P3 pipeline:
Instructions are broken down into one or more uops by the instruction
decoders. We then have a set of execution units to execute the various
uops; we can execute multiple uops in the same cycle by issuing them
to different execution units. For the purposes of this message, we're
only going to worry about the P0 and P1 units (basically ALUs).
Let's consider two instructions in the ready queue. Call them I0 and I1.
Assume that I0 generates one uop which can issue to P0 or P1.
Assume that I1 generates two uops, one of which must go to P0, the other
can go to P0 or P1.
The current description models two functional units. The first is
called P0 and is used for things which must go into P0. The second
is P01, which we claim we have two of -- its used for things which can
go into P0 or P1. We do not model the fact that P0 is actually part of
the P01 unit.
Anyway, assume we fire I0. This uses one of our P01 instances. This leaves
one instance of P01 along with P0 available.
Now we have I1, which uses one instance of P01 and P0. Those are precisely
the resources we have available, so we go ahead and fire I1. [ Note this
is based on haifa's notion of cycles and issue rates, meaning the 4:1:1
uop decoder template is not modeled. ]
Clearly this is not good as we actually over-subscribed the P0 unit with
two uops in a single cycle. We over-subscribed the P0 unit because we
have not accurately described the pipeline. And yes, this actually happens.
With the DFA model this is handled trivially.
First, we model two cpu units. P0 and P1.
Second, instructions which use P0 are marked as reserving P0.
Third, instructions which may use P0 or P1 are marked as reserving P0 or P1.
Viola! We now have a more accurate model of the pipeline and as a result
we know that I2 really shouldn't fire in the same cycle as I1 as it will
over-subscribe the P0 unit.
When I say this is trivial to describe with the DFA scheduler I'm not kidding:
<attribute test to select the insn type for i1>
<attribute test to select the insn type for I2>
Anyway, I'm getting excellent results with my DFA desription for PPro/P2/P3.
As of right now there are only 3 cases where the DFA scheduler generates
different code from the old scheduler.
1. There's a slight difference in how we handle ASMs. Basically the DFA
scheduler treats them as starting a new cycle, whereas the old scheduler
would actually issue them to "no-unit" without starting a new cycle.
I believe the DFA's handling of this case is better, so I won't be
contributing my hack to haifa-sched to make the two schedulers handle
ASMs in an identical manner.
2. There's a slight difference in how we handle USE/CLOBBER insns when
one or more insns have already been issued in the current cycle.
Basically if the DFA appears to be deadlocked, then we we start a
new cycle -- even if the next instruction to fire consumes no
resources. I think the DFA behavior in this case is reasonable,
so like #1 above, I won't be contributing my hack to make the two
schedulers behave identically.
3. There is a performance problem in building some of the tables in
genautomata/genattrtab that is preventing me from modeling
fdiv/fsqrt like the old description. Vlad is working on fixing
the performance issue with an algorithmic change in how the
particular table is built.