[PATCH]: GCC Scheduler support for R10000 on MIPS
Kumba
kumba@gentoo.org
Tue Aug 5 02:48:00 GMT 2008
Richard Sandiford wrote:
>
> In general, the thing to do is compare the post-patch results with the
> pre-patch results.
Gotcha, I'll keep this in mind for when I roll the final patch.
> Sure. Just create issue and completion cpu_units and add them to
> the insn resservations. If you're interested in doing this, have a look
> at other scheduler descriptions for inspiration. (Not just MIPS ones.)
Hmm, the catch is in how I think, I've largely modeled this R10k pipeline
descriptor spatially in my head, according to the block diagram the manual gives
and factoring in some of the errata notes. Hence why I initially created three
automata for the three pipelines (int queue, fp queue, and address queue), and
then defined the five cpu units as the five main blocks (alu1, alu2 (fed by int
queue); fp-multiply, fp-add (fed by fp queue); and load/store (fed by address
queue)).
The multiplier gets tricky because it's in its own right subdivided into
issue/completion, and fp-divide, fp-sqrt, and fp-multiply components, the latter
three of which run parallel to each it other. It's when you define all these
cpu units, that all tie to the fp queue, that the state count launches itself to
the moon.
So might it be more reasonable to create the fp multiplier as a separate
automata from the fp-queue (and fp adder), and define cpu_units to match the
issue and completion stages, multiply, divide, and square root stages? Then we
can multiply like this:
(define_insn_reservation "r10k_fp_miscmul" 2
(and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
(eq_attr "type" "fmul,fmove"))
"r10k_mpy_issue + r10k_fpmpy + r10k_mpr_compl")
And fdiv and fsqrts like this:
(define_insn_reservation "r10k_fdiv_single" 12
(and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
(and (eq_attr "type" "fdiv,frdiv")
(eq_attr "mode" "SF")))
"(r10k_mpy_issue + r10k_fpdiv + r10k_mpy_compl) * 14")
(define_insn_reservation "r10k_fsqrt_single" 18
(and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
(and (eq_attr "type" "fsqrt")
(eq_attr "mode" "SF")))
"(r10k_mpy_issue + r10k_fpsqrt + r10k_mpy_compl) * 20")
This should setup the "sharing" of the multiplier's issue and completion units,
right?
> You need to add the names of the target insn reservations too.
> Probably the ones for mfhilo.
>
> In fact, if you split the mfhilo reservation into two, one for mfhi and
> one for mflo, you could avoid the predicate. (And yes, you'd be using
> lo_operand to do the split; see sb1.md for how. But it's the bypass
> that's the key.)
Okay, this might be easier than the predicates. How does something like this look?:
(define_insn_reservation "r10k_mfhi" 1
(and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
(and (eq_attr "type" "mfhilo")
(not (match_operand 1 "lo_operand"))))
"r10k_alu1 | r10k_alu2")
(define_insn_reservation "r10k_mflo" 1
(and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
(and (eq_attr "type" "mfhilo")
(match_operand 1 "lo_operand")))
"r10k_alu1 | r10k_alu2")
And then for the bypasses:
(define_bypass 6 "r10k_imul_single" "r10k_mfhi")
(define_bypass 10 "r10k_imul_double" "r10k_mfhi")
(define_bypass 35 "r10k_idiv_single" "r10k_mfhi")
(define_bypass 67 "r10k_idiv_double" "r10k_mfhi")
We'll set the default latency for the insn_reservations to the LO latency, and
use bypasses for the HI latency. Sound kosher?
I also assume we need not worry about the mthilo insn, right?
> MULTU and DMULTU are classed as IMUL. If you want to split it into
> signed and unsigned -- a reasonable thing to do -- you need to add
> a new value to the "type" attribute. You'd then need to update all
> uses of IMUL in config/mips to check for both the signed and unsigned
> versions.
>
> If you do this, please do the split as a separate patch. It's easier
> to review that way.
Reasonable, but that would probably take a while for me to research on. I'm
still fairly new to gcc internals, and still don't quite know my way around
things. Defining entirely new types is not something I'll figure out very
easily. Probably worth looking at after getting this patch in.
So I'll ask this then. The R10K Manual states that MULT latency is 5/6 (Lo/Hi)
while MULTU is 6/7 (Lo/Hi). If the "imul" type combines both MULT and MULTU,
then should we compromise on the latency and use 6 instead of 5 or 7? Later on,
if I work out a way to add that type (or someone else does), then we can
re-tweak 10000.md to relfect the change.
On a different thought, though, does a predicate exist like lo_operand that can
determine signed versus unsigned? Seems like it could be used to create
insn_reservations of say, r10k_imul_single_signed and r10k_imul_single_unsigned,
set their default latencies to the LO register, then use more define_bypasses on
the mfhi bit to set the HI latencies.
> Doesn't have to be commerical, and most benchmarks aren't hugely
> system-specific. As always with these things: pick something
> (a program or a benchmark) that matters to you. Maybe an
> example of your typical workload, or whatever.
The typical workload on this box is compiling, and pretty much anything that
speeds up gcc is a blessing. I'm still hunting for one of those elusive dual
R14K modules for this machine (at a reasonable price, which $2,000 is not).
> The R10K doesn't have integer multiply-accumulate instructions.
> I.e. no IMADD. MADD == MADD.fmt, i.e. FMADD.
Okay, I moved imadd down to the reservation that handles unknown and multi as a
precaution. (i.e., Murphy's law)
> Not sure whether the patch you attached was the latest one.
> It didn't mention MOVE, FPDIV1, etc. And like I say, "cpu"
> must exactly mirror "enum processor_type", so you need to
> remove "r12000", "r14000" and "r16000" from "cpu" too.
Hmm, it might've not been. I'll double check before I send the next one.
And by removing r12000-r16000, you mean in mips.md, from the "cpu" field, right?
I thought that specifies the list of valid params to -march. Or, do you mean
from the "cpu" attr in 10000.md where we do all our checking in each insn? Does
gcc somehow translate r12000 passed to -march to requal r10000 via the mapping
to PROCESSOR_R10000?
I wanted to keep r12000 through r16000 around as valid values, even if they map
back to r10000 in the end. Might it make more sense still to use r1x000 to
reference the entire family based on current knowledge?
> Are cpu_units and automata allowed to have the same name?
> If so, I'd prefer to call them r10k_fpdiv and f10k_fpsqrt.
That's a very good question. I wasn't sure, so I spelled them out due to a lack
of creativity in the naming.
Maybe I should called the automata r10k_a_int, r10k_a_fp, r10k_a_fpdiv,
r10k_a_fpsqrt, and r10k_a_addr, then map cpu units to those?
Cheers!,
--
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
"The past tempts us, the present confuses us, the future frightens us. And our
lives slip away, moment by moment, lost in that vast, terrible in-between."
--Emperor Turhan, Centauri Republic
More information about the Gcc-patches
mailing list