[PATCH]: GCC Scheduler support for R10000 on MIPS
Richard Sandiford
rdsandiford@googlemail.com
Tue Aug 5 18:29:00 GMT 2008
Kumba <kumba@gentoo.org> writes:
>> Sure. Just create issue and completion cpu_units and add them to
>> the insn resservations. If you're interested in doing this, have a look
>> at other scheduler descriptions for inspiration. (Not just MIPS ones.)
>
> Hmm, the catch is in how I think, I've largely modeled this R10k
> pipeline descriptor spatially in my head, according to the block
> diagram the manual gives and factoring in some of the errata notes.
> Hence why I initially created three automata for the three pipelines
> (int queue, fp queue, and address queue), and then defined the five
> cpu units as the five main blocks (alu1, alu2 (fed by int queue);
> fp-multiply, fp-add (fed by fp queue); and load/store (fed by address
> queue)).
>
> The multiplier gets tricky because it's in its own right subdivided
> into issue/completion, and fp-divide, fp-sqrt, and fp-multiply
> components, the latter three of which run parallel to each it other.
> It's when you define all these cpu units, that all tie to the fp
> queue, that the state count launches itself to the moon.
>
> So might it be more reasonable to create the fp multiplier as a
> separate automata from the fp-queue (and fp adder), and define
> cpu_units to match the issue and completion stages, multiply, divide,
> and square root stages? Then we can multiply like this:
>
> (define_insn_reservation "r10k_fp_miscmul" 2
> (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
> (eq_attr "type" "fmul,fmove"))
> "r10k_mpy_issue + r10k_fpmpy + r10k_mpr_compl")
>
> And fdiv and fsqrts like this:
>
> (define_insn_reservation "r10k_fdiv_single" 12
> (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
> (and (eq_attr "type" "fdiv,frdiv")
> (eq_attr "mode" "SF")))
> "(r10k_mpy_issue + r10k_fpdiv + r10k_mpy_compl) * 14")
>
> (define_insn_reservation "r10k_fsqrt_single" 18
> (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
> (and (eq_attr "type" "fsqrt")
> (eq_attr "mode" "SF")))
> "(r10k_mpy_issue + r10k_fpsqrt + r10k_mpy_compl) * 20")
I think you want "foo, bar" (foo one cycle, then bar the next)
rather than "foo + bar" (foo and bar simultaneously). And you don't
want to tie up the issue and completion units for more than one cycle.
>> You need to add the names of the target insn reservations too.
>> Probably the ones for mfhilo.
>>
>> In fact, if you split the mfhilo reservation into two, one for mfhi and
>> one for mflo, you could avoid the predicate. (And yes, you'd be using
>> lo_operand to do the split; see sb1.md for how. But it's the bypass
>> that's the key.)
>
> Okay, this might be easier than the predicates. How does something like this look?:
>
> (define_insn_reservation "r10k_mfhi" 1
> (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
> (and (eq_attr "type" "mfhilo")
> (not (match_operand 1 "lo_operand"))))
> "r10k_alu1 | r10k_alu2")
>
> (define_insn_reservation "r10k_mflo" 1
> (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
> (and (eq_attr "type" "mfhilo")
> (match_operand 1 "lo_operand")))
> "r10k_alu1 | r10k_alu2")
>
> And then for the bypasses:
>
> (define_bypass 6 "r10k_imul_single" "r10k_mfhi")
> (define_bypass 10 "r10k_imul_double" "r10k_mfhi")
> (define_bypass 35 "r10k_idiv_single" "r10k_mfhi")
> (define_bypass 67 "r10k_idiv_double" "r10k_mfhi")
>
> We'll set the default latency for the insn_reservations to the LO
> latency, and use bypasses for the HI latency. Sound kosher?
Looks good.
> I also assume we need not worry about the mthilo insn, right?
Not as targets of these bypasses, no.
>> MULTU and DMULTU are classed as IMUL. If you want to split it into
>> signed and unsigned -- a reasonable thing to do -- you need to add
>> a new value to the "type" attribute. You'd then need to update all
>> uses of IMUL in config/mips to check for both the signed and unsigned
>> versions.
>>
>> If you do this, please do the split as a separate patch. It's easier
>> to review that way.
>
> Reasonable, but that would probably take a while for me to research on. I'm
> still fairly new to gcc internals, and still don't quite know my way around
> things. Defining entirely new types is not something I'll figure out very
> easily. Probably worth looking at after getting this patch in.
>
> So I'll ask this then. The R10K Manual states that MULT latency is
> 5/6 (Lo/Hi) while MULTU is 6/7 (Lo/Hi). If the "imul" type combines
> both MULT and MULTU, then should we compromise on the latency and use
> 6 instead of 5 or 7? Later on, if I work out a way to add that type
> (or someone else does), then we can re-tweak 10000.md to relfect the
> change.
Again, I'm afraid it's really a case of trying and seeing what gives
the best performance. ;)
> On a different thought, though, does a predicate exist like lo_operand
> that can determine signed versus unsigned? Seems like it could be
> used to create insn_reservations of say, r10k_imul_single_signed and
> r10k_imul_single_unsigned, set their default latencies to the LO
> register, then use more define_bypasses on the mfhi bit to set the HI
> latencies.
This isn't the preferred approach. Splitting mfhilo is on my TODO list,
so that we don't need to use match_operand in the schedulers. But using
lo_operand for mfhilo is better than using a predicate for signedness vs.
unsignedness, for two reasons:
- Predicates are really for defining insns. lo_operand is useful
for that, so we'd have it regardless of whether the schedulers
needed it.
- Operand 1 of an mfhilo is guaranteed to be the source, so a simple
predicate check is enough. On the other hand, there's no defined
mapping between the multiplication operator (MULT, MULTU) and the
operands. So you couldn't really have:
(match_operand N "signed_multiplication")
because there's no known value for N.
>> Doesn't have to be commerical, and most benchmarks aren't hugely
>> system-specific. As always with these things: pick something
>> (a program or a benchmark) that matters to you. Maybe an
>> example of your typical workload, or whatever.
>
> The typical workload on this box is compiling, and pretty much anything that
> speeds up gcc is a blessing. I'm still hunting for one of those elusive dual
> R14K modules for this machine (at a reasonable price, which $2,000 is not).
OK. The time taken to compile something is certainly a valid benchmark.
(It forms part of SPECINT, of course.)
>> The R10K doesn't have integer multiply-accumulate instructions.
>> I.e. no IMADD. MADD == MADD.fmt, i.e. FMADD.
>
> Okay, I moved imadd down to the reservation that handles unknown and
> multi as a precaution. (i.e., Murphy's law)
Nothing goes wrong if you fail to handle an instruction. The compiler
won't crash or anything.
At this stage we're dealing with things like "-march=r4130 -mtune=r10000".
I don't think any particular handling of imadd is better than any other
in that case. So my personal perference would be to leave out the
unnecessary insns (IMADD, SIGNEXT, FRDIV1, etc.).
>> Not sure whether the patch you attached was the latest one.
>> It didn't mention MOVE, FPDIV1, etc. And like I say, "cpu"
>> must exactly mirror "enum processor_type", so you need to
>> remove "r12000", "r14000" and "r16000" from "cpu" too.
>
> Hmm, it might've not been. I'll double check before I send the next one.
>
> And by removing r12000-r16000, you mean in mips.md, from the "cpu"
> field, right?
Right.
> I thought that specifies the list of valid params to
> -march.
Nope, that's mips.c:mips_cpu_table (which you're already handling
correctly). "cpu" is just an .md copy of enum processor_type.
>> Are cpu_units and automata allowed to have the same name?
>> If so, I'd prefer to call them r10k_fpdiv and f10k_fpsqrt.
>
> That's a very good question. I wasn't sure, so I spelled them out due to a lack
> of creativity in the naming.
>
> Maybe I should called the automata r10k_a_int, r10k_a_fp, r10k_a_fpdiv,
> r10k_a_fpsqrt, and r10k_a_addr, then map cpu units to those?
That'd be OK with me.
Richard
More information about the Gcc-patches
mailing list