This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH]: GCC Scheduler support for R10000 on MIPS


Kumba <kumba@gentoo.org> writes:
>> Sure.  Just create issue and completion cpu_units and add them to
>> the insn resservations.  If you're interested in doing this, have a look
>> at other scheduler descriptions for inspiration.  (Not just MIPS ones.)
>
> Hmm, the catch is in how I think, I've largely modeled this R10k
> pipeline descriptor spatially in my head, according to the block
> diagram the manual gives and factoring in some of the errata notes.
> Hence why I initially created three automata for the three pipelines
> (int queue, fp queue, and address queue), and then defined the five
> cpu units as the five main blocks (alu1, alu2 (fed by int queue);
> fp-multiply, fp-add (fed by fp queue); and load/store (fed by address
> queue)).
>
> The multiplier gets tricky because it's in its own right subdivided
> into issue/completion, and fp-divide, fp-sqrt, and fp-multiply
> components, the latter three of which run parallel to each it other.
> It's when you define all these cpu units, that all tie to the fp
> queue, that the state count launches itself to the moon.
>
> So might it be more reasonable to create the fp multiplier as a
> separate automata from the fp-queue (and fp adder), and define
> cpu_units to match the issue and completion stages, multiply, divide,
> and square root stages?  Then we can multiply like this:
>
> (define_insn_reservation "r10k_fp_miscmul" 2
>    (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
>         (eq_attr "type" "fmul,fmove"))
>    "r10k_mpy_issue + r10k_fpmpy + r10k_mpr_compl")
>
> And fdiv and fsqrts like this:
>
> (define_insn_reservation "r10k_fdiv_single" 12
>    (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
>         (and (eq_attr "type" "fdiv,frdiv")
>              (eq_attr "mode" "SF")))
>    "(r10k_mpy_issue + r10k_fpdiv + r10k_mpy_compl) * 14")
>
> (define_insn_reservation "r10k_fsqrt_single" 18
>    (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
>         (and (eq_attr "type" "fsqrt")
>              (eq_attr "mode" "SF")))
>    "(r10k_mpy_issue + r10k_fpsqrt  + r10k_mpy_compl) * 20")

I think you want "foo, bar" (foo one cycle, then bar the next)
rather than "foo + bar" (foo and bar simultaneously).  And you don't
want to tie up the issue and completion units for more than one cycle.

>> You need to add the names of the target insn reservations too.
>> Probably the ones for mfhilo.
>> 
>> In fact, if you split the mfhilo reservation into two, one for mfhi and
>> one for mflo, you could avoid the predicate.  (And yes, you'd be using
>> lo_operand to do the split; see sb1.md for how.  But it's the bypass
>> that's the key.)
>
> Okay, this might be easier than the predicates.  How does something like this look?:
>
> (define_insn_reservation "r10k_mfhi" 1
>    (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
>         (and (eq_attr "type" "mfhilo")
>              (not (match_operand 1 "lo_operand"))))
>    "r10k_alu1 | r10k_alu2")
>
> (define_insn_reservation "r10k_mflo" 1
>    (and (eq_attr "cpu" "r10000,r12000,r14000,r16000")
>         (and (eq_attr "type" "mfhilo")
>              (match_operand 1 "lo_operand")))
>    "r10k_alu1 | r10k_alu2")
>
> And then for the bypasses:
>
> (define_bypass 6 "r10k_imul_single" "r10k_mfhi")
> (define_bypass 10 "r10k_imul_double" "r10k_mfhi")
> (define_bypass 35 "r10k_idiv_single" "r10k_mfhi")
> (define_bypass 67 "r10k_idiv_double" "r10k_mfhi")
>
> We'll set the default latency for the insn_reservations to the LO
> latency, and use bypasses for the HI latency.  Sound kosher?

Looks good.

> I also assume we need not worry about the mthilo insn, right?

Not as targets of these bypasses, no.

>> MULTU and DMULTU are classed as IMUL.  If you want to split it into
>> signed and unsigned -- a reasonable thing to do -- you need to add
>> a new value to the "type" attribute.  You'd then need to update all
>> uses of IMUL in config/mips to check for both the signed and unsigned
>> versions.
>> 
>> If you do this, please do the split as a separate patch.  It's easier
>> to review that way.
>
> Reasonable, but that would probably take a while for me to research on.  I'm 
> still fairly new to gcc internals, and still don't quite know my way around 
> things.  Defining entirely new types is not something I'll figure out very 
> easily.  Probably worth looking at after getting this patch in.
>
> So I'll ask this then.  The R10K Manual states that MULT latency is
> 5/6 (Lo/Hi) while MULTU is 6/7 (Lo/Hi).  If the "imul" type combines
> both MULT and MULTU, then should we compromise on the latency and use
> 6 instead of 5 or 7?  Later on, if I work out a way to add that type
> (or someone else does), then we can re-tweak 10000.md to relfect the
> change.

Again, I'm afraid it's really a case of trying and seeing what gives
the best performance. ;)

> On a different thought, though, does a predicate exist like lo_operand
> that can determine signed versus unsigned?  Seems like it could be
> used to create insn_reservations of say, r10k_imul_single_signed and
> r10k_imul_single_unsigned, set their default latencies to the LO
> register, then use more define_bypasses on the mfhi bit to set the HI
> latencies.

This isn't the preferred approach.  Splitting mfhilo is on my TODO list,
so that we don't need to use match_operand in the schedulers.  But using
lo_operand for mfhilo is better than using a predicate for signedness vs.
unsignedness, for two reasons:

  - Predicates are really for defining insns.  lo_operand is useful
    for that, so we'd have it regardless of whether the schedulers
    needed it.

  - Operand 1 of an mfhilo is guaranteed to be the source, so a simple
    predicate check is enough.  On the other hand, there's no defined
    mapping between the multiplication operator (MULT, MULTU) and the
    operands.  So you couldn't really have:

        (match_operand N "signed_multiplication")

    because there's no known value for N.

>> Doesn't have to be commerical, and most benchmarks aren't hugely
>> system-specific.  As always with these things: pick something
>> (a program or a benchmark) that matters to you.  Maybe an
>> example of your typical workload, or whatever.
>
> The typical workload on this box is compiling, and pretty much anything that 
> speeds up gcc is a blessing.  I'm still hunting for one of those elusive dual 
> R14K modules for this machine (at a reasonable price, which $2,000 is not).

OK.  The time taken to compile something is certainly a valid benchmark.
(It forms part of SPECINT, of course.)

>> The R10K doesn't have integer multiply-accumulate instructions.
>> I.e. no IMADD.  MADD == MADD.fmt, i.e. FMADD.
>
> Okay, I moved imadd down to the reservation that handles unknown and
> multi as a precaution.  (i.e., Murphy's law)

Nothing goes wrong if you fail to handle an instruction.  The compiler
won't crash or anything.

At this stage we're dealing with things like "-march=r4130 -mtune=r10000".
I don't think any particular handling of imadd is better than any other
in that case.  So my personal perference would be to leave out the
unnecessary insns (IMADD, SIGNEXT, FRDIV1, etc.).

>> Not sure whether the patch you attached was the latest one.
>> It didn't mention MOVE, FPDIV1, etc.  And like I say, "cpu"
>> must exactly mirror "enum processor_type", so you need to
>> remove "r12000", "r14000" and "r16000" from "cpu" too.
>
> Hmm, it might've not been.  I'll double check before I send the next one.
>
> And by removing r12000-r16000, you mean in mips.md, from the "cpu"
> field, right?

Right.

> I thought that specifies the list of valid params to
> -march.

Nope, that's mips.c:mips_cpu_table (which you're already handling
correctly).  "cpu" is just an .md copy of enum processor_type.

>> Are cpu_units and automata allowed to have the same name?
>> If so, I'd prefer to call them r10k_fpdiv and f10k_fpsqrt.
>
> That's a very good question.  I wasn't sure, so I spelled them out due to a lack 
> of creativity in the naming.
>
> Maybe I should called the automata r10k_a_int, r10k_a_fp, r10k_a_fpdiv, 
> r10k_a_fpsqrt, and r10k_a_addr, then map cpu units to those?

That'd be OK with me.

Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]