[PATCH]: GCC Scheduler support for R10000 on MIPS

Richard Sandiford rdsandiford@googlemail.com
Mon Aug 4 19:23:00 GMT 2008


Kumba <kumba@gentoo.org> writes:
> Richard Sandiford wrote:
>> 
>> I see Ralf's already answered this.
>
> Yup, and I ran it earlier.  Took quite a bit to finish, but I figured
> it wasn't going to be a quick endeavor.  There's a couple of failures,
> but I'm guessing some failures are expected.  Not sure what counts and
> what doesn't, so I've attached it.

In general, the thing to do is compare the post-patch results with the
pre-patch results.

I certainly wouldn't have expected so many libgomp failures,
and it's a bad sign if the libstdc++ testsuite can't run.  But these
are probably system-specific issues rather than problems with the patch.
The results for the "gcc" subtests look OK.

> It was run from a fully-compiled gcc, not bootstrap, so I'm unsure if
> that affects the output any.

Shouldn't matter.

>> There doesn't seem to be anything in the description linking the
>> FP multiplier cpu_unit with the division and sqare root cpu_units,
>> so I'm pretty sure it isn't modelled.  Which is fine.  I just think
>> you should add something like "We don't model this at present."
>> to the end of the comment.
>> 
>> (There's no shame in that.  It's common to omit some details from the
>> DFA description, and only mention them in the comments.  The aim after
>> all is to get good code, not to describe the pipeline with complete accuracy.
>> Sometimes omitting details gives better code.)
>
> Ah!  That's probably because I wasn't sure how to link the division
> and square-root units to the multiplier.  I knew that they had to be
> linked, because as the R10K manual stated, they're separate/parallel
> units, but their issue & completion logic is shared by the multiplier.
> So I know that if the multiplier is busy in either of these two
> stages, it'll cause a delay for these other two units, right?.
>
> That I think is why I had only three automata, and was funneling
> squareroot and division into the r10k_fp automata.  I figured this
> represented "linking" the multiplier and these two units.  I suppose
> that wasn't accurate, though?

No.

> Can we even model the issue and completion stages of a cpu unit?

Sure.  Just create issue and completion cpu_units and add them to
the insn resservations.  If you're interested in doing this, have a look
at other scheduler descriptions for inspiration.  (Not just MIPS ones.)

>> Add:
>> 
>>     (automata_option "v")
>> 
>> to one of the .md files and do "make insn-automata.c".  This will
>> create a file called "mips.dfa" in the build directory.  At the end
>> of that file is a summary of the automata.  The interesting thing is
>> the number of DFA states and DFA arcs in the r10k_* automata.
>> 
>> (Hadn't realised it was so hard to get at this information these days.
>> It used to be printed on stderr.  There's also support for adding "-v"
>> to the genautomata command line, but it seems to have bitrotted and
>> no longer works.)
>
> Thanks, this worked great.  I captured the entire build output looking
> for this verbosity, but didn't see it; I guess it was hidden at some
> point.
>
> Breaking out those two units into their own automata changes things
> quite a bit.  The resulting mips.dfa file is only about 300,000 lines
> long, and the r10k_fp automaton now has only 8 states (division has 22
> and square root has 36 states.  Originally, the r10k_fp automaton had
> 6336 states (and the mips.dfa file was 500,000+ lines long).  So this
> seems to bring the state numbers down to look more like the other mips
> cpu automatons.  I can pass those along if you're interested.

FWIW, the size of mips.dfa doesn't really count for much.
The reduction in states is very big though, so yeah, this is
something we should keep.

> And yeah, I looked at the option parsing bit in genautomata.c.  It
> looks like it should work, but that if-then-else construct it's got
> going just seems to fail somehow.

The loop isn't even reached in the problem cases.  init_md_reader
expects to understand all arguments.

The right fix would be to call init_md_reader_args_cb instead, with a
custom callback.

>> TBH, the only way to know is to try it and measure the result.
>> 
>> And like I say, there's absolutely no need to try it.  I was just trying
>> to say that the comment should mention bypasses instead of lo_operand.
>
> Curiosity demands I at least look it up :)
>
> I wrote that comment based on what I thought was the way to check -- I
> wasn't aware that multiplications and divisions clobbered both HI and
> LO, so I can see why bypasses are the way to go.
>
> Speaking of predicates, I get what to do now.  Define a custom predicate in 
> mips.c (I guess, "mips_check_insn_hi_p" ?).  Here's what I think so far by 
> looking at those two predicates that you mentioned:
>
> mips_check_insn_hi_p(rtx insn)
> {
>    return IS_INSN_HI(insn)
> }
>
> Then in 10000.md, something like:
>
> (define_bypass 6 "r10k_imul_single" "mips_check_insn_hi_p")

You need to add the names of the target insn reservations too.
Probably the ones for mfhilo.

In fact, if you split the mfhilo reservation into two, one for mfhi and
one for mflo, you could avoid the predicate.  (And yes, you'd be using
lo_operand to do the split; see sb1.md for how.  But it's the bypass
that's the key.)

> And do you know what the attr type is for MULTU or DMULTU?  imul, imul3, and 
> imadd don't seem to fit (I am assuming MULTU/DMULTU and friends are for 
> unsigned?).  The R10K manual has different latencies for those, and it looks 
> like I don't have insn reservations defined for those.  Or is this another 
> define_bypass + custom predicate to check for signed/unsigned?

MULTU and DMULTU are classed as IMUL.  If you want to split it into
signed and unsigned -- a reasonable thing to do -- you need to add
a new value to the "type" attribute.  You'd then need to update all
uses of IMUL in config/mips to check for both the signed and unsigned
versions.

If you do this, please do the split as a separate patch.  It's easier
to review that way.

>> Experimentation, basically.  Costs are used to choose between
>> two equivalent implementations of an operation.  E.g. multiplication
>> by a constant can be done using a single multiplication insn or by
>> a sequence of shifts and adds.
>> 
>> The target-independent code calculates the cost of a sequence of
>> insns simply by adding them up.  It doesn't take into account how
>> the pipeline might issue them, or what the repeat rates are.
>> 
>> So COSTS_N_INSNS (latency) is a good start, but is often too high on
>> superscalar pipelines, where breaking a monolithic operation into
>> smaller operations can exploit the parallelism better.  For example,
>> if multiplication takes 5 cycles on a dual-issue target, a multiplication
>> is often (but not always!) more expensive than 5 single-cycle insns.
>> 
>> The costs are just heuristics, and you have to accept that any given
>> choice of values is going to make some things better and some things
>> worse.  When I've done scheduling work in the past, I simply tried
>> various values and run the result through (commercial) benchmarks.
>
> Well, I know of no commercial benchmarking tools for Linux/Mips on SGI
> systems, and since it sounds like it's mostly guesswork to begin with,
> I guess using the same values as the latencies should be kosher.

Doesn't have to be commerical, and most benchmarks aren't hugely
system-specific.  As always with these things: pick something
(a program or a benchmark) that matters to you.  Maybe an
example of your typical workload, or whatever.

Again, I'm not saying you should run any performance tests.
But it is certainly possible to run them on this system.

> I don't suppose there's any rule of thumb involving superscalar
> pipelines out there that might say, slice a couple digits off these
> default latencies?

No, it's too target-specific.  Like I say, it really is a case of
trying it and seeing.

> Attached is round three.  Other changes not mentioned above include adding 
> frdiv1/2 and frsqrt1/2 insns to the existing reservations.  No idea if R10K 
> supports these, but better safe than sorry.

The R10K doesn't support them, but I guess it's OK to add them anyway.

> I also added the 'move' insn, even though the manual makes no explicit
> mention of an unconditional integer register move operand (only
> integer condmove).

There is no unconditional register move instruction (except in MIPS16).
As the comment says, it's just a special form of addition:

;; move		integer register move ({,D}ADD{,U} with rt = 0)

So yes, adding this is the right thing to do.

> Also, the R10K manual doesn't seem to differentiate betweem fmadd and
> imadd.  In the latency table, it simply states "MADD" -- might I infer
> this to assume that R10K itself doesn't distingush between imadd or
> fmadd, treats them the same, and so I need to follow suit? (I've got
> imadd set to run on ALU2, whereas fmadd runs on the fp multiplier).

The R10K doesn't have integer multiply-accumulate instructions.
I.e. no IMADD.  MADD == MADD.fmt, i.e. FMADD.

> And happen to know what kind of insns LWC1/LDC1/LWXC1/LDXC1 match?
> fpload/fpidxload by chance?  Referenced in the manual, they look like
> loads, but they have a different latency (which is how I coded them,
> but wanted to double check).

Yes, LWC1 and LDC1 are FPLOAD, and LWXC1 and LDXC1 are FPIDXLOAD.

Not sure whether the patch you attached was the latest one.
It didn't mention MOVE, FPDIV1, etc.  And like I say, "cpu"
must exactly mirror "enum processor_type", so you need to
remove "r12000", "r14000" and "r16000" from "cpu" too.

> +;; R10000 has int queue, fp queue, address queue.
> +;; We split the fp queue into standard fp, fp division, and
> +;; fp square root to further optimize the automata, though.

"reduce the size of the automata" might be clearer.

> +(define_automaton "r10k_int, r10k_fp, r10k_fpdivision,
> +                   r10k_fpsqroot, r10k_addr")

Are cpu_units and automata allowed to have the same name?
If so, I'd prefer to call them r10k_fpdiv and f10k_fpsqrt.

Richard



More information about the Gcc-patches mailing list