This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH]: GCC Scheduler support for R10000 on MIPS


Kumba <kumba@gentoo.org> writes:
>> Have you run the patch through the gcc testsuites?
>
> Not yet.  Are there any quick guides on doing do?  Is that as simple and 
> completing a compile, then running 'make test'?

I see Ralf's already answered this.

>> Minor nit, but you've sometimes used many blank lines to separate things.
>> Seems a bit excessive: two should be enough.
>> 
>> Coding convention nit: all comments should start with a capital letter
>> and end with ".", even if they aren't real sentences.
>
> Cleaned these up -- how does it look now?

A lot better, thanks.

>> Do you actually model the relationship with the multiplier?  If not,
>> it might be worth a comment saying so.
>
> Not real sure.  It's been awhile since I looked at the details, but I
> know the R10K had a rather complex multiplier, and I wasn't real sure
> how to properly model it.  It was also my first stab at DFA and
> pipeline descriptors in general, so there's no telling how far off I
> was.

There doesn't seem to be anything in the description linking the
FP multiplier cpu_unit with the division and sqare root cpu_units,
so I'm pretty sure it isn't modelled.  Which is fine.  I just think
you should add something like "We don't model this at present."
to the end of the comment.

(There's no shame in that.  It's common to omit some details from the
DFA description, and only mention them in the comments.  The aim after
all is to get good code, not to describe the pipeline with complete accuracy.
Sometimes omitting details gives better code.)

>> Given that both r10k_fpdiv and r10k_fpsqrt are long-latency insns,
>> and that they can be used independently, we might get smaller
>> automata if we split r10k_fp into three: r10k_fp, r10k_fpdiv and
>> r10k_fpsqrt.  Could you try this and see how it affects the total
>> number of states?
> [...]
> I'll try adding those two additional automata, and tweaking the
> cpu_units for them.

Thanks.

> As far as the number of states, is there a quick
> way to have gcc calc the number of states per automata versus actually
> running the build?

Add:

    (automata_option "v")

to one of the .md files and do "make insn-automata.c".  This will
create a file called "mips.dfa" in the build directory.  At the end
of that file is a summary of the automata.  The interesting thing is
the number of DFA states and DFA arcs in the r10k_* automata.

(Hadn't realised it was so hard to get at this information these days.
It used to be printed on stderr.  There's also support for adding "-v"
to the genautomata command line, but it seems to have bitrotted and
no longer works.)

>> "logical" seems to be missing from the .md file.  Should it be in
>> this reservation?
>
> I haven't kept track of the newer insns, so I went on ahead and added
> it here.  How about signext?  I see that one in the 7000.md file.
> Good fit here too?  I matched the insns that I could recognize.
> 'logical', does this refer to conditional statements like if clauses
> and such?

"logical" means things like AND, OR, XOR and NOR.  These insns used
to be lumped into "arith", but were split out for the benefit of a
pipeline that doesn't issue all old-"arith" insns in the same way.

(That's the general model.  We split "type" attributes up on an
as-needed basis, rather than trying to predict in advance what
would be the finest useful granularity.)

"signext" is the new MIPS32r2/MIPS64r2 SEB and SEH instructions,
which the R10K doesn't have.

>> The way to implement that sort of thing is define_bypass.  E.g.
>> make the default latency work for LO, then add a define_bypass that is
>> conditional on the int insn using HI.  (I assume that way round would be
>> better because using LO is the common case.)  You'd have to use a custom
>> predicate, along the lines of mips_store_data_bypass_p.
>> 
>> No need to do that if you don't want, but I think the last sentence is a
>> little misleading as it stands.
>
> Hmm, sounds like something to try.  I admit, I know nothing about
> defining custom predicates, though.  Any tips on that?

Look at mips_store_data_bypass_p and mips_linked_madd_p for examples.

> And does it add much optimization to be worth implementing?

TBH, the only way to know is to try it and measure the result.

And like I say, there's absolutely no need to try it.  I was just trying
to say that the comment should mention bypasses instead of lo_operand.

> The way I was thinking of checking was to use lo_operand in one of the
> define_insn_reservation lines as one of the tests, generating one
> latency if the insn was LO and the other latency if the insn was HI in
> a different define_insn_reservation.  But I wasn't able to find docs
> that explained lo_operand() and what its return values are.  I don't
> even know if it's still around anymore.

It is still around.  But the point is that the latency of a
define_insn_reservation is determined solely by the insn it applies to.
And that insn is the definition side of the dependency, not the use.
In other words, the insns matched by these insn_reservations are the
multiplications and divisions themselves.  Those insns clobber both
HI _and_ LO.

Thus you can't use define_insn_reservation tests to tell between a
use of HI and a ues of LO when setting the latency of a multiplication
or division.  That's what bypasses are for.

As to documentation, lo_operand is an operand predicate.
See the Predicates section of the internals documentation
for more info about those.  (Just if you're curious; like I say,
it won't really help here.)

>> I'd prefer to map r12000, r14000 and r16000 to PROCESSOR_R10000,
>> rather than have four PROCESSOR_* values that do the same thing.
>> I take your point about leaving processor-specific tuning as
>> future work, but I think the split should be introduced as part
>> of that work rather than here.
>
> I changed these all to R10000, and removed the R12000-R16000 macros
> from mips.h.

Thanks.

> Do I need to remove the separate insn costs from each as well (Except
> for R10000)?

Yes.  The costs array is indexed by "enum processor_type".

You also need to remove the r12000, r14000 and r16000 "cpu" attributes,
because "cpu" must be a carbon copy of "enum processor_type".
(It's a nasty wart of the infrastructure that we need to define both.)

> Also, How does one calculate insn costs?  I never found much detail on
> that, so I think my defaults are copied from something else (been
> awhile).

Experimentation, basically.  Costs are used to choose between
two equivalent implementations of an operation.  E.g. multiplication
by a constant can be done using a single multiplication insn or by
a sequence of shifts and adds.

The target-independent code calculates the cost of a sequence of
insns simply by adding them up.  It doesn't take into account how
the pipeline might issue them, or what the repeat rates are.

So COSTS_N_INSNS (latency) is a good start, but is often too high on
superscalar pipelines, where breaking a monolithic operation into
smaller operations can exploit the parallelism better.  For example,
if multiplication takes 5 cycles on a dual-issue target, a multiplication
is often (but not always!) more expensive than 5 single-cycle insns.

The costs are just heuristics, and you have to accept that any given
choice of values is going to make some things better and some things
worse.  When I've done scheduling work in the past, I simply tried
various values and run the result through (commercial) benchmarks.

>> I generally try to avoid having TARGET_* and TUNE_* macros that
>> aren't used.  (Again, it's a case of "add it when you need it".)
>> There are some old macros that are unused, but still.
>
> I added these mostly so those with an R12000 or R14000 processor can just use 
> -march=r12000 and get the same code as an R10K without having to look up the 
> info that they need to use -march=r10000.
>
> Should I instead define -march=r1x000?  And later on, when someone can
> find errata for R12K/R14K/R16K, add the processor-specific bits.
> [...]

I think you've misunderstood what I meant.  I was simply saying
that you shouldn't define those new TARGET_* and TUNE_* macros.
They're not used anywhere in your patch, so they're just dead code.

TARGET_FOO should only be defined if some code tests TARGET_FOO.
Likewise TUNE_FOO.

I certainly wasn't talking about changing the -march options.
Please keep them all, but map them to PROCESSOR_R10000, which is
exactly what your revised patch did.

You need to add the new options to doc/invoke.texi.

With those issues fixed, the patch should be ready to go in.

Richard


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]