This is the mail archive of the
mailing list for the GCC project.
Re: The new scheduler and x86 CPUs
- To: jh at suse dot cz
- Subject: Re: The new scheduler and x86 CPUs
- From: Vladimir Makarov <vmakarov at toke dot toronto dot redhat dot com>
- Date: Tue, 28 Aug 2001 16:41:18 -0400
- CC: gcc at gcc dot gnu dot org
> I am looking at your work and trying to figure out what benefits it
> can bring to convert the current i386.md descriptions to the new
> syntax. First of all I need to thank you for such a huge effort for
> implementing all this. At the moment I don't understand much the
> automaton implementation, so my questions are probably somewhat naive.
> Hope it will change soon :)
I'd recommed the best article about this
Efficient Instruction Scheduling Using Finite State Automata: V. Bala
and N. Rubin, Proceedings of MICRO-28.
> The i386 CPUs differ from RISC/VLIW your patch is targeted for by
> reorder buffer. It would be nice to be able to model this explicitly.
> I think it can be possible to do by placing "variable sized"
> repetetions to your syntax. So for instance typical operation can look
> "decode, none*x, execute, none*x, retire"
> I am not aware with the automaton design, but is there any chance to
> make such a think possible? Is there some alternative?
Unfortunately, a more powerful model should be used to model
out-of-order processors. It is possible to describe it by DFA
(reorder buffer, register renaming), as possible to describe context
free language (with constraints on level of nested constructions) by
regular expressions. But the result automaton will be huge and will
be built forever. As for your example, if the number repetition is
fixed (say 10) you could describe it as
decode, (none | none * 2 | none * 3 | ... | none * 10), execute, ...
But even in this case, the result automaton will be big and several
automata should be used.
Usually fetch and decode stages of the insns execution are not
described. For classical RISC architecture they are the same and can
be ignored. It results in a smaller automaton too. As for
out-of-order speculative processor, on my opinion they can be ignored
too, because it is not the bottleneck in the processor (designers make
the insn buffer long enough).
Also I don't know the researches about usage of more powerful models in
insn schedulers to describe out-of-order processors).
Although Bob Morgan in his book "Building an optimizing compiler"
wrote "The compiler should schedule the insns as if the processor were
not an out-of-order execution processor. The more effective this
schedule is, the larger the size of the effective insns buffer.", the
insn scheduling gives small improvement for out-of-order speculative
processor. The best what I've seen is less 2% for a SGI MIPS10000
M. Srinivas and S. Jain and J. Dehnert, "A New Framework for
Integrated Global Local Scheduling, " International Conference on
Parallel Architectures and Compilation Techniques (PACT), 1998.
For comparison, IA64 scheduler gives 30%:
Wavefront Scheduling: Path Based Data Representation and Scheduling of
Subgraphs. Jay Bharadwaj, Kishore Menezes and Chris McKinsey. The
Journal of Instruction-Level Parallelism. Vol. 2. May, 2000.
So with my point of view, very accurate descriptions is not worth of
efforts to write it. You could spent weeks on very accurate
descriptions of insn reservations and get 0.1% in code improvements.
This is especially true for out-of-order speculative processors.
I'd simply describe the pipeline hazards which can be easily
described by DFA and may be often bottleneck of the processor.
My feeling that the DFA model can give improving code quality on
several percent fractions for the most processors. Now processors
have less irregularity (more symmetric functional units) in insn
reservation than earlier. On my opinion it should be used as
infrastructure to implement sophisticated and practical insn
scheduling which will try many instruction sequences to choose the
Actually I don't like out-of-order/speculative processors. It is
solution for pure people which can not afford to write a good compiler
requiring huge investments. It is a dead end approach. The more
registers and issue rate, the more percent of logic is needed for
control of out-of-order, speculative execution. Intel understood
this. They have a decent compiler and gcc is far behind it. We
should worry about this.
> Another problem of i386.md is huge amount of variants of various
> instructions. The instruction may or may not load memory, do
> execution and write memory so it can be wonderfull to make possible to
> write (define_reservation "name" "str" condition), where condtion will
> need to match in order this particular reservation to be used.
> For instance then I can generate "operand_fetch" like
> (define_reservation "operand_fetch" "address*2, load*2" (eq_attr "memory" "load,both")
> (define_reservation "operand_fetch" "" (eq_attr "memory" "store, none)
> Then when the define_unit is used, actually all variants are
> generated. This can hide the complexity at least in the .md file.
I'll think about this. Probably I'll implement your proposal if I
have time for this. Now I am going to
1. Speed up pipeline hazard recognizer when many automata are used.
Now the speed of the recognizer is proportional to the number of
automata even if big part of the automata are not used. It is
neccessary for targets with many subtargets (e.g. MIPS) where
different automata are used for different subtargets.
2. Implement generation of unions of automata states which is
necessary to more accurate insn scheduling and software pipelining
in joint CFG points.
> Last problem are the latencies. On Athlon, the usual operation has
> latency of 1, the load latency of 3. In order for instruction to
> start executing, the operands don't need to be ready. It can be
> decoded anytime and then for instance memory loads can start as soon
> as address is ready, so the other argument needed for execution can
> still be in computation.
> So in order to define this, I probably need to have one
> define_insn_reservation for each possible latency (in case there is no
> load/nor store, there is load and there is load and store) and I also
> need the define_bypass for each instruction to avoid the load latency,
> or is there better way to define?
I'd use one define reservation and several bypasses for different
latencies with all the rest insns. The bypasses should have guards
(functions which decides that the bypass is valid for the insn pairs).
I feel that we need more flexible mechanism for this case. I'll
think about it.
Thanks for your proposals and feedback.