This is the mail archive of the
mailing list for the GCC project.
Re: The new scheduler and x86 CPUs
- To: Vladimir Makarov <vmakarov at toke dot toronto dot redhat dot com>
- Subject: Re: The new scheduler and x86 CPUs
- From: Jan Hubicka <jh at suse dot cz>
- Date: Wed, 29 Aug 2001 14:22:30 +0200
- Cc: jh at suse dot cz, gcc at gcc dot gnu dot org
- References: <200108282041.QAA00552@toke.toronto.redhat.com>
> Unfortunately, a more powerful model should be used to model
> out-of-order processors. It is possible to describe it by DFA
> (reorder buffer, register renaming), as possible to describe context
> free language (with constraints on level of nested constructions) by
> regular expressions. But the result automaton will be huge and will
> be built forever. As for your example, if the number repetition is
> fixed (say 10) you could describe it as
> decode, (none | none * 2 | none * 3 | ... | none * 10), execute, ...
> But even in this case, the result automaton will be big and several
> automata should be used.
Hmm, crazy solution :)
> Usually fetch and decode stages of the insns execution are not
> described. For classical RISC architecture they are the same and can
> be ignored. It results in a smaller automaton too. As for
> out-of-order speculative processor, on my opinion they can be ignored
> too, because it is not the bottleneck in the processor (designers make
> the insn buffer long enough).
Actually most current x86 CPUs do have bottleneck in insn decoding. The decoders
behave in a way that when simple instructions are paired (or trippled) they
decode faster than if they are interleaved by complex instructions.
Athlon can decode up to 3 simple instructions per cycle or one vector
We do use MD_SCHED_REORDER hacks for PentiumPro to get instructions ordered
well. In the past I've implemented same hacks for K6, where benefits were
noticable (as it were really bottlenecked in this area), but the change
never got in.
For Athlon I don't do that and simply model the conflict between vector
and fast decoded instructions, but it is not ideal solution too.
(just easier to write :)
> Although Bob Morgan in his book "Building an optimizing compiler"
> wrote "The compiler should schedule the insns as if the processor were
> not an out-of-order execution processor. The more effective this
> schedule is, the larger the size of the effective insns buffer.", the
> insn scheduling gives small improvement for out-of-order speculative
> processor. The best what I've seen is less 2% for a SGI MIPS10000
Oh yes, the benefits are relativly small, but for x86 CPUs it is still
important to first get good decoder troughput and second schedule for
latencies. As the latencies are gettling large, and the reorder buffer is often
almost empty, the CPU is not always able to mask them.
In fact most x86 CPUs rely on bad code - in case code does lots of long latency
stuff (such as memory loads/stores), the reordering goes well. For internal
loops that fits in registers the picture appears to be different.
For instance in Atlas benchmark, gcc 3.0 schedules incorrectly the loads
causing 50% slowdowns on Athlon. Just slightly more exact scheduling
(saying gcc that the load followed by load+execute isntructions can
start immediately in case first load does not load the address) avoids
> So with my point of view, very accurate descriptions is not worth of
> efforts to write it. You could spent weeks on very accurate
> descriptions of insn reservations and get 0.1% in code improvements.
> This is especially true for out-of-order speculative processors.
> Actually I don't like out-of-order/speculative processors. It is
> solution for pure people which can not afford to write a good compiler
> requiring huge investments. It is a dead end approach. The more
> registers and issue rate, the more percent of logic is needed for
> control of out-of-order, speculative execution. Intel understood
> this. They have a decent compiler and gcc is far behind it. We
> should worry about this.
Hope that this will change some time in the future.
> I'll think about this. Probably I'll implement your proposal if I
> have time for this. Now I am going to
Thanks a lot!
> 1. Speed up pipeline hazard recognizer when many automata are used.
> Now the speed of the recognizer is proportional to the number of
> automata even if big part of the automata are not used. It is
> neccessary for targets with many subtargets (e.g. MIPS) where
> different automata are used for different subtargets.
> 2. Implement generation of unions of automata states which is
> necessary to more accurate insn scheduling and software pipelining
> in joint CFG points.
OK, I see you do have more important thinks to work on first ;)
> I'd use one define reservation and several bypasses for different
> latencies with all the rest insns. The bypasses should have guards
> (functions which decides that the bypass is valid for the insn pairs).
> I feel that we need more flexible mechanism for this case. I'll
> think about it.
> Thanks for your proposals and feedback.
You are welcome. Thanks for considering my ideas!