This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

K6 and decoding bottleneck


Hi
I've started some experiments with scheduling for AMD-K6. It seems to
be quite interesting problem, because K6 seems to have two bottlenecks.
It translates the i386 opcodes into RISC instructions, schedules them
and executes. The bottleneck with execution don't seems to be so bad,
because micro instructions are scheduled, have low latencies and
there are two ALU units for most common operations so scheduling for
K6 don't seems to help a much in my testcases (sometimes it is even
loss compared to -m386 so overall it don't brings much speedups).

Much more interesting topic is scheduling for decoder. K6 have two
parallel decoders able to decode common instruction with register opcode
in one cycle.
They are translated into usually single RISC instruction and executed
in one cycle so with scheduled code in register, K6's internal scheduler
is almost empty.
Some instructions takes longer to execute (division and read-modify operations)
and others takes longer to decode (most read-modify-write operations).

For decoder there are three types of instructions. Short decodable, that
can be decoded in cycle in both decoders (common instructions that needs
less that two RISC instruction - register operations,  read modify and fp ins.),
long decodable that is decoded in one cycle but second decoder must sleep
(common instruction that needs maximally four RISC operations - read modify
write instructions) and vector decoded, that needs two cycles to decode.
(multiply, divide and most of "complex" instructions).

So it is good to start code with short decodable instructions and 
fill scheduler a bit and then decode long and vector decodable instructions.
This avoids stalls caused by waiting for scheduler and improves K6 internal
scheduling, because it have more instructions to choose from.

I've made simple MD_SCHED macros that counts estimate number of cycles needed
to decode and accept only those instructions, that can be decoded in time
(decoder_cycles - cpu_clocks supplied by HAIFA). They also inplements some
other strategies such as pairing short decodable instructions to take
advantage of parallel decoding.

Results are pretty interesting. It is not so unusual to get 20 or 25% speedup
(maximal speedup reached by "normal" scheduler taking execution into account
was 7% in my tests).
Problem is that my current implementation makes also large slowdowns in some
cases, because it tends to proceed complex instructions later (sometimes too
late) resulting in much worse code.

At the other hand it shows, that decoder optimization strategies seems to
be significant for K6 and maybe for PentiumPro too. The decoding seems
to be main bottleneck of K6 (don't know about PPro, because it can decode
three instructions at cycle but execute only two as far as I can remember)

So I am thinking about better implementation. I want to ask scheduling
gurus for ideas to try.  I really don't know how to make scheduler handle
this optimally so hope someone will get the idea.

Honza
-- 
                       OK. Lets make a signature file.
+-------------------------------------------------------------------------+
|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka@freesoft.cz         |
|         Czech free software foundation: http://www.freesoft.cz          |
|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
+-------------------------------------------------------------------------+


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]