This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Multiplications on Pentium 4
- To: Jan Hubicka <jh at suse dot cz>
- Subject: Multiplications on Pentium 4
- From: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>
- Date: Fri, 7 Sep 2001 20:04:03 +0200
- >Received: (from pfk@localhost)by fuchs.offl.uni-jena.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA05315;Fri, 7 Sep 2001 20:04:04 +0200
- Cc: gcc at gcc dot gnu dot org
- References: <20010827121624.D8568@atrey.karlin.mff.cuni.cz> <20010827143032.C636@fuchs.offl.uni-jena.de> <20010827173025.F11402@atrey.karlin.mff.cuni.cz> <20010901202854.A7713@fuchs.offl.uni-jena.de> <20010902000000.C27182@atrey.karlin.mff.cuni.cz> <20010902024104.F7713@fuchs.offl.uni-jena.de> <20010903171717.E13574@atrey.karlin.mff.cuni.cz> <20010904215156.C438@fuchs.offl.uni-jena.de> <007001c1358e$6f53b6f0$7edd18ac@amr.corp.intel.com> <20010905134405.G15564@atrey.karlin.mff.cuni.cz>
The Pentim 4 is so different from all other CPUs so I must write a special
Code Choice Generator. Some Examples:
imul: 14 Clocks Latency
shl: 4 Clocks Latency
lea (,,1) 0.5 Clocks Latency
lea (,,2) 4 Clocks Latency
lea (,,4) 4 Clocks Latency
lea (,,8) 4 Clocks Latency
add, sub, neg: 0.5 Clocks Latency
mov 0...0.5 Clocks Latency
This generates fully different Code compared with i386...Pentium-III,
K5...Athlon.
Optimizing code for size is easy. It's the same as for other CPUs.
Optimizing for speed normally blows the code. Nearly always
cascades of adds and lea(,,1) are the fastest solution, also
for huge multiplier. Code can increase up to 50 bytes for ONE
multiplication (2 register solutions).
Only few multiplier. are a _little_ bit faster using the imul
instruction. So the optimization is more a speed <=> code size
tradeoff.
So it should be programmed a proposal generator which generates
the shortest path method for a given multiplier.
Examples: *12:
lea (r,r,1),t; add t,r; add r,r; add r,r 2 Clocks (1)
lea (r,2,2),r; shl $2,r 8 Clocks (2)
imul $12,r 14 Clocks (4.667)
Latency! Throughput is higher (in () ).
--
Frank Klemm