This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Multiplications on Pentium 4


The Pentim 4 is so different from all other CPUs so I must write a special
Code Choice Generator. Some Examples:


	imul:		14 Clocks Latency
	shl:		 4 Clocks Latency
	lea (,,1)	 0.5 Clocks  Latency
	lea (,,2)	 4 Clocks  Latency
	lea (,,4)	 4 Clocks  Latency
	lea (,,8)	 4 Clocks  Latency
	add, sub, neg:	 0.5 Clocks Latency
	mov		 0...0.5 Clocks Latency

This generates fully different Code compared with i386...Pentium-III,
K5...Athlon.

Optimizing code for size is easy. It's the same as for other CPUs.

Optimizing for speed normally blows the code. Nearly always 
cascades of adds and lea(,,1) are the fastest solution, also
for huge multiplier. Code can increase up to 50 bytes for ONE
multiplication (2 register solutions).

Only few multiplier. are a _little_ bit faster using the imul
instruction. So the optimization is more a speed <=> code size
tradeoff.

So it should be programmed a proposal generator which generates
the shortest path method for a given multiplier.

Examples: *12:

	lea (r,r,1),t; add t,r; add r,r; add r,r		 2 Clocks (1)
	lea (r,2,2),r; shl $2,r					 8 Clocks (2)
	imul $12,r						14 Clocks (4.667)

Latency! Throughput is higher (in () ).

-- 
Frank Klemm














Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]