This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Multiplications on Pentium 4

To: Jan Hubicka <jh at suse dot cz>
Subject: Multiplications on Pentium 4
From: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>
Date: Fri, 7 Sep 2001 20:04:03 +0200
>Received: (from pfk@localhost)by fuchs.offl.uni-jena.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA05315;Fri, 7 Sep 2001 20:04:04 +0200
Cc: gcc at gcc dot gnu dot org
References: <20010827121624.D8568@atrey.karlin.mff.cuni.cz> <20010827143032.C636@fuchs.offl.uni-jena.de> <20010827173025.F11402@atrey.karlin.mff.cuni.cz> <20010901202854.A7713@fuchs.offl.uni-jena.de> <20010902000000.C27182@atrey.karlin.mff.cuni.cz> <20010902024104.F7713@fuchs.offl.uni-jena.de> <20010903171717.E13574@atrey.karlin.mff.cuni.cz> <20010904215156.C438@fuchs.offl.uni-jena.de> <007001c1358e$6f53b6f0$7edd18ac@amr.corp.intel.com> <20010905134405.G15564@atrey.karlin.mff.cuni.cz>

The Pentim 4 is so different from all other CPUs so I must write a special
Code Choice Generator. Some Examples:


	imul:		14 Clocks Latency
	shl:		 4 Clocks Latency
	lea (,,1)	 0.5 Clocks  Latency
	lea (,,2)	 4 Clocks  Latency
	lea (,,4)	 4 Clocks  Latency
	lea (,,8)	 4 Clocks  Latency
	add, sub, neg:	 0.5 Clocks Latency
	mov		 0...0.5 Clocks Latency

This generates fully different Code compared with i386...Pentium-III,
K5...Athlon.

Optimizing code for size is easy. It's the same as for other CPUs.

Optimizing for speed normally blows the code. Nearly always 
cascades of adds and lea(,,1) are the fastest solution, also
for huge multiplier. Code can increase up to 50 bytes for ONE
multiplication (2 register solutions).

Only few multiplier. are a _little_ bit faster using the imul
instruction. So the optimization is more a speed <=> code size
tradeoff.

So it should be programmed a proposal generator which generates
the shortest path method for a given multiplier.

Examples: *12:

	lea (r,r,1),t; add t,r; add r,r; add r,r		 2 Clocks (1)
	lea (r,2,2),r; shl $2,r					 8 Clocks (2)
	imul $12,r						14 Clocks (4.667)

Latency! Throughput is higher (in () ).

-- 
Frank Klemm

Follow-Ups:
- Re: Multiplications on Pentium 4
  - From: Michael Meissner
- Re: Multiplications on Pentium 4
  - From: Jan Hubicka
- Re: Multiplications on Pentium 4
  - From: Torbjorn Granlund

References:
- mul + div with 64 bit signed ints on IA32
  - From: Frank Klemm
- Re: mul + div with 64 bit signed ints on IA32
  - From: Tim Prince
- Re: mul + div with 64 bit signed ints on IA32
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]