This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Multiplications on Pentium 4

> The Pentim 4 is so different from all other CPUs so I must write a special
> Code Choice Generator. Some Examples:
> 	imul:		14 Clocks Latency
> 	shl:		 4 Clocks Latency
> 	lea (,,1)	 0.5 Clocks  Latency
> 	lea (,,2)	 4 Clocks  Latency
> 	lea (,,4)	 4 Clocks  Latency
> 	lea (,,8)	 4 Clocks  Latency
Actually lea for ,,2 can be rewriten to lea doing addition, that is faster.
The rule is that shift has 4 cycle latency, while add 0.5.
Lea is broken to trivial operations, so for your measurements you probably
can ignore her existence.
> 	add, sub, neg:	 0.5 Clocks Latency
> 	mov		 0...0.5 Clocks Latency
> This generates fully different Code compared with i386...Pentium-III,
> K5...Athlon.
Agreed. Thats the poroblem.
Other problem is that imul's and shift's extreme latency causes that
we can benefit from replacing it by relativly many adds, but P4 is
limited by trace cache. More adds, less cache space so this tradeoff
needs to be controlled mainly by program's profile to find hot spots
and aditionally by scheduler to reduce only critical paths trought
BB.  This is _extremly_ dificult to integrate to existing gcc model.

I hope that Intel will realize that and do some funding to gcc development
as good Pentium4 support will be tricky.

> Optimizing code for size is easy. It's the same as for other CPUs.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]