Time PPro tunning patch

Thu Mar 9 14:55:00 GMT 2000

> On Mon, Feb 28, 2000 at 03:09:54PM +0100, Jan Hubicka wrote:
> > The i386.c use cost of multiply "1" on PPro. Multiply takes 4 cycles,
> > so I suggest to use cost 4.
> 
> Where did you get 4 cycles?  Uli and I measured 1 cycle and
> also have documentation to that effect for imul.
I read:
Integer multiplication takes 4 clocks, floating point multiplication 5, and MMX
 multiplication 3
clocks. Integer and MMX multiplication is pipelined so that it can receive a ne
w instruction
every clock cycle. Floating point multiplication is partially pipelined: The ex
ecution unit can
receive a new  FMUL  instruction two clocks after the preceding one, so that th
e maximum
throughput is one  FMUL  per two clock cycles. The holes between the FMUL's can
not be
filled by integer multiplications because they use the same circuitry.

The function unit overview also mentions, that multiply unit is attached to the
execution ports and is pipelined with 4 stages I believe.
The troughput is 1, but latency 4. I believe that latency is important
for the costs.

Note that I am also getting better code by this replacement
Honza
> 
> > Mon Feb 28 15:07:05 MET 2000  Jan Hubicka  <jh@suse.cz>
> > 	* i386.md (movhi_1): Promote movw imm, reg to movl imm, reg and
> > 	movw reg, reg to movzwl reg, reg on PARTIAL_REGISTER_STALL machines.
> > 	* i386.c (pentiumpro_cost): Set mul cost to 4.
> > 	(x86_use_movx): Set for PPro.
> 
> The rest of the patch is fine.
> 
> 
> r~