*To*: Jan Hubicka <jh at suse dot cz>*Subject*: Re: How to avoid de-optimization*From*: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>*Date*: Sun, 26 Aug 2001 20:29:24 +0200*>Received*: (from pfk@localhost)by fuchs.offl.uni-jena.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA03837;Sun, 26 Aug 2001 20:29:24 +0200*Cc*: gcc at gcc dot gnu dot org*References*: <20010826101321.B8344@atrey.karlin.mff.cuni.cz> <20010826133123.A326@fuchs.offl.uni-jena.de> <20010826171732.H17801@atrey.karlin.mff.cuni.cz>

On Sun, Aug 26, 2001 at 05:17:32PM +0200, Jan Hubicka wrote: > > On Sun, Aug 26, 2001 at 10:13:21AM +0200, Jan Hubicka wrote: > > > Hi, > > > Actually the MUL->arithmetic converison is controlled by costs information > > > near the beggining of i386.c file and is CPU model specific. > > > For instance K6 cost is 3, while cost of simple operation is 1. This means > > > that gcc will replace mul by one, or two simple operations. > > > > > > I Athlon case it is set to 5, pentiumII 4 and Pentium4 30. Always representing > > > the relative latency of simple arithmetic compared to imul instruction. > > > > > > In what CPU are you experiencing slowdown? > > > > > Athlon. > > > > IMUL takes 2 clocks, shift operations/adds something around 0.6...0.7 > > clocks. > I've jsut cross checked the Athlon Optimization Manual: > > Use Alternative Code When Multiplying by a Constant > > A 32-bit integer multiply by a constant has a latency > of five cycles. Therefore, use alternative code when multiplying by certain > constants. In addition, because there is just one multiply unit, the > replacement code may provide better throughput. The following code samples are > designed such that the original source also receives the final result. Other > sequences are possible if the result is in a different register. Adds have been > favored over shifts to keep code size small. Generally, there is a fast > replacement if the constant has very few 1 bits in binary. More constants are > found in the file multiply_by_constants.txt located in the "opt_utilities" > directory of the documentation CDROM. > > So the latency is 5 and the gcc optimization is one of directly recommended > by the optimization manual. > instruction throughput latency imul0x03 : 2.17011 clocks 5.00795 clocks imul0x7F : 2.17012 clocks 5.00795 clocks imul0x7FFFFFFF : 2.17011 clocks 5.00794 clocks imulvar : 2.17010 clocks 4.00633 clocks imul64 : 6.00947 clocks 6.00947 clocks fast2 : 0.50079 clocks 1.00158 clocks fast8 : 0.66772 clocks 1.00157 clocks fast0x80000000 : 0.83464 clocks 1.00158 clocks fast5 : 0.66771 clocks 2.00319 clocks fast10 : 1.14406 clocks 3.00475 clocks fast80 : 1.16851 clocks 3.00476 clocks fast25 : 1.16853 clocks 4.00634 clocks fast50 : 1.39109 clocks 5.00844 clocks fast100 : 1.58585 clocks 5.00794 clocks fast125 : 1.58585 clocks 6.00957 clocks fast250 : 2.00319 clocks 7.01115 clocks fast1000 : 2.00801 clocks 7.01115 clocks fast10000 : 2.59737 clocks 9.01427 clocks (CPU: Athlon 700) See code at the appendix.

