This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: How to avoid de-optimization
- To: Jan Hubicka <jh at suse dot cz>
- Subject: Re: How to avoid de-optimization
- From: Frank Klemm <pfk at fuchs dot offl dot uni-jena dot de>
- Date: Sun, 26 Aug 2001 20:29:24 +0200
- >Received: (from pfk@localhost)by fuchs.offl.uni-jena.de (8.9.3/8.9.3/SuSE Linux 8.9.3-0.1) id UAA03837;Sun, 26 Aug 2001 20:29:24 +0200
- Cc: gcc at gcc dot gnu dot org
- References: <20010826101321.B8344@atrey.karlin.mff.cuni.cz> <20010826133123.A326@fuchs.offl.uni-jena.de> <20010826171732.H17801@atrey.karlin.mff.cuni.cz>
On Sun, Aug 26, 2001 at 05:17:32PM +0200, Jan Hubicka wrote:
> > On Sun, Aug 26, 2001 at 10:13:21AM +0200, Jan Hubicka wrote:
> > > Hi,
> > > Actually the MUL->arithmetic converison is controlled by costs information
> > > near the beggining of i386.c file and is CPU model specific.
> > > For instance K6 cost is 3, while cost of simple operation is 1. This means
> > > that gcc will replace mul by one, or two simple operations.
> > >
> > > I Athlon case it is set to 5, pentiumII 4 and Pentium4 30. Always representing
> > > the relative latency of simple arithmetic compared to imul instruction.
> > >
> > > In what CPU are you experiencing slowdown?
> > >
> > Athlon.
> >
> > IMUL takes 2 clocks, shift operations/adds something around 0.6...0.7
> > clocks.
> I've jsut cross checked the Athlon Optimization Manual:
>
> Use Alternative Code When Multiplying by a Constant
>
> A 32-bit integer multiply by a constant has a latency
> of five cycles. Therefore, use alternative code when multiplying by certain
> constants. In addition, because there is just one multiply unit, the
> replacement code may provide better throughput. The following code samples are
> designed such that the original source also receives the final result. Other
> sequences are possible if the result is in a different register. Adds have been
> favored over shifts to keep code size small. Generally, there is a fast
> replacement if the constant has very few 1 bits in binary. More constants are
> found in the file multiply_by_constants.txt located in the "opt_utilities"
> directory of the documentation CDROM.
>
> So the latency is 5 and the gcc optimization is one of directly recommended
> by the optimization manual.
>
instruction throughput latency
imul0x03 : 2.17011 clocks 5.00795 clocks
imul0x7F : 2.17012 clocks 5.00795 clocks
imul0x7FFFFFFF : 2.17011 clocks 5.00794 clocks
imulvar : 2.17010 clocks 4.00633 clocks
imul64 : 6.00947 clocks 6.00947 clocks
fast2 : 0.50079 clocks 1.00158 clocks
fast8 : 0.66772 clocks 1.00157 clocks
fast0x80000000 : 0.83464 clocks 1.00158 clocks
fast5 : 0.66771 clocks 2.00319 clocks
fast10 : 1.14406 clocks 3.00475 clocks
fast80 : 1.16851 clocks 3.00476 clocks
fast25 : 1.16853 clocks 4.00634 clocks
fast50 : 1.39109 clocks 5.00844 clocks
fast100 : 1.58585 clocks 5.00794 clocks
fast125 : 1.58585 clocks 6.00957 clocks
fast250 : 2.00319 clocks 7.01115 clocks
fast1000 : 2.00801 clocks 7.01115 clocks
fast10000 : 2.59737 clocks 9.01427 clocks
(CPU: Athlon 700)
See code at the appendix.
example2.zip