This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: SSE2 benchmarks

To: Jan Hubicka <jh at suse dot cz>
Subject: Re: SSE2 benchmarks
From: Paolo Carlini <pcarlini at unitus dot it>
Date: Mon, 02 Jul 2001 23:32:05 +0200
CC: gcc at gcc dot gnu dot org
Organization: Universita' della Tuscia
References: <3B3CA9C2.42FB2E9B@unitus.it> <004b01c101a6$dce6b2b0$9865fea9@timayum4srqln4> <20010702113847.C3390@atrey.karlin.mff.cuni.cz> <3B40471C.E522E6C8@unitus.it> <20010702230436.A30762@atrey.karlin.mff.cuni.cz>
Reply-To: pcarlini at unitus dot it

Hi Jan...

Jan Hubicka wrote:

> Well, it depends on the loop. But if it is resonably simplem I can take
> a look at it and see what I can do :)
> Don't forget to try -funroll-all-loops/ -ffast-math and similar tricks.
> Also possibly your stack is missaligned at main level.
> Honza

Thanks for your reply. IMHO -funroll-all-loops it's not a switch to be recommmended
for general use. But please correct me if my impressions are wrong.

My preliminary results for VC6 Release build vs GCC3.0 -O2 -funroll-loops. The
MFLOPS(1) index is the most important, MFLOPS(2) is not relevant without
vectorization, MFLOPS(3) and MFLOPS(4) are characterized by a reduced numbers of
FDIV with respect to MFLOPS(1).

-----------
VC6-Windoze
-----------
   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1    -7.6739e-013      0.0888    157.6355
     2    -5.7021e-013      0.0681    102.7582
     3    -2.4314e-014      0.0985    172.5615
     4     6.8723e-014      0.0936    160.2337
     5    -1.6320e-014      0.2120    136.8227
     6     1.3961e-013      0.1533    189.2188
     7    -3.6209e-011      0.2355     50.9579
     8     9.0483e-015      0.1801    166.5763

   Iterations      =  256000000
   NullTime (usec) =     0.0000
   MFLOPS(1)       =   118.4184
   MFLOPS(2)       =    95.7113
   MFLOPS(3)       =   137.5098
   MFLOPS(4)       =   173.1723

------------------
GCC_3.0-Linux2.4.4
------------------
   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1     -5.4307e-13      0.1986     70.4957
     2      8.5760e-16      0.1109     63.1431
     3      3.4601e-14      0.2152     78.9837
     4      3.6952e-13      0.1368    109.6516
     5     -5.1441e-15      0.4043     71.7295
     6      2.3913e-14      0.3234     89.6618
     7     -1.6393e-10      0.3486     34.4240
     8      1.4633e-13      0.2305    130.1254

   Iterations      =  128000000
   NullTime (usec) =     0.0012
   MFLOPS(1)       =    67.5736
   MFLOPS(2)       =    56.8706
   MFLOPS(3)       =    78.6003
   MFLOPS(4)       =   100.4398


As you can see, on my PII400, Linux2.4.4 or Windoze98, the difference is quite
startling...

You will find attached to this message the original flops.c file together with a
short description from its author, please let me know if you want that I do
something else which may help you in your work.

Also, I would like to show that GCC3.0 is doing much better :) than 2.95.2 in this
test (still -O2 -funroll-loops)

---------------------
GCC_2.95.2-Linux2.4.4
---------------------
   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1     -5.4193e-13      0.1859     75.3258
     2      8.5760e-16      0.2486     28.1584
     3      3.4567e-14      0.2496     68.1064
     4      3.6970e-13      0.1214    123.5521
     5     -5.1910e-15      0.4051     71.5911
     6      2.3930e-14      0.3736     77.6244
     7     -1.6393e-10      0.3191     37.6010
     8      1.4631e-13      0.2522    118.9591

   Iterations      =  128000000
   NullTime (usec) =     0.0012
   MFLOPS(1)       =    34.8390
   MFLOPS(2)       =    58.1905
   MFLOPS(3)       =    76.5651
   MFLOPS(4)       =    91.2924


Thanks again for your interest,
Paolo Carlini.

flops.c.gz

-------
   I have finally revised the flops.c program to version 2.0 which
   addresses the concerns brought out over the last year or so (version
   1.2c and earliar versions). Below is a discussion of the new flops.c
   program (flops20.c) and some results for the HP 9000/730 and IBM
   RS/6000 Model 550 systems.

   Flops.c is a 'c' program which attempts to estimate your systems
   floating-point 'MFLOPS' rating for the FADD, FSUB, FMUL, and FDIV
   operations based on specific 'instruction mixes' (discussed below).
   The program provides an estimate of PEAK MFLOPS performance by making
   maximal use of register variables with minimal interaction with main
   memory. The execution loops are all small so that they will fit in
   any cache. Flops.c can be used along with Linpack and the Livermore
   kernels (which exercise memory much more extensively) to gain further
   insight into the limits of system performance. The flops.c execution
   modules include various percent weightings of FDIV's (from 0% to 25%
   FDIV's) so that the range of performance can be obtained when using
   FDIV's. FDIV's, being computationally more intensive than FADD's or
   FMUL's, can impact performance considerably on some systems.
   
   Flops.c consists of 8 independent 'modules' which, except for module
   2, conduct numerical integration of various functions. Some of the
   functions (sin(x) and cos(x)) are approximated using a power series
   expansion accurate to 1.0e-14 over the integration interval. Module 2,
   estimates the value of pi based upon the Maclaurin series expansion of
   atan(1). MFLOPS ratings are provided for each module, but the programs
   overall results are summerized by the MFLOPS(1), MFLOPS(2), MFLOPS(3),
   and MFLOPS(4) outputs.

   The MFLOPS(1) result is identical to the result provided by all
   previous versions of flops.c (flops12c.c and earliar versions). It is
   based only upon the results from modules 2 and 3. Actually, on faster
   machines, MFLOPS(1) from flops.c V2.0 is expected to provide more
   accurate results since the number of iterations conducted (which is
   reflected in the timing accuracy) is more tightly controlled than in
   previous versions of flops.c.
   
   Two problems surfaced in using MFLOPS(1). First, it was difficult to
   completely 'vectorize' the result due to the recurrence of the 's'
   variable in module 2. This problem is addressed in the MFLOPS(2) result
   which does not use module 2, but maintains nearly the same weighting of
   FDIV's (9.2%) as in MFLOPS(1) (9.6%). For scalar machines the MFLOPS(2)
   results 'should' be similar to the MFLOPS(1) results. However, for
   vector machines the MFLOPS(1) and MFLOPS(2) results may differ
   considerably since the MFLOPS(2) result is expected to be more
   completely vectorizable. The second problem with MFLOPS(1) centers
   around the percentage of FDIV's (9.6%) which was viewed as too high for
   an important class of problems. This concern is addressed in the
   MFLOPS(3) result which does only 3.4% FDIV's, and the MFLOPS(4) result
   where NO FDIV's are conducted at all.
   
   The number of floating-point instructions per iteration (loop) is
   given below for each module executed.

   MODULE   FADD   FSUB   FMUL   FDIV   TOTAL  Comment
     1        7      0      6      1      14   7.1%  FDIV's
     2        3      2      1      1       7   difficult to vectorize.
     3        6      2      9      0      17   0.0%  FDIV's
     4        7      0      8      0      15   0.0%  FDIV's
     5       13      0     15      1      29   3.4%  FDIV's
     6       13      0     16      0      29   0.0%  FDIV's
     7        3      3      3      3      12   25.0% FDIV's
     8       13      0     17      0      30   0.0%  FDIV's
   
   A*2+3     21     12     14      5      52   A=5, MFLOPS(1), Same as
	    40.4%  23.1%  26.9%  9.6%          previous versions of the
					       flops.c program. Includes
					       only Modules 2 and 3.
   
   1+3+4     58     14     66     14     152   A=4, MFLOPS(2), New output
   +5+6+    38.2%  9.2%   43.4%  9.2%          does not include Module 2,
   A*7                                         but does 9.2% FDIV's.
   
   1+3+4     62      5     74      5     146   A=0, MFLOPS(3), New output
   +5+6+    42.5%  3.4%   50.7%  3.4%          does not include Module 2,
   7+8                                         but does 3.4% FDIV's.

   3+4+6     39      2     50      0      91   A=0, MFLOPS(4), New output
   +8       42.9%  2.2%   54.9%  0.0%          does not include Module 2,
					       and does NO FDIV's.

   I hope that flops.c V2.0 (flops20.c) proves more useful than earliar
   versions.


(1) HP 9000/730 flops.c V2.0 Results, cc +OS +O3 -W1-a,archive   

   Below are the HP 9000/730 results (provided by Bo Thide'). The minimum
   MFLOPS rating is 15.1 MFLOPS for module 7, which does 25% FDIV's. The
   maximum MFLOPS rating is 37.1 MFLOPS for module 6, which does 0.0%
   FDIV's. FDIV appears to be reasonably efficient on the HP 9000/730,
   as indicated by the overall MFLOPS(n) outputs. 

   The 'Runtime' output is the time in microseconds (usec) for one
   iteration (loop) through the module. The MFLOPS rating is obtained by
   dividing the number of floating-point instructions in the loop by the
   Runtime (in microseconds). For example for module 1 below:
   MFLOPS = 14.0 / 0.5978 = 23.42.

   The Runtime output has already been adjusted for an estimate of the
   time in microseconds to conduct one empty 'for' loop (NullTime). If
   NullTime is not calculated (that is, NullTime = 0.0), due to compiler
   optimization, it can produce a 3% to 5% lower MFLOPS rating than would
   otherwise be obtained.


   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
			    (usec)
     1     -4.6896e-13      0.5978     23.4187
     2      2.2160e-13      0.2447     28.6079
     3     -6.9944e-15      0.7412     22.9342
     4     -9.7256e-14      0.6906     21.7195
     5     -1.6542e-14      0.9200     31.5217
     6      4.3632e-14      0.7822     37.0755
     7     -4.9454e-11      0.7972     15.0529
     8      7.2164e-14      0.8275     36.2538

   Iterations      =   32000000
   NullTime (usec) =     0.0306
   MFLOPS(1)       =    26.4673  [same as flops12c.c, 9.6% FDIV's]
   MFLOPS(2)       =    21.9633  [9.2% FDIV's]
   MFLOPS(3)       =    27.2566  [3.4% FDIV's]
   MFLOPS(4)       =    29.9188  [0.0% FDIV's]


(2) IBM RS/6000 Model 550 flops.c V2.0 results, cc -DUNIX -O -Q

   The IBM RS/6000 Model 550 flops20.c results are shown below. Here,
   the minimum MFLOPS rating is 7.3 MFLOPS also for module 7 which does
   25.0% FDIV's. The maximum MFLOPS rating is 56.9 MFLOPS (!) also for
   module 6 which does 0.0% FDIV's. While the Model 550 works wonders
   with FADD's and FMULS's its performance falls off rapidly with FDIV's.


   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
			    (usec)
     1     -4.6896e-13      0.7028     19.9200
     2      2.2160e-13      0.5806     12.0560
     3     -7.0499e-15      0.4372     38.8849
     4     -9.7145e-14      0.4359     34.4086
     5     -1.6542e-14      0.9903     29.2837
     6      4.3632e-14      0.5100     56.8627
     7     -4.9454e-11      1.6456      7.2921
     8      7.2164e-14      0.5572     53.8418

   Iterations      =   32000000
   NullTime (usec) =     0.0484
   MFLOPS(1)       =    15.5674  [same as flops12c.c, 9.6% FDIV's]
   MFLOPS(2)       =    15.7370  [9.2% FDIV's]
   MFLOPS(3)       =    27.6568  [3.4% FDIV's]
   MFLOPS(4)       =    46.8997  [0.0% FDIV's]

Al Aburto
aburto@nosc.mil

-------

References:
- Re: SSE2 benchmarks
  - From: Jan Hubicka
- Re: SSE2 benchmarks
  - From: Paolo Carlini
- Re: SSE2 benchmarks
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]