This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: SSE2 benchmarks
Hi Jan...
Jan Hubicka wrote:
> Well, it depends on the loop. But if it is resonably simplem I can take
> a look at it and see what I can do :)
> Don't forget to try -funroll-all-loops/ -ffast-math and similar tricks.
> Also possibly your stack is missaligned at main level.
> Honza
Thanks for your reply. IMHO -funroll-all-loops it's not a switch to be recommmended
for general use. But please correct me if my impressions are wrong.
My preliminary results for VC6 Release build vs GCC3.0 -O2 -funroll-loops. The
MFLOPS(1) index is the most important, MFLOPS(2) is not relevant without
vectorization, MFLOPS(3) and MFLOPS(4) are characterized by a reduced numbers of
FDIV with respect to MFLOPS(1).
-----------
VC6-Windoze
-----------
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -7.6739e-013 0.0888 157.6355
2 -5.7021e-013 0.0681 102.7582
3 -2.4314e-014 0.0985 172.5615
4 6.8723e-014 0.0936 160.2337
5 -1.6320e-014 0.2120 136.8227
6 1.3961e-013 0.1533 189.2188
7 -3.6209e-011 0.2355 50.9579
8 9.0483e-015 0.1801 166.5763
Iterations = 256000000
NullTime (usec) = 0.0000
MFLOPS(1) = 118.4184
MFLOPS(2) = 95.7113
MFLOPS(3) = 137.5098
MFLOPS(4) = 173.1723
------------------
GCC_3.0-Linux2.4.4
------------------
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -5.4307e-13 0.1986 70.4957
2 8.5760e-16 0.1109 63.1431
3 3.4601e-14 0.2152 78.9837
4 3.6952e-13 0.1368 109.6516
5 -5.1441e-15 0.4043 71.7295
6 2.3913e-14 0.3234 89.6618
7 -1.6393e-10 0.3486 34.4240
8 1.4633e-13 0.2305 130.1254
Iterations = 128000000
NullTime (usec) = 0.0012
MFLOPS(1) = 67.5736
MFLOPS(2) = 56.8706
MFLOPS(3) = 78.6003
MFLOPS(4) = 100.4398
As you can see, on my PII400, Linux2.4.4 or Windoze98, the difference is quite
startling...
You will find attached to this message the original flops.c file together with a
short description from its author, please let me know if you want that I do
something else which may help you in your work.
Also, I would like to show that GCC3.0 is doing much better :) than 2.95.2 in this
test (still -O2 -funroll-loops)
---------------------
GCC_2.95.2-Linux2.4.4
---------------------
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -5.4193e-13 0.1859 75.3258
2 8.5760e-16 0.2486 28.1584
3 3.4567e-14 0.2496 68.1064
4 3.6970e-13 0.1214 123.5521
5 -5.1910e-15 0.4051 71.5911
6 2.3930e-14 0.3736 77.6244
7 -1.6393e-10 0.3191 37.6010
8 1.4631e-13 0.2522 118.9591
Iterations = 128000000
NullTime (usec) = 0.0012
MFLOPS(1) = 34.8390
MFLOPS(2) = 58.1905
MFLOPS(3) = 76.5651
MFLOPS(4) = 91.2924
Thanks again for your interest,
Paolo Carlini.
flops.c.gz
-------
I have finally revised the flops.c program to version 2.0 which
addresses the concerns brought out over the last year or so (version
1.2c and earliar versions). Below is a discussion of the new flops.c
program (flops20.c) and some results for the HP 9000/730 and IBM
RS/6000 Model 550 systems.
Flops.c is a 'c' program which attempts to estimate your systems
floating-point 'MFLOPS' rating for the FADD, FSUB, FMUL, and FDIV
operations based on specific 'instruction mixes' (discussed below).
The program provides an estimate of PEAK MFLOPS performance by making
maximal use of register variables with minimal interaction with main
memory. The execution loops are all small so that they will fit in
any cache. Flops.c can be used along with Linpack and the Livermore
kernels (which exercise memory much more extensively) to gain further
insight into the limits of system performance. The flops.c execution
modules include various percent weightings of FDIV's (from 0% to 25%
FDIV's) so that the range of performance can be obtained when using
FDIV's. FDIV's, being computationally more intensive than FADD's or
FMUL's, can impact performance considerably on some systems.
Flops.c consists of 8 independent 'modules' which, except for module
2, conduct numerical integration of various functions. Some of the
functions (sin(x) and cos(x)) are approximated using a power series
expansion accurate to 1.0e-14 over the integration interval. Module 2,
estimates the value of pi based upon the Maclaurin series expansion of
atan(1). MFLOPS ratings are provided for each module, but the programs
overall results are summerized by the MFLOPS(1), MFLOPS(2), MFLOPS(3),
and MFLOPS(4) outputs.
The MFLOPS(1) result is identical to the result provided by all
previous versions of flops.c (flops12c.c and earliar versions). It is
based only upon the results from modules 2 and 3. Actually, on faster
machines, MFLOPS(1) from flops.c V2.0 is expected to provide more
accurate results since the number of iterations conducted (which is
reflected in the timing accuracy) is more tightly controlled than in
previous versions of flops.c.
Two problems surfaced in using MFLOPS(1). First, it was difficult to
completely 'vectorize' the result due to the recurrence of the 's'
variable in module 2. This problem is addressed in the MFLOPS(2) result
which does not use module 2, but maintains nearly the same weighting of
FDIV's (9.2%) as in MFLOPS(1) (9.6%). For scalar machines the MFLOPS(2)
results 'should' be similar to the MFLOPS(1) results. However, for
vector machines the MFLOPS(1) and MFLOPS(2) results may differ
considerably since the MFLOPS(2) result is expected to be more
completely vectorizable. The second problem with MFLOPS(1) centers
around the percentage of FDIV's (9.6%) which was viewed as too high for
an important class of problems. This concern is addressed in the
MFLOPS(3) result which does only 3.4% FDIV's, and the MFLOPS(4) result
where NO FDIV's are conducted at all.
The number of floating-point instructions per iteration (loop) is
given below for each module executed.
MODULE FADD FSUB FMUL FDIV TOTAL Comment
1 7 0 6 1 14 7.1% FDIV's
2 3 2 1 1 7 difficult to vectorize.
3 6 2 9 0 17 0.0% FDIV's
4 7 0 8 0 15 0.0% FDIV's
5 13 0 15 1 29 3.4% FDIV's
6 13 0 16 0 29 0.0% FDIV's
7 3 3 3 3 12 25.0% FDIV's
8 13 0 17 0 30 0.0% FDIV's
A*2+3 21 12 14 5 52 A=5, MFLOPS(1), Same as
40.4% 23.1% 26.9% 9.6% previous versions of the
flops.c program. Includes
only Modules 2 and 3.
1+3+4 58 14 66 14 152 A=4, MFLOPS(2), New output
+5+6+ 38.2% 9.2% 43.4% 9.2% does not include Module 2,
A*7 but does 9.2% FDIV's.
1+3+4 62 5 74 5 146 A=0, MFLOPS(3), New output
+5+6+ 42.5% 3.4% 50.7% 3.4% does not include Module 2,
7+8 but does 3.4% FDIV's.
3+4+6 39 2 50 0 91 A=0, MFLOPS(4), New output
+8 42.9% 2.2% 54.9% 0.0% does not include Module 2,
and does NO FDIV's.
I hope that flops.c V2.0 (flops20.c) proves more useful than earliar
versions.
(1) HP 9000/730 flops.c V2.0 Results, cc +OS +O3 -W1-a,archive
Below are the HP 9000/730 results (provided by Bo Thide'). The minimum
MFLOPS rating is 15.1 MFLOPS for module 7, which does 25% FDIV's. The
maximum MFLOPS rating is 37.1 MFLOPS for module 6, which does 0.0%
FDIV's. FDIV appears to be reasonably efficient on the HP 9000/730,
as indicated by the overall MFLOPS(n) outputs.
The 'Runtime' output is the time in microseconds (usec) for one
iteration (loop) through the module. The MFLOPS rating is obtained by
dividing the number of floating-point instructions in the loop by the
Runtime (in microseconds). For example for module 1 below:
MFLOPS = 14.0 / 0.5978 = 23.42.
The Runtime output has already been adjusted for an estimate of the
time in microseconds to conduct one empty 'for' loop (NullTime). If
NullTime is not calculated (that is, NullTime = 0.0), due to compiler
optimization, it can produce a 3% to 5% lower MFLOPS rating than would
otherwise be obtained.
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -4.6896e-13 0.5978 23.4187
2 2.2160e-13 0.2447 28.6079
3 -6.9944e-15 0.7412 22.9342
4 -9.7256e-14 0.6906 21.7195
5 -1.6542e-14 0.9200 31.5217
6 4.3632e-14 0.7822 37.0755
7 -4.9454e-11 0.7972 15.0529
8 7.2164e-14 0.8275 36.2538
Iterations = 32000000
NullTime (usec) = 0.0306
MFLOPS(1) = 26.4673 [same as flops12c.c, 9.6% FDIV's]
MFLOPS(2) = 21.9633 [9.2% FDIV's]
MFLOPS(3) = 27.2566 [3.4% FDIV's]
MFLOPS(4) = 29.9188 [0.0% FDIV's]
(2) IBM RS/6000 Model 550 flops.c V2.0 results, cc -DUNIX -O -Q
The IBM RS/6000 Model 550 flops20.c results are shown below. Here,
the minimum MFLOPS rating is 7.3 MFLOPS also for module 7 which does
25.0% FDIV's. The maximum MFLOPS rating is 56.9 MFLOPS (!) also for
module 6 which does 0.0% FDIV's. While the Model 550 works wonders
with FADD's and FMULS's its performance falls off rapidly with FDIV's.
FLOPS C Program (Double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -4.6896e-13 0.7028 19.9200
2 2.2160e-13 0.5806 12.0560
3 -7.0499e-15 0.4372 38.8849
4 -9.7145e-14 0.4359 34.4086
5 -1.6542e-14 0.9903 29.2837
6 4.3632e-14 0.5100 56.8627
7 -4.9454e-11 1.6456 7.2921
8 7.2164e-14 0.5572 53.8418
Iterations = 32000000
NullTime (usec) = 0.0484
MFLOPS(1) = 15.5674 [same as flops12c.c, 9.6% FDIV's]
MFLOPS(2) = 15.7370 [9.2% FDIV's]
MFLOPS(3) = 27.6568 [3.4% FDIV's]
MFLOPS(4) = 46.8997 [0.0% FDIV's]
Al Aburto
aburto@nosc.mil
-------