This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
- From: Scott Robert Ladd <coyote at coyotegulch dot com>
- To: gcc mailing list <gcc at gcc dot gnu dot org>
- Date: Sun, 15 Aug 2004 10:55:07 -0400
- Subject: GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004
Good day,
Using a custom benchmark suite of my own design, I have compared the
performance of code generated by recent and pending versions of GCC, for
AMD Opteron and Intel Pentium 4 processors.
Raw Numbers
===========
System Corwin (x86_64)
Dual Opteron 240, 1.4GHz
Tyan K8W 2885 motherboard
120GB Maxtor 7200 RPM ATA-133 HD
2GB PC2700 DRAM (1GB per processor, NUMA)
Radeon 9200 Pro, 128MB, HP f1903 LCD
Linux 2.6.7 #2 SMP Sat Jun 19 20:16:20 EDT 2004
GNU C Library 20040808 release version 2.3.4
GNU assembler 2.15.90.0.1.1 20040303
ln (coreutils) 5.2.1
3.2.3 3.3.3 3.4.2 3.5.0
----- ----- ----- -----
alma time: 43.1 43.4 42.3 28.1
arco time: 24.8 25.4 24.7 24.8
evo time: 47.0 65.9 25.0 24.8
fft time: 27.7 27.8 27.4 28.1
huff time: 28.4 28.3 23.6 22.4
lin time: 30.1 30.1 29.8 29.3
mat1 time: 28.5 28.3 28.7 29.7
mole time: 10.7 12.8 12.2 28.8
tree time: 41.8 40.9 37.7 30.4
-------------- ----- ----- ----- -----
total run time: 282.0 302.8 251.5 246.3
System Tycho (i686)
2.8GHz Pentium 4, HT enabled in BIOS and OS
Intel D850EMV2 motherboard
80GB Maxtor 6L080J4, 7200RPM ATA-100 HD
80GB Maxtor 6Y080P0, 7200RPM ATA-100 HD
512MB PC800 RDRAM
Radeon 9200 Pro, NEC FE990 monitor
Linux 2.6.7 #1 SMP Sat Jun 26 12:39:11 EDT 2004
GNU C Library 20040808 release version 2.3.4
GNU assembler 2.14.90.0.8 20040114
ln (coreutils) 5.2.1
3.2.3 3.3.3 3.4.2 3.5.0 icc 8
----- ----- ----- ----- -----
alma time: 39.5 39.6 39.0 22.3 13.3
arco time: 27.8 26.9 25.1 27.3 27.7
evo time: 43.1 42.9 42.4 42.1 30.1
fft time: 27.4 27.4 27.0 27.3 30.2
huff time: 23.1 23.6 18.0 13.1 16.3
lin time: 19.1 19.1 18.9 19.5 19.1
mat1 time: 7.4 7.5 7.5 7.5 7.4
mole time: 31.6 30.5 30.9 31.3 5.1
tree time: 30.9 32.3 28.3 25.6 28.8
---------- ----- ----- ----- ----- -----
total time: 249.9 249.7 237.1 215.8 178.0
General Thoughts
================
Overall, GCC 3.5 provides a minor improvement in generated code
performance when compared to GCC 3.4. The historical comparison with
earlier GCCs shows that code performance *is* improving with subsequent
releases.
At this time, GCC 3.5 and 3.4 often produce comparable code -- but on a
few benchmarks, they differ greatly. For the Opteron, GCC 3.5 generates
significantly faster code for the alma and tree benchmarks -- but it
suffers a massive regression on the mole test. For the Pentium 4, GCC
3.5 is superior for the alma, huff, and tree tests, but loses a bit of
ground against 3.4 on others.
Intel C is still amazingly effective. HOWEVER, I do not have a more
recent version of Intel C because my current commercial license has
expired, and compiler updates won't install any more. In terms of
intellectual and practical freedom, GCC wins hands down.
The Usual Explanations and Caveats
==================================
All compilers were built on the host systems, from official, unpatched
archives (3.2 and 3.3) or CVS checkouts (3.4 and 3.5), acquired on the
morning of 14 August 2004. The compiler configuration command was:
../gcc/configure --prefix=/opt/gcc-3.?
--enable-shared
--enable-threads=posix
--enable-__cxa_atexit
--disable-checking
--disable-multilib
--enable-languages=c,c++,f77 (f95 for gcc 3.5)
The compilers were built with make -j2 bootstrap.
Since we're interested in generated code speed, all compiles were
performed with the option set used by typical users:
-O3 -ffast-math -march=pentium4
-O3 -ffast-math -march=athlon-mp (Opteron, for GCCs 3.2 and 3.3)
-O3 -ffast-math -march=opteron (Opteron, for GCCs 3.4 and 3.5)
On the Pentium 4, I also compiled the code with Intel's ICC compiler,
version 8.0.055 (build 20031211Z), using the options:
-xN -O3 -ipo
As my Acovea program has shown, a selection of individual optimization
flags often produces code that performs faster that what is generated by
the generic (-O? options). However, most programmers don't have the time
or expertise required for finding optimal optimizations (!) -- and as
such, they tend to use the most "powerful" composite options (e.g., -O3).
Some folk may object to my use of -ffast-math -- however, in numerous
accuracy tests, -ffast-math produces code that is both faster *and* more
accurate than code generated without it. Yes, -ffast-math has other
aspects that make for interesting debate; however, such discussions
belong in another article.
This article is *NOT* a comparison of the Pentium 4 and Opteron
processors; my two test systems are far too different for any such
comparison to have meaning. Please do not ask me to test on systems I
don't own, unless you're willing to send me hardware. Assuming I find
some paying work this month, I'll be making some system upgrades in the
near future; for now, what I've got is what I've got.
About the Benchmarks
====================
alma -- Calculates the daily ephemeris (at noon) for the years
2000-2099; tests array handling, floating-point math, and mathematical
functions such as sin() and cos().
evo -- A simple genetic algorithm that maximizes a two-dimensional
function; tests 64-bit math, loop generation, and floating-point math.
fft -- Uses a Fast Fourier Transform to multiply two very (very) large
polynomials; tests the C99 _Complex type and basic floating-point math.
huff -- Compresses a large block of data using the Huffman algorithm;
tests string manipulation, bit twiddling, and the use of large memory
blocks.
lin -- Solves a large linear equation via LUP decomposition; tests basic
floating-point math, two-dimensional array performance, and loop
optimization.
mat1 -- Multiplies two very large matrices using the brute-force
algorithm; tests loop optimization.
mole -- A molecular dynamics simulation, with performance predicated on
matrix operations, loop efficiency, and sin() and cos(). I recently
added this test, which exhibits very different characteristics from alma
(even if they appear similar).
tree -- Creates and modifies a large B-tree in memory; tests integer
looping, and dynamic memory management.
My benchmark suite is still in development, and isn't packaged as nicely
as I'd like for general distribution. If you'd want the benchmark source
code, or have any questions about these tests, please e-mail me.
Thank you!
--
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing