This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

GCC Benchmarks (coybench), AMD64 and i686, 14 August 2004

Good day,

Using a custom benchmark suite of my own design, I have compared the performance of code generated by recent and pending versions of GCC, for AMD Opteron and Intel Pentium 4 processors.

Raw Numbers

System Corwin (x86_64)
  Dual Opteron 240, 1.4GHz
  Tyan K8W 2885 motherboard
  120GB Maxtor 7200 RPM ATA-133 HD
  2GB PC2700 DRAM (1GB per processor, NUMA)
  Radeon 9200 Pro, 128MB, HP f1903 LCD

  Linux 2.6.7 #2 SMP Sat Jun 19 20:16:20 EDT 2004
  GNU C Library 20040808 release version 2.3.4
  GNU assembler 20040303
  ln (coreutils) 5.2.1

                 3.2.3  3.3.3  3.4.2  3.5.0
                 -----  -----  -----  -----
     alma time:   43.1   43.4   42.3   28.1
     arco time:   24.8   25.4   24.7   24.8
      evo time:   47.0   65.9   25.0   24.8
      fft time:   27.7   27.8   27.4   28.1
     huff time:   28.4   28.3   23.6   22.4
      lin time:   30.1   30.1   29.8   29.3
     mat1 time:   28.5   28.3   28.7   29.7
     mole time:   10.7   12.8   12.2   28.8
     tree time:   41.8   40.9   37.7   30.4
--------------   -----  -----  -----  -----
total run time:  282.0  302.8  251.5  246.3

System Tycho (i686) 2.8GHz Pentium 4, HT enabled in BIOS and OS Intel D850EMV2 motherboard 80GB Maxtor 6L080J4, 7200RPM ATA-100 HD 80GB Maxtor 6Y080P0, 7200RPM ATA-100 HD 512MB PC800 RDRAM Radeon 9200 Pro, NEC FE990 monitor

  Linux 2.6.7 #1 SMP Sat Jun 26 12:39:11 EDT 2004
  GNU C Library 20040808 release version 2.3.4
  GNU assembler 20040114
  ln (coreutils) 5.2.1

                 3.2.3  3.3.3  3.4.2  3.5.0  icc 8
                 -----  -----  -----  -----  -----
     alma time:   39.5   39.6   39.0   22.3   13.3
     arco time:   27.8   26.9   25.1   27.3   27.7
      evo time:   43.1   42.9   42.4   42.1   30.1
      fft time:   27.4   27.4   27.0   27.3   30.2
     huff time:   23.1   23.6   18.0   13.1   16.3
      lin time:   19.1   19.1   18.9   19.5   19.1
     mat1 time:    7.4    7.5    7.5    7.5    7.4
     mole time:   31.6   30.5   30.9   31.3    5.1
     tree time:   30.9   32.3   28.3   25.6   28.8
    ----------   -----  -----  -----  -----  -----
    total time:  249.9  249.7  237.1  215.8  178.0

General Thoughts ================

Overall, GCC 3.5 provides a minor improvement in generated code performance when compared to GCC 3.4. The historical comparison with earlier GCCs shows that code performance *is* improving with subsequent releases.

At this time, GCC 3.5 and 3.4 often produce comparable code -- but on a few benchmarks, they differ greatly. For the Opteron, GCC 3.5 generates significantly faster code for the alma and tree benchmarks -- but it suffers a massive regression on the mole test. For the Pentium 4, GCC 3.5 is superior for the alma, huff, and tree tests, but loses a bit of ground against 3.4 on others.

Intel C is still amazingly effective. HOWEVER, I do not have a more recent version of Intel C because my current commercial license has expired, and compiler updates won't install any more. In terms of intellectual and practical freedom, GCC wins hands down.

The Usual Explanations and Caveats ==================================

All compilers were built on the host systems, from official, unpatched archives (3.2 and 3.3) or CVS checkouts (3.4 and 3.5), acquired on the morning of 14 August 2004. The compiler configuration command was:

../gcc/configure --prefix=/opt/gcc-3.?
	--enable-languages=c,c++,f77 (f95 for gcc 3.5)

The compilers were built with make -j2 bootstrap.

Since we're interested in generated code speed, all compiles were performed with the option set used by typical users:

  -O3 -ffast-math -march=pentium4
  -O3 -ffast-math -march=athlon-mp (Opteron, for GCCs 3.2 and 3.3)
  -O3 -ffast-math -march=opteron   (Opteron, for GCCs 3.4 and 3.5)

On the Pentium 4, I also compiled the code with Intel's ICC compiler, version 8.0.055 (build 20031211Z), using the options:

-xN -O3 -ipo

As my Acovea program has shown, a selection of individual optimization flags often produces code that performs faster that what is generated by the generic (-O? options). However, most programmers don't have the time or expertise required for finding optimal optimizations (!) -- and as such, they tend to use the most "powerful" composite options (e.g., -O3).

Some folk may object to my use of -ffast-math -- however, in numerous accuracy tests, -ffast-math produces code that is both faster *and* more accurate than code generated without it. Yes, -ffast-math has other aspects that make for interesting debate; however, such discussions belong in another article.

This article is *NOT* a comparison of the Pentium 4 and Opteron processors; my two test systems are far too different for any such comparison to have meaning. Please do not ask me to test on systems I don't own, unless you're willing to send me hardware. Assuming I find some paying work this month, I'll be making some system upgrades in the near future; for now, what I've got is what I've got.

About the Benchmarks

alma -- Calculates the daily ephemeris (at noon) for the years 2000-2099; tests array handling, floating-point math, and mathematical functions such as sin() and cos().

evo -- A simple genetic algorithm that maximizes a two-dimensional function; tests 64-bit math, loop generation, and floating-point math.

fft -- Uses a Fast Fourier Transform to multiply two very (very) large polynomials; tests the C99 _Complex type and basic floating-point math.

huff -- Compresses a large block of data using the Huffman algorithm; tests string manipulation, bit twiddling, and the use of large memory blocks.

lin -- Solves a large linear equation via LUP decomposition; tests basic floating-point math, two-dimensional array performance, and loop optimization.

mat1 -- Multiplies two very large matrices using the brute-force algorithm; tests loop optimization.

mole -- A molecular dynamics simulation, with performance predicated on matrix operations, loop efficiency, and sin() and cos(). I recently added this test, which exhibits very different characteristics from alma (even if they appear similar).

tree -- Creates and modifies a large B-tree in memory; tests integer looping, and dynamic memory management.

My benchmark suite is still in development, and isn't packaged as nicely as I'd like for general distribution. If you'd want the benchmark source code, or have any questions about these tests, please e-mail me.

Thank you!

Scott Robert Ladd
Coyote Gulch Productions (
Software Invention for High-Performance Computing

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]