C Optimization Tests, 1 May 2004, tree-ssa/3.5/3.4/icc

Sat May 1 22:40:00 GMT 2004

Hello,

I've compared the speed of code generated by recent versions of GCC, 
using a beta version of a C benchmark suite that I'm developing. The 
benchmark suite is currently comprised of eight separate tests, each of 
which times its inner loop and reports the result to a driver program.

In the following tables, times are in seconds, as computed for the inner 
loop being tested. An asterisk marks the fastest time for a given piece 
of benchmark and overall time. The benchmarks are described at the end 
of this message.

AMD64/Opteron 240 (1.4GHz)
Gentoo AMD64 64-bit Linux 2.6.5
GCC options: -O3 -ffast-math -march=Opteron

test  tree-ssa  mainline    3.4.1
----  --------  --------  --------
alma    23.2*     53.5      70.1
  evo    32.9      31.1*     32.0
  fft    31.7      30.8      30.2*
huff    30.2      24.5      23.8*
  lin    29.5*     29.6      30.2
mat1    30.0*     30.5      30.2
mole    30.0      26.3      12.2*
tree    30.0*     37.4      38.0
       --------  --------  --------
total  237.7*    263.7     266.7

I am only testing 64-bit code generation on the AMD64.

Intel ia32/Pentium 4 Northwood 2.8GHz
Debian 32-bit Linux 2.6.5
GCC options: -O3 -ffast-math -march=pentium4
ICC options: -O3 -xN -ipo

test  tree-ssa  mainline    3.4.1    ICC 8.0
----  --------  --------  --------  --------
alma    64.6      64.7      66.0      22.3*
  evo    53.5      52.8      52.5      37.7*
  fft    28.2      27.2      27.2      30.6
huff    21.8      16.1*     18.2      16.4
  lin    19.5      19.2*     19.2*     19.4
mat1     7.6       7.6       7.6       7.5*
mole    31.4      32.8      32.8       5.0*
tree    23.9*     24.7      24.7      27.2
       --------  --------  --------  --------
total  250.6     244.8     246.8     166.1

I include the Intel C compiler (ICC) for comparison purposes, and 
because so many people ask about it.

ANALYSIS

While tree-ssa is a clear overall winner overall on the Opteron, it 
shows (as does mainline) a big regression on the huff and mole 
benchmarks. On the Pentium 4, mainline is the winner on huff, while 
tree-ssa still shows the regression; on the other hand, tree-ssa is a 
clear winner for mole on the Pentium 4.

On the Opteron, tree-ssa is strong on alma, similar to Intel's dominance 
on alma for the Pentium 4.

Intel's compiler turns in some incredibly fast times on mat1 and mole -- 
and the times appear legitimate. For example, I enabled verification 
output for mole, and Intel is producing correct output six times faster 
than any of the GCC versions.

I'll be adding more verification routines to future versions of the 
benchmarks, to catch compilers that might be optimizing away parts of 
the programs. At this point, however, I haven't seen that happen.

USUAL DISCLAIMER AND EXPLANATION STUFF:

The benchmark suite is *not* complete; I will be adding at least two 
more tests, along with better automated reporting facilities. If you 
would like a copy of the benchmark suite, please request it from me by 
e-mail, as it's not ready for general distribution.

I am *not* testing compilation speed.

Please do not ask about other architectures; I don't have them, so I 
can't test them. Well, I do have SPARC, but it's old, slow, and no one 
asks me about SPARC anyway.

Most users will compile with the highest optimization level possible 
under the assumption that it will produce the fastest code. Additional 
options (e.g. -funroll-loops) may improve generated code speed; in fact, 
it is almost always possible to find a "-O1 and other options" set that 
produces faster code than -O3 (see my Acovea articles). HOWEVER, in this 
comparison, I'm looking at how general users are going to use the tool 
at hand.

All GNU compilers were taken from anonymous CVS on 2004-05-01, and built 
using:

     --enable-shared
     --enable-threads=posix
     --enable-__cxa_atexit
     --disable-checking
     --disable-multilib (Opteron only)

BENCHMARKS

A short description of the benchmarks:

alma -- Calculates the daily ephemeris (at noon) for the years 
2000-2099; tests array handling, floating-point math, and mathematical 
functions such as sin() and cos().

evo -- A simple genetic algorithm that maximizes a two-dimensional 
function; tests 64-bit math, loop generation, and floating-point math.

fft -- Uses a Fast Fourier Transform to multiply two very (very) large 
polynomials; tests the C99 _Complex type and basic floating-point math.

huff -- Compresses a large block of data using the Huffman algorithm; 
tests string manipulation, bit twiddling, and the use of large memory 
blocks.

lin -- Solves a large linear equation via LUP decomposition; tests basic 
floating-point math, two-dimensional array performance, and loop 
optimization.

mat1 -- Multiplies two very large matrices using the brute-force 
algorithm; tests loop optimization.

mole -- A molecular dynamics simulation, with performance predicated on 
matrix operations, loop efficiency, and sin() and cos(). I recently 
added this test, which exhibits very different characteristics from alma 
(even if they appear similar).

tree -- Creates and modifies a large B-tree in memory; tests integer 
looping, and dynamic memory management.

That's all for now, folks.

..Scott

-- 
Scott Robert Ladd
Coyote Gulch Productions (http://www.coyotegulch.com)
Software Invention for High-Performance Computing