This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

C Optimization, Opteron, 4 May 2004, tree-ssa/3.5/3.4


New this time around:
    Use of -D__NO_MATH_INLINES and -D__NO_STRING_INLINES switches
    New arithmetic coding benchmark
    Separation of Pentium 4 and Opteron results
    Durations of some tests have been adjusted

I've compared the speed of code generated by recent versions of GCC,
using a beta version of a C benchmark suite that I'm developing. The
benchmark suite is currently comprised of eight separate tests, each of
which times its inner loop and reports the result to a driver program.

In the following tables, times are in seconds, as computed for the inner
loop being tested. The benchmarks are described at the end of this message.

AMD64/Opteron 240 (1.4GHz)
Gentoo AMD64 64-bit Linux 2.6.5
GCC options: -O3 -ffast-math -march=opteron

test    3.4.1   mainline  tree-ssa
----  --------  --------  --------
alma    70.3      53.5      23.3
arco    24.8      24.4      26.1
 evo    31.9      31.3      32.9
 fft    29.9      31.2      31.6
huff    23.6      24.7      30.2
 lin    29.9      29.5      29.6
mat1    30.5      30.4      29.9
mole    29.5      63.2      71.5
tree    38.6      37.7      30.0
      --------  --------  --------
total  309.2     325.9     305.1

Because people have asked: When compiled with -O2, the total run times are { 312.7, 326.2, 313.0 } -- in other words, -O3 is little or no advantage over -O2.

As for the oft-requested -Os, the corresponding overall run times were { 495.9, 439.5, 403.3 }. In no case did -Os produce faster code than did -O2 or -O3.

Using -O1 and options evolved by Acovea, I was able to reduce run times substantially; however, that is a topic that needs to be treated in depth elsewhere.

I am only testing 64-bit code generation on the AMD64.


While these benchmarks may appear similar, their characteristics vary widely based on the compiler in use.

While tree-ssa is clearly superior on "alma", is shows a severe pessimism when it comes to "mole". Tree-ssa also shows slight pessimisms on the new "arco" benchmark, "evo" and "huff". This suggests that something is amiss in tree-ssa's compilation of bit manipulation.

Mush as I would like to say otherwise, tree-ssa's only real win is on the infamous "alma" benchmark.


The benchmark suite is *not* complete; I will be adding at least one
more test, along with better automated reporting facilities. If you
would like a copy of the benchmark suite, please request it from me by
e-mail, as it's not ready for general distribution.

I am *not* testing compilation speed.

Most users will compile with the highest optimization level possible
under the assumption that doing so will produce the fastest code. Additional options (e.g. -funroll-loops) may improve generated code speed; in fact, it is almost always possible to find a "-O1 and other options" set that produces faster code than -O3 (see my Acovea articles). HOWEVER, in this comparison, I'm looking at how general users are going to use the tool at hand.

All GNU compilers were taken from anonymous CVS on 2004-05-03, and built

    --disable-multilib (Opteron only)


A short description of the benchmarks:

alma -- Calculates the daily ephemeris (at noon) for the years
2000-2099; tests array handling, floating-point math, and mathematical
functions such as sin() and cos().

arco -- Implements simple arithmetic encoding and decoding, testing bit manipulation, loop optimization, and integer math.

evo -- A simple genetic algorithm that maximizes a two-dimensional
function; tests 64-bit math, loop generation, and floating-point math.

fft -- Uses a Fast Fourier Transform to multiply two very (very) large
polynomials; tests the C99 _Complex type and basic floating-point math.

huff -- Compresses a large block of data using the Huffman algorithm;
tests string manipulation, bit twiddling, and the use of large memory

lin -- Solves a large linear equation via LUP decomposition; tests basic
floating-point math, two-dimensional array performance, and loop

mat1 -- Multiplies two very large matrices using the brute-force
algorithm; tests loop optimization.

mole -- A molecular dynamics simulation, with performance predicated on
matrix operations, loop efficiency, and sin() and cos(). I recently
added this test, which exhibits very different characteristics from alma
(even if they appear similar).

tree -- Creates and modifies a large B-tree in memory; tests integer
looping, and dynamic memory management.

That's all for now, folks.


Scott Robert Ladd
Coyote Gulch Productions (
Software Invention for High-Performance Computing

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]