This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: spec2k comparison of gcc 4.1 and 4.2 on AMD K8

From: "Vladimir N. Makarov" <vmakarov at redhat dot com>
To: Serge Belyshev <belyshev at depni dot sinp dot msu dot ru>
Cc: gcc at gcc dot gnu dot org, Mark Mitchell <mark at codesourcery dot com>
Date: Sun, 25 Feb 2007 16:20:44 -0500
Subject: Re: spec2k comparison of gcc 4.1 and 4.2 on AMD K8
References: <87wt27tr0h.fsf@depni.sinp.msu.ru>

Serge Belyshev wrote:

I have compared 4.1.2 release (r121943) with three revisions of 4.2 on spec2k
on an 2GHz AMD Athlon64 box (in 64bit mode), detailed results are below.

In short, current 4.2 performs just as good as 4.1 on this target
with the exception of huge 80% win on 178.galgel. All other difference
lies almost in the noise.

results:

first number in each column is a runtime difference in %
between corresponding 4.2 revision and 4.1.2 (+ is better, - is worse).

second number is a +- confidence interval, i.e. according to my results,
current 4.2 does (82.0+-1.7)% better than 4.1.2 on 178.galgel.

(note some results are clearly noisy, but I've tried hard to avoid this --
I did three runs on a completely idle machine, wasting 14 hours of machine time in total).

I run SPEC2000 several times per week and always look at 3 runs (to be sure that is nothing wrong happened) but I never saw such big "confidence" intervals (as I understand that is difference between max and min of 3 runs divided by the score). Although I should acknowledge that I never ran SPEC2000 on AMD machines and some processors generates less "confident intervals". There are tests like art for which the difference between min and max can be big but geometric meaning makes the effect of such differences smaller in the overall score. If the machine has only 512 Mb memory (even they write that it is enough for SPEC2000), the scores for some benchmark programs may be unstable. Also if the middle score (of 3 runs) for base or peak is bigger on a program even the best (max) scores for peak or base are the same, usually the opposite happens on another benchmark program so it also makes the overall score smoother.

So I trust overall score SPEC2000 and on my evaluation the measure error of the overal score for Core2 Duo (which I usually use for Spec2000) is +-0.3%. It would be better if you posted them (but probably something wrong happened on your machine during the run).

Although I must say you did a really big job, thank you.

r117890 -- 4.2 just before DannyB's aliasing fixes
r117891 -- 4.2 with aliasing fixes.
r122236 -- 4.2 current.

CINT2000 r117890 r117891 r122236

164.gzip        -4.2 1.7        -4.2 1.2        -4.0 1.3
175.vpr          1.7 2.6         1.4 2.3         1.1 2.5
176.gcc         -0.5 0.8        -0.8 1.1        -1.2 4.0
181.mcf         -0.4 2.0        -0.1 2.1        -0.6 2.7
186.crafty      -0.4 6.4        -1.3 7.0         0.8 4.4
197.parser       0.7 1.3         0.8 1.5        -0.3 1.6
252.eon          8.8 3.7        10.6 9.4         6.9 4.7
253.perlbmk      2.7 1.0         3.4 1.4         3.0 1.9
254.gap         -0.6 0.5        -0.5 0.4        -0.4 0.6
255.vortex       1.3 0.9         1.2 1.2         1.4 1.1
256.bzip2        0.6 1.6         0.9 1.6         0.4 1.7
300.twolf        0.1 4.5         0.8 1.4        -0.6 2.0

CFP2000

168.wupwise      0.2 22.0        0.1 22.1        2.2 13.6
171.swim        -0.1 0.7        -0.3 0.1        -0.3 0.2
172.mgrid       -6.3 0.4        -6.1 0.4        -6.6 0.3
173.applu       -0.1 0.8         0.1 0.9        -0.4 0.1
177.mesa         6.9 15.1        7.2 15.1        3.9 5.3
178.galgel      80.8 1.7        80.9 2.0        82.0 1.7
179.art          0.8 8.9        -1.6 8.1        -0.3 5.1
183.equake      -0.9 1.0        -0.8 0.9        -0.9 0.9
187.facerec      2.7 0.7         2.9 0.8         3.0 0.6
188.ammp        -0.4 0.5        -0.1 1.0        -0.5 0.7
189.lucas       -0.8 0.5        -0.7 0.6        -0.4 0.6
191.fma3d        1.1 2.1        -0.9 2.3        -1.0 2.2
200.sixtrack    -0.7 0.4        -0.7 0.5        -1.3 0.4
301.apsi        -3.0 1.4        -2.7 1.1        -3.1 0.3

remarks:

1. big jump on 178.galgel can be seen here too:
  http://www.suse.de/~aj/SPEC/amd64/CFP/sandbox-britten/178_galgel_big.png

2. even though I did three runs, most of the difference is noise,
  which means that one should treat single-run spec results with a *big* grain of salt.

3. on this AMD K8 machine the difference between 4.2 with aliasing fixes and 4.2 w/o aliasing fixes lies completely in the noise (modulo small 2% 191.fma3d regression).

Follow-Ups:
- Re: spec2k comparison of gcc 4.1 and 4.2 on AMD K8
  - From: Serge Belyshev

References:
- spec2k comparison of gcc 4.1 and 4.2 on AMD K8
  - From: Serge Belyshev

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]