This is the mail archive of the
mailing list for the GCC project.
Re: Benchmarking theory
- To: dek_ml at konerding dot com, toon at moene dot indiv dot nluug dot nl
- Subject: Re: Benchmarking theory
- From: dewar at gnat dot com
- Date: Sat, 26 May 2001 23:11:37 -0400 (EDT)
- Cc: gcc at gcc dot gnu dot org, jsm28 at cam dot ac dot uk
<<I think Joe's point is that people aren't doing real statistics on the results.
For example, with just 2 data points
(time on one compiler and the other) you don't have what is known as "power" to
distinguish whether the difference is actually significant or not. I certainly
wouldn't trust the hardware vendors to do real statistics either-- they'd just
run it a 100 times, and give you back their best result, even if they only got
that once. I'd like, at the very least, to see the median, mode, and standard
deviation of scores. Is your distribution of scores normal?
First of all, I think the casual indictment of hardware vendors here is
uncalled for. I doubt that the obviously invalid practice you site is
in fact common. What is more normal is to run a few times, throw out
outliers, and average the remaining runs.
One thing is that it is not worth getting TOO fanatic statistically because
the benchmarks themselves only have approximate validity, we do a benchmark
of a given application X on two compilers to get an idea of the quality of
the two compilers, not on X specifically, but in general. So analyzing the
specific performance on X ferociously may be that useful.
I do think it would be useful if people would confirm when they publish
figures that they have performned the basic step of running several times
and averaged result, in practice that's probably good enough.