This is the mail archive of the
mailing list for the GCC project.
Re: Benchmarking theory
Toon Moene wrote:
> "Joseph S. Myers" wrote:
> > Benchmark results seem to get posted to the gcc list as single figures for
> > a test and old and new compilers, with assertions that results seem
> > significant or are consistent between runs. Why are benchmarks done on
> > this basis rather than using actual statistical significance tests?
> Perhaps because we haven't included specific benchmarking tests into our
> release criteria ?
> > Could someone point me to appropriate references on the theory of
> > benchmarking that explain this?
> Tsk. My theory of benchmarking is:
> 1. Take you own application.
> 2. Constuct a sample self-contained application out of it.
> 3. Ship it to prospective hardware sellers.
> 4. Rank results.
> 5. Buy.
> OK - simplistic, but it works.
I think Joe's point is that people aren't doing real statistics on the results.
For example, with just 2 data points
(time on one compiler and the other) you don't have what is known as "power" to
distinguish whether the difference is actually significant or not. I certainly
wouldn't trust the hardware vendors to do real statistics either-- they'd just
run it a 100 times, and give you back their best result, even if they only got
that once. I'd like, at the very least, to see the median, mode, and standard
deviation of scores. Is your distribution of scores normal?