This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Benchmarking theory

<<I think Joe's point is that people aren't doing real statistics on the results.
For example, with just 2 data points
(time on one compiler and the other) you don't have what is known as "power" to
distinguish whether the difference is actually significant or not.  I certainly
wouldn't trust the hardware vendors to do real statistics either-- they'd just
run it a 100 times, and give you back their best result, even if they only got
that once.  I'd like, at the very least, to see the median, mode, and standard
deviation of scores.  Is your distribution of scores normal?

First of all, I think the casual indictment of hardware vendors here is
uncalled for. I doubt that the obviously invalid practice you site is
in fact common. What is more normal is to run a few times, throw out
outliers, and average the remaining runs.

One thing is that it is not worth getting TOO fanatic statistically because
the benchmarks themselves only have approximate validity, we do a benchmark
of a given application X on two compilers to get an idea of the quality of
the two compilers, not on X specifically, but in general. So analyzing the
specific performance on X ferociously may be that useful.

I do think it would be useful if people would confirm when they publish
figures that they have performned the basic step of running several times
and averaged result, in practice that's probably good enough.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]