This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
March gcc 3.0 and 3.1 Bootstraps Fail 34% of Time
- To: gcc at gcc dot gnu dot org
- Subject: March gcc 3.0 and 3.1 Bootstraps Fail 34% of Time
- From: Jeffrey Oldham <oldham at codesourcery dot com>
- Date: Fri, 30 Mar 2001 10:32:13 -0800
- cc: oldham at codesourcery dot com
- Reply-to: oldham at codesourcery dot com
INTRODUCTION:
Both CodeSourcery, LLC, and I perform nightly gcc builds, downloading
the most recent gcc 3.0 and 3.1 code about 08:00 GMT daily and
bootstrapping. The two charts below indicate failure to bootstrap and
then run tests, giving no indication of number of regression test
failures. (It is rare for bootstrapping to succeed but testing to
fail.)
Builds of the prerelease gcc 3.0 are listed before builds of the
development gcc 3.1. Three configurations are listed for gcc 3.0, but
only two are listed for gcc 3.1. A blank entry indicates that
bootstrapping and running the regression tests finished, giving no
indication of how many tests succeeded or failed. At the bottom of
each chart, the approximate percentage of bootstrapping successes is
listed.
GCC 3.0 i686-pc-linux-gnu i386-pc-linux-gnu mips-sgi-irix6.5
Mar01
Mar02
Mar03
Mar04
Mar05
Mar06
Mar07
Mar08
Mar09
Mar10 failure failure
Mar11
Mar12
Mar13
Mar14 failure
Mar15 failure
Mar16
Mar17
Mar18 failure
Mar19 failure
Mar20 failure
Mar21 failure
Mar22 failure failure
Mar23 unknown failure
Mar24 failure
Mar25
Mar26
Mar27 failure
Mar28 failure failure? failure
Mar29 anoncvs.cygnus down anoncvs.cygnus down anoncvs.cygnus down
Mar30 failure failure? failure
-------------------------------------------------------------------
87% success 77% success 63% success
GCC 3.1 i686-pc-linux-gnu mips-sgi-irix6.5
Mar01
Mar02 failure failure
Mar03 failure
Mar04 failure
Mar05 failure
Mar06 failure
Mar07 failure
Mar08 failure
Mar09 failure failure
Mar10 failure failure
Mar11 failure
Mar12 failure
Mar13
Mar14
Mar15
Mar16
Mar17
Mar18 failure failure
Mar19 failure
Mar20 failure
Mar21 failure
Mar22 failure
Mar23 failure
Mar24 failure
Mar25
Mar26
Mar27 failure
Mar28 failure failure
Mar29 anoncvs.cygnus down anoncvs.cygnus down
Mar30 failure failure
-------------------------------------------------------------------
43% success 60% success
INTERPRETATION:
One way to interpret the success percentages is "If I, as a gcc user
or gcc developer, download gcc at some random time in March, what is
the probability that it bootstraps." Although it is possible that the
gcc tree is more likely to be broken (or fixed) about 08:00 GMT, I
believe that implausible.
Since most gcc developers use i686-pc-linux-gnu, I conjecture its
probability of success represents an upper bound on other platform's
success. The gcc 3.1 data does not reflect this because the sequence
of i686-pc-linux-gnu failures during the early part of the month
reflect including Java in bootstrapping and testing for the first
time. This was not turned on for mips-sgi-irix6.5 for some period of
time. The gcc 3.0 data does reflect this conjecture.
Interestingly, i386-pc-linux-gnu builds fail more frequently than
i686-pc-linux-gnu builds for an unknown reason.
The failures can be grouped into 7 one-day failures and 10 multi-day
failures. Of the 51 days of failures, 43 days were caused by 10
failures that were allowed to persist for more than one day. (This
intrepretation assumes that a failure causing a multi-day failure
remains until the failures end. Even if the failure originating a
multi-day failure is fixed but replaced by another failure which
extends the sequence, it can be argued that the subsequent failure
might not have been introduced if the original failure had not masked
it.)
MY CONCLUSIONS:
Prerelease gcc 3.0 is supposed to be stable with minor changes. Thus,
bootstrapping downloads from random times should succeed with 99%
probability. Changes to this code are supposed to represent small
monotonic improvements that are bootstrapped and tested. Improving
the code depends on successfully bootstrapping and testing so each
failure delays further improvements. An 87% success rate means a 13%
failure rate. One calculation indicates these failures delayed
release by 100%/87% - 100% = 15%, i.e., almost one workweek of delays.
The high failure rates of 57% and 40% for gcc 3.1 indicate that either
1) bootstrapping and checking of code changes is not being performed ior
2) patches that break code are not being removed quickly enough.
The cost of these failures include
a) introduction of other errors that are masked by the initial failures,
b) slowing of development because no bootstrapping can occur,
c) wasting of time searching for these errors, and
d) alienation of GCC customers by broken code.
Notice of code breakages is being lost among other messages in
gcc-bugs@gcc.gnu.org postings. Also, tracking these breakages is
difficult. Failure to bootstrap or finish testing because of an
unknown cause could be more effectively tracked by a separate WWW
site. The site would contain postings of failures sorted according to
gcc 3.[01] x configurations. It would be important to note when the
failures cease. Using this information, a developer could easily
discern if her failure is the same as that already found. When it is
discovered that a patch causes a failure for some configuration, it
would be easy to point the patch submitter to that configuration's
failure.
The GCC Steering Committee should adopt a desired rate of successful
bootstrapping and testing to facilitate code correctness and
development. If the GCC community agrees with this decision,
processes to ensure the rate is met will evolve and then be adopted by
the Steering Committee as policy.
CAVEATS:
1) Although these tests are automated, humans still interact
with them, occasionally causing problems.
2) I collected the data by hand, further increasing the probability of
errors.
3) These comments reflect my own views, not CodeSourcery's views, and
have not been reviewed by or discussed with anyone else at
CodeSourcery.
SUMMARY:
The GCC community needs to work harder to develop a product that works
first time and every time. We are a long way from achieving at least
one 9 of reliability, much less five 9's.
Hoping for improvement,
Jeffrey D. Oldham
oldham@codesourcery.com