GCC 3.2.1 -> GCC 3.3 compile speed regression

Fri Jan 31 07:49:00 GMT 2003

I will try to make this as succinct as possible.  For testing
I used the gcc_3_2_1_release and the gcc-3_3-branch; for the
test case, a preprocessed version of a single source file
found in the open source version of the Qt toolkit.  You will
find the two test cases (created under RedHat Linux 8.0) here:

   http://homepage.mac.com/zlaski/FileSharing1.html

I've instrumented my local copy of 3.3 to produce -fmem-report
stats comparable to 3.2.1 (productizing this for FSF use is another
matter that I will need to coordinate with Geoff).  Also, I've augmented
-fmem-report in both compilers to also tell us how many times 
gcc_collect()
is called, and how many times it actually decides to go through with its
mark-n-sweep thang.

Both compilers are using the default 4MB heap size.  I re-ran both 
command lines several times to warm up the various caches, and then 
reported the last 3 timings for each.  The -fmem-report and -Q output 
has been trimmed to only show the juicy cuts:

/home/zlaski/fsf/obj/gcc/gcc_3_2_1_release/gcc/g++ -B 
/home/zlaski/fsf/obj/gcc/gcc_3_2_1_release/gcc/ -c 
structureparser.gcc_3_2_1_release.ii -fmem-report -Q

{GC 5327k -> 3237k}
{GC 5328k -> 4109k}
{GC 5343k -> 4661k}
{GC 6065k -> 5406k}
{GC 7029k -> 6218k}
{GC 8120k -> 7203k}
{GC 9386k -> 8260k}
{GC 10741k -> 9470k}
{GC 12395k -> 10397k}
{GC 13584k -> 10727k}
{GC 14039k -> 11681k}
{GC 13922k -> 12090k}

Tree                 Number            Bytes    % Total
Total                216513              10M
RTX                  Number            Bytes    % Total
Total                  1165              12k

Collector calls:        7644
Collections performed:  12

Size   Allocated        Used    Overhead
Total         13M         11M        130k

real    0m2.692s
real    0m2.682s
real    0m2.686s

/home/zlaski/fsf/obj/gcc/gcc-3_3-branch/gcc/g++ -B 
/home/zlaski/fsf/obj/gcc/gcc-3_3-branch/gcc/ -c 
structureparser.gcc-3_3-branch.ii -fmem-report -Q

{GC 5326k -> 2058k}
{GC 5325k -> 3023k}
{GC 5411k -> 3871k}
{GC 5420k -> 4278k}
{GC 5564k -> 4414k}
{GC 5740k -> 4772k}
{GC 6209k -> 5170k}
{GC 6745k -> 5728k}
{GC 7447k -> 6243k}
{GC 8119k -> 6722k}
{GC 8740k -> 7364k}
{GC 9584k -> 8121k}
{GC 10596k -> 8662k}
{GC 11279k -> 9516k}
{GC 12400k -> 10220k}
{GC 13352k -> 10597k}
{GC 13844k -> 10762k}
{GC 14069k -> 10903k}
{GC 14183k -> 11670k}
{GC 15455k -> 12306k}
{GC 4194303k -> 12411k}  /* ignore the 4Gb - needed to trigger the last 
collection :-) */

Tree                 Number            Bytes    % Total
Total                227136              11M
RTX                  Number            Bytes    % Total
Total                  1322              12k

Collector calls:        7805
Collections performed:  21

Size   Allocated        Used    Overhead
Total         14M         12M        133k

real    0m3.649s
real    0m3.648s
real    0m3.641s

Immediately, a culprit comes to mind:  The 3.3 compiler performs almost 
TWICE AS MANY COLLECTIONS as the 3.2.1 compiler, with a roughly 
comparable # of gcc_collect() calls.  This can't be cheap.

What we see from the GC numbers is that the 3.3 collections are able to 
reclaim a greater percentage of allocated memory at each turn.  This 
could be a combination of two factors:
   - Data in the 3.3 compiler have better temporal locality (!!)
   - The 3.2.1 collector was buggy/incomplete and could not reliably 
reclaim some stuff.

Anyway, this better reclamation done by the 3.3 ggc also explains why 
the collections occur more frequently there.

So, I throttled the heap size all the way to 128 MB.  For the 3.2.1 
compiler, this involved rebuilding it after changing ggc-page.c; the 
3.3 compiler is more civilized and accepts a --param:

/home/zlaski/fsf/obj/gcc/gcc_3_2_1_release/gcc/g++ -B 
/home/zlaski/fsf/obj/gcc/gcc_3_2_1_release/gcc/ -c 
structureparser.gcc_3_2_1_release.ii -fmem-report

Execution times (seconds)
  garbage collection    :   0.11 ( 4%) usr   0.00 ( 0%) sys   0.11 ( 4%) 
wall
  cfg construction      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  life analysis         :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  preprocessing         :   0.19 ( 8%) usr   0.13 (27%) sys   0.28 ( 9%) 
wall
  lexical analysis      :   0.28 (11%) usr   0.21 (43%) sys   0.44 (14%) 
wall
  parser                :   1.73 (71%) usr   0.15 (31%) sys   1.98 (65%) 
wall
  expand                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 2%) 
wall
  varconst              :   0.03 ( 1%) usr   0.00 ( 0%) sys   0.05 ( 2%) 
wall
  integration           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  flow analysis         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 1%) 
wall
  mode switching        :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 1%) 
wall
  local alloc           :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 1%) 
wall
  global alloc          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 1%) 
wall
  shorten branches      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  rest of compilation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 1%) 
wall
  TOTAL                 :   2.45             0.49             3.05

Tree                 Number            Bytes    % Total
Total                216390              10M
RTX                  Number            Bytes    % Total
Total                  1165              12k

Collector calls:        7644
Collections performed:  1

Size   Allocated        Used    Overhead
Total         23M         11M        242k

real    0m1.997s
real    0m1.999s
real    0m1.994s

/home/zlaski/fsf/obj/gcc/gcc-3_3-branch/gcc/g++ -B 
/home/zlaski/fsf/obj/gcc/gcc-3_3-branch/gcc/ -c 
structureparser.gcc-3_3-branch.ii -fmem-report --param 
ggc-min-heapsize=131072

Execution times (seconds)
  garbage collection    :   0.10 ( 3%) usr   0.00 ( 0%) sys   0.09 ( 3%) 
wall
  life info update      :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  preprocessing         :   0.27 ( 9%) usr   0.13 (22%) sys   0.27 ( 7%) 
wall
  lexical analysis      :   0.29 (10%) usr   0.15 (26%) sys   0.72 (20%) 
wall
  parser                :   2.13 (72%) usr   0.28 (48%) sys   2.41 (66%) 
wall
  expand                :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  varconst              :   0.06 ( 2%) usr   0.00 ( 0%) sys   0.06 ( 2%) 
wall
  jump                  :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  global alloc          :   0.02 ( 1%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  reg stack             :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) 
wall
  final                 :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.05 ( 1%) 
wall
  symout                :   0.00 ( 0%) usr   0.01 ( 2%) sys   0.02 ( 0%) 
wall
  rest of compilation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) 
wall
  TOTAL                 :   2.94             0.58             3.67

Tree                 Number            Bytes    % Total
Total                227014              11M
RTX                  Number            Bytes    % Total
Total                  1322              12k

Collector calls:        7805
Collections performed:  1

Size   Allocated        Used    Overhead
Total         27M         12M        274k

real    0m2.580s
real    0m2.585s
real    0m2.580s

The overall _amount_ of extra memory being allocated clearly does not 
have an adverse effect on the time it takes the garbage collector to 
traverse it, since the 3.3 collector runs at least as fast as the 3.2.1 
collector in spite of about 10% more data under its wing.

However, the access _patterns_ for the allocated memory clearly do have 
an effect on the _number_ of traversals the garbage collector will 
perform on it.  What you see from the numbers is that more heap data 
becomes orphaned sooner in 3.3 than in 3.2.1, and is therefore 
reclaimed.  The less remaining (i.e., marked) data means a lower 
threshold for firing off the next collection, and ergo we see more of 
them.
But, this is not all; even though a lot more stuff is successfully 
reclaimed, the live memory size grows back quickly, only to be 
reclaimed with equal vigor.  What this suggests (correct me if I'm 
going astray) is a "sawtooth" time series of memory allocations.  Each 
tooth tends to group together data with a high degree of temporal 
locality, most of it destroyed at the end of some activity (my guess: 
compiling a body of a function).  Then, as the compilation of the next 
function begins, the compiler again allocates a bunch of data that it 
will use for that function and that function only.

This leads me to the following half-baked hypothesis: Some kinds of 
data/information that used to persist across the entire compilation are 
now created and destroyed repeatedly in some local temporal scope, like 
the compilation of a single function.  This hypothesis would explain 
not only the increased number of gc passes, but also the overall 
performance degradation we see (even with gc turned off).

So, the question is, has anyone touched code in the C++ front-end that 
deals with compilation of functions? Classes?  Perhaps someone has 
tried to reduce the memory footprint of the running compiler by 
destroying all data unless it is absolutely needed?  The law of 
unintended consequences applies in engineering as well as in economic 
policy. :-)

--Zem
--------------------------------------------------------------
Ziemowit Laski                 1 Infinite Loop, MS 301-2K
Mac OS X Compiler Group        Cupertino, CA USA  95014-2083
Apple Computer, Inc.           +1.408.974.6229  Fax .5477