This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Tremendous performance regression in 1.1.2 -> mainline

To: gcc at gcc dot gnu dot org
Subject: Tremendous performance regression in 1.1.2 -> mainline
From: Brad Lucier <lucier at math dot purdue dot edu>
Date: Thu, 6 Apr 2000 11:56:22 -0500 (EST)
Cc: lucier at math dot purdue dot edu, feeley at iro dot umontreal dot ca, hosking at cs dot purdue dot edu

To follow up a bit with my e-mail of

http://gcc.gnu.org/ml/gcc/2000-03/msg00860.html

I decided to compare the performance of egcs-1.1.2 with the mainline
compiler on that code.

With egcs-1.1.2 on alpha-redhat-linux:

popov-75% time /usr/lib/gcc-lib/alpha-redhat-linux/egcs-2.91.66/cc1 -O1 _std.i 
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std ___init_proc ____20___std
time in parse: 1.827072
time in jump: 1.662128
time in cse: 1.168272
time in loop: 0.006832
time in flow: 13.635696
time in combine: 0.992592
time in local-alloc: 0.440176
time in global-alloc: 1.127280
time in shorten-branch: 0.068320
time in final: 0.349408
20.894u 1.003s 0:21.96 99.6%    0+0k 0+0io 326pf+0w

with gcc version 2.96 20000331:

popov-76% time /export/u10/egcs-test/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O1 _std.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std {GC 27190k -> 8575k in 0.085} {GC 12863k -> 8778k in 0.091} {GC 12534k -> 9230k in 0.096} ___init_proc {GC 18065k -> 1705k in 0.019} ____20___std
time in parse: 2.065216 (0%)
time in jump: 1.479616 (0%)
time in cse: 0.730048 (0%)
time in loop: 0.006832 (0%)
time in flow: 1466.980704 (99%)
time in combine: 0.996496 (0%)
time in local-alloc: 0.452864 (0%)
time in global-alloc: 1.095072 (0%)
time in flow2: 5.055680 (0%)
time in shorten-branch: 0.056608 (0%)
time in final: 0.897920 (0%)
time in varconst: 0.007808 (0%)
time in gc: 0.289872 (0%)
1482.946u 0.711s 24:48.62 99.6% 0+0k 0+0io 621pf+0w

So the current version is 70 times slower than egcs-1.1.2; almost all
the time is spent in compute_flow_dominators.

Lately I've been compiling code like this with 2.95.1 with -O2 since
Marc Feeley was kind enough to change the Gambit-C code generator for
floating-point arithmetic to get around the problem in gcc's register
allocator for IEEE floating-point on the 21264.  (BTW, I've been getting
tremendous code on my 21264; the bottlenck now is memory access time, not
the actual FP operations.)  So I was somewhat worried whether there are
several places in the current gcc that had this kind of performance hit.
But with -O2, things are not significantly worse than -O1 (which is
bad enough):

popov-78% time /export/u10/egcs-profile/lib/gcc-lib/alphaev6-unknown-linux-gnu/2.96/cc1 -O2 _std.i
 __copysignf copysignf __copysign copysign __fabsf fabsf __fabs fabs __floorf __floor floorf floor __fdimf fdimf __fdim fdim ___H__20___std {GC 27822k -> 8578k in 0.188} {GC 11787k -> 8943k in 0.208} {GC 14947k -> 9413k in 0.227} {GC 15181k -> 9818k in 0.239} ___init_proc {GC 21431k -> 1711k in 0.035} {GC 5696k -> 1810k in 0.052} ____20___std
time in parse: 4.316848 (0%)
time in integration: 0.000976 (0%)
time in jump: 14.361840 (1%)
time in cse: 3.089040 (0%)
time in gcse: 2.694736 (0%)
time in loop: 0.223504 (0%)
time in cse2: 2.815760 (0%)
time in flow: 1501.634560 (95%)
time in combine: 2.682048 (0%)
time in regmove: 0.750544 (0%)
time in sched: 8.382864 (1%)
time in local-alloc: 2.263344 (0%)
time in global-alloc: 4.171424 (0%)
time in flow2: 7.216544 (0%)
time in peephole2: 0.036112 (0%)
time in sched2: 6.834928 (0%)
time in shorten-branch: 0.120048 (0%)
time in final: 1.255136 (0%)
time in varconst: 0.009760 (0%)
time in gc: 0.949648 (0%)
1574.513u 2.770s 26:21.30 99.7% 0+0k 0+0io 661pf+0w

The profile information for -O2 tells almost nothing new; here are all
the procedures that take longer than parse:

Flat profile:

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 85.21    268.42   268.42   738118     0.36     0.36  sbitmap_intersection_of_preds
  1.86    274.28     5.86    12070     0.49     0.55  compute_block_backward_dependences
  1.08    277.67     3.39   738121     0.00     0.00  sbitmap_a_and_b
  1.03    280.90     3.23       36    89.74    89.74  mark_critical_edges
  0.72    283.18     2.28                             htab_traverse
  0.61    285.11     1.93        3   642.25 91266.58  compute_flow_dominators
  0.55    286.85     1.74  7901172     0.00     0.00  bitmap_operation
  0.47    288.32     1.47       18    81.43    82.04  delete_unreachable_blocks
  0.44    289.72     1.40        3   466.47 91953.93  flow_loops_find
  0.43    291.06     1.34        6   223.63   538.83  calculate_global_regs_live
  0.41    292.34     1.28  1879299     0.00     0.00  rtx_renumbered_equal_p
  0.41    293.62     1.28    24143     0.05     0.05  count_or_remove_death_notes
  0.27    294.47     0.85  6712608     0.00     0.00  make_label_edge
  0.23    295.20     0.73   947453     0.00     0.00  find_cross_jump
  0.19    295.80     0.60   430642     0.00     0.00  constrain_operands
  0.18    296.37     0.57  6734272     0.00     0.00  make_edge
  0.17    296.91     0.55     5850     0.09     0.09  clear_table
  0.14    297.36     0.45  8637288     0.00     0.00  find_reg_note
  0.11    297.69     0.33  1425216     0.00     0.00  ggc_alloc_obj
  0.10    298.02     0.33        1   328.12   328.13  flow_depth_first_order_compute
  0.10    298.33     0.31       18    17.36   103.93  make_edges
  0.10    298.64     0.31      724     0.43     0.43  flow_loop_exits_find
  0.10    298.94     0.30        1   301.76 311841.42  yyparse
...
-----------------------------------------------
                1.93  271.87       3/3           flow_loops_find [7]
[8]     86.9    1.93  271.87       3         compute_flow_dominators [8]
              268.42    0.04  738118/738118      sbitmap_intersection_of_preds [9]
                3.39    0.00  738121/738121      sbitmap_a_and_b [15]
                0.02    0.00       3/14          sbitmap_vector_alloc [206]
                0.00    0.00       3/3           sbitmap_vector_ones [732]
                0.00    0.00       3/11          sbitmap_vector_zero [761]
                0.00    0.00       3/72575       sbitmap_zero [700]
-----------------------------------------------
              268.42    0.04  738118/738118      compute_flow_dominators [8]
[9]     85.2  268.42    0.04  738118         sbitmap_intersection_of_preds [9]
                0.04    0.00  738118/738118      sbitmap_copy [274]
-----------------------------------------------

Brad Lucier

Follow-Ups:
- Re: Tremendous performance regression in 1.1.2 -> mainline
  - From: David Edelsohn

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]