This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: optimization/6007: cfg cleanup tremendous performance hog with -O1


> Honza and I have exchanged some off-list e-mail about this problem.  Real
> life has intervened, and I doubt that he will have time before the scheduled
> release of April 15 to work on this, so I'll attempt to summarize what's
> been happening.  I think I need some help.
> 
> All this is with the compile options -O1 -fschedule-insns2.
> 
> At first, this was the profile on my all.i test:
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  37.31   1106.44  1106.44 2123667007     0.00     0.00  try_crossjump_to_edge
>  11.27   1440.70   334.26                             internal_mcount
>   6.85   1643.67   202.97   395788     0.00     0.00  cselib_invalidate_regno
>   6.53   1837.39   193.72                             htab_traverse
>   4.67   1975.95   138.56     4987     0.03     0.03  propagate_freq
>   2.87   2061.08    85.13       29     2.94     2.94  find_unreachable_blocks
>   2.50   2135.09    74.01       15     4.93     4.94  calc_idoms
>   2.48   2208.53    73.44   468802     0.00     0.00  try_forward_edges
>   2.46   2281.48    72.95 173160573     0.00     0.00  cached_make_edge
>   2.41   2353.01    71.53 175996207     0.00     0.00  bitmap_operation
> ...
> 
> cleanup cfg took > 98% of the CPU time of 18 hours on my UltraSPARC.
> 
> Honza sent me a patch on March 28 that disabled try_crossjump_bb if there
> are more than 100 outgoint edges.  This changed the profile on a slightly
> smaller example (denoise3.i) to
> 
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  63.29   1219.40  1219.40    33525     0.04     0.04  try_crossjump_to_edge
>   8.19   1377.13   157.73                             internal_mcount
>   5.08   1474.99    97.86                             htab_traverse
>   2.52   1523.56    48.57   206655     0.00     0.00  try_forward_edges
>   2.40   1569.81    46.25       31     1.49     1.49  find_unreachable_blocks
>   2.16   1611.38    41.57     2905     0.01     0.01  propagate_freq
>   1.40   1638.30    26.91        9     2.99     5.48  calculate_global_regs_live
>   1.33   1663.98    25.68       15     1.71     1.71  calc_idoms
>   1.20   1687.02    23.04       15     1.54     1.54  calc_dfs_tree_nonrec
> 
> Basically, try_crossjump_bb was no longer a problem, but try_crossjump_to_edge
> is still a problem; cleanup cfg still took > 87% of the CPU time.
> 
> Honza suggested (several times) that one way to deal with this is to disable
> try_crossjump_to_edge and try_crossjump_bb with unless we use -O2 or higher.
> These algorithms are O(N^3) in the number of edges, which are quadratic in

This is not the case. Tiny loop is O(N^2) in number of edges leaving basic
block, overall maximally O(N^3) in number of basic block for complette graph.

With work limiting check they are already O(N^2).  For complette graph.
This is little bit dificult to decrease futher, as crossjumping needs to
compare all sources and all destinations.  This is not problem for normal
control flow graphs, as they are sparse (number of edges is linear to nmuber
of blocks).

Old crossjumping didn't run into this problem just because it didn't handled
computed jump.  I guess one way to proceed is to simply bypass abnormal
edges when expensive optimization is disabled. (-O1)

Does this sound sane? If so, I have enought time to prepare path
tomorrow night.
> the size of my programs, since I use a lot of computed gotos and label 
> addresses.
> 
> I've looked at cfgcleanup.c and don't really know how to proceed.  Can
> someone suggest a reasonable way to fix this?

There is already patch in mainline for -fcrossjump-optimize.  I guess all
we need is to put it into branch and set default to 0 for -O1 level.
Overall -O1 can be made about 7% faster if crossjumping and jump threading
is disabled (that is also true for old implementation).  I am not sure
if the optimizations woth the CPU time for -O1 level.

Honza
> 
> Brad


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]