This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: optimization/6007: cfg cleanup tremendous performance hog with -O1
- From: Jan Hubicka <jh at suse dot cz>
- To: Brad Lucier <lucier at math dot purdue dot edu>
- Cc: Jan Hubicka <jh at suse dot cz>, David Edelsohn <dje at watson dot ibm dot com>,gcc at gcc dot gnu dot org, mark at codesourcery dot com, feeley at iro dot umontreal dot ca
- Date: Thu, 4 Apr 2002 18:22:03 +0200
- Subject: Re: optimization/6007: cfg cleanup tremendous performance hog with -O1
- References: <20020329162904.GK2886@atrey.karlin.mff.cuni.cz> <200204041456.g34Eunh29854@banach.math.purdue.edu>
> Honza and I have exchanged some off-list e-mail about this problem. Real
> life has intervened, and I doubt that he will have time before the scheduled
> release of April 15 to work on this, so I'll attempt to summarize what's
> been happening. I think I need some help.
>
> All this is with the compile options -O1 -fschedule-insns2.
>
> At first, this was the profile on my all.i test:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 37.31 1106.44 1106.44 2123667007 0.00 0.00 try_crossjump_to_edge
> 11.27 1440.70 334.26 internal_mcount
> 6.85 1643.67 202.97 395788 0.00 0.00 cselib_invalidate_regno
> 6.53 1837.39 193.72 htab_traverse
> 4.67 1975.95 138.56 4987 0.03 0.03 propagate_freq
> 2.87 2061.08 85.13 29 2.94 2.94 find_unreachable_blocks
> 2.50 2135.09 74.01 15 4.93 4.94 calc_idoms
> 2.48 2208.53 73.44 468802 0.00 0.00 try_forward_edges
> 2.46 2281.48 72.95 173160573 0.00 0.00 cached_make_edge
> 2.41 2353.01 71.53 175996207 0.00 0.00 bitmap_operation
> ...
>
> cleanup cfg took > 98% of the CPU time of 18 hours on my UltraSPARC.
>
> Honza sent me a patch on March 28 that disabled try_crossjump_bb if there
> are more than 100 outgoint edges. This changed the profile on a slightly
> smaller example (denoise3.i) to
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 63.29 1219.40 1219.40 33525 0.04 0.04 try_crossjump_to_edge
> 8.19 1377.13 157.73 internal_mcount
> 5.08 1474.99 97.86 htab_traverse
> 2.52 1523.56 48.57 206655 0.00 0.00 try_forward_edges
> 2.40 1569.81 46.25 31 1.49 1.49 find_unreachable_blocks
> 2.16 1611.38 41.57 2905 0.01 0.01 propagate_freq
> 1.40 1638.30 26.91 9 2.99 5.48 calculate_global_regs_live
> 1.33 1663.98 25.68 15 1.71 1.71 calc_idoms
> 1.20 1687.02 23.04 15 1.54 1.54 calc_dfs_tree_nonrec
>
> Basically, try_crossjump_bb was no longer a problem, but try_crossjump_to_edge
> is still a problem; cleanup cfg still took > 87% of the CPU time.
>
> Honza suggested (several times) that one way to deal with this is to disable
> try_crossjump_to_edge and try_crossjump_bb with unless we use -O2 or higher.
> These algorithms are O(N^3) in the number of edges, which are quadratic in
This is not the case. Tiny loop is O(N^2) in number of edges leaving basic
block, overall maximally O(N^3) in number of basic block for complette graph.
With work limiting check they are already O(N^2). For complette graph.
This is little bit dificult to decrease futher, as crossjumping needs to
compare all sources and all destinations. This is not problem for normal
control flow graphs, as they are sparse (number of edges is linear to nmuber
of blocks).
Old crossjumping didn't run into this problem just because it didn't handled
computed jump. I guess one way to proceed is to simply bypass abnormal
edges when expensive optimization is disabled. (-O1)
Does this sound sane? If so, I have enought time to prepare path
tomorrow night.
> the size of my programs, since I use a lot of computed gotos and label
> addresses.
>
> I've looked at cfgcleanup.c and don't really know how to proceed. Can
> someone suggest a reasonable way to fix this?
There is already patch in mainline for -fcrossjump-optimize. I guess all
we need is to put it into branch and set default to 0 for -O1 level.
Overall -O1 can be made about 7% faster if crossjumping and jump threading
is disabled (that is also true for old implementation). I am not sure
if the optimizations woth the CPU time for -O1 level.
Honza
>
> Brad