This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: gcc compile-time performance
- From: Richard Earnshaw <rearnsha at arm dot com>
- To: Jan Hubicka <jh at suse dot cz>
- Cc: Richard dot Earnshaw at arm dot com, gcc at gcc dot gnu dot org
- Date: Mon, 20 May 2002 16:02:55 +0100
- Subject: Re: gcc compile-time performance
- Organization: ARM Ltd.
- Reply-to: Richard dot Earnshaw at arm dot com
> > Here's another oddity.
> >
> > Why is predict.c using the target floating-point emulation routines to do
> > its branch probability calculations? There must be a faster way of doing
> > this that is good enough for the level of estimation needed here -- the
> > probabilities are at best approximate.
> >
> > When profiling a compilation of combine.c (a function with no floating
> > point code), I was amazed to find that we spend 2.5% of the total
> > compilation time in earith() and its children.
>
> This is curious, I was benchmarking it on similar testcase before sending
> the patch and it was about 0.5% of total compilation time spent in
> branch probability pass...
OK, I've re-run the same code but this time with a compiler built with
-O2; it makes a little difference, but not much:
0.01 0.06 22284/63991 estimate_bb_frequencies
[46]
0.03 0.11 41707/63991 propagate_freq [64]
[68] 1.8 0.04 0.17 63991 earith [68]
0.01 0.11 28608/28608 ediv [107]
0.01 0.02 14613/32590 emul [185]
0.00 0.01 20770/38747 eadd1 [332]
0.01 0.00 127982/779628 eisnan [244]
0.00 0.00 14642/32619 eadd [635]
0.00 0.00 6128/6128 esub [812]
That 1.8 means 1.8% of total run time, the only callers being the
basic-block code. Ie, ~2% of the entire compilation time is spent just
estimating probabilities.
> >
> > Surely either native floating-point code, or even some simple fixed-point
>
> Native floating point code is problem, unfortunately since for i386 you get
> different results in optimized and non-optimized builds breaking bootstrap.
> Fixed point code is problem, as we are interested in comparisons relative
> to the highest fequency in the program. This may be the entry block for
> tree-structured function, but it may be the internal loop and at high loop
> nests there is more than 2^30 differences between these two we can affort
> in integral arithmetics.
I'm a little surprised that we really need to preserve such huge levels of
ordering. Would it really matter (in terms of final code execution
performance) if those smaller numbers underflowed to zero?
R.