This is the mail archive of the
mailing list for the GCC project.
Re: gcc compile-time performance
> > > Here's another oddity.
> > >
> > > Why is predict.c using the target floating-point emulation routines to do
> > > its branch probability calculations? There must be a faster way of doing
> > > this that is good enough for the level of estimation needed here -- the
> > > probabilities are at best approximate.
> > >
> > > When profiling a compilation of combine.c (a function with no floating
> > > point code), I was amazed to find that we spend 2.5% of the total
> > > compilation time in earith() and its children.
> > This is curious, I was benchmarking it on similar testcase before sending
> > the patch and it was about 0.5% of total compilation time spent in
> > branch probability pass...
> OK, I've re-run the same code but this time with a compiler built with
> -O2; it makes a little difference, but not much:
> 0.01 0.06 22284/63991 estimate_bb_frequencies
> 0.03 0.11 41707/63991 propagate_freq 
>  1.8 0.04 0.17 63991 earith 
> 0.01 0.11 28608/28608 ediv 
> 0.01 0.02 14613/32590 emul 
> 0.00 0.01 20770/38747 eadd1 
> 0.01 0.00 127982/779628 eisnan 
> 0.00 0.00 14642/32619 eadd 
> 0.00 0.00 6128/6128 esub 
> That 1.8 means 1.8% of total run time, the only callers being the
> basic-block code. Ie, ~2% of the entire compilation time is spent just
> estimating probabilities.
I am just looking at it. What I don't understand yet is why this does not show
up in the -Q output? There is BRANCH_PROB timevar, the timevar is
pushed/popped but yet it is not displayed...
> > >
> > > Surely either native floating-point code, or even some simple fixed-point
> > Native floating point code is problem, unfortunately since for i386 you get
> > different results in optimized and non-optimized builds breaking bootstrap.
> > Fixed point code is problem, as we are interested in comparisons relative
> > to the highest fequency in the program. This may be the entry block for
> > tree-structured function, but it may be the internal loop and at high loop
> > nests there is more than 2^30 differences between these two we can affort
> > in integral arithmetics.
> I'm a little surprised that we really need to preserve such huge levels of
> ordering. Would it really matter (in terms of final code execution
> performance) if those smaller numbers underflowed to zero?
We need to get the big number high. Unforutnately if we propagate them
we often build them for small numbers, so the roundoff error propagates
Of course my first version was 32bit integer and it overflowed all them time,
unfortunately. My second version was floating point and I was asked to rewrite
it into emulated fp because of the bootstrap misscomparisons happening from
time to time with the broken FP on i386....
I am open to the suggestinos, just mentioning what has been already tried.