This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: GCC gprof statistics


Andrew Pinski wrote:

> Using Shark from the CHUD tools, sampling for 30 seconds, here are the
> top 3 functions which call expr_equiv_p:
>         8.2%      ldst_entry    cc1
>         3.1%      expr_equiv_p  cc1             <--- itself (already lowered)
>         2.1%      trim_ld_motion_mems   cc1
> 
> Note this version of gcc which I tested includes a patch which optimize
> the common case of "ee" for the rtl/rtx:
> 3.6%    1906      if ( GET_RTX_LENGTH (code) == 2 && fmt[0] == 'e' && fmt[1]
> == 'e')
>         1907        {
> 0.2%    1908          if (! expr_equiv_p (XEXP (x, 1), XEXP (y, 1)))
>         1909            return 0;       !Invariant
>         1910          return expr_equiv_p (XEXP (x, 0), XEXP (y, 0));
>         1911        }
> 
> which causes expr_equiv_p to be able changed into a loop for that case.
> This patch has also been applied (for me) to for_each_rtx.

Thank you for the info.
I will see if I can find more info on "Shark from the CHUD tools".
My knowledge of profiling tools is somewhat outdated.
(Micro-analysis using tcov is just that. However, it can certainly
gives us information about the average number of loops: for example,
I checked the expr_equiv_p and found that the above code was
taken out from a loop. The loop itself uses a generic
loop count, but as Andrew found that there are common
cases that could be taken care of in order to speed up the
execution. I used to use tcov-like tool to carry out
micro-analysis. But better tools give us easier time.)

> I also have a patch which causes the number of calls to expr_equiv_p to
> be lowered by switching around
> (in find_rtx_in_ldst) the left hand with the right in the "and".

This also figures. I used to
rewrite  complex conditions such as
 if (a && b && c && d 
to
 if ( a
     && b
     && c
     && d)
just so that tcov can give us better (easier to
decipher) output to figure out which
conditions failed/succeeded, etc..
If we know that certain conditions (on equal standing,
otherwise) are known to fail more often, then
we should check the condition first in the
above example. (Of course, there are cases
when implicit ordering is in place and then
we can't rearrange them.)

After downloading the CVS tar, and create a patch
for some command options work
for which I originally downloaded cvs 3.3, 
I will begin experimeting a little more.

With gcc 3.3, I have already done the three specializations of
functions that call the generic for_each_rtx.
This reduced the compilation time by
0.2 seconds from 9.3 sec to 9.1 sec.
about 2 percent speedup...

Three specialized functions are:
for_each_rtx_approx_reg_cost
for_each_rtx_check_dependence
for_each_rtx_check_for_label_ref
In view of Andrew's suggestion that for_each_rtx could
be improved, I probably should create a generic macro
to define such specialized functions. 

But what is more important here, the new profile
clearly shows the newer (previously hidden) bottleneck.
(From the flat profile of gprof output.)
 
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  4.13      0.21     0.21  4154256     0.00     0.00  canon_rtx
  3.35      0.38     0.17   224919     0.00     0.00  cse_insn
  2.56      0.51     0.13   634982     0.00     0.00  rtx_cost
  2.36      0.63     0.12  1229455     0.00     0.00  mark_set_1
  2.17      0.74     0.11   565900     0.00     0.00 
for_each_rtx_approx_reg_cost
  2.17      0.85     0.11   425486     0.00     0.00  fold_rtx
  1.57      0.93     0.08   820005     0.00     0.00  canon_hash
  1.57      1.01     0.08   298804     0.00     0.00  insert
  1.57      1.09     0.08   157512     0.00     0.00  reg_scan_mark_refs
  1.38      1.16     0.07  2314167     0.00     0.00  get_cse_reg_info
  1.38      1.23     0.07   976769     0.00     0.00  ggc_alloc
  1.38      1.30     0.07   768888     0.00     0.00 
rtx_equal_for_memref_p
  1.38      1.37     0.07   521578     0.00     0.00  canon_reg
  1.38      1.44     0.07   248600     0.00     0.00 
for_each_rtx_check_dependence
  1.38      1.51     0.07   200171     0.00     0.00  constrain_operands
  1.38      1.58     0.07   161391     0.00     0.00  copy_rtx
 
    ...

Looking at canon_rtx, I was under the impression that maybe we must do
something radical to RTX data structure, but as the post from Adnrew
suggests that micro-analyzing and doing various small optimizations
can probably achieve more speedup without major surgery.
canon_rtx probably can be tuned.

I will see what I can do.

(Long time ago, I used these optimization technique
to speed up an in-house text formatter by 40%.
The major breakthrough was using better alogorithm (Knuth
would be proud.), but about 15% came from the various 
optimization technique. On a slow Sun-3 machine using NFS, the 
improvement was very welcome. Back then I used GCC 1.4x, if I
recall correctly. I am happy to contribute back to GCC now.)

Happy Hacking,

Ishikawa, Chiaki 

-- 
int main(void){int j=2003;/*(c)2003 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="g>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]