This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: GCC gprof statistics
- From: Ishikawa <ishikawa at yk dot rim dot or dot jp>
- To: Andrew Pinski <pinskia at physics dot uc dot edu>
- Cc: gcc at gcc dot gnu dot org
- Date: Mon, 30 Jun 2003 05:32:06 +0900
- Subject: Re: GCC gprof statistics
- References: <C1A8D88B-AA3B-11D7-B3D8-000393A6D2F2@physics.uc.edu>
Andrew Pinski wrote:
> Using Shark from the CHUD tools, sampling for 30 seconds, here are the
> top 3 functions which call expr_equiv_p:
> 8.2% ldst_entry cc1
> 3.1% expr_equiv_p cc1 <--- itself (already lowered)
> 2.1% trim_ld_motion_mems cc1
>
> Note this version of gcc which I tested includes a patch which optimize
> the common case of "ee" for the rtl/rtx:
> 3.6% 1906 if ( GET_RTX_LENGTH (code) == 2 && fmt[0] == 'e' && fmt[1]
> == 'e')
> 1907 {
> 0.2% 1908 if (! expr_equiv_p (XEXP (x, 1), XEXP (y, 1)))
> 1909 return 0; !Invariant
> 1910 return expr_equiv_p (XEXP (x, 0), XEXP (y, 0));
> 1911 }
>
> which causes expr_equiv_p to be able changed into a loop for that case.
> This patch has also been applied (for me) to for_each_rtx.
Thank you for the info.
I will see if I can find more info on "Shark from the CHUD tools".
My knowledge of profiling tools is somewhat outdated.
(Micro-analysis using tcov is just that. However, it can certainly
gives us information about the average number of loops: for example,
I checked the expr_equiv_p and found that the above code was
taken out from a loop. The loop itself uses a generic
loop count, but as Andrew found that there are common
cases that could be taken care of in order to speed up the
execution. I used to use tcov-like tool to carry out
micro-analysis. But better tools give us easier time.)
> I also have a patch which causes the number of calls to expr_equiv_p to
> be lowered by switching around
> (in find_rtx_in_ldst) the left hand with the right in the "and".
This also figures. I used to
rewrite complex conditions such as
if (a && b && c && d
to
if ( a
&& b
&& c
&& d)
just so that tcov can give us better (easier to
decipher) output to figure out which
conditions failed/succeeded, etc..
If we know that certain conditions (on equal standing,
otherwise) are known to fail more often, then
we should check the condition first in the
above example. (Of course, there are cases
when implicit ordering is in place and then
we can't rearrange them.)
After downloading the CVS tar, and create a patch
for some command options work
for which I originally downloaded cvs 3.3,
I will begin experimeting a little more.
With gcc 3.3, I have already done the three specializations of
functions that call the generic for_each_rtx.
This reduced the compilation time by
0.2 seconds from 9.3 sec to 9.1 sec.
about 2 percent speedup...
Three specialized functions are:
for_each_rtx_approx_reg_cost
for_each_rtx_check_dependence
for_each_rtx_check_for_label_ref
In view of Andrew's suggestion that for_each_rtx could
be improved, I probably should create a generic macro
to define such specialized functions.
But what is more important here, the new profile
clearly shows the newer (previously hidden) bottleneck.
(From the flat profile of gprof output.)
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
4.13 0.21 0.21 4154256 0.00 0.00 canon_rtx
3.35 0.38 0.17 224919 0.00 0.00 cse_insn
2.56 0.51 0.13 634982 0.00 0.00 rtx_cost
2.36 0.63 0.12 1229455 0.00 0.00 mark_set_1
2.17 0.74 0.11 565900 0.00 0.00
for_each_rtx_approx_reg_cost
2.17 0.85 0.11 425486 0.00 0.00 fold_rtx
1.57 0.93 0.08 820005 0.00 0.00 canon_hash
1.57 1.01 0.08 298804 0.00 0.00 insert
1.57 1.09 0.08 157512 0.00 0.00 reg_scan_mark_refs
1.38 1.16 0.07 2314167 0.00 0.00 get_cse_reg_info
1.38 1.23 0.07 976769 0.00 0.00 ggc_alloc
1.38 1.30 0.07 768888 0.00 0.00
rtx_equal_for_memref_p
1.38 1.37 0.07 521578 0.00 0.00 canon_reg
1.38 1.44 0.07 248600 0.00 0.00
for_each_rtx_check_dependence
1.38 1.51 0.07 200171 0.00 0.00 constrain_operands
1.38 1.58 0.07 161391 0.00 0.00 copy_rtx
...
Looking at canon_rtx, I was under the impression that maybe we must do
something radical to RTX data structure, but as the post from Adnrew
suggests that micro-analyzing and doing various small optimizations
can probably achieve more speedup without major surgery.
canon_rtx probably can be tuned.
I will see what I can do.
(Long time ago, I used these optimization technique
to speed up an in-house text formatter by 40%.
The major breakthrough was using better alogorithm (Knuth
would be proud.), but about 15% came from the various
optimization technique. On a slow Sun-3 machine using NFS, the
improvement was very welcome. Back then I used GCC 1.4x, if I
recall correctly. I am happy to contribute back to GCC now.)
Happy Hacking,
Ishikawa, Chiaki
--
int main(void){int j=2003;/*(c)2003 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="g>qtCIuqivb,gCwe\np@.ietCIuqi\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */