This is the mail archive of the
mailing list for the GCC project.
Re: [PATCH] var-tracking.c speedups
- From: Richard Guenther <richard dot guenther at gmail dot com>
- To: Jakub Jelinek <jakub at redhat dot com>
- Cc: gcc-patches at gcc dot gnu dot org, Alexandre Oliva <aoliva at redhat dot com>
- Date: Tue, 23 Jun 2009 15:30:52 +0200
- Subject: Re: [PATCH] var-tracking.c speedups
- References: <20090623125102.GC4462@tyan-ft48-01.lab.bos.redhat.com>
On Tue, Jun 23, 2009 at 2:51 PM, Jakub Jelinek<firstname.lastname@example.org> wrote:
> Still on the http://bugzilla.redhat.com/503816 testcase I've run callgrind
> (valgrind --tool=callgrind --collect-atstart=no --toggle-collect=variable_tracking_main ./cc1plus ...
> callgrind_annotate --auto=yes --inclusive=yes callgrind*.out*) and the
> following patch fixes some low hanging fruit discovered by it.
> Overall the patch (still on 4.4 branch, x86_64-linux -> s390x-linux cross)
> results in (real/user/sys):
> cc1plus.vanilla -g -O2 -fno-var-tracking ? ? ? ?3m30.618s ? ? ? 3m29.638s ? ? ? 0m0.947s
> cc1plus.vanilla -g -O2 ? ? ? ? ? ? ? ? ? ? ? ? ?6m49.722s ? ? ? 6m47.254s ? ? ? 0m2.374s
> cc1plus.patched -g -O2 ? ? ? ? ? ? ? ? ? ? ? ? ?5m50.255s ? ? ? 5m48.398s ? ? ? 0m1.787s
> (middle from 3 runs). ?cc1plus.vanilla here is branches/gcc-4_4-branch with
> r148760 backported to it, and cc1plus.patched is the same with the following
> patch on top of it. ?So this patch saves approx. 15% of runtime, or 30% of
> vartrack pass runtime.
> The fixes are:
> 1) variable_union has been totally unnecessarily calling 178999434x xcalloc,
> ? qsort and free. ?This is in the loc_chain sorting code. ?I've
> ? instrumented trunk bootstrap (x86_64-linux and i686-linux) to track how
> ? frequent different values of dst_l and src_l are (x,y) stands for
> ? dst_l == x, src_l == y, sum stands for dst_l + src_l. ?The first
> ? line is gathered from the trunk bootstraps, the second one from the
> ? above mentioned s390x testcase:
> (1,1) ? ?(2,2) ? sum==3 sum<=4 sum<=8 sum<=16 sum<=32 sum>32
> 10072567 2698962 706337 27977 ?382251 2492 ? ?0 ? ? ? 0
> 89673364 89268534 ?1400 ? ?89 ? ?2397 53650 ? 0 ? ? ? 0
> ? Turns out that the most often combinations can be handled without
> ? qsort, fairly cheaply. ?For dst_l == 1 we can just put in the
> ? dst single entry followed by the src entries in order, except possibly
> ? if some src entry is the same as dst single entry we just skip it.
> ? dst_l == 2 can be handled easily as well. ?For dst_l == 1 we
> ? can avoid any kind of temporary buffer, for other combinations
> ? we can just use a fixed size static buffer with a dynamically realloced
> ? buffer just to be sure (and free it only at the end of the pass).
> 2) I doubt anybody uses -fvar-tracking-uninit, it certainly uses a very
> ? weird non-DWARFy DW_GNU_uninit extension which isn't clearly defined
> ? and GDB only barely supports it, just in a few places. ?Still, this
> ? code was causing a quadratic loop in variable_union which was showing up
> ? very high in the profile. ?Fixed by making sure ->init is always
> ? *_INITIALIZED unless -fvar-tracking-uninit and avoiding that loop.
> 3) dataflow_set_different_2 was just wasting cycles, we've checked at the
> ? beginning that both hashtables have the same number of elements, and
> ? if traversal of one of the hashtables found all decls from that table
> ? in the other table (and their variable objects were considered equal),
> ? then the same is necessarily true for the other hash table.
> 4) in sel-sched merge, rtx_equal_p was moved to very infrequently used
> ? rtx_equal_p_cb with added callback and rtx_equal_p implemented on top
> ? of that new function. ?Given that rtx_equal_p was showing very high
> ? in the callgrind profile and is very frequently used throughout gcc
> ? rtl passes (400+ different callers), isn't too long to be a maintainance
> ? burden and is recursive, so we can't get rid of the overhead of passing
> ? cb around and checking at each level, I'm proposing returning to the
> ? old rtx_equal_p function and have rtx_equal_p_cb as completely separate
> ? function.
> Patch has been bootstrapped/regtested on the trunk on x86_64-linux and
> i686-linux, on 4.4 branch and above mentioned testcase has been verified
> to generate bitwise identical assembly.
> Attached is the patch, then callgrind_annotate dump before this
> patch and after it, for anyone interested to suggest further low hanging
> Ok for trunk?
The rtl.c parts are ok if you add a warning comment to both
rtx_equal_p and rtx_equal_cb_p that they should be kept in sync.
Is it worth having the vui static buffer _and_ an allocated variant?
Why not just unconditionally use the allocated variant?
If using an allocated variant always simplifies the code structure
(it looks quite complicated now) then please consider doing that.
Otherwise the patch is ok (the parts 2) and 3) anyway).