This is the mail archive of the
mailing list for the GCC project.
Re: gcc 3.5 integration branch proposal
Nick Burrett <firstname.lastname@example.org> writes:
> There's no harm in that. I have a port of GCC 3.3.3 running on a
> 200MHz StrongARM that takes over 6 minutes to compile the following:
> #include <iostream>
> int main (void)
> std::cout << "Hello World" << std::endl;
> return 0;
> GCC 2.95.4 compiled the same application on the same hardware in
> around 20-30 seconds.
I compiled GCC 3.4-to-be with profiling instrumentation and ran it
against this test case. The test takes 2.7 seconds on my roughly
two-year-old Athlon, which I think is far too slow; it should be
<0.01s on this hardware. Without benefit of PCH or other such
Profiling results are interesting. First, the times are nearly
identical at -O2 and -fsyntax-only. This is natural, there really
isn't anything here to optimize, but it's worth noting. Almost all
the time is in the C++ front end. Here's the top of the flat profile
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
4.40 0.67 0.67 661 0.00 0.00 store_bindings
2.95 1.12 0.45 372916 0.00 0.00 ggc_alloc
2.95 1.57 0.45 129400 0.00 0.00 make_node
2.50 1.95 0.38 193075 0.00 0.00 _cpp_lex_direct
2.36 2.31 0.36 14689 0.00 0.00 grokdeclarator
2.30 2.66 0.35 103752 0.00 0.00 memset
2.23 3.00 0.34 96832 0.00 0.00 ht_lookup
1.90 3.29 0.29 365671 0.00 0.00 htab_find_slot_with_hash
1.90 3.58 0.29 89353 0.00 0.00 walk_tree
1.90 3.87 0.29 81277 0.00 0.00 _int_malloc
store_bindings potentially does a tremendous amount of work: it (in
conjunction with its sole caller, maybe_push_to_top_level) temporarily
unwinds the current scope stack, which entails modifying the data
structure for every identifier declared in the program. Since the
program declares some 8,000 identifiers, you can see why a function
that's called only 661 times ends up at the top of the profile.
I am not sure why ggc_alloc comes in second; checking is disabled so
it isn't doing tons and tons of memset() operations or anything. The
time spent in make_node is, I suspect, largely due to the inlined
memset in there. Memset itself is being called mostly by [x]calloc,
and most of *those* calls trace back to walk_tree_without_duplicates
and/or for_each_template_parm, via htab_create_alloc. Those hash
tables are a kludge to prevent exponential time consumption in certain
algorithms; it would be nice if they weren't necessary.
I think there's room for some easy speedups here, and I think they
could get into 3.4 if developed promptly. Let's see some patches.