This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
More on compile performance of Linux kernels in mainline gcc
- From: Andi Kleen <ak at suse dot de>
- To: gcc at gcc dot gnu dot org
- Date: Wed, 3 Nov 2004 05:52:52 +0100
- Subject: More on compile performance of Linux kernels in mainline gcc
This is an addendum for the numbers for linux kernel compiling
on x86-64 I posted some days ago. gcc tested is the same (041029)
on the same machine with the same kernel tree/configuration.
I tracked down why the 4.0 compiled kernels didn't boot. One issue
was a missing -fno-strict-aliasing for one file (now fixed),
the other is a miscompilation of a loop in function in the linux
radix tree library (PR18241) The miscompilation can be worked around
by compiling the affected file with -O0.
There are a lot of new warnings. Especially
pointer targets in passing argument 2 of `foo' differ in signedness
is extremly common.
I was asked to retry with an make profiledbootstrap compiled
mainline gcc.
This improves the 4.0 numbers somewhat.
gcc 3.3-hammer (profiledbootstrap)
210.32user 31.62system 3:57.66elapsed
4.0 snapshot with normal bootstrap:
262.71user 30.50system 4:48.46elapsed
4.0 snapshots with profiledbootstrap:
248.01user 30.25system 4:33.66elapsed
Still considerably slower than 3.3-hammer though.
Also Jan asked for oprofile output. Here are all symbols over 0.3%
for a full kernel compile done with the profiledbootstrap compiler.
Looks like the likely/unlikely split is not very effective,
there are a lot of hot unlikely hits.
Some hash table lookup(s?) seem to be very hot, perhaps it needs
a better hash function or a larger table?
1.7% memset is somewhat worrying, that's a lot of clearing.
1.5% garbage collector accounting looks like a bug if that function
isn't misnamed.
Standard GLOBAL_POWER_EVENTS:
95020 4.0626 cc1 yyparse.unlikely_section
438612 2.5638 cc1 ht_lookup_with_hash
298462 1.7446 libc.so.6 memset
288277 1.6851 cc1 _cpp_lex_direct
265789 1.5536 cc1 ggc_alloc_stat.unlikely_section
264069 1.5436 vmlinux-26-quilt clear_page
246603 1.4415 libc.so.6 _int_malloc
195981 1.1456 vmlinux-26-quilt page_fault
180141 1.0530 cc1 walk_tree.unlikely_section
178419 1.0429 cc1 _cpp_clean_line
172287 1.0071 cc1 cpp_get_token
171537 1.0027 cc1 htab_find_slot_with_hash
164198 0.9598 cc1 for_each_rtx.unlikely_section
160974 0.9409 cc1 cse_insn
145394 0.8499 oprofiled odb_insert
139131 0.8133 cc1 constrain_operands.unlikely_section
129540 0.7572 libc.so.6 strlen
118542 0.6929 cc1 fold.unlikely_section
109847 0.6421 cc1 synth_mult
109798 0.6418 as (no symbols)
108898 0.6365 cc1 gimplify_expr.unlikely_section
96304 0.5629 cc1 record_reg_classes
95929 0.5607 cc1 lex_identifier
94718 0.5537 cc1 fold_rtx
91228 0.5333 cc1 reg_scan_mark_refs
88666 0.5183 cc1 make_node_stat.unlikely_section
88218 0.5157 cc1 yylex
87855 0.5135 cc1 mark_set_1
87435 0.5111 libc.so.6 memcpy
85209 0.4981 cc1 c_lex_with_flags.unlikely_section
82712 0.4835 cc1 build_int_cst_wide.unlikely_section
79861 0.4668 vmlinux-26-quilt do_no_page
79044 0.4620 cc1 canon_reg
78793 0.4606 cc1 extract_insn.unlikely_section
74090 0.4331 cc1 grokdeclarator
73247 0.4282 libc.so.6 __cfree
71708 0.4192 cc1 pop_scope.unlikely_section
71695 0.4191 cc1 init_alias_analysis.unlikely_section
71092 0.4156 cc1 iterative_hash_expr.unlikely_section
69631 0.4070 cc1 note_stores.unlikely_section
68900 0.4027 libc.so.6 __GI___libc_malloc
67924 0.3970 libc.so.6 __calloc
65767 0.3844 cc1 hash_rtx.unlikely_section
65629 0.3836 cc1 et_splay
65296 0.3817 cc1 mark_used_regs
62908 0.3677 cc1 find_reloads.unlikely_section
62825 0.3672 libc.so.6 _int_free
62063 0.3628 cc1 ix86_rtx_costs
61419 0.3590 cc1 count_reg_usage
61287 0.3582 cc1 rtx_cost.unlikely_section
60488 0.3536 cc1 get_stmt_operands.unlikely_section
60202 0.3519 cc1 htab_delete
60164 0.3517 cc1 force_fit_type.unlikely_section
58552 0.3423 libc.so.6 strcmp
57048 0.3335 cc1 bitmap_set_bit.unlikely_section
56734 0.3316 cc1 tree_code_size.unlikely_section
55753 0.3259 libc.so.6 malloc_consolidate
54495 0.3185 cc1 get_cse_reg_info
54318 0.3175 cc1 invalidate
53892 0.3150 cc1 int_const_binop.unlikely_section
52898 0.3092 cc1 _cpp_lex_token
52622 0.3076 cc1 propagate_one_insn.unlikely_section
52358 0.3061 vmlinux-26-quilt copy_user_generic
51972 0.3038 cc1 ix86_decompose_address.unlikely_section
L2 cache misses (with two 1MB caches):
The hot clear_page is the kernel's function to clear pages
allocated by page faults.
Looks like that hot hash lookup is just a bad cache pig. Maybe
the hash table is not too small, but too big?
yyparse is surprisinly cache intensive.
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit m
ask of 0x300 (multiple flags) count 10000
samples % app name symbol name
22564 23.1238 vmlinux-26-quilt clear_page
4785 4.9037 cc1 ht_lookup_with_hash
1859 1.9051 vmlinux-26-quilt copy_page
1487 1.5239 vmlinux-26-quilt copy_user_generic
1364 1.3978 cc1 pop_scope.unlikely_section
1077 1.1037 libc.so.6 memset
886 0.9080 cc1 yyparse.unlikely_section
727 0.7450 cc1 ggc_alloc_stat.unlikely_section
690 0.7071 libc.so.6 _int_malloc
686 0.7030 cc1 htab_find_slot_with_hash
684 0.7010 cc1 fold.unlikely_section
684 0.7010 cc1 get_stmt_operands.unlikely_section
675 0.6917 vmlinux-26-quilt do_no_page
539 0.5524 libc.so.6 memcpy
464 0.4755 cc1 pool_alloc.unlikely_section
463 0.4745 cc1 wrapup_global_declarations.unlikely_section
458 0.4694 vmlinux-26-quilt unmap_vmas
447 0.4581 vmlinux-26-quilt __rmqueue
423 0.4335 cc1 cpp_get_token
418 0.4284 vmlinux-26-quilt scheduler_tick
414 0.4243 ld-2.3.3.so _dl_relocate_object
405 0.4150 libc.so.6 _IO_vfprintf_internal
383 0.3925 ld-2.3.3.so do_lookup_x
380 0.3894 cc1 rewrite_stmt
372 0.3812 vmlinux-26-quilt buffered_rmqueue
370 0.3792 cc1 check_global_declarations.unlikely_section
349 0.3577 libc.so.6 malloc_consolidate
331 0.3392 cc1 tree_ssa_dominator_optimize
322 0.3300 cc1 list_length.unlikely_section
321 0.3290 cc1 type_hash_eq
301 0.3085 cc1 recog.unlikely_section
296 0.3033 cc1 gimplify_expr.unlikely_section
293 0.3003 cc1 c_write_global_declarations_1
Random notes (same as last time):
4.0:
Configured with: configure --disable-checking --enable-languages=c,c++ --prefix=/pkg/gcc-4.0-041029
Thread model: posix
gcc version 4.0.0 20041030 (experimental)
vs 3.3-hammer compiler from suse 9.1:
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local
+--infodir=/usr/share/info --mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada
+--disable-checking --libdir=/usr/lib64 --enable-libgcj --with-gxx-include-dir=/usr/include/g++
+--with-slibdir=/lib64 --with-system-zlib --enable-shared --enable-__cxa_atexit x86_64-suse-linux
Thread model: posix
gcc version 3.3.3 (SuSE Linux)
A linux kernel is a complex C program compiled with -O2
The files are relatively short, but it has a lot of includes to
process. The standard compile options are:
-Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -O2 -fomit-frame-pointer
-mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare
-fno-asynchronous-unwind-tables -funit-at-a-time
-g is not used.
The machine has enough memory that everything was cached.
All compilation with -j8 which normally gives best result on this machine.
-Andi