This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

More on compile performance of Linux kernels in mainline gcc


This is an addendum for the numbers for linux kernel compiling
on x86-64 I posted some days ago. gcc tested is the same (041029)
on the same machine with the same kernel tree/configuration.

I tracked down why the 4.0 compiled kernels didn't boot. One issue
was a missing -fno-strict-aliasing for one file (now fixed), 
the other is a miscompilation of a loop in function in the linux
radix tree library (PR18241) The miscompilation can be worked around
by compiling the affected file with -O0.

There are a lot of new warnings.  Especially 
pointer targets in passing argument 2 of `foo' differ in signedness
is extremly common.

I was asked to retry with an make profiledbootstrap compiled 
mainline gcc.

This improves the 4.0 numbers somewhat.

gcc 3.3-hammer (profiledbootstrap) 
210.32user 31.62system 3:57.66elapsed

4.0 snapshot with normal bootstrap:
262.71user 30.50system 4:48.46elapsed

4.0 snapshots with profiledbootstrap:
248.01user 30.25system 4:33.66elapsed 

Still considerably slower than 3.3-hammer though.

Also Jan asked for oprofile output. Here are all symbols over 0.3%
for a full kernel compile done with the profiledbootstrap compiler.

Looks like the likely/unlikely split is not very effective,
there are a lot of hot unlikely hits. 

Some hash table lookup(s?) seem to be very hot, perhaps it needs
a better hash function or a larger table?

1.7% memset is somewhat worrying, that's a lot of clearing.
1.5% garbage collector accounting looks like a bug if that function
     isn't misnamed.

Standard GLOBAL_POWER_EVENTS:
95020    4.0626  cc1                      yyparse.unlikely_section
438612    2.5638  cc1                      ht_lookup_with_hash
298462    1.7446  libc.so.6                memset
288277    1.6851  cc1                      _cpp_lex_direct
265789    1.5536  cc1                      ggc_alloc_stat.unlikely_section
264069    1.5436  vmlinux-26-quilt         clear_page
246603    1.4415  libc.so.6                _int_malloc
195981    1.1456  vmlinux-26-quilt         page_fault
180141    1.0530  cc1                      walk_tree.unlikely_section
178419    1.0429  cc1                      _cpp_clean_line
172287    1.0071  cc1                      cpp_get_token
171537    1.0027  cc1                      htab_find_slot_with_hash
164198    0.9598  cc1                      for_each_rtx.unlikely_section
160974    0.9409  cc1                      cse_insn
145394    0.8499  oprofiled                odb_insert
139131    0.8133  cc1                      constrain_operands.unlikely_section
129540    0.7572  libc.so.6                strlen
118542    0.6929  cc1                      fold.unlikely_section
109847    0.6421  cc1                      synth_mult
109798    0.6418  as                       (no symbols)
108898    0.6365  cc1                      gimplify_expr.unlikely_section
96304     0.5629  cc1                      record_reg_classes
95929     0.5607  cc1                      lex_identifier
94718     0.5537  cc1                      fold_rtx
91228     0.5333  cc1                      reg_scan_mark_refs
88666     0.5183  cc1                      make_node_stat.unlikely_section
88218     0.5157  cc1                      yylex
87855     0.5135  cc1                      mark_set_1
87435     0.5111  libc.so.6                memcpy
85209     0.4981  cc1                      c_lex_with_flags.unlikely_section
82712     0.4835  cc1                      build_int_cst_wide.unlikely_section
79861     0.4668  vmlinux-26-quilt         do_no_page
79044     0.4620  cc1                      canon_reg
78793     0.4606  cc1                      extract_insn.unlikely_section
74090     0.4331  cc1                      grokdeclarator
73247     0.4282  libc.so.6                __cfree
71708     0.4192  cc1                      pop_scope.unlikely_section
71695     0.4191  cc1                      init_alias_analysis.unlikely_section
71092     0.4156  cc1                      iterative_hash_expr.unlikely_section
69631     0.4070  cc1                      note_stores.unlikely_section
68900     0.4027  libc.so.6                __GI___libc_malloc
67924     0.3970  libc.so.6                __calloc
65767     0.3844  cc1                      hash_rtx.unlikely_section
65629     0.3836  cc1                      et_splay
65296     0.3817  cc1                      mark_used_regs
62908     0.3677  cc1                      find_reloads.unlikely_section
62825     0.3672  libc.so.6                _int_free
62063     0.3628  cc1                      ix86_rtx_costs
61419     0.3590  cc1                      count_reg_usage
61287     0.3582  cc1                      rtx_cost.unlikely_section
60488     0.3536  cc1                      get_stmt_operands.unlikely_section
60202     0.3519  cc1                      htab_delete
60164     0.3517  cc1                      force_fit_type.unlikely_section
58552     0.3423  libc.so.6                strcmp
57048     0.3335  cc1                      bitmap_set_bit.unlikely_section
56734     0.3316  cc1                      tree_code_size.unlikely_section
55753     0.3259  libc.so.6                malloc_consolidate
54495     0.3185  cc1                      get_cse_reg_info
54318     0.3175  cc1                      invalidate
53892     0.3150  cc1                      int_const_binop.unlikely_section
52898     0.3092  cc1                      _cpp_lex_token
52622     0.3076  cc1                      propagate_one_insn.unlikely_section
52358     0.3061  vmlinux-26-quilt         copy_user_generic
51972     0.3038  cc1                      ix86_decompose_address.unlikely_section

L2 cache misses (with two 1MB caches): 

The hot clear_page is the kernel's function to clear pages
allocated by page faults.

Looks like that hot hash lookup is just a bad cache pig. Maybe
the hash table is not too small, but too big? 

yyparse is surprisinly cache intensive.

Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit m
ask of 0x300 (multiple flags) count 10000
samples  %        app name                 symbol name
22564    23.1238  vmlinux-26-quilt         clear_page
4785      4.9037  cc1                      ht_lookup_with_hash
1859      1.9051  vmlinux-26-quilt         copy_page
1487      1.5239  vmlinux-26-quilt         copy_user_generic
1364      1.3978  cc1                      pop_scope.unlikely_section
1077      1.1037  libc.so.6                memset
886       0.9080  cc1                      yyparse.unlikely_section
727       0.7450  cc1                      ggc_alloc_stat.unlikely_section
690       0.7071  libc.so.6                _int_malloc
686       0.7030  cc1                      htab_find_slot_with_hash
684       0.7010  cc1                      fold.unlikely_section
684       0.7010  cc1                      get_stmt_operands.unlikely_section
675       0.6917  vmlinux-26-quilt         do_no_page
539       0.5524  libc.so.6                memcpy
464       0.4755  cc1                      pool_alloc.unlikely_section
463       0.4745  cc1                      wrapup_global_declarations.unlikely_section
458       0.4694  vmlinux-26-quilt         unmap_vmas
447       0.4581  vmlinux-26-quilt         __rmqueue
423       0.4335  cc1                      cpp_get_token
418       0.4284  vmlinux-26-quilt         scheduler_tick
414       0.4243  ld-2.3.3.so              _dl_relocate_object
405       0.4150  libc.so.6                _IO_vfprintf_internal
383       0.3925  ld-2.3.3.so              do_lookup_x
380       0.3894  cc1                      rewrite_stmt
372       0.3812  vmlinux-26-quilt         buffered_rmqueue
370       0.3792  cc1                      check_global_declarations.unlikely_section
349       0.3577  libc.so.6                malloc_consolidate
331       0.3392  cc1                      tree_ssa_dominator_optimize
322       0.3300  cc1                      list_length.unlikely_section
321       0.3290  cc1                      type_hash_eq
301       0.3085  cc1                      recog.unlikely_section
296       0.3033  cc1                      gimplify_expr.unlikely_section
293       0.3003  cc1                      c_write_global_declarations_1

Random notes (same as last time):

4.0:
Configured with: configure --disable-checking --enable-languages=c,c++ --prefix=/pkg/gcc-4.0-041029
Thread model: posix
gcc version 4.0.0 20041030 (experimental)

vs 3.3-hammer compiler from suse 9.1:
Configured with: ../configure --enable-threads=posix --prefix=/usr --with-local-prefix=/usr/local
+--infodir=/usr/share/info --mandir=/usr/share/man --enable-languages=c,c++,f77,objc,java,ada
+--disable-checking --libdir=/usr/lib64 --enable-libgcj --with-gxx-include-dir=/usr/include/g++
+--with-slibdir=/lib64 --with-system-zlib --enable-shared --enable-__cxa_atexit x86_64-suse-linux
Thread model: posix
gcc version 3.3.3 (SuSE Linux)

A linux kernel is a complex C program compiled with -O2
The files are relatively short, but it has a lot of includes to
process. The standard compile options are:
-Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -O2     -fomit-frame-pointer
-mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks     -Wno-sign-compare
-fno-asynchronous-unwind-tables -funit-at-a-time

-g is not used.

The machine has enough memory that everything was cached.

All compilation with -j8 which normally gives best result on this machine.

-Andi


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]