Optimising GCC
This project, as part of Google Summer of Code 2011, is about optimising GCC in both CPU and memory utilisation. Instead of only targeting a specific part, this project also includes profiling the whole codebase to find the hotspots, which could be split in many parts of the compiler or inside data structures used in many places. Even though GCC's runtime is already almost evenly split, gathering small gains from many parts will hopefully result in a noticeable result.
Emphasis, at least in the beginning, will be given on improving non-optimising builds (-O0 -g3) which are the most frequent for a developer's work flow. Later on, time spent on optimising passes (-02) will be measured and g++ specific optimisations will be considered near the end of this project.
Interesting links:
Final report with links to all patches
Work done to speedup/cleanup the Dataflow Analysis part: (1) (2)
Avoid fprintf() calls when outputting assembly. There was a discussion regarding whether this makes code less maintainable, so that such optimisations should be left for the compiler to perform.
Thread discussing whether mem_attrs/reg_attrs hash tables should be removed with cool measurements from matz
var_tracking is currently too complex, should we simplify var_tracking?
Thread with various small improvements. Each one is insignificant by itself, but they add up to a small but noticeable speedup.
This page will be holding notes regarding the ongoing project of GCC optimisation.
How to find hotspots
- Valgrind's callgrind gives good information. It displays both cumulative non-cumulative time spent in each function. It also records the full call graph, necessary to spot which path is the most common for the hottest functions. Plus it can annotate the source with the total number of instructions each line spent.
- On the minus side, the "instruction count" metric it offers is arguably not enough, with all levels of processor caches available. There are options however to emulate caches and report L1 and L2 misses.
- Another useful trick is to compile with function inlining disabled, so that all functions show on the profiler. Note that this might produce unrealistic results.
CFLAGS="-O2 -g3 -fno-inline -fno-inline-small-functions -fno-indirect-inlining -fno-inline-functions"
- TODO rewrite various huge macros as inline functions, bitmaps and obstacks coming to mind. This should help measure the effect of particular data structures.
Optimise What?
- Dataflow Analysis
- libc's qsort() called too often. Callstack:
df_scan_blocks()-> df_bb_refs_record()-> df_{insn(MOSTLY),bb,entry_block,exit_block}_refs_collect()-> df_canonize_collection_rec()-> df_sort_and_compress_refs(&collection_rec->def_vec)-> qsort()
- df_scan_blocks() has 2 major callers: rest_of_handle_ira() (247x) and rest_of_handle_df_initialize() (247x)
- Is it by luck that they are called exactly the same amount of time? Maybe they sort the same/similar dataset?
- NO ...
- Is it by luck that they are called exactly the same amount of time? Maybe they sort the same/similar dataset?
- qsort() in my testcase is called thousands of times, but for vectors of length 3,4,5,6,46,47. Perhaps there are common elements?
- there is an absolute maximum vector length for each architecture. TODO allocate such vector in the beginning and then use quick_push().
- MAX_RECOG_OPERANDS * MAX_REGS_PER_ADDRESS + FIRST_PSEUDO_REGISTER
- TODO test with custom asm statements with many operands on x86
- there is an absolute maximum vector length for each architecture. TODO allocate such vector in the beginning and then use quick_push().
- comparator to qsort() is mostly df_ref_compare(). What's the ordering that this function imposes? Can it be simplified?
- Since qsort is called too often for small vectors, maybe implement a custom lighter algorithm, and only call qsort() for large cases.
- Radix sort for class,regno,type, simple (insertion,shell) sort for rest of fields.
- Another possibility is to compute a 64bit sort_value based on a polynomial and sort just by comparing this value.
- Need to create the value on each df_ref creation, not possible if df_ref changes later (value would need update, too costly, too complex?)
- libc's qsort() called too often. Callstack:
- Hash tables are used everywhere throughout gcc, accessed millions of times, mostly hashtab but also symtab. Various remarks:
- symtab is used in cpp, with default size of 16384 entries. When it is nearly full (for very large sources), too many collisions are happening. Could change hash function or grow the array sooner.
- Use hash per word (not per byte) algorithms (htab_hash_string(), HASH_NEXT)
- separate function for lookup instead of NO_INSERT (htab_find_slot_with_hash())
- power-of-2 array sizes so no modulo is needed (htab_mod_1() too costly)
- symtab's strings are ggc alloc'ed, compare with obstacks.
- fprintf("%s %#x") called too often, mostly from dw2_asm_output_data(), replace it with fputs() or even fwrite_unlocked()?
- fwrite_unlocked() is the fastest, it should be used whenever we know string length. putc_unlocked() for one or two bytes.
Hmmm I can't believe I actually looked the mess in glibc's vfprintf(), but it seems conversion to hex is non-optimal. I could gain something by converting using a static byte to hex-representation array, and avoid the vfprintf() calls alltogether. Priority HIGH, will look into it more after finishing with hash tables.
- Why are we ever printing decimal offsets? It's assembly output, it should all be hex which is generated/parsed much faster!
- Also look into GAS about how it parses hex numbers. Always reading numbers in %08x form should make it a bit faster.
- Also ASM_GENERATE_INTERNAL_LABEL() which calls sprintf.
- In dwarf2out_source_line(), if (DWARF2_ASM_LINE_DEBUG_INFO) then ".loc %d %d" are printed, do we really need decimal offsets?
- dw2_asm_output_data_uleb128(), dw2_asm_output_data_sleb128()
- assemble_string()
- In default_internal_label(): ASM_GENERATE_INTERNAL_LABEL()
Where are the memory offsets printed? for example movl %eax, -4(%ebp). Convert them to hex too?
dw2_asm_output_offset() -> dw2_assemble_integer() -> output_addr_const() -> { case CONST_INT, assemble_name() }
output_indirect_string -> { ASM_OUTPUT_LABEL()->assemble_name(), assemble_string() }
- bitmaps
- (dis)advantages over sbitmaps and ebitmaps
- How much are they grown? ratio of set_bit to find_bit?
- c_parser
- ggc_internal_{,cleared_}alloc_stat()
- ira_build(), especially process_bb_node_lives() and record_reg_classes()
- also dw2_asm_output_nstring(), which calls ASM_OUTPUT_ASCII macro, which is different for each assembler
- too complex macros, reads bytes and outputs proper assembly for each assembler
- merge gimple with gimple_seq_node
too much time spent in qsort(), called too often from df_sort_and_compress_refs(def_vec) < df_canonize_collection_rec() < df_insn_refs_collect()
- *df_* operations, for example df_ref_compare() as called from qsort(), or VEC_df_* operations
- madvise() (needs mmap() first) or async io read all the .h files that will be needed?
- LTO type merging on larger projects, when lto1 reads in the various files and unifies the types
- tree.c:iterative_hash_expr() is too hot! Is there a practical way of caching hash values of already hashed trees?
- Major callers: emit-rtl.c:mem_attrs_htab_hash, tree-ssa-dom.c:avail_expr_hash, tree-ssa-sccvn.c:vn_nary_op_compute_hash, tree-ssa-sccvn.c:vn_reference_compute_hash
various TODO
- Does garbage collection have a hit on CPU cache efficiency?
- Must compare both garbage-collected and custom-allocator versions of algorithm
- EDGE_COUNT is hot, use quick_length()
- change VEC interface to not allow NULLs
As it is now it is very hard to change. For example bb->{preds,succs} vectors are used all over the place and it is quite common to be NULL for zero length! If whereever we set them to NULL, we had to gc alloc them, it would probably cost a lot. Now it's initialised to NULL in alloc_block() with a call to ggc_alloc_cleared_basic_block_def() for the whole basic_block struct.
- reginfo.c: change char global_regs[FIRST_PSEUDO_REGISTER] to sbitmap
Benchmarking Methodology
- To enable gcc_assert() pass --enable-checking=assert to configure
- Build static builds of 4.6.0 using always the same 4.5.3 compiler. How to build:
~/dist/src/gcc-4.6.0-mine/configure --enable-languages=c --disable-bootstrap --disable-plugins --disable-multilib \ CFLAGS="-O2 -g3 -march=i686 -mtune=generic -fno-inline -fno-inline-small-functions -fno-indirect-inlining -fno-inline-functions" \ BOOT_LDFLAGS="-static -L/home/jimis/dist/opt/tgt-glibc-2.13/lib" \ LDFLAGS="-static -L/home/jimis/dist/opt/tgt-glibc-2.13/lib" \ CC=/home/jimis/dist/opt/tgt-gcc-4.5.3/bin/gcc make -j4 BOOT_LDFLAGS="-static -L/home/jimis/dist/opt/tgt-glibc-2.13/lib" LDFLAGS="-static -L/home/jimis/dist/opt/tgt-glibc-2.13/lib" make prefix=/home/jimis/dist/opt/tgt5e-gcc-4.6.0-noinline-static/ install
- Use "valgrind --tool=callgrind CMD" to profile
Profile output is in CPU instructions. Does not include time spent in system calls, does not include time spent in I/O wait.
- Use "time CMD" on old 500MHz PC to measure time, repeatedly for hot caches.
- System call time shown as "sys"
- I/O wait is real-(user+sys)
Profiler output (compiling kernel's tcp_ipv4.c with -O0 -g3)
- Base case, vanilla gcc 4.6.0:
4.6.0-baseline_prof.txt instructions spent in each function separately.
4.6.0-baseline_prof-tree.txt full call graph marking callers and callees of each function.
4.6.0-baseline_prof-inclusive-annotated.txt instructions spent in each function, including callees, contains also annotated source.
real 0m6.823s user 0m6.524s sys 0m0.297s
- After replacing hottest fprintf() callers with fwrite(), Jakub's puthexl(), and after reducing hash table collisions etc minor changes:
real 0m6.712s user 0m6.400s sys 0m0.309s
- Made the following changes that helped profiling:
- removed inline keywords from vec.[ch]
- created function wrappers for obstack.[ch], plus wrote obstack_finish() and obstack_blank() as normal functions in C instead of CPP
- Rewrote lookup_attribute() as a static inline function in tree.h, calling strlen() which most times is optimised because of compile-time known strings
- Most Important Changes were in Dataflow Analysis part:
- Small gain by using vec_quick_length(), vec_quick_push() for stack allocated vectors in df_bb_refs_record()
- Changed order of REFs pushed into collection_rec in df_insn_refs_collect() and df_get_call_refs().
- Most calls to qsort() were mitigated that way. Thanks to Paolo Bonzini for invaluable idea and help.
- In df_get_call_refs() changed defs_generated bitmap to HARD_REG_SET, and used regs_invalidated_by_call HARD_REG_SET instead of bitmap