- Some general thoughts on memory usage
- Generic data structure changes
- Removal of duplicate effort.
- I/O reduction.
- Better algorithms
- Garbage collection / general memory allocation issues.
- C++ improvements
- Lexer, cpplib
- Improvements to PCH
- Code generation improvements which would help performance
- Pie in the sky
- Integrated changes
GCC is generally agreed to be too slow. There are a lot of places where it could be made faster without sacrificing functionality. What follows is a long list of brainstormed ideas. Some are easy, some are hard; some are worth doing, others are not. We didn't consider optimizer performance much; if you have ideas for improving performance with the optimizers on, please add them. Also, please add hyperlinks to this document.
GCC has a few known compile time bottlenecks that are caused by a very poor choice of algorithms. As a result, even at -O0 we manage to trigger some essentially quadratic algorithms, and with optimization enabled we have some really ugly offenders. The best known ones are CSE and regclass. Needless to say, such algorithms make the compiler slow.
Basic rule #1:
DO NOT ADD ALGORITHMS WITH QUADRATIC OR WORSE BEHAVIOR, EVER.
People submitting algorithms with known nonlinear behavior often claim that "it should never be a problem in real code". Don't believe that kind of talk, there is always someone who will find code that shows terrible compile time behavior with such algorithms, and even for normal, real-world code, such algorithms *do* slow down the compiler.
Some general thoughts on memory usage
Does reducing the memory footprint of GCC actually help performance? We think so, but we haven't any evidence. Small (1%) memory savings are easy, but seem to translate to <0.5% time improvements in practice. Time characteristics are very different for hot-cache (repeated run on the same file) than cold (large project build). There might be a plateau effect, where there is no dramatic improvement until the entire working set fits inside (for instance) the amount of memory that can be simultaneously addressed without causing TLB refills. Kernel-inclusive profiling on Darwin indicates a lot of time (>10% of total wall time) spent in VM subsystem. We don't know what these routines do. Would be interesting to talk to Darwin kernel people and find out. Would also be interesting to compare with other operating systems (Linux/PPC e.g.) (Andrew Pinski thinks most of this time is zeroing out pages as they get allocated in).
Avoiding allocations with ggc_alloc is almost always a win. Recent examples of that are in tree-into-ssa.c which used to allocate objects that were only live in the into-SSA pass in GC memory. Moving to allocation on the heap resulted in a 2% bootstrap speedup! Even larger speedups resulted from allocating VALUE_RTX on the heap in cselib.c. So it is almost certainly a win to allocate things that have a known live time (ie. only within one pass) on the heap instead of in GC space.
Generic data structure changes
These changes should be investigated, but they are not suitable for Google's Summer of Code. If you are looking for a Summer of Code project, please do something else.
- Our splay trees are actually very slow compared to most splay tree implementations, making them pretty much unusable for things we want to do. It would be nice if they could be sped up. Frank Eigler has been making some speed related changes for use in libmudflap that might be helpful here. Splay trees are used to represent alias sets in RTL, among other things.
Bitmaps, also called sparse bit sets, are implemented using a linked list with a cache. This is probably not the most time-efficient representation, and it is not unusual for bitmap functions to show up high on the execution profile. Bitmaps are used for many things, such as for live register sets at the entry and exit of basic blocks in RTL, or for a great number of data flow problems. See bitmap.c (and sbitmap.c for GCC's simple bitmap implementation). Several alternative sparse bit set implementations have been proposed, such as so-called ebitmaps (ebitmap.c, but removed from trunk because nobody used them or even just experimented with them), and more recently a hybrid linked-list/splay-tree representation (patch for bitmap.c).
- Hash tables are used in many time-critical paths in GCC, so speeding up the generic hash table implementation (see hash-table.h) will almost always pay off in compile time. One possible improvement would be to replace calls to traverse with use of the element iterator.
Types: about 10% of memory usage. One obvious problem is that we have a different type for anything that's even slightly different. In addition, they're all chained to each other, and we have to do a search through the chain. Typedefs for int all get their own version, all chained together. But duplicating nodes isn't useful for ints, only for classes. Finally, in addition to duplicating nodes for typedefs, we duplicate nodes for function types. We shouldn't have a separate copy of "returns bool and takes no arguments" for every member function in a class. One reason we have this problem is that default arguments are represented in the type, which isn't where it belongs. More thoughts about the TypeSystem.
Expressions, ie. *_EXPR are the majority of trees. We used to just not care about their size, because they were generated on an obstack and then thrown away, back when GCC expanded one statement at a time to RTL. But now that we have ggc, unit-at-a-time, and tree-ssa, that's no longer true. EXPR nodes are actually pretty large: a large tree_common most of which is irrelevant. We shouldn't need a chain field (ie. TREE_CHAIN , since it's just a vector. We shouldn't need a type node, since that information is almost always in the arguments. We shouldn't need all those flags. TREE_CHAIN is hard because it affects all tree nodes via tree_common
(Practical issue: getting rid of anything in tree_common is very hard because the accessor macros don't restrict access, so getting rid of even a single bit means look at how it's used in all of the front ends and all of the back ends.)
One problem is that the various front end parsers do these things differently. How to make changes in the representation of function_decl that the C and C++ parsers can both deal with? One possible answer: new kinds of DECL nodes for C++ only.
Indeed, one school of thought is that everything like types, decls, expressions and statments should be front-end specific, and that each front end should hand down a common, simpler and genuinely language-independent representation to the middle end. A reason for many of the problems outlined above is that all front ends are trying to re-use common data structures to express language-specific constructs.
TREE_LIST must die. Or, at, least, TREE_CHAIN Most nodes do not need it. In particular, EXPR nodes do not need it.
Temporary trees. Don't put them in GC memory, or, better, don't create them at all. Horrible example is parsing a static array initialization with lots of elements in it. (Largely because of figuring out where the curly braces go.)
NON_LVALUE_EXPR must die, it is really only used because the C set of front-ends cannot keep track of lvalueness themselves currently. It will also fix a couple other issues and keep several from showing up.
Split up TYPE nodes - one structure per subtype (makes many of them a lot smaller, DECL nodes are already done this way)
Reduce information carried by each DECL subtype to the bare minimum
- Compact representation of function parameters
Compact representation of structure/array initializers. Not with a linked list of TREE_LIST like we have now (this also effectively makes optimizing loads from static constant initializers a job with potential quadratic behavior right now)
- Lightweight representation of GIMPLE temporaries, these should be cheap because we create thousands of them
Kill TREE_LIST and TREE_VECTOR
- Eliminate cv and typedef copies of types.
- Improve attribute handling
- Get rid of unused fields in common tree nodes (hard)
- Avoid recursive tree walks
Make sure DECL_RTL and DECL_ASSEMBLER_NAME are as lazy as possible.
DECL_ASSEMBLER_NAME shouldn't be a full identifier.
Replace fold (buildN (...)) with fold_buildN (...) which does not create a scratch tree node (much like simplify_gen_binary . Better yet, see if you can use fold_binary and its friends instead of fold_buildN
Remove artificial labels during the life time of tree-level cfg. Bonus points for coming up with a way to handle SWITCH_EXPR without labels.
If we are running CCP, copy-prop, FRE, and DCE (in this order), we should call tree-cfg.c:cleanup_control_flow after CCP and copy-prop instead of the full-fledged cleanup_tree_cfg This is because CCP and copy-prop visit (possibly a subset of) originally reachable blocks. We may want to use flag EDGE_REACHABLE to go ahead and remove unreachable blocks after propagation and before substitution. This way, we can prevent tree-ssa-ccp.c:substitute_and_fold and tree-ssa-pre.c:eliminate from scanning dead blocks. 1. tree-into-ssa.c:find_idf can have a worklist of fixed size, namely 2 * n_basic_blocks We should probably keep a single worklist allocated during the whole PHI insertion.
Split simulate_stmt into simulate_stmt and simulate_phi because the caller of simulate_stmt knows whether it's got a statement or a PHI node.
The primary cost of many passes isn't so much what they do, but the fact that they have to do it to every instruction. Looking at lots of memory, walk into every instruction, etc. Proposal to fix: right now RTL is a lisp-like representation to represent details of every instruction. Wasteful: "add two things and set the condition flag" requires lots of RTL, and we have to say the whole thing again, verbosely, for every add. Geoff suggestion: compressed RTL. Instead of a list form, represent it as an instruction number followed by operands. Much more compact. Can autogenerate routines that answer questions like "given this instruction which registers might it clobber." If necessary can convert it to real RTL later.
RTL uses about 20% of memory, so that's the maximum space saving we could get. Also could save some time from these recursive walks over RTL. Geoff guess: 10-20% performance gain.
Other ideas for RTL:
Get rid of insn notes (not to be confused with register notes). For example, NOTE_INSN_BLOCK_BEG and NOTE_INSN_BLOCK_END are only emitted and used by final, it should be quite easy to get rid of them. The same is true for NOTE_INSN_EH_REGION_BEG and NOTE_INSN_EH_REGION_END though these two are probably a bit harder.
- Use vector of insns, not doubly-linked list. (Would work better if we had a separate vector for each basic block.)
Get rid of sparse lookup arrays. RTL optimizers frequently have arrays of pass-specific data indexed by insn or reg number, which are sparse. Or, to avoid that, arrays mapping insn uids to monotonically increasing ids (grep for LUID and CUID , which index the real data. We should compress insns and insns UIDs (and maybe even reg numbers) between passes, or during each RTL CFG cleanup. There should be some reverse mapping for the sake of debug dumps, but perhaps just dumping the insn pointer is good enough for that.
- Don't generate RTL for things that aren't used for code generation, e.g. parm decls for functions that aren't defined shouldn't get DECL_RTL, and neither should unused static variables local to a translation unit.
Reduce size of insn-recog.c and insn-attrtab.c One way to do this is to play with eliminating the "optimizations" that the gen* files apply to the RTL for the machine descriptions. A 80% reduction on the size of insn-attrtab.c has been reported using this technique, at almost no compile time cost. Figure out what optimizations make sense and which ones should go away.
insn-attrtab.c is especially huge for backends with many subtargets, mostly because of the DFA scheduler descriptions. Right now each backend has a single, massive DFA representing all scheduler descriptions for all subtargets. Yes, if you have separate Pentium, K6, PentiumPro and AMD64 scheduling descriptions, [gccsource:genattrtab.c] will smash them all together into one big fat automaton. Splitting them up in per-CPU automata might not make insn-attrtab.c smaller per se, but it will improve locality within the various automata, perhaps giving a speedup to the scheduler.
Have back ends use the mode macros facility which is new for GCC 4.0. While the generated back end code is exactly the same, it does make the backend source files significantly smaller and easier to read, this improving maintainability and perhaps allowing further simplifications and improvements. Many missing patterns have already been discovered simply by rewriting existing similar patterns for n different modes as a single pattern with a mode macro for all those modes.
Replace for_each_rtx and friends with iterators (hard).
Get rid of EXPR_LIST and INSN_LIST
- Replace the forward propagation and instruction combining/selecting work done in [gccsource:combine.c] and [gccsource:cse.c] (and elsewhere, these are just the most important ones) with something using on "real" dataflow based forward propagation, perhaps even based on RTL SSA (hard, if at all possible).
Removal of duplicate effort.
Avoid destroying and recreating "on the side" data structures, such as alias information. Alias analysis information for RTL is recomputed from almost-scratch by many passes with init_alias_analysis() in [gccsource:alias.c]. Perhaps it is possible to maintain the tables computed there over passes.
Avoid translating one representation to another with no semantic difference. For example, on RTL we have two modes of the CFG: Normal RTL-CFG mode, where the CFG is just a web on top of the instruction stream, and cfglayout_mode, where the basic blocks are containers whose content can be moved around more freely. Semantically these representations are mostly equivalent. Going into and out of cfglayout mode is therefore really a waste of effort.
- Remove redundant conditionals. It is surprising how often we check for the same condition twice or more. On modern CPUs, conditionals can be quite expensive, so it really does pay off to avoid them. This usually is a trade-off between abstraction and speed, but also a matter of proper interfaces (with function comments describing preconditions and so on).
- More assembly pseudo-ops instead of repetitive sequences of data. (Requires cooperation from binutils.)
- Use unlocked stdio primitives wherever we aren't already
Known instances of quadratic algorithms
- C++ field layout, enumeration parsing
- C++ final overrider determination
- C++ method adding
One major source of cache misses is cgraph, especially cgraph_remove. Not clear that we should be doing this at all, let alone that we should be doing it this way (with a doubly linked list). We can iterate through the list without removing anything from it: have a small bitmap off on the side.
cgraph was recently updated to use doubly linked lists so removing should not be that critical. Marking nodes dead might result in slowdown as inliner introduces a lot of clone nodes in a progress on C++ modules.
Garbage collection / general memory allocation issues.
Two tracks can be taken here: avoid use of the garbage collector, or improve the garbage collector's algorithms.
It's always a win to avoid doing "scratch" memory allocation. For instance, if one has to do a bit of arithmetic on INTEGER_CST nodes, it is much better to use the' op '_double= functions directly, than to construct an EXPR node describing the calcuation and feed it to fold().
Well-defined projects for this:
- GC bucket for every multiple of a pointer size
ggc_push_context() and ggc_pop_context() are no more. In ggc-page, their machinery is still relied upon by PCH machinery, but maybe it can be special-cased
- Finish ggc-zone, turn it on by default
Have special zones for objects known to be allocated in groups, or known to have a certain life span. For example, SSA_NAME and PHI_NODE trees never live beyond tree-ssa, so putting them in a special SSA zone might help. Similarly, allocating edges and basic blocks in special zones prevents us from allocating them on pages where we also have IL objects (trees, RTXen, etc.).
- Augment ggc-zone into a generational collector, or at least a copying collector. This is hard because C is just not a language that lends itself to garbage collection. A copying collector may require restrictions on the kind of data structures we can use in the compiler for objects that must live in GC memory
Investigate use of Boehm GC. This has been done before, with inconclusive results. See BoehmGCForGCC.
Partition data structures by frequency of use. There is a branch in SVN, the struct-reorg-branch that implements this kind of optimizations in a generic way using profile feedback.
[which of these are being worked on???]
- Avoid unnecessary backtracking in common cases in the parser - can frequently be replaced with an extra token of lookahead, or by reordering cases.
- Speed up template instantiation (how?)
Revise finish_file (Avoid walking through lists to decide what to emit; use triggers instead.)
- Make C++ name lookup faster (how?)
cxx_binding should be 16 bytes, not 20.
- Eliminate binfos for POD structs and other no-inheritance classes
Investigate other improvements to binfos (e.g. suspicion of information duplication between them and the RECORD_TYPE .
- Duplication of work in field layout: once in parsing, once in layout.
- Unify representation of tokens in preprocessor and front end
- Shared buffer between preprocessor and front end
- Get rid of fields in data structures that are only used in the error recovery path.
Don't parse inline functions (just save the tokens till we know we need them). For C++, notice that the infrastructure for this is partly done because we already defer parsing the body of functions defined within classes till the end of class definition. This is done by saving the tokens and parsing them later (see cp_parser_late_parsing_for_member . We should try and be even lazier and wait until the first real use. But -- how do we reject invalid code if we go this way?
C++ front end does code-generation work twice for destruction of static data - PR17422
- Vtable, VTT construction and final overrider searching is overly complicated, redoes work and contains some quadratic behaviour. We can hold class-specific chunks of vtable with each class, and then reuse that information when deriving. similarly we can build up lists of overridden functions with each overrider when we do the covariant return type semantic checking. All of this change is probably not suitable for stage 3, but perhaps small parts are.
We're allocating a writable buffer because of rare cases (non-ASCII character sets, trigraphs, continued lines). Why not mmap it in read-only for the vast majority of cases? Need to be careful here, mmap has its own overhead, OS may only do readahead for read()
Cost of those rare cases could be reduced. Don't need to run the entire buffer through iconv() when the character set allows us to find line boundaries reliably before conversion to UTF-8. (Most do. The necessary property is that bytes 0x0a and 0x0d must never constitute part of a multi-byte character; always single byte characters standing for LF, CR respectively.)
- Could maybe handle comment stripping with \-newline processing instead of tokenization. This would make the tokenizer's inner loop tighter.
Have cpplib use readdir() to populate its tables of headers that exist.
- Put all system headers in a jar file, teach cpplib to map the entire thing; avoids lots of system calls.
- Avoid putting things in the identifier hash table which are not user-visible identifiers (GIMPLE temps, C++ mangled names, etc)
- Replace identifier hash table with a better data structure (have already tried a ternary tree, it's not faster; could try to code it even cleverer than it already was; B* trees might be worth looking into)
Improvements to PCH
- Put all of cpplib's data structures in garbage collected memory so they are reloaded from PCH by a direct mmap.
- When PCH is in use, don't initialize any data structures that will be overwritten by the PCH load (e.g. don't create builtin function decls)
- Mangle names, instantiate templates, and generate debug information at pch generation time
- Avoid writing to memory loaded from PCH (incurs copy-on-write faults).
Code generation improvements which would help performance
- It's often said that doing some limited amount of optimization at -O0 would be a net win: unreachable code elimination, possibly constant propagation. Need hard numbers though, and must not hinder debuggability.
- At -O0, reload has to do tons of extra work because the register allocator basically dumps everything to the stack. This also bloats the generated assembly.
- Some micro-optimizations would help gcc's own code a lot
- Use frepo-like idea when compiling multiple .c files on one line.
- Enable IMA for C++.
Avoid use of 64-bit integers on machines with 32-bit registers - i.e. use of HOST_WIDE_INT / HOST_WIDEST_INT where what is appropriate is HOST_WIDEST_FAST_INT ([gccsource:bitmap.c] already fixed?)
- Figure out what the kernel bottlenecks are.
Avoid clearing pages unnecessarily (e.g. fresh MAP_ANON pages are already zeroed).
- Turn the GCC driver into a shared library that make or an IDE can load.
- Make specs processing more efficient; possibly, generate C from specs at compiler build time. (Might not be much of a win.)
- Avoid generic data structures and function pointers. (turn hashtab.c into fake templates like Nathan's [gccsource:vec.h]?)
- Why (asks jimis)? Just to avoid function call overhead? Is it worth the extra CPP mess?
- Find all structures with internal padding and reorder fields to remove it. (See "dwarves" pahole tool for an easy way to try this.)
Pie in the sky
- Integrate assembler into the compiler
- Put object code in a database (avoids e.g. repeated emission of template instantiations, debug info).
- Multithreaded linker
Per's compile server concept (see IncrementalCompiler)
- Auto-PCH (Create PCH headers for the user automatically, instead of forcing users to do the project restructuring necessary to do the work on their own.)
- Provide static binaries for beta testing so more people participate and more regressions are resolved before release
Here is a list of stuff from this page that was implemented:
- Finish Matt's do-all-the-lexing-up-front patch. (1-5% on -O0)
- Use operator precedence parsing (a substantial win in some special cases, and always better at least by a small amount).
- Add contains-repeated-base and is diamond-shaped flags to classes. These can avoid repeated binfo walks, much like the dynamic_cast implementation is optimized.
ggc_push_context() and ggc_pop_context() are not used anymore.
- The CFG now survives the whole compilation.
- Mapped locations are now the default.
- Push keyword lookup into preprocessor.
- (For cpp) Use word-sized memory fetches somehow. Not clear how much this gains; not clear how hard it is.
- (For cpp) Use vector instructions where available and useful (probably would be worth it for comment skipping at least).
Put the string at the end of the IDENTIFIER_NODE using the trailing array hack
- Delete the identifier hash after parsing (or maybe make it a weak hash at that time). This would reduce the amount of memory scanned by GC.