Improve_instruction_selection

Instruction selection in compilers can be loosely described as the process of picking the optimal (under some criteria) set of instructions to represent the semantics of the intermediate representation (IR) of some function.

In GCC there is no instruction selection pass in the classical sense. Instead, GCC translates one GENERIC tree statement at a time to RTL in the expand pass. GENERIC-to-RTL expansion is currently done by invoking standard named patterns to generate RTL. Each named pattern is a relatively low level operation; more complex, machine specific higher level instructions can never be generated in the expand pass. So the generated RTL very much resemble that what one would get by macro expansion: Very Poor Code. This is one of the reasons (if not *the* most important reason) why the RTL Optimizers are still so necessary despite the presence of Tree Optimizers.

The RTL optimizers basically do three things right now:

The poor initial RTL has to be optimized to remove redundant computations that are
- not visible at the GIMPLE level. There is very little one can do about this, passes to do this will always be necessary. But they can be cheap.
Register allocation and scheduling, which is highly machine specific, hence another
- set of passes that the Tree Optimizers can never subsume.
Instruction combining, such as done in [gccsource:cse.c] and [gccsource:combine.c].
- These passes combine the simple initially generated RTL into more complex patterns.

It is the last of the above list that really is a form of instruction selection. While reload makes the final choice about which alternative of a pattern to emit as an assembly instruction, the choice of which more complex patterns are valid RTL for the target is with the passes that combine the initial RTL into new patterns. These passes try to substitute definitions for registers and such with their set source, and then try to validate that substitution using recog() from [gccsource:recog.c].

Not only is this kind of "poor man's instruction selection" inefficient, it also appears to be main reason why we still need the full complexity of CSE and combine, why disabling a CSE pass still has such adverse effects on the generated code, and why CSE's fcse-skip-blocks which makes it so hard to simplify CSE, is still so useful (it's the only mechanism GCC has to combine simple instructions into more complex ones over extended basic block boundaries).

So why does this matter?

CSE often takes about 5% of the total compile time. There are three passes of CSE when compiling at -O2 CSE1, CSE2, and a local CSE pass which is run after GCSE. Even being able to eliminate one of these passes would give a significant compile time speedup; so would making all of these local. The combine pass is not clever either: it tries sticking together any possible set of instructions, with no guidance as to what is likely to succeed. That makes it inefficient.
So, the only way to make complex instruction is either in CSE, that can see through at most PARAM_MAX_CSE_PATH_LENGTH basic blocks but is expensive and not so powerful, or in combine, which works one basic block (i.e. it is alocalpass) and will only try to combine up to three instructions at a time. Thus a single instruction which can be decomposed into more than three basic instructions is hard for combine to recognize. For example, if sign extension is done using two shift instructions, as is often the case, it can be hard for combine to recognize a multiply instruction which effectively sign extends the operands to a wider mode: it is very hard to remove the restriction of looking at only up to three instructions without rewriting it, in practice. This difficulty is partially mitigated by REG_EQUAL notes which mark implicit sign extensions done by the compiler, but those only help in the straightforward cases.
GCC often "optimizes" some expressions on the GIMPLE level that would have had a cheap RTL equivalent. Luckily this happens mostly for addressing modes nowadays (for example [gccbug:18463]) and this is a special case that could be fixed without severely crippling the Tree Optimizers.

What GCC needs is a smarter instruction selection pass. Ways to do this:

Simple CSE-like passes that can propagate values over basic block boundaries. An example is http://gcc.gnu.org/ml/gcc-patches/2005-07/msg01407.html, forward propagation, that can do most (but not all) of what CSE path following achieves. It also provides a real reason for moving transformation from combine to [gccsource:simplify-rtx.c].
Similarly, you could think of a combine-like pass that can combine instructions over basic block boundaries. This pass could use [gccsource:df.c] to compute available expressions and DEF-USE chains on RTL, and try to combine instruction based on that instead of LOG_LINKS It would still be necessary to improve the combining strategy.
A smarter expand pass that is capable of selecting and constructing more complex instructions already while expanding trees to RTL. The pass would do machine dependent tree to RTL expansion; this might be similar to peepholes, but the input would be a GIMPLE tree and the output would be RTL.
A smarter combine pass that is capable of doing optimal instruction selection in a single basic block. Tiling is hard to apply to GCC, because of PARALLEL patterns for example, but at least the simplest cases could be handled. It could require major surgery of recog and genrecog to make this efficient.

More thoughts on this? Please add them to this Wiki page.

None: Improve_instruction_selection (last edited 2008-01-10 19:38:37 by localhost)