Autovect-branch optimizations

Reviewer: Diego Novillo.

1. Vectorizer features that have already been reviewed and approved (originally targeted for 4.2):

(1.1) Vectorization of loops with multiple data types and type conversions

(1.2) Vectorization of strided accesses

2. Misc. features that are already implemented in autovect-branch:

(2.1) Vectorization of induction. This is a small enhancement that adds support for vectorization of induction (e.g. for (i...) {a[i] = i;}).

(2.2) Versioning for aliasing for vectorization It is often difficult/impossible to prove that two data-references in the loop don't overlap (e.g. when they are accessed using pointers). It is still possible to vectorize such loops using runtime dependence checks, much like the runtime alignment checks that we already generate in mainline. I.e., use loop versioning and guard the vectorized version with a runtime aliasing test. This is done within the vectorizer - all pairs of pointers that can't be anti-aliased are compared against one another in a guard that determines at run-time whether the scalar or vector version will be executed.

(2.3) tree-ifcvt enhancements A batch of small improvements to the tree-level if-conversion: (1) Speculative load motion: enable if-conversion in some cases when a load is present in the if-clauses. (2) Store-sinking: enable if-conversion in some cases when stores are present in the if-causes. (3) support of multiple basic-blocks: enable if-conversion for loops with multiple if-then-else clauses.

(2.4) Call site versioning: In some cases it is profitable to duplicate a call site. For instance in the following case:

switch (cond) {

} foo (a, b)

If the call site is duplicated in the switch cases, as shown below, the parameters will be constants. In this way we in fact create specialized versions of function foo for the case when the parameters are constant. This may enable further optimizations such as constant propagation, scalarization, unrolling, inlining...

switch (cond) {


The current implementation is based on a simple pattern-matching, but it can be rewritten in a more general setting. The implementation has two phases, one for deciding about the feasibility and the profitability of the optimization, and the other for the transformation part. This optimizations should be invoked before the inlining. We have a simple, preliminary implementation of this optimization in the file cv.c. This optimization is enabled by -fcv command line switch and is disabled by the default.

3. Features that have not been implemented yet, but we hope to address them during the 4.3 cycle:

(3.1) Outer loop vectorization Vectorize certain forms of doubly-nested loops. Filters for example, which are very common in multimedia computations, could be vectorized with this optimization.

(3.2) Straight-line code vectorization Exploit data parallelism in straight-line code rather than across loop iterations. One option is to look for opportunities in all basic-blocks, like SLP ("Exploiting Superword Level Parallelism with Multimedia Instruction Sets", Samuel Larsen and Saman Amarasinghe, PLDI 2000). Another option is to continue to focus on loops, exploiting parallelism within an iteration. This would let us vectorize (partially) unrolled loops, and accesses to consecutive structure fields.

(3.3) Cost model We are currently vectorizing whenever we can. This can often hurt performance, for example, if the loop is very short, because of the overheads involved in vectorization (e.g. alignment handling, loop peeling, and epilog code for reduction). We also need the cost model to decide how to vectorize - for example - there are different ways we can handle alignment (versioning, peeling, misaligned vector accesses). We want to try to estimate the costs involved in vectorization, and make a decision based on that whether to vectorized or not, and how.

None: AutovectBranchOptimizations (last edited 2008-01-10 19:38:45 by localhost)