Vectorization related tasks
This page is a TODO list for tasks related to the GCC vectorizer. For each task I list the person who proposed the task (and who should be contacted in case you want to start working on it), and a short description of the project.
Here is the summary of the Loop-Optimizations BOF that took place at the 2007 GCC summit.
In progress:
- Working towards enabling the vectorizer by default. This includes tuning the cost model, enabling the cost model by default, and solving any bootstrap failures with vectorization enabled.
- Runtime aliasing tests (or in other words: Loop versioning to eliminate data dependence using run-time dependence tests) (Victor Kaplansky): It is often difficult/impossible to prove that two data-references in the loop don't overlap (e.g. when they are accessed using pointers). It is still possible to vectorize such loops using runtime dependence checks, much like the runtime alignment checks that we already generate in mainline. I.e., use loop versioning and guard the vectorized version with a runtime aliasing test. This is done within the vectorizer - all pairs of pointers that can't be anti-aliased are compared against one another in a guard that determines at run-time whether the scalar or vector version will be executed.
- Straight-Line code vectorization (Ira Rosen): Exploit data parallelism in straight-line code rather than across loop iterations. One option is to look for opportunities in all basic-blocks, like SLP ("Exploiting Superword Level Parallelism with Multimedia Instruction Sets", Samuel Larsen and Saman Amarasinghe, PLDI 2000). Another option is to continue to focus on loops, exploiting parallelism within an iteration. This would let us vectorize (partially) unrolled loops, and accesses to consecutive structure fields. Initial implementation in autovect-branch. To be submitted to 4.3.
- Outer-loop vectorization (Dorit Nuzman): Vectorize certain forms of doubly-nested loops. Filters for example, which are very common in multimedia computations, could be vectorized with this optimization. Initial implementation in autovect-branch. To be submitted to 4.3.
Cost model (Harsha Jagasia): Build a cost model to decide whether it's profitable to vectorize, and how. We are currently vectorizing whenever we can. This can often hurt performance, for example, if the loop is very short, because of the overheads involved in vectorization (e.g. alignment handling, loop peeling, and epilog code for reduction). We also need the cost model to decide how to vectorize - for example - there are different ways we can handle alignment (versioning, peeling, misaligned vector accesses). We want to try to estimate the costs involved in vectorization, and make a decision based on that whether to vectorized or not, and how. RFC: http://gcc.gnu.org/ml/gcc/2007-02/msg00377.html.
Todo:
Misaligned stores support (Dorit Nuzman): We currently don't handle misaligned stores. Instead we peel the loop to force the alignment of the store. This works only if there's a single misaligned store in the loop; if there's more than one, and we can't prove that all the stores in the loop have the same misalignment, we can't vectorize the loop. We want to add the capability to vectorize misaligned stores. Initial Patch: http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00604.html. Related PR: PR32380.
- Interleaved stores with gaps (Ira Rosen): Support interleaved stores to non contiguous memory locations (i.e. with gaps). Related PRs: PR18438, PR19049.
- Interleaving improvements (Ira Rosen): Extend interleaving support to more forms of strided accesses (e.g. non power-of-2 strides).
- Model missing vec_extract_even/odd (needed for interleaving loads) for ia64. See details in PR30211.
- Model missing vec_pack/unpack idioms for ia64 (needed for vectorization of multiple-datatypes). Details in PR29778.
- Model missing int-float conversions idioms for Altivec (Tehila Meyzels).
- Integral type conversions improvements (Dorit Nuzman): Extend our support for type conversions of integral types to conversions between types whose size is more than twice as big/small than one another (e.g. conversions between char and int). Currently we only support conversions that require a single level of pack/unpack instructions, such as needed for conversions between ints and shorts, or shorts and chars. A multi-level (recursive) packing/unpacking is required to generalize this.
Support uses-after-loop (Dorit Nuzman): Extend the vectorizer to vectorize loops even if they contain definitions that are used after the loop. Related PR: PR30038 (See http://gcc.gnu.org/bugzilla//show_bug.cgi?id=30038#c16).
- Generalize realignment support (Dorit Nuzman): See details in PR29268.
Support attribute aligned for pointers (Dorit Nuzman): This is an infrastructural enhancement out side the vectorizer. Related thread: http://gcc.gnu.org/ml/gcc/2006-11/msg00106.html.
Generalize the reduction support (Dorit Nuzman): Pick up Daniel Berlin's patch (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) to detect more general reduction forms than what we are currently handling (using SCC detection). PR25621 is an example where this is needed.
#pragma support to guide autovectorizer (Devang Patel): See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.
- Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec). (Dorit Nuzman). (TODO: check if this is still needed).
- Support multiple vector sizes (Dorit Nuzman).
- Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390) (Dorit Nuzman).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html). (TODO: check if this is still needed).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html). (Dorit Nuzman).
- Data Dependence enhancements:
- PR32378 ("siv not implemented"?) (Sebastian Pop)
- ignore forward dependences (Ira Rosen)
- interchange stmts to reverse backward dependences (PR32806) (Tomas Bily).
- Loop-number-of-iterations enhancements:
- PR32377 (Sebastian Pop)
- make gimplifier create COND_EXPR (Zdenek has an initial patch).
- Preserve and pass information: Preserve data-dependence information on top of Zdenek's preserve-loop-info project. Have the vectorizer mark that outer-loop nests are parallelizable. Pass information on the maximum loop count of peel loops (in edge/BB probability/frequency and in the loop-structure) (Dorit Nuzman).
- Move predictive commoing to after vectorization (Dorit Nuzman). Also make predcom work on vectorized code (along with any required modifications inside the vectorizer).
- Teach the vectorizer to overcome dependences created by PRE. (Dorit Nuzman).
- look into vectorizing fortran COMMOM block arrays better.
- look into altivec specific problems (PR31334, PR32107). (Andrew Pinski).
- Loop-aware SLP (Ira Rosen):
- Data dependence relaxation: SLP can be used to vectorize within an iteration in situations where there are cross-iteration dependencies that prevent loop-based implementation. The current implementation applies the SLP analysis however is applied only after full dependence testing has passed. This restriction should be relaxed.
- Data permutation support: when the order of loaded scalar elements does not fit the order of stores the data should be reorganized.
- Reduction: the current implementation terminates when it reaches a PHI-node.
- Strided accesses with gaps.
- Partial SLP.
- MIMD (Multiple Instructions Multiple Data) support: for example, the “subadd” computation as well as other forms of mixtures of non-isomorphic defs that could still be combined into a vector. There are targets that can directly support certain MIMD operations (SSE3, for example, supports a subadd vector operation), but it can be vectorized also on targets that don’t directly support it (by multiplying the relevant vector operand by a vector of 1s and –1s).
- Non-isomorphic computations: the current implementation does not address the case in which the GS is greater than VS and not all the elements of the group are defined by isomorphic computations, but there exists a subgroup of VS elements that are defined by isomorphic computations. Now we attempt to construct the SLP-tree from the entire group, and will therefore fail and terminate. However, the analysis can continue if the implementation is extended to explore subgroups of size VS of the SLP group under consideration.
Smart permutation schemes: decide when to rearrange the data — either immediately when loading (resembling the eager shift heuristic in http://portal.acm.org/citation.cfm?id=1048985), or at a later stage of the computation not originally associated with loads or stores (resembling the lazy shift heuristic). Ultimately, we may want to reorder in anticipation of the permutation needed by the stores, if internal operations are isomorphic, similar to the eager shift scheme. A simple case to optimize occurs when the rearrangement at the stores is the inverse of that at the loads (e.g., interleave low/high and extract odd/even), thereby canceling each other.
- Allow shifts with different scalar arguments, when the statements that are grouped into the same vector statement have the same argument.