Vectorization related tasks
This page is a TODO list for tasks related to the GCC vectorizer. For each task I list the person who proposed the task (and who should be contacted in case you want to start working on it), and a short description of the project.
Here is the summary of the Loop-Optimizations BOF that took place at the 2007 GCC summit.
In progress:
- Enabling the cost model by default (currently enabled only on x86).
Misaligned stores support (Revital Eres): We currently don't handle misaligned stores. Instead we peel the loop to force the alignment of the store. This works only if there's a single misaligned store in the loop; if there's more than one, and we can't prove that all the stores in the loop have the same misalignment, we can't vectorize the loop. We want to add the capability to vectorize misaligned stores. Initial Patch: http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00604.html. Related PR: PR32380.
Todo:
- Interleaved stores with gaps (Ira Rosen): Support interleaved stores to non contiguous memory locations (i.e. with gaps). Related PRs: PR18438, PR19049.
- Interleaving improvements (Ira Rosen): Extend interleaving support to more forms of strided accesses (e.g. non power-of-2 strides).
Model missing vec_extract_even/odd (needed for interleaving loads) for ia64. See details in PR30211. Were implemented for wide types x86_64 and i686 in 4.4 by Richard Guenther (http://gcc.gnu.org/viewcvs?view=rev&revision=138553).
- Model missing vec_pack/unpack idioms for ia64 (needed for vectorization of multiple-datatypes). Details in PR29778.
Support uses-after-loop (Dorit Nuzman): Extend the vectorizer to vectorize loops even if they contain definitions that are used after the loop. Related PR: PR30038 (See http://gcc.gnu.org/bugzilla//show_bug.cgi?id=30038#c16).
- Generalize realignment support (Dorit Nuzman): See details in PR29268.
Support attribute aligned for pointers (Dorit Nuzman): This is an infrastructural enhancement out side the vectorizer. Related thread: http://gcc.gnu.org/ml/gcc/2006-11/msg00106.html.
Generalize the reduction support (Dorit Nuzman): Pick up Daniel Berlin's patch (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) to detect more general reduction forms than what we are currently handling (using SCC detection). PR25621 is an example where this is needed.
#pragma support to guide autovectorizer (Devang Patel): See http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.
- Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec). (Dorit Nuzman). (TODO: check if this is still needed).
- Support multiple vector sizes (Dorit Nuzman).
- Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390) (Dorit Nuzman).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html). (TODO: check if this is still needed).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html). (Dorit Nuzman).
- Data Dependence enhancements:
- PR32378 ("siv not implemented"?) (Sebastian Pop)
- ignore forward dependences (Ira Rosen)
- interchange stmts to reverse backward dependences (PR32806) (Tomas Bily).
- Loop-number-of-iterations enhancements:
- make gimplifier create COND_EXPR (Zdenek has an initial patch).
- Preserve and pass information: Preserve data-dependence information on top of Zdenek's preserve-loop-info project. Have the vectorizer mark that outer-loop nests are parallelizable. Pass information on the maximum loop count of peel loops (in edge/BB probability/frequency and in the loop-structure) (Dorit Nuzman).
- Move predictive commoing to after vectorization (Dorit Nuzman). Also make predcom work on vectorized code (along with any required modifications inside the vectorizer).
- Teach the vectorizer to overcome dependences created by PRE. (Dorit Nuzman).
- look into vectorizing fortran COMMOM block arrays better.
- look into altivec specific problems (PR32107). (Andrew Pinski).
- Loop-aware SLP (Ira Rosen):
- Data dependence relaxation: SLP can be used to vectorize within an iteration in situations where there are cross-iteration dependencies that prevent loop-based implementation. The current implementation applies the SLP analysis however is applied only after full dependence testing has passed. This restriction should be relaxed.
- Data permutation support: when the order of loaded scalar elements does not fit the order of stores the data should be reorganized.
- Reduction: the current implementation terminates when it reaches a PHI-node.
- Strided accesses with gaps.
- Partial SLP.
- MIMD (Multiple Instructions Multiple Data) support: for example, the “subadd” computation as well as other forms of mixtures of non-isomorphic defs that could still be combined into a vector. There are targets that can directly support certain MIMD operations (SSE3, for example, supports a subadd vector operation), but it can be vectorized also on targets that don’t directly support it (by multiplying the relevant vector operand by a vector of 1s and –1s).
- Non-isomorphic computations: the current implementation does not address the case in which the GS is greater than VS and not all the elements of the group are defined by isomorphic computations, but there exists a subgroup of VS elements that are defined by isomorphic computations. Now we attempt to construct the SLP-tree from the entire group, and will therefore fail and terminate. However, the analysis can continue if the implementation is extended to explore subgroups of size VS of the SLP group under consideration.
Smart permutation schemes: decide when to rearrange the data — either immediately when loading (resembling the eager shift heuristic in http://portal.acm.org/citation.cfm?id=1048985), or at a later stage of the computation not originally associated with loads or stores (resembling the lazy shift heuristic). Ultimately, we may want to reorder in anticipation of the permutation needed by the stores, if internal operations are isomorphic, similar to the eager shift scheme. A simple case to optimize occurs when the rearrangement at the stores is the inverse of that at the loads (e.g., interleave low/high and extract odd/even), thereby canceling each other.
- Allow shifts with different scalar arguments, when the statements that are grouped into the same vector statement have the same argument.