Vectorization related tasks
This page is a TODO list for tasks related to the GCC vectorizer.
Here is the summary of the Loop-Optimizations BOF that took place at the 2007 GCC summit.
Todo:
SLP group size relaxation: vectorize only a subset of interleaved stores or split large groups in subgroups if necessary (PR 49955).
- Enabling the cost model by default (currently enabled only on x86).
Interleaved stores with gaps: support interleaved stores to non contiguous memory locations (i.e. with gaps). Related PRs: PR18438, PR19049.
- Interleaving improvements: extend interleaving support to more forms of strided accesses (e.g. non power-of-2 strides).
Generalize the reduction support: pick up Daniel Berlin's patch (http://gcc.gnu.org/ml/gcc-patches/2006-04/msg00172.html) to detect more general reduction forms than what we are currently handling (using SCC detection). PR25621 is an example where this is needed.
- Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec). (TODO: check if this is still needed).
- Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html). (TODO: check if this is still needed).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html).
- Data Dependence enhancements:
- Loop-number-of-iterations enhancements:
- make gimplifier create COND_EXPR (Zdenek has an initial patch).
- Preserve and pass information: Preserve data-dependence information on top of Zdenek's preserve-loop-info project. Have the vectorizer mark that outer-loop nests are parallelizable. Pass information on the maximum loop count of peel loops (in edge/BB probability/frequency and in the loop-structure).
- Make predcom work on vectorized code (along with any required modifications inside the vectorizer).
- Teach the vectorizer to overcome dependences created by PRE.
- look into vectorizing Fortran COMMON block arrays better.
look into altivec specific problems (PR32107).
- Loop-aware SLP:
- Data permutation support: when the order of loaded scalar elements does not fit the order of stores the data should be reorganized (partially supported).
- Strided accesses with gaps.
- Partial SLP.
- MIMD (Multiple Instructions Multiple Data) support: for example, the “subadd” computation as well as other forms of mixtures of non-isomorphic defs that could still be combined into a vector. There are targets that can directly support certain MIMD operations (SSE3, for example, supports a subadd vector operation), but it can be vectorized also on targets that don’t directly support it (by multiplying the relevant vector operand by a vector of 1s and –1s).
- Non-isomorphic computations: the current implementation does not address the case in which the GS is greater than VS and not all the elements of the group are defined by isomorphic computations, but there exists a subgroup of VS elements that are defined by isomorphic computations. Now we attempt to construct the SLP-tree from the entire group, and will therefore fail and terminate. However, the analysis can continue if the implementation is extended to explore subgroups of size VS of the SLP group under consideration.
Smart permutation schemes: decide when to rearrange the data — either immediately when loading (resembling the eager shift heuristic in http://portal.acm.org/citation.cfm?id=1048985), or at a later stage of the computation not originally associated with loads or stores (resembling the lazy shift heuristic). Ultimately, we may want to reorder in anticipation of the permutation needed by the stores, if internal operations are isomorphic, similar to the eager shift scheme. A simple case to optimize occurs when the rearrangement at the stores is the inverse of that at the loads (e.g., interleave low/high and extract odd/even), thereby canceling each other.
- Allow shifts with different scalar arguments, when the statements that are grouped into the same vector statement have the same argument.