Vectorization related tasks
This page is a TODO list for tasks related to the GCC vectorizer.
- Remove non-SLP loop vectorization paths
- Use single-lane SLP for formerly non-SLP parts
- Keep load and store groups as analyzed, add merge/split nodes
- Replace greedy loop SLP discovery with one based on merging nodes starting from single-lane SLP graph matching the SSA graph
- Delay vector type assignment to SLP node analysis (vectorizable_*), compute set of vector types and decide on the vector size by evaluating different sets of working combinations
- Make the vectorization factor support fractional poly-ints to implement re-rolling of loops
- Remove if-conversion, replacing it with masking or on-the-fly if-conversion
- Generate code directly from SLP instead of copying the scalar loop and replacing stmts
- Move pattern detection from stmts to SLP
- Make patterns cancelable
- Make x86 gather and scatter use the internal function instead of the builtins representation
- Split vectorizable_* into analysis and code generation, store analysis data instead of recomputing it
- Code generate unvectorizable (single-lane) SLP instances by duplicating the scalar code implementing partial loop vectorization, with no vectorization this implements unrolling + interleaving (plus costing)
Old content below
Here is the summary of the Loop-Optimizations BOF that took place at the 2007 GCC summit.
Todo:
SLP group size relaxation: vectorize only a subset of interleaved stores or split large groups in subgroups if necessary (PR 49955).
- Enabling the cost model by default (currently enabled only on x86).
Interleaved stores with gaps: support interleaved stores to non contiguous memory locations (i.e. with gaps). Related PRs: PR18438, PR19049.
- Interleaving improvements: extend interleaving support to more forms of strided accesses (e.g. non power-of-2 strides).
- Support certain operations on data-types that are not directly supported by a target, but yet vectorization is possible. For example, support data movements and bitwise operations on 64-bit data types for altivec). (TODO: check if this is still needed).
- Vectorize instructions that operate on a sequence of bytes in memory, which means that they implement semantics that corresponds to code containing a loop in C (such as those available in S390).
Improve debug information (mostly line-number information) for code created by the vectorizer (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00197.html). (TODO: check if this is still needed).
Reuse generic loop peeling utilities in the vectorizer where possible (see http://gcc.gnu.org/ml/gcc-patches/2005-02/msg00165.html).
- Data Dependence enhancements:
- Loop-number-of-iterations enhancements:
- make gimplifier create COND_EXPR (Zdenek has an initial patch).
- look into vectorizing Fortran COMMON block arrays better.
look into altivec specific problems (PR32107).
- Loop-aware SLP:
- Non-isomorphic computations: the current implementation does not address the case in which the GS is greater than VS and not all the elements of the group are defined by isomorphic computations, but there exists a subgroup of VS elements that are defined by isomorphic computations. Now we attempt to construct the SLP-tree from the entire group, and will therefore fail and terminate. However, the analysis can continue if the implementation is extended to explore subgroups of size VS of the SLP group under consideration.
- Allow shifts with different scalar arguments, when the statements that are grouped into the same vector statement have the same argument.