Summary of the Loop-Optimizations BOF (2007 GCC Summit)
Some general notes:
- auto-parallelization is on it's way to mainline (depends on review).
- people want to see the vectorizer turned on by default (now!). We may consider a one-month trial period during stage 2 to assess if the vectorizer is ready for that. We also may want to disable by default the new vectorization features that are yet to be introduced (outer-loop and SLP vectorization). This depends on first fixing a bootstrap failure with vectorization enabled on powerpc. It also depends on commitment of people to work on any PRs that will be opened (in the vectorizer and related passes - tree-if-conversion, scalar-evolution, etc).
(1) Interaction between the auto-vectorizer and the auto-parallelizer/GOMP
- Components that can be shared between these two passes, and utilities that need to be developed for both passes:
- reduction detection: Can and should be implemented as a general analysis utility that could be used by both parallelizer and vectorizer (Razya@IBM to be working on this).
- loop peeling/versioning/unrolling for nested loops: Zdenek estimates about a day to make these utilities work on nested loops. Zdenek also pointed out that the vectorizer specific peeling utilities would have to be replaced with the generic utilities.
- cost-model: We have a preliminary cost-model for the vectorizer that was tuned for the Cell SPU (which is a special case because it's a vector machine where you always want to vectorize), and in the process of tuning for x86_64 (Harsha@AMD) and powerpc (Dorit@IBM). Preliminary tuning for powerpc seems to indicate that low-level considerations, such as ILP, effect the profitability of vectorization, and we don't expect to expose this kind of information to the tree level. Also introducing the cost model run-time test itself (to compare the loop-bound against the computed threshold) may have bad affects on the branch prediction, which is hard to take into account. Richard Guenther suggested outlining different versions of the loop (vectorized and unvectorized) into separate functions, and then later on in the compilation (say just before/after register allocation) decide which version to use. The associated function-call overhead may be a concern, however it will be a useful infrastructure to have for others optimizations as well. For now we'll try to make the decision at the tree-level. It was also proposed to add a tree-cost function (for which each target may want to run through expand to get a cost evaluation, although this may be problematic and also not so accurate, considering changes that combine for example later makes). It was also proposed to take register pressure into account at the tree-level.
- share/pass information between the vectorizer/parallelizer: We want to preserve data dependence information across passes. Both at the level of the loop as a whole (mark that a loop has no dependences), and at the detailed level of the specific data-ref pairs (preserve the DDRs). This also applies to information that we have from the GOMP directives, which is lost. The project of preserving loop-structures through the tree-level (by Zdenek) would facilitate this effort. This is also related to the project to pass dependence information from the tree-level to the rtl-level (for the sake of the scheduler).
- Ordering of the vectorizer/parallelizer passes: The parallelizer is invoked after the vectorizer, because it outlines loops into functions, where all datarefs become based on pointers that are passed to that function as arguments, which makes all aliasing and alignment information unknown to the vectorizer. We need however to check how the parallelizer works on vectorized loops - can the data-dependence analysis work on vectorized references, or do we need to extend it. We also want to have the vectorizer mark that outer loop nests are parallelizable.
(2) Loop Transformations to Increase Vectorizability of Loops
- interchanging stmts (PR32806): Requires similar analysis to what the loop distribution does (looks for SCCs, builds dependence graph, sorts it in the topological order). Sebastian will provide an API that builds a combined dependence graph that includes scalar and memory dependences. The vectorizer will use that to get rid of backward dependences by interchanging stmts, when possible.
- how to plug in vectorizer-oriented considerations to other loop optimizations (or in other words: how would vectorizer/parallelizer select/drive loop transformations?). Right now loop interchange tries to increase locality. Loop distribution (in the works, by Sebastian Pop) doesn't yet have a heuristic to decide when to apply it. Harsha is working on scalarization, to assist loop distribution. Maybe in the future we'll look into how to incorporate parallelism considerations into loop interchange/distribution.
- tree-level if-conversion enhancements for the benefit of the vectorizer. This includes taking stores and loads out of if-then-else constructs to assist if-conversion.
- make predictive commoning work on vectorized code. We need to check if predcom can work with the align-ref (rather than a regular indirect-ref). Also this may require to make some adjustment in the vectorizer, such as bump the address between two loads by VS (Vector Size) rather than VS-1 (or teach predcom to deal with it). As a first step we'll move predcom after the vectorizer (and check that we don't get any degradations), and then look at making it work on vectorized loops.
- autovec and PRE: teach the vectorizer to ignore (or rather eliminate) dependences that PRE created. Actually, for predcom Zdenek also had to work around transformations that PRE had made. Can we make PRE avoid it?
- potential graphite-vectorizer collaboration? - too early; graphite will take about 3-4 years to be ready.
(3) Potential benchmarks to test the vectorizer (and tune the cost model)
- Spec spu2006
- netlib
- CSIBE benchmark?
- polyhedron
- blas/lapack
- maybe Toon Moone has more benchmarks
(4) Vectorizer related Todo
- vectorizer verbosity option was requested (Vladimir Makarov). It is available, but can be improved.
- enhancements to the data dependence checks. This includes ignoring forward dependences (patch is ready), interchanging stmts to reverse backward dependences, turning on the omega test by default, and PRs like PR32378 ("siv not implemented"). (Note: in single basic-block loops we can rely on the fact that in a DDR the first stmt appears in the code before the second stmt. However in multi-bb loops, if the two stmts are not in the same BB, it is guaranteed the first stmt is in a BB that dominates the BB of the second stmt).
- enhancements to loop-niters analysis. This includes creating a COND_EXPR when we can't prove that the loop-count is non-zero, but this requires teaching the gimplifier to generate rhs COND_EXPR. PR32377 is another case where loop-niters analysis can be improved.
- how to handle virtual vectors - to be discussed.
- look into vectorizing fortran COMMOM block arrays better.
- look into altivec specific problems (PR31334, PR32107)
a lot more in [VectorizationTasks]
Summary of Action Items
Zdenek:
- powerpc bootstrap failure with vectorization enabled
loop peeling/versioning/unrolling for nested loops (done: http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01584.html)
- update the loop-optimizations wiki (done)
- on going: preserving loop-info structure though the tree-level passes
- on going: merge auto-par branch
Sebastian:
- PR32378 ("siv not implemented"?)
- PR32377 (can't compute loop-number-of-iterations).
provide an API that builds a combined dependence graph that includes scalar and memory dependences. (done: http://gcc.gnu.org/ml/gcc-patches/2007-07/msg01593.html)
- turn on Omega test by default. Put the dependence-check under enabled checking.
- on going: loop distribution
Dorit:
- tune the cost model for powerpc.
- update the vectorizer wiki.
- work towards enabling the vectorizer by default
- get rid of vectorizer's loop peeling utilities.
- on going: outer-loop vectorization.
Ira:
- patch to ignore forward dependences
- on going: SLP vectorization.
Harsha:
- tune the cost model for x86_64.
- on going: scalarization for the benefit of loop distribution.
Jan Sjodin:
- on going: loop Skewing for the benefit of vectorization.
Other (to be assigned...):
- generalized reduction detection for parallelizer/vectorizer (Razya@IBM?)
- preserve data-dependence information on top of Zdenek's preserve-loop-info project
- have the vectorizer mark that outer-loop nests are parallelizable
- interchange stmts to get rid of backward dependences
- move predcom to after vectorization
- make predcom work on vectorized code (along with any required modifications inside the vectorizer).
- make gimplifier create COND_EXPR (Zdenek has initial patch)
- don't have PRE create problematic dependence (or teach the vectorizer to deal with it).
- look into vectorizing fortran COMMOM block arrays better.
- look into altivec specific problems (PR31334, PR32107)