Auto-vectorization in GCC

The goal of this project is to develop a loop vectorizer in GCC, based on the tree-ssa framework. This work is taking place in the lno-branch.

Table of Contents

Using the Vectorizer

Vectorization is enabled by the flag -ftree-vectorize. To allow vectorization on powerpc platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. The vectorizer test cases demonstrate the currrent vectorization capabilities; these can be found under gcc/gcc/testsuite/gcc.dg/tree-ssa-vect/. Information on which loops were or were not vectorized and why, can be obtained using the flag -fdump-tree-vect-stats. For example, this is the output that the vectorizer generated for test case tree-ssa-vect-1.c:

loop at tree-ssa-vect-1.c:35: not vectorized: unsupported scalar cycle.
loop at tree-ssa-vect-1.c:45: not vectorized: multi-dimensional array.
loop at tree-ssa-vect-1.c:53: LOOP VECTORIZED.
loop at tree-ssa-vect-1.c:60: LOOP VECTORIZED.
loop at tree-ssa-vect-1.c:67: not vectorized: complicated access pattern.
loop at tree-ssa-vect-1.c:75: LOOP VECTORIZED.
loop at tree-ssa-vect-1.c:86: not vectorized: mixed data-types
loop at tree-ssa-vect-1.c:95: not vectorized: dependence between refs to same array: a
vectorized 3 loops in function.

Vectorizable Loops

Examples of loops that can currently be vectorized

example1:
int a[256], b[256], c[256];
foo () {
  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
} 

example2:
int a[256], b[256], c[256];
foo (int n, int x) {

   /* new feature: support for unknown loop bound  */
   /* new feature: support for loop invariants  */
   for (i=0; i<n; i++)
      b[i] = x;
   }

   /* new feature: general loop exit condition  */
   /* new feature: support for bitwise operations  */
   while (n--){
      a[i] = b[i]&c[i]; i++;
   }
}

example3:
typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restricted__ p, aint * __restricted q) {

   /* new feature: support for pointer accesses.  */
   while (n--){
      *p++ = *q++;
   }
}

example4 (on apple-ppc-branch only):
typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restricted__ p, aint * __restricted__ q) {

   /* features: support for pointer accesses and constants  */
   while (n--){
      *p++ = *q++ + 5;
   }

   /* new feature: support for read accesses with a compile time known misalignment  */
   for (i=0; i<n; i++){
      a[i] = b[i+1] + c[i+3];
   }

   /* new feature: support for if-conversion  */
   for (i=0; i<n; i++){
      j = a[i];
      b[i] = (j > MAX ? MAX : 0);
   }
}

example5:
for (i = 0; i < N; i++)
  {
    /* new feature: struct access  */
    s.ca[i] = 5;
  }
example6 (gfortran):DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END
example7 (on apple-ppc-branch only):
int a[256], b[256];
foo (int x) {

   /* new feature: support for read accesses with an unknown misalignment  */
   for (i=0; i<N; i++){
      a[i] = b[i+x]; 
   }
}

example8 (on apple-ppc-branch only):
typedef unsigned char auchar __attribute__ ((__aligned__(16)));
foo (auchar * __restricted__ srcA,
     auchar * __restricted__ srcB,
     auchar * __restricted__ dest, unsigned char alpha){
  unsigned char not_alpha = 255 - alpha;

  /* new feature: pattern recognition  */
  for (i = 0; i < N; i++){
     unsigned short a1 = srcA[i] * alpha;
     unsigned short a2 = srcB[i] * not_alpha;
     unsigned char b1 = a1 >> 8;
     unsigned char b2 = a2 >> 8;
     dest[i] = b1 + b2;
  }
}

Unvectorizable Loops

Examples of loops that currently cannot be vectorized:

example1: uncountable loop:
while (*p != NULL) {
  *q++ = *p++;
}
example2: reduction, induction:
diff = 0;
for (i = 0; i < N; i++) {
  diff += (a[i] - b[i]);
}

for (i = 0; i < N; i++) {
  a[i] = i;
}
example3: strided access (interleaved data):
for (i = 0; i < N/2; i++){
  a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
  d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}
example4: multiple data-types, type casting:
short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
  ia[i] = ib[i] + ic[i];
  sa[i] = sb[i] + sc[i];
}

for (i = 0; i < N; i++) {
  ia[i] = (int) sb[i];
}

Status

The main development branch is the lno-branch. In some cases, a certain functionality is first implemented in apple-ppc-branch, as a preliminary step before committing it to lno-branch in a more generic form.
2004-07-20, lno-branch
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. New: Supported memory accesses are one-dimensional arrays, which base can be a struct field, and which alignment can be forced (not extern arrays), and aligned pointers that are annotated as __restricted__.
  5. All memory accesses are consecutive (stride=1) and aligned.
  6. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support in the target platform.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
  10. New: first gfortran program vectorized.

Examples of newly vectorizable loops: loop, loop

New test-cases:

2004-06-25, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, and pointers. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__.
  5. Store (memory-write) accesses have to be aligned.
  6. New: Loads (memory-reads) can be unaligned by an unknown amount (e.g. access a[i+x], where the value of x is unknown). Misalignment support for loads was also made more efficient.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support in the target platform.
  9. All operations operate on data types of the same size.
  10. Some forms of if-the-else patterns can be vectorized.
  11. New: Infrastructure for idiom recognition is added. The first computation idiom that is recognized and vectorized is a multiplication of unsigned-chars, which (unsigned short) product is converted back to unsigned char (a similar computation comes up in pixel blending; we support a simplified version that does not require operating on different data types).
  12. No reductions (sum += a[i]) or inductions (a[i] = i).
  13. Constants and invariants are supported (a[i] = 5, a[i] = x). New: Support for invariants was made more efficient.

Examples of newly vectorizable loops: loop, loop

2004-06-23, lno-branch
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, which alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__. The pointers have to point to an aligned address.
  6. All memory accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support in the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

New test-cases:

2004-06-17, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Memory accesses are one dimensional-arrays, which alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restricted__. (new experimental feature)
  6. New: Loads (memory reads) can be unaligned by a known amount (e.g. access a[i+1], where array 'a' is aligned and i starts from 0). Stores (memory writes) still have to be aligned.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support in the target platform.
  9. All operations operate on data types of the same size.
  10. New: Some forms of if-the-else patterns can be vectorized. (new experimental feature).
  11. No reductions (sum += a[i]) or inductions (a[i] = i).
  12. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-06-04, lno-branch (and apple-ppc-branch)
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. New: No restrictions on the form of the loop index and the loop exit condition.
  4. New: The loop bound (number of iterations) can be unknown. Currently known loop-bounds still need to be divisible by the vectorization factor.
  5. All memory accesses are one dimensional-arrays, which alignment can be forced (not extern arrays).
  6. All array accesses are consecutive and aligned; i,e. all the array references are of the form 'a[i]', where 'i' is updated from 0 to N in steps of 1.
  7. New: Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support in the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. New: Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

New test-cases:

2004-01-01, lno-branch
Description of vectorizable loops:
  1. Inner most loops, that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. Known (constant) loop bound, divisible by the vectorization factor.
  4. The loop index 'i' is updated from 0 to N in steps of 1, and the loop exit condition is of the form "i<N".
  5. All memory accesses are one dimensional-arrays, which alignment can be forced (not extern arrays).
  6. All array accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, according to available vector support in the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. No Constants or invariants in the loop (a[i] = 5).

Examples of vectorizable loops: loop

New test-cases:

Related Links

Related discussion threads: http://gcc.gnu.org/ml/gcc/2003-07/msg01309.html.

GCC summit paper: http://people.redhat.com/lockhart/.gcc04/MasterGCC-2side.pdf

This web page is managed by Dorit Naishlos <dorit at il dot ibm dot com>.

In The Works

Current contributors to this project include Olga Golovanevsky, Devang Patel, Ayal Zaks and Dorit Naishlos. Currently in the works:

  1. Merge the vectorizer from lno-branch to mainline
  2. Port new functionality from apple-ppc-branch to lno-branch
  3. Loop peeling to force alignment of accesses in the loop.
  4. Loop versioning, to create a version of the loop in which all accesses are aligned.
  5. Support for reduction and induction

High-Level Plan of Implementation

The table below outlines the high level vectorization scheme along with a proposal for an implementation scheme, as follows:

vectorization driver basic vectorizer enhancements
analyze_loop_CFG(loop)

Checks the control flow properties of the loop (number of basic-blocks it consists of, nesting, single entry/exit, etc.), in order to determine whether the control flow of the loop falls within the range of loop forms that are supported by this vectorizer.

  • inner-most (single nest) loops.
  • single basic block loops; i.e., no if-then-else constructs, etc. (in practice this means loops that consist of exactly two basic blocks - header+latch).
  • other restrictions (single successor/predecessor, a pre-header block).

analyze_loop_index_and_bound(loop)

Analyzes the loop termination condition to determine the loop bound and properties of the loop index (its bounds and step). The functionality of this utility should be largely provided by the information computed by the Induction Variable Analyzer.

  • handle simple normalized loops (loop index is a trivial IV with step 1, etc.), with a simple termination condition.
  • loop-bound known at compile time.
  • loop-bound >= vector_size and loop-bound % vector_size = 0.

analyze_loop_stmts(loop-stmts)

Scan the loop statements and check whether there are any statements that prohibit vectorization (function calls, statements that don't have a mapping to a built-in vector function, etc.)

  • simple operations for which there's a 1-1 mapping between the scalar and vector operations.
  • no support for scalar expansion, induction variables, reduction operations...
  • no mixture of data types (all statements operate on the same data types).

analyze_access_pattern(loop-mem-refs)

Analyze the memory references in the loop, and classify them according to the access pattern that they exhibit.

  • support only memory accesses which are array references (no pointers...).
  • support only consecutive (unit stride) access pattern.

analyze_alignment(loop-mem-refs)

Analyze the alignment of the memory references in the loop. For each memory reference, record its misalignment amount, if it can be resolved at compile time.

  • misalignment amount for all memory references is known at compile time.
  • misalignment is zero for all references (all references are aligned).

analyze_loop_carried_dependences(loop)

Build the loop dependence graph (for scalar and array references); Detect Strongly Connected Components (SCCs) in the graph (statements that are involved in a dependence cycle); Perform a topological sort on the reduced graph (in which each SCC is represented by a single node); Only singleton nodes w/o self dependencies can be vectorized. If other (compound) nodes (which represent SCCs) are present, loop transformations are required.

  • handle only loops that do not contain any SCCs (i.e., no dependence cycles).
  • the only scalar loop-carried dependencies allowed are of IVs which are used in array references or for the loop index (i.e., reduction is not supported).
  • use simplest dependence tests.

estimate_vectorization_profitability(loop)

At this point, it has been determined that the loop is vectorizable. It remains to decide whether it is indeed profitable to vectorize it.

  • vectorize all loops with loop_bound >= MIN_LIMIT (?).

vectorize_loop(loop)

Replace the scalar statements with the corresponding vector statements (which could be calls to builtin functions); Also change the loop bound accordingly.

List of Vectorization Related Tasks (Todo List)

The following is a list of independent directions by which the basic vectorizer can be enhanced. It should be possible for different people to work on different items on this list. Some of these items are already under development, or (partially) supported.

  1. Loop detection and loop CFG analysis

    Detect loops, and record some basic control flow information about them (contained basic blocks, loop pre-header, exit and entry, etc.).

    Status: Loop detection and control flow analysis is already supported (cfgloop.c, cfgloopanal.c).

  2. Modeling the target machine vector capabilities to the tree-level.

    Expose the required target specific information to the tree level. This includes providing a mapping from scalar operations to the corresponding vector support, which will answer the following questions:

    1. Does vector support for this operation exist?
    2. At what cost?
    3. How to express the vector operation at the tree-level?

    The general SIMD support in GCC already provides some initial support; For simple operations which can be expressed using existing (scalar) tree-codes (PLUS_EXPR, MULT_EXPR, etc.) the existing infra-structure can provide answers for questions 1 and 2 above, however, the tree-level currently does not have an idea about the cost that this transformation actually entails. A first basic implementation will support only simple operations that fall into the above category. As the capabilities of the vectorizer are extended, it will be required to inform the vectorizer of the advanced capabilities available in the architecture (for example, support for operations on complex numbers, reduction, etc.). Such operations cannot be expressed using existing tree-codes. Possible solutions: introduce new tree-codes (and corresponding optabs); introduce new builtins that are exposed to the compiler; use target hooks to handle these cases (the hook could return a call to a machine specific builtin function). Another related design question that needs to be addressed here is how much information to expose to the tree-level (is it sufficient to indicate that conditional vector addition is supported, or do we want the vectorizer to actually generate the required masking/predication/select operations depending on the target? similarly for alignment, multiplication of integers, etc.).

    Status: Open for discussion. Contact: Dorit Naishlos.
    Related discussion: http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html

  3. Enhance the Builtins Support

    Currently the tree optimizers do not know the semantics of target specific builtin functions, so they do not attempt to optimize them (or to SSA the variables passed as arguments to these functions). Since the vectorizer will probably end up generating calls to target specific builtin functions, this situation needs to be improved, i.e. - the semantics of these builtins needs to somehow be exposed to the compiler.

    Status: Open for discussion.

  4. Cost Model

    There is an overhead associated with vectorization -- moving data in to/out of vector registers before/after the vectorized loop, aligning of data accesses, etc. It is required to incorporate a cost model into the machine description in order to allow the vectorizer to evaluate whether it is worth while to vectorize a given loop. One can also consider using run time tests to decide which version of the loop to execute (scalar or vectorized).

    Status: Open for discussion. Contact:Dorit Naishlos.
    Related discussion: http://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html

  5. Induction Variable Analysis

    Used by the vectorizer to detect loop bound, analyze access patterns and analyze data dependencies between array references. One option is that the dependence tests would be designed deal with array references that are expressed in terms of a linear function of the iteration counter (in this case, vectorization will also benefit from optimizations like induction variable substitution and replacement of auxiliary IVs with linear functions of the loop index). A dependence tester that is based on IVs represented in this form would analyze each subscript of each array reference, and apply the appropriate dependence test (SIV, ZIV, MIV etc., see dependence testing). Alternatively, an induction variable evolution analyzer could provide a different implementation to the dependence tester. This is the solution that is currently used, based on the Induction Variable evolution analysis developed by Sebastian Pop.

    Status: Using the IV evolution analyzer developed by Sebastian Pop.

  6. Dependence Testing

    Following the classic dependence-based approach for vectorization as described in [1], apply dependence tests to pairs of array references in each loop nest, and analyze the resulting dependence graph. We will start from a dependence analyzer that relies on the array references being expressed in terms of a linear function of the loop index, apply the simplest dependence tests to all pairs of memory read/write and write/write, and gradually extend its capabilities. The scheme below follows the algorithm described in [2]:

    Status: Some preliminary trivial tests are implemented to prove independence.

  7. Access Pattern Analysis

    The memory architecture usually allows only restricted accesses to data in memory; one of the restrictions is that the accessed data must be consecutive in memory. Any other accesses (strided for example) require to perform special permutation of the data in order to pack the data elements in the right order into a vector register. Support for different access patterns consists of the following stages:

    In order to facilitate a scheme that trivially replaces each operation with its equivalent SIMD instruction, a preprocessing pass may need to take place in order to convert the scalar operations into a proper (scalar) form. For example, the following code sequence is generated in order to represent the expression 'a[2*i+1] = b[i]':

    1. T.1 = i * 2;
    2. T.2 = T.1 + 1;
    3. a[T.2] = b[i];

    These operations (1 and 2 in the example) determine the data permutation that will be performed, however, they do not need to be directly vectorized. It maybe required to mark such code to indicate that no vector code needs to be generated for it; instead, we may want to consider collapsing this code into the relevant array reference, similarly to forward substitution (but this would probably mean breaking the GIMPLE conventions?).

    Status: Currently, the access function is determined using the evolution analyzer. Statements that are only used for address computations will be detected and handled accordingly. Contact: Dorit Naishlos.

  8. Extend the range of supportable operations

    At first, the only computations that will be vectorized are those for which the vectorization process consists of trivially replacing each scalar operation in the loop with its vector counterpart. This includes simple loads, stores and arithmetic operations that operate on the same data type. Some computations require extra code to be generated in order to vectorize. These include:

    Status: Depending on the design decisions with respect to the machine modeling, some of these issues maybe hidden in the machine description, alleviating the vectorizer from the need to generate the special extra code. Contact: Dorit Naishlos.

  9. Alignment

    The memory architecture usually allows only restricted accesses to data in memory. One of the restrictions is that data accesses need to be properly aligned on a certain boundary. Even if the architecture supports unaligned accesses, these are usually much more costly than aligned accesses. The work on alignment consists of several stages:

    Status: Preliminary implementation of some of this functionality is in apple-ppc-branch. Peeling and versioning for alignment are in the works. Contact: Dorit Naishlos, Olga Golovanevsky.

  10. Idiom Recognition

    It is often the case that complicated computations can be reduced into a simpler, straight-line sequence of operations. These operations may not be directly supported in a scalar form, but are supported by the target in a vector form. Such cases include:

    Status: Contact: Dorit Naishlos.

  11. Conditional Execution

    The general principle we are trying to follow is to keep the actual code transformation part of the vectorizer as simple as possible: a simple scan of straight-line code, and a one-to-one replacement of each scalar operation with the equivalent vector operation. To support this scheme in the presence of conditional execution, we'll need to flatten the loop body by collapsing if-then-else into a conditional (scalar) operation (something like transforming - 'if (x) {c = PLUS (a,b)}' into 'PLUS_COND(a,b,x)'. These will later be replaced with a conditional vector operation using whatever support is available in the target (masking, predication or select operation). Flattening the loop body this way will greatly simplify the vectorizer. Some of the issues to think about here: (a) how to represent these conditional operations, (b) to what extent does the tree vectorizer need to be aware of the specific target support that is available for conditional vector execution (mask/predicate/select), and (c) how to allow a simple way to reverse this transformation if the loop doesn't end up getting vectorized.

    Status: Contact: Devang Patel.

  12. Handle Advanced Loop Forms

    1. Support general loop bound (unknown, or doesn't divide by the vector size). [done; contact: Olga Golovanevsky]
    2. Support more complex forms of loop termination condition and loop index update. [done]
    3. Support outer-loop vectorization (unroll and jam).
    4. Relax other restrictions on the loop form.

    Status: Future work.

  13. Handle Pointer Aliasing

    1. Develop anti-aliasing analysis.
    2. Generate run-time tests for cases where memory anti-aliasing cannot be resolved at compile time.
    3. Support user hints?

    Status: Some aliasing infrastructure is available. Andersen points-to analysis also available.

  14. Aliasing and Virtual def-use Chains

    Address the item from the tree-ssa todo list - "SSA information for arrays : The existing implementation treats arrays as an opaque object. A definition to an array location is treated as a definition for the whole array"

    Status: Open for discussion.

  15. Array addressing Represented as Pointer Arithmetic

    Address the issue mentioned in http://gcc.gnu.org/ml/gcc/2003-07/msg02013.html, which turns out to be a front end issue.

    Status: Open for discussion.

  16. Loop versioning

    Provide utilities that allow performing the following transformation: Given a condition and a loop, create -'if (condition) { loop_copy1 } else { loop_copy2 }', where loop_copy1 is the loop transformed in one way, and loop_copy2 is the loop transformed in another way (or unchanged). 'condition' may be a run time test for things that were not resolved by static analysis (overlapping ranges (anti-aliasing), alignment, etc.).

    Status: Contact: Devang Patel.

  17. Loop Transformations to Increase Vectorizability of Loops

    These include:

    Status:Linear loop transformations are implemented by Daniel Berlin.

  18. Other Optimizations

    1. Exploit data reuse (a la "Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures" by Shin, Chame and Hall). Mostly relies on unroll & jam having been applied.
    2. Vectorize loops that can't be vectorized using the classic vectorizer (until the proper loop transformations are developed) by applying SLP vectorization (a la "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" by Amarasinghe and Larsen). This scheme can potentially more easily vectorize partially vectorizable loops, or loops that are already unrolled in the source code. It is possible to implement SLP vectorization either in the tree level or at the RTL level as a complementary approach to classic loop vectorization.

    Status: Future work.

  19. Other Issues to consider

    1. Using user hints for different purposes (aliasing, alignment, profitability of vectorizing a loop, etc.).
    2. Decide which of the items listed above should take place in the GIMPLE tree level, and which in the RTL level. For example, anything that could affect the decision whether to vectorize a loop, should take place at the tree level (in order to avoid having to undo transformations); also, many of the things that can take place at the RTL level maybe simpler to implement at the tree level. However, this would mean that a lot of machine dependent stuff will have to be done at the tree level.

    Status: Open for discussion.

(Incomplete) Reference list

  1. "Optimizing Compilers for Modern Architectures - A dependence based approach", Randy Allen & Ken Kennedy, Morgan Kaufmann Publishers, San Francisco, San Diego, New York (2001).
  2. "Practical Dependence Testing", Gina Goff, Ken Kennedy, Chau-Wen Tseng, CRPC+TR 90103, November 1990.