Auto-vectorization in GCC

The goal of this project was to develop a loop and basic block vectorizer in GCC, based on the tree-ssa framework. It has been completed and the functionality has been part of GCC for years.

Table of Contents

Latest News

2011-10-23
  1. Vectorization of reduction in loop SLP. Both multiple reduction cycles and reduction chains are supported.
  2. Various basic block vectorization (SLP) improvements, such as better data dependence analysis, support of misaligned accesses and multiple types, cost model.
  3. Detection of vector size: https://gcc.gnu.org/ml/gcc-patches/2010-10/msg00441.html.
  4. Vectorization of loads with negative step.
  5. Improved realignment scheme: https://gcc.gnu.org/ml/gcc-patches/2010-06/msg02301.html.
  6. A new built-in, __builtin_assume_aligned, has been added, through which the compiler can be hinted about pointer alignment.
  7. Support of strided accesses using memory instructions that have the interleaving "built in", such as NEON's vldN and vstN.
  8. The vectorizer now attempts to reduce over-promotion of operands in some vector operations: https://gcc.gnu.org/ml/gcc-patches/2011-07/msg01472.html.
  9. Widening shifts are now detected and vectorized if supported by the target.
  10. Vectorization of conditions with mixed types.
  11. Support of loops with bool.
2009-12-03
The following new vectorization features were committed to mainline:
  1. vectorization of conditions in nested loop (2009-07-20)
  2. vectorization of double reduction (reductions carried by two nested loops) (2009-07-12)
  3. vectorization of nested cycles (dependence cycles other than reduction cycles in nested loops) (2009-06-16)
  4. support of misaligned stores for platforms that allow misaligned access (2009-06-05)
  5. basic block SLP (vectorization of straight line code) (2009-05-25)
  6. avoid versioning of vectorized loop if possible (2009-04-02) (https://gcc.gnu.org/ml/gcc-patches/2009-03/msg01784.html).
  7. support load permutations in loop-aware SLP (2008-08-25)
  8. support multiple types in loop-aware SLP (2008-08-19)
2008-08-11
  1. Vectorization supports integer type conversions also when one type is more than two times bigger than the other (e.g. char to int) (2008-08-11).
  2. UNITS_PER_SIMD_WORD can be different for different scalar types (2008-05-22).
  3. Vector shifts by a vector shift amount differentiated from vector shifts with scalar shift amount (2008-05-14).
  4. Complete unrolling enabled before vectorization, relying on intra-iteration vectorization (aka SLP) to vectorize unrolled loops (2008-04-27).
  5. Further refinements to the cost model (2007-12-06).
  6. -ftree-vectorize is turned on under -O3 (2007-09-18).

Contributing

This project was started by Dorit (Naishlos) Nuzman. Current contributors to this project include Revital Eres, Richard Guenther, Jakub Jelinek, Michael Matz, Richard Sandiford, and Ira Rosen. This web page is maintained by Ira Rosen <irar@il.ibm.com>. For a list of missing features and possible enhancements see https://gcc.gnu.org/wiki/VectorizationTasks.

Using the Vectorizer

Vectorization is enabled by the flag -ftree-vectorize and by default at -O3. To allow vectorization on powerpc* platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.

The vectorizer test cases demonstrate the current vectorization capabilities; these can be found under gcc/gcc/testsuite/gcc.dg/vect/. Information on which loops were or were not vectorized and why, can be obtained using the flag -ftree-vectorizer-verbose. For details see https://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html. Example output using -ftree-vectorizer-verbose=2:

vect-1.c:82: note: not vectorized, possible dependence between data-refs a[i_124] and a[i_83]
vect-1.c:72: note: LOOP VECTORIZED.
vect-1.c:64: note: LOOP VECTORIZED.
vect-1.c:56: note: LOOP VECTORIZED.
vect-1.c:49: note: LOOP VECTORIZED.
vect-1.c:41: note: not vectorized: unsupported use in stmt.
vect-1.c:31: note: not vectorized: unsupported use in stmt.
vect-1.c:13: note: vectorized 4 loops in function.

Basic block vectorization, aka SLP, is enabled by the flag -ftree-slp-vectorize, and requires the same platform dependent flags as loop vectorization. Basic block SLP is enabled by default at -O3 and when -ftree-vectorize is enabled.

Vectorizable Loops

"feature" indicates the vectorization capabilities demonstrated by the example.

Example 1:
int a[256], b[256], c[256];
foo () {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}
Example 2:
int a[256], b[256], c[256];
foo (int n, int x) {
   int i;

   /* feature: support for unknown loop bound  */
   /* feature: support for loop invariants  */
   for (i=0; i<n; i++)
      b[i] = x;
   }

   /* feature: general loop exit condition  */
   /* feature: support for bitwise operations  */
   while (n--){
      a[i] = b[i]&c[i]; i++;
   }
}
Example 3:
typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restrict__ p, aint * __restrict q) {

   /* feature: support for (aligned) pointer accesses.  */
   while (n--){
      *p++ = *q++;
   }
}
Example 4:
typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restrict__ p, aint * __restrict__ q) {
   int i;

   /* feature: support for (aligned) pointer accesses  */
   /* feature: support for constants  */
   while (n--){
      *p++ = *q++ + 5;
   }

   /* feature: support for read accesses with a compile time known misalignment  */
   for (i=0; i<n; i++){
      a[i] = b[i+1] + c[i+3];
   }

   /* feature: support for if-conversion  */
   for (i=0; i<n; i++){
      j = a[i];
      b[i] = (j > MAX ? MAX : 0);
   }
}
Example 5:
struct a {
  int ca[N];
} s;
for (i = 0; i < N; i++)
  {
    /* feature: support for alignable struct access  */
    s.ca[i] = 5;
  }
Example 6: gfortran:
DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END
Example 7:
int a[256], b[256];
foo (int x) {
   int i;

   /* feature: support for read accesses with an unknown misalignment  */
   for (i=0; i<N; i++){
      a[i] = b[i+x];
   }
}
Example 8:
int a[M][N];
foo (int x) {
   int i,j;

   /* feature: support for multidimensional arrays  */
   for (i=0; i<M; i++) {
     for (j=0; j<N; j++) {
       a[i][j] = x;
     }
   }
}
Example 9:
unsigned int ub[N], uc[N];
foo () {
  int i;

  /* feature: support summation reduction.
     note: in case of floats use -funsafe-math-optimizations  */
  unsigned int diff = 0;
  for (i = 0; i < N; i++) {
    udiff += (ub[i] - uc[i]);
  }
Example 10:
/* feature: support data-types of different sizes.
   Currently only a single vector-size per target is supported; 
   it can accommodate n elements such that n = vector-size/element-size 
   (e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes). 
   A combination of data-types of different sizes in the same loop 
   requires special handling. This support is now present in mainline,
   and also includes support for type conversions.  */

short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
  ia[i] = ib[i] + ic[i];
  sa[i] = sb[i] + sc[i];
}

for (i = 0; i < N; i++) {
  ia[i] = (int) sb[i];
}
Example 11:
/* feature: support strided accesses - the data elements
   that are to be operated upon in parallel are not consecutive - they
   are accessed with a stride > 1 (in the example, the stride is 2):  */

for (i = 0; i < N/2; i++){
  a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
  d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}
Example 12: Induction:
for (i = 0; i < N; i++) {
  a[i] = i;
}
Example 13: Outer-loop:
  for (i = 0; i < M; i++) {
    diff = 0;
    for (j = 0; j < N; j+=8) {
      diff += (a[i][j] - b[i][j]);
    }
    out[i] = diff;
  }
}
Example 14: Double reduction:
  for (k = 0; k < K; k++) {
    sum = 0;
    for (j = 0; j < M; j++)
      for (i = 0; i < N; i++)
          sum += in[i+k][j] * coeff[i][j];

    out[k] = sum;
  }
Example 15: Condition in nested loop:
  for (j = 0; j < M; j++)
    {
      x = x_in[j];
      curr_a = a[0];

      for (i = 0; i < N; i++)
        {
          next_a = a[i+1];
          curr_a = x > c[i] ? curr_a : next_a;
        }

      x_out[j] = curr_a;
    }
Example 16: Load permutation in loop-aware SLP:
  for (i = 0; i < N; i++)
    {
       a = *pInput++;
       b = *pInput++;
       c = *pInput++;

       *pOutput++ = M00 * a + M01 * b + M02 * c;
       *pOutput++ = M10 * a + M11 * b + M12 * c;
       *pOutput++ = M20 * a + M21 * b + M22 * c;
    }
Example 17: Basic block SLP:
void foo ()
{
  unsigned int *pin = &in[0];
  unsigned int *pout = &out[0];

  *pout++ = *pin++;
  *pout++ = *pin++;
  *pout++ = *pin++;
  *pout++ = *pin++;
}
Example 18: Simple reduction in SLP:
int sum1;
int sum2;
int a[128];
void foo (void)
{
  int i;

  for (i = 0; i < 64; i++)
    {
      sum1 += a[2*i];
      sum2 += a[2*i+1];
    }
}
Example 19: Reduction chain in SLP:
int sum;
int a[128];
void foo (void)
{
  int i;

  for (i = 0; i < 64; i++)
    {
      sum += a[2*i];
      sum += a[2*i+1];
    }
}
Example 20: Basic block SLP with multiple types, loads with different offsets, misaligned load, and not-affine accesses:
void foo (int * __restrict__ dst, short * __restrict__ src,
          int h, int stride, short A, short B)
{
  int i;
  for (i = 0; i < h; i++)
    {
      dst[0] += A*src[0] + B*src[1];
      dst[1] += A*src[1] + B*src[2];
      dst[2] += A*src[2] + B*src[3];
      dst[3] += A*src[3] + B*src[4];
      dst[4] += A*src[4] + B*src[5];
      dst[5] += A*src[5] + B*src[6];
      dst[6] += A*src[6] + B*src[7];
      dst[7] += A*src[7] + B*src[8];
      dst += stride;
      src += stride;
    }
}
Example 21: Backward access:
int foo (int *b, int n)
{
  int i, a = 0;

  for (i = n-1; i ≥ 0; i--)
    a += b[i];

  return a;
}
Example 22: Alignment hints:
void foo (int *out1, int *in1, int *in2, int n)
{
  int i;

  out1 = __builtin_assume_aligned (out1, 32, 16);
  in1 = __builtin_assume_aligned (in1, 32, 16);
  in2 = __builtin_assume_aligned (in2, 32, 0);

  for (i = 0; i < n; i++)
    out1[i] = in1[i] * in2[i];
}
Example 23: Widening shift:
void foo (unsigned short *src, unsigned int *dst)
{
  int i;

  for (i = 0; i < 256; i++)
    *dst++ = *src++ << 7;
}
Example 24: Condition with mixed types:
#define N 1024
float a[N], b[N];
int c[N];

void foo (short x, short y)
{
  int i;
  for (i = 0; i < N; i++)
    c[i] = a[i] < b[i] ? x : y;
}
Example 25: Loop with bool:
#define N 1024
float a[N], b[N], c[N], d[N];
int j[N];

void foo (void)
{
  int i;
  _Bool x, y;
  for (i = 0; i < N; i++)
    {
      x = (a[i] < b[i]);
      y = (c[i] < d[i]);
      j[i] = x & y;
    }
}

Unvectorizable Loops

Examples of loops that currently cannot be vectorized:

Example 1: Uncountable loop:
while (*p != NULL) {
  *q++ = *p++;
}

Also see https://gcc.gnu.org/wiki/VectorizationTasks and a list of vectorizer missed-optimization PRs in the GCC bug tracker.

Previous News and Status

2007-09-17
  1. -ftree-vectorize is going to be turned on under -O3.
  2. Cost-model tweaks and x86_64 specific costs committed to mainline (2007-09-10).
  3. Vectorization that exploits intra-iteration parallelism (ala SLP) was committed to mainline (2007-09-09).
  4. -fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats (2007-09-04).
  5. Initial support for vectorization of outer-loops (doubly nested loops) was committed to mainline (2007-08-19).
  6. Run-time dependence testing using loop-versioning was committed to mainline (2007-08-16).
2007-07-25
The new vectorization feature that exploits intra-iteration parallelism (to be submitted to GCC 4.3) was presented at the GCC summit last week ("Loop-aware SLP"). Also at the summit, we held a BOF on vectorization and other loop optimizations. The summary of the BOF can be found here: https://gcc.gnu.org/wiki/LoopOptimizationsBOF. Following the BOF the vectorizer's todo list was also updated ( https://gcc.gnu.org/wiki/VectorizationTasks).
Mainline updates:
  1. SPU specific costs for the cost model committed to mainline (2007-07-12). Tuning for other platforms (PPC, x86) ongoing.
  2. Initial cost model implementation committed to mainline (2007-06-08).
  3. Vectorization of fp/integer conversions of different sizes (e.g. float/short) committed to mainline (2007-05-17).
  4. Data-refs analysis was rewritten and improved (2007-05-13).
Autovect-branch updates:
  1. Outer-loop vectorization was enhanced to support aligned and unaligned memory references in the inner-loop, using the optimized realignment scheme when possible.
  2. Vectorization that exploits intra-iteration parallelism (ala SLP) was added to the vectorizer (that so far exploited only inter-iteration parallelism).
  3. The vectorizer cost model was extended to support the above two new vectorization features (outer-loop and "SLP").
2007-05-09
  1. Vectorization of fp/integer conversions of different sizes (e.g. float/short) is soon to be committed to mainline.
  2. Initial cost model implementation is soon to be committed to mainline.
2007-04-22
  1. Vectorization of float/double conversions was added to mainline.
  2. Initial vectorization support for certain forms of outer-loops (doubly nested loops) was added to autovect-branch.
2007-02-21
  1. Vectorization of float/int conversions added to mainline.
  2. Vectorization of inductions added to mainline.
  3. Vectorization of function-calls added to mainline.
  4. Improvements to vectorization of strided-accesses - added to autovect-branch.
  5. Discussion on building a cost-model for the vectorizer - started on the mailing list.
2007-01-14
  1. A new flag to limit vectorization to loops with large enough loop-bound was added: --param min-vect-loop-bound=X prevents vectorization of loops whose vectorized loop-bound is equal or less than X.
  2. Autovect branch has been reopened: ( https://gcc.gnu.org/ml/gcc/2007-01/msg00117.html).
2006-11-27
  1. Vectorization of loops that operate on multiple data-types, including type conversions: incorporated into GCC 4.3.
  2. Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: incoporated into GCC 4.3.
  3. Additional vectorizer related projects planned for GCC 4.3 can be found here: ( https://gcc.gnu.org/wiki/AutovectBranchOptimizations).
2006-02-19
  1. Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.
  2. Detection and vectorization of special idioms, such as dot-product and widening-summation: Incorporated into GCC 4.2.
  3. Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: Incoporated into autovect-branch. To be submitted to GCC 4.2.
2005-10-23
Autovect-branch has been enhanced to support the following features:
  1. Vectorization of loops that operate on multiple data-types, including type promotion (conversion to a wider type) and type demotion (conversion to a narrower type). Type promotion is supported using the new VEC_UNPACK_HI and VEC_UNPACK_LO tree-codes (and the new vec_unpacks_hi/lo and vec_unpacku_hi/lo optabs). Type demotion is supported using the new VEC_PACK_MOD tree-code (and the new vec_pack_mod optab).
  2. Vectorization of idioms that involve type conversion. This allows more efficient vectorization (if specialized target support is available) that avoids the data packing/unpacking that is otherwise required to handle multiple data-types. These idioms include: widening-summation (WIDEN_SUM), dot-product (DOT_PROD), widening-multiplication (WIDEN_MULT, VEC_WIDEN_MULT_HI/LO), multiply-highpart (MULT_HI) and sum-of-absolute-differences (SAD).
2005-08-11
The following enhancements have been incorpoated into GCC 4.1:
  1. Vectorization of reduction has been introduced and currently supports summation, and minimum/maximum computation.
  2. Vectorization of conditional code has been introduced.
  3. The mechanism of peeling a loop to force the alignment of data accesses has been improved. We now generate better code when the misalignment of an access is known at compile time, or when different accesses are known to have the same misalignment, even if the misalignment amount itself is unknown.
  4. Dependence distance is considered when checking dependences between data references.
  5. Loop-closed-ssa-form is incrementally updated during vectorization.
  6. Generic parts of the data references analysis were cleaned up and externalized to make this analysis available to other passes.
2005-04-05
Vectorization of reduction on autovect-branch was enhanced to support maximum/minimum computations, and special reduction idioms such as widening summation as a step towards supporting patterns like dot-product, sum of absolute differences and more: ( https://gcc.gnu.org/ml/gcc-patches/2005-04/msg00532.html).
2005-03-01
Vectorization projects for GCC 4.1: See https://gcc.gnu.org/wiki/Autovectorization_Enhancements.
Vectorization capabilities in GCC 4.0: See 2005-03-01 mainline status.
2005-02-25
New features:
  1. Summation is the first reduction idiom that is vectorized (autovect-branch only).
  2. Verbosity levels for vectorization reports.
  3. Improved line number information.
  4. Revised data-references analysis.
2005-03-01, autovect-branch
Description of vectorizable loops:
  1. Vectorization is restricted to inner most countable loops, in which all operations operate on data types of the same size, and all memory accesses are consecutive.
  2. Certain forms of conditional code.
  3. Unaligned memory accesses are handled using loop peeling, or loop versioning, or direct misalignment support.
  4. Summation reduction is supported (sum += a[i]).
  5. Constants and invariants are supported (a[i] = 5, a[i] = x).
  6. Vectorization of the subtract-and-saturate idiom is supported.
2005-03-01, mainline (final 4.0 status)
Description of vectorizable loops:
  1. Vectorization is restricted to inner most countable loops, in which all operations operate on data types of the same size, and all memory accesses are consecutive.
  2. Unaligned memory write accesses are handled using loop peeling. This allows vectorization when there's only a single unaligned memory-write (or all memory-writes in the loop have the same misalignment). Unaligned memory read accesses are handled using direct misalignment support. This support is currently available for Alpha ev6, mmx, and altivec.
  3. Constants and invariants are supported (a[i] = 5, a[i] = x).
  4. No reductions (sum += a[i]) or inductions (a[i] = i).
  5. -ftree-vectorizer-verbose=[n] controls vectorization reports, with n ranging from 0 (no information reported) to 6 (all information reported).
2005-02-02
New features that are only in autovect-branch:
  1. Incrementally preserve loop closed SSA form during vectorization.
  2. Loop versioning guarded by a runtime alignment test.
  3. Idiom recognition, and vectorization of the subtract-and-saturate idiom.
  4. Incrementally preserve SSA form during vectorization.
  5. Improved handling of misalignment in case it is known at compile time, or in case multiple accesses are known to have the same misalignment.
  6. Vectorization of conditional code.
  7. Consider dependence distance.
2004-10-27, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. New: No restrictions on the loop bound. An epilog loop is generated to handle unknown loop bounds and loop bounds that do not divide by the vectorization factor.
  3. Supported memory accesses are multidimensional arrays, and pointers that are annotated as __restrict__.
  4. All memory accesses are consecutive (stride=1).
  5. New: Loops with a single unaligned write to memory (store). This is supported by peeling the loop to force the alignment of the store.
  6. Reads from memory (loads) can be unaligned. This support is currently available for Alpha ev6, mmx, and altivec.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-11-10
New branch for vectorization development opened: autovect-branch. lno-branch was retired. ( https://gcc.gnu.org/ml/gcc-patches/2004-11/msg00852.html).
2004-10-27
Mainline merge in progress. Last pending vectorization patches for mainline:
Support for vectorization of conditional code ( https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01768.html).
Consider dependence distance ( https://gcc.gnu.org/ml/gcc-patches/2004-10/msg01046.html).
2004-10-14
Support for unknown loop bound ( https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01104.html, vectorizer merge part #2) committed to mainline.
Peeling for alignment ( https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01894.html, vectorizer merge part #5) committed to mainline.
2004-09-27, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop bound (number of iterations) is known and divides by the vectorization factor.
  3. New: Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restrict__.
  4. New: Additional forms of accesses are supported: Restrictions on the the initial value of the access function of pointers and array indexes have been relaxed. These can now take a form like p=&a[16]-4B (pointer), and a[i+off] (arrays).
  5. All memory accesses are consecutive (stride=1).
  6. Writes to memory (stores) are aligned.
  7. New: Reads from memory (loads) can be unaligned.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-09-23
Support for unaligned loads ( https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01522.html, vectorizer merge part #4) committed to mainline.
2004-09-19
Support for additional forms of data references ( https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01521.html, vectorizer merge part #3) committed to mainline.
2004-09-13, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restrict__.
  4. New: Additional forms of accesses are supported: Restrictions on the the initial value of the access function of pointers and array indexes have been relaxed. These can now take a form like p=&a[16]-4B (pointer), and a[i+off] (arrays).
  5. All memory accesses are consecutive (stride=1) and aligned.
  6. Writes to memory (stores) are aligned.
  7. New: Reads from memory (loads) can be unaligned.
  8. Loop versioning for alignment is applied to unaligned stores.
  9. All operations operate on data types of the same size.
  10. No reductions (sum += a[i]) or inductions (a[i] = i).
  11. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-09-02, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. New: The loop bound (number of iterations) can be unknown at compile time, or can be known but not divisible by the vectorization factor.
  4. New: Supported memory accesses are multi-dimensional arrays, and pointers that are annotated as __restrict__
  5. All memory accesses are consecutive (stride=1).
  6. New: Loop versioning for alignment: in the presence of unaligned accesses create two versions of the loop controlled by a runtime alignment check. In one version all the accesses are guaranteed to be aligned, and it can therefore be vectorized. In the second version there may be unaligned accesses, and it remains unvectorized.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
2004-08-17, mainline
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop bound (number of iterations) is known and divides by the vectorization factor.
  3. Supported memory accesses are one-dimensional arrays, whose alignment can be forced (not extern, and stack boundary of target platform allows it), and aligned pointers that are annotated as __restrict__.
  4. All memory accesses are consecutive (stride=1) and aligned.
  5. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  6. All operations operate on data types of the same size.
  7. No reductions (sum += a[i]) or inductions (a[i] = i).
  8. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of vectorizable loops: loop, loop

2004-08-17, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. New: Supported memory accesses are multidimensional arrays, whose alignment can be forced (not extern), and aligned pointers that are annotated as __restrict__.
  5. All memory accesses are consecutive (stride=1) and aligned.
  6. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-08-17
Initial vectorization functionality committed to mainline ( https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01219.html, https://gcc.gnu.org/ml/gcc-patches/2004-07/msg02127.html, vectorizer merge part #1).
2004-07-20, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. New: Supported memory accesses are one-dimensional arrays, whose base can be a struct field, and whose alignment can be forced (not extern arrays), and aligned pointers that are annotated as __restrict__.
  5. All memory accesses are consecutive (stride=1) and aligned. Arrays are alignable
  6. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  7. All operations operate on data types of the same size.
  8. No reductions (sum += a[i]) or inductions (a[i] = i).
  9. Constants and invariants are supported (a[i] = 5, a[i] = x).
  10. New: first gfortran program vectorized.

Examples of newly vectorizable loops: loop, loop

2004-06-25, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, and pointers. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restrict__.
  5. Store (memory-write) accesses have to be aligned.
  6. New: Loads (memory-reads) can be unaligned by an unknown amount (e.g. access a[i+x], where the value of x is unknown). Misalignment support for loads was also made more efficient.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  9. All operations operate on data types of the same size.
  10. Some forms of if-then-else patterns can be vectorized.
  11. New: Infrastructure for idiom recognition has been added. The first computation idiom that is recognized and vectorized is a multiplication of unsigned chars, whose (unsigned short) product is converted back to unsigned char (a similar computation comes up in pixel blending; we support a simplified version that does not require operating on different data types).
  12. No reductions (sum += a[i]) or inductions (a[i] = i).
  13. Constants and invariants are supported (a[i] = 5, a[i] = x). New: Support for invariants was made more efficient.

Examples of newly vectorizable loops: loop

2004-06-23, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Supported memory accesses are one-dimensional arrays, whose alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restrict__. The pointers have to point to an aligned address.
  6. All memory accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-06-17, apple-ppc-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. The loop bound (number of iterations) can be unknown.
  4. Memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  5. New: Memory accesses can also be pointer based. If more than one memory access is present in the loop, any pointers that are used in the loop have to be annotated as __restrict__. (new experimental feature)
  6. New: Loads (memory reads) can be unaligned by a known amount (e.g. access a[i+1], where array a is aligned and i starts from 0). Stores (memory writes) still have to be aligned.
  7. All memory accesses are consecutive (stride=1).
  8. Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  9. All operations operate on data types of the same size.
  10. New: Some forms of if-then-else patterns can be vectorized. (new experimental feature).
  11. No reductions (sum += a[i]) or inductions (a[i] = i).
  12. Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-06-04, lno-branch (and apple-ppc-branch)
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. New: No restrictions on the form of the loop index and the loop exit condition.
  4. New: The loop bound (number of iterations) can be unknown. Currently known loop-bounds still need to be divisible by the vectorization factor.
  5. All memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  6. All array accesses are consecutive and aligned; i,e. all the array references are of the form a[i], where i is updated from 0 to N in steps of 1.
  7. New: Supportable operations include plus/minus/mult, as well as bitwise operations - and/or/xor/1's-complement, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. New: Constants and invariants are supported (a[i] = 5, a[i] = x).

Examples of newly vectorizable loops: loop

2004-01-01, lno-branch
Description of vectorizable loops:
  1. Inner most loops that consist of a single basic block (i.e, straight-line code, no if-then-else).
  2. The loop has to be countable - i.e, the number of iterations can be evaluated before the loop starts to execute.
  3. Known (constant) loop bound, divisible by the vectorization factor.
  4. The loop index i is updated from 0 to N in steps of 1, and the loop exit condition is of the form i<N.
  5. All memory accesses are one dimensional-arrays, whose alignment can be forced (not extern arrays).
  6. All array accesses are consecutive (stride=1) and aligned.
  7. Supportable operations include plus/minus/mult, according to available vector support on the target platform.
  8. All operations operate on data types of the same size.
  9. No reductions (sum += a[i]) or inductions (a[i] = i).
  10. No Constants or invariants in the loop (a[i] = 5).

Examples of vectorizable loops: loop

References/Documentation

  1. "Vapor SIMD: Auto-vectorize once, run everywhere", Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, Ayal Zaks, CGO 2011: 151-160
  2. "Polyhedral-Model Guided Loop-Nest Auto-Vectorization", Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks and Ira Rosen, PACT 2009
  3. "Outer-loop vectorization: revisited for short SIMD architectures", Dorit Nuzman and Ayal Zaks, PACT 2008
  4. "Loop-Aware SLP in GCC", Ira Rosen, Dorit Nuzman and Ayal Zaks, GCC summit, July 2007. GCC Summit 2007 Proceedings
  5. "Autovectorization in GCC - two years later", Dorit Nuzman and Ayal Zaks, GCC summit, June 2006. GCC Summit 2006 Proceedings
  6. "Auto-Vectorization of Interleaved Data for SIMD", Dorit Nuzman, Ira Rosen and Ayal Zaks. ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation (PLDI), Ottawa, Canada, June 10-16, 2006.
  7. "Multi-platform Auto-vectorization", Dorit Nuzman and Richard Henderson, CGO-4 (The 4th Annual International Symposium on Code Generation and Optimization), March 26-29, 2006, Manhattan, New York.
  8. "Autovectorization in GCC", Dorit Naishlos, GCC summit, June 2004. https://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf
  9. "The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance.", Aart Bik, Intel Press, June 2004.
  10. "Vectorization for SIMD Architectures with Alignment Constraints", Alexandre E. Eichenberger, Peng Wu, Kevin O'brien, PLDI'04, June 9-11 2004.
  11. "Optimizing Compilers for Modern Architectures - A dependence based approach", Randy Allen & Ken Kennedy, Morgan Kaufmann Publishers, San Francisco, San Diego, New York (2001).
  12. "Exploiting Superword Level Parallelism with Multimedia Instruction Sets", Samuel Larsen and Saman Amarasinghe, PLDI 2000.

High-Level Plan of Implementation (2003-2005)

The table below outlines the high level vectorization scheme along with a proposal for an implementation scheme, as follows:

vectorization driver basic vectorizer enhancements
analyze_loop_CFG(loop)

Checks the control flow properties of the loop (number of basic-blocks it consists of, nesting, single entry/exit, etc.), in order to determine whether the control flow of the loop falls within the range of loop forms that are supported by this vectorizer.

  • inner-most (single nest) loops.
  • single basic block loops; i.e., no if-then-else constructs, etc. (in practice this means loops that consist of exactly two basic blocks - header+latch).
  • other restrictions (single successor/predecessor, a pre-header block).

analyze_loop_index_and_bound(loop)

Analyzes the loop termination condition to determine the loop bound and properties of the loop index (its bounds and step). The functionality of this utility should be largely provided by the information computed by the Induction Variable Analyzer.

  • handle simple normalized loops (loop index is a trivial IV with step 1, etc.), with a simple termination condition.
  • loop-bound known at compile time.
  • loop-bound >= vector_size and loop-bound % vector_size = 0.

analyze_loop_stmts(loop-stmts)

Scan the loop statements and check whether there are any statements that prohibit vectorization (function calls, statements that don't have a mapping to a built-in vector function, etc.)

  • simple operations for which there's a 1-1 mapping between the scalar and vector operations.
  • no support for scalar expansion, induction variables, reduction operations...
  • no mixture of data types (all statements operate on the same data types).

analyze_access_pattern(loop-mem-refs)

Analyze the memory references in the loop, and classify them according to the access pattern that they exhibit.

  • support only memory accesses which are array references (no pointers...).
  • support only consecutive (unit stride) access pattern.

analyze_alignment(loop-mem-refs)

Analyze the alignment of the memory references in the loop. For each memory reference, record its misalignment amount, if it can be resolved at compile time.

  • misalignment amount for all memory references is known at compile time.
  • misalignment is zero for all references (all references are aligned).

analyze_loop_carried_dependences(loop)

Build the loop dependence graph (for scalar and array references); Detect Strongly Connected Components (SCCs) in the graph (statements that are involved in a dependence cycle); Perform a topological sort on the reduced graph (in which each SCC is represented by a single node); Only singleton nodes w/o self dependencies can be vectorized. If other (compound) nodes (which represent SCCs) are present, loop transformations are required.

  • handle only loops that do not contain any SCCs (i.e., no dependence cycles).
  • the only scalar loop-carried dependencies allowed are of IVs which are used in array references or for the loop index (i.e., reduction is not supported).
  • use simplest dependence tests.

estimate_vectorization_profitability(loop)

At this point, it has been determined that the loop is vectorizable. It remains to decide whether it is indeed profitable to vectorize it.

  • vectorize all loops with loop_bound >= MIN_LIMIT (?).

vectorize_loop(loop)

Replace the scalar statements with the corresponding vector statements (which could be calls to builtin functions); Also change the loop bound accordingly.

The following is a list of independent directions by which the basic vectorizer can be enhanced. It should be possible for different people to work on different items on this list. Some of these items are already under development, or (partially) supported.

  1. >Loop detection and loop CFG analysis

    Detect loops, and record some basic control flow information about them (contained basic blocks, loop pre-header, exit and entry, etc.).

    Status: Loop detection and control flow analysis is already supported (cfgloop.c, cfgloopanal.c).

  2. Modeling the target machine vector capabilities to the tree-level.

    Expose the required target specific information to the tree level. This includes providing a mapping from scalar operations to the corresponding vector support, which will answer the following questions:

    1. Does vector support for this operation exist?
    2. At what cost?
    3. How to express the vector operation at the tree-level?

    The general SIMD support in GCC already provides some initial support; For simple operations which can be expressed using existing (scalar) tree-codes (PLUS_EXPR, MULT_EXPR, etc.) the existing infra-structure can provide answers for questions 1 and 2 above, however, the tree-level currently does not have an idea about the cost that this transformation actually entails. A first basic implementation will support only simple operations that fall into the above category. As the capabilities of the vectorizer are extended, it will be required to inform the vectorizer of the advanced capabilities available in the architecture (for example, support for operations on complex numbers, reduction, etc.). Such operations cannot be expressed using existing tree-codes. Possible solutions: introduce new tree-codes (and corresponding optabs); introduce new builtins that are exposed to the compiler; use target hooks to handle these cases (the hook could return a call to a machine specific builtin function). Another related design question that needs to be addressed here is how much information to expose to the tree-level (is it sufficient to indicate that conditional vector addition is supported, or do we want the vectorizer to actually generate the required masking/predication/select operations depending on the target? similarly for alignment, multiplication of integers, etc.).

    Status: Open for discussion.
    Related discussion: https://gcc.gnu.org/ml/gcc/2004-08/msg00317.html

  3. Enhance the Builtins Support

    Currently the tree optimizers do not know the semantics of target specific builtin functions, so they do not attempt to optimize them (or to SSA the variables passed as arguments to these functions). Since the vectorizer will probably end up generating calls to target specific builtin functions, this situation needs to be improved, i.e. - the semantics of these builtins needs to somehow be exposed to the compiler.

    Status: Open for discussion.

  4. Cost Model

    There is an overhead associated with vectorization -- moving data in to/out of vector registers before/after the vectorized loop, aligning of data accesses, etc. It is required to incorporate a cost model into the machine description in order to allow the vectorizer to evaluate whether it is worth while to vectorize a given loop. One can also consider using run time tests to decide which version of the loop to execute (scalar or vectorized).

    Status: Open for discussion.
    Related discussion: https://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html

  5. Induction Variable Analysis

    Used by the vectorizer to detect loop bound, analyze access patterns and analyze data dependencies between array references. One option is that the dependence tests would be designed deal with array references that are expressed in terms of a linear function of the iteration counter (in this case, vectorization will also benefit from optimizations like induction variable substitution and replacement of auxiliary IVs with linear functions of the loop index). A dependence tester that is based on IVs represented in this form would analyze each subscript of each array reference, and apply the appropriate dependence test (SIV, ZIV, MIV etc., see dependence testing). Alternatively, an induction variable evolution analyzer could provide a different implementation to the dependence tester. This is the solution that is currently used, based on the Induction Variable evolution analysis developed by Sebastian Pop.

    Status: Using the IV evolution analyzer developed by Sebastian Pop.

  6. Dependence Testing

    Following the classic dependence-based approach for vectorization as described in [1], apply dependence tests to pairs of array references in each loop nest, and analyze the resulting dependence graph. We will start from a dependence analyzer that relies on the array references being expressed in terms of a linear function of the loop index, apply the simplest dependence tests to all pairs of memory read/write and write/write, and gradually extend its capabilities. The scheme below follows the algorithm described in [2]:

    Status: Many of the tests above are implemented. Omega test is in the works. We don't have a DDG graph built based on the dependence tests.

  7. Access Pattern Analysis

    The memory architecture usually allows only restricted accesses to data in memory; one of the restrictions is that the accessed data must be consecutive in memory. Any other accesses (strided for example) require to perform special permutation of the data in order to pack the data elements in the right order into a vector register. Support for different access patterns consists of the following stages:

    Status: Partial support in place.

  8. Extend the range of supportable operations

    At first, the only computations that will be vectorized are those for which the vectorization process consists of trivially replacing each scalar operation in the loop with its vector counterpart. This includes simple loads, stores and arithmetic operations that operate on the same data type. Some computations require extra code to be generated in order to vectorize. These include:

    Status: Partial support in place.

  9. Alignment

    The memory architecture usually allows only restricted accesses to data in memory. One of the restrictions is that data accesses need to be properly aligned on a certain boundary. Even if the architecture supports unaligned accesses, these are usually much more costly than aligned accesses. The work on alignment consists of several stages:

    Status: Currently the way we handle unaligned stores is by peeling the loop to force the alignment of the store. This is not always applicable. Vectorizing unaligned stores is in the works.

  10. Idiom Recognition

    It is often the case that complicated computations can be reduced into a simpler, straight-line sequence of operations. These operations may not be directly supported in a scalar form, but are supported by the target in a vector form. Such cases include:

    Status: Partial support in place.

  11. Conditional Execution

    The general principle we are trying to follow is to keep the actual code transformation part of the vectorizer as simple as possible: a simple scan of straight-line code, and a one-to-one replacement of each scalar operation with the equivalent vector operation. To support this scheme in the presence of conditional execution, we'll need to flatten the loop body by collapsing if-then-else into a conditional (scalar) operation (something like transforming - 'if (x) {c = PLUS (a,b)}' into 'PLUS_COND(a,b,x)'. These will later be replaced with a conditional vector operation using whatever support is available on the target (masking, predication or select operation). Flattening the loop body this way will greatly simplify the vectorizer. Some of the issues to think about here: (a) how to represent these conditional operations, (b) to what extent does the tree vectorizer need to be aware of the specific target support that is available for conditional vector execution (mask/predicate/select), and (c) how to allow a simple way to reverse this transformation if the loop doesn't end up getting vectorized.

    Status: Done. Contact: Devang Patel.

  12. Handle Advanced Loop Forms

    1. Support general loop bound (unknown, or doesn't divide by the vector size). [done. Contact: Olga Golovanevsky]
    2. Support more complex forms of loop termination condition and loop index update. [done]
    3. Support outer-loop vectorization (unroll and jam). [planned]
    4. Relax other restrictions on the loop form.

    Status: Partial support in place.

  13. Handle Pointer Aliasing

    1. Improve aliasing analysis. [various GCC projects deal with improvements to alias analysis].
    2. Generate run-time tests for cases where memory anti-aliasing cannot be resolved at compile time. [Planned. Contact: Keith Besaw]
    3. Support user hints. [in the works. See https://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html. Contact: Devang Patel]

    Status: In the works.

  14. Aliasing and Virtual def-use Chains

    Address the item from the tree-ssa todo list - "SSA information for arrays : The existing implementation treats arrays as an opaque object. A definition to an array location is treated as a definition for the whole array"

    Status: Open for discussion.

  15. Array addressing Represented as Pointer Arithmetic

    Address the issue mentioned in https://gcc.gnu.org/ml/gcc/2003-07/msg02013.html, which turns out to be a front end issue.

    Status: In the works. See https://gcc.gnu.org/wiki/Array_references_on_pointers and https://gcc.gnu.org/ml/gcc-patches/2005-05/msg01577.html.

  16. Loop versioning

    Provide utilities that allow performing the following transformation: Given a condition and a loop, create -'if (condition) { loop_copy1 } else { loop_copy2 }', where loop_copy1 is the loop transformed in one way, and loop_copy2 is the loop transformed in another way (or unchanged). 'condition' may be a run time test for things that were not resolved by static analysis (overlapping ranges (anti-aliasing), alignment, etc.).

    Status: Done. Contact: Devang Patel.

  17. Loop Transformations to Increase Vectorizability of Loops

    These include:

    Status:Linear loop transformations are implemented by Daniel Berlin.

  18. Other Optimizations

    1. Exploit data reuse (a la "Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures" by Shin, Chame and Hall). Mostly relies on unroll & jam having been applied.
    2. Vectorize loops that can't be vectorized using the classic vectorizer (until the proper loop transformations are developed) by applying SLP vectorization (a la "Exploiting Superword Level Parallelism with Multimedia Instruction Sets" by Amarasinghe and Larsen). This scheme can potentially more easily vectorize partially vectorizable loops, or loops that are already unrolled in the source code. It is possible to implement SLP vectorization either in the tree level or at the RTL level as a complementary approach to classic loop vectorization.

    Status: Future work.

  19. User Hints

    Using user hints for different purposes (aliasing, alignment, profitability of vectorizing a loop, etc.).

    Status: In the works. See https://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.