6.3.1.7 Vectorization of loops

The GCC and LLVM back ends have an auto-vectorizer that’s enabled by default at some optimization levels. For the GCC back end, it’s enabled by default at -O3 and you can request it at other levels with -ftree-vectorize. For the LLVM back end, it’s enabled by default at lower levels, but you can explicitly enable or disable it with the -fno-vectorize, -fvectorize, -fno-slp-vectorize, and -fslp-vectorize switches.

To get auto-vectorization, you also need to make sure that the target architecture features a supported SIMD instruction set. For example, for the x86 architecture, you should at least specify -msse2 to get significant vectorization (but you don’t need to specify it for x86-64 as it is part of the base 64-bit architecture). Similarly, for the PowerPC architecture, you should specify -maltivec.

The preferred loop form for vectorization is the for iteration scheme. Loops with a while iteration scheme can also be vectorized if they are very simple, but the vectorizer will quickly give up otherwise. With either iteration scheme, the flow of control must be straight, in particular no exit statement may appear in the loop body. The loop may however contain a single nested loop, if it can be vectorized when considered alone:

A : array (1..4, 1..4) of Long_Float;
S : array (1..4) of Long_Float;

procedure Sum is
begin
   for I in A'Range(1) loop
      for J in A'Range(2) loop
         S (I) := S (I) + A (I, J);
      end loop;
   end loop;
end Sum;

The vectorizable operations depend on the targeted SIMD instruction set, but addition and some multiplication operators are generally supported, as well as the logical operators for modular types. Note that compiling with -gnatp might well reveal cases where some checks do thwart vectorization.

Type conversions may also prevent vectorization if they involve semantics that are not directly supported by the code generator or the SIMD instruction set. A typical example is direct conversion from floating-point to integer types. The solution in this case is to use the following idiom:

Integer (S'Truncation (F))

if S is the subtype of floating-point object F.

In most cases, the vectorizable loops are loops that iterate over arrays. All kinds of array types are supported, i.e. constrained array types with static bounds:

type Array_Type is array (1 .. 4) of Long_Float;

constrained array types with dynamic bounds:

type Array_Type is array (1 .. Q.N) of Long_Float;

type Array_Type is array (Q.K .. 4) of Long_Float;

type Array_Type is array (Q.K .. Q.N) of Long_Float;

or unconstrained array types:

type Array_Type is array (Positive range <>) of Long_Float;

The quality of the generated code decreases when the dynamic aspect of the array type increases, the worst code being generated for unconstrained array types. This is because the less information the compiler has about the bounds of the array, the more fallback code it needs to generate in order to fix things up at run time.

You can specify that a given loop should be subject to vectorization preferably to other optimizations by means of pragma Loop_Optimize:

pragma Loop_Optimize (Vector);

placed immediately within the loop will convey the appropriate hint to the compiler for this loop. This is currently only supported for the GCC back end.

You can also help the compiler generate better vectorized code for a given loop by asserting that there are no loop-carried dependencies in the loop. Consider for example the procedure:

type Arr is array (1 .. 4) of Long_Float;

procedure Add (X, Y : not null access Arr; R : not null access Arr) is
begin
  for I in Arr'Range loop
    R(I) := X(I) + Y(I);
  end loop;
end;

By default, the compiler cannot unconditionally vectorize the loop because assigning to a component of the array designated by R in one iteration could change the value read from the components of the array designated by X or Y in a later iteration. As a result, the compiler will generate two versions of the loop in the object code, one vectorized and the other not vectorized, as well as a test to select the appropriate version at run time. This can be overcome by another hint:

pragma Loop_Optimize (Ivdep);

placed immediately within the loop will tell the compiler that it can safely omit the non-vectorized version of the loop as well as the run-time test. This is also currently only supported by the GCC back end.