Vectorization of loops - GNAT User's Guide for Native Platforms

Next: Other Optimization Switches, Previous: Floating_Point_Operations, Up: Performance Considerations

8.3.1.7 Vectorization of loops

You can take advantage of the auto-vectorizer present in the `gcc' back end to vectorize loops with GNAT. The corresponding command line switch is `-ftree-vectorize' but, as it is enabled by default at `-O3' and other aggressive optimizations helpful for vectorization also are enabled by default at this level, using `-O3' directly is recommended.

You also need to make sure that the target architecture features a supported SIMD instruction set. For example, for the x86 architecture, you should at least specify `-msse2' to get significant vectorization (but you don't need to specify it for x86-64 as it is part of the base 64-bit architecture). Similarly, for the PowerPC architecture, you should specify `-maltivec'.

The preferred loop form for vectorization is the for iteration scheme. Loops with a while iteration scheme can also be vectorized if they are very simple, but the vectorizer will quickly give up otherwise. With either iteration scheme, the flow of control must be straight, in particular no exit statement may appear in the loop body. The loop may however contain a single nested loop, if it can be vectorized when considered alone:

    A : array (1..4, 1..4) of Long_Float;
    S : array (1..4) of Long_Float;
    
    procedure Sum is
    begin
       for I in A'Range(1) loop
          for J in A'Range(2) loop
             S (I) := S (I) + A (I, J);
          end loop;
       end loop;
    end Sum;

The vectorizable operations depend on the targeted SIMD instruction set, but the adding and some of the multiplying operators are generally supported, as well as the logical operators for modular types. Note that compiling with `-gnatp' might well reveal cases where some checks do thwart vectorization.

Type conversions may also prevent vectorization if they involve semantics that are not directly supported by the code generator or the SIMD instruction set. A typical example is direct conversion from floating-point to integer types. The solution in this case is to use the following idiom:

    Integer (S'Truncation (F))

if S is the subtype of floating-point object F.

In most cases, the vectorizable loops are loops that iterate over arrays. All kinds of array types are supported, i.e. constrained array types with static bounds:

    type Array_Type is array (1 .. 4) of Long_Float;

constrained array types with dynamic bounds:

    type Array_Type is array (1 .. Q.N) of Long_Float;
    
    type Array_Type is array (Q.K .. 4) of Long_Float;
    
    type Array_Type is array (Q.K .. Q.N) of Long_Float;

or unconstrained array types:

    type Array_Type is array (Positive range <>) of Long_Float;

The quality of the generated code decreases when the dynamic aspect of the array type increases, the worst code being generated for unconstrained array types. This is so because, the less information the compiler has about the bounds of the array, the more fallback code it needs to generate in order to fix things up at run time.

It is possible to specify that a given loop should be subject to vectorization preferably to other optimizations by means of pragma Loop_Optimize:

    pragma Loop_Optimize (Vector);

placed immediately within the loop will convey the appropriate hint to the compiler for this loop.

It is also possible to help the compiler generate better vectorized code for a given loop by asserting that there are no loop-carried dependencies in the loop. Consider for example the procedure:

    type Arr is array (1 .. 4) of Long_Float;
    
    procedure Add (X, Y : not null access Arr; R : not null access Arr) is
    begin
      for I in Arr'Range loop
        R(I) := X(I) + Y(I);
      end loop;
    end;

By default, the compiler cannot unconditionally vectorize the loop because assigning to a component of the array designated by R in one iteration could change the value read from the components of the array designated by X or Y in a later iteration. As a result, the compiler will generate two versions of the loop in the object code, one vectorized and the other not vectorized, as well as a test to select the appropriate version at run time. This can be overcome by another hint:

    pragma Loop_Optimize (Ivdep);

placed immediately within the loop will tell the compiler that it can safely omit the non-vectorized version of the loop as well as the run-time test.