Improvement of General Vector Extensions

This page describes plans and progress for the GSoC 2010 project Improvement of General Vector Extensions. The project is done by Artem Shinkarov, the advisor is Richard Günther.

Abstract

The project addresses the portability of vector operations in C considering efficiency and correctness. The starting point of the project is GCC general vector extension. I show that sometimes GCC ends-up with a wrong code because memory-alignment issues are not considered. I introduce a set of desirable interfaces for vector operations which are not currently supported. Finally I introduce several test-cases and discuss future work.

Plan

The project will consist of the following parts.

Memory alignment: The main problem here is to understand why there is no alignment checking. If this is just a simple bug then it would be an easy fix, though it would be much harder if we don't have information about the memory address at the point when we need to decide which instruction do we want to use. The other thing to consider is that memory could be accessed like "array[i]" where "i" could be an external variable. In that case I am going to use an unaligned memory access instructions. Some architectures may not provide any support for unaligned memory access, in that case this must be emulated.
Indexing: The first task to do is to check if the existing patch is still compatible with the current version. After that we need to be sure that the patch works correctly and efficiently. The optional step here is to introduce the set of optimizations for intra-register arithmetic. For example: "a[2] = a[3]". Possibly shifting and masking would give better performance than the memory transfer.
Shifting and permutation: The first thing to do is to understand the mechanism of adding a new instruction. There are several stages: support it on the syntactical level, support it on the level of intermediate language, generate the correct hardware instruction. After the mechanism becomes clear and instruction is added, we need to ensure that it works correctly. Depending on the complexity of this mechanism reduction and comparison problems can be implemented as well.
Comparison: It is a natural thing to compare two vectors. The result of comparison is an signed integer vector where each element is a comparison of the according elements in the source operands. True is represented with -1 and false with 0. For example: {1, 3} > {2, 1} is evaluated to {0, -1} .
Testing: All the stages must be tested very carefully. It is considerably simple to test an individual operation, however, I would like to implement a couple of test-cases to show that the system works. First example is part of the Bzip2 algorithm: vectorized version of reversed MTF (Move To Front), the second test-case is a vectorized version of MD5 hashing algorithm.

For more information please look at the original proposal.

Progress

Currently we have committed into the trunk the following patches:

Vector indexing patch, allowing to write the code like vec[i], where vec is of type vector.
Vector shifting patch, allowing to write the code like vec >> vec, where vec is of type vector.
Vector shifting optimisation patch, optimising the code like vec >> {x,x,x,x} to vec >> x in case the underlying hardware supports that.
Vector scalar patch, allowing to write the code like vec + 2 which means vec + {2,2,...} , where vec is of type vector

Patches that are ready but not committed:

Vector comparison and condition patch (pending for approve), now we can do two things:
- vec = vec0 > vec1 which is evaluated into the signed integer vector mask, for example {1, 2} > {3, 1} would be evaluated to {0, -1} .
- vec = vec0 > vec1 ? vec2 : vec3 in that case the result of the condition is used to select elements either from vec2} or vec3. For example, {-1, 0} ? {2, 3} : {4, 5} would be evaluated to {2, 5} .
Vector shuffling patch (some adjustments required), allows you to write the code like __builtin_shuffle ({1,2,3,4}, {1,0,0,2}) which would be evaluated to {2, 1, 1, 3} . It is also possible to use mask on two vectors: __builtin_shuffle ({1,2,3,4}, {5,6,7,8}, {0,4,7,2} would be evaluated to {1, 5, 8, 3} .
Memory misalignment patch. This is a *very* non-trivial question, which requires a lot of effort. The current patch just tries to warn a user whenever memory alignment is violated.

Patches to be created:

Whole vector shifting, in case we consider a vector as a big integer and make standard shift. In case of SSE 128-bit vectors we can use __int128 type as a workaround. In case of longer vectors we have a problem.
Alignment check. We need to create a mechanism to check if a given function preserves an alignment. (This is a hard task.)
Warn when the vector operation is expanded piece-wise. It would be very useful for the performance analysis.
Vector constant folding. Currently constant vectors remain untouched by the constant folder, which of course leads to the inefficient code.

Details

Indexing

This is quite a simple thing to implement. The last version is submitted for comments. See this link for more details. An example:
```
   1 #define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
   2 
   3 int main ()
   4 {
   5   vector (4, int) v = {1,2,3,4};
   6   return v[0] + v[1]; /* results to 3.  */
   7 }
```
The question I currently have is the C++ parts. At the moment there are the same code in typechecker and in the cp/decl2.c file (part of frontend). The code looks like this:

Shifting & shuffling

The shifting patch has been submitted. See this link for more details.

   1 #define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
   2 
   3 int main (int argc, char *argv[])
   4 {
   5   vector (4, int) v0 = {argc,2,3,4};
   6   vector (4, int) v1 = {2,2,2,2};
   7   vector (4, int) r1, r2, r3, r4;
   8 
   9   r1 = v0 >> v1;      /* results to {argc >> 2, 2 >> 2, 3 >> 2, 4 >> 2}  */
  10   r2 = v0 << v1;      /* results to {argc << 2, 2 << 2, 3 << 2, 4 << 2}  */
  11 
  12   r1 = v0 >> 1;       /* results to {argc >> 1, 2 >> 1, 3 >> 1, 4 >> 1}  */
  13   r1 = v0 << 1;       /* results to {argc << 1, 2 << 1, 3 << 1, 4 << 1}  */
  14 
  15   return 0;
  16 }

While writing an example I encountered a minor bug which I hope can be fixed soon. I is fixed now.
Whole-vector shifting is required as well. Basically this could be done with __int128, if one is available on the system. But still there is an open question about the AVX 256-bit vectors.

Vector shuffling -- the patch has been submitted: here.

   1 typedef int v4si __attribute__ ((vector_size (16)));
   2 
   3 int main ()
   4 {
   5   v4si a = {1,2,3,4};
   6   v4si b = {5,6,7,8};
   7   v4si mask1 = {0,1,1,3};
   8   v4si mask2 = {0,4,2,5};
   9   v4si res;
  10 
  11   res = __builtin_shuffle (a, mask1);       /* res is {1,2,2,4}  */
  12   res = __builtin_shuffle2 (a, b, mask2);   /* res is {1,5,3,6}  */
  13 
  14   return 0;
  15 }

Optimizations

OpenCL standard allows to specify vector operation in a form "vector <op> scalar" or "scalar <op> vector". In this case the scalar is converted to the vector where each element is this scalar (2 ==> {2,2,2,2}). This happens only if type of the scalar is less or equal then the base type of the vector. GCC middle-end has a primitive for operation of "vector >> scalar", which often has a hardware support. In case we have a "vector1 >> vector2" and all the elements of vector2 are the same, we could convert this vector to the scalar. This optimization is implemented and is submitted. For more information see this link.
In case we have a vector-vector operation, it is quite harmful for the performance if this operation is decomposed into a set of scalar-scalar operations. The plan is to introduce a new warning (-Wscalar-vector) for that purpose. Not implemented yet.

Comparison

The vector comparison patch has been submitted

   1 typedef int v4si __attribute__ ((vector_size (16)));
   2 
   3 int main ()
   4 {
   5   v4si a = {1,2,3,4};
   6   v4si b = {3,2,1,4};
   7   v4si c = {2,3,4,5};
   8   v4si d = {6,7,8,9};
   9   v4si res;
  10 
  11   res = a >  b;        /* The result would be {0, 0,-1, 0}  */
  12   res = a == b;        /* The result would be {0,-1, 0,-1}  */
  13   res = a >= b ? c : d;  /* res would contain {6, 3, 4, 9}  */
  14   return 0;
  15 }

Memory alignment

The patch is ready but is not submitted. The problem of misaligned data in project can be solved by using __attribute__((vector_size(16), aligned(4))) which will always produce unaligned moves. However, if you use misaligned vector assignment through pointers, compiler does not provide any warnings. This situation is handled in the patch.

Testing

There are several implementations using a combination of vector extensions and inline assembly which must be reworked later.

Other stuff

Following the OpenCL standard we want to support operations in form "vector <op> scalar" and "scalar <op> vector", converting the scalar to the vector. The patch was submitted.

Comments

If any.