Improvement of General Vector Extensions

This page describes plans and progress for the GSoC 2010 project Improvement of General Vector Extensions. The project is done by Artem Shinkarov, the advisor is Richard Günther.

Abstract

The project addresses the portability of vector operations in C considering efficiency and correctness. The starting point of the project is GCC general vector extension. I show that sometimes GCC ends-up with a wrong code because memory-alignment issues are not considered. I introduce a set of desirable interfaces for vector operations which are not currently supported. Finally I introduce several test-cases and discuss future work.

Plan

The project will consist of the following parts.

Memory alignment
The main problem here is to understand why there is no alignment checking. If this is just a simple bug then it would be an easy fix, though it would be much harder if we don't have information about the memory address at the point when we need to decide which instruction do we want to use. The other thing to consider is that memory could be accessed like "array[i]" where "i" could be an external variable. In that case I am going to use an unaligned memory access instructions. Some architectures may not provide any support for unaligned memory access, in that case this must be emulated.
Indexing
The first task to do is to check if the existing patch is still compatible with the current version. After that we need to be sure that the patch works correctly and efficiently. The optional step here is to introduce the set of optimizations for intra-register arithmetic. For example: "a[2] = a[3]". Possibly shifting and masking would give better performance than the memory transfer.
Shifting and permutation
The first thing to do is to understand the mechanism of adding a new instruction. There are several stages: support it on the syntactical level, support it on the level of intermediate language, generate the correct hardware instruction. After the mechanism becomes clear and instruction is added, we need to ensure that it works correctly. Depending on the complexity of this mechanism reduction and comparison problems can be implemented as well.
Comparison

It is a natural thing to compare two vectors. The result of comparison is an signed integer vector where each element is a comparison of the according elements in the source operands. True is represented with -1 and false with 0. For example: {1, 3} > {2, 1}  is evaluated to {0, -1} .

Testing
All the stages must be tested very carefully. It is considerably simple to test an individual operation, however, I would like to implement a couple of test-cases to show that the system works. First example is part of the Bzip2 algorithm: vectorized version of reversed MTF (Move To Front), the second test-case is a vectorized version of MD5 hashing algorithm.

For more information please look at the original proposal.

Progress

Currently we have committed into the trunk the following patches:

Patches that are ready but not committed:

Patches to be created:

Details

Indexing
  • This is quite a simple thing to implement. The last version is submitted for comments. See this link for more details. An example:

       1 #define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
       2 
       3 int main ()
       4 {
       5   vector (4, int) v = {1,2,3,4};
       6   return v[0] + v[1]; /* results to 3.  */
       7 }
    
  • The question I currently have is the C++ parts. At the moment there are the same code in typechecker and in the cp/decl2.c file (part of frontend). The code looks like this:
Shifting & shuffling
  • The shifting patch has been submitted. See this link for more details.

       1 #define vector(elcount, type)  __attribute__((vector_size((elcount)*sizeof(type)))) type
       2 
       3 int main (int argc, char *argv[])
       4 {
       5   vector (4, int) v0 = {argc,2,3,4};
       6   vector (4, int) v1 = {2,2,2,2};
       7   vector (4, int) r1, r2, r3, r4;
       8 
       9   r1 = v0 >> v1;      /* results to {argc >> 2, 2 >> 2, 3 >> 2, 4 >> 2}  */
      10   r2 = v0 << v1;      /* results to {argc << 2, 2 << 2, 3 << 2, 4 << 2}  */
      11 
      12   r1 = v0 >> 1;       /* results to {argc >> 1, 2 >> 1, 3 >> 1, 4 >> 1}  */
      13   r1 = v0 << 1;       /* results to {argc << 1, 2 << 1, 3 << 1, 4 << 1}  */
      14 
      15   return 0;
      16 }
    
  • While writing an example I encountered a minor bug which I hope can be fixed soon. I is fixed now.

  • Whole-vector shifting is required as well. Basically this could be done with __int128, if one is available on the system. But still there is an open question about the AVX 256-bit vectors.

  • Vector shuffling -- the patch has been submitted: here.

       1 typedef int v4si __attribute__ ((vector_size (16)));
       2 
       3 int main ()
       4 {
       5   v4si a = {1,2,3,4};
       6   v4si b = {5,6,7,8};
       7   v4si mask1 = {0,1,1,3};
       8   v4si mask2 = {0,4,2,5};
       9   v4si res;
      10 
      11   res = __builtin_shuffle (a, mask1);       /* res is {1,2,2,4}  */
      12   res = __builtin_shuffle2 (a, b, mask2);   /* res is {1,5,3,6}  */
      13 
      14   return 0;
      15 }
    
  • Optimizations
    • OpenCL standard allows to specify vector operation in a form "vector <op> scalar" or "scalar <op> vector". In this case the scalar is converted to the vector where each element is this scalar (2 ==> {2,2,2,2}). This happens only if type of the scalar is less or equal then the base type of the vector. GCC middle-end has a primitive for operation of "vector >> scalar", which often has a hardware support. In case we have a "vector1 >> vector2" and all the elements of vector2 are the same, we could convert this vector to the scalar. This optimization is implemented and is submitted. For more information see this link.

    • In case we have a vector-vector operation, it is quite harmful for the performance if this operation is decomposed into a set of scalar-scalar operations. The plan is to introduce a new warning (-Wscalar-vector) for that purpose. Not implemented yet.
Comparison
  • The vector comparison patch has been submitted

       1 typedef int v4si __attribute__ ((vector_size (16)));
       2 
       3 int main ()
       4 {
       5   v4si a = {1,2,3,4};
       6   v4si b = {3,2,1,4};
       7   v4si c = {2,3,4,5};
       8   v4si d = {6,7,8,9};
       9   v4si res;
      10 
      11   res = a >  b;        /* The result would be {0, 0,-1, 0}  */
      12   res = a == b;        /* The result would be {0,-1, 0,-1}  */
      13   res = a >= b ? c : d;  /* res would contain {6, 3, 4, 9}  */
      14   return 0;
      15 }
    
Memory alignment
  • The patch is ready but is not submitted. The problem of misaligned data in project can be solved by using __attribute__((vector_size(16), aligned(4))) which will always produce unaligned moves. However, if you use misaligned vector assignment through pointers, compiler does not provide any warnings. This situation is handled in the patch.

Testing
  • There are several implementations using a combination of vector extensions and inline assembly which must be reworked later.
Other stuff
  • Following the OpenCL standard we want to support operations in form "vector <op> scalar" and "scalar <op> vector", converting the scalar to the vector. The patch was submitted.

Comments

If any.

None: VectorExtensionsImprovement (last edited 2011-08-12 15:53:57 by 147)