⇤ ← Revision 1 as of 20200625 18:11:11
7800
Comment:

← Revision 2 as of 20200701 13:06:56 ⇥
7872

Deletions are marked like this.  Additions are marked like this. 
Line 1:  Line 1: 
## page was renamed from OpenACC kernels Construct Optimization Tutorial  
Line 185:  Line 186: 
OpenACC kernels Construct Optimization Tutorial
This tutorial explains how to optimize computations using OpenACC kernels constructs for offloading parts of a program to an accelerator device like a GPU.
Example Application
In this tutorial we will optimize a simple matrix multiplication. The original function looks like this (N is expected to be defined as a macro):
void matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N]) { for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { float sum = 0; for (int k = 0; k < N ; k++) sum += a[i][k] * b[k][j]; r[i][j] = sum; } } }
This is a normal C function that can be compiled with GCC and executed as usual. For a value of N = 2000, this runs in about 94.260000 seconds on a test machine, executed on the CPU in a single thread.
To offload and optimize this function using OpenACC, the code itself will not change. Only some #pragmas will be needed to tell the compiler what data and computations to move to the accelerator, and how to parallelize the loops.
Introducing the OpenACC kernels Construct
The OpenACC kernels construct indicates a program region to offload to the accelerator. Data clauses on the kernels region list the data to be copied to and from the accelerator device.
void matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N]) { #pragma acc kernels \ copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N]) { for (int j = 0; j < N; j++) { for (int i = 0; i < N; i++) { float sum = 0; for (int k = 0; k < N ; k++) sum += a[i][k] * b[k][j]; r[i][j] = sum; } } } }
As written, GCC (built from the OpenACC development branch) will warn that it is not able to parallelize the kernels region automatically:
matmul.c: In function 'matrix_multiply._omp_fn.0': matmul.c:4:11: warning: OpenACC kernels construct will be executed sequentially; will by default avoid offloading to prevent data copy penalty #pragma acc kernels \ ^
This means that the region will be executed on the host CPU instead of the accelerator device. This choice of the compiler can be overridden using the foffloadforce flag during compilation or by defining the ACC_DEVICE_TYPE environment variable before execution. In this example, forcing offloading would result in a large slowdown: Sequential execution on GPUs is much slower than on modern CPUs. The advantage of offloading to the accelerator comes from parallelization, which will be addressed in the next step.
Parallelizing Loops
To fully benefit from GCC's OpenACC support, each for loop in the kernels region (or an OpenACC parallel construct) needs a #pragma acc loop annotation. Parallelism annotations on the pragma tell the compiler how to parallelize the loop:
independent means that all iterations of the loop are independent of each other, i.e., there are no dependences between iterations. Independent iterations may be executed in any order and freely parallelized. The compiler trusts the programmer's independent annotations, it is therefore your responsibility to ensure that you only add them when they are indeed correct.
seq means that the iterations of the loop are to be executed sequentially, in order.
auto (which is also implicit in kernels regions if no other parallelism clause is specified) means that the compiler is supposed to determine automatically whether the loop's iterations are independent. In practice, GCC's autoparallelizer very often does not understand data dependences in multilevel loop nests, and it will almost always choose sequential execution.
In the matrix multiplication example, the two outer loops are independent. The innermost loop has data dependences on the sum variable, as the value from one iteration is needed in the next one. We will improve this in the next step, but for now the innermost loop can be marked seq:
void matrix_multiply (float r[N][N], const float a[N][N], const float b[N][N]) { #pragma acc kernels \ copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N]) { #pragma acc loop independent for (int j = 0; j < N; j++) { #pragma acc loop independent for (int i = 0; i < N; i++) { float sum = 0; #pragma acc loop seq for (int k = 0; k < N ; k++) sum += a[i][k] * b[k][j]; r[i][j] = sum; } } } }
With these simple annotations, the execution time for N = 2000 on the test machine is 2.050000 seconds, a 46fold speedup over CPUonly execution.
Reductions
The data dependence on the sum variable in the innermost loop is special: The values from different loop iterations are combined using an associative arithmetic operator. The loop computes the value of sum equal to the following expression:
(((0 + a[i][0] * b[0][j]) + a[i][1] + b[1][j]) + a[i][2] * b[2][j]) + ...
Mathematically, since addition is associative, parentheses may be moved around freely, for example:
(0 + a[i][0] * b[0][j] + a[i][1] + b[1][j]) + (a[i][2] * b[2][j]) + ...)
This means that the values of a[i][k] * b[k][j] may be computed independently and finally combined using the + operator using any grouping that may be convenient. Such calculations are called reductions, and there is special support for expressing them in OpenACC. The innermost loop can change to the following:
float sum = 0; #pragma acc loop independent reduction(+: sum) for (int k = 0; k < N ; k++) sum += a[i][k] * b[k][j]; r[i][j] = sum;
The compiler and OpenACC runtime take care of organizing the independent execution of the loop iterations and summing up the intermediate results.
Using the independent and reduction clauses like this brings execution time to 1.750000 seconds, a further 15% improvement.
Note that while addition is associative on real numbers, it is not in fact associative on floatingpoint numbers. Similarly to the optimizations enabled commandline flags like ffastmath, using reductions can introduce numerical errors compared to sequential execution. It is the programmer's responsibility to ensure that the reduction is numerically correct for a given application.
Further Optimizations
For certain problems and problem sizes, some additional changes can improve execution time even further:
The a and b arrays are only read inside the kernels region, while r is only written. It is not necessary to copy all of these arrays to and from the accelerator. Instead it suffices to only copy a and b to the accelerator, while space for r can be created on the accelerator, written to by the kernel, and copied out to the host. This can be achieved by changing the copy clause to two clauses: copyout(r[0:N][0:N]) copyin(a[0:N][0:N], b[0:N][0:N]). (In the matrix multiplication example, data movement overhead is dominated by computation time, and this does not result in a measurable improvement on large matrices.)
Some of the loops can be annotated with gang, worker, or vector clauses to finetune the allocation of the loops to the different kinds of parallelism. Otherwise, GCC computes an automatic allocation.
Adding num_gangs, num_workers, or vector_length clauses can further finetune the loop schedule.