This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: OpenACC support in 4.9


Another interesting use-case for OpenACC and OpenMP is mixing both
standard annotations for the same loop:
 // Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
  }
This means that OpenACC pragmas should be parsed before OpenMP pass
(in case both standards were enabled), before the OpenMP pass would
change annotated GIMPLE statements irrecoverably. In my view this
use-case could be handles for example in this way:
We could add some temporary variable for example
"expand_gimple_with_openmp" and change the example above to something
like this just before the OpenMP pass:


if (expand_gimple_with_openmp) {
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
  }
else {
#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
  pcopyout(C[0:size][0:size])

  for (int i = 0; i < size; ++i) {
    for (int j = 0; j < size; ++j) {
      float tmp = 0.;
      for (int k = 0; k < size; ++k) {
tmp += A[i][k] * B[k][j];
      }
      C[i][j] = tmp;
    }
}
and later at the Graphite pass we could understand that our statement
is SCOP and we could produce kernel for this statement and then we
could assume that expand_gimple_with_openmp heuristic is false and the
OpenMP version of the loop could be eliminated or vice versa. But we
have to make sure that optimization passes would not change our
OpenACC gimple that it become unparalleled.
                               thanks, Dinar.

On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus <burnus@net-b.de> wrote:
> Jakub Jelinek wrote:
> [Fallback generation of CPU code]
>>
>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the required
>> behavior, if the code is for whatever reason not possible to run on the
>> accelerator, it should be executed on host [...]
>
> (I haven't checked, but is this a compile time or run-time requirement?)
>
>
>> Otherwise, the OpenMP runtime as well as the pragmas have a way to choose
>> which accelerator you want to run something on, as device id (integer), so
>> the OpenMP runtime library should maintain the list of supported
>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>> devices), and probably we'll need a compiler switch to say for which kinds
>> of accelerators we want to generate code for, plus the runtime could have
>> dlopened plugins for each of the accelerator kinds.
>
>
> At least two OpenACC implementations I know fail hard when the GPU is not
> available (nonexisting or if the /dev/... has not the right permissions).
> And three of them fail at compile time with an error message if an
> expression within a device section is not possible (e.g. calling some
> nondevice/noninlinable function).
>
> While it is convenient to have CPU fallback, it would be nice to know
> whether some code actually uses the accelerator - both at compile time and
> at run time. Otherwise, one thinks the the GPU is used - without realizing
> that it isn't because, e.g. the device permissions are wrong - or one forgot
> to declare a certain function as target function.
>
> Besides having a flag which tells the compiler for which accelerator the
> code should be generated, also additional flags should be handled, e.g. for
> different versions of the accelerator. For instance, one accelerator model
> of the same series might support double-precision variables while another
> might not. - I assume that falling back to the CPU if the accelerator
> doesn't support a certain feature won't work and one will get an error in
> this case.
>
>
> Is there actually the need to handle multiple accelerators simultaneously?
> My impression is that both OpenACC and OpenMP 4 assume that there is only
> one kind of accelerator available besides the host. If I missed some fine
> print or something else  requires that there are multiple different
> accelerators, it will get more complicated - especially for those code
> section where the user didn't explicitly specify which one should be used.
>
>
> Finally, one should think about debugging. It is not really clear (to me)
> how to handle this best, but as the compiler generates quite some additional
> code (e.g. for copying the data around) and as printf debugging doesn't work
> on GPUs, it is not that easy. I wonder whether there should be an optional
> library like libgomp_debug which adds additional sanity checks (e.g. related
> to copying data to/from the GPU) and which allows to print diagnostic
> output, when one sets an environment variables.
>
> Tobias


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]