This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: OpenACC support in 4.9


Dinar Temirbulatov <dtemirbulatov@gmail.com> wrote:

>Another interesting use-case for OpenACC and OpenMP is mixing both
>standard
>annotations for the same loop:
> // Compute matrix multiplication.
>#pragma omp parallel for default(none) shared(A,B,C,size)
>#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
>  pcopyout(C[0:size][0:size])
>
>  for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>  }
>This means that OpenACC pragmas should be parsed before OpenMP pass (in
>case both standards were enabled), before the OpenMP pass would
>change annotated GIMPLE statements irrecoverably. In my view this
>use-case
>could be handles for example in this way:
>We could add some temporary variable for example
>"expand_gimple_with_openmp" and change the example above to something
>like
>this just before the OpenMP pass:
>
>
>if (expand_gimple_with_openmp) {
>#pragma omp parallel for default(none) shared(A,B,C,size)
>for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>  }
>else {
>#pragma acc kernels pcopyin(A[0:size][0:size],B[0:size][0:size]) \
>  pcopyout(C[0:size][0:size])
>
>  for (int i = 0; i < size; ++i) {
>    for (int j = 0; j < size; ++j) {
>      float tmp = 0.;
>      for (int k = 0; k < size; ++k) {
>tmp += A[i][k] * B[k][j];
>      }
>      C[i][j] = tmp;
>    }
>}
>and later at the Graphite pass we could understand that our statement
>is
>SCOP and we could produce kernel for this statement and then we could
>assume that expand_gimple_with_openmp heuristic is false and the OpenMP
>version of the loop could be eliminated or vice versa. But we have to
>make
>sure that optimization passes would not change our OpenACC gimple that
>it
>become unparalleled.

No, the point is that we want a middle-end annotation that covers both at the same time.
Otherwise factoring in others will quickly get unmanageable.

Richard.
>                               thanks, Dinar.
>
>
>
>On Fri, May 10, 2013 at 2:06 PM, Tobias Burnus <burnus@net-b.de> wrote:
>
>> Jakub Jelinek wrote:
>> [Fallback generation of CPU code]
>>
>>> If one uses the OpenMP 4.0 accelerator pragmas, then that is the
>required
>>> behavior, if the code is for whatever reason not possible to run on
>the
>>> accelerator, it should be executed on host [...]
>>>
>> (I haven't checked, but is this a compile time or run-time
>requirement?)
>>
>>
>>  Otherwise, the OpenMP runtime as well as the pragmas have a way to
>choose
>>> which accelerator you want to run something on, as device id
>(integer), so
>>> the OpenMP runtime library should maintain the list of supported
>>> accelerators (say if you have two Intel MIC cards, and two AMD GPGPU
>>> devices), and probably we'll need a compiler switch to say for which
>kinds
>>> of accelerators we want to generate code for, plus the runtime could
>have
>>> dlopened plugins for each of the accelerator kinds.
>>>
>>
>> At least two OpenACC implementations I know fail hard when the GPU is
>not
>> available (nonexisting or if the /dev/... has not the right
>permissions).
>> And three of them fail at compile time with an error message if an
>> expression within a device section is not possible (e.g. calling some
>> nondevice/noninlinable function).
>>
>> While it is convenient to have CPU fallback, it would be nice to know
>> whether some code actually uses the accelerator - both at compile
>time and
>> at run time. Otherwise, one thinks the the GPU is used - without
>realizing
>> that it isn't because, e.g. the device permissions are wrong - or one
>> forgot to declare a certain function as target function.
>>
>> Besides having a flag which tells the compiler for which accelerator
>the
>> code should be generated, also additional flags should be handled,
>e.g. for
>> different versions of the accelerator. For instance, one accelerator
>model
>> of the same series might support double-precision variables while
>another
>> might not. - I assume that falling back to the CPU if the accelerator
>> doesn't support a certain feature won't work and one will get an
>error in
>> this case.
>>
>>
>> Is there actually the need to handle multiple accelerators
>simultaneously?
>> My impression is that both OpenACC and OpenMP 4 assume that there is
>only
>> one kind of accelerator available besides the host. If I missed some
>fine
>> print or something else  requires that there are multiple different
>> accelerators, it will get more complicated - especially for those
>code
>> section where the user didn't explicitly specify which one should be
>used.
>>
>>
>> Finally, one should think about debugging. It is not really clear (to
>me)
>> how to handle this best, but as the compiler generates quite some
>> additional code (e.g. for copying the data around) and as printf
>debugging
>> doesn't work on GPUs, it is not that easy. I wonder whether there
>should be
>> an optional library like libgomp_debug which adds additional sanity
>checks
>> (e.g. related to copying data to/from the GPU) and which allows to
>print
>> diagnostic output, when one sets an environment variables.
>>
>> Tobias
>>



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]