[OpenACC] Performance issues on simple example program

Christopher Guckes chris@guckes-webstyle.com
Tue Jun 21 17:27:00 GMT 2016

I'm currently capable of compiling and running the PI example from
http://scelementary.com/2015/04/25/openacc-in-gcc.html with the current
GCC 6.1.0. The GPU version of the code is much slower than the CPU
version and I can't figure out why. I didn't have this problem with GCC
5.3.0 before.

The code looks as follows:

#include <stdio.h>
#include <stdlib.h>

#define N 200000000

int main(void) {
  double pi = 0.0f;
  long long i;

  #pragma acc data copyout(pi)
    #pragma acc parallel loop reduction (+:pi) present (pi)
    for (i=0; i<N; i++) {
      double t= (double)((i+0.5)/N);
      pi +=4.0/(1.0+t*t);


  return 0;

The GPU version takes about four times as long as the CPU version of the
code. I used the NVIDIA visual profiler to ensure it wasn't a copy
operation that tanked the runtime. Copying was measured at 0.1% while
the kernel itself runs for about six seconds on a GTX 970. The profiler
tells me that the occupancy is at 1.6% giving the grid size as the
limiting factor. I'm quite new to GPU code, so I'm not sure what to do
about that. The original sample code used a vector length of 1024, the
default seems to be 32 in the current GCC 6.1.0 version. When I try to
set the vector length to 1024 manually it warns me that it will ignore
that. What else can I try to get this to run faster?

Thanks in advance

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <https://gcc.gnu.org/pipermail/gcc-help/attachments/20160621/bde56171/attachment.sig>

More information about the Gcc-help mailing list