This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: OpenACC-Library-Interoperability


Sorry, I realized I have to send it as plain text.

I finally found some time to repeat my previous attempts, but with gcc
6.1. I'll try to be a little detailed so that people don't have to
waste time in doing stuff I did.

I used the same script as before

https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-offload-openacc-build-install.sh

And modified it for the paths for gcc6.1. For some reason

x86_64-pc-linux-gnu-accel-nvptx-none-gcc

that builds in install/bin

does not look for directories in $LD_LIBRARY_PATH. I found this using
the -print-search-dirs option.

x86_64-pc-linux-gnu-accel-nvptx-none-gcc -print-search-dirs

So, I got a few errors for libraries not found for libraries that were
present in install/lib64. I just copied those libraries to
install/nvptx-none/lib/.

With that I can now use gcc6.

Onto the next task. I made a simple testcase. I'll copy paste the
whole thing here. Lets call this test.f90


program example_dgemm

  use iso_c_binding
  implicit none

  integer                     :: N = 8
  real(c_double), allocatable :: A(:,:), B(:, :), C(:, :)
  integer                     :: size_of_real, i, j
  integer*8 :: devPtrA, devPtrB, devPtrC

  size_of_real = 16

  allocate(A(N, N))
  allocate(B(N, N))
  allocate(C(N, N))

  !$ACC PARALLEL COPY(A)
  do i = 1, N
      do j = 1, N
          A(i, j) = i + j
      end do
  end do
  !$ACC END PARALLEL
  !$ACC PARALLEL COPY(B)
  do i = 1, N
      do j = 1, N
          B(i, j) = j
      end do
  end do
  !$ACC END PARALLEL

  call cublas_Alloc(N*N, size_of_real, devPtrA)
  call cublas_Alloc(N*N, size_of_real, devPtrB)
  call cublas_Alloc(N*N, size_of_real, devPtrC)

  ! Copy Fixed Data to the GPU
  call cublas_Set_Matrix(N, N, size_of_real, A, N, devPtrA, N)
  call cublas_Set_Matrix(N, N, size_of_real, C, N, devPtrC, N)

  ! Copy data to the GPU
  call cublas_Set_Matrix(N, N, size_of_real, B, N, devPtrB, N)

  ! Do DGEMM on the GPU
  call cublas_DGEMM('N', 'N', N, N, N, &
       1.0_c_double, devPtrA, N, devPtrB, N, 0.0_c_double, devPtrC, N)

  ! Copy data from the GPU
  call cublas_Get_matrix(N, N, size_of_real, devPtrC, N, C, N)

  call cublas_Free(devPtrA)
  call cublas_Free(devPtrB)
  call cublas_Free(devPtrC)

  deallocate(A)
  deallocate(B)
  deallocate(C)

end program example_dgemm

Basically, I can test both CUBLAS and OPENACC using this simple code.
But, using CUBLAS needs some extra stuff. So what I need to do is copy
fortran.h, fortran_common.h and fortran.c from /usr/local/cuda/src.

Now, I do

 ./rungcc6.sh gcc -Wall -g -I/usr/local/cuda/include
-I/usr/local/cuda/src -DCUBLAS_GFORTRAN -c fortran.c

Finally, I do

./rungcc6.sh gfortran -Wall -g test.f90 fortran.o -fopenacc
-foffload=nvptx-none -foffload=-O3 -O3 -o gpu.x
-L/usr/local/cuda/lib64 -lcublas -lcudart

And finally nvprof ./gpu.x gives output

==9155== Profiling application: ./gpu.x
==9155== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 23.60%  13.952us         1  13.952us  13.952us  13.952us  MAIN__$_omp_fn$0
 23.55%  13.920us         1  13.920us  13.920us  13.920us  MAIN__$_omp_fn$1
 18.57%  10.976us        16     686ns     576ns  1.0880us  [CUDA memcpy HtoD]
 13.72%  8.1080us         2  4.0540us  2.0430us  6.0650us  [CUDA memcpy HtoH]
 12.07%  7.1360us         1  7.1360us  7.1360us  7.1360us  void
gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=0,
bool=0>(double*, double const *, double const *, int, int, int, int,
int, int, double*, double*, double, double, int)
  8.50%  5.0240us         3  1.6740us  1.5680us  1.8880us  [CUDA memcpy DtoH]

Clearly openacc loops run on GPU and DGEMM kernels also run on GPU.

So the only thing left now, is to put host_data so that the whole
process of cublas_alloc, cublas_set_matrix, cublas_get_matrix need not
be done.

If you can tell me how the patch is applied, I can test it with this
simple example.

Regards,
Vikram

On Mon, Aug 1, 2016 at 6:58 PM, Vikram Singh <vikramsingh001@gmail.com> wrote:
> I finally found some time to repeat my previous attempts, but with gcc 6.1.
> I'll try to be a little detailed so that people don't have to waste time in
> doing stuff I did.
>
> I used the same script as before
>
> https://github.com/olcf/OLCFHack15/blob/master/GCC5OffloadTest/auto-gcc5-offload-openacc-build-install.sh
>
> And modified it for the paths for gcc6.1. For some reason
>
> x86_64-pc-linux-gnu-accel-nvptx-none-gcc
>
> that builds in install/bin
>
> does not look for directories in $LD_LIBRARY_PATH. I found this using the
> -print-search-dirs option.
>
> x86_64-pc-linux-gnu-accel-nvptx-none-gcc -print-search-dirs
>
> So, I got a few errors for libraries not found for libraries that were
> present in install/lib64. I just copied those libraries to
> install/nvptx-none/lib/.
>
> With that I can now use gcc6.
>
> Onto the next task. I made a simple testcase. I'll copy paste the whole
> thing here. Lets call this test.f90
>
>
> program example_dgemm
>
>   use iso_c_binding
>   implicit none
>
>   integer                     :: N = 8
>   real(c_double), allocatable :: A(:,:), B(:, :), C(:, :)
>   integer                     :: size_of_real, i, j
>   integer*8 :: devPtrA, devPtrB, devPtrC
>
>   size_of_real = 16
>
>   allocate(A(N, N))
>   allocate(B(N, N))
>   allocate(C(N, N))
>
>   !$ACC PARALLEL COPY(A)
>   do i = 1, N
>       do j = 1, N
>           A(i, j) = i + j
>       end do
>   end do
>   !$ACC END PARALLEL
>   !$ACC PARALLEL COPY(B)
>   do i = 1, N
>       do j = 1, N
>           B(i, j) = j
>       end do
>   end do
>   !$ACC END PARALLEL
>
>   call cublas_Alloc(N*N, size_of_real, devPtrA)
>   call cublas_Alloc(N*N, size_of_real, devPtrB)
>   call cublas_Alloc(N*N, size_of_real, devPtrC)
>
>   ! Copy Fixed Data to the GPU
>   call cublas_Set_Matrix(N, N, size_of_real, A, N, devPtrA, N)
>   call cublas_Set_Matrix(N, N, size_of_real, C, N, devPtrC, N)
>
>   ! Copy data to the GPU
>   call cublas_Set_Matrix(N, N, size_of_real, B, N, devPtrB, N)
>
>   ! Do DGEMM on the GPU
>   call cublas_DGEMM('N', 'N', N, N, N, &
>        1.0_c_double, devPtrA, N, devPtrB, N, 0.0_c_double, devPtrC, N)
>
>   ! Copy data from the GPU
>   call cublas_Get_matrix(N, N, size_of_real, devPtrC, N, C, N)
>
>   call cublas_Free(devPtrA)
>   call cublas_Free(devPtrB)
>   call cublas_Free(devPtrC)
>
>   deallocate(A)
>   deallocate(B)
>   deallocate(C)
>
> end program example_dgemm
>
> Basically, I can test both CUBLAS and OPENACC using this simple code. But,
> using CUBLAS needs some extra stuff. So what I need to do is copy fortran.h,
> fortran_common.h and fortran.c from /usr/local/cuda/src.
>
> Now, I do
>
>  ./rungcc6.sh gcc -Wall -g -I/usr/local/cuda/include -I/usr/local/cuda/src
> -DCUBLAS_GFORTRAN -c fortran.c
>
> Finally, I do
>
> ./rungcc6.sh gfortran -Wall -g test.f90 fortran.o -fopenacc
> -foffload=nvptx-none -foffload=-O3 -O3 -o gpu.x -L/usr/local/cuda/lib64
> -lcublas -lcudart
>
> And finally nvprof ./gpu.x gives output
>
> ==9155== Profiling application: ./gpu.x
> ==9155== Profiling result:
> Time(%)      Time     Calls       Avg       Min       Max  Name
>  23.60%  13.952us         1  13.952us  13.952us  13.952us  MAIN__$_omp_fn$0
>  23.55%  13.920us         1  13.920us  13.920us  13.920us  MAIN__$_omp_fn$1
>  18.57%  10.976us        16     686ns     576ns  1.0880us  [CUDA memcpy
> HtoD]
>  13.72%  8.1080us         2  4.0540us  2.0430us  6.0650us  [CUDA memcpy
> HtoH]
>  12.07%  7.1360us         1  7.1360us  7.1360us  7.1360us  void
> gemm_kernel2x2_core<double, bool=0, bool=0, bool=0, bool=0, bool=0>(double*,
> double const *, double const *, int, int, int, int, int, int, double*,
> double*, double, double, int)
>   8.50%  5.0240us         3  1.6740us  1.5680us  1.8880us  [CUDA memcpy
> DtoH]
>
> Clearly openacc loops run on GPU and DGEMM kernels also run on GPU.
>
> So the only thing left now, is to put host_data so that the whole process of
> cublas_alloc, cublas_set_matrix, cublas_get_matrix need not be done.
>
> If you can tell me how the patch is applied, I can test it with this simple
> example.
>
> Regards,
> Vikram
>
>
> On Thu, May 12, 2016 at 6:34 PM, Thomas Schwinge <thomas@codesourcery.com>
> wrote:
>>
>> Hi!
>>
>> On Fri, 15 Apr 2016 13:59:53 +0300, Vikram Singh
>> <vikramsingh001@gmail.com> wrote:
>> > I checked libgomp.oacc-c-c++-common/context-*.c, and they seem to be
>> > exactly what I needed to start with.
>> >
>> > If I understand it correctly, I'll anyway be trying to implement
>> > something similar with PGI compilers, so I'll have something as a
>> > starting point.
>>
>> Any progress on that already?  I had a very quick look myself, but it's
>> not as easy as I thought...  A "courageous" use of "use cublas" in
>> Fortran code compiled with gfortran results in: "Fatal Error: Can't open
>> module file 'cublas.mod' for reading".  The problem is: Fortran
>> interfacing to C libraries (which cuBLAS is).
>>
>> <http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings>
>> has some instructions how to do it.  Anyone got that to work already?
>>
>> > But again, it will need the OpenACC host_data construct to be setup
>> > for gfortran to test.
>>
>> In
>>
>> <http://news.gmane.org/find-root.php?message_id=%3C2b4f59d5-be38-2814-27bb-73aa7ffb4b8f%40codesourcery.com%3E>,
>> Chung-Lin has now posted a patch (pending review) that should make the
>> OpenACC host_data construct usable in GCC Fortran.  (Problem discussed in
>>
>> <http://news.gmane.org/find-root.php?message_id=%3C878u0o6wwj.fsf%40kepler.schwinge.homeip.net%3E>
>> before.)
>>
>> For reference:
>>
>> > On Fri, Apr 15, 2016 at 11:57 AM, Thomas Schwinge
>> > <thomas@codesourcery.com> wrote:
>> > > On Fri, 15 Apr 2016 11:35:06 +0300, Vikram Singh
>> > > <vikramsingh001@gmail.com> wrote:
>> > >> Yes, I came to the conclusion that host_data
>> > >> would be the only way to do it in fortran.
>> > >>
>> > >> On the other hand, I though there were no plans to implement it in
>> > >> gfortran 6 either
>> > >
>> > > I still hope we'll get this (that is, <https://gcc.gnu.org/PR70598>)
>> > > fixed in time for the GCC 6.1 release.  I'll keep you posted.
>> > >
>> > >
>> > >> > I'm copying Jim, who is the author of this chapter in the
>> > >> > documentation
>> > >> > as well as the
>> > >> > libgomp/testsuite/libgomp.oacc-c-c++-common/context-*.c
>> > >> > test cases, and much of the relevant libgomp code, too, and who
>> > >> > should
>> > >> > please correct me if I'm wrong.  I'll make a note for later, that
>> > >> > we
>> > >> > should translate the libgomp.oacc-c-c++-common/context-*.c test
>> > >> > cases to
>> > >> > Fortran, and also replicate them using the OpenACC host_data
>> > >> > construct
>> > >> > (like in
>> > >> > libgomp/testsuite/libgomp.oacc-c-c++-common/host_data-1.c), and
>> > >> > the same for the documentation you referenced.  (Vikram, please
>> > >> > tell if
>> > >> > you're interested to work on these items.)
>> > >>
>> > >> I am not sure I understand what you want me to work on exactly. I am
>> > >> not really much of a C programmer, so I wouldn't be good at it. I
>> > >> would still like to help.
>> > >
>> > > Sorry for being unclear.  My idea/question has been whether you're
>> > > interested in helping by translating the documentation as well as the
>> > > libgomp.oacc-c-c++-common/context-*.c test cases from C to Fortran
>> > > (using
>> > > the OpenACC host_data construct instead of the acc_* functions).  If
>> > > yes,
>> > > then that's great, if not, then one of us will do it at some point.
>>
>>
>> Grüße
>>  Thomas
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]