Bug 80859

Summary: Performance Problems with OpenMP 4.5 support
Product: gcc Reporter: Thorsten Kurth <thorstenkurth>
Component: libgompAssignee: Jakub Jelinek <jakub>
Status: ASSIGNED ---    
Severity: normal CC: jakub, webrown.cpp
Priority: P3 Keywords: missed-optimization, openmp
Version: 6.3.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2017-05-24 00:00:00
Attachments: OpenMP 4.5 Testcase
Testcase

Description Thorsten Kurth 2017-05-22 20:33:50 UTC
Dear Sir/Madam,

I am working on the Cori HPC system, a Cray XC-40 with intel Xeon Phi 7250. I probably found a performance "bug" when using the OpenMP 4.5 target directives. It seems to me that the GNU compiler generates unnecessary move and push functions when a 

#pragma omp target region is present but no offloading is used.

I have attached a test case to illustrate that problem. Please compile the nested_test_omp_4dot5.x in the directory (don't be confused by the name, I am not using nested OpenMP here). Then go into the corresponding .cpp file and comment out the target-related directives (target teams and distribute), compile again and then compare the assembly code. The code with the target directives has more pushes and moves than the one without. I think I also place the output of that process in the directory already, the files ending in .as.

The performance overhead is marginal here but I am currently working on a Department of Energy performance portability project and I am exploring the usefulness of OpenMP 4.5. The code we retargeting is a Geometric Multigrid in the BoxLiv/AMReX framework and there the overhead is significant. I could observe as much as 10x slowdown accumulated throughout the app. This code is bigger and thus I do not want to demonstrate that here but I could send you an invitation to the github repo if requested. In my opinion, if no offloading is used, the compiler should just ignore the target region statements and just default to plain OpenMP. 

Please let me know what you think.

Best Regards
Thorsten Kurth
National Energy Research Scientific Computing Center
Lawrence Berkeley National Laboratory
Comment 1 Jakub Jelinek 2017-05-24 07:36:59 UTC
I don't see any attachments.  The target directives can't be ignored, some clauses have significant role even when doing host fallback (data sharing).
Plus the question is what do you mean by no offloading.  Do you mean compiler configured without any offloading support, or compiler configured with offloading support, but with offloading not selected, or by compiler with offloading support and offloading generated, but at runtime deciding it has to use host fallback?
E.g. the second case, the decision whether offloading will be supported or not is not done at compile time, but at link time, where one can choose whether to emit offloading or not and to which subset of the configured offloading targets.
Comment 2 Jakub Jelinek 2017-05-24 08:19:03 UTC
Also, even for host fallback there is a separate set of ICVs and many other properties, the target region can't be just ignored for many reasons even if there is no data sharing.
Of course, if you provide small testcases then we can discuss on what can and what can't be optimized in detail.
Comment 3 Thorsten Kurth 2017-05-24 15:05:58 UTC
Created attachment 41414 [details]
OpenMP 4.5 Testcase

This is the source code
Comment 4 Thorsten Kurth 2017-05-24 15:08:03 UTC
Created attachment 41415 [details]
Testcase

This is the test case. The files ending on .as contain the assembly code with and without target region
Comment 5 Thorsten Kurth 2017-05-24 15:11:38 UTC
To clarify the problem:
I think that the additional movq, pushq and other instructions generated when using the target directive can cause a big hit on the performance. I understand that these instructions are necessary when offloading is used but in case when I compile for native architecture those should not be there. So maybe I am just missing a GNU compiler flag which disables offloading and lets the compiler ignore the target, teams and distribute directives at compile time but still honoring all the other OpenMP constructs. 
Is there a way to do that right now and if not, is there a way to add that flag that supports this.
Comment 6 Jakub Jelinek 2017-05-24 15:25:50 UTC
movq/pushq etc. aren't that expensive, if it affects performance it must be something in the inner loops.  A compiler switch that ignores omp target, teams and distribute would basically create a new OpenMP version if it would ignore the requirements on those constructs, you can achive it yourself by using
those in _Pragma in some macro and defining it conditionally based on whether you want offloading or not, then the "you can ignore all side effects" is decided by you.  For OpenMP 5.0, there is some work on prescriptive vs. descriptive clauses/constructs where in your case you could just use a describe that the loop could be parallelized, simdized and/or offloaded and keep that up to the implementation what it does with that.

What we perhaps could do is when not offloading try to simplify omp distribute (if we know omp_get_num_teams () will be always 1), either just by folding the library calls in that case to 1 or 0, or perhaps doing some more.

#pragma omp target teams
	{
		num_teams=omp_get_num_teams();
	}
	
#pragma omp parallel
	{
		num_threads=omp_get_num_threads();
	}
in your testcase is just wrong, the target would be ok in OpenMP 4.0, but it is not in 4.5, num_teams, being a scalar variable, is firstprivate, so you won't get the value back.
The parallel is racy, to avoid races you'd need #pragma omp single or #pragma omp master.

Why are you using separate distribute and parallel for constructs and prescribing what they handle, instead of just using
#pragma omp distribute parallel for
  for (int i = 0; i < N; ++i) D[i] += B[i] * C[i];
?  Do you expect or see any gains from that?
Comment 7 Thorsten Kurth 2017-05-24 15:42:12 UTC
Hello Jakub,

thanks for your comment but I think the parallel for is not racey. Every thread is working a block of i-indices so that is fine. The dotprod kernel is actually a kernel from the OpenMP standard documentation and I am sure that this is not racey. 

The example with the regions you mentioned I do not see a problem with that either. By default, everything is shared so the variable is updated by all the threads/teams with the same value. 

The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver, architecture and whatever dependent. 

Concerning splitting distribute and parallel: I tried both combinations and found that they behave the same. But in the end I split it so that I could comment out the distribute section to see if that makes a performance difference (and it does).

I believe that the overhead instructions are responsible for the bad performance because that is the only thing which distinguishes the target annotated code from the plain openmp code. I used vtune to look at thread utilization and they look similar, L1, L2 hit rates are very close (100% vs 99% and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the performance of the target annotated code can be up to 10x worse. So I think there might be register spilling due to copying a large amount of variables. If you like I can point you to the github repo code (BoxLib) which clearly exhibits this issue. This small test case only shows minor overhead of OpenMP 4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.
Comment 8 Thorsten Kurth 2017-05-24 15:45:04 UTC
Here is the output of the get_num_threads section:

[tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32
[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 32 threads.

and:

[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 12 threads.

I think the code is OK.
Comment 9 Thorsten Kurth 2017-05-24 15:46:22 UTC
Sorry, in the second run I set the number of threads to 12. I think the code works as expected.
Comment 10 Jakub Jelinek 2017-05-24 15:58:16 UTC
(In reply to Thorsten Kurth from comment #7)
> Hello Jakub,
> 
> thanks for your comment but I think the parallel for is not racey. Every
> thread is working a block of i-indices so that is fine. The dotprod kernel
> is actually a kernel from the OpenMP standard documentation and I am sure
> that this is not racey. 

I was not talking about the parallel for, but about the parallel I've cited.
Even if you write the same value from all threads, at least pedantically it is racy, even when you might get away with that.  Which is why you should assign it just once, e.g. through #pragma omp master or single.
 
> The example with the regions you mentioned I do not see a problem with that
> either. By default, everything is shared so the variable is updated by all
> the threads/teams with the same value. 

The omp target I've cited above is by default handled in OpenMP 4.0 as
#pragma omp target teams map(tofrom:num_teams)
and will work that way, although it is again pedantically racy, multiple teams
write the same value.
In OpenMP 4.5 it is
#pragma omp target teams firstprivate(num_teams)
and you will always end up with 1, even if there is accelerator that has say 1024 teams by default.  So you really need explicit map(from:num_teams) or similar to get the value back.  And to be pedantically correct also
assign it only once, e.g. by doing the assignment only if (omp_get_team_num () == 0).

> Concerning splitting distribute and parallel: I tried both combinations and
> found that they behave the same. But in the end I split it so that I could
> comment out the distribute section to see if that makes a performance
> difference (and it does).

I was just asking why you are doing it, I haven't yet analyzed the code if there is something that could be easily improved.
Comment 11 Thorsten Kurth 2017-05-24 16:06:26 UTC
Hello Jakub,

yes, you are right. I thought that map(tofrom:XXXX) is the default mapping but I might be wrong. In any case, teams is always 1. So this code is basically just data streaming  so there is no need for a detailed performance analysis. When I timed the code (not profiling it) the OpenMP 4.5 code had a tiny bit more overhead, but not significant. 
However, we might nevertheless learn from that. 

Best
Thorsten
Comment 12 Jakub Jelinek 2017-05-24 16:16:30 UTC
(In reply to Thorsten Kurth from comment #11)
> yes, you are right. I thought that map(tofrom:XXXX) is the default mapping
> but I might be wrong. In any case, teams is always 1. So this code is

Variables that aren't pointers nor scalars are still implicitly map(tofrom:XXXX),
scalars are implicitly firstprivate(XXXX), pointers are map(alloc:ptr[0:0]).

> basically just data streaming  so there is no need for a detailed
> performance analysis. When I timed the code (not profiling it) the OpenMP
> 4.5 code had a tiny bit more overhead, but not significant. 
> However, we might nevertheless learn from that. 

What kind of compiler options you use?  -O2 -fopenmp, -O3 -fopenmp, -Ofast -fopenmp, something different?  What ISA choice? -march=native, -mavx2, ...?
The 10x slowdown could most likely be explained by the inner loop being vectorized in one case and not the other.  You aren't using #pragma omp parallel for simd that you'd explicitly ask for vectorization e.g. even at -O2 -fopenmp.
Comment 13 Thorsten Kurth 2017-05-24 16:29:02 UTC
Hello Jakub,

the compiler options are just -fopenmp. I am sure it does not have to do anything with vectorization as I compare the code runtime with and without the target directives and thus vectorization should be the same between them. The remaining OpenMP sections are the same. In our work we have not seen 10x because of insufficient vectorization, it is usually because of cache locality but that is the same for OMP 4.5 and OMP 3 because the loops are not touched.
I do not specify an ISA choice, but I will try specifying KNL now and will tell you what the compiler is going to do.

Best
Thorsten
Comment 14 Jakub Jelinek 2017-05-24 16:33:40 UTC
(In reply to Thorsten Kurth from comment #13)
> the compiler options are just -fopenmp. I am sure it does not have to do
> anything with vectorization as I compare the code runtime with and without
> the target directives and thus vectorization should be the same between
> them. The remaining OpenMP sections are the same. In our work we have not
> seen 10x because of insufficient vectorization, it is usually because of
> cache locality but that is the same for OMP 4.5 and OMP 3 because the loops
> are not touched.
> I do not specify an ISA choice, but I will try specifying KNL now and will
> tell you what the compiler is going to do.

The compiler doesn't optimize by default (i.e. default is -O0), so if you are measuring -O0 -fopenmp performance or code size, that is something that is completely uninteresting.  For -O0 the most important is compilation speed, not quality of generated code.  For runtime performance of generated code only -O2, -O3 or -Ofast are optimization levels that make sense.
Comment 15 Thorsten Kurth 2017-05-24 16:35:34 UTC
The code I care about definitely has optimization enabled. For the fortran stuff it does (for example): 

ftn  -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore -Jo/3d.gnu.MPI.OMP.EXE -I o/3d.gnu.MPI.OMP.EXE -fimplicit-none  -fopenmp -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/LinearSolvers/F_MG/itsol.f90 -o o/3d.gnu.MPI.OMP.EXE/itsol.o
Compiling cc_mg_tower_smoother.f90 ...

and for the C++ stuff it does

CC  -g -O3 -std=c++14  -fopenmp -g -DCG_USE_OLD_CONVERGENCE_CRITERIA -DBL_OMP_FABS -DDEVID=0 -DNUM_TEAMS=1 -DNUM_THREADS_PER_BOX=1 -march=knl  -DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='6.3.0' -DBL_GCC_MAJOR_VERSION=6 -DBL_GCC_MINOR_VERSION=3 -DBL_SPACEDIM=3 -DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB -DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/C_BaseLib/FPC.cpp -o o/3d.gnu.MPI.OMP.EXE/FPC.o
Compiling Box.cpp ...

But the kernels I care about are in C++.
Comment 16 Thorsten Kurth 2017-05-24 16:44:21 UTC
FYI, the code is:

https://github.com/zronaghi/BoxLib.git

in branch

cpp_kernels_openmp4dot5

and then in Src/LinearSolvers/C_CellMG

the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out and commented in and when the test case in run you get significant slowdown when the code is compiled with that stuff commented in. I did not map all the scalar stuff so it might be that this is a problem. But in any case, it should not create copies of that stuff at all in my opinion.

Please don't look at that code right now because it is a bit convoluted I just wanted to show that this issue appears. So when I have the target section I mentioned above commented in I get by running:

#!/bin/bash

export OMP_NESTED=false
export OMP_NUM_THREADS=64
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_MAX_ACTIVE_LEVELS=1

execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C"
exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'`

#execute
${exec} inputs

the following:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size     : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS      : -2.68882138776405e-17
----------------------------------------
Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs                = 135.516568492921
MultiGrid: Initial residual           = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916982188
MultiGrid: Iteration   4 resid/bnorm = 3.55014555643671e-05
MultiGrid: Iteration   5 resid/bnorm = 2.57082340920002e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439886018e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525222814178e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675190626368e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 5.84898591041565, CG time: 0.162226438522339
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time      : 5.98936820030212
----------------------------------------
Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

When I comment it out, recompile, I get:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size     : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS      : -2.68882138776405e-17
----------------------------------------
Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs                = 135.516568492921
MultiGrid: Initial residual           = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916981978
MultiGrid: Iteration   4 resid/bnorm = 3.5501455563633e-05
MultiGrid: Iteration   5 resid/bnorm = 2.5708234090034e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439781153e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525225042545e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675108045705e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 0.759385108947754, CG time: 0.14183521270752
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time      : 0.879786014556885
----------------------------------------
Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the most expensive kernel in the Multi-Grid code, so I see the biggest effect here. But the other kernels (prolongation, restriction, dot products etc) have slowdowns as well amounting to a total of more than 10x for the whole app.
Comment 17 Thorsten Kurth 2017-05-24 16:45:58 UTC
the result though is correct, I verified that both codes generate the correct output.
Comment 18 Jakub Jelinek 2017-05-24 16:54:42 UTC
Ok, I'll grab your git code and will have a look tomorrow what's going on.
Comment 19 Thorsten Kurth 2017-05-24 16:57:55 UTC
Thanks you very much. I am sorry that I do not have a simpler test case. The kernel which is executed is in the same directory as ABecLaplacian and called MG_3D_cpp.cpp.

We have seen similar problems with the fortran kernels (they are scattered across multiple files) but the fortran kernels and our C++ ports give the same performance with the original OpenMP parallelization. In any case, I wonder why the compiler honors the target region even if -march=knl is specified. However, please let me know if you have further questions. I can guide you through that code. The code is big but the relevant files are technically 2 or 3 and the relevant lines of code also not very many.
Comment 20 Thorsten Kurth 2017-05-25 15:30:21 UTC
To compile the code, edit the GNUmakefile to suit your needs (feel free to ask any questions) and in order to run it run the generated executable, called something like

main3d.XXX.XXXX..

and the XXX tell you if you compiled with MPI, OpenMP, etc. There is an inputs file you just pass to it:

./main3d...... inputs

That's it. Tell me if you need more info.
Comment 21 Jakub Jelinek 2017-05-25 17:05:42 UTC
It doesn't compile for me.
cmake -DENABLE_MPI=0 -DENABLE_OpenMP=1 ..
make -j16

I don't have ittnotify.h, I've tried to comment that out as well as the _itt*
calls, but then run into:
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:83:36: error: ‘NUM_TEAMS’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:93:82: error: ‘DEVID’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:98:74: error: ‘NUM_THREADS_PER_BOX’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:532:36: error: ‘NUM_TEAMS’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:542:82: error: ‘DEVID’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:547:78: error: ‘NUM_THREADS_PER_BOX’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:820:47: error: ‘NUM_TEAMS’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/ABecLaplacian.cpp:820:120: error: ‘DEVID’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:786:36: error: ‘NUM_TEAMS’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:796:82: error: ‘DEVID’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:801:74: error: ‘NUM_THREADS_PER_BOX’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:857:36: error: ‘NUM_TEAMS’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:867:82: error: ‘DEVID’ was not declared in this scope
/usr/src/BoxLib/Src/LinearSolvers/C_CellMG/MultiGrid.cpp:872:74: error: ‘NUM_THREADS_PER_BOX’ was not declared in this scope

Is that something that ittnotify.h defines?  At least in the ittnotify.h I've googled for it doesn't, so it is unclear from where these macros would be satisfied.
Comment 22 Thorsten Kurth 2017-05-25 17:11:19 UTC
Hello Jakub,

that is stuff for Intel vTune. I have commented it out and added the NUM_TEAMS defines in the GNUmakefile. Please pull the latest changes.

Best and thanks
Thorsten
Comment 23 Jakub Jelinek 2017-05-26 07:17:57 UTC
So, the time difference with OMP_NUM_THREADS=1, which means this isn't about code generation (or at least not something that is significant).
What matters is that there is a parallel inside of the target region and the target region is invoked many times, and while libgomp uses a thread pool to cache a set of threads from one (non-nested) host parallel region to another one (this is effectively required by the OpenMP spec, as threadprivate vars should survive between parallel regions if the number of threads doesn't change), target fallback creates a different contention group, so the threads created there are unrelated to the other host threads, don't have guarantees on locking with the real host threads etc.
The performance problem on your testcase as well as e.g.
int
main (int argc, const char **argv)
{
  int i, j;
  int a[16] = {};
  if (argc > 1)
    for (i = 0; i < 128; i++)
      {
      #pragma omp target teams distribute parallel for simd
	for (j = 0; j < 16; j++)
	  a[j]++;
      }
  else
    for (i = 0; i < 128; i++)
      {
      #pragma omp parallel for simd
	for (j = 0; j < 16; j++)
	  a[j]++;
      }
  return 0;
}

is the lack of caching of threads (i.e. preserving a thread pool) across different target host fallback regions, at the end of each target host fallback the thread pool created for it when encountering parallel construct in it is destroyed, which means the POSIX threads are let terminate, and the next time parallel construct in another target host fallback region is encountered, pthread_create is called again.
I'll try to implement caching of one? thread pool between target host fallback regions (there can be more of them concurrently, so just probably atomically exchange with a thread pool sitting in some static variable).  Might take a few days to implement properly though.

Back to your code,
        //determine number of threads and teams
#pragma omp parallel
        {
                nthreads = omp_get_num_threads();
        }
#pragma omp target teams num_teams(NUM_TEAMS)
        {
                nteams = omp_get_num_teams();
        }
#endif
is wrong not just for the reasons I said earlier (missing map(from:nteams) clause on target teams and (pedantically) data races, but also that it grabs completely unrelated number of threads (how many host threads would be created) and uses that to request that many threads in the target region.  Plus completely unnecessarily spawns all the host threads that will not be used then at all (but the runtime can't know that and can't reuse them for anything else, because if you do another #pragma omp parallel on the host, it would need to be the same threads).  Consider that the host is say a 1s/8c/16t CPU and accelerator is PTX, with 16 teams, 32 threads, and warp size 32 (i.e. 32 "SIMD" lanes).  The above if fixed will tell you to request 16 teams, but limit threads to 16 because that is what the host has, while you could use 32 threads per team.  And you are not using parallel for simd but just parallel for, so it wouldn't efficiently use all the accelerator HW, but just 1/32th of it (because of the thread limitation actually 1/64th).
So, if you really want to query the number of teams and threads upfront (still don't understand why, if you don't split the distribute vs. parallel for the way you do and just do target teams distribute parallel for simd as one construct then it is up to the implementation to split the work in between "SIMD" lanes (on PTX SIMT threads in a warp), threads and teams and you don't have to prescribe anything), you should use
#pragma omp target teams map(from:nteams, nthreads)
  {
    #pragma omp parallel
    #pragma omp master
    if (omp_get_team_num () == 0)
      {
        nteams = omp_get_num_teams ();
        nthreads = omp_get_thread_num ();
      }
  }
or so, that way you query how many threads are there by default inside of the teams construct.
Comment 24 Thorsten Kurth 2017-05-26 15:14:45 UTC
Hello Jakub,

I know that the section you mean is racey and gets the wrong number of threads is not right but I put this in in order to see if I get the correct numbers on a CPU (I am not working on a GPU yet, that will be next). Most of the defines for setting number of teams and threads in outer loop are for playing around with that and see what works best, in the end this will be removed. This code is not finished by any means, it is a moving target and under active development. Only the OpenMP3 version is considered done and works well.

You said that SIMD pragmas is missing, and this is for a reason. First of all, the code is memory bandwidth bound, so it has a rather low AI so that vectorization does not help a lot. Of course vectorization helps in the sense that the loads and stores are vectorized and the prefetcher works more efficiently. But we made sure that the (Intel) compiler vectorizes the inner loops automatically nicely. Putting in explicit SIMD pragmas made the code performance worse, because in that case the (Intel) compiler generates worse code in some cases (according to some Intel compiler guys, this is because if the compiler sees a SIMD statement it will not try to partially unroll loops etc. and might generate more masks than necessary). So auto vectorization works fine here so we have not revisited this issue. The GNU compiler might be different and I did not look at what the auto-vectorizer did.

The more important questions I have are the following:
1) as you see the codes has two levels of parallelism. On the CPU, it is most efficient to tile the boxes (this is the loop with the target distribute) and then let one thread work on a box. I added another level of parallelism inside that box, because on the GPU you have more thread and might want to exploit more parallelism. Talking to folks from IBM at an OpemMP 4.5 hackathon at least this is what they suggested. 
So my question is: when you have a target teams distribute, will be one team equal to a CUDA WARP or will it be something bigger? In that case, I would like to have one WARP working on a box and not letting different ptx threads working on individual boxes. To summarize: on the CPU the OpenMP threading should be such that one threads gets a box and the vectorization works on the inner loop (which is fine, that works), and in the CUDA case one team/WARP should work on a box and then SIMT parallelize the work on the box.

2) related to this: how does ptx behave when it sees a SIMD statement in a target region? Is that ignored or somehow interpreted? In any case, how does OpenMP do the mapping between CUDA WARP <-> OpenMP CPU thread, because this is the closest equivalence I would say. I would guess it ignores SIMD pragmas and just acts on thread level, where in the CUDA world one thread more or less acts like a SIMD lane on the CPU.

3) this device mapping business is extremely verbose for C++ classes. For example the MFIter class amfi, comfy, solnLmfi whatever are not correctly mapped yet and would cause trouble on the GPU (the intel compiler complains that the stuff is not bitwise copyable, GNU complies it though). These are classes containing other class pointers. So in order to map that properly I would technically need to map the dereferenced data member of the member class of the first class, correct? As an example, you have a class with std::vector<XXX> * vector data member. You technically need to map the vector.data() member to the device, right? That however tells you that you need to be able to access that guy, i.e. it should not be a protected class member. So what happens when you have a class which you cannot change but need to map private/protected members of it? The example at hand is the MFIter class which has this:

protected:

    const FabArrayBase& fabArray;

    IntVect tile_size;

    unsigned char flags;
    int           currentIndex;
    int           beginIndex;
    int           endIndex;
    IndexType     typ;

    const Array<int>* index_map;
    const Array<int>* local_index_map;
    const Array<Box>* tile_array;

    void Initialize ();

It has these array pointers. So technically this is (to my knowledge, I do not know the code fully) an array of indices which determines which global indices the iterator is in fact iterating over. This stuff can be shared among the threads and it is only read and never written. Nevertheless, it needs to know the indices on the device so the index_map etc. needs to be mapped. Now, Array is just a class with a public member of std::vector. But in order to map the index_map class member I would need to have access to it, so that I can map the underlying std::vector data member. Do you know what I mean? How is this done in the most elegant way in OpenMP?
Comment 25 Jakub Jelinek 2017-05-26 15:30:58 UTC
In the GCC implementation of offloading to PTX, all HW threads in a warp (i.e. 32 of them) are a single OpenMP thread, and one needs to use a simd region (effectively SIMT) to get useful work done by all all the threads of a warp rather than just one.
Right now GCC doesn't do auto-SIMTization (but does auto-vectorization on the host or XeonPhi accelerator etc., but only with -O3 or -O2 -ftree-vectorize; while with simd constructs you get it even with just -O2 -fopenmp for those regions), so simd construct is important to get the right performance.  Threads within a team are the warp groups of threads within a PTX CTA, and different teams are the CTAs in a CTA grid (in PTX terms).
Comment 26 Thorsten Kurth 2017-05-26 15:38:30 UTC
Hello Jakub,

thanks for the clarification. So a team maps to a CTA which is somewhat equivalent to a block in CUDA language, correct? And it is good to have some categorical equivalency between GPU and CPU code (SIMD units <> WARPS) instead of mapping SIMT threads to OpenMP threads, that makes it easier for making it portable. 

About my mapping "problem" is there an elegant way for doing this or does only brute force work, i.e. by writing additional member functions returning pointers etc.?
In general, the OpenMP mapping business is very verbose (not your fault, I know), it makes the code very annoying to read.

Best
Thorsten
Comment 27 Thorsten Kurth 2017-08-08 15:41:43 UTC
Hello Jakub,

I wanted to follow up on this. Is there any progress on this issue?

Best Regards
Thorsten Kurth
Comment 28 Thorsten Kurth 2017-10-20 16:00:09 UTC
Hello,

can someone please give me an update on this bug?

Best Regards
Thorsten Kurth