This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

From: "jakub at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 26 May 2017 07:17:57 +0000
Subject: [Bug c++/80859] Performance Problems with OpenMP 4.5 support
Auto-submitted: auto-generated
References: <bug-80859-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |ASSIGNED

--- Comment #23 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, the time difference with OMP_NUM_THREADS=1, which means this isn't about
code generation (or at least not something that is significant).
What matters is that there is a parallel inside of the target region and the
target region is invoked many times, and while libgomp uses a thread pool to
cache a set of threads from one (non-nested) host parallel region to another
one (this is effectively required by the OpenMP spec, as threadprivate vars
should survive between parallel regions if the number of threads doesn't
change), target fallback creates a different contention group, so the threads
created there are unrelated to the other host threads, don't have guarantees on
locking with the real host threads etc.
The performance problem on your testcase as well as e.g.
int
main (int argc, const char **argv)
{
  int i, j;
  int a[16] = {};
  if (argc > 1)
    for (i = 0; i < 128; i++)
      {
      #pragma omp target teams distribute parallel for simd
        for (j = 0; j < 16; j++)
          a[j]++;
      }
  else
    for (i = 0; i < 128; i++)
      {
      #pragma omp parallel for simd
        for (j = 0; j < 16; j++)
          a[j]++;
      }
  return 0;
}

is the lack of caching of threads (i.e. preserving a thread pool) across
different target host fallback regions, at the end of each target host fallback
the thread pool created for it when encountering parallel construct in it is
destroyed, which means the POSIX threads are let terminate, and the next time
parallel construct in another target host fallback region is encountered,
pthread_create is called again.
I'll try to implement caching of one? thread pool between target host fallback
regions (there can be more of them concurrently, so just probably atomically
exchange with a thread pool sitting in some static variable).  Might take a few
days to implement properly though.

Back to your code,
        //determine number of threads and teams
#pragma omp parallel
        {
                nthreads = omp_get_num_threads();
        }
#pragma omp target teams num_teams(NUM_TEAMS)
        {
                nteams = omp_get_num_teams();
        }
#endif
is wrong not just for the reasons I said earlier (missing map(from:nteams)
clause on target teams and (pedantically) data races, but also that it grabs
completely unrelated number of threads (how many host threads would be created)
and uses that to request that many threads in the target region.  Plus
completely unnecessarily spawns all the host threads that will not be used then
at all (but the runtime can't know that and can't reuse them for anything else,
because if you do another #pragma omp parallel on the host, it would need to be
the same threads).  Consider that the host is say a 1s/8c/16t CPU and
accelerator is PTX, with 16 teams, 32 threads, and warp size 32 (i.e. 32 "SIMD"
lanes).  The above if fixed will tell you to request 16 teams, but limit
threads to 16 because that is what the host has, while you could use 32 threads
per team.  And you are not using parallel for simd but just parallel for, so it
wouldn't efficiently use all the accelerator HW, but just 1/32th of it (because
of the thread limitation actually 1/64th).
So, if you really want to query the number of teams and threads upfront (still
don't understand why, if you don't split the distribute vs. parallel for the
way you do and just do target teams distribute parallel for simd as one
construct then it is up to the implementation to split the work in between
"SIMD" lanes (on PTX SIMT threads in a warp), threads and teams and you don't
have to prescribe anything), you should use
#pragma omp target teams map(from:nteams, nthreads)
  {
    #pragma omp parallel
    #pragma omp master
    if (omp_get_team_num () == 0)
      {
        nteams = omp_get_num_teams ();
        nthreads = omp_get_thread_num ();
      }
  }
or so, that way you query how many threads are there by default inside of the
teams construct.

References:
- [Bug c++/80859] New: Performance Problems with OpenMP 4.5 support
  - From: thorstenkurth at me dot com

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]