This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads

From: Johannes Singler <singler at ira dot uka dot de>
To: gcc-patches at gcc dot gnu dot org, Jakub Jelinek <jakub at redhat dot com>
Cc: Jakob Blomer <Jakob dot Blomer at ira dot uka dot de>, Ulrich Drepper <drepper at redhat dot com>
Date: Tue, 29 Apr 2008 15:45:47 +0200
Subject: Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
References: <48108BAA.2090401@ira.uka.de> <20080424150418.GE2255@devserv.devel.redhat.com> <4811BFA7.7070306@ira.uka.de> <20080425131544.GM2255@devserv.devel.redhat.com>

Jakub Jelinek wrote:

On Fri, Apr 25, 2008 at 01:25:27PM +0200, Johannes Singler wrote:
Why has the barrier been changed into per-thread mutex_lock/unlock? I don't see how that can scale well, consider say 128 threads, with gomp_barrier_wait that means all threads sleep on the same futex and are awaken at once, while with the mutex_lock/mutex_unlock that means the thread that wants to wake them up needs to call futex_wait 128 times. If you are worried that the master thread is wasting cycles unnecessarily or sleeping (depending on busy waiting) when it doesn't reach the do_release: gomp_barrier_wait as the last one, then most likely it could be changed to just bstate = gomp_barrier_wait_start (dock); if (gomp_barrier_last_thread (dock, bstate)) gomp_barrier_wait_end (dock, bstate); unless nthreads < old_threads_used (if decreasing the number of threads we should delay the changes until all threads are started, at least I think so ATM).
The reason for having multiple gomp_mutexes was the possibility of releasing specific threads independently. However, we can emulate that with the barrier by letting superfluous threads return to the pool.

This raises a more general question. So far, the threads in the pool that are not needed for the team quit, i. e. the pool reduces the size to the number of threads in the last parallel region (nthreads).
I think this very much depends on what do we want to optimize for.
With the addition of tasking I believe most of the reasons to use nested
parallel regions are gone, it is better to use tasks.  E.g. for the parallel
sorting in libstdc++ tasks are IMHO much better than to use nested
parallels.


That's true, for the parallel sorters, we can move to tasks. But in that
case, we would like to nest parallel regions inside tasks. (Or is that
forbidden?)

So, IMHO there are rather more than less reasons for efficient nesting.
What about having a parallel region nested in a user parallel task (e.
g. by using the parallel mode). This is a very natural combination of
task- and data-parallelism, isn't it?

Another example:
The user could decide to split his program up in to two coarse-grained
tasks using parallel tasks or sections, and then let each of them use
four cores of the two quad-core CPUs in his system dynamically, i. e.
parallelism is forked and joined frequently. Just by adding the wrapper
of tasks around it, he would lose all the efficient pooling at once.
Telling him to use pthreads would firstly be inconsequent and secondly
not portable.

By definition, only the first level of parallelism is non-nested, so
"almost everything" else *is* nested. We should have a good solution for
"almost everything".

I'd say most OpenMP programs will just use non-nested parallels,


I'm pretty sure that will change with even more cores and the task
construct being used more widely.

always with the same number of threads (coincidently that's where the standard
requires the threads to be kept), and that's what should we optimize for.
The rest is just something that can be optimized if it doesn't slow down
the most likely case.

For nested parallels, the question is if they are common enough to bother
optimizing them, and if they are, whether it is common that consecutive
nested parallels will have always the same number of threads.  I'm not
sure about the answer of how commonly nested parallels will be used, but
I'd say if they are used then they likely will be using the same number of
threads.  So, if we optimize nested parallels at all, it might sense to
have just a pool with the right number of threads waiting in a dock, ready
to be used quickly.  In the unlikely case that more or less threads are
needed, either we can let the extra threads die, or if not more than
number of CPUs perhaps we could instead redock them on a different dock
for threads that aren't part of any usable pool.

What do you mean? One global pool, or one per user thread?

Also, slowing down parallel so much worries me a lot, because I'm afraid we
need for tasking another barrier where the tasks would be actually handled
and so some slow down is ahead of us anyway.
Would the additional barrier be required only if there actually are task constructs, or would this be a general overhead?
I'm afraid all or nearly all (of course I'm open to ideas). The problem is that there is no clause on parallel that would say if tasks may be created or not. Perhaps if the compiler can analyze the parallel region, see no task constructs nor calls to any functions, except well known builtins which are known not to create any tasks, or with interprocedural optimizations even by analyzing all the functions that are called from the parallel region, transitively, as long as gcc sees them all and they can't be overloaded, we can avoid that.

Yes, that's probably unclear at compile-time.

But if you have
#pragma omp parallel num_threads (16)
  {
    #pragma omp master
    foo ();
  }

then foo could contain
  for (int i = 0; i < 64; i++)
    #pragma omp task
      do_some_work (i);

but the compiler doesn't know that there are some tasks created at compile
time, nor the runtime can find out early enough.  All but the master thread
could be already waiting in the final barrier when the tasks start to be
created, and should be parallelized.  And the tasks in the threads must
be obviously finished before the team is destroyed, so we need a barrier
afterwards.  I'm ATM unsure whether such barrier must be from an explicit
call in the parallel body (that would be the worse variant), or just
after the callback fn returns, see
http://www.openmp.org/forum/viewtopic.php?f=5&t=106#p399

We'll think about that...

-- Johannes

Follow-Ups:
- Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
  - From: Jakub Jelinek

References:
- [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
  - From: Johannes Singler
- Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
  - From: Jakub Jelinek
- Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
  - From: Johannes Singler
- Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads
  - From: Jakub Jelinek

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]