This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][libgomp/gomp-3_0-branch] Support for OpenMP in user threads, and pooling for nested threads


On Fri, Apr 25, 2008 at 01:25:27PM +0200, Johannes Singler wrote:
> >Why has the barrier been changed into per-thread mutex_lock/unlock?  I 
> >don't
> >see how that can scale well, consider say 128 threads, with 
> >gomp_barrier_wait
> >that means all threads sleep on the same futex and are awaken at once,
> >while with the mutex_lock/mutex_unlock that means the thread that wants to
> >wake them up needs to call futex_wait 128 times.  If you are worried that
> >the master thread is wasting cycles unnecessarily or sleeping (depending on
> >busy waiting) when it doesn't reach the do_release: gomp_barrier_wait as 
> >the
> >last one, then most likely it could be changed to just
> >bstate = gomp_barrier_wait_start (dock);
> >if (gomp_barrier_last_thread (dock, bstate))
> >  gomp_barrier_wait_end (dock, bstate);
> >unless nthreads < old_threads_used (if decreasing the number of threads
> >we should delay the changes until all threads are started, at least I think
> >so ATM).
> 
> The reason for having multiple gomp_mutexes was the possibility of 
> releasing specific threads independently. However, we can emulate that 
> with the barrier by letting superfluous threads return to the pool.
> 
> This raises a more general question. So far, the threads in the pool 
> that are not needed for the team quit, i. e. the pool reduces the size 
> to the number of threads in the last parallel region (nthreads).

I think this very much depends on what do we want to optimize for.
With the addition of tasking I believe most of the reasons to use nested
parallel regions are gone, it is better to use tasks.  E.g. for the parallel
sorting in libstdc++ tasks are IMHO much better than to use nested
parallels.

I'd say most OpenMP programs will just use non-nested parallels,
always with the same number of threads (coincidently that's where the standard
requires the threads to be kept), and that's what should we optimize for.
The rest is just something that can be optimized if it doesn't slow down
the most likely case.

For nested parallels, the question is if they are common enough to bother
optimizing them, and if they are, whether it is common that consecutive
nested parallels will have always the same number of threads.  I'm not
sure about the answer of how commonly nested parallels will be used, but
I'd say if they are used then they likely will be using the same number of
threads.  So, if we optimize nested parallels at all, it might sense to
have just a pool with the right number of threads waiting in a dock, ready
to be used quickly.  In the unlikely case that more or less threads are
needed, either we can let the extra threads die, or if not more than
number of CPUs perhaps we could instead redock them on a different dock
for threads that aren't part of any usable pool.

> >Also, slowing down parallel so much worries me a lot, because I'm afraid we
> >need for tasking another barrier where the tasks would be actually handled
> >and so some slow down is ahead of us anyway.
> 
> Would the additional barrier be required only if there actually are task 
> constructs, or would this be a general overhead?

I'm afraid all or nearly all (of course I'm open to ideas).  The problem is
that there is no clause on parallel that would say if tasks may be created
or not.  Perhaps if the compiler can analyze the parallel region, see no
task constructs nor calls to any functions, except well known builtins
which are known not to create any tasks, or with interprocedural
optimizations even by analyzing all the functions that are called from the
parallel region, transitively, as long as gcc sees them all and they can't
be overloaded, we can avoid that.  But if you have
#pragma omp parallel num_threads (16)
  {
    #pragma omp master
    foo ();
  }

then foo could contain
  for (int i = 0; i < 64; i++)
    #pragma omp task
      do_some_work (i);

but the compiler doesn't know that there are some tasks created at compile
time, nor the runtime can find out early enough.  All but the master thread
could be already waiting in the final barrier when the tasks start to be
created, and should be parallelized.  And the tasks in the threads must
be obviously finished before the team is destroyed, so we need a barrier
afterwards.  I'm ATM unsure whether such barrier must be from an explicit
call in the parallel body (that would be the worse variant), or just
after the callback fn returns, see
http://www.openmp.org/forum/viewtopic.php?f=5&t=106#p399

> >And, lastly we should talk to ARB or at least the various other OpenMP
> >vendors what's preferrable to do for the non-nested ICVs, are they supposed
> >to modify and query a global state or is each pthread_create created thread
> >supposed to have its own ICVs?  This is surely goes beyond the standard,
> >but would be good if all vendors that are willing to handle pthread_create
> >vs. OpenMP mixing do the same.  Having per-thread ICVs makes more sense,
> >after all we probably shouldn't consider different threads outside of
> >parallel regions to be all running the same task (for locking, etc.).
> 
> Looks like the Intel compiler (10.1.015) does it thread-local (each 
> pthread-created thread has its own ICVs).

Yeah, just verified that too, that's certainly the better alternative.

	Jakub


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]