This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[gomp3] libgomp performance improvements


Hi!

I've committed attached patch to improve libgomp performance.
It adds support for optional busy waiting rather than going to sleep
immediately in case of contention (though this is incomplete so far,
only GOMP_BLOCKTIME=0 and GOMP_BLOCKTIME=infinity make difference,
I still need to write code to estimate the speed of the
static inline void do_wait (int *addr, int val)
{
  unsigned long long i, count = gomp_spin_count_var;

  for (i = 0; i < count; i++)
    if (__builtin_expect (*addr != val, 0))
      return;
    else
      cpu_relax ();

loop to be able to translate milliseconds into gomp_spin_count_var;
probably using a short benchmark on libgomp startup, but don't want
to make it too expensive - on some targets which don't have rep;nop
or hint @pause perhaps just reading /proc/cpuinfo to find MHz and
use hardcoded number of ticks would work too - any suggestions appreciated),
and rewrites libgomp barriers, so that they use just one atomic operation
in each thread, rather than at least 3 or even more as they needed so far
(which was really bad in case of heavy contention).

On the attached microbenchmark this gives nice speedups on quadcore CPU
(similar speedups can be seen on real-world benchmarks):

GOMP_BLOCKTIME=infinity ./micro-gcc-after-patch		# busy waiting, new barriers
barrier bench 0.990822 seconds
parallel bench 2.21924 seconds
static bench 0.114701 seconds
dynamic bench 0.539615 seconds
./micro-gcc-after-patch					# sleeping, new barriers
barrier bench 15.7526 seconds
parallel bench 7.05841 seconds
static bench 0.357082 seconds
dynamic bench 0.53934 seconds
./micro-gcc-before-patch				# sleeping, old barriers
barrier bench 47.8483 seconds
parallel bench 8.22674 seconds
static bench 0.412263 seconds
dynamic bench 0.536502 seconds

While the barrier bench numbers look very good, I'm still not satisfied
with the time it takes to execute #pragma omp parallel when no
pthread_create is needed, as the threads are docked - hope we can improve
that some more.  This is the most important factor in parallel bench and
static bench.  Dynamic bench is about the speed of gomp_loop_dynamic_next
and gomp_iter_dynamic_next, wonder if the amount of work needed from reading
ws->next until it is __sync_val_compare_and_swaped isn't too big and
therefore too many threads restart the loop in case of high contention.
Guided is even more expensive.

	Jakub

Attachment: P7
Description: Text document

Attachment: micro.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]