This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH, libgomp] tune guided schedule load balancing


Hi,

The openmp 3.0 specification (section 2.5.1, page 45)  mentions 2 example
implementations of guided schedules. Assuming n iterations, p threads, the
assignment of iterations can be either ceiling (n / p) or ceiling ( n / (2 * p )
). Currently libgomp's guided schedule uses the first choice (coarsest
assignment). This very simple patch implements a tunable (environment variable
GOMP_GUIDED_DIVIDE) to do guided iteration assignments from a tunable grain
of iterations. The assignment of iterations to threads now happens in groups of
n / (d * p) iterations, where d is the value of GOMP_GUIDED_DIVIDE. The default
value is GOMP_GUIDED_DIVIDE=1 i.e. identical to the old behaviour. The higher
the value of the environment variable, the more fine grained the guided
iteration assignments are.

I tested a few spec omp2001 benchmarks wirh static scheduling versus guided
scheduling with GOMP_CHUNK_DIVIDE=1, 2, 4. Guided2 / guided4 improve load 
balancing for one application. All percentages below are performance with 
regards to static scheduling (numbers > 100% are a speedup).  Tested on an 
8-core system. 
           
                guided1   guided2   guided4

310.wupwise_m     91%  			98%  			92%
312.swim_m        99%  			102% 			93%
316.applu_m       101% 			91%  			84%
320.equake_m      103% 			101% 			103%
324.apsi_m        100% 			99%  			99%
328.fma3d_m       100% 			100% 			99%
330.art_m         99%  			100% 		  100%
332.ammp_m        106% 			123% 			122%

(guided1 performance is identical to trunk default guided)

For 332.ammp, guided2 or guided4 improves execution time by 23% against static
and by 17% against the default-guided implementation. Note that ammp source code
uses schedule(guided) explicitly for the main loop - the static schedule was
tested by removing the schedule(guided) clause.
For 310.wupwise, guided2 improves significantly upon the guided schedule.
Guided2 is almost as good as static.
For the rest of the applications,  changing the initial guided assignment to
finer-grained assignments has almost no difference (324, 328, 330, 312) or a
negative effect (316, 320)

In 332.ammp, most of the time is spent at a big openmp loop (rectmm.c
lines596-1307) and the time spent in each iteration of this loop varies greatly
from iteration to iteration. Static scheduling or the trunk version of guided
scheduling results in bad load-balancing across cores/threads - some cores
remain idle while others keep working. For an example run, core utilizations in
the default_guided run range from 77-90%, and in the new-guided run range from
94-99%.

It would be interesting to test applications with load imbalance issues or
investigate more sophisticated heuristics for the schedule(guided or auto)
clause. I am not aware of other apps that would benefit from this and I am not
sure it's worth pushing this in trunk. Comments or applications for further
testing are welcome.

These experiments are using gcc-trunk rev. from 2010-01-18 (sorry, I don't
recall exact revision number) and flags "-fopenmp -O3 -funroll-loops".
Bootstrapped on x86_64.
thanks,

- Vasilis

Index: libgomp/iter.c
===================================================================
--- libgomp/iter.c	(revision 155863)
+++ libgomp/iter.c	(working copy)
@@ -268,7 +268,7 @@
 
   start = ws->next;
   n = (ws->end - start) / ws->incr;
-  q = (n + nthreads - 1) / nthreads;
+  q = (n + nthreads - 1) / (gomp_guided_divide * nthreads);
 
   if (q < ws->chunk_size)
     q = ws->chunk_size;
@@ -311,7 +311,7 @@
 	return false;
 
       n = (end - start) / incr;
-      q = (n + nthreads - 1) / nthreads;
+      q = (n + nthreads - 1) / (gomp_guided_divide * nthreads);
 
       if (q < chunk_size)
 	q = chunk_size;
Index: libgomp/env.c
===================================================================
--- libgomp/env.c	(revision 155863)
+++ libgomp/env.c	(working copy)
@@ -66,6 +66,7 @@
 #endif
 unsigned long gomp_available_cpus = 1, gomp_managed_threads = 1;
 unsigned long long gomp_spin_count_var, gomp_throttled_spin_count_var;
+unsigned long gomp_guided_divide = 1;
 
 /* Parse the OMP_SCHEDULE environment variable.  */
 
@@ -543,6 +544,7 @@
       if (err != 0)
 	gomp_error ("Stack size change failed: %s", strerror (err));
     }
+  parse_unsigned_long ("GOMP_GUIDED_DIVIDE", &gomp_guided_divide);
 }
 
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h	(revision 155863)
+++ libgomp/libgomp.h	(working copy)
@@ -226,6 +226,7 @@
 extern unsigned long gomp_max_active_levels_var;
 extern unsigned long long gomp_spin_count_var, gomp_throttled_spin_count_var;
 extern unsigned long gomp_available_cpus, gomp_managed_threads;
+extern unsigned long gomp_guided_divide;
 
 enum gomp_task_kind
 {



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]