This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[PATCH, libgomp] tune guided schedule load balancing
- From: Vasilis Liaskovitis <vliaskov at gmail dot com>
- To: gcc-patches at gcc dot gnu dot org
- Cc: jakub at redhat dot com
- Date: Fri, 29 Jan 2010 20:29:02 -0600
- Subject: [PATCH, libgomp] tune guided schedule load balancing
Hi,
The openmp 3.0 specification (section 2.5.1, page 45) mentions 2 example
implementations of guided schedules. Assuming n iterations, p threads, the
assignment of iterations can be either ceiling (n / p) or ceiling ( n / (2 * p )
). Currently libgomp's guided schedule uses the first choice (coarsest
assignment). This very simple patch implements a tunable (environment variable
GOMP_GUIDED_DIVIDE) to do guided iteration assignments from a tunable grain
of iterations. The assignment of iterations to threads now happens in groups of
n / (d * p) iterations, where d is the value of GOMP_GUIDED_DIVIDE. The default
value is GOMP_GUIDED_DIVIDE=1 i.e. identical to the old behaviour. The higher
the value of the environment variable, the more fine grained the guided
iteration assignments are.
I tested a few spec omp2001 benchmarks wirh static scheduling versus guided
scheduling with GOMP_CHUNK_DIVIDE=1, 2, 4. Guided2 / guided4 improve load
balancing for one application. All percentages below are performance with
regards to static scheduling (numbers > 100% are a speedup). Tested on an
8-core system.
guided1 guided2 guided4
310.wupwise_m 91% 98% 92%
312.swim_m 99% 102% 93%
316.applu_m 101% 91% 84%
320.equake_m 103% 101% 103%
324.apsi_m 100% 99% 99%
328.fma3d_m 100% 100% 99%
330.art_m 99% 100% 100%
332.ammp_m 106% 123% 122%
(guided1 performance is identical to trunk default guided)
For 332.ammp, guided2 or guided4 improves execution time by 23% against static
and by 17% against the default-guided implementation. Note that ammp source code
uses schedule(guided) explicitly for the main loop - the static schedule was
tested by removing the schedule(guided) clause.
For 310.wupwise, guided2 improves significantly upon the guided schedule.
Guided2 is almost as good as static.
For the rest of the applications, changing the initial guided assignment to
finer-grained assignments has almost no difference (324, 328, 330, 312) or a
negative effect (316, 320)
In 332.ammp, most of the time is spent at a big openmp loop (rectmm.c
lines596-1307) and the time spent in each iteration of this loop varies greatly
from iteration to iteration. Static scheduling or the trunk version of guided
scheduling results in bad load-balancing across cores/threads - some cores
remain idle while others keep working. For an example run, core utilizations in
the default_guided run range from 77-90%, and in the new-guided run range from
94-99%.
It would be interesting to test applications with load imbalance issues or
investigate more sophisticated heuristics for the schedule(guided or auto)
clause. I am not aware of other apps that would benefit from this and I am not
sure it's worth pushing this in trunk. Comments or applications for further
testing are welcome.
These experiments are using gcc-trunk rev. from 2010-01-18 (sorry, I don't
recall exact revision number) and flags "-fopenmp -O3 -funroll-loops".
Bootstrapped on x86_64.
thanks,
- Vasilis
Index: libgomp/iter.c
===================================================================
--- libgomp/iter.c (revision 155863)
+++ libgomp/iter.c (working copy)
@@ -268,7 +268,7 @@
start = ws->next;
n = (ws->end - start) / ws->incr;
- q = (n + nthreads - 1) / nthreads;
+ q = (n + nthreads - 1) / (gomp_guided_divide * nthreads);
if (q < ws->chunk_size)
q = ws->chunk_size;
@@ -311,7 +311,7 @@
return false;
n = (end - start) / incr;
- q = (n + nthreads - 1) / nthreads;
+ q = (n + nthreads - 1) / (gomp_guided_divide * nthreads);
if (q < chunk_size)
q = chunk_size;
Index: libgomp/env.c
===================================================================
--- libgomp/env.c (revision 155863)
+++ libgomp/env.c (working copy)
@@ -66,6 +66,7 @@
#endif
unsigned long gomp_available_cpus = 1, gomp_managed_threads = 1;
unsigned long long gomp_spin_count_var, gomp_throttled_spin_count_var;
+unsigned long gomp_guided_divide = 1;
/* Parse the OMP_SCHEDULE environment variable. */
@@ -543,6 +544,7 @@
if (err != 0)
gomp_error ("Stack size change failed: %s", strerror (err));
}
+ parse_unsigned_long ("GOMP_GUIDED_DIVIDE", &gomp_guided_divide);
}
Index: libgomp/libgomp.h
===================================================================
--- libgomp/libgomp.h (revision 155863)
+++ libgomp/libgomp.h (working copy)
@@ -226,6 +226,7 @@
extern unsigned long gomp_max_active_levels_var;
extern unsigned long long gomp_spin_count_var, gomp_throttled_spin_count_var;
extern unsigned long gomp_available_cpus, gomp_managed_threads;
+extern unsigned long gomp_guided_divide;
enum gomp_task_kind
{