[Bug libgomp/93591] New: Bad number of threads and place management on Power-9 (with OpenBLAS)
jeromerichard111 at msn dot com
gcc-bugzilla@gcc.gnu.org
Wed Feb 5 09:06:00 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93591
Bug ID: 93591
Summary: Bad number of threads and place management on Power-9
(with OpenBLAS)
Product: gcc
Version: 8.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: jeromerichard111 at msn dot com
CC: jakub at gcc dot gnu.org
Target Milestone: ---
Created attachment 47781
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47781&action=edit
Code used to reproduce the bug
Hello,
I benchmarked the simple following dgemm call using OpenBLAS (commit 8d2a796)
with 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM
LC922 machine with 2 POWER-9 processors (of each 22 cores and each 88 hardware
threads) with GCC-8.3.0:
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b,
n, 1.0, c, n);
Performance results are very bad in some case: the number of threads actually
created is always one for GCC-8.3.0 when OMP_PLACES is not set to "cores(...)".
This is not the case with Clang-9.0 where the number of threads created is
correct.
This can also be reproduced using GCC-9.2.1.
By looking OMP_DISPLAY_ENV when OMP_PLACES="cores(8)" (a configuration that
create multiple threads and not just one) we can see that:
OMP_PLACES = '{0:4},{4:4},{8:4},{12:4},{16:4},{20:4},{24:4},{28:4}'
This configuration give good performance while the following does not as only
one thread is created:
OMP_PLACES = '{0},{4},{8},{12},{16},{20},{24},{28}'
And surprisingly this one is fine (multiple threads are created):
OMP_PLACES = '{0:2},{4},{8},{12},{16},{20},{24},{28}'
Thus, the place of the first thread is important in libGOMP and strangely
causes the issue that only one thread is created. I think this is most probably
an issue in libGOMP and not GCC itself.
All test are runned on a ubuntu18.04.1 system.
Here is the command used to compile the basic example code:
g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas
-fopenmp
Here is an example of results (with only 8 threads put on 8 different cores):
$ OMP_NUM_THREADS=8 OMP_PLACES="{0:2},{4},{8},{12},{16},{20},{24},{28}"
OMP_PROC_BIND=TRUE ./a.out
167.602 Gflops (time: 0.820032 s)
$ OMP_NUM_THREADS=8 OMP_PLACES="{0},{4},{8},{12},{16},{20},{24},{28}"
OMP_PROC_BIND=TRUE ./a.out
22.4853 Gflops (time: 6.11239 s)
Without the issue, the performance should reach up to 550~600 Gflops on this
machine. But if the issue occurs, a performance of only 23 Gflops is obtained.
More details can be seen here: https://github.com/xianyi/OpenBLAS/issues/2380 .
More information about the Gcc-bugs
mailing list