[Bug libgomp/93591] New: Bad number of threads and place management on Power-9 (with OpenBLAS)

jeromerichard111 at msn dot com gcc-bugzilla@gcc.gnu.org
Wed Feb 5 09:06:00 GMT 2020


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93591

            Bug ID: 93591
           Summary: Bad number of threads and place management on Power-9
                    (with OpenBLAS)
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jeromerichard111 at msn dot com
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Created attachment 47781
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47781&action=edit
Code used to reproduce the bug

Hello,

I benchmarked the simple following dgemm call using OpenBLAS (commit 8d2a796)
with 4096x4096 matrices (thus n=4096 and a, b and c are matrices) on a IBM
LC922 machine with 2 POWER-9 processors (of each 22 cores and each 88 hardware
threads) with GCC-8.3.0:

cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b,
n, 1.0, c, n);

Performance results are very bad in some case: the number of threads actually
created is always one for GCC-8.3.0 when OMP_PLACES is not set to "cores(...)".
This is not the case with Clang-9.0 where the number of threads created is
correct.
This can also be reproduced using GCC-9.2.1. 

By looking OMP_DISPLAY_ENV when OMP_PLACES="cores(8)" (a configuration that
create multiple threads and not just one) we can see that:

OMP_PLACES = '{0:4},{4:4},{8:4},{12:4},{16:4},{20:4},{24:4},{28:4}'

This configuration give good performance while the following does not as only
one thread is created:

OMP_PLACES = '{0},{4},{8},{12},{16},{20},{24},{28}'

And surprisingly this one is fine (multiple threads are created):

OMP_PLACES = '{0:2},{4},{8},{12},{16},{20},{24},{28}'

Thus, the place of the first thread is important in libGOMP and strangely
causes the issue that only one thread is created. I think this is most probably
an issue in libGOMP and not GCC itself.

All test are runned on a ubuntu18.04.1 system.

Here is the command used to compile the basic example code:

g++ -O3 -mcpu=native -ffast-math main.cpp -I./OpenBLAS -L./OpenBLAS -lopenblas
-fopenmp

Here is an example of results (with only 8 threads put on 8 different cores):

$ OMP_NUM_THREADS=8 OMP_PLACES="{0:2},{4},{8},{12},{16},{20},{24},{28}"
OMP_PROC_BIND=TRUE ./a.out
167.602 Gflops (time: 0.820032 s)
$ OMP_NUM_THREADS=8 OMP_PLACES="{0},{4},{8},{12},{16},{20},{24},{28}"
OMP_PROC_BIND=TRUE ./a.out
22.4853 Gflops (time: 6.11239 s)

Without the issue, the performance should reach up to 550~600 Gflops on this
machine. But if the issue occurs, a performance of only 23 Gflops is obtained.

More details can be seen here: https://github.com/xianyi/OpenBLAS/issues/2380 .


More information about the Gcc-bugs mailing list