This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug fortran/68600] Inlined MATMUL is too slow.
- From: "tkoenig at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 29 Nov 2015 23:46:20 +0000
- Subject: [Bug fortran/68600] Inlined MATMUL is too slow.
- Auto-submitted: auto-generated
- References: <bug-68600-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68600
--- Comment #5 from Thomas Koenig <tkoenig at gcc dot gnu.org> ---
Another interesting data point. I deleted the DGEMM implementation from
the file and linked against the serial version of openblas. OK,
openblas is based on GOTO blas, so we have to expect a hit
for large matrices.
Figures:
ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops bench-3.f90
-lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
Size Loops Matmul dgemm Matmul Matmul
fixed explicit assumed variable
explicit
=====================================================================================
2 200000 11.944 0.035 0.136 0.412
4 200000 1.712 0.257 0.458 0.738
8 200000 2.080 1.162 0.824 1.077
16 200000 1.697 3.104 0.939 0.995
32 200000 1.450 4.814 1.388 1.426
64 30757 1.485 5.978 1.351 1.371
128 3829 1.557 6.857 1.534 1.522
256 477 1.568 7.017 1.589 1.537
So far so good. Looks as if the crossover point for the inline and the dgemm
version is between 8 and 16, so let us try this:
ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops
-finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
Size Loops Matmul dgemm Matmul Matmul
fixed explicit assumed variable
explicit
=====================================================================================
2 200000 11.948 0.039 0.156 0.464
4 200000 1.999 0.305 0.542 0.859
8 200000 2.435 1.359 0.962 1.255
16 200000 0.802 3.102 0.798 0.799
32 200000 4.878 4.990 4.906 4.906
64 30757 6.045 6.062 5.977 5.968
So, if the user really wants us to call an external BLAS, we had better
do so directly and not through our library routines.