[fortran,patch] BLAS-enabled matmul
FX Coudert
fxcoudert@gmail.com
Mon Apr 3 21:53:00 GMT 2006
:ADDPATCH fortran:
Hi all,
This rather long mail is divided in 3 parts : the first describe the
usage of the new options I have implemented, the second describes the
inner working of the patch for reviewers, and the third asks a few
questions to any GCC or gfortran developers interested to help (about
linking order, static and dynamic linking, GCC drivers, const and
restrict keywords and vectorisation).
And, by the way, the patch includes a fix for PR libfortran/26985
(matmul gives wrong result), which I found by staring at this code so
long trying to understand how it could work out (the answer is: it doesn't).
*** What this patch enables you to do ***
$ cat b.f90
integer,parameter :: n = 700
real(8) :: a(n,n), b(n,n), c(n,n)
integer :: i, j
do i = 1, n ; do j = 1, n
a(i,j) = sin(dble(i*j)) ; b(i,j) = cos(dble(i*j))
end do ; end do
c = matmul(a,b)
print *, sum(c)
end
[Using the standard no-option compile line]
$ gfortran b.f90 && time ./a.out
88.7353056829913
real 0m6.366s
user 0m6.316s
sys 0m0.048s
[Using a external BLAS library, here a non-optimized static one]
$ gfortran b.f90 -fexternal-blas=./libblas_nonoptimized.a && time ./a.out
88.7353056829913
real 0m7.791s
user 0m7.739s
sys 0m0.044s
[Using a complex argument to -fexternal-blas, to link in an optimized
ATLAS version of BLAS]
$ ./bin/gfortran b.f90 -fexternal-blas="-llapack -lcblas -lf77blas
-latlas -L$HOME/ATLAS/lib/Linux_PIIISSE1" && time ./a.out
88.7353056829957
real 0m0.979s
user 0m0.923s
sys 0m0.054s
And that "0.979s" doesn't sound too bad!
*** How does it work? ***
On the library side: we create a new library called libgfortran_blas,
which is static only, and which contains a minimal BLAS implentation.
The intrinsics/libgfortran_blas.c file contains this minimal
implementation of sgemm/dgemm/cgemm/zgemm, based on the code we
currently use for matmul. Then, change the matmul_??.c code to generate
BLAS calls under certain conditions (no stride except for transposed
arguments, size above a given size that can be specified at
compile-time) and otherwise use the current code. So, when we link, the
driver either links with libgfortran_blas or with the optimized blas
library provided.
As for how the matmul_??.c code is changed, I modified the m4 to:
1. know which numeric types have a BLAS routine associated, and what
is the letter associated (this is in iparm.m4)
2. protect the BLAS call with an "#if 0" in the case there is
actually no BLAS routine for that type, that is, for all but real(4 or
8) and complex(4 or 8); that might not be a very nice solution (maybe
the if-clause could be done in m4 rather than preprocessing), so I find
a smarter way to do this, I'll tell you
3. I added some debug code, commented out with "#if 0", that prints
out the differents strides and counts of all arrays; i'd like to let
that in, since it was wonderfully useful during my debugging of this
patch (and to write the testcase for PR libfortran/26985)
Now there's one thing slightly subtle about all this: suppose a user
compiles some code that calls DGEMM, and forgets to link in his favorite
BLAS library. Well, the program will compile and link fine, because we
provided a symbol called dgemm_ in our libgfortran_blas library. But
since this later is only a minimalist implementation, it does not handle
all possibles cases (for example, it asserts that alpha=1 and beta=0),
so this is very bad. To avoid that, I created a symbol
libgfortran_called_blas in libgfortran, that is set to 1 before every
BLAS call initiated by our library, and is set back to 0 afterwards.
Similarly, the ?gemm routines in libgfortran_blas first to check if
libgfortran_called_blas has value 1, else issue a runtime error message
explaining the situation. I'm not happy with this situation, but
couldn't think of a better way to handle this. If you have any idea,
please share it!
On the front-end side: we look for the new options -fexternal-blas and
-fblas-matmul-limit. If the second one is used for compilation of the
main program, a call to set the library
compile_options.blas_matmul_limit value accordingly. As for the first one:
1. if it's not specified, the driver adds "-lgfortran_blas -lgfortran"
to the options, making the full final libraries order "-lgfortranbegin
-lgfortran -lgfortran_blas -lgfortran -lm"; this is needed because
libgfortran_blas references a symbol from libgfortran, namely
libgfortran_called_blas (see above about this one)
2. if -fexternal-blas is specified, its argument is added instead of
-lgfortran_blas, but everything else from point 1 is still valid;
append_blas() breaks the space-separated words in
-fexternal-blas argument before appending them to the arguments. As for
the order of linking, -lgfortran still needs to be added after the
external blas library, because if the external library was compiled with
gfortran, it might have some libgfortran symbols referred in it :)
*** How you can help this patch ***
1. I'd like some careful examination of my changes to
gcc/fortran/gfortranspec.c since it's the first time I actually change
the gfortran driver. Comments, both on the driver changes and on the
global linking order scheme, are welcome!
2. The following show that the current matmul code is faster than its
rewrite in libgfortran_blas.c:
$ ./bin/gfortran b.f90 && time ./a.out
88.7353056829913
real 0m5.589s
user 0m5.549s
sys 0m0.038s
$ ./bin/gfortran b.f90 -fblas-matmul-limit=800 && time ./a.out
88.7353056829913
real 0m3.826s
user 0m3.789s
sys 0m0.036s
Almost two seconds difference is too much to be explained by the
additional system calls. I believe that, although libgfortran_blas.c is
compiled with -ftree-vectorize -funroll-loops and I thought I had kept
all the restrict/const qualifiers from the original code, something
escaped my understanding. Could someone qualified in that (Janne maybe,
since the original vectorization was your work) have a look at that code?
3. If you're one of the gfortran Happy Few, review that patch, or part
of it (front-end, library, m4...) ;-)
So, I'll proceed with the magic ritual:
Bootstrapped and regtested on i686-linux. OK for 4.2?
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blas_matmul.ChangeLog
URL: <http://gcc.gnu.org/pipermail/fortran/attachments/20060403/d82f4522/attachment.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blas_matmul.diff
Type: text/x-patch
Size: 21153 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/fortran/attachments/20060403/d82f4522/attachment.bin>
More information about the Fortran
mailing list