[fortran,patch] BLAS-enabled matmul

FX Coudert fxcoudert@gmail.com
Mon Apr 3 21:53:00 GMT 2006


:ADDPATCH fortran:

Hi all,

This rather long mail is divided in 3 parts : the first describe the 
usage of the new options I have implemented, the second describes the 
inner working of the patch for reviewers, and the third asks a few 
questions to any GCC or gfortran developers interested to help (about 
linking order, static and dynamic linking, GCC drivers, const and 
restrict keywords and vectorisation).

And, by the way, the patch includes a fix for PR libfortran/26985 
(matmul gives wrong result), which I found by staring at this code so 
long trying to understand how it could work out (the answer is: it doesn't).



*** What this patch enables you to do ***

$ cat b.f90
   integer,parameter :: n = 700
   real(8) :: a(n,n), b(n,n), c(n,n)
   integer :: i, j

   do i = 1, n ; do j = 1, n
     a(i,j) = sin(dble(i*j)) ; b(i,j) = cos(dble(i*j))
   end do ; end do
   c = matmul(a,b)
   print *, sum(c)
   end

[Using the standard no-option compile line]

$ gfortran b.f90 && time ./a.out
    88.7353056829913

real    0m6.366s
user    0m6.316s
sys     0m0.048s

[Using a external BLAS library, here a non-optimized static one]

$ gfortran b.f90 -fexternal-blas=./libblas_nonoptimized.a && time ./a.out
    88.7353056829913

real    0m7.791s
user    0m7.739s
sys     0m0.044s

[Using a complex argument to -fexternal-blas, to link in an optimized 
ATLAS version of BLAS]

$ ./bin/gfortran b.f90 -fexternal-blas="-llapack -lcblas -lf77blas 
-latlas -L$HOME/ATLAS/lib/Linux_PIIISSE1" && time ./a.out
    88.7353056829957

real    0m0.979s
user    0m0.923s
sys     0m0.054s

And that "0.979s" doesn't sound too bad!




*** How does it work? ***

On the library side: we create a new library called libgfortran_blas, 
which is static only, and which contains a minimal BLAS implentation.
The intrinsics/libgfortran_blas.c file contains this minimal 
implementation of sgemm/dgemm/cgemm/zgemm, based on the code we 
currently use for matmul. Then, change the matmul_??.c code to generate 
BLAS calls under certain conditions (no stride except for transposed 
arguments, size above a given size that can be specified at 
compile-time) and otherwise use the current code. So, when we link, the 
driver either links with libgfortran_blas or with the optimized blas 
library provided.

As for how the matmul_??.c code is changed, I modified the m4 to:
   1. know which numeric types have a BLAS routine associated, and what 
is the letter associated (this is in iparm.m4)
   2. protect the BLAS call with an "#if 0" in the case there is 
actually no BLAS routine for that type, that is, for all but real(4 or 
8) and complex(4 or 8); that might not be a very nice solution (maybe 
the if-clause could be done in m4 rather than preprocessing), so I find 
a smarter way to do this, I'll tell you
   3. I added some debug code, commented out with "#if 0", that prints 
out the differents strides and counts of all arrays; i'd like to let 
that in, since it was wonderfully useful during my debugging of this 
patch (and to write the testcase for PR libfortran/26985)


Now there's one thing slightly subtle about all this: suppose a user 
compiles some code that calls DGEMM, and forgets to link in his favorite 
BLAS library. Well, the program will compile and link fine, because we 
provided a symbol called dgemm_ in our libgfortran_blas library. But 
since this later is only a minimalist implementation, it does not handle 
all possibles cases (for example, it asserts that alpha=1 and beta=0), 
so this is very bad. To avoid that, I created a symbol 
libgfortran_called_blas in libgfortran, that is set to 1 before every 
BLAS call initiated by our library, and is set back to 0 afterwards. 
Similarly, the ?gemm routines in libgfortran_blas first to check if 
libgfortran_called_blas has value 1, else issue a runtime error message 
explaining the situation. I'm not happy with this situation, but 
couldn't think of a better way to handle this. If you have any idea, 
please share it!



On the front-end side: we look for the new options -fexternal-blas and 
-fblas-matmul-limit. If the second one is used for compilation of the 
main program, a call to set the library 
compile_options.blas_matmul_limit value accordingly. As for the first one:

1. if it's not specified, the driver adds "-lgfortran_blas -lgfortran" 
to the options, making the full final libraries order "-lgfortranbegin 
-lgfortran -lgfortran_blas -lgfortran -lm"; this is needed because 
libgfortran_blas references a symbol from libgfortran, namely 
libgfortran_called_blas (see above about this one)

2. if -fexternal-blas is specified, its argument is added instead of 
-lgfortran_blas, but everything else from point 1 is still valid; 
append_blas() breaks the space-separated words in
-fexternal-blas argument before appending them to the arguments. As for 
the order of linking, -lgfortran still needs to be added after the 
external blas library, because if the external library was compiled with 
gfortran, it might have some libgfortran symbols referred in it :)



*** How you can help this patch ***

1. I'd like some careful examination of my changes to 
gcc/fortran/gfortranspec.c since it's the first time I actually change 
the gfortran driver. Comments, both on the driver changes and on the 
global linking order scheme, are welcome!


2. The following show that the current matmul code is faster than its 
rewrite in libgfortran_blas.c:

$ ./bin/gfortran b.f90 && time ./a.out
    88.7353056829913

real    0m5.589s
user    0m5.549s
sys     0m0.038s
$ ./bin/gfortran b.f90 -fblas-matmul-limit=800 && time ./a.out
    88.7353056829913

real    0m3.826s
user    0m3.789s
sys     0m0.036s

Almost two seconds difference is too much to be explained by the 
additional system calls. I believe that, although libgfortran_blas.c is 
compiled with -ftree-vectorize -funroll-loops and I thought I had kept 
all the restrict/const qualifiers from the original code, something 
escaped my understanding. Could someone qualified in that (Janne maybe, 
since the original vectorization was your work) have a look at that code?


3. If you're one of the gfortran Happy Few, review that patch, or part 
of it (front-end, library, m4...) ;-)

So, I'll proceed with the magic ritual:
Bootstrapped and regtested on i686-linux. OK for 4.2?
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: blas_matmul.ChangeLog
URL: <http://gcc.gnu.org/pipermail/fortran/attachments/20060403/d82f4522/attachment.ksh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blas_matmul.diff
Type: text/x-patch
Size: 21153 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/fortran/attachments/20060403/d82f4522/attachment.bin>


More information about the Fortran mailing list