Bug 51119 - MATMUL slow for large matrices
Summary: MATMUL slow for large matrices
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libfortran (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: ---
Assignee: Jerry DeLisle
URL:
Keywords:
Depends on: 14741 37131 66189 68600
Blocks:
  Show dependency treegraph
 
Reported: 2011-11-14 06:48 UTC by Janne Blomqvist
Modified: 2017-02-26 13:23 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-03-29 00:00:00


Attachments
comparison in performance for small matrix multiplies (libsmm vs mkl) (66.00 KB, image/png)
2011-11-15 12:31 UTC, Joost VandeVondele
Details
Proposed patch to get testing going (6.68 KB, patch)
2016-11-07 21:41 UTC, Jerry DeLisle
Details | Diff
A test program (588 bytes, text/plain)
2016-11-08 00:55 UTC, Jerry DeLisle
Details
A test program I am using for timings (3.84 KB, text/plain)
2016-11-14 18:26 UTC, Jerry DeLisle
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Janne Blomqvist 2011-11-14 06:48:35 UTC
Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with n > 256 is around a factor of 8 slower. 

While I don't think it's worth spending the time on target-specific parameters and/or asm-coded inner kernel as high-performance BLAS implementations do, I suspect that a little effort towards cache blocking could improve things.
Comment 1 Janne Blomqvist 2011-11-14 06:49:11 UTC
Assigning to myself.

I have a cunning plan.
Comment 2 Tobias Burnus 2011-11-14 13:08:49 UTC
(In reply to comment #0)
> Compared to ATLAS BLAS on an AMD 10h processor, MATMUL on square matrices with
> n > 256 is around a factor of 8 slower. 

Side note: You can use -fexternal-blas -fblas-matmul-limit=<...> and link ATLAS BLAS.

> Assigning to myself.
> I have a cunning plan.

I am looking forward to cunning ideas - at least if they are not too convoluted, work on all targets and are middle-end friendly.
Comment 3 Joost VandeVondele 2011-11-15 12:19:59 UTC
(In reply to comment #1)
> I have a cunning plan.

It is doable to come within a factor of 2 of highly efficient implementations using a cache-oblivious matrix multiply, which is relatively easy to code. I'm not sure this is worth the effort.

I believe it would be more important to have actually highly efficient (inlined) implementations for very small matrices. These would outperform general libraries by a large factor. For CP2K I have written a specialized small matrix multiply library generator which generates code that outperforms e.g. MKL by a large factor for small matrices (<<32x32). The generation time and library size do not make it a general purpose tool. It also contains an implementation of the recursive multiply of some sort (see http://cvs.berlios.de/cgi-bin/viewvc.cgi/cp2k/cp2k/tools/build_libsmm/)
Comment 4 Joost VandeVondele 2011-11-15 12:31:10 UTC
Created attachment 25826 [details]
comparison in performance for small matrix multiplies (libsmm vs mkl)

added some data showing the speedup of specialized matrix multiply code (small matrices, known bounds, in cache) against general dgemm (mkl).
Comment 5 Janne Blomqvist 2011-11-15 15:47:54 UTC
(In reply to comment #3)
> I believe it would be more important to have actually highly efficient
> (inlined) implementations for very small matrices.

There's already PR 37131 for that.
Comment 6 Joost VandeVondele 2012-06-28 11:58:20 UTC
Janne, have you had a chance to look at this ? For larger matrices MATMMUL is really slow. Anything that includes even the most basic blocking scheme should be faster. I think this would be a valuable improvement.
Comment 7 Janne Blomqvist 2012-06-28 12:15:05 UTC
(In reply to comment #6)
> Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> really slow. Anything that includes even the most basic blocking scheme should
> be faster. I think this would be a valuable improvement.

I implemented a block-panel multiplication algorithm similar to GOTO BLAS and Eigen, but I got side-tracked by other things and never found the time to fix the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak flops which was a bit disappointing.
Comment 8 Joost VandeVondele 2012-06-29 07:19:03 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Janne, have you had a chance to look at this ? For larger matrices MATMMUL is
> > really slow. Anything that includes even the most basic blocking scheme should
> > be faster. I think this would be a valuable improvement.
> 
> I implemented a block-panel multiplication algorithm similar to GOTO BLAS and
> Eigen, but I got side-tracked by other things and never found the time to fix
> the corner-case bugs and tune performance. IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.

I think 30% of peak is a good improvement over the current version (which reaches 7% of peak (92% for MKL) for a double precision 8000x8000 matrix multiplication) on a sandy bridge.

In addition to blocking, is the Fortran runtime being compiled with a set of compile options that enables vectorization ? In the ideal world, gcc would recognize the loop pattern in the runtime library code, and do blocking, vectorization etc. automagically.
Comment 9 Steven Bosscher 2012-06-29 10:55:48 UTC
(In reply to comment #7)
> IIRC I reached about 30-40 % of peak
> flops which was a bit disappointing.

This sounds quite impressive to me, actually.

It would be interesting to investigate using the IFUNC mechanism to provide optimized (e.g. vectorized) versions of some of the library functions.
Comment 10 Joost VandeVondele 2013-03-29 08:47:39 UTC
What about compiling the fortran runtime library with vectorization, and all the fancy options that come with graphite (loop-blocking in particular). If they don't work for a matrix multiplication pattern .... what's their use ? Further naivety would be to provide an lto'ed runtime, allowing matrix multiplication to be inlined for known small bounds ... kind of the ultimate dogfooding ?
Comment 11 Thomas Koenig 2013-04-01 15:58:52 UTC
A bit like PR 37131 (but I don't want to lose either audit trail).
Comment 12 Dominique d'Humieres 2015-10-31 14:15:46 UTC
Some new numbers for a four cores Corei7 2.8Ghz, turboboost 3.8Ghz, 1.6Ghz DDR3 on x86_64-apple-darwin14.5 for the following test

program t2 
implicit none 
REAL time_begin, time_end 
integer, parameter :: n=2000; 
integer(8) :: ts, te, rate8, cmax8
real(8) :: elapsed
REAL(8) :: a(n,n), b(n,n), c(n,n) 
integer, parameter :: m = 100 
integer :: i 
call RANDOM_NUMBER(a) 
call RANDOM_NUMBER(b) 
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    c = MATMUL(a,b) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
call cpu_time(time_begin) 
call SYSTEM_CLOCK (ts, rate8, cmax8)
do i = 1,m 
    a(1,1) = a(1,1) + 0.1 
    call dgemm('n','n',n, n, n, dble(1.0), a, n, b, n, dble(0.0), c, n) 
enddo 
call SYSTEM_CLOCK (te, rate8, cmax8)
call cpu_time(time_end) 
elapsed = real(te-ts, kind=8)/real(rate8, kind=8)
PRINT *, 'Time, MATMUL: ',time_end-time_begin, elapsed , 2*m*real(n, kind=8)**3/(10**9*elapsed)
end program 

borrowed from http://groups.google.com/group/comp.lang.fortran/browse_thread/thread/1cba8e6ce5080197

[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -fno-frontend-optimize
[Book15] f90/bug% time a.out
 Time, MATMUL:    374.027161       374.02889900000002        4.2777443247774283     
 Time, MATMUL:    172.823853       23.073034000000000        69.345019818373260     
546.427u 0.542s 6:37.24 137.6%	0+0k 1+0io 41pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate 
[Book15] f90/bug% time a.out
 Time, MATMUL:    391.495880       391.49403500000000        4.0869077353886123     
 Time, MATMUL:    169.313202       22.781099000000001        70.233661685944114     
560.384u 0.544s 6:54.39 135.3%	0+0k 0+0io 0pf+0w
[Book15] f90/bug% gfc -Ofast timing/matmul_tst_sys.f90 -framework Accelerate -march=native
[Book15] f90/bug% time a.out
 Time, MATMUL:    367.570374       367.56880500000000        4.3529265221514102     
 Time, MATMUL:    170.150818       22.837544000000001        70.060073009602078     
537.306u 0.534s 6:30.53 137.7%	0+0k 0+0io 0pf+0w

where the last column is the speed in Gflops. These numbers show that the library MATMUL is slightly faster than the inline version unless -march=native is used (AVX should be twice faster unless limited by the memory bandwidth).

[Book15] f90/bug% gfc -Ofast -fexternal-blas timing/matmul_tst_sys.f90 -framework Accelerate
[Book15] f90/bug% time a.out
 Time, MATMUL:    159.000992       21.450851000000000        74.589115368896088     
 Time, MATMUL:    172.616943       23.029487000000000        69.476145951492541     
331.281u 0.453s 0:44.60 743.7%	0+0k 0+0io 3pf+0w
... repeated several time in order to heat the CPU
[Book15] f90/bug% time a.out
 Time, MATMUL:    179.624268       23.935708999999999        66.845732457726655     
 Time, MATMUL:    178.685364       23.898668000000001        66.949337929628541     
357.978u 0.447s 0:47.95 747.4%	0+0k 0+0io 0pf+0w

Thus the BLAS provided by darwin gets ~67GFlops out of the ~90GFlops peak (AVX*4cores), while the inlined MATMUL gets ~4GFlops out of ~15Gflops peak (no AVX, one core and turboboost) with little gain when using AVX (~30GFlops peak).

I suppose most modern OS provide such optimized BLAS and, if not, one can install libraries such as atlas. So I wonder if it would not be more effective to be able to configure with something such as --with-blas="magic incantation" and use -fexternal-blas as the default rather than reinventing the wheel.

More than three years ago Janne Blomqvist (comment 7) wrote
> IIRC I reached about 30-40 % of peak flops which was a bit disappointing.

Would it be possible to have the patch to play with?
Comment 13 Thomas Koenig 2015-10-31 16:36:54 UTC
(In reply to Dominique d'Humieres from comment #12)

> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effective to be able to configure with something such as --with-blas="magic
> incantation" and use -fexternal-blas as the default rather than reinventing
> the wheel.

If -fexternal-blas is supplied, the current implementation defaults to
-fblas-matmul-limit=30, which in turn sets -finline-matmul-limit=30
(which is fairly reasonable for the point where an
external, optimized BLAS and inlining are equally fast).

I would be interested to see where threading moves this intersection.
Comment 14 Janne Blomqvist 2015-10-31 21:28:56 UTC
(In reply to Dominique d'Humieres from comment #12)
> I suppose most modern OS provide such optimized BLAS and, if not, one can
> install libraries such as atlas. So I wonder if it would not be more
> effective to be able to configure with something such as --with-blas="magic
> incantation" and use -fexternal-blas as the default rather than reinventing
> the wheel.

This matches my current thinking on this subject. 

To get good performance one really needs arch-specific parameters (block sizes to fit into cache etc.), as well as using arch-specific code to make maximum use of the vector ISA. Add in threading which is useful for larger matrices, and there's lot more work than what the current GFortran development team is able to commit to.

So my idea of what ought to be done:

- Check for the presence of BLAS at compile time. Alternatively, use weak references so we can always use BLAS if it's available, without the user having to specify -fexternal-blas (which I guess most user don't).

  - A problem here is what if the system has multiple BLAS libraries, which one do we choose? And different systems have different ways of linking to BLAS (e.g. -framework Accelerate on OSX).

  - And what about BLAS64, i.e. BLAS compiler with 64-bit integers. It seems these libraries have the same API as the "normal" BLAS, so how to figure out at build time which kind of BLAS library are we using?

- Currently with -fexternal-blas we only use BLAS for stride-1 arrays, falling back to the current code for stride /= 1. It's probably more efficient to pack stride /= 1 arrays and then call BLAS. Heck, high performance BLAS libraries repack blocks to get better cache behavior anyways.


> 
> More than three years ago Janne Blomqvist (comment 7) wrote
> > IIRC I reached about 30-40 % of peak flops which was a bit disappointing.
> 
> Would it be possible to have the patch to play with?

My GCC dev box where I think this stuff might reside is packed down in a box as I have recently moved. But I'll keep this in mind, and see if I can find the patch once I get around to unpacking..

As an aside, contrary to when I implemented my patch based on reading the papers by Goto et al., nowadays there's a nice step-by-step description at

http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html
Comment 15 Thomas Koenig 2015-11-01 10:31:42 UTC
Another issue:  What should we do if the user supplies an external subroutine DGEMM which does something unrelated?

I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.
Comment 16 Jerry DeLisle 2015-11-08 23:33:14 UTC
For what its worth:

$ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
$ ./a.out 
 Time, MATMUL:    21.2483196       21.254449646000001     1.5055670945599979     
 Time, dgemm:    33.2441711       33.243087289000002      .96260614189671445     

This is on a laptop not taking any advantage of a tuned BLAS.  If I replace -Ofast with -O2 I get:

$ ./a.out 
 Time, MATMUL:    43.6199570       43.625358022999997    0.73351833543988521     
 Time, dgemm:    33.2262650       33.226961453000001     0.96307331759072967 

-O3 brings performance back to match with -Ofast. It seems odd to me that -O2 does not do well.

Regardless, the internal MATMUL is doing better than BLAS on this platform, but 1.5 gflops is pretty lame either way.
Comment 17 Jerry DeLisle 2015-11-22 20:34:31 UTC
I have done some experimenting.  Since gcc supports OMP and I think to some extent ACC why not come up with a MATMUL that exploits these if present?  On the darwin platform discussed in comment #12, the performance is excellent.  Does darwin implementation provided exploit OpenCL?  What is it using?  Why not enable that on other platforms if present.

I am going to explore OpenCL and clBLAS to see if I can get it to work.  If I am successful, I would like to hide it behind MATMUL if possible.  Any other opinions?
Comment 18 Joost VandeVondele 2015-11-22 20:48:54 UTC
(In reply to Jerry DeLisle from comment #17)
> I have done some experimenting.  Since gcc supports OMP and I think to some
> extent ACC why not come up with a MATMUL that exploits these if present?  On
> the darwin platform discussed in comment #12, the performance is excellent. 
> Does darwin implementation provided exploit OpenCL?  What is it using?  Why
> not enable that on other platforms if present.
> 
> I am going to explore OpenCL and clBLAS to see if I can get it to work.  If
> I am successful, I would like to hide it behind MATMUL if possible.  Any
> other opinions?

yes, this is tricky. In a multithreaded code executing matmul, what is the strategy (nested parallelism, serial, ...) ? We usually link in a serial blas because threading in the library is usually not good for performance of the code overall, i.e. nested parallelism tends to perform badly. Also, how many threads would you use by default (depending on matrix size, machine load) ? Users on an N core machine might run N jobs in parallel, and not expect those to start several threads each. 

Maybe, this could be part of the auto-parallelize (or similar) option that gcc has ?
Comment 19 Jerry DeLisle 2015-11-22 22:53:10 UTC
If I can get something working I am thinking something like -fexternal-blas-n, if -n not given then default to current libblas behaviour. This way users have some control. With GPUs, it is not unusual to have hundreds of cores.  We can also, at run time, see if the opencl is already initialized which may mean used elsewhere so don't mess with it.
Comment 20 Joost VandeVondele 2015-11-23 06:43:43 UTC
(In reply to Jerry DeLisle from comment #19)
> If I can get something working I am thinking something like
> -fexternal-blas-n, if -n not given then default to current libblas
> behaviour. This way users have some control. With GPUs, it is not unusual to
> have hundreds of cores.  We can also, at run time, see if the opencl is
> already initialized which may mean used elsewhere so don't mess with it.

Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs seems even a tad more tricky. We have a paper on GPU (small) matrix multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf . BTW, another interesting project is the libxsmm library more aimed at small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if this info is useful in this context, but it might provide inspiration.
Comment 21 Thomas Koenig 2015-11-23 09:20:57 UTC
> Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> seems even a tad more tricky. We have a paper on GPU (small) matrix
> multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf

Quite interesting what can be done with GPUs...

> . BTW, another interesting project is the libxsmm library more aimed at
> small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if
> this info is useful in this context, but it might provide inspiration.

I assume that for  small matrices bordering on the silly
(say, a matrix multiplication with dimensions of (1,2) and (2,1))
the inline code will be faster if the code is compiled with the
right options, due to function call overhead.  I also assume that
libxsmm will become faster quite soon for bigger sizes.

Do you have an idea where the crossover is?
Comment 22 Joost VandeVondele 2015-11-23 09:32:25 UTC
(In reply to Thomas Koenig from comment #21)
> I assume that for  small matrices bordering on the silly
> (say, a matrix multiplication with dimensions of (1,2) and (2,1))
> the inline code will be faster if the code is compiled with the
> right options, due to function call overhead.  I also assume that
> libxsmm will become faster quite soon for bigger sizes.
> 
> Do you have an idea where the crossover is?

I agree that inline should be faster, if the compiler is reasonably smart, if the matrix dimensions are known at compile time (i.e. should be able to generate the same kernel). I haven't checked yet.
Comment 23 Jerry DeLisle 2015-11-23 17:58:43 UTC
(In reply to Thomas Koenig from comment #21)
> > Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs
> > seems even a tad more tricky. We have a paper on GPU (small) matrix
> > multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf
> 
> Quite interesting what can be done with GPUs...
> 

Run of the mill graphics processing units have many floating point compute cores.  128 cores is not unusual, usually a lot more. These cores perform basic things like a + b * c on scalars. and other useful functions. Softwares like OpenCL will compile compute kernels which will run efficiently in parallel on these GPU architectures. clBLAS is a runtime library which encapsulates this capability with a BLAS compatible API.  Conceptually you initialize for particular matrices and hand of the work to the GPU.

My low end laptop (300 dollar variety) is running an nbody 3D model with several thousand masses without even pressing the CPU as an example. MATMUL should be doable.

The main GPU competitors are Nvidia, AMD. and Intel. OpenCL is supported on all three.
Comment 24 Jerry DeLisle 2015-11-23 20:18:54 UTC
(In reply to Jerry DeLisle from comment #16)
> For what its worth:
> 
> $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
> $ ./a.out 
>  Time, MATMUL:    21.2483196       21.254449646000001     1.5055670945599979
> 
>  Time, dgemm:    33.2441711       33.243087289000002      .96260614189671445
> 

Running a sample matrix multiply program on this same platform using the default OpenCL (Mesa on Fedora 22) the machine is achieving:

64 x 64      2.76 Gflops
1000 x 1000  14.10
2000 x 2000  24.4
Comment 25 Janne Blomqvist 2015-11-24 12:22:46 UTC
(In reply to Jerry DeLisle from comment #24)
> (In reply to Jerry DeLisle from comment #16)
> > For what its worth:
> > 
> > $ gfc pr51119.f90 -lblas -fno-external-blas -Ofast -march=native 
> > $ ./a.out 
> >  Time, MATMUL:    21.2483196       21.254449646000001     1.5055670945599979
> > 
> >  Time, dgemm:    33.2441711       33.243087289000002      .96260614189671445
> > 
> 
> Running a sample matrix multiply program on this same platform using the
> default OpenCL (Mesa on Fedora 22) the machine is achieving:
> 
> 64 x 64      2.76 Gflops
> 1000 x 1000  14.10
> 2000 x 2000  24.4

But, that is not particularly impressive, is it? I don't know about current low end graphics adapters, but at least the high end GPU cards (Tesla) are capable of several Tflops. Of course, there is a non-trivial threshold size to amortize the data movement to/from the GPU.

With the test program from #12, with OpenBLAS (which BTW should be available in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to a single core with the OMP_NUM_THREADS=1 environment variable. This on a machine with 20 2.8 GHz Ivy bridge cores.

I'm not per se against using GPU's, but I think there's a lot of low hanging fruit to be had just by making it easier for users to use a high performance BLAS implementation.
Comment 26 Janne Blomqvist 2015-11-24 12:30:42 UTC
(In reply to Thomas Koenig from comment #15)
> Another issue:  What should we do if the user supplies an external
> subroutine DGEMM which does something unrelated?
> 
> I suppose we should then make DGEMM (and SGEMM) an intrinsic subroutine.

Indeed, this is a potential problem. 

Another related problem is that, apparently, framework Accelerate on OSX uses the f2c ABI. This at least can be worked around by instead using the cblas API (and hence the C ABI), which most BLAS implementations provide anyway.
Comment 27 Thomas Koenig 2015-11-24 14:49:37 UTC
(In reply to Joost VandeVondele from comment #22)

> I agree that inline should be faster, if the compiler is reasonably smart,
> if the matrix dimensions are known at compile time (i.e. should be able to
> generate the same kernel). I haven't checked yet.

If the compiler turns out not to be reasonably smart, file a bug report :-)
Comment 28 Jerry DeLisle 2015-11-24 17:21:54 UTC
(In reply to Janne Blomqvist from comment #25)

> 
> But, that is not particularly impressive, is it? I don't know about current
> low end graphics adapters, but at least the high end GPU cards (Tesla) are
> capable of several Tflops. Of course, there is a non-trivial threshold size
> to amortize the data movement to/from the GPU.

Not even a graphics card, just the on system chip on a low end laptop. Not trying to impress, just pointing out that the hardware acceleration is fairly ubiquitous these days, so why not just use it.  Maybe not important for serious computing where users already have things like your 20 core machine.
> 
> With the test program from #12, with OpenBLAS (which BTW should be available
> in Fedora 22 as well) I get 337 Gflops/s, or 25 Gflops/s if I restrict it to
> a single core with the OMP_NUM_THREADS=1 environment variable. This on a
> machine with 20 2.8 GHz Ivy bridge cores.
> 
> I'm not per se against using GPU's, but I think there's a lot of low hanging
> fruit to be had just by making it easier for users to use a high performance
> BLAS implementation.

I agree, if available external BLAS does what is needed very good, What I am exploring is one of those external BLAS libraries that uses GPU.  Maybe the answer to this PR is "use an external BLAS" and close this PR.
Comment 29 Joost VandeVondele 2015-11-24 17:45:25 UTC
(In reply to Thomas Koenig from comment #27)
> (In reply to Joost VandeVondele from comment #22)
> If the compiler turns out not to be reasonably smart, file a bug report :-)

what is needed for large matrices (in my opinion) is some simple loop tiling, as can, in principle, be achieved with graphite : this is my PR14741

Good vectorization, which gcc already does well, just requires the proper compiler options for the matmul implementation, i.e. '-O3 -march=native -ffast-math'. However, this would require the Fortran runtime to be compiled with such options, or at least a way to provide specialized (avx2 etc) routines.

There is however the related PR (inner loop of matmul) : PR25621, where some unusual flag combo helps (-fvariable-expansion-in-unroller -funroll-loops)

I think external blas and inlining of small matmuls are good things, but I would expect the default library implementation to reach at least 50% of peak (for e.g. a 4000x4000 matrix), which is not all that hard. Actually, would be worth an experiment, but a Fortran loop nest which implements a matmul compiled with ifort would presumably reach that or higher :-).

These slides show how to reach 90% of peak:
http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
the code actually is not too ugly, and I think there is no need for the explicit vector intrinsics with gcc.

I believe I had once a bug report open for small matrices, but this might have been somewhat fixed in the meanwhile.
Comment 30 Jerry DeLisle 2015-11-25 00:48:17 UTC
(In reply to Joost VandeVondele from comment #29)

> These slides show how to reach 90% of peak:
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/
> the code actually is not too ugly, and I think there is no need for the
> explicit vector intrinsics with gcc.

The 90% of peak is achieved using SSE registers.  I went ahead and built the example and on my laptop (the slow machine) I get about 4.8 gflops with a single core.  So we could use this example and back-off from the SSE optimizations to get an internal MATMUL that is not architecture dependent and perhaps leave the rest to external optimized BLAS.
Comment 31 Dominique d'Humieres 2015-11-28 18:18:43 UTC
From comment 27

> > I agree that inline should be faster, if the compiler is reasonably smart,
> > if the matrix dimensions are known at compile time (i.e. should be able to
> > generate the same kernel). I haven't checked yet.
>
> If the compiler turns out not to be reasonably smart, file a bug report :-)

PR68600.
Comment 32 Jerry DeLisle 2016-11-07 21:41:25 UTC
Created attachment 39985 [details]
Proposed patch to get testing going

This patch works pretty good for me. My results are as follows:

gfortran version 6:

$ gfc6 -static -O2 -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.086      0.054      0.060      0.098
    4  2000      0.288      0.302      0.256      0.315
    8  2000      0.799      0.830      2.094      2.246
   16  2000      4.045      2.539      4.198      4.266
   32  2000      5.358      2.301      5.340      5.335
   64  2000      5.411      2.207      5.391      5.395
  128  2000      5.918      2.416      5.919      5.915
  256   477      5.871      2.393      5.870      5.869
  512    59      2.927      1.891      2.927      2.928
 1024     7      1.668      1.182      1.667      1.668
 2048     1      1.763      1.526      1.763      1.763

gfortran version 7:

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.053      0.052      0.043      0.054
    4  2000      0.310      0.304      0.277      0.377
    8  2000      0.704      0.858      1.711      1.758
   16  2000      2.805      2.529      2.798      2.780
   32  2000      4.693      2.210      4.700      4.821
   64  2000      6.768      2.038      6.732      6.782
  128  2000      8.550      2.419      8.647      8.595
  256   477      9.442      2.378      9.425      9.446
  512    59      8.565      1.960      8.641      8.568
 1024     7      8.537      1.178      8.610      8.530
 2048     1      8.576      1.512      8.652      8.582

A portion of the speed up is from using:

#pragma GCC optimize ( "-Ofast" ) which I just discovered. I am thinking addition and subtraction are fairly safe with this option, however I do not know if it is acceptable for release since it may contradict somewhere on some platform or even a gcc policy. But hey it workd for me.

Much testing needed. There is a nice sweet spot at 256. This is on a single thread on 3.8 GHz core.
Comment 33 Jerry DeLisle 2016-11-07 22:29:21 UTC
With #pragma GCC optimize ( "-O3" )

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.055      0.051      0.038      0.056
    4  2000      0.408      0.274      0.318      0.408
    8  2000      0.644      0.711      1.287      1.831
   16  2000      2.507      2.591      2.521      2.579
   32  2000      3.573      2.300      3.506      3.573
   64  2000      4.628      2.196      4.462      4.629
  128  2000      5.030      2.393      5.304      5.054
  256   477      4.802      2.367      5.573      4.854
  512    59      3.907      1.856      5.234      4.035
 1024     7      3.891      1.178      5.222      4.022
 2048     1      3.901      1.500      5.238      4.033

and with no #pragma it is better than the -O3 version

$ gfc -static -O2 -finline-matmul-limit=0 compare.f90 
$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.054      0.052      0.043      0.057
    4  2000      0.397      0.281      0.316      0.414
    8  2000      0.691      0.773      1.831      1.995
   16  2000      2.493      2.691      2.521      2.512
   32  2000      3.629      2.301      3.623      3.572
   64  2000      4.557      2.072      4.568      4.468
  128  2000      5.282      2.387      5.291      5.284
  256   477      5.629      2.369      5.620      5.605
  512    59      5.215      1.874      5.240      5.216
 1024     7      5.212      1.174      5.217      5.217
 2048     1      5.230      1.499      5.234      5.229

Still a good improvement over gfortran6 on the larger matrices.
Comment 34 Jerry DeLisle 2016-11-08 00:55:25 UTC
Created attachment 39987 [details]
A test program

Just ran some tests comparing reference results and results using -Ofast.

-Ofast does reorder execution.. 

Using kind=8

DELTA = (reference - result) and looking at the ranges of minval and maxval with several random matrices one sees (4 cases):

Using some smaller matrices. see attachment for test program.

 delta minval              delta maxval

-2.2204460492503131E-016   2.2204460492503131E-016

-4.4408920985006262E-016   4.4408920985006262E-016

-2.2204460492503131E-016   2.2204460492503131E-016

-4.4408920985006262E-016   2.2204460492503131E-016

So the wiggling one gets in the least significant bits may matter to some.

With larger arrays (error amplification):

1000 x 1000    -1.0231815394945443E-012   8.8107299234252423E-013

2000 x 2000    -2.7284841053187847E-012   2.6716406864579767E-012

Compiling the test program with -Ofast so that the reference method has the same optimization as MATMUL, one gets.

2000 x 2000     0.0000000000000000        0.0000000000000000

Opinions welcome.
Comment 35 Thomas Koenig 2016-11-08 08:56:32 UTC
(In reply to Jerry DeLisle from comment #34)

> -Ofast does reorder execution.. 

So does a block algorithm.

> Opinions welcome.

I'd say go for -Ofast, or at least its subset that enables
reordering of expressions and thus vectorization.

It might be interesting to check the GOTO BLAS for accuracy, as well.

We could also ask on c.l.f. on what people#s expectations are, but I
shudder to think about the lectures on PL/I that would ensue :-)
Comment 36 Joost VandeVondele 2016-11-08 08:57:49 UTC
(In reply to Jerry DeLisle from comment #34)
> -Ofast does reorder execution.. 
> Opinions welcome.

That is absolutely OK for a matmul, and all techniques to get near peak performance require that (e.g. use of fma, blocking, etc.). 

I didn't realize that one can easily put pragmas for single routines, so you could experiment with something like 

#pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller -funroll-loops" )
Comment 37 Joost VandeVondele 2016-11-08 10:03:42 UTC
(In reply to Joost VandeVondele from comment #36)
> #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> -funroll-loops" )

and really beneficial for larger matrices would be 

-floop-nest-optimize

in particular the blocking (it would be an additional motivation for PR14741 and work on graphite in general), don't know if one can give the parameter for the blocking. In principle the loop-nest-optimization, together with the -Ofast (and ideally -march=native, which we can't have in libgfortran, I assume) would yield near peak performance.
Comment 38 Thomas Koenig 2016-11-08 10:32:15 UTC
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
> 
> and really beneficial for larger matrices would be 
> 
> -floop-nest-optimize
> 
> in particular the blocking (it would be an additional motivation for PR14741
> and work on graphite in general), don't know if one can give the parameter
> for the blocking. In principle the loop-nest-optimization, together with the
> -Ofast (and ideally -march=native, which we can't have in libgfortran, I
> assume) would yield near peak performance.

The algorithm that Jerry implemented already has a very nice unrolling/
blocking algorithm.  I doubt that the gcc algorithms can add to that.

Regarding -march=native, that could really be an improvement,
especially with -mavx.  I wonder if it is possible to have
architecture-specific versions of library functions?  We could
select the right routine depending on the -march flag.  Worth
a question on the gcc list, probably (but definitely _not_ a
prerequisite for this going into gcc 7).

Of course, we _could_ also try to bring blocking to the inline
version (PR 66189), risking insanity for the implementer :-)

Jerry, what Netlib code were you basing your code on?
Comment 39 Jerry DeLisle 2016-11-08 14:55:40 UTC
(In reply to Thomas Koenig from comment #38)
> 
> Jerry, what Netlib code were you basing your code on?

http://www.netlib.org/blas/index.html#_level_3_blas_tuned_for_single_processors_with_caches

Used the dgemm version from the download zip to start, stripped it of unneeded code, converted to C with f2c, and further edits to clean it up.
Comment 40 Jerry DeLisle 2016-11-08 15:38:29 UTC
(In reply to Joost VandeVondele from comment #37)
> (In reply to Joost VandeVondele from comment #36)
> > #pragma GCC optimize ( "-Ofast -fvariable-expansion-in-unroller
> > -funroll-loops" )
>
Using: (I found it necessary to split into separate lines)
#pragma GCC optimize ( "-Ofast" )
#pragma GCC optimize ( "-funroll-loops" )
#pragma GCC optimize ( "-fvariable-expansion-in-unroller" )

$ gfc -static -Ofast -finline-matmul-limit=0 compare.f90 
[jerry@quasar pr51119]$ ./a.out 
 =========================================================
 ================            MEASURED GIGAFLOPS          =
 =========================================================
                 Matmul                           Matmul
                 fixed                 Matmul     variable
 Size  Loops     explicit   refMatmul  assumed    explicit
 =========================================================
    2  2000      0.055      0.048      0.042      0.055
    4  2000      0.366      0.236      0.299      0.368
    8  2000      0.628      0.673      1.610      1.833
   16  2000      2.876      2.765      2.821      2.930
   32  2000      4.681      3.382      4.812      4.763
   64  2000      6.742      2.817      6.760      6.764
  128  2000      8.532      3.194      7.852      8.539
  256   477      9.420      3.319      9.053      9.420
  512    59      8.435      2.358      8.319      8.390
 1024     7      8.493      1.368      8.379      8.444
 2048     1      8.499      1.666      8.385      8.448
Comment 41 Jerry DeLisle 2016-11-14 18:26:24 UTC
Created attachment 40039 [details]
A test program I am using for timings

Timing program used for earlier posts.
Comment 42 Jerry DeLisle 2016-11-15 23:03:32 UTC
Author: jvdelisle
Date: Tue Nov 15 23:03:00 2016
New Revision: 242462

URL: https://gcc.gnu.org/viewcvs?rev=242462&root=gcc&view=rev
Log:
2016-11-15  Jerry DeLisle  <jvdelisle@gcc.gnu.org>
	    Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR libgfortran/51119
	* Makefile.am: Add new optimization flags matmul.
	* Makefile.in: Regenerate.
	* m4/matmul.m4: For the case of all strides = 1, implement a
	fast blocked matrix multiply. Fix some whitespace.
	* generated/matmul_c10.c: Regenerate.
	* generated/matmul_c16.c: Regenerate.
	* generated/matmul_c4.c: Regenerate.
	* generated/matmul_c8.c: Regenerate.
	* generated/matmul_i1.c: Regenerate.
	* generated/matmul_i16.c: Regenerate.
	* generated/matmul_i2.c: Regenerate.
	* generated/matmul_i4.c: Regenerate.
	* generated/matmul_i8.c: Regenerate.
	* generated/matmul_r10.c: Regenerate.
	* generated/matmul_r16.c: Regenerate.
	* generated/matmul_r4.c: Regenerate.
	* generated/matmul_r8.c: Regenerate.

2016-11-15  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR libgfortran/51119
	* gfortran.dg/matmul_12.f90: New test case.

Added:
    trunk/gcc/testsuite/gfortran.dg/matmul_12.f90
Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/Makefile.am
    trunk/libgfortran/Makefile.in
    trunk/libgfortran/generated/matmul_c10.c
    trunk/libgfortran/generated/matmul_c16.c
    trunk/libgfortran/generated/matmul_c4.c
    trunk/libgfortran/generated/matmul_c8.c
    trunk/libgfortran/generated/matmul_i1.c
    trunk/libgfortran/generated/matmul_i16.c
    trunk/libgfortran/generated/matmul_i2.c
    trunk/libgfortran/generated/matmul_i4.c
    trunk/libgfortran/generated/matmul_i8.c
    trunk/libgfortran/generated/matmul_r10.c
    trunk/libgfortran/generated/matmul_r16.c
    trunk/libgfortran/generated/matmul_r4.c
    trunk/libgfortran/generated/matmul_r8.c
    trunk/libgfortran/m4/matmul.m4
Comment 43 Janne Blomqvist 2016-11-16 12:35:31 UTC
Compile warnings caused by this patch:

cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran but not for C
cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but not for C
Comment 44 Jerry DeLisle 2016-11-16 16:31:12 UTC
(In reply to Janne Blomqvist from comment #43)
> Compile warnings caused by this patch:
> 
> cc1: warning: command line option ‘-fno-protect-parens’ is valid for Fortran
> but not for C
> cc1: warning: command line option ‘-fstack-arrays’ is valid for Fortran but
> not for C

Yes I am aware of these. I was willing to live with them, but if it is a problem, we can remove those options easy enough.
Comment 45 Dominique d'Humieres 2016-11-16 16:37:29 UTC
I have some tests coming from pr37131 which now fail due to too stringent comparisons between REAL. This illustrated by the following test

program main
  implicit none
  integer, parameter :: factor=100
  integer, parameter :: n = 2*factor, m = 3*factor, count = 4*factor
  real :: a(m, count), b(count,n)
  real, dimension(m,n) :: c1, c2, c3, c4, c5
  real :: at(count,m), bt(n,count)
  real :: c_t(n)
  integer :: i,j,k
  call random_number(a)
  call random_number(b)
  at = transpose(a)
  bt = transpose(b)

  c1 = matmul(a,b)

  c3 = 0
  do i=1,m
     do j=1,n
        do k=1, count
           c3(i,j) = c3(i,j) + at(k,i)*b(k,j)
        end do
     end do
  end do
  print *, maxval(abs(c3/c1-1)/epsilon(c1))
  if (any(abs(c3/c1-1) > sqrt(real(n))*epsilon(c1))) call abort

end program main

For which maxval(abs(c3/c1-1)/epsilon(c1)) fluctuates around 11 with -ffrontend-optimize and around 3 with -fno-frontend-optimize. The original test was abs(c3-c1) > 1e-5 with c1~100.

Although I don't think there is anything wrong with these results, it may be worth some further investigations.
Comment 46 Thomas Koenig 2016-11-16 17:51:18 UTC
(In reply to Jerry DeLisle from comment #44)

> Yes I am aware of these. I was willing to live with them, but if it is a
> problem, we can remove those options easy enough.

I think it is no big deal, but on the whole I would prefer
not to have the warnings.

So, please go ahead and remove these options. The patch to do so
is either pre-approved or obvious and simple; it is your choice :-)
Comment 47 Jerry DeLisle 2016-11-16 21:54:58 UTC
Author: jvdelisle
Date: Wed Nov 16 21:54:25 2016
New Revision: 242518

URL: https://gcc.gnu.org/viewcvs?rev=242518&root=gcc&view=rev
Log:
2016-11-16  Jerry DeLisle  <jvdelisle@gcc.gnu.org>

	PR libgfortran/51119
	* Makefile.am: Remove -fno-protect-parens -fstack-arrays.
	* Makefile.in: Regenerate.

Modified:
    trunk/libgfortran/ChangeLog
    trunk/libgfortran/Makefile.am
    trunk/libgfortran/Makefile.in
Comment 48 Jerry DeLisle 2016-12-04 06:23:30 UTC
I think we can close this now.
Comment 49 Thomas Koenig 2017-02-26 13:23:17 UTC
Author: tkoenig
Date: Sun Feb 26 13:22:43 2017
New Revision: 245745

URL: https://gcc.gnu.org/viewcvs?rev=245745&root=gcc&view=rev
Log:
2017-02-26  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR fortran/51119
	* options.c (gfc_post_options): Set default limit for matmul
	inlining to 30.
	* invoke.texi: Document change.

2017-02-26  Thomas Koenig  <tkoenig@gcc.gnu.org>

	PR fortran/51119
	* gfortran.dg/inline_matmul_1.f90: Scan optimized dump instead
	of original.
	* gfortran.dg/inline_matmul_11.f90: Likewise.
	* gfortran.dg/inline_matmul_9.f90: Likewise.
	* gfortran.dg/matmul_13.f90: New test.
	* gfortran.dg/matmul_14.f90: New test.


Added:
    trunk/gcc/testsuite/gfortran.dg/matmul_13.f90
    trunk/gcc/testsuite/gfortran.dg/matmul_14.f90
Modified:
    trunk/gcc/fortran/ChangeLog
    trunk/gcc/fortran/invoke.texi
    trunk/gcc/fortran/options.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gfortran.dg/inline_matmul_1.f90
    trunk/gcc/testsuite/gfortran.dg/inline_matmul_11.f90
    trunk/gcc/testsuite/gfortran.dg/inline_matmul_9.f90