68600 – Inlined MATMUL is too slow.

Bug 68600 - Inlined MATMUL is too slow.

Summary: Inlined MATMUL is too slow.

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	fortran (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	37131 51119
	Show dependency tree / graph

Reported:	2015-11-28 18:00 UTC by Dominique d'Humieres
Modified:	2017-05-08 17:10 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-11-30 00:00:00

Attachments
Code used for the timings (4.39 KB, text/plain) 2015-11-28 18:02 UTC, Dominique d'Humieres	Details
Modified benchmark (4.94 KB, text/plain) 2015-11-29 17:42 UTC, Thomas Koenig	Details
Modified benchmark (really this time) (4.95 KB, text/plain) 2015-11-29 18:04 UTC, Thomas Koenig	Details
Thomas program with a modified dgemm. (2.65 KB, text/plain) 2015-11-29 19:40 UTC, Jerry DeLisle	Details
A faster version (1.49 KB, text/x-csrc) 2015-12-02 05:48 UTC, Jerry DeLisle	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominique d'Humieres 2015-11-28 18:00:44 UTC

Expected results:
(1) to be at least as fast as the MATMUL from the library when compiled 
with the same options (-O2 -ftree-vectorize -funroll-loops);
(2) to be at least as fast as dgemm from lapack when compiled 
with the same options.

Options tested
(a) -O2 -ftree-vectorize -funroll-loops -fno-frontend-optimize, i.e., MATMUL
from the library;
(b) -O2 -ftree-vectorize -funroll-loops, i.e., inlined MATMUL;
(c) -Ofast -march=native -funroll-loops;
(d) -O2 -ftree-vectorize.

Timings in Gflops/s on a Corei7 2.8Ghz (turbo 3.8Ghz) show that neither (1) nor (2) are true (comparing columns 4 and 6 gives an idea of timings accuracy).

                        (a)                 (b)
 Size     Loops   Matmul   dgemm      Matmul   dgemm
 ===================================================
    2    200000    0.360   0.218      0.723   0.221
    4    200000    1.246   0.959      1.379   0.969
    8    200000    2.098   2.396      2.186   2.385
   16    200000    3.748   3.648      2.920   3.645
   32    200000    5.386   5.406      3.096   5.418
   64     30757    6.364   6.385      3.220   6.494
  128      3829    6.362   6.760      3.256   6.702
  256       477    6.515   6.527      3.164   6.444
  512        59    6.313   6.634      3.189   6.675
 1024         7    4.796   4.842      2.935   4.853
 2048         1    4.026   4.032      2.824   3.996
 4096         1    3.355   3.467      2.652   3.475

                        (c)                (d)
 Size     Loops   Matmul   dgemm     Matmul   dgemm
 ========================================================
    2    200000    0.403   0.172      0.919   0.204
    4    200000    0.956   0.799      1.668   1.104
    8    200000    1.796   2.089      2.060   2.310
   16    200000    2.948   4.297      2.253   3.475
   32    200000    4.119   6.219      2.049   4.229
   64     30757    5.174   7.652      2.268   4.464
  128      3829    5.042   6.985      2.371   4.353
  256       477    5.052   6.492      2.423   4.696
  512        59    5.136   6.738      2.421   4.704
 1024         7    3.978   5.075      2.361   4.012
 2048         1    3.476   4.304      2.372   3.543
 4096         1    2.966   3.307      2.370   3.333

Comment 1 Dominique d'Humieres 2015-11-28 18:02:16 UTC

Created attachment 36864 [details]
Code used for the timings

Comment 2 Thomas Koenig 2015-11-29 17:42:22 UTC

Created attachment 36867 [details]
Modified benchmark

Comment 3 Thomas Koenig 2015-11-29 18:04:11 UTC

Created attachment 36868 [details]
Modified benchmark (really this time)

Hi Dominique,

I think you are seeing the effects of inefficiencies of assumed-shape arrays.

If you want to use matmul on very small matrix sizes, it is best to
use fixed-size explicit arrays.

Below the results of the modified benchmark (some changes to keep
the optimizer honest, such as a call to a dummy subroutine) on my
rather dated home box:

 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 =====================================================================================
    2    200000          11.948           0.072           0.142           0.411
    4    200000           1.711           0.417           0.534           0.861
    8    200000           2.314           0.953           0.858           1.076
   16    200000           1.745           1.276           0.918           1.000
   32    200000           1.459           1.456           1.371           1.436
   64     30757           1.501           1.440           1.360           1.393
  128      3829           1.586           1.544           1.557           1.529
  256       477           1.531           1.519           1.544           1.507
  512        59           1.315           1.290           1.263           1.231
 1024         7           1.110           1.081           1.069           1.053
 2048         1           1.095           1.086           1.081           1.058

Comment 4 Jerry DeLisle 2015-11-29 19:40:44 UTC

Created attachment 36869 [details]
Thomas program with a modified dgemm.

The dgemm in this example is a stripped out version of an "optimized for cache" version from netlib.org.  I stripped out a lot of the unused code.

Results show better performance for larger arrays.  Maybe we could model the library routines after this and invoke for larger arrays.

 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 ==============================================================================
    2    200000           1.752           0.042           0.124           0.295
    4    200000           2.172           0.314           0.434           0.704
    8    200000           2.293           1.071           0.721           1.127
   16    200000           2.826           1.533           0.972           1.468
   32    200000           2.707           1.666           1.184           2.154
   64     30757           2.726           1.853           1.192           2.299
  128      3829           2.641           1.965           1.379           2.542
  256       477           2.661           2.001           1.384           2.594
  512        59           1.740           2.011           1.147           1.746
 1024         7           1.344           2.024           1.070           1.355
 2048         1           1.305           2.026           1.088           1.312

Comment 5 Thomas Koenig 2015-11-29 23:46:20 UTC

Another interesting data point.  I deleted the DGEMM implementation from
the file and linked against the serial version of openblas. OK,
openblas is based on GOTO blas, so we have to expect a hit
for large matrices.

Figures:

ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops  bench-3.f90 -lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 =====================================================================================
    2    200000          11.944           0.035           0.136           0.412
    4    200000           1.712           0.257           0.458           0.738
    8    200000           2.080           1.162           0.824           1.077
   16    200000           1.697           3.104           0.939           0.995
   32    200000           1.450           4.814           1.388           1.426
   64     30757           1.485           5.978           1.351           1.371
  128      3829           1.557           6.857           1.534           1.522
  256       477           1.568           7.017           1.589           1.537

So far so good.  Looks as if the crossover point for the inline and the dgemm 
version is between 8 and 16, so let us try this:

ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops -finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial
ig25@linux-fd1f:~/Krempel/Bench> ./a.out
 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 =====================================================================================
    2    200000          11.948           0.039           0.156           0.464
    4    200000           1.999           0.305           0.542           0.859
    8    200000           2.435           1.359           0.962           1.255
   16    200000           0.802           3.102           0.798           0.799
   32    200000           4.878           4.990           4.906           4.906
   64     30757           6.045           6.062           5.977           5.968

So, if the user really wants us to call an external BLAS, we had better
do so directly and not through our library routines.

Comment 6 Dominique d'Humieres 2015-11-30 11:14:59 UTC

> I think you are seeing the effects of inefficiencies of assumed-shape arrays.
>
> If you want to use matmul on very small matrix sizes, it is best to
> use fixed-size explicit arrays.

Well, the problem is that MATMUL inlining is the default. IMO it should be restricted to fixed-size explicit arrays (and small matrices?), at least for the 6.1 version.

> Created attachment 36869 [details]
> Thomas program with a modified dgemm.
>
> The dgemm in this example is a stripped out version of an "optimized for cache"
> version from netlib.org.  I stripped out a lot of the unused code.

It is probably too late for 6.1, but the results are quite impressive (~30Gflops/s peak):

[Book15] f90/bug% gfc -Ofast timing/matmul_sys_8jd.f90
[Book15] f90/bug% a.out
 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 =====================================================================================
    2    200000           0.969           0.104           0.360           0.368
    4    200000           5.821           0.774           1.381           1.049
    8    200000           5.415           2.970           2.316           2.342
   16    200000           6.455           4.917           2.738           3.225
   32    200000           7.332           5.964           2.893           4.117
   64     30757           5.565           7.277           2.785           3.830
  128      3829           4.790           7.982           2.981           4.384
  256       477           4.674           8.375           3.077           4.675
  512        59           4.797           8.200           3.156           4.786
 1024         7           3.967           8.370           2.896           4.050
 2048         1           3.693           8.414           2.804           3.650
[Book15] f90/bug% gfc -Ofast -mavx timing/matmul_sys_8jd.f90
[Book15] f90/bug% a.out
 Size     Loops          Matmul           dgemm         Matmul          Matmul
                      fixed explicit                    assumed      variable explicit
 =====================================================================================
    2    200000           0.956           0.106           0.372           0.469
    4    200000           7.805           0.715           1.334           1.462
    8    200000           7.520           3.222           2.292           3.482
   16    200000           3.001           6.406           2.671           4.917
   32    200000           8.886           8.530           2.900           6.136
   64     30757          10.203          10.998           2.677           6.770
  128      3829           6.742          13.367           2.831           6.774
  256       477           6.435          13.979           2.906           6.049
  512        59           6.592          15.041           2.991           6.273
 1024         7           5.247          14.639           2.775           4.922
 2048         1           4.309          13.976           2.739           4.176

Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate and file a PR for it).

Comment 7 Joost VandeVondele 2015-11-30 14:17:49 UTC

(In reply to Dominique d'Humieres from comment #6)
> Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate
> and file a PR for it).

that's a good find!

I ran locally on haswell, and find these numbers, including openblas, and libxsmm. 

./a.out
 Size     Loops          Matmul           newmatmul     dgemm-like          dgemm
                      fixed explicit      internal      libxsmm          openblas
 =====================================================================================
    2    200000           1.562           0.107           0.104           0.139
    4    200000           6.781           0.779           1.012           0.887
    8    200000           7.424           3.360           6.150           4.732
   16    200000           2.954           7.290          14.421          11.527
   32    200000          10.401          10.251          24.396          18.071
   64     30757          12.696          14.196          27.385          24.547
  128      3829           8.646          17.684          31.460          31.530
  256       477           7.834          19.123          37.457          37.471
  512        59           8.064          19.473          40.738          40.755
 1024         7           8.334          19.475          40.931          41.112
 2048         1           3.042          19.157          41.225          41.279


so the 'newmatmul' code gets about 50% of peak. Inlined matmul is good up to size 8/16, 16-64 libxsmm wins, >64 openblas is better. For the small sizes it is mostly related to call eliminated overhead, I think.

Comment 8 Jerry DeLisle 2015-12-02 05:48:17 UTC

Created attachment 36887 [details]
A faster version

I took the example code found in http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based vector computations are explicitly called via the SSE registers and converted it to use the builtin gcc vector extensions.  I had to experiment a little to get some of the equivalent operations of the original code.

With only -O2 and march=native I am getting good results. I need to roll this into the other test program yet to confirm the gflops are being computed correctly.  The diff value is comparing to the reference naive results to check the computation is correct.

MY_MMult = [
Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 
Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 
Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 
Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 
Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 
Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 
Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 
Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 
Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 
Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 
Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 
Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 
Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 
Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 
Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 
Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 
Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 
Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 
Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 
Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 
Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 
Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 
Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 
Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 
Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13

Comment 9 Thomas Koenig 2015-12-02 21:07:37 UTC

> I took the example code found in
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
> vector computations are explicitly called via the SSE registers and
> converted it to use the builtin gcc vector extensions.  I had to experiment
> a little to get some of the equivalent operations of the original code.

Nice one, I think we can use this as a basis for a new library
function in 7.1.

It probably will be necessary to add some zero-padding in some places
where the matix dimensions are not divisible by four.

Comment 10 Dominique d'Humieres 2016-04-09 10:14:08 UTC

> I think you are seeing the effects of inefficiencies of assumed-shape arrays.
>
> If you want to use matmul on very small matrix sizes, it is best to
> use fixed-size explicit arrays.

Then IMO the matmul inlining should be restricted to fixed-size explicit arrays. Could this be done before the release of gcc-6?

Comment 11 Jerry DeLisle 2016-04-09 17:59:39 UTC

(In reply to Jerry DeLisle from comment #8)
> Created attachment 36887 [details]
> A faster version
> 
> I took the example code found in
> http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based
> vector computations are explicitly called via the SSE registers and
> converted it to use the builtin gcc vector extensions.  I had to experiment
> a little to get some of the equivalent operations of the original code.
> 
> With only -O2 and march=native I am getting good results. I need to roll
> this into the other test program yet to confirm the gflops are being
> computed correctly.  The diff value is comparing to the reference naive
> results to check the computation is correct.
> 
> MY_MMult = [
> Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 
> Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 
> Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 
> Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 
> Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 
> Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 
> Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 
> Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 
> Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 
> Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 
> Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 
> Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 
> Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 
> Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 
> Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 
> Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 
> Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 
> Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 
> Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 
> Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 
> Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 
> Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 
> Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 
> Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 
> Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13

(In reply to Dominique d'Humieres from comment #10)
> > I think you are seeing the effects of inefficiencies of assumed-shape arrays.
> >
> > If you want to use matmul on very small matrix sizes, it is best to
> > use fixed-size explicit arrays.
> 
> Then IMO the matmul inlining should be restricted to fixed-size explicit
> arrays. Could this be done before the release of gcc-6?

I was experimenting some more here a few days ago.  I really think that inlineing shold be disabled above some threshold.  On larger arrays, the runtime library outperforms inline and right now by default the runtime routines are never used unless you provide -fno-frontend-optimize which is counter intuitive for the larger arrays.

If one compiles with -march=native -mavx -Ofast etc etc, the inline can do fairly well on the larger, however when we update the runtime routines to something like shown in comment #8 it will make even more sense to not do inline all the time. (Unless of course we further optimize the frontend-optimize to do better.)

Comment 12 Thomas Koenig 2016-04-10 14:31:22 UTC

(In reply to Jerry DeLisle from comment #11)

> I was experimenting some more here a few days ago.  I really think that
> inlineing shold be disabled above some threshold.  On larger arrays, the
> runtime library outperforms inline and right now by default the runtime
> routines are never used unless you provide -fno-frontend-optimize which is
> counter intuitive for the larger arrays.

May I suggest reading the docs? ;-)

`-finline-matmul-limit=N'
     When front-end optimiztion is active, some calls to the `MATMUL'
     intrinsic function will be inlined.  This may result in code size
     increase if the size of the matrix cannot be determined at compile
     time, as code for both cases is generated.  Setting
     `-finline-matmul-limit=0' will disable inlining in all cases.
     Setting this option with a value of N will produce inline code for
     matrices with size up to N. If the matrices involved are not
     square, the size comparison is performed using the geometric mean
     of the dimensions of the argument and result matrices.

> If one compiles with -march=native -mavx -Ofast etc etc, the inline can do
> fairly well on the larger, however when we update the runtime routines to
> something like shown in comment #8 it will make even more sense to not do
> inline all the time. (Unless of course we further optimize the
> frontend-optimize to do better.)

We can give this option a reasonable default value.  The current
status is

     The default value for N is the value specified for
     `-fblas-matmul-limit' if this option is specified, or unlimitited
     otherwise.

Comment 13 Jerry DeLisle 2016-04-11 05:11:29 UTC

(In reply to Thomas Koenig from comment #12)
> (In reply to Jerry DeLisle from comment #11)

---snip--
> 
> May I suggest reading the docs? ;-)
> 
--- snip ---

>      The default value for N is the value specified for
>      `-fblas-matmul-limit' if this option is specified, or unlimitited
>      otherwise.

Oh gosh!, Sorry about that Thomas. I just did not see it.  And I was even looking for it because I thought it was there! This is excellent because I am working on a modification to the run-time libraries. This will give us something like:

 ====================================
                 Matmul              
                 fixed               
 Size  Loops     explicit   NewMatmul
 ====================================
   16  2000      1.496      1.719    
   32  2000      2.427      1.784    
   64  2000      1.343      1.967    
  128  2000      1.657      2.113    
  256   477      2.660      2.185    
  512    59      2.027      2.195    
 1024     7      1.530      2.208    
 2048     1      1.516      2.210    

On this particular machine, the inlining at high levels of optimization has some sweet spots at size of 32 x 32 for example, so allowing the tuning is essential depending on users application.

Comment 14 Thomas Koenig 2016-11-04 09:22:56 UTC

Question: Would it make sense to add an option so that only
matrices with size known at compile-time are inlined?

Somethin like

-finline-matmul-size-var=0 (to disable), -finline-matmul-size-fixed=5
(to inline for fixed size up to 5 only).

Comment 15 Thomas Koenig 2017-05-07 10:35:26 UTC

I think that with the current status, where
we have -finline-matmul-limit=30 by default, we
can close this bug.

Agreed?

Comment 16 Jerry DeLisle 2017-05-08 17:10:46 UTC

(In reply to Thomas Koenig from comment #15)
> I think that with the current status, where
> we have -finline-matmul-limit=30 by default, we
> can close this bug.
> 
> Agreed?

Yes, this can be closed.