Expected results: (1) to be at least as fast as the MATMUL from the library when compiled with the same options (-O2 -ftree-vectorize -funroll-loops); (2) to be at least as fast as dgemm from lapack when compiled with the same options. Options tested (a) -O2 -ftree-vectorize -funroll-loops -fno-frontend-optimize, i.e., MATMUL from the library; (b) -O2 -ftree-vectorize -funroll-loops, i.e., inlined MATMUL; (c) -Ofast -march=native -funroll-loops; (d) -O2 -ftree-vectorize. Timings in Gflops/s on a Corei7 2.8Ghz (turbo 3.8Ghz) show that neither (1) nor (2) are true (comparing columns 4 and 6 gives an idea of timings accuracy). (a) (b) Size Loops Matmul dgemm Matmul dgemm =================================================== 2 200000 0.360 0.218 0.723 0.221 4 200000 1.246 0.959 1.379 0.969 8 200000 2.098 2.396 2.186 2.385 16 200000 3.748 3.648 2.920 3.645 32 200000 5.386 5.406 3.096 5.418 64 30757 6.364 6.385 3.220 6.494 128 3829 6.362 6.760 3.256 6.702 256 477 6.515 6.527 3.164 6.444 512 59 6.313 6.634 3.189 6.675 1024 7 4.796 4.842 2.935 4.853 2048 1 4.026 4.032 2.824 3.996 4096 1 3.355 3.467 2.652 3.475 (c) (d) Size Loops Matmul dgemm Matmul dgemm ======================================================== 2 200000 0.403 0.172 0.919 0.204 4 200000 0.956 0.799 1.668 1.104 8 200000 1.796 2.089 2.060 2.310 16 200000 2.948 4.297 2.253 3.475 32 200000 4.119 6.219 2.049 4.229 64 30757 5.174 7.652 2.268 4.464 128 3829 5.042 6.985 2.371 4.353 256 477 5.052 6.492 2.423 4.696 512 59 5.136 6.738 2.421 4.704 1024 7 3.978 5.075 2.361 4.012 2048 1 3.476 4.304 2.372 3.543 4096 1 2.966 3.307 2.370 3.333
Created attachment 36864 [details] Code used for the timings
Created attachment 36867 [details] Modified benchmark
Created attachment 36868 [details] Modified benchmark (really this time) Hi Dominique, I think you are seeing the effects of inefficiencies of assumed-shape arrays. If you want to use matmul on very small matrix sizes, it is best to use fixed-size explicit arrays. Below the results of the modified benchmark (some changes to keep the optimizer honest, such as a call to a dummy subroutine) on my rather dated home box: Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 11.948 0.072 0.142 0.411 4 200000 1.711 0.417 0.534 0.861 8 200000 2.314 0.953 0.858 1.076 16 200000 1.745 1.276 0.918 1.000 32 200000 1.459 1.456 1.371 1.436 64 30757 1.501 1.440 1.360 1.393 128 3829 1.586 1.544 1.557 1.529 256 477 1.531 1.519 1.544 1.507 512 59 1.315 1.290 1.263 1.231 1024 7 1.110 1.081 1.069 1.053 2048 1 1.095 1.086 1.081 1.058
Created attachment 36869 [details] Thomas program with a modified dgemm. The dgemm in this example is a stripped out version of an "optimized for cache" version from netlib.org. I stripped out a lot of the unused code. Results show better performance for larger arrays. Maybe we could model the library routines after this and invoke for larger arrays. Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ============================================================================== 2 200000 1.752 0.042 0.124 0.295 4 200000 2.172 0.314 0.434 0.704 8 200000 2.293 1.071 0.721 1.127 16 200000 2.826 1.533 0.972 1.468 32 200000 2.707 1.666 1.184 2.154 64 30757 2.726 1.853 1.192 2.299 128 3829 2.641 1.965 1.379 2.542 256 477 2.661 2.001 1.384 2.594 512 59 1.740 2.011 1.147 1.746 1024 7 1.344 2.024 1.070 1.355 2048 1 1.305 2.026 1.088 1.312
Another interesting data point. I deleted the DGEMM implementation from the file and linked against the serial version of openblas. OK, openblas is based on GOTO blas, so we have to expect a hit for large matrices. Figures: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 11.944 0.035 0.136 0.412 4 200000 1.712 0.257 0.458 0.738 8 200000 2.080 1.162 0.824 1.077 16 200000 1.697 3.104 0.939 0.995 32 200000 1.450 4.814 1.388 1.426 64 30757 1.485 5.978 1.351 1.371 128 3829 1.557 6.857 1.534 1.522 256 477 1.568 7.017 1.589 1.537 So far so good. Looks as if the crossover point for the inline and the dgemm version is between 8 and 16, so let us try this: ig25@linux-fd1f:~/Krempel/Bench> gfortran -O2 -funroll-loops -finline-matmul-limit=12 -fexternal-blas bench-3.f90 -lopenblas_serial ig25@linux-fd1f:~/Krempel/Bench> ./a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 11.948 0.039 0.156 0.464 4 200000 1.999 0.305 0.542 0.859 8 200000 2.435 1.359 0.962 1.255 16 200000 0.802 3.102 0.798 0.799 32 200000 4.878 4.990 4.906 4.906 64 30757 6.045 6.062 5.977 5.968 So, if the user really wants us to call an external BLAS, we had better do so directly and not through our library routines.
> I think you are seeing the effects of inefficiencies of assumed-shape arrays. > > If you want to use matmul on very small matrix sizes, it is best to > use fixed-size explicit arrays. Well, the problem is that MATMUL inlining is the default. IMO it should be restricted to fixed-size explicit arrays (and small matrices?), at least for the 6.1 version. > Created attachment 36869 [details] > Thomas program with a modified dgemm. > > The dgemm in this example is a stripped out version of an "optimized for cache" > version from netlib.org. I stripped out a lot of the unused code. It is probably too late for 6.1, but the results are quite impressive (~30Gflops/s peak): [Book15] f90/bug% gfc -Ofast timing/matmul_sys_8jd.f90 [Book15] f90/bug% a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 0.969 0.104 0.360 0.368 4 200000 5.821 0.774 1.381 1.049 8 200000 5.415 2.970 2.316 2.342 16 200000 6.455 4.917 2.738 3.225 32 200000 7.332 5.964 2.893 4.117 64 30757 5.565 7.277 2.785 3.830 128 3829 4.790 7.982 2.981 4.384 256 477 4.674 8.375 3.077 4.675 512 59 4.797 8.200 3.156 4.786 1024 7 3.967 8.370 2.896 4.050 2048 1 3.693 8.414 2.804 3.650 [Book15] f90/bug% gfc -Ofast -mavx timing/matmul_sys_8jd.f90 [Book15] f90/bug% a.out Size Loops Matmul dgemm Matmul Matmul fixed explicit assumed variable explicit ===================================================================================== 2 200000 0.956 0.106 0.372 0.469 4 200000 7.805 0.715 1.334 1.462 8 200000 7.520 3.222 2.292 3.482 16 200000 3.001 6.406 2.671 4.917 32 200000 8.886 8.530 2.900 6.136 64 30757 10.203 10.998 2.677 6.770 128 3829 6.742 13.367 2.831 6.774 256 477 6.435 13.979 2.906 6.049 512 59 6.592 15.041 2.991 6.273 1024 7 5.247 14.639 2.775 4.922 2048 1 4.309 13.976 2.739 4.176 Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate and file a PR for it).
(In reply to Dominique d'Humieres from comment #6) > Note a problem when 16x16 matrices are inlined with -mavx (I'll investigate > and file a PR for it). that's a good find! I ran locally on haswell, and find these numbers, including openblas, and libxsmm. ./a.out Size Loops Matmul newmatmul dgemm-like dgemm fixed explicit internal libxsmm openblas ===================================================================================== 2 200000 1.562 0.107 0.104 0.139 4 200000 6.781 0.779 1.012 0.887 8 200000 7.424 3.360 6.150 4.732 16 200000 2.954 7.290 14.421 11.527 32 200000 10.401 10.251 24.396 18.071 64 30757 12.696 14.196 27.385 24.547 128 3829 8.646 17.684 31.460 31.530 256 477 7.834 19.123 37.457 37.471 512 59 8.064 19.473 40.738 40.755 1024 7 8.334 19.475 40.931 41.112 2048 1 3.042 19.157 41.225 41.279 so the 'newmatmul' code gets about 50% of peak. Inlined matmul is good up to size 8/16, 16-64 libxsmm wins, >64 openblas is better. For the small sizes it is mostly related to call eliminated overhead, I think.
Created attachment 36887 [details] A faster version I took the example code found in http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based vector computations are explicitly called via the SSE registers and converted it to use the builtin gcc vector extensions. I had to experiment a little to get some of the equivalent operations of the original code. With only -O2 and march=native I am getting good results. I need to roll this into the other test program yet to confirm the gflops are being computed correctly. The diff value is comparing to the reference naive results to check the computation is correct. MY_MMult = [ Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13
> I took the example code found in > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based > vector computations are explicitly called via the SSE registers and > converted it to use the builtin gcc vector extensions. I had to experiment > a little to get some of the equivalent operations of the original code. Nice one, I think we can use this as a basis for a new library function in 7.1. It probably will be necessary to add some zero-padding in some places where the matix dimensions are not divisible by four.
> I think you are seeing the effects of inefficiencies of assumed-shape arrays. > > If you want to use matmul on very small matrix sizes, it is best to > use fixed-size explicit arrays. Then IMO the matmul inlining should be restricted to fixed-size explicit arrays. Could this be done before the release of gcc-6?
(In reply to Jerry DeLisle from comment #8) > Created attachment 36887 [details] > A faster version > > I took the example code found in > http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm/ where the register based > vector computations are explicitly called via the SSE registers and > converted it to use the builtin gcc vector extensions. I had to experiment > a little to get some of the equivalent operations of the original code. > > With only -O2 and march=native I am getting good results. I need to roll > this into the other test program yet to confirm the gflops are being > computed correctly. The diff value is comparing to the reference naive > results to check the computation is correct. > > MY_MMult = [ > Size: 40, Gflops: 1.828571e+00, Diff: 2.664535e-15 > Size: 80, Gflops: 3.696751e+00, Diff: 7.105427e-15 > Size: 120, Gflops: 4.051583e+00, Diff: 1.065814e-14 > Size: 160, Gflops: 4.015686e+00, Diff: 1.421085e-14 > Size: 200, Gflops: 4.029212e+00, Diff: 2.131628e-14 > Size: 240, Gflops: 3.972414e+00, Diff: 2.486900e-14 > Size: 280, Gflops: 3.881188e+00, Diff: 2.842171e-14 > Size: 320, Gflops: 3.872371e+00, Diff: 3.552714e-14 > Size: 360, Gflops: 3.887676e+00, Diff: 4.973799e-14 > Size: 400, Gflops: 3.862052e+00, Diff: 4.973799e-14 > Size: 440, Gflops: 3.886575e+00, Diff: 4.973799e-14 > Size: 480, Gflops: 3.910124e+00, Diff: 6.039613e-14 > Size: 520, Gflops: 3.863706e+00, Diff: 6.394885e-14 > Size: 560, Gflops: 3.976947e+00, Diff: 6.750156e-14 > Size: 600, Gflops: 4.002631e+00, Diff: 7.460699e-14 > Size: 640, Gflops: 3.992507e+00, Diff: 8.171241e-14 > Size: 680, Gflops: 3.964570e+00, Diff: 9.237056e-14 > Size: 720, Gflops: 3.973661e+00, Diff: 1.101341e-13 > Size: 760, Gflops: 3.982346e+00, Diff: 1.065814e-13 > Size: 800, Gflops: 3.869291e+00, Diff: 8.881784e-14 > Size: 840, Gflops: 3.936271e+00, Diff: 1.065814e-13 > Size: 880, Gflops: 3.931259e+00, Diff: 1.030287e-13 > Size: 920, Gflops: 3.912907e+00, Diff: 1.207923e-13 > Size: 960, Gflops: 3.938391e+00, Diff: 1.278977e-13 > Size: 1000, Gflops: 3.945754e+00, Diff: 1.421085e-13 (In reply to Dominique d'Humieres from comment #10) > > I think you are seeing the effects of inefficiencies of assumed-shape arrays. > > > > If you want to use matmul on very small matrix sizes, it is best to > > use fixed-size explicit arrays. > > Then IMO the matmul inlining should be restricted to fixed-size explicit > arrays. Could this be done before the release of gcc-6? I was experimenting some more here a few days ago. I really think that inlineing shold be disabled above some threshold. On larger arrays, the runtime library outperforms inline and right now by default the runtime routines are never used unless you provide -fno-frontend-optimize which is counter intuitive for the larger arrays. If one compiles with -march=native -mavx -Ofast etc etc, the inline can do fairly well on the larger, however when we update the runtime routines to something like shown in comment #8 it will make even more sense to not do inline all the time. (Unless of course we further optimize the frontend-optimize to do better.)
(In reply to Jerry DeLisle from comment #11) > I was experimenting some more here a few days ago. I really think that > inlineing shold be disabled above some threshold. On larger arrays, the > runtime library outperforms inline and right now by default the runtime > routines are never used unless you provide -fno-frontend-optimize which is > counter intuitive for the larger arrays. May I suggest reading the docs? ;-) `-finline-matmul-limit=N' When front-end optimiztion is active, some calls to the `MATMUL' intrinsic function will be inlined. This may result in code size increase if the size of the matrix cannot be determined at compile time, as code for both cases is generated. Setting `-finline-matmul-limit=0' will disable inlining in all cases. Setting this option with a value of N will produce inline code for matrices with size up to N. If the matrices involved are not square, the size comparison is performed using the geometric mean of the dimensions of the argument and result matrices. > If one compiles with -march=native -mavx -Ofast etc etc, the inline can do > fairly well on the larger, however when we update the runtime routines to > something like shown in comment #8 it will make even more sense to not do > inline all the time. (Unless of course we further optimize the > frontend-optimize to do better.) We can give this option a reasonable default value. The current status is The default value for N is the value specified for `-fblas-matmul-limit' if this option is specified, or unlimitited otherwise.
(In reply to Thomas Koenig from comment #12) > (In reply to Jerry DeLisle from comment #11) ---snip-- > > May I suggest reading the docs? ;-) > --- snip --- > The default value for N is the value specified for > `-fblas-matmul-limit' if this option is specified, or unlimitited > otherwise. Oh gosh!, Sorry about that Thomas. I just did not see it. And I was even looking for it because I thought it was there! This is excellent because I am working on a modification to the run-time libraries. This will give us something like: ==================================== Matmul fixed Size Loops explicit NewMatmul ==================================== 16 2000 1.496 1.719 32 2000 2.427 1.784 64 2000 1.343 1.967 128 2000 1.657 2.113 256 477 2.660 2.185 512 59 2.027 2.195 1024 7 1.530 2.208 2048 1 1.516 2.210 On this particular machine, the inlining at high levels of optimization has some sweet spots at size of 32 x 32 for example, so allowing the tuning is essential depending on users application.
Question: Would it make sense to add an option so that only matrices with size known at compile-time are inlined? Somethin like -finline-matmul-size-var=0 (to disable), -finline-matmul-size-fixed=5 (to inline for fixed size up to 5 only).
I think that with the current status, where we have -finline-matmul-limit=30 by default, we can close this bug. Agreed?
(In reply to Thomas Koenig from comment #15) > I think that with the current status, where > we have -finline-matmul-limit=30 by default, we > can close this bug. > > Agreed? Yes, this can be closed.