This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
| Other format: | [Raw text] | |
Hi,
not so long ago my proposal to enable loop unrolling for the matmul
intrinsic was turned down due to fears about a negative performance
impact on register starved architectures.
However, I now believe there is sufficient benchmark data to dispel
these fears. I made a slightly improved version of my benchmark
program to also test double precision and logicals, here are the
result (well for logicals the correct unit should be gops/s not
gflops/s, and probably the ops count is wrong anyway, but it should
provide some clue about the relative speeds with and without loop
unrolling):
trunk on 1.8 GHz A64, i686-pc-linux-gnu:
Single precision matrix multiplication test
Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops
====================================================================
2 0.092 0.024 100000
4 0.361 0.162 100000
8 0.733 0.495 100000
16 0.687 0.768 100000
32 0.874 1.316 15500
64 0.988 1.397 1922
128 1.041 2.929 239
256 0.848 4.809 29
512 0.794 5.093 3
1024 0.800 5.173 1
2048 0.807 5.368 1
Double precision matrix multiplication test
Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops
====================================================================
2 0.086 0.020 100000
4 0.373 0.158 100000
8 0.762 0.522 100000
16 0.703 0.949 100000
32 0.889 1.306 15500
64 0.973 1.391 1922
128 0.777 1.643 239
256 0.698 2.259 29
512 0.521 2.673 3
1024 0.522 2.749 1
2048 0.527 2.759 1
Default kind logical matrix multiplication test
Matrix side size Matmul (Gflops/s) Loops
==============================================
2 0.109 100000
4 0.311 100000
8 0.466 100000
16 0.839 100000
32 1.493 15500
64 2.659 1922
128 5.642 239
256 10.914 29
512 22.990 3
1024 43.812 1
2048 75.675 1
matmul and matmull compiled with -funroll-loops:
Single precision matrix multiplication test
Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops
====================================================================
2 0.057 0.012 100000
4 0.215 0.091 100000
8 0.447 0.267 100000
16 0.683 0.750 100000
32 1.242 1.342 15500
64 1.316 1.458 1922
128 1.227 2.981 239
256 1.047 4.761 29
512 0.951 5.093 3
1024 0.947 5.173 1
2048 0.958 5.265 1
Double precision matrix multiplication test
Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops
====================================================================
2 0.100 0.021 100000
4 0.400 0.158 100000
8 0.781 0.513 100000
16 1.038 0.942 100000
32 1.224 1.326 15500
64 1.282 1.408 1922
128 0.957 1.653 239
256 0.738 2.285 29
512 0.547 2.664 3
1024 0.545 2.731 1
2048 0.557 2.784 1
Default kind logical matrix multiplication test
Matrix side size Matmul (Gflops/s) Loops
==============================================
2 0.120 100000
4 0.361 100000
8 0.619 100000
16 0.990 100000
32 1.852 15500
64 3.135 1922
128 6.614 239
256 12.781 29
512 26.822 3
1024 48.789 1
2048 80.649 1
As can be seen, with -funroll-loops performance is about 20-30%
better.
There are also some results posted for ppc, where loop unrolling
improved performance by 15-20%. See
http://gcc.gnu.org/ml/fortran/2005-11/msg00644.html
And once again, thanks to rth for showing the correct make
incarnation.
--
Janne Blomqvist
Attachment:
ChangeLog
Description: Text document
Attachment:
matmul-unroll-loops.diff
Description: Text document
Attachment:
matmul-bench4.f90
Description: Text document
Attachment:
pgp00000.pgp
Description: PGP signature
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |