This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Hi, not so long ago my proposal to enable loop unrolling for the matmul intrinsic was turned down due to fears about a negative performance impact on register starved architectures. However, I now believe there is sufficient benchmark data to dispel these fears. I made a slightly improved version of my benchmark program to also test double precision and logicals, here are the result (well for logicals the correct unit should be gops/s not gflops/s, and probably the ops count is wrong anyway, but it should provide some clue about the relative speeds with and without loop unrolling): trunk on 1.8 GHz A64, i686-pc-linux-gnu: Single precision matrix multiplication test Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops ==================================================================== 2 0.092 0.024 100000 4 0.361 0.162 100000 8 0.733 0.495 100000 16 0.687 0.768 100000 32 0.874 1.316 15500 64 0.988 1.397 1922 128 1.041 2.929 239 256 0.848 4.809 29 512 0.794 5.093 3 1024 0.800 5.173 1 2048 0.807 5.368 1 Double precision matrix multiplication test Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops ==================================================================== 2 0.086 0.020 100000 4 0.373 0.158 100000 8 0.762 0.522 100000 16 0.703 0.949 100000 32 0.889 1.306 15500 64 0.973 1.391 1922 128 0.777 1.643 239 256 0.698 2.259 29 512 0.521 2.673 3 1024 0.522 2.749 1 2048 0.527 2.759 1 Default kind logical matrix multiplication test Matrix side size Matmul (Gflops/s) Loops ============================================== 2 0.109 100000 4 0.311 100000 8 0.466 100000 16 0.839 100000 32 1.493 15500 64 2.659 1922 128 5.642 239 256 10.914 29 512 22.990 3 1024 43.812 1 2048 75.675 1 matmul and matmull compiled with -funroll-loops: Single precision matrix multiplication test Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops ==================================================================== 2 0.057 0.012 100000 4 0.215 0.091 100000 8 0.447 0.267 100000 16 0.683 0.750 100000 32 1.242 1.342 15500 64 1.316 1.458 1922 128 1.227 2.981 239 256 1.047 4.761 29 512 0.951 5.093 3 1024 0.947 5.173 1 2048 0.958 5.265 1 Double precision matrix multiplication test Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops ==================================================================== 2 0.100 0.021 100000 4 0.400 0.158 100000 8 0.781 0.513 100000 16 1.038 0.942 100000 32 1.224 1.326 15500 64 1.282 1.408 1922 128 0.957 1.653 239 256 0.738 2.285 29 512 0.547 2.664 3 1024 0.545 2.731 1 2048 0.557 2.784 1 Default kind logical matrix multiplication test Matrix side size Matmul (Gflops/s) Loops ============================================== 2 0.120 100000 4 0.361 100000 8 0.619 100000 16 0.990 100000 32 1.852 15500 64 3.135 1922 128 6.614 239 256 12.781 29 512 26.822 3 1024 48.789 1 2048 80.649 1 As can be seen, with -funroll-loops performance is about 20-30% better. There are also some results posted for ppc, where loop unrolling improved performance by 15-20%. See http://gcc.gnu.org/ml/fortran/2005-11/msg00644.html And once again, thanks to rth for showing the correct make incarnation. -- Janne Blomqvist
Attachment:
ChangeLog
Description: Text document
Attachment:
matmul-unroll-loops.diff
Description: Text document
Attachment:
matmul-bench4.f90
Description: Text document
Attachment:
pgp00000.pgp
Description: PGP signature
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |