This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Patch, gfortran]: Enable loop unrolling for the matmul intrinsic.


Hi,

not so long ago my proposal to enable loop unrolling for the matmul
intrinsic was turned down due to fears about a negative performance
impact on register starved architectures.

However, I now believe there is sufficient benchmark data to dispel
these fears. I made a slightly improved version of my benchmark
program to also test double precision and logicals, here are the
result (well for logicals the correct unit should be gops/s not
gflops/s, and probably the ops count is wrong anyway, but it should
provide some clue about the relative speeds with and without loop
unrolling):

trunk on 1.8 GHz A64, i686-pc-linux-gnu:

 Single precision matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    sgemm (Gflops/s)      Loops
 ====================================================================
    2                0.092                0.024                100000
    4                0.361                0.162                100000
    8                0.733                0.495                100000
   16                0.687                0.768                100000
   32                0.874                1.316                 15500
   64                0.988                1.397                  1922
  128                1.041                2.929                   239
  256                0.848                4.809                    29
  512                0.794                5.093                     3
 1024                0.800                5.173                     1
 2048                0.807                5.368                     1
 Double precision matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    dgemm (Gflops/s)      Loops
 ====================================================================
    2                0.086                0.020                100000
    4                0.373                0.158                100000
    8                0.762                0.522                100000
   16                0.703                0.949                100000
   32                0.889                1.306                 15500
   64                0.973                1.391                  1922
  128                0.777                1.643                   239
  256                0.698                2.259                    29
  512                0.521                2.673                     3
 1024                0.522                2.749                     1
 2048                0.527                2.759                     1
 Default kind logical matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    Loops
 ==============================================
    2                0.109                100000
    4                0.311                100000
    8                0.466                100000
   16                0.839                100000
   32                1.493                 15500
   64                2.659                  1922
  128                5.642                   239
  256               10.914                    29
  512               22.990                     3
 1024               43.812                     1
 2048               75.675                     1

matmul and matmull compiled with -funroll-loops:

 Single precision matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    sgemm (Gflops/s)      Loops
 ====================================================================
    2                0.057                0.012                100000
    4                0.215                0.091                100000
    8                0.447                0.267                100000
   16                0.683                0.750                100000
   32                1.242                1.342                 15500
   64                1.316                1.458                  1922
  128                1.227                2.981                   239
  256                1.047                4.761                    29
  512                0.951                5.093                     3
 1024                0.947                5.173                     1
 2048                0.958                5.265                     1
 Double precision matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    dgemm (Gflops/s)      Loops
 ====================================================================
    2                0.100                0.021                100000
    4                0.400                0.158                100000
    8                0.781                0.513                100000
   16                1.038                0.942                100000
   32                1.224                1.326                 15500
   64                1.282                1.408                  1922
  128                0.957                1.653                   239
  256                0.738                2.285                    29
  512                0.547                2.664                     3
 1024                0.545                2.731                     1
 2048                0.557                2.784                     1
 Default kind logical matrix multiplication test
 Matrix side size    Matmul (Gflops/s)    Loops
 ==============================================
    2                0.120                100000
    4                0.361                100000
    8                0.619                100000
   16                0.990                100000
   32                1.852                 15500
   64                3.135                  1922
  128                6.614                   239
  256               12.781                    29
  512               26.822                     3
 1024               48.789                     1
 2048               80.649                     1

As can be seen, with -funroll-loops performance is about 20-30%
better.

There are also some results posted for ppc, where loop unrolling
improved performance by 15-20%. See 

http://gcc.gnu.org/ml/fortran/2005-11/msg00644.html

And once again, thanks to rth for showing the correct make
incarnation.

-- 
Janne Blomqvist

Attachment: ChangeLog
Description: Text document

Attachment: matmul-unroll-loops.diff
Description: Text document

Attachment: matmul-bench4.f90
Description: Text document

Attachment: pgp00000.pgp
Description: PGP signature


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]