This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [Patch, gfortran]: Enable loop unrolling for the matmul intrinsic.


On 11/29/05, Janne Blomqvist <jblomqvi@cc.hut.fi> wrote:
> Hi,
>
> not so long ago my proposal to enable loop unrolling for the matmul
> intrinsic was turned down due to fears about a negative performance
> impact on register starved architectures.
>
> However, I now believe there is sufficient benchmark data to dispel
> these fears. I made a slightly improved version of my benchmark
> program to also test double precision and logicals, here are the
> result (well for logicals the correct unit should be gops/s not
> gflops/s, and probably the ops count is wrong anyway, but it should
> provide some clue about the relative speeds with and without loop
> unrolling):

I will try the benchmark on a libgfortran compiled with FDO, which
also enables loop-unrolling.  Maybe this is a more general approach,
if it shows comparable performance.

Richard.

>
> trunk on 1.8 GHz A64, i686-pc-linux-gnu:
>
>  Single precision matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    sgemm (Gflops/s)      Loops
>  ====================================================================
>     2                0.092                0.024                100000
>     4                0.361                0.162                100000
>     8                0.733                0.495                100000
>    16                0.687                0.768                100000
>    32                0.874                1.316                 15500
>    64                0.988                1.397                  1922
>   128                1.041                2.929                   239
>   256                0.848                4.809                    29
>   512                0.794                5.093                     3
>  1024                0.800                5.173                     1
>  2048                0.807                5.368                     1
>  Double precision matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    dgemm (Gflops/s)      Loops
>  ====================================================================
>     2                0.086                0.020                100000
>     4                0.373                0.158                100000
>     8                0.762                0.522                100000
>    16                0.703                0.949                100000
>    32                0.889                1.306                 15500
>    64                0.973                1.391                  1922
>   128                0.777                1.643                   239
>   256                0.698                2.259                    29
>   512                0.521                2.673                     3
>  1024                0.522                2.749                     1
>  2048                0.527                2.759                     1
>  Default kind logical matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    Loops
>  ==============================================
>     2                0.109                100000
>     4                0.311                100000
>     8                0.466                100000
>    16                0.839                100000
>    32                1.493                 15500
>    64                2.659                  1922
>   128                5.642                   239
>   256               10.914                    29
>   512               22.990                     3
>  1024               43.812                     1
>  2048               75.675                     1
>
> matmul and matmull compiled with -funroll-loops:
>
>  Single precision matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    sgemm (Gflops/s)      Loops
>  ====================================================================
>     2                0.057                0.012                100000
>     4                0.215                0.091                100000
>     8                0.447                0.267                100000
>    16                0.683                0.750                100000
>    32                1.242                1.342                 15500
>    64                1.316                1.458                  1922
>   128                1.227                2.981                   239
>   256                1.047                4.761                    29
>   512                0.951                5.093                     3
>  1024                0.947                5.173                     1
>  2048                0.958                5.265                     1
>  Double precision matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    dgemm (Gflops/s)      Loops
>  ====================================================================
>     2                0.100                0.021                100000
>     4                0.400                0.158                100000
>     8                0.781                0.513                100000
>    16                1.038                0.942                100000
>    32                1.224                1.326                 15500
>    64                1.282                1.408                  1922
>   128                0.957                1.653                   239
>   256                0.738                2.285                    29
>   512                0.547                2.664                     3
>  1024                0.545                2.731                     1
>  2048                0.557                2.784                     1
>  Default kind logical matrix multiplication test
>  Matrix side size    Matmul (Gflops/s)    Loops
>  ==============================================
>     2                0.120                100000
>     4                0.361                100000
>     8                0.619                100000
>    16                0.990                100000
>    32                1.852                 15500
>    64                3.135                  1922
>   128                6.614                   239
>   256               12.781                    29
>   512               26.822                     3
>  1024               48.789                     1
>  2048               80.649                     1
>
> As can be seen, with -funroll-loops performance is about 20-30%
> better.
>
> There are also some results posted for ppc, where loop unrolling
> improved performance by 15-20%. See
>
> http://gcc.gnu.org/ml/fortran/2005-11/msg00644.html
>
> And once again, thanks to rth for showing the correct make
> incarnation.
>
> --
> Janne Blomqvist
>
>
>


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]