This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: [Patch, gfortran]: Enable loop unrolling for the matmul intrinsic.
- From: Richard Guenther <richard dot guenther at gmail dot com>
- To: GNU GFortran <fortran at gcc dot gnu dot org>, GCC patches <gcc-patches at gcc dot gnu dot org>
- Date: Tue, 29 Nov 2005 23:32:41 +0100
- Subject: Re: [Patch, gfortran]: Enable loop unrolling for the matmul intrinsic.
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=jvTOuzSijYJlBUf9/My9gKLmq/jIZLdhJiqH/HmqVH6afAh30s4eDOA+A/JMyR7pQWfHLjSb9u6/sIxbrhhIXjRvFsCpXniW+WExtr/1KvStDrq0qUUl5c/lyNCK6XD/T6E5VmTtTW4f+TOPnM7zeFmYQiufCtXC5Yt0c46RHJ0=
- References: <20051129200114.GD16405@vipunen.hut.fi>
On 11/29/05, Janne Blomqvist <jblomqvi@cc.hut.fi> wrote:
> Hi,
>
> not so long ago my proposal to enable loop unrolling for the matmul
> intrinsic was turned down due to fears about a negative performance
> impact on register starved architectures.
>
> However, I now believe there is sufficient benchmark data to dispel
> these fears. I made a slightly improved version of my benchmark
> program to also test double precision and logicals, here are the
> result (well for logicals the correct unit should be gops/s not
> gflops/s, and probably the ops count is wrong anyway, but it should
> provide some clue about the relative speeds with and without loop
> unrolling):
I will try the benchmark on a libgfortran compiled with FDO, which
also enables loop-unrolling. Maybe this is a more general approach,
if it shows comparable performance.
Richard.
>
> trunk on 1.8 GHz A64, i686-pc-linux-gnu:
>
> Single precision matrix multiplication test
> Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops
> ====================================================================
> 2 0.092 0.024 100000
> 4 0.361 0.162 100000
> 8 0.733 0.495 100000
> 16 0.687 0.768 100000
> 32 0.874 1.316 15500
> 64 0.988 1.397 1922
> 128 1.041 2.929 239
> 256 0.848 4.809 29
> 512 0.794 5.093 3
> 1024 0.800 5.173 1
> 2048 0.807 5.368 1
> Double precision matrix multiplication test
> Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops
> ====================================================================
> 2 0.086 0.020 100000
> 4 0.373 0.158 100000
> 8 0.762 0.522 100000
> 16 0.703 0.949 100000
> 32 0.889 1.306 15500
> 64 0.973 1.391 1922
> 128 0.777 1.643 239
> 256 0.698 2.259 29
> 512 0.521 2.673 3
> 1024 0.522 2.749 1
> 2048 0.527 2.759 1
> Default kind logical matrix multiplication test
> Matrix side size Matmul (Gflops/s) Loops
> ==============================================
> 2 0.109 100000
> 4 0.311 100000
> 8 0.466 100000
> 16 0.839 100000
> 32 1.493 15500
> 64 2.659 1922
> 128 5.642 239
> 256 10.914 29
> 512 22.990 3
> 1024 43.812 1
> 2048 75.675 1
>
> matmul and matmull compiled with -funroll-loops:
>
> Single precision matrix multiplication test
> Matrix side size Matmul (Gflops/s) sgemm (Gflops/s) Loops
> ====================================================================
> 2 0.057 0.012 100000
> 4 0.215 0.091 100000
> 8 0.447 0.267 100000
> 16 0.683 0.750 100000
> 32 1.242 1.342 15500
> 64 1.316 1.458 1922
> 128 1.227 2.981 239
> 256 1.047 4.761 29
> 512 0.951 5.093 3
> 1024 0.947 5.173 1
> 2048 0.958 5.265 1
> Double precision matrix multiplication test
> Matrix side size Matmul (Gflops/s) dgemm (Gflops/s) Loops
> ====================================================================
> 2 0.100 0.021 100000
> 4 0.400 0.158 100000
> 8 0.781 0.513 100000
> 16 1.038 0.942 100000
> 32 1.224 1.326 15500
> 64 1.282 1.408 1922
> 128 0.957 1.653 239
> 256 0.738 2.285 29
> 512 0.547 2.664 3
> 1024 0.545 2.731 1
> 2048 0.557 2.784 1
> Default kind logical matrix multiplication test
> Matrix side size Matmul (Gflops/s) Loops
> ==============================================
> 2 0.120 100000
> 4 0.361 100000
> 8 0.619 100000
> 16 0.990 100000
> 32 1.852 15500
> 64 3.135 1922
> 128 6.614 239
> 256 12.781 29
> 512 26.822 3
> 1024 48.789 1
> 2048 80.649 1
>
> As can be seen, with -funroll-loops performance is about 20-30%
> better.
>
> There are also some results posted for ppc, where loop unrolling
> improved performance by 15-20%. See
>
> http://gcc.gnu.org/ml/fortran/2005-11/msg00644.html
>
> And once again, thanks to rth for showing the correct make
> incarnation.
>
> --
> Janne Blomqvist
>
>
>