This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: matmul, dotprod, transpose performance patch proposal
- From: Tim Prince <tprince at myrealbox dot com>
- To: tprince at computer dot org
- Cc: fortran at gcc dot gnu dot org
- Date: Wed, 24 Aug 2005 19:16:46 -0700
- Subject: Re: matmul, dotprod, transpose performance patch proposal
- References: <430C0DF7.9070509@myrealbox.com>
- Reply-to: tprince at computer dot org
Tim Prince wrote:
I have been examining performance improvements which can be obtained
by 1st step optimizations of these intrinsics. I'm sure it's no
secret, that a performance increase can be obtained by usual methods:
matmul_r[48]: unroll and jam the stride 1 loops, combining 2 outer
loop iterations in one inner loop. For Pentium-M, but not for Xeon,
Opteron, or Itanium, a much bigger boost is obtained by using 2
parallel dot products. Change the general stride case to dot product,
dictating the strength reduction which gfortran misses. 20% to 100%
gain in performance
dotprod_r[48]: unroll, performing addition in pairs first, before
adding to sum accumulation, effectively cutting the performance
limitation due to latency of addition in half. Typical 30%
performance gain
transpose: swap inner and outer loops, so the small stride (normally
1) occurs on the destination rather than the source side, taking
advantage of Write Combine buffering, where present. This should be
faster, as long as the operand size is not large enough for cache
eviction to become important.
Would such a patch be offered, under what circumstance?
As this did not appear within 20 hours, I am sending again. My apology,
if it is waiting for moderator approval.