This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: matmul, dotprod, transpose performance patch proposal


Tim Prince wrote:

I have been examining performance improvements which can be obtained by 1st step optimizations of these intrinsics. I'm sure it's no secret, that a performance increase can be obtained by usual methods:

matmul_r[48]: unroll and jam the stride 1 loops, combining 2 outer loop iterations in one inner loop. For Pentium-M, but not for Xeon, Opteron, or Itanium, a much bigger boost is obtained by using 2 parallel dot products. Change the general stride case to dot product, dictating the strength reduction which gfortran misses. 20% to 100% gain in performance

dotprod_r[48]: unroll, performing addition in pairs first, before adding to sum accumulation, effectively cutting the performance limitation due to latency of addition in half. Typical 30% performance gain

transpose: swap inner and outer loops, so the small stride (normally 1) occurs on the destination rather than the source side, taking advantage of Write Combine buffering, where present. This should be faster, as long as the operand size is not large enough for cache eviction to become important.

Would such a patch be offered, under what circumstance?

As this did not appear within 20 hours, I am sending again. My apology, if it is waiting for moderator approval.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]