This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: matmul, dotprod, transpose performance patch proposal


Janne Blomqvist wrote:

On Tue, Aug 23, 2005 at 11:04:39PM -0700, Tim Prince wrote:


I have been examining performance improvements which can be obtained by 1st step optimizations of these intrinsics. I'm sure it's no secret, that a performance increase can be obtained by usual methods:

matmul_r[48]: unroll and jam the stride 1 loops, combining 2 outer loop iterations in one inner loop. For Pentium-M, but not for Xeon, Opteron, or Itanium, a much bigger boost is obtained by using 2 parallel dot products. Change the general stride case to dot product, dictating the strength reduction which gfortran misses. 20% to 100% gain in performance

dotprod_r[48]: unroll, performing addition in pairs first, before adding to sum accumulation, effectively cutting the performance limitation due to latency of addition in half. Typical 30% performance gain



There was some discussion a while ago (late last year IIRC) were it was proposed to call BLAS instead of spending a lot of work optimizing the current implementations. That would still have to be optional, since not everyone has blas, and I think that the current implementation will still be needed for integers which IIRC is required by the standard but blas doesn't support.



transpose: swap inner and outer loops, so the small stride (normally 1) occurs on the destination rather than the source side, taking advantage of Write Combine buffering, where present. This should be faster, as long as the operand size is not large enough for cache eviction to become important.



Another thing which came up in the blas discussion was that code like


MATMUL (TRANSPOSE (A), TRANSPOSE (B))

could be done in a single blas call, without creating temporaries for
the transposed arguments.



Woould such a patch be offered,, under what circumstance?



I'm sure that patches that improve performance (without making the code hugely more complicated, IMHO) are very welcome indeed, provided that the necessary paperwork has been done.



My original message and my reply to myself arrived together this evening. I must be more patient with this list, or, as another mentioned, find a more reliable scheme for viewing replies, as apparently I have missed several.
I've seen the related discussions which you mention. If it is realistic to suppose that gfortran could determine when to invoke BLAS, taking into account transposed arguments, of course that opens up interesting opportunities. Then, it may become more important to optimize also for sizes which are too small to perform well in BLAS.
I suppose my paperwork for g77 has become stale, so I would be glad to undertake the update, if that is indicated.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]