On Tue, Aug 23, 2005 at 11:04:39PM -0700, Tim Prince wrote:
I have been examining performance improvements which can be obtained by
1st step optimizations of these intrinsics. I'm sure it's no secret,
that a performance increase can be obtained by usual methods:
matmul_r[48]: unroll and jam the stride 1 loops, combining 2 outer loop
iterations in one inner loop. For Pentium-M, but not for Xeon, Opteron,
or Itanium, a much bigger boost is obtained by using 2 parallel dot
products. Change the general stride case to dot product, dictating the
strength reduction which gfortran misses. 20% to 100% gain in performance
dotprod_r[48]: unroll, performing addition in pairs first, before
adding to sum accumulation, effectively cutting the performance
limitation due to latency of addition in half. Typical 30% performance gain
There was some discussion a while ago (late last year IIRC) were it
was proposed to call BLAS instead of spending a lot of work optimizing
the current implementations. That would still have to be optional,
since not everyone has blas, and I think that the current
implementation will still be needed for integers which IIRC is
required by the standard but blas doesn't support.
transpose: swap inner and outer loops, so the small stride (normally 1)
occurs on the destination rather than the source side, taking advantage
of Write Combine buffering, where present. This should be faster, as
long as the operand size is not large enough for cache eviction to
become important.
Another thing which came up in the blas discussion was that code like
MATMUL (TRANSPOSE (A), TRANSPOSE (B))
could be done in a single blas call, without creating temporaries for
the transposed arguments.
Woould such a patch be offered,, under what circumstance?
I'm sure that patches that improve performance (without making the
code hugely more complicated, IMHO) are very welcome indeed, provided
that the necessary paperwork has been done.