This is the mail archive of the
fortran@gcc.gnu.org
mailing list for the GNU Fortran project.
Re: RFC: optimizing matmul-transpose combinations
On Tuesday 16 November 2004 13:03, Victor Leikehman wrote:
> Paul Brook wrote:
> > I don't understand. AFAICS the generic matmul implementation only does
>
> x*y
>
> > stores. Could you post your implementation of matmul_transpose (or point
>
> me
>
> > at the message if you already have).
>
> Yes, but it travereses the first argument row-wise, causing cache misses.
> My patch http://gcc.gnu.org/ml/fortran/2004-11/msg00097.html trades
> reducing cache misses for extra stores. This is faster than the generic
> implementation, but matmul_transpose is still faster because it BOTH
> traverses the matrices column-wise (saving cache misses) AND does minimal
> number of stores.
Ah, I see. In that case you seem to have shot youself in the foot with your
patch :-)
Your matmul_transpose looks like the same algorithm as is used by the "old"
generic transpose routine. Have you tried swapped-strdes with the old
transpose routine? You've already created a duplicated sets of loops in the
generic transpose routine. Why not just add third one that handles the
swapped-strides case?
On a related note, do the special case unit-stride loops give measurable
speedup compared to generic loop?
Paul