This is the mail archive of the fortran@gcc.gnu.org mailing list for the GNU Fortran project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: RFC: optimizing matmul-transpose combinations


On Tuesday 16 November 2004 13:03, Victor Leikehman wrote:
> Paul Brook wrote:
> > I don't understand.  AFAICS the generic matmul implementation only does
>
> x*y
>
> > stores.  Could you post your implementation of matmul_transpose (or point
>
> me
>
> > at the message if you already have).
>
> Yes, but it travereses the first argument row-wise, causing cache misses.
> My patch  http://gcc.gnu.org/ml/fortran/2004-11/msg00097.html  trades
> reducing cache misses for extra stores.  This is faster than the generic
> implementation, but matmul_transpose is still faster because it BOTH
> traverses the matrices column-wise (saving cache misses) AND does minimal
> number of stores.

Ah, I see. In that case you seem to have shot youself in the foot with your 
patch :-)

Your matmul_transpose looks like the same algorithm as is used by the "old" 
generic transpose routine. Have you tried swapped-strdes with the old 
transpose routine? You've already created a duplicated sets of loops in the 
generic transpose routine. Why not just add third one that handles the 
swapped-strides case?

On a related note, do the special case unit-stride loops give measurable 
speedup compared to generic loop?

Paul


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]