This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug fortran/79930] Potentially Missed Optimisation for MATMUL / DOT_PRODUCT


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79930

--- Comment #8 from Adam Hirst <adam at aphirst dot karoo.co.uk> ---
Ah, it seems that Jerry was tinkering with tp_array.f90 (intrinsic array
version of the Vector type), while I was with tp_xyz.f90 (explicit separate
elements). I was going to remark at how he didn't need to use -flto to get any
of the matmul paths working better than the DO/SUM paths.

I'm curious as to whether he reproduces my results on his system, but I'll
first reproduce his.

1) When I use his modified TP_LEFT and compile only under -O2 I get, as he
does, that the matmul path is faster than the DO/SUM path. Not by as large a
margin, but I expect that this varies system-to-system.

2) I notice that he moved the matmul() calls out of the dot_product calls, but
didn't move the D%vec calls out of matmul. If I do the same with in tp_xyz.f90,
and recompile under simply -O2, I get the same kind of performance boost as
Jerry does.

What do you think the reason could be that:

    Dx = D%x
    Dy = D%y
    Dz = D%z
    NUDx = matmul(NU, Dx)
    NUDy = matmul(NU, Dy)
    NUDz = matmul(NU, Dz)
    tensorproduct%x = ...

performs so much worse with -O2 than

    NUDx = matmul(NU, D%x)
    NUDy = matmul(NU, D%y)
    NUDz = matmul(NU, D%z)
    tensorproduct%x = ...

that the former needs -flto to be able to compete?

---

It's probably important that we remain clear on which version of the Vector
type we're doing the tests, as (as someone commented to me earlier, probably
Jerry), array-stride-shenanigans are bound to play some role.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]