[Bug rtl-optimization/42612] post-increment addressing not used
pinskia at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Jul 12 05:39:59 GMT 2022
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42612
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
(In reply to Dmitry Baksheev from comment #6)
> Please consider fixing this issue. Here is another example where not using
> post-increment for loops produces suboptimal code on AArch64. The code is 4x
> slower than LLVM-generated code for dot-product function:
>
> double dotprod(std::size_t n,
> const double* __restrict__ a,
> const double* __restrict__ b)
> {
> double ans = 0;
> #if __clang__
> #pragma clang loop vectorize(assume_safety)
> #else
> #pragma GCC ivdep
> #endif
> for (std::size_t i = 0; i < n; ++i) {
> ans += a[i] * b[i];
> }
> return ans;
> }
>
>
> Compile with: $(CXX) -march=armv8.2-a -O3 dp.cpp
>
> GCC-generated loop does not have post-increment loads:
> .L3:
>
> ldr d2, [x1, x3, lsl 3]
>
> ldr d1, [x2, x3, lsl 3]
>
> add x3, x3, 1
>
> fmadd d0, d2, d1, d0
>
> cmp x0, x3
>
> bne .L3
>
> Clang emits this:
> .LBB0_4:
> ldr d1, [x10], #8
>
> ldr d2, [x8], #8
>
> subs x9, x9, #1
> fmadd d0, d1, d2, d0
>
> b.ne .LBB0_4
I suspect that is a different issue. And I suspect it is a target cost issue
which depends on the core really. Because some cores the separate add is
better.
More information about the Gcc-bugs
mailing list