[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
wilco at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Jan 29 13:18:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
Wilco <wilco at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |wilco at gcc dot gnu.org
--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
>
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
> float32x4_t __c, const int __lane)
> {
> return __builtin_aarch64_fmav4sf (__b,
> __aarch64_vdupq_laneq_f32 (__c, __lane),
> __a);
> }
>
> thus leading to unoptimized code as:
>
> ldr q1, [x2, 16]
> dup v28.4s, v1.s[0]
> dup v27.4s, v1.s[1]
> dup v26.4s, v1.s[2]
> dup v1.4s, v1.s[3]
> fmla v22.4s, v25.4s, v28.4s
> fmla v3.4s, v25.4s, v27.4s
> fmla v6.4s, v25.4s, v26.4s
> fmla v17.4s, v25.4s, v1.4s
>
> instead of:
>
> ldr q1, [x2, 16]
> fmla v22.4s, v25.4s, v1.s[0]
> fmla v3.4s, v25.4s, v1.s[1]
> fmla v6.4s, v25.4s, v1.s[2]
> fmla v17.4s, v25.4s, v1.s[3]
>
> I guess several other *lane* intrinsics exhibit the same shortcoming.
Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:
#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
a = vfmaq_laneq_f32 (a, b, c, 0);
a = vfmaq_laneq_f32 (a, b, c, 1);
return a;
}
fmla v0.4s, v1.4s, v2.4s[0]
fmla v0.4s, v1.4s, v2.4s[1]
ret
In all cases the optimizer is able to merge the dups as expected.
If it still fails for you, could you provide a compilable example like above
that shows the issue?
> For the record, I managed to partly workaround this issue by writing my own
> version as:
>
> if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
>
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633
I'd strongly advise against using inline assembler since most people make
mistakes writing it, and GCC won't be able to optimize code using inline
assembler.
More information about the Gcc-bugs
mailing list