vfmaq_laneq_f32 is currently implemented as: __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, float32x4_t __c, const int __lane) { return __builtin_aarch64_fmav4sf (__b, __aarch64_vdupq_laneq_f32 (__c, __lane), __a); } thus leading to unoptimized code as: ldr q1, [x2, 16] dup v28.4s, v1.s[0] dup v27.4s, v1.s[1] dup v26.4s, v1.s[2] dup v1.4s, v1.s[3] fmla v22.4s, v25.4s, v28.4s fmla v3.4s, v25.4s, v27.4s fmla v6.4s, v25.4s, v26.4s fmla v17.4s, v25.4s, v1.4s instead of: ldr q1, [x2, 16] fmla v22.4s, v25.4s, v1.s[0] fmla v3.4s, v25.4s, v1.s[1] fmla v6.4s, v25.4s, v1.s[2] fmla v17.4s, v25.4s, v1.s[3] I guess several other *lane* intrinsics exhibit the same shortcoming. For the record, I managed to partly workaround this issue by writing my own version as: if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" (a), "w" (b) : ); else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" (a), "w" (b) : ); but that's of course not ideal. This change yields a 32% speed up in Eigen's matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633
(In reply to Gael Guennebaud from comment #0) > vfmaq_laneq_f32 is currently implemented as: > > __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) > vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, > float32x4_t __c, const int __lane) > { > return __builtin_aarch64_fmav4sf (__b, > __aarch64_vdupq_laneq_f32 (__c, __lane), > __a); > } > > thus leading to unoptimized code as: > > ldr q1, [x2, 16] > dup v28.4s, v1.s[0] > dup v27.4s, v1.s[1] > dup v26.4s, v1.s[2] > dup v1.4s, v1.s[3] > fmla v22.4s, v25.4s, v28.4s > fmla v3.4s, v25.4s, v27.4s > fmla v6.4s, v25.4s, v26.4s > fmla v17.4s, v25.4s, v1.4s > > instead of: > > ldr q1, [x2, 16] > fmla v22.4s, v25.4s, v1.s[0] > fmla v3.4s, v25.4s, v1.s[1] > fmla v6.4s, v25.4s, v1.s[2] > fmla v17.4s, v25.4s, v1.s[3] > > I guess several other *lane* intrinsics exhibit the same shortcoming. Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2: #include "arm_neon.h" float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c) { a = vfmaq_laneq_f32 (a, b, c, 0); a = vfmaq_laneq_f32 (a, b, c, 1); return a; } fmla v0.4s, v1.4s, v2.4s[0] fmla v0.4s, v1.4s, v2.4s[1] ret In all cases the optimizer is able to merge the dups as expected. If it still fails for you, could you provide a compilable example like above that shows the issue? > For the record, I managed to partly workaround this issue by writing my own > version as: > > if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > > but that's of course not ideal. This change yields a 32% speed up in Eigen's > matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633 I'd strongly advise against using inline assembler since most people make mistakes writing it, and GCC won't be able to optimize code using inline assembler.
Indeed, it fails to remove the dup only if the coefficient is used multiple times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) #include <arm_neon.h> void foo(const float* a, const float * b, float * c, int n) { float32x4_t c0, c1, c2, c3; c0 = vld1q_f32(c+0*4); c1 = vld1q_f32(c+1*4); for(int k=0; k<n; k++) { float32x4_t a0 = vld1q_f32(a+0*4+k*4); float32x4_t b0 = vld1q_f32(b+k*4); c0 = vfmaq_laneq_f32(c0, a0, b0, 0); c1 = vfmaq_laneq_f32(c1, a0, b0, 0); } vst1q_f32(c+0*4, c0); vst1q_f32(c+1*4, c1); } I tested with gcc 7 and 8.
(In reply to Gael Guennebaud from comment #2) > Indeed, it fails to remove the dup only if the coefficient is used multiple > times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) > > > #include <arm_neon.h> > > void foo(const float* a, const float * b, float * c, int n) { > float32x4_t c0, c1, c2, c3; > c0 = vld1q_f32(c+0*4); > c1 = vld1q_f32(c+1*4); > for(int k=0; k<n; k++) > { > float32x4_t a0 = vld1q_f32(a+0*4+k*4); > float32x4_t b0 = vld1q_f32(b+k*4); > c0 = vfmaq_laneq_f32(c0, a0, b0, 0); > c1 = vfmaq_laneq_f32(c1, a0, b0, 0); > } > vst1q_f32(c+0*4, c0); > vst1q_f32(c+1*4, c1); > } > > > I tested with gcc 7 and 8. Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses and it always generates the expected code on trunk. So this is fixed for GCC9, however it seems unlikely the fix (multi-use support in Combine) could be backported.
Good to know this is fixed in trunk! Thank you, and sorry for the false alarm then.
Fixed on trunk (aka gcc-9). Not a regression, so no backport.