[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

Tue Jan 29 13:18:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
> 
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
> 	         float32x4_t __c, const int __lane)
> {
>   return __builtin_aarch64_fmav4sf (__b,
> 				    __aarch64_vdupq_laneq_f32 (__c, __lane),
> 				    __a);
> }
> 
> thus leading to unoptimized code as:
> 
>         ldr	q1, [x2, 16]
> 	dup	v28.4s, v1.s[0]
> 	dup	v27.4s, v1.s[1]
> 	dup	v26.4s, v1.s[2]
> 	dup	v1.4s, v1.s[3]
> 	fmla	v22.4s, v25.4s, v28.4s
> 	fmla	v3.4s, v25.4s, v27.4s
> 	fmla	v6.4s, v25.4s, v26.4s
> 	fmla	v17.4s, v25.4s, v1.4s
> 
> instead of:
> 
>         ldr	q1, [x2, 16]
> 	fmla	v22.4s, v25.4s, v1.s[0]
> 	fmla	v3.4s, v25.4s, v1.s[1]
> 	fmla	v6.4s, v25.4s, v1.s[2]
> 	fmla	v17.4s, v25.4s, v1.s[3]
> 
> I guess several other *lane* intrinsics exhibit the same shortcoming.

Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:

#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
  a = vfmaq_laneq_f32 (a, b, c, 0);
  a = vfmaq_laneq_f32 (a, b, c, 1);
  return a;
}

        fmla    v0.4s, v1.4s, v2.4s[0]
        fmla    v0.4s, v1.4s, v2.4s[1]
        ret

In all cases the optimizer is able to merge the dups as expected.

If it still fails for you, could you provide a compilable example like above
that shows the issue?

> For the record, I managed to partly workaround this issue by writing my own
> version as:
> 
>          if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> 
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633

I'd strongly advise against using inline assembler since most people make
mistakes writing it, and GCC won't be able to optimize code using inline
assembler.