[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

wilco at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Jan 29 13:18:00 GMT 2019


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
> 
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
> 	         float32x4_t __c, const int __lane)
> {
>   return __builtin_aarch64_fmav4sf (__b,
> 				    __aarch64_vdupq_laneq_f32 (__c, __lane),
> 				    __a);
> }
> 
> thus leading to unoptimized code as:
> 
>         ldr	q1, [x2, 16]
> 	dup	v28.4s, v1.s[0]
> 	dup	v27.4s, v1.s[1]
> 	dup	v26.4s, v1.s[2]
> 	dup	v1.4s, v1.s[3]
> 	fmla	v22.4s, v25.4s, v28.4s
> 	fmla	v3.4s, v25.4s, v27.4s
> 	fmla	v6.4s, v25.4s, v26.4s
> 	fmla	v17.4s, v25.4s, v1.4s
> 
> instead of:
> 
>         ldr	q1, [x2, 16]
> 	fmla	v22.4s, v25.4s, v1.s[0]
> 	fmla	v3.4s, v25.4s, v1.s[1]
> 	fmla	v6.4s, v25.4s, v1.s[2]
> 	fmla	v17.4s, v25.4s, v1.s[3]
> 
> I guess several other *lane* intrinsics exhibit the same shortcoming.

Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:

#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
  a = vfmaq_laneq_f32 (a, b, c, 0);
  a = vfmaq_laneq_f32 (a, b, c, 1);
  return a;
}

        fmla    v0.4s, v1.4s, v2.4s[0]
        fmla    v0.4s, v1.4s, v2.4s[1]
        ret

In all cases the optimizer is able to merge the dups as expected.

If it still fails for you, could you provide a compilable example like above
that shows the issue?

> For the record, I managed to partly workaround this issue by writing my own
> version as:
> 
>          if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> 
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633

I'd strongly advise against using inline assembler since most people make
mistakes writing it, and GCC won't be able to optimize code using inline
assembler.


More information about the Gcc-bugs mailing list