Bug 89101 - [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
Summary: [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.2.0
: P3 normal
Target Milestone: 9.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-29 12:56 UTC by Gael Guennebaud
Modified: 2019-02-25 15:51 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work: 9.0
Known to fail: 8.2.0
Last reconfirmed: 2019-01-29 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gael Guennebaud 2019-01-29 12:56:41 UTC
vfmaq_laneq_f32 is currently implemented as:

__extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
	         float32x4_t __c, const int __lane)
{
  return __builtin_aarch64_fmav4sf (__b,
				    __aarch64_vdupq_laneq_f32 (__c, __lane),
				    __a);
}

thus leading to unoptimized code as:

        ldr	q1, [x2, 16]
	dup	v28.4s, v1.s[0]
	dup	v27.4s, v1.s[1]
	dup	v26.4s, v1.s[2]
	dup	v1.4s, v1.s[3]
	fmla	v22.4s, v25.4s, v28.4s
	fmla	v3.4s, v25.4s, v27.4s
	fmla	v6.4s, v25.4s, v26.4s
	fmla	v17.4s, v25.4s, v1.4s

instead of:

        ldr	q1, [x2, 16]
	fmla	v22.4s, v25.4s, v1.s[0]
	fmla	v3.4s, v25.4s, v1.s[1]
	fmla	v6.4s, v25.4s, v1.s[2]
	fmla	v17.4s, v25.4s, v1.s[3]

I guess several other *lane* intrinsics exhibit the same shortcoming.

For the record, I managed to partly workaround this issue by writing my own version as:

         if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" (a), "w" (b) :  );
    else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" (a), "w" (b) :  );

but that's of course not ideal. This change yields a 32% speed up in Eigen's matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633
Comment 1 Wilco 2019-01-29 13:18:23 UTC
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
> 
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
> 	         float32x4_t __c, const int __lane)
> {
>   return __builtin_aarch64_fmav4sf (__b,
> 				    __aarch64_vdupq_laneq_f32 (__c, __lane),
> 				    __a);
> }
> 
> thus leading to unoptimized code as:
> 
>         ldr	q1, [x2, 16]
> 	dup	v28.4s, v1.s[0]
> 	dup	v27.4s, v1.s[1]
> 	dup	v26.4s, v1.s[2]
> 	dup	v1.4s, v1.s[3]
> 	fmla	v22.4s, v25.4s, v28.4s
> 	fmla	v3.4s, v25.4s, v27.4s
> 	fmla	v6.4s, v25.4s, v26.4s
> 	fmla	v17.4s, v25.4s, v1.4s
> 
> instead of:
> 
>         ldr	q1, [x2, 16]
> 	fmla	v22.4s, v25.4s, v1.s[0]
> 	fmla	v3.4s, v25.4s, v1.s[1]
> 	fmla	v6.4s, v25.4s, v1.s[2]
> 	fmla	v17.4s, v25.4s, v1.s[3]
> 
> I guess several other *lane* intrinsics exhibit the same shortcoming.

Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:

#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
  a = vfmaq_laneq_f32 (a, b, c, 0);
  a = vfmaq_laneq_f32 (a, b, c, 1);
  return a;
}

	fmla	v0.4s, v1.4s, v2.4s[0]
	fmla	v0.4s, v1.4s, v2.4s[1]
	ret

In all cases the optimizer is able to merge the dups as expected.

If it still fails for you, could you provide a compilable example like above that shows the issue?

> For the record, I managed to partly workaround this issue by writing my own
> version as:
> 
>          if(LaneID==0)  asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==1)  asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==2)  asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
>     else if(LaneID==3)  asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) :  );
> 
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633

I'd strongly advise against using inline assembler since most people make mistakes writing it, and GCC won't be able to optimize code using inline assembler.
Comment 2 Gael Guennebaud 2019-01-29 14:05:32 UTC
Indeed, it fails to remove the dup only if the coefficient is used multiple times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0)


#include <arm_neon.h>

void foo(const float* a, const float * b, float * c, int n) {
    float32x4_t c0, c1, c2, c3;
    c0 = vld1q_f32(c+0*4);
    c1 = vld1q_f32(c+1*4);
    for(int k=0; k<n; k++)
    {
        float32x4_t a0 = vld1q_f32(a+0*4+k*4);
        float32x4_t b0 = vld1q_f32(b+k*4);
        c0 = vfmaq_laneq_f32(c0, a0, b0, 0);
        c1 = vfmaq_laneq_f32(c1, a0, b0, 0);
    }
    vst1q_f32(c+0*4, c0);
    vst1q_f32(c+1*4, c1);
}


I tested with gcc 7 and 8.
Comment 3 Wilco 2019-01-29 14:22:25 UTC
(In reply to Gael Guennebaud from comment #2)
> Indeed, it fails to remove the dup only if the coefficient is used multiple
> times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0)
> 
> 
> #include <arm_neon.h>
> 
> void foo(const float* a, const float * b, float * c, int n) {
>     float32x4_t c0, c1, c2, c3;
>     c0 = vld1q_f32(c+0*4);
>     c1 = vld1q_f32(c+1*4);
>     for(int k=0; k<n; k++)
>     {
>         float32x4_t a0 = vld1q_f32(a+0*4+k*4);
>         float32x4_t b0 = vld1q_f32(b+k*4);
>         c0 = vfmaq_laneq_f32(c0, a0, b0, 0);
>         c1 = vfmaq_laneq_f32(c1, a0, b0, 0);
>     }
>     vst1q_f32(c+0*4, c0);
>     vst1q_f32(c+1*4, c1);
> }
> 
> 
> I tested with gcc 7 and 8.

Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses and it always generates the expected code on trunk. So this is fixed for GCC9, however it seems unlikely the fix (multi-use support in Combine) could be backported.
Comment 4 Gael Guennebaud 2019-01-29 14:26:57 UTC
Good to know this is fixed in trunk! Thank you, and sorry for the false alarm then.
Comment 5 Richard Earnshaw 2019-02-25 15:51:06 UTC
Fixed on trunk (aka gcc-9).  Not a regression, so no backport.