[Bug target/89101] [Aarch64] vfmaq_laneq_f32 generates unnecessary dup instrcutions

wilco at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Tue Jan 29 14:22:00 GMT 2019


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101

Wilco <wilco at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|WAITING                     |NEW
      Known to work|                            |9.0
            Version|unknown                     |8.2.0
   Target Milestone|---                         |9.0
      Known to fail|                            |8.2.0

--- Comment #3 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Gael Guennebaud from comment #2)
> Indeed, it fails to remove the dup only if the coefficient is used multiple
> times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0)
> 
> 
> #include <arm_neon.h>
> 
> void foo(const float* a, const float * b, float * c, int n) {
>     float32x4_t c0, c1, c2, c3;
>     c0 = vld1q_f32(c+0*4);
>     c1 = vld1q_f32(c+1*4);
>     for(int k=0; k<n; k++)
>     {
>         float32x4_t a0 = vld1q_f32(a+0*4+k*4);
>         float32x4_t b0 = vld1q_f32(b+k*4);
>         c0 = vfmaq_laneq_f32(c0, a0, b0, 0);
>         c1 = vfmaq_laneq_f32(c1, a0, b0, 0);
>     }
>     vst1q_f32(c+0*4, c0);
>     vst1q_f32(c+1*4, c1);
> }
> 
> 
> I tested with gcc 7 and 8.

Confirmed for GCC8, fixed on trunk. I tried the above example with up to 4 uses
and it always generates the expected code on trunk. So this is fixed for GCC9,
however it seems unlikely the fix (multi-use support in Combine) could be
backported.


More information about the Gcc-bugs mailing list