Bug 115709 - missed optimisation: vperms not reordered to eliminate
Summary: missed optimisation: vperms not reordered to eliminate
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2024-06-29 15:12 UTC by mjr19
Modified: 2024-07-02 09:13 UTC (History)
1 user (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Demo of effect of vperm rearrangement (1.34 KB, application/x-gtar-compressed)
2024-07-02 09:13 UTC, mjr19
Details

Note You need to log in before you can comment on or make changes to this bug.
Description mjr19 2024-06-29 15:12:31 UTC
#include <complex.h>
void foo(double complex *a, double *b, int n){
  int i;

  for(i=0;i<n;i++)
    b[i]=creal(a[i])*creal(a[i])+cimag(a[i])*cimag(a[i]);
}

with "gcc-14 -mavx2 -mfma -Ofast" produces a loop which ends

        vpermpd $216, %ymm0, %ymm0
        vpermpd $216, %ymm1, %ymm1
        vmulpd  %ymm0, %ymm0, %ymm0
        vfmadd132pd     %ymm1, %ymm0, %ymm1
        vmovupd %ymm1, (%rsi,%rax)

However, if the two identical vperms were delayed until after the vmul and
vfmadd, then just one on ymm1 would be needed. I believe that

        vmulpd  %ymm0, %ymm0, %ymm0
        vfmadd132pd     %ymm1, %ymm0, %ymm1
        vpermpd $216, %ymm1, %ymm1
        vmovupd %ymm1, (%rsi,%rax)

is equivalent, given that the contents of ymm0 are not used again.

subroutine foo(a,b,n)
  complex(kind(1d0))::a(*)
  real(kind(1d0))::b(*)
  integer::i,n

  do i=1,n
     b(i)=real(a(i))*real(a(i))+aimag(a(i))*aimag(a(i))
  end do
end subroutine foo

has the same issue. The speed increase from eliminating one vperm is quite measurable.
Comment 1 Andrew Pinski 2024-06-29 19:14:35 UTC
aarch64 is fine since it has load_lanes:
.L4:
        ld2     {v30.2d - v31.2d}, [x4], 32
        fmul    v31.2d, v31.2d, v31.2d
        fmla    v31.2d, v30.2d, v30.2d
        str     q31, [x3], 16
        cmp     x3, x5
        bne     .L4
Comment 2 Richard Biener 2024-06-30 10:11:21 UTC
I don't think this works, in the end we have to add even and odd elements
to compute b[i] (real and imag parts).  Yes, the multiplies could happen
on unpermuted data.  But your example assembly accumulates in a wrong way.

GCC produces

  vect__4.10_77 = MEM <vector(4) double> [(double *)a_15(D) + ivtmp.33_113 * 2];
  vect__4.11_79 = MEM <vector(4) double> [(double *)a_15(D) + 32B + ivtmp.33_113 * 2];
  vect_perm_even_80 = VEC_PERM_EXPR <vect__4.10_77, vect__4.11_79, { 0, 2, 4, 6 }>;
  vect_perm_odd_81 = VEC_PERM_EXPR <vect__4.10_77, vect__4.11_79, { 1, 3, 5, 7 }>;
  vect_powmult_7.13_83 = vect_perm_odd_81 * vect_perm_odd_81;
  vect__10.14_84 = .FMA (vect_perm_even_80, vect_perm_even_80, vect_powmult_7.13_83);
  MEM <vector(4) double> [(double *)b_16(D) + ivtmp.33_113 * 1] = vect__10.14_84;
Comment 3 mjr19 2024-07-02 09:13:10 UTC
Created attachment 58558 [details]
Demo of effect of vperm rearrangement

I still believe that my code is correct. To make what I propose clearer, I attach a runnable demo, which checks itself.

Whether the optimisation is easy enough to be worthwhile, or whether it would generalise to other cases, is another matter. On a Kaby Lake the optimised version is about 20% faster, but on a Haswell it is only about 7% faster.