#include <complex.h> void foo(double complex *a, double *b, int n){ int i; for(i=0;i<n;i++) b[i]=creal(a[i])*creal(a[i])+cimag(a[i])*cimag(a[i]); } with "gcc-14 -mavx2 -mfma -Ofast" produces a loop which ends vpermpd $216, %ymm0, %ymm0 vpermpd $216, %ymm1, %ymm1 vmulpd %ymm0, %ymm0, %ymm0 vfmadd132pd %ymm1, %ymm0, %ymm1 vmovupd %ymm1, (%rsi,%rax) However, if the two identical vperms were delayed until after the vmul and vfmadd, then just one on ymm1 would be needed. I believe that vmulpd %ymm0, %ymm0, %ymm0 vfmadd132pd %ymm1, %ymm0, %ymm1 vpermpd $216, %ymm1, %ymm1 vmovupd %ymm1, (%rsi,%rax) is equivalent, given that the contents of ymm0 are not used again. subroutine foo(a,b,n) complex(kind(1d0))::a(*) real(kind(1d0))::b(*) integer::i,n do i=1,n b(i)=real(a(i))*real(a(i))+aimag(a(i))*aimag(a(i)) end do end subroutine foo has the same issue. The speed increase from eliminating one vperm is quite measurable.
aarch64 is fine since it has load_lanes: .L4: ld2 {v30.2d - v31.2d}, [x4], 32 fmul v31.2d, v31.2d, v31.2d fmla v31.2d, v30.2d, v30.2d str q31, [x3], 16 cmp x3, x5 bne .L4
I don't think this works, in the end we have to add even and odd elements to compute b[i] (real and imag parts). Yes, the multiplies could happen on unpermuted data. But your example assembly accumulates in a wrong way. GCC produces vect__4.10_77 = MEM <vector(4) double> [(double *)a_15(D) + ivtmp.33_113 * 2]; vect__4.11_79 = MEM <vector(4) double> [(double *)a_15(D) + 32B + ivtmp.33_113 * 2]; vect_perm_even_80 = VEC_PERM_EXPR <vect__4.10_77, vect__4.11_79, { 0, 2, 4, 6 }>; vect_perm_odd_81 = VEC_PERM_EXPR <vect__4.10_77, vect__4.11_79, { 1, 3, 5, 7 }>; vect_powmult_7.13_83 = vect_perm_odd_81 * vect_perm_odd_81; vect__10.14_84 = .FMA (vect_perm_even_80, vect_perm_even_80, vect_powmult_7.13_83); MEM <vector(4) double> [(double *)b_16(D) + ivtmp.33_113 * 1] = vect__10.14_84;
Created attachment 58558 [details] Demo of effect of vperm rearrangement I still believe that my code is correct. To make what I propose clearer, I attach a runnable demo, which checks itself. Whether the optimisation is easy enough to be worthwhile, or whether it would generalise to other cases, is another matter. On a Kaby Lake the optimised version is about 20% faster, but on a Haswell it is only about 7% faster.