[Bug tree-optimization/93588] New: Vectorized load followed by FMA pessimizes on Haswell from version 8.1
alex.reinking at gmail dot com
gcc-bugzilla@gcc.gnu.org
Wed Feb 5 07:29:00 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93588
Bug ID: 93588
Summary: Vectorized load followed by FMA pessimizes on Haswell
from version 8.1
Product: gcc
Version: 8.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: alex.reinking at gmail dot com
Target Milestone: ---
Using vector intrinsics (via immintrin.h) on GCC 7.3 with -O3 -march=haswell
performs the following compilation:
---
for (int k = 0; k < n; ++k) {
ymm12 = _mm256_broadcast_sd(&b[k]);
ymm13 = _mm256_broadcast_sd(&b[k + ldb]);
ymm14 = _mm256_broadcast_sd(&b[k + 2 * ldb]);
ymm15 = _mm256_loadu_pd(&a[k * lda]);
ymm0 = _mm256_fmadd_pd(ymm15, ymm12, ymm0);
ymm4 = _mm256_fmadd_pd(ymm15, ymm13, ymm4);
ymm8 = _mm256_fmadd_pd(ymm15, ymm14, ymm8);
ymm15 = _mm256_loadu_pd(&a[4 + k * lda]);
ymm1 = _mm256_fmadd_pd(ymm15, ymm12, ymm1);
ymm5 = _mm256_fmadd_pd(ymm15, ymm13, ymm5);
ymm9 = _mm256_fmadd_pd(ymm15, ymm14, ymm9);
ymm15 = _mm256_loadu_pd(&a[8 + k * lda]);
ymm2 = _mm256_fmadd_pd(ymm15, ymm12, ymm2);
ymm6 = _mm256_fmadd_pd(ymm15, ymm13, ymm6);
ymm10 = _mm256_fmadd_pd(ymm15, ymm14, ymm10);
ymm15 = _mm256_loadu_pd(&a[12 + k * lda]);
ymm3 = _mm256_fmadd_pd(ymm15, ymm12, ymm3);
ymm7 = _mm256_fmadd_pd(ymm15, ymm13, ymm7);
ymm11 = _mm256_fmadd_pd(ymm15, ymm14, ymm11);
}
---
.L3:
lea rax, [r8+rcx]
vbroadcastsd ymm2, QWORD PTR [rcx]
vmovupd ymm3, YMMWORD PTR [rsi]
add rcx, 8
vbroadcastsd ymm1, QWORD PTR [rax]
vbroadcastsd ymm0, QWORD PTR [rax+r8]
vfmadd231pd ymm15, ymm3, ymm2
vfmadd231pd ymm11, ymm3, ymm1
vfmadd231pd ymm7, ymm3, ymm0
vmovupd ymm3, YMMWORD PTR [rsi+32]
vfmadd231pd ymm14, ymm3, ymm2
vfmadd231pd ymm10, ymm3, ymm1
vfmadd231pd ymm6, ymm3, ymm0
vmovupd ymm3, YMMWORD PTR [rsi+64]
vfmadd231pd ymm13, ymm3, ymm2
vfmadd231pd ymm9, ymm3, ymm1
vfmadd231pd ymm5, ymm3, ymm0
vmovupd ymm3, YMMWORD PTR [rsi+96]
add rsi, rdx
vfmadd231pd ymm12, ymm3, ymm2
vfmadd231pd ymm8, ymm3, ymm1
vfmadd231pd ymm4, ymm3, ymm0
cmp rdi, rcx
jne .L3
---
This reuses the registers that are loaded from memory (and in fact uses all 16
ymm registers). However, when compiling with GCC 8.1 or newer, we get:
---
.L3:
vbroadcastsd ymm2, QWORD PTR [rcx]
lea rax, [r8+rcx]
add rcx, 8
vbroadcastsd ymm1, QWORD PTR [rax]
vbroadcastsd ymm0, QWORD PTR [rax+r8]
vfmadd231pd ymm14, ymm2, YMMWORD PTR [rsi]
vfmadd231pd ymm10, ymm1, YMMWORD PTR [rsi]
vfmadd231pd ymm6, ymm0, YMMWORD PTR [rsi]
vfmadd231pd ymm13, ymm2, YMMWORD PTR [rsi+32]
vfmadd231pd ymm9, ymm1, YMMWORD PTR [rsi+32]
vfmadd231pd ymm5, ymm0, YMMWORD PTR [rsi+32]
vfmadd231pd ymm12, ymm2, YMMWORD PTR [rsi+64]
vfmadd231pd ymm8, ymm1, YMMWORD PTR [rsi+64]
vfmadd231pd ymm4, ymm0, YMMWORD PTR [rsi+64]
vfmadd231pd ymm11, ymm2, YMMWORD PTR [rsi+96]
vfmadd231pd ymm7, ymm1, YMMWORD PTR [rsi+96]
vfmadd231pd ymm3, ymm0, YMMWORD PTR [rsi+96]
add rsi, rdx
cmp rdi, rcx
jne .L3
---
This code has half the throughput on both my i9-7900X and NERSC's Xeon E5-2698
v3. Enabling -mtune=skylake "fixes" the problem, but it isn't clear why it does
or how this code could be written to be more robust to compiler changes. The
intrinsics are supposed to map to the corresponding assembly instructions, no?
Here are some Compiler Explorer links to show the behavior:
[GCC 7.3] https://gcc.godbolt.org/z/nLHD47
[GCC 8.1] https://gcc.godbolt.org/z/6EEt2N
[GCC 8.1 -mtune=skylake] https://gcc.godbolt.org/z/XGZKtX
More information about the Gcc-bugs
mailing list