Created attachment 52933 [details] testcase Hit by core-math team at https://gcc.gnu.org/pipermail/gcc-help/2022-May/141480.html Compile the attached testcase with -O2 -march=haswell (other AVX-capable Intel families except Alderlake are affected too) and observe that the big basic block begins with .L6: vcvtss2sd xmm1, xmm1, DWORD PTR [rsp-4] This creates a false dependency on the previous assignment into xmm1, resulting in wildly varying (and suboptimal) throughput figures depending on how long the CPU stalls waiting for the previous assignment to complete. GCC has code to emit such instructions in a manner that avoids false dependencies (see e.g. PR89071), but here it doesn't seem to work. Also there's a potentially related issue that GCC copies the initial xmm0 value to eax via stack in the beginning of the function: cr_exp10f: vmovss DWORD PTR [rsp-4], xmm0 mov eax, DWORD PTR [rsp-4] This seems wrong since xmm-reg moves on Haswell are 1 cycle afaict.
Pass remove_partial_avx_dependency is before RA, which we have (insn 128 127 129 22 (set (reg/v:DF 99 [ z ]) (float_extend:DF (reg/v:SF 117 [ x ]))) "test.c":43:10 163 {*extendsfdf2} and attr avx_partial_xmm_update is false, since we don't need to handle vcvtss2sd %xmm, %xmm, %xmm But after RA, 117 is allocate as mem which causes partial xmm dependency.
After set remove_partial_avx_dependency to true for register alternative, we get vxorps %xmm3, %xmm3, %xmm3 vmovsd .LC16(%rip), %xmm6 vmovsd .LC14(%rip), %xmm5 vcvtss2sd %xmm0, %xmm3, %xmm0 vmulsd .LC12(%rip), %xmm0, %xmm1 vroundsd $9, %xmm1, %xmm3, %xmm2 vcvttsd2siq %xmm2, %rax vsubsd %xmm2, %xmm1, %xmm1 vfmadd231sd .LC13(%rip), %xmm0, %xmm1 vfmadd213sd .LC17(%rip), %xmm1, %xmm6 vmovsd .LC18(%rip), %xmm0 vfmadd213sd .LC19(%rip), %xmm1, %xmm0 vfmadd213sd .LC15(%rip), %xmm1, %xmm5 movq %rax, %rdx sarq $4, %rax vmulsd %xmm1, %xmm1, %xmm4 addq $1023, %rax andl $15, %edx salq $52, %rax vmovq %rax, %xmm7 vmulsd tb.1(,%rdx,8), %xmm7, %xmm2 vfmadd132sd %xmm4, %xmm6, %xmm0 vmulsd %xmm2, %xmm1, %xmm1 vfmadd132sd %xmm4, %xmm5, %xmm0 vfmadd132sd %xmm1, %xmm2, %xmm0 vcvtsd2ss %xmm0, %xmm3, %xmm0 Also >Also there's a potentially related issue that GCC copies the initial xmm0 value >to eax via stack in the beginning of the function: This issue disappears(should still be latent after fix).
The overhead for setting attr remove_partial_avx_dependency to true for register alternative is: For simple double foo (float a) { return a; } Now we generates: foo: .LFB0: .cfi_startproc vmovaps %xmm0, %xmm1 vxorps %xmm0, %xmm0, %xmm0 vcvtss2sd %xmm1, %xmm0, %xmm0
Another possible solution is add a little bit dislike for "m" alternative(like ?m) to avoid potential spill.
The strange xmm0 spill issue may affect more code, so I reported an isolated testcase: PR 105513 (regression vs. gcc-8, the complete testcase in this PR also does not spill with gcc-8).
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 Author: liuhongt <hongtao.liu@intel.com> Date: Mon May 30 15:30:51 2022 +0800 Disparages SSE_REGS alternatives sligntly with ?v instead of *v in *mov{si,di}_internal. So alternative v won't be igored in record_reg_classess. Similar for *r alternatives in some vector patterns. It helps testcase in the PR, also RA now makes better decisions for gcc.target/i386/extract-insert-combining.c movd %esi, %xmm0 movd %edi, %xmm1 - movl %esi, -12(%rsp) paddd %xmm0, %xmm1 pinsrd $0, %esi, %xmm0 paddd %xmm1, %xmm0 The patch has no big impact on SPEC2017 for both O2 and Ofast march=native run. And I noticed there's some changes in SPEC2017 from code like mov mem, %eax vmovd %eax, %xmm0 .. mov %eax, 64(%rsp) to vmovd mem, %xmm0 .. vmovd %xmm0, 64(%rsp) Which should be exactly what we want? gcc/ChangeLog: PR target/105513 PR target/105504 * config/i386/i386.md (*movsi_internal): Change alternative from *v to ?v. (*movdi_internal): Ditto. * config/i386/sse.md (vec_set<mode>_0): Change alternative *r to ?r. (*vec_extractv4sf_mem): Ditto. (*vec_extracthf): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr105513-1.c: New test. * gcc.target/i386/extract-insert-combining.c: Add new scan-assembler-not for spill.
(In reply to CVS Commits from comment #6) > The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: > > https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > > commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > Author: liuhongt <hongtao.liu@intel.com> > Date: Mon May 30 15:30:51 2022 +0800 > > Disparages SSE_REGS alternatives sligntly with ?v instead of *v in > *mov{si,di}_internal. > > So alternative v won't be igored in record_reg_classess. > > Similar for *r alternatives in some vector patterns. > > It helps testcase in the PR, also RA now makes better decisions for > gcc.target/i386/extract-insert-combining.c > > movd %esi, %xmm0 > movd %edi, %xmm1 > - movl %esi, -12(%rsp) > paddd %xmm0, %xmm1 > pinsrd $0, %esi, %xmm0 > paddd %xmm1, %xmm0 > > The patch has no big impact on SPEC2017 for both O2 and Ofast > march=native run. > > And I noticed there's some changes in SPEC2017 from code like > > mov mem, %eax > vmovd %eax, %xmm0 > .. > mov %eax, 64(%rsp) > > to > > vmovd mem, %xmm0 > .. > vmovd %xmm0, 64(%rsp) > > Which should be exactly what we want? > > gcc/ChangeLog: > > PR target/105513 > PR target/105504 > * config/i386/i386.md (*movsi_internal): Change alternative > from *v to ?v. > (*movdi_internal): Ditto. > * config/i386/sse.md (vec_set<mode>_0): Change alternative *r > to ?r. > (*vec_extractv4sf_mem): Ditto. > (*vec_extracthf): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr105513-1.c: New test. > * gcc.target/i386/extract-insert-combining.c: Add new > scan-assembler-not for spill. Did this fix it?
(In reply to Eric Gallager from comment #7) > (In reply to CVS Commits from comment #6) > > The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: > > > > https://gcc.gnu.org/g:5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > > > > commit r13-1009-g5e005393d4ff0a428c5f55b9ba7f65d6078a7cf5 > > Author: liuhongt <hongtao.liu@intel.com> > > Date: Mon May 30 15:30:51 2022 +0800 > > > > Disparages SSE_REGS alternatives sligntly with ?v instead of *v in > > *mov{si,di}_internal. > > > > So alternative v won't be igored in record_reg_classess. > > > > Similar for *r alternatives in some vector patterns. > > > > It helps testcase in the PR, also RA now makes better decisions for > > gcc.target/i386/extract-insert-combining.c > > > > movd %esi, %xmm0 > > movd %edi, %xmm1 > > - movl %esi, -12(%rsp) > > paddd %xmm0, %xmm1 > > pinsrd $0, %esi, %xmm0 > > paddd %xmm1, %xmm0 > > > > The patch has no big impact on SPEC2017 for both O2 and Ofast > > march=native run. > > > > And I noticed there's some changes in SPEC2017 from code like > > > > mov mem, %eax > > vmovd %eax, %xmm0 > > .. > > mov %eax, 64(%rsp) > > > > to > > > > vmovd mem, %xmm0 > > .. > > vmovd %xmm0, 64(%rsp) > > > > Which should be exactly what we want? > > > > gcc/ChangeLog: > > > > PR target/105513 > > PR target/105504 > > * config/i386/i386.md (*movsi_internal): Change alternative > > from *v to ?v. > > (*movdi_internal): Ditto. > > * config/i386/sse.md (vec_set<mode>_0): Change alternative *r > > to ?r. > > (*vec_extractv4sf_mem): Ditto. > > (*vec_extracthf): Ditto. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/pr105513-1.c: New test. > > * gcc.target/i386/extract-insert-combining.c: Add new > > scan-assembler-not for spill. > > Did this fix it? Yes.
(In reply to Hongtao.liu from comment #8) > (In reply to Eric Gallager from comment #7) > > > > Did this fix it? > > Yes. OK, closing, then