Summary: | [14/15 Regression] floating point vector regression, x86, between gcc 14 and gcc-13 using -O3 and target clones on skylake platforms | ||
---|---|---|---|
Product: | gcc | Reporter: | Colin Ian King <colin.king> |
Component: | target | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | amonakov, crazylht, haochen.jiang, hjl.tools, liuhongt, sjames |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 14.0 | ||
Target Milestone: | 14.2 | ||
See Also: |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115002 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107916 |
||
Host: | Target: | x86_64-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2024-05-10 00:00:00 | |
Attachments: |
reproducer.c source code
gcc-13 disassembly gcc-14 disassembly |
Description
Colin Ian King
2024-05-08 14:44:31 UTC
Created attachment 58127 [details]
gcc-13 disassembly
Created attachment 58128 [details]
gcc-14 disassembly
perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop: 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.02 │ vmovaps %ymm5,0x60(%rsp) 2.92 │ mov 0x60(%rsp),%rax │ mov 0x68(%rsp),%rdx 0.37 │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.30 │ vmovq %rax,%xmm1 4.11 │ mov 0x40(%rsp),%rax │ vmovdqa 0x90(%rsp),%xmm5 │ vmovaps %ymm6,0xa0(%rsp) 3.27 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx │ vmovdqa 0xb0(%rsp),%xmm6 3.22 │ vmovdqa %xmm1,0xc0(%rsp) 0.42 │ vmovq %rax,%xmm0 │ vmovdqa %xmm5,0xd0(%rsp) 6.80 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.52 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa %xmm6,0xf0(%rsp) │ sub $0x1,%ecx │ ↑ jne 200 perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop: 65.79 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 3.26 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.00 │ vmovaps %ymm5,0x60(%rsp) 9.25 │ mov 0x60(%rsp),%rax 0.00 │ mov 0x68(%rsp),%rdx │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.49 │ vmovq %rax,%xmm1 0.00 │ mov 0x40(%rsp),%rax 0.00 │ vmovaps %ymm6,0xa0(%rsp) 3.02 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx 0.35 │ vmovdqa %xmm1,0xc0(%rsp) 0.68 │ vmovq %rax,%xmm0 0.00 │ vmovdqa 0x90(%rsp),%xmm1 5.18 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.00 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa 0xb0(%rsp),%xmm0 │ vmovdqa %xmm1,0xd0(%rsp) │ vmovdqa %xmm0,0xf0(%rsp) │ sub $0x1,%ecx 2.94 │ ↑ jne 200 I can't reproduce a slowdown on a Zen2 CPU. The difference seems to be merely instruction scheduling. I do note we're not doing a good job in handling for (i = 0; i < LOOPS_PER_CALL; i++) { r.v = r.v + add.v; } where r.v and add.v are AVX512 sized vectors when emulating them with AVX vectors. We end up with r_v_lsm.48_48 = r.v; _11 = add.v; <bb 3> [local count: 1063004408]: # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)> # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)> _16 = BIT_FIELD_REF <_11, 256, 0>; _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>; _29 = _16 + _37; _387 = BIT_FIELD_REF <_11, 256, 256>; _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>; _363 = _387 + _375; _12 = {_29, _363}; ivtmp_55 = ivtmp_56 - 1; if (ivtmp_55 != 0) goto <bb 3>; [98.99%] else goto <bb 4>; [1.01%] <bb 4> [local count: 10737416]: after lowering from 512bit to 256bit vectors and there's no pass that would demote the 512bit reduction value to two 256bit ones. There's also weird things going on in the target/on RTL. A smaller testcase illustrating the code generation issue is typedef float v16sf __attribute__((vector_size(sizeof(float)*16))); void foo (v16sf * __restrict r, v16sf *a, int n) { for (int i = 0; i < n; ++i) *r = *r + *a; } So confirmed for non-optimal code but I don't see how it's a regression. What I have found is that the binary built with GCC13 and GCC14 will regress on Cascadelake and Skylake. But when I copied the binary to Icelake, it won't. Seems Icelake might fix this with micro-tuning. I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" and rebuilt the binary and it will save half the regression. > I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" > and rebuilt the binary and it will save half the regression. 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 ... 3.22 │ vmovdqa %xmm1,0xc0(%rsp) │ vmovdqa %xmm5,0xd0(%rsp) 3.52 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa %xmm6,0xf0(%rsp) I guess there're specific patterns in SKX microarhitecture for STLF, the main difference is instruction order of those xmm stores. From compiler side, the worth thing to do is PR107916. Furthermore, when I build with GCC11, the codegen is much better: vaddps 0xc0(%rsp),%ymm5,%ymm2 vaddps 0xe0(%rsp),%ymm4,%ymm1 vmovaps %ymm2,0x80(%rsp) vmovdqa 0x90(%rsp),%xmm6 vmovaps %ymm1,0xa0(%rsp) vmovdqa 0xb0(%rsp),%xmm7 vmovdqa %xmm2,0xc0(%rsp) vmovdqa %xmm6,0xd0(%rsp) vmovdqa %xmm1,0xe0(%rsp) vmovdqa %xmm7,0xf0(%rsp) sub $0x1,%eax jne 401e00 <stress_vecfp_float_add_16.avx.1+0x1e0> Seems we might get two separate issues for this regression. |