Created attachment 58126 [details] reproducer.c source code I'm seeing a ~10% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) king@skylake:~$ CFLAGS="" gcc-13 reproducer.c; ./a.out 4.92 secs duration, 2130.379 Mfp-ops/sec cking@skylake:~$ CFLAGS="" gcc-14 reproducer.c; ./a.out 5.46 secs duration, 1921.799 Mfp-ops/sec The original issue appeared when regression testing stress-ng vecfp stressor [1] using the floating point vector 16 add stressor method. I've managed to extract the attached reproducer (reproducer.c) from the original code. Salient points to focus on: 1. The issue is dependant on the OPTIMIZE3 macro in the reproducer being __attribute__((optimize("-O3"))) 2. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("mmx,avx,default"))) - the avx target clones seems to be an issue in reproducing this problem. Attached are the reproducer.c C source and disassembled object code. The stress_vecfp_float_add_16.avx from gcc-13 is significantly different from the gcc-14 code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-vecfp.c
Created attachment 58127 [details] gcc-13 disassembly
Created attachment 58128 [details] gcc-14 disassembly
perf report from gcc-13 of stress_vecfp_float_add_16.avx of compute loop: 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.02 │ vmovaps %ymm5,0x60(%rsp) 2.92 │ mov 0x60(%rsp),%rax │ mov 0x68(%rsp),%rdx 0.37 │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.30 │ vmovq %rax,%xmm1 4.11 │ mov 0x40(%rsp),%rax │ vmovdqa 0x90(%rsp),%xmm5 │ vmovaps %ymm6,0xa0(%rsp) 3.27 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx │ vmovdqa 0xb0(%rsp),%xmm6 3.22 │ vmovdqa %xmm1,0xc0(%rsp) 0.42 │ vmovq %rax,%xmm0 │ vmovdqa %xmm5,0xd0(%rsp) 6.80 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.52 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa %xmm6,0xf0(%rsp) │ sub $0x1,%ecx │ ↑ jne 200 perf report from gcc-14 of stress_vecfp_float_add_16.avx of compute loop: 65.79 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 3.26 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 0.00 │ vmovaps %ymm5,0x60(%rsp) 9.25 │ mov 0x60(%rsp),%rax 0.00 │ mov 0x68(%rsp),%rdx │ vmovaps %ymm6,0x40(%rsp) │ vmovaps %ymm5,0x80(%rsp) 6.49 │ vmovq %rax,%xmm1 0.00 │ mov 0x40(%rsp),%rax 0.00 │ vmovaps %ymm6,0xa0(%rsp) 3.02 │ vpinsrq $0x1,%rdx,%xmm1,%xmm1 │ mov 0x48(%rsp),%rdx 0.35 │ vmovdqa %xmm1,0xc0(%rsp) 0.68 │ vmovq %rax,%xmm0 0.00 │ vmovdqa 0x90(%rsp),%xmm1 5.18 │ vpinsrq $0x1,%rdx,%xmm0,%xmm0 3.00 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa 0xb0(%rsp),%xmm0 │ vmovdqa %xmm1,0xd0(%rsp) │ vmovdqa %xmm0,0xf0(%rsp) │ sub $0x1,%ecx 2.94 │ ↑ jne 200
I can't reproduce a slowdown on a Zen2 CPU. The difference seems to be merely instruction scheduling. I do note we're not doing a good job in handling for (i = 0; i < LOOPS_PER_CALL; i++) { r.v = r.v + add.v; } where r.v and add.v are AVX512 sized vectors when emulating them with AVX vectors. We end up with r_v_lsm.48_48 = r.v; _11 = add.v; <bb 3> [local count: 1063004408]: # r_v_lsm.48_50 = PHI <_12(3), r_v_lsm.48_48(2)> # ivtmp_56 = PHI <ivtmp_55(3), 65536(2)> _16 = BIT_FIELD_REF <_11, 256, 0>; _37 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 0>; _29 = _16 + _37; _387 = BIT_FIELD_REF <_11, 256, 256>; _375 = BIT_FIELD_REF <r_v_lsm.48_50, 256, 256>; _363 = _387 + _375; _12 = {_29, _363}; ivtmp_55 = ivtmp_56 - 1; if (ivtmp_55 != 0) goto <bb 3>; [98.99%] else goto <bb 4>; [1.01%] <bb 4> [local count: 10737416]: after lowering from 512bit to 256bit vectors and there's no pass that would demote the 512bit reduction value to two 256bit ones. There's also weird things going on in the target/on RTL. A smaller testcase illustrating the code generation issue is typedef float v16sf __attribute__((vector_size(sizeof(float)*16))); void foo (v16sf * __restrict r, v16sf *a, int n) { for (int i = 0; i < n; ++i) *r = *r + *a; } So confirmed for non-optimal code but I don't see how it's a regression.
What I have found is that the binary built with GCC13 and GCC14 will regress on Cascadelake and Skylake. But when I copied the binary to Icelake, it won't. Seems Icelake might fix this with micro-tuning. I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" and rebuilt the binary and it will save half the regression.
> I tried to move "vmovdqa %xmm1,0xd0(%rsp)" before "vmovdqa %xmm0,0xe0(%rsp)" > and rebuilt the binary and it will save half the regression. 57.93 │200: vaddps 0xc0(%rsp),%ymm3,%ymm5 11.11 │ vaddps 0xe0(%rsp),%ymm2,%ymm6 ... 3.22 │ vmovdqa %xmm1,0xc0(%rsp) │ vmovdqa %xmm5,0xd0(%rsp) 3.52 │ vmovdqa %xmm0,0xe0(%rsp) │ vmovdqa %xmm6,0xf0(%rsp) I guess there're specific patterns in SKX microarhitecture for STLF, the main difference is instruction order of those xmm stores. From compiler side, the worth thing to do is PR107916.
Furthermore, when I build with GCC11, the codegen is much better: vaddps 0xc0(%rsp),%ymm5,%ymm2 vaddps 0xe0(%rsp),%ymm4,%ymm1 vmovaps %ymm2,0x80(%rsp) vmovdqa 0x90(%rsp),%xmm6 vmovaps %ymm1,0xa0(%rsp) vmovdqa 0xb0(%rsp),%xmm7 vmovdqa %xmm2,0xc0(%rsp) vmovdqa %xmm6,0xd0(%rsp) vmovdqa %xmm1,0xe0(%rsp) vmovdqa %xmm7,0xf0(%rsp) sub $0x1,%eax jne 401e00 <stress_vecfp_float_add_16.avx.1+0x1e0> Seems we might get two separate issues for this regression.