Created attachment 58158 [details] reproducer source code for __int128_t division regression I'm seeing a 5% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04: Versions: gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-div128.c cking@skylake:~$ ./a.out 1650.83 div128 ops per sec cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-div128.c cking@skylake:~$ ./a.out 1567.48 div128 ops per sec The original issue appeared when regression testing stress-ng cpu div128 stressor [1]. I've managed to extract the attached reproducer from the original code (see attached). Salient point to focus on: 1. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("avx,default"))) - the avx target clones seems to be an issue in reproducing this problem. Attached are the reproducer C source and disassembled object code. References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c
Created attachment 58159 [details] gcc-13 disassembly gcc-13 disassembly
Created attachment 58160 [details] gcc-14 disassembly gcc-14 disassembly
Created attachment 58161 [details] perf output for gcc-13 compiled code
Created attachment 58162 [details] perf output for gcc-14 compiled code
From my test, trunk only has <1% regression if I calculated right. [haochenj@shgcc101 ~]$ ./13.exe 1240.97 div128 ops per sec [haochenj@shgcc101 ~]$ ./13.exe 1235.78 div128 ops per sec [haochenj@shgcc101 ~]$ ./13.exe 1236.95 div128 ops per sec [haochenj@shgcc101 ~]$ ./trunk.exe 1228.43 div128 ops per sec [haochenj@shgcc101 ~]$ ./trunk.exe 1227.11 div128 ops per sec [haochenj@shgcc101 ~]$ ./trunk.exe 1225.42 div128 ops per sec
I have got a machine to reproduce the regression. Seem like a DSB miss from my data, but don't know why. Need more investigation.
GCC 14.2 is being released, retargeting bugs to GCC 14.3.
Haven't benchmarked anything, but just looking at assembly differences, the largest changes start with r14-2386-gbdf2737cda53a83332db1a1a021653447b05a7e7
Created attachment 60680 [details] Standalone reduction of libgcc's __udivti3. The bugzilla title implies that the issue is with 128-bit division, which in this testcase is performed by libgcc's __udivti3. Indeed, in Colin's attachments we appear to be doing worse at argument passing/shuffling (as observed by Jakub). However, this appears to be fixed (or better) for me on mainline, and godbolt's gcc14 (see attached code). Confusingly, __udivti3 wouldn't be impacted by the callers use of -mavx, and indeed none of the attached code (caller and calleee) actually uses AVX/SSE instructions or registers, so perhaps Haochen's analysis is right that this is some strange DSB scheduling issue? I've not yet managed to reproduce the problem, so if someone could check whether linking the gcc-13 stress-cpu with the gcc-14 udivti3, and likewise the gcc-14 stress-cpu against the gcc-13 udivti3, we can narrow down which combination actually triggered the regression. Thanks in advance.
I could not reproduce that from scratch for now either since if I recalled that correctly, I reproduce that on one specific Sky Lake machine, not all. BTW, the DSB issue is some time just a "bad luck" caused by code layout, especially on old Intel platforms. Let me try to find where my regression binary to sort out that.