115024 – [14/15 regression] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

Bug 115024 - [14/15 regression] 128 bit division performance regression, x86, between gcc-14 and gcc-13 using target clones on skylake platform

Summary: [14/15 regression] 128 bit division performance regression, x86, between gcc-...

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	14.0

Importance:	P3 normal
Target Milestone:	14.3
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2024-05-10 07:12 UTC by Colin Ian King
Modified:	2025-03-10 07:13 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
reproducer source code for __int128_t division regression (578 bytes, text/x-csrc) 2024-05-10 07:12 UTC, Colin Ian King	Details
gcc-13 disassembly (16.09 KB, text/plain) 2024-05-10 07:16 UTC, Colin Ian King	Details
gcc-14 disassembly (17.75 KB, text/plain) 2024-05-10 07:17 UTC, Colin Ian King	Details
perf output for gcc-13 compiled code (1.75 KB, text/plain) 2024-05-10 07:45 UTC, Colin Ian King	Details
perf output for gcc-14 compiled code (1.81 KB, text/plain) 2024-05-10 07:46 UTC, Colin Ian King	Details
Standalone reduction of libgcc's __udivti3. (817 bytes, text/plain) 2025-03-07 17:05 UTC, Roger Sayle	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Colin Ian King 2024-05-10 07:12:55 UTC

Created attachment 58158 [details]
reproducer source code for __int128_t division regression

I'm seeing a 5% performance regression in gcc-14 compared to gcc-13, using gcc on Ubuntu 24.04:

Versions:
gcc version 13.2.0 (Ubuntu 13.2.0-23ubuntu4) 
gcc version 14.0.1 20240412 (experimental) [master r14-9935-g67e1433a94f] (Ubuntu 14-20240412-0ubuntu1) 

cking@skylake:~$ CFLAGS="" gcc-13 -O2 reproducer-div128.c 
cking@skylake:~$ ./a.out 
1650.83 div128 ops per sec

cking@skylake:~$ CFLAGS="" gcc-14 -O2 reproducer-div128.c 
cking@skylake:~$ ./a.out 
1567.48 div128 ops per sec

The original issue appeared when regression testing stress-ng cpu div128 stressor [1]. I've managed to extract the attached reproducer from the original code (see attached).

Salient point to focus on:

1. The issue is also dependant on the TARGET_CLONES macro being defined as __attribute__((target_clones("avx,default")))  - the avx target clones seems to be an issue in reproducing this problem.

Attached are the reproducer C source and disassembled object code. 

References: [1] https://github.com/ColinIanKing/stress-ng/blob/master/stress-cpu.c

Comment 1 Colin Ian King 2024-05-10 07:16:44 UTC

Created attachment 58159 [details]
gcc-13 disassembly

gcc-13 disassembly

Comment 2 Colin Ian King 2024-05-10 07:17:18 UTC

Created attachment 58160 [details]
gcc-14 disassembly

gcc-14 disassembly

Comment 3 Colin Ian King 2024-05-10 07:45:45 UTC

Created attachment 58161 [details]
perf output for gcc-13 compiled code

Comment 4 Colin Ian King 2024-05-10 07:46:22 UTC

Created attachment 58162 [details]
perf output for gcc-14 compiled code

Comment 5 Haochen Jiang 2024-05-20 08:38:40 UTC

From my test, trunk only has <1% regression if I calculated right.

[haochenj@shgcc101 ~]$ ./13.exe
1240.97 div128 ops per sec
[haochenj@shgcc101 ~]$ ./13.exe
1235.78 div128 ops per sec
[haochenj@shgcc101 ~]$ ./13.exe
1236.95 div128 ops per sec

[haochenj@shgcc101 ~]$ ./trunk.exe
1228.43 div128 ops per sec
[haochenj@shgcc101 ~]$ ./trunk.exe
1227.11 div128 ops per sec
[haochenj@shgcc101 ~]$ ./trunk.exe
1225.42 div128 ops per sec

Comment 6 Haochen Jiang 2024-06-05 06:25:20 UTC

I have got a machine to reproduce the regression.

Seem like a DSB miss from my data, but don't know why. Need more investigation.

Comment 7 Jakub Jelinek 2024-08-01 09:40:43 UTC

GCC 14.2 is being released, retargeting bugs to GCC 14.3.

Comment 8 Jakub Jelinek 2025-03-06 13:12:30 UTC

Haven't benchmarked anything, but just looking at assembly differences, the largest
changes start with r14-2386-gbdf2737cda53a83332db1a1a021653447b05a7e7

Comment 9 Roger Sayle 2025-03-07 17:05:19 UTC

Created attachment 60680 [details]
Standalone reduction of libgcc's __udivti3.

The bugzilla title implies that the issue is with 128-bit division, which in this testcase is performed by libgcc's __udivti3. Indeed, in Colin's attachments we appear to be doing worse at argument passing/shuffling (as observed by Jakub).  However, this appears to be fixed (or better) for me on mainline, and godbolt's gcc14 (see attached code).  Confusingly, __udivti3 wouldn't be impacted by the callers use of -mavx, and indeed none of the attached code (caller and calleee) actually uses AVX/SSE instructions or registers, so perhaps Haochen's analysis is right that this is some strange DSB scheduling issue?

I've not yet managed to reproduce the problem, so if someone could check whether linking the gcc-13 stress-cpu with the gcc-14 udivti3, and likewise the gcc-14 stress-cpu against the gcc-13 udivti3, we can narrow down which combination actually triggered the regression.  Thanks in advance.

Comment 10 Haochen Jiang 2025-03-10 07:13:24 UTC

I could not reproduce that from scratch for now either since if I recalled that correctly, I reproduce that on one specific Sky Lake machine, not all.

BTW, the DSB issue is some time just a "bad luck" caused by code layout, especially on old Intel platforms.

Let me try to find where my regression binary to sort out that.