I have detected a 7% regression of 525.x264_r from SPEC INTrate 2017 at -O2 and generic march/tuning on AMD EPYC (znver1) CPUs (I have not seen it on an Intel CPU), compared to the gcc-8-branch. I have bisected it to r264897. With revision 264896 I get: perf stat: Performance counter stats for 'numactl -C 0 -l specinvoke': 495413.105450 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 80572 page-faults:u # 0.163 K/sec 1573525941814 cycles:u # 3.176 GHz (83.33%) 56730573392 stalled-cycles-frontend:u # 3.61% frontend cycles idle (83.33%) 397644125819 stalled-cycles-backend:u # 25.27% backend cycles idle (83.33%) 5157395976259 instructions:u # 3.28 insn per cycle # 0.08 stalled cycles per insn (83.33%) 421019689027 branches:u # 849.836 M/sec (83.33%) 10705813341 branch-misses:u # 2.54% of all branches (83.33%) 495.869208013 seconds time elapsed perf report -n --percent-limit 2 # Event count (approx.): 1576108148398 # # Overhead Samples Command Shared Object Symbol # ........ ......... ........... .............. ............................ # 14.20% 282290 x264_r_base x264_r_base.mi [.] x264_pixel_satd_8x4 11.19% 222403 x264_r_base x264_r_base.mi [.] get_ref 10.82% 215061 x264_r_base x264_r_base.mi [.] x264_pixel_sad_x4_16x16 7.00% 139082 x264_r_base x264_r_base.mi [.] x264_pixel_sad_16x16 6.11% 121470 x264_r_base x264_r_base.mi [.] x264_pixel_sad_x3_16x16 5.89% 116939 x264_r_base x264_r_base.mi [.] x264_pixel_sad_x4_8x8 5.09% 101266 x264_r_base x264_r_base.mi [.] quant_4x4 4.10% 81471 x264_r_base x264_r_base.mi [.] mc_chroma 2.47% 49122 x264_r_base x264_r_base.mi [.] x264_pixel_sad_x3_8x8 2.21% 43928 x264_r_base x264_r_base.mi [.] sub4x4_dct 2.14% 42598 x264_r_base x264_r_base.mi [.] pixel_hadamard_ac With revision 264897 I get: perf stat Performance counter stats for 'numactl -C 0 -l specinvoke': 495413.105450 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 80572 page-faults:u # 0.163 K/sec 1573525941814 cycles:u # 3.176 GHz (83.33%) 56730573392 stalled-cycles-frontend:u # 3.61% frontend cycles idle (83.33%) 397644125819 stalled-cycles-backend:u # 25.27% backend cycles idle (83.33%) 5157395976259 instructions:u # 3.28 insn per cycle # 0.08 stalled cycles per insn (83.33%) 421019689027 branches:u # 849.836 M/sec (83.33%) 10705813341 branch-misses:u # 2.54% of all branches (83.33%) 495.869208013 seconds time elapsed perf report -n --percent-limit 2 # Event count (approx.): 1576108148398 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................ ................................................. # 14.20% 282290 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_satd_8x4 11.19% 222403 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] get_ref 10.82% 215061 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_16x16 7.00% 139082 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_16x16 6.11% 121470 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_16x16 5.89% 116939 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_8x8 5.09% 101266 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] quant_4x4 4.10% 81471 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] mc_chroma 2.47% 49122 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_8x8 2.21% 43928 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] sub4x4_dct 2.14% 42598 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] pixel_hadamard_ac
Cut and paste error? The two data sets look the same to me...or am I missing something?
Doh, yes, copy-paste error, sorry. The data should have been: FAST: Performance counter stats for 'numactl -C 0 -l specinvoke': 495413.105450 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 80572 page-faults:u # 0.163 K/sec 1573525941814 cycles:u # 3.176 GHz (83.33%) 56730573392 stalled-cycles-frontend:u # 3.61% frontend cycles idle (83.33%) 397644125819 stalled-cycles-backend:u # 25.27% backend cycles idle (83.33%) 5157395976259 instructions:u # 3.28 insn per cycle # 0.08 stalled cycles per insn (83.33%) 421019689027 branches:u # 849.836 M/sec (83.33%) 10705813341 branch-misses:u # 2.54% of all branches (83.33%) 495.869208013 seconds time elapsed # Event count (approx.): 1576108148398 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................ ................................................. # 14.20% 282290 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_satd_8x4 11.19% 222403 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] get_ref 10.82% 215061 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_16x16 7.00% 139082 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_16x16 6.11% 121470 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_16x16 5.89% 116939 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_8x8 5.09% 101266 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] quant_4x4 4.10% 81471 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] mc_chroma 2.47% 49122 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_8x8 2.21% 43928 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] sub4x4_dct 2.14% 42598 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] pixel_hadamard_ac SLOW: Performance counter stats for 'numactl -C 0 -l specinvoke': 526858.531112 task-clock:u (msec) # 0.999 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 81064 page-faults:u # 0.154 K/sec 1673634535742 cycles:u # 3.177 GHz (83.33%) 64458929239 stalled-cycles-frontend:u # 3.85% frontend cycles idle (83.33%) 397586117982 stalled-cycles-backend:u # 23.76% backend cycles idle (83.33%) 5157346862311 instructions:u # 3.08 insn per cycle # 0.08 stalled cycles per insn (83.33%) 421082988475 branches:u # 799.234 M/sec (83.33%) 14226205709 branch-misses:u # 3.38% of all branches (83.33%) 527.353829377 seconds time elapsed # Samples: 2M of event 'cycles' # Event count (approx.): 1675655436335 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................ ................................................. # 14.13% 298519 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_16x16 13.43% 283793 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_satd_8x4 11.56% 244196 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] get_ref 7.17% 151589 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_16x16 6.29% 132936 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_16x16 5.28% 111517 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x4_8x8 4.84% 102317 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] quant_4x4 3.86% 81563 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] mc_chroma 2.57% 54233 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] x264_pixel_sad_x3_8x8 2.08% 43964 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] sub4x4_dct 2.01% 42520 x264_r_base.min x264_r_base.mine-gen-std-m64 [.] pixel_hadamard_ac
I don't have access to that type of machine and honestly don't know the ISA well enough to know the differences between what runs well and what doesn't just by looking at the code. Can you point out some code/function where the assembler code is worse? The patch you bisected to only removes unneeded conflicts in the conflict graph, which gives the allocators more freedom, which in general is a good thing. That said, since these are all heuristics built on top of heuristics, it's not impossible that giving more freedom could lead to worse code. My guess is though, we're probably tickling a AMD specific hardware pipeline feature, since you said you don't see the same thing on Intel.
Just for the record, my Ryzen machine periodic tester probably improved due to the revision: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41.377.0&plot.2=70.377.0&plot.3=31.377.0 As seen, it's now about 5% faster than GCC8 branch.
(In reply to Martin Liška from comment #4) > Just for the record, my Ryzen machine periodic tester probably improved due > to the revision: > https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=158.377.0&plot.1=41. > 377.0&plot.2=70.377.0&plot.3=31.377.0 > > As seen, it's now about 5% faster than GCC8 branch. Very interesting, thanks for that! Since the two of you both used -O2 and generic tuning (ie, same code), that would tend to agree with my speculation that this is an AMD EPYC specific pipeline issue/hazard/... we're unluckily hitting. Agreed? If so, I'm not sure we can really blame my patch, but if someone could narrow down what the exact issue is that is causing the slowdown, maybe we can mitigate it somehow.
Hi, the assembly of the most affected function does not change at all, just its offset (is 0x10 bytes bigger). Aligning the loops in the function a bit more avoids most of the slowdown but not quite all of it. In any event, this is a microarchitectural problem that we probably cannot do anything about. Sorry for the noise, I will check for this the next time before I report a problem.
(In reply to Martin Jambor from comment #6) > Hi, the assembly of the most affected function does not change at all, just > its offset (is 0x10 bytes bigger). Aligning the loops in the function a bit > more avoids most of the slowdown but not quite all of it. In any event, > this is a microarchitectural problem that we probably cannot do anything > about. Sorry for the noise, I will check for this the next time before I > report a problem. We've seen similar issues on POWER, where a particular revision causes slight size changes in a function that changes the function offset of some other later function and that causes a performance change. Unfortunately, just increasing function alignment to eliminate that has other unintended performance issues. Thanks for isolating the issue.