Compiling the 2006 CPU benchmark 435.gromacs on an AMD Zen4 CPU with -Ofast -march=native and PGO using the current trunk GCC and running the benchmark results in the following miscomparison: 0002: 3.07684e+02 3.11606e+02 ^ According to https://www.spec.org/cpu2006/Docs/435.gromacs.html > The gromacs.out results shouldn't differ by more than 1.25% from the reference values And this difference is ~1.27%.
The oldest commit with this miscomparison i found so far is g:eb619490b01baa2f. The most recent commit without the miscomparison i found so far is g:405096f908e1ceb0.
Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad.
If it is a FP rounding issue then my guess would be the extra reduc_scal_* patterns in the x86 backend to do more BB reduction vectorization. I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros -fno-trapping-math -ffinite-math-only -ffp-contract=fast?
I've bisected this to g:57f611604e8bab67af6c0bcfe6ea88c001408412. (In reply to Richard Biener from comment #3) > I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros > -fno-trapping-math -ffinite-math-only -ffp-contract=fast? I'll take a look. (In reply to Filip Kastl from comment #2) > Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad. Disregard this. g:eb619490b01baa2f *does* miscompare. I had some troubles with my bisect script.
Can confirm that with -O3 -march=native PGO nor with -O3 -march=native PGO -fno-signed-zeros -fno-trapping-math -ffinite-math-only -ffp-contract=fast the benchmark doesn't miscompare. Btw I just noticed that the bisection lead to the same commit as the bisection of pr113833. I suppose that both PRs relate to the same problem.
I just observed the same miscomparison on a Zen2 machine. Exactly the same values: 0002: 3.07684e+02 3.11606e+02 So I assume it's the same problem.
I guess that bisecting with -fno-vect-cost-model might reveal the actual offender. (if it doesn't turn up other problems, of course)
Confirmed on Zen2 btw. I'm going to have a closer look here.
Interestingly r14-7272-g57f611604e8bab causes quite some BB vectorization cases to be rejected - I would have expected it to only get us more vectorization? -innerf.f:277:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:281:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:284:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:323:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:327:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:330:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:630:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:634:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:637:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:679:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:683:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:686:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1283:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1287:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1290:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1354:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1358:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:1361:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2004:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2008:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2011:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2082:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2086:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2089:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2365:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2369:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:2372:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3347:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3351:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3354:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3390:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3394:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3397:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3434:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3438:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:3441:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4477:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4481:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4484:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4520:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4524:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4527:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4567:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4571:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:4574:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5702:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5706:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5709:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5745:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5749:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5752:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5815:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5819:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:5822:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:6114:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:6118:72: optimized: basic block part vectorized using 32 byte vectors +innerf.f:6127:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7050:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7054:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7057:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7093:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7097:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7100:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7170:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7174:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:7177:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:7479:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:7483:72: optimized: basic block part vectorized using 32 byte vectors +innerf.f:7492:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:8280:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:8284:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:8287:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9306:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9310:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9313:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9350:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9354:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9357:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9394:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9398:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:9401:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10483:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10487:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10490:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10527:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10531:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10534:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10574:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10578:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:10581:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11751:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11755:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11758:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11795:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11799:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11802:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11865:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11869:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:11872:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:12173:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:12177:72: optimized: basic block part vectorized using 32 byte vectors +innerf.f:12186:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13136:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13140:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13143:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13180:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13184:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13187:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13257:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13261:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:13264:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:13574:72: optimized: basic block part vectorized using 8 byte vectors +innerf.f:13578:72: optimized: basic block part vectorized using 32 byte vectors +innerf.f:13587:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:14631:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:14635:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:14638:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15896:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15900:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15903:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15953:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15957:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15960:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:15998:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:16002:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:16005:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17325:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17329:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17332:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17382:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17386:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17389:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17429:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17433:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:17436:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19038:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19042:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19045:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19095:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19099:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19102:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19164:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19168:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:19171:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20844:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20848:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20851:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20901:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20905:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20908:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20977:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20981:72: optimized: basic block part vectorized using 8 byte vectors -innerf.f:20984:72: optimized: basic block part vectorized using 8 byte vectors -vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors -clincs.c:65:12: optimized: basic block part vectorized using 8 byte vectors -clincs.c:151:12: optimized: basic block part vectorized using 8 byte vectors -vec.h:537:15: optimized: basic block part vectorized using 32 byte vectors -vec.h:453:15: optimized: basic block part vectorized using 32 byte vectors -coupling.c:188:10: optimized: basic block part vectorized using 8 byte vectors -vec.h:426:15: optimized: basic block part vectorized using 32 byte vectors -vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors -update.c:266:16: optimized: basic block part vectorized using 8 byte vectors
So the ref output is -3.22397e+05 3.07684e+02 1.06621e+10 and before the change we get -3.22205e+05 3.05161e+02 1.06660e+10 while after it is -3.22401e+05 3.11606e+02 1.06579e+10 vectorization differences show in innerf.o, bondfree.o, clincs.o, coupling.o, disre.o and update.o while all but innerf.o show only less vectorization. Only using the "bad" version of innerf.o gets us -3.23378e+05 3.08348e+02 1.06697e+10 which should still PASS. Replacing all above TUs with the bad objects reproduces the bad output. Replacing update.o, disre.o, coupling.o or clincs.o with the GOOD version doesn't change the output, so it's only innerf.o and bondfree.o making a difference. Using only BAD bondfree.o gives -3.22265e+05 3.07882e+02 1.06644e+10 That would have been OK as well. The bondfree.o change is small: @@ -2720,9 +2588,6 @@ vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors bondfree.c:806:16: optimized: basic block part vectorized using 8 byte vectors vec.h:239:8: optimized: basic block part vectorized using 8 byte vectors vec.h:265:8: optimized: basic block part vectorized using 8 byte vectors while the innerf.o changes are many (but possibly similar). I will see to understand the bondfree change first. That's the following change in the function idihs: vec.h:380:9: note: Cost model analysis for part in loop 1: - Vector cost: 624 - Scalar cost: 700 -vec.h:380:9: note: Basic block will be vectorized using SLP -vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors -vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors -vec.h:380:9: note: Vectorizing SLP tree: -vec.h:380:9: note: node 0x345f188 (max_nunits=2, refcnt=1) vector(2) float + Vector cost: 640 + Scalar cost: 532 +vec.h:380:9: missed: not vectorized: vectorization is not profitable. where it basically changes what nodes we think are live. Note this is a larger graph with multiple instances so we might suffer from what I noted in PR114413. The IL has all but the call to do_dih_fup inlined into idihs.
Created attachment 57816 [details] manually reduced preprocessed source This is the TU reduced to idihs where I put flatten on, with just -Ofast -march=znver2 one can reproduce the vectorization difference when reverting the offending revision ontop of trunk. I'll have a closer look tomorrow.
OK, so I think the change is that we get to "correctly" notice -vec.h:380:9: note: node (external) 0x6a2e9d8 (max_nunits=2, refcnt=1) vector(2) float -vec.h:380:9: note: stmt 0 _164 = MEM[(const real *)_27 + 8B]; -vec.h:380:9: note: stmt 1 _158 = MEM[(const real *)_27]; +vec.h:380:9: note: node (external) 0x5a823a8 (max_nunits=2, refcnt=1) vector(2) float +vec.h:380:9: note: [l] stmt 0 _164 = MEM[(const real *)_27 + 8B]; +vec.h:380:9: note: [l] stmt 1 _158 = MEM[(const real *)_27]; for the loads we do not handle because of gaps and promoted external. That leads to extra costs. But also +vec.h:380:9: note: node 0x5a81770 (max_nunits=2, refcnt=2) vector(2) float vec.h:380:9: note: op template: x_160 = _158 - _159; vec.h:380:9: note: stmt 0 x_160 = _158 - _159; -vec.h:380:9: note: [l] stmt 1 y_163 = _161 - _162; +vec.h:380:9: note: stmt 1 y_163 = _161 - _162; so y_163 isn't considered live for some reason. We find _123 = _117 * y_163; is vectorized as part of a reduction. On the costing side we then see -_161 - _162 1 times scalar_stmt costs 12 in body -MEM[(const real *)_27 + 4B] 1 times scalar_load costs 12 in body -MEM[(const real *)_24 + 4B] 1 times scalar_load costs 12 in body which is the live (and dependent) stmts no longer costed on the scalar side but also +MEM[(const real *)_27 + 8B] 1 times vec_to_scalar costs 4 in epilogue +MEM[(const real *)_24 + 8B] 1 times vec_to_scalar costs 4 in epilogue costed in the vector epilog. This is because we're conservative as we don't really know whether we'll be able to code-generate the live operation. The costing side here is also not in sync as can be seen from the _161 - _162 op removed. I should also note that the setting of PURE_SLP is done a bit too early, before we analyze operations and eventually throw away instances or prune it by promoting ops external. For reductions we also falsely claim all root stmts are vectorized - we do have remain ops. Fixing this restores the LIVE on them and in some way restores vectorization. I'm going to test this as fix for now.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:0b02da5b99e89347f5f8bf875ec8318f84adff18 commit r14-9687-g0b02da5b99e89347f5f8bf875ec8318f84adff18 Author: Richard Biener <rguenther@suse.de> Date: Wed Mar 27 11:37:16 2024 +0100 tree-optimization/114057 - handle BB reduction remain defs as LIVE The following makes sure to record the scalars we add to the BB reduction vectorization result as scalar uses for the purpose of computing live lanes. This restores vectorization in the bondfree.c TU of 435.gromacs. PR tree-optimization/114057 * tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Mark BB reduction remain defs as scalar uses.
I wasn't able to reproduce the miscompare on a Zen4 machine (but with -march=znver2). But the original vectorizataion of bondfree.c should be restored and thus the miscompare gone. I'll verify tomorrow when I'm back at the machine I was able to reproduce the issue.
Confirmed fixed.