Bug 114057 - [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znver4} and PGO after r14-7272-g57f611604e8bab
Summary: [14 Regression] 435.gromacs fails verification with -Ofast -march={znver2,znv...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P1 normal
Target Milestone: 14.0
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2024-02-22 16:19 UTC by Filip Kastl
Modified: 2024-03-28 07:04 UTC (History)
3 users (show)

See Also:
Host: x86_64-linux
Target: x86_64-linux
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-03-26 00:00:00


Attachments
manually reduced preprocessed source (31.92 KB, text/plain)
2024-03-26 15:34 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Filip Kastl 2024-02-22 16:19:14 UTC
Compiling the 2006 CPU benchmark 435.gromacs on an AMD Zen4 CPU with -Ofast -march=native and PGO using the current trunk GCC and running the benchmark results in the following miscomparison:

 0002:  3.07684e+02
        3.11606e+02
                  ^

According to https://www.spec.org/cpu2006/Docs/435.gromacs.html
> The gromacs.out results shouldn't differ by more than 1.25% from the reference values
And this difference is ~1.27%.
Comment 1 Filip Kastl 2024-02-22 17:50:14 UTC
The oldest commit with this miscomparison i found so far is g:eb619490b01baa2f. The most recent commit without the miscomparison i found so far is g:405096f908e1ceb0.
Comment 2 Filip Kastl 2024-02-22 18:08:00 UTC
Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad.
Comment 3 Richard Biener 2024-02-23 06:54:47 UTC
If it is a FP rounding issue then my guess would be the extra reduc_scal_* patterns in the x86 backend to do more BB reduction vectorization.

I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros -fno-trapping-math -ffinite-math-only -ffp-contract=fast?
Comment 4 Filip Kastl 2024-02-23 09:44:20 UTC
I've bisected this to g:57f611604e8bab67af6c0bcfe6ea88c001408412.


(In reply to Richard Biener from comment #3)
> I guess it doesn't miscompare with -O3 or -O3 -fno-signed-zeros
> -fno-trapping-math -ffinite-math-only -ffp-contract=fast?

I'll take a look.


(In reply to Filip Kastl from comment #2)
> Hm, seems like g:eb619490b01baa2f actually doesn't miscompare. My bad.

Disregard this. g:eb619490b01baa2f *does* miscompare. I had some troubles with my bisect script.
Comment 5 Filip Kastl 2024-02-23 10:26:27 UTC
Can confirm that with -O3 -march=native PGO nor with -O3 -march=native PGO -fno-signed-zeros -fno-trapping-math -ffinite-math-only -ffp-contract=fast the benchmark doesn't miscompare.

Btw I just noticed that the bisection lead to the same commit as the bisection of pr113833. I suppose that both PRs relate to the same problem.
Comment 6 Filip Kastl 2024-02-25 11:11:26 UTC
I just observed the same miscomparison on a Zen2 machine. Exactly the same values:

0002:  3.07684e+02
       3.11606e+02

So I assume it's the same problem.
Comment 7 Richard Biener 2024-03-26 10:59:41 UTC
I guess that bisecting with -fno-vect-cost-model might reveal the actual offender.  (if it doesn't turn up other problems, of course)
Comment 8 Richard Biener 2024-03-26 11:27:00 UTC
Confirmed on Zen2 btw.  I'm going to have a closer look here.
Comment 9 Richard Biener 2024-03-26 12:02:55 UTC
Interestingly r14-7272-g57f611604e8bab causes quite some BB vectorization
cases to be rejected - I would have expected it to only get us more
vectorization?

-innerf.f:277:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:281:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:284:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:323:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:327:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:330:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:630:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:634:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:637:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:679:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:683:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:686:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1283:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1287:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1290:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1358:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:1361:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2004:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2008:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2011:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2082:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2086:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2089:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2365:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2369:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:2372:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3347:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3351:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3390:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3394:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3397:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3434:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3438:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:3441:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4477:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4481:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4484:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4520:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4524:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4527:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4567:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4571:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:4574:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5702:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5706:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5709:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5745:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5749:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5752:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5815:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5819:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:5822:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:6114:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:6118:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:6127:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7050:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7054:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7057:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7093:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7097:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7100:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7170:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7174:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:7177:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:7479:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:7483:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:7492:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8280:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8284:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:8287:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9306:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9310:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9313:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9350:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9354:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9357:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9394:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9398:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:9401:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10483:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10487:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10490:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10527:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10531:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10534:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10574:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10578:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:10581:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11751:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11755:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11758:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11795:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11799:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11802:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11865:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11869:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:11872:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:12173:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:12177:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:12186:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13136:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13140:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13143:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13180:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13184:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13187:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13257:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13261:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:13264:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:13574:72: optimized: basic block part vectorized using 8 byte vectors
+innerf.f:13578:72: optimized: basic block part vectorized using 32 byte vectors
+innerf.f:13587:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14631:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14635:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:14638:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15896:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15900:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15903:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15953:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15957:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15960:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:15998:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:16002:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:16005:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17325:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17329:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17332:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17382:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17386:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17389:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17429:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17433:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:17436:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19038:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19042:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19045:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19095:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19099:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19102:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19164:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19168:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:19171:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20844:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20848:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20851:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20901:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20905:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20908:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20977:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20981:72: optimized: basic block part vectorized using 8 byte vectors
-innerf.f:20984:72: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
-clincs.c:65:12: optimized: basic block part vectorized using 8 byte vectors
-clincs.c:151:12: optimized: basic block part vectorized using 8 byte vectors
-vec.h:537:15: optimized: basic block part vectorized using 32 byte vectors
-vec.h:453:15: optimized: basic block part vectorized using 32 byte vectors
-coupling.c:188:10: optimized: basic block part vectorized using 8 byte vectors
-vec.h:426:15: optimized: basic block part vectorized using 32 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-update.c:266:16: optimized: basic block part vectorized using 8 byte vectors
Comment 10 Richard Biener 2024-03-26 15:32:18 UTC
So the ref output is

-3.22397e+05
3.07684e+02
1.06621e+10

and before the change we get

-3.22205e+05
3.05161e+02
1.06660e+10

while after it is

-3.22401e+05
3.11606e+02
1.06579e+10

vectorization differences show in innerf.o, bondfree.o, clincs.o, coupling.o,
disre.o and update.o while all but innerf.o show only less vectorization.
Only using the "bad" version of innerf.o gets us

-3.23378e+05
3.08348e+02
1.06697e+10

which should still PASS.  Replacing all above TUs with the bad objects
reproduces the bad output.

Replacing update.o, disre.o, coupling.o or clincs.o with the GOOD version doesn't change the output, so it's only innerf.o and bondfree.o making
a difference.  Using only BAD bondfree.o gives

-3.22265e+05
3.07882e+02
1.06644e+10

That would have been OK as well.

The bondfree.o change is small:

@@ -2720,9 +2588,6 @@
 vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
 bondfree.c:806:16: optimized: basic block part vectorized using 8 byte vectors
 vec.h:239:8: optimized: basic block part vectorized using 8 byte vectors
 vec.h:265:8: optimized: basic block part vectorized using 8 byte vectors

while the innerf.o changes are many (but possibly similar).

I will see to understand the bondfree change first.  That's the following
change in the function idihs:

 vec.h:380:9: note: Cost model analysis for part in loop 1:
-  Vector cost: 624
-  Scalar cost: 700
-vec.h:380:9: note: Basic block will be vectorized using SLP
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:252:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:419:8: optimized: basic block part vectorized using 8 byte vectors
-vec.h:378:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:379:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: optimized: basic block part vectorized using 16 byte vectors
-vec.h:380:9: note: Vectorizing SLP tree:
-vec.h:380:9: note: node 0x345f188 (max_nunits=2, refcnt=1) vector(2) float
+  Vector cost: 640
+  Scalar cost: 532
+vec.h:380:9: missed: not vectorized: vectorization is not profitable.

where it basically changes what nodes we think are live.  Note this is
a larger graph with multiple instances so we might suffer from
what I noted in PR114413.

The IL has all but the call to do_dih_fup inlined into idihs.
Comment 11 Richard Biener 2024-03-26 15:34:51 UTC
Created attachment 57816 [details]
manually reduced preprocessed source

This is the TU reduced to idihs where I put flatten on, with just -Ofast -march=znver2 one can reproduce the vectorization difference when reverting
the offending revision ontop of trunk.

I'll have a closer look tomorrow.
Comment 12 Richard Biener 2024-03-27 10:36:58 UTC
OK, so I think the change is that we get to "correctly" notice

-vec.h:380:9: note: node (external) 0x6a2e9d8 (max_nunits=2, refcnt=1) vector(2) float
-vec.h:380:9: note:     stmt 0 _164 = MEM[(const real *)_27 + 8B];
-vec.h:380:9: note:     stmt 1 _158 = MEM[(const real *)_27];
+vec.h:380:9: note: node (external) 0x5a823a8 (max_nunits=2, refcnt=1) vector(2) float
+vec.h:380:9: note:     [l] stmt 0 _164 = MEM[(const real *)_27 + 8B];
+vec.h:380:9: note:     [l] stmt 1 _158 = MEM[(const real *)_27];

for the loads we do not handle because of gaps and promoted external.  That
leads to extra costs.

But also

+vec.h:380:9: note: node 0x5a81770 (max_nunits=2, refcnt=2) vector(2) float
 vec.h:380:9: note: op template: x_160 = _158 - _159;
 vec.h:380:9: note:     stmt 0 x_160 = _158 - _159;
-vec.h:380:9: note:     [l] stmt 1 y_163 = _161 - _162;
+vec.h:380:9: note:     stmt 1 y_163 = _161 - _162;

so y_163 isn't considered live for some reason.  We find

_123 = _117 * y_163;

is vectorized as part of a reduction.  On the costing side we then see

-_161 - _162 1 times scalar_stmt costs 12 in body
-MEM[(const real *)_27 + 4B] 1 times scalar_load costs 12 in body
-MEM[(const real *)_24 + 4B] 1 times scalar_load costs 12 in body

which is the live (and dependent) stmts no longer costed on the scalar
side but also

+MEM[(const real *)_27 + 8B] 1 times vec_to_scalar costs 4 in epilogue
+MEM[(const real *)_24 + 8B] 1 times vec_to_scalar costs 4 in epilogue

costed in the vector epilog.  This is because we're conservative as we
don't really know whether we'll be able to code-generate the live
operation.  The costing side here is also not in sync as can be seen
from the _161 - _162 op removed.

I should also note that the setting of PURE_SLP is done a bit too early,
before we analyze operations and eventually throw away instances or
prune it by promoting ops external.

For reductions we also falsely claim all root stmts are vectorized - we
do have remain ops.  Fixing this restores the LIVE on them and in some
way restores vectorization.

I'm going to test this as fix for now.
Comment 13 GCC Commits 2024-03-27 11:50:14 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:0b02da5b99e89347f5f8bf875ec8318f84adff18

commit r14-9687-g0b02da5b99e89347f5f8bf875ec8318f84adff18
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Mar 27 11:37:16 2024 +0100

    tree-optimization/114057 - handle BB reduction remain defs as LIVE
    
    The following makes sure to record the scalars we add to the BB
    reduction vectorization result as scalar uses for the purpose of
    computing live lanes.  This restores vectorization in the
    bondfree.c TU of 435.gromacs.
    
            PR tree-optimization/114057
            * tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Mark
            BB reduction remain defs as scalar uses.
Comment 14 Richard Biener 2024-03-27 11:51:29 UTC
I wasn't able to reproduce the miscompare on a Zen4 machine (but with -march=znver2).  But the original vectorizataion of bondfree.c should be
restored and thus the miscompare gone.  I'll verify tomorrow when I'm back
at the machine I was able to reproduce the issue.
Comment 15 Richard Biener 2024-03-28 07:04:14 UTC
Confirmed fixed.