Since the revision --size=train miscompares, I was able to track that to 2 LTO object files that are needed for it to happen: ecpder.fppized.o ecplib.fppized.o
One can reproduce it in build dir: $ cp /cpu2006/benchspec/CPU2006/416.gamess/data/train/input/* . $ timeout 10 ./gamess < h2ocu2+.energy.config ... 3 2 -270.604580685 0.000920175 0.040597819 0.071110441 68854147 9774 -270.6 is a good value. while 3 2 -276.344499030 -0.114881130 0.489865944 0.157922540 69406341 3920 is a bad value.
Created attachment 51794 [details] Bad optimized dump
Created attachment 51795 [details] Good optimized dump
As seen the change triggers one new ISRA clone: diff -u gamess.ltrans0.ltrans.250t.optimized.good gamess.ltrans0.ltrans.250t.optimized.bad | head --- gamess.ltrans0.ltrans.250t.optimized.good 2021-11-15 11:32:03.388858701 +0100 +++ gamess.ltrans0.ltrans.250t.optimized.bad 2021-11-15 11:33:06.468521265 +0100 @@ -1,5 +1,5 @@ -;; Function dco (dco_, funcdef_no=0, decl_uid=4939, cgraph_uid=8, symbol_order=140) +;; Function dco.isra (dco_.isra.0, funcdef_no=1, decl_uid=5003, cgraph_uid=28, symbol_order=288) Removing basic block 9 Removing basic block 10 @@ -7,168 +7,159 @@
Looking into optimized dump, I wonder why IPA-SRA is rpelacing some of parameters passed by reference but not all of them: -__attribute__((fn spec (". w w w w w w w w w w w w w "))) -double dco (int & restrict l, int & restrict m, int & restrict kx, int & restrict ky, int & restrict kz, int & restrict lp, int & restrict mp, double[15625] * restrict fpqr, double[0:] * restrict zlm, int[0:] * restrict lmf, int[0:] * restrict lmx, int[0:] * restrict lmy, int[0:] * restrict lmz) +double dco.isra (int ISRA.1, int ISRA.1, int & restrict kx, int & restrict ky, int & restrict kz, int ISRA.2, int ISRA.3, double[15625] * restrict fpqr, double[0:] * restrict zlm, int[0:] * restrict lmf, int[0:] * restrict lmx, int[0:] * restrict lmy, int[0:] * restrict lmz) I would expect kx,ky and kz to be also replaced. -void formdr.constprop (double[0:] * restrict g, double[0:] * restrict g2, double[0:] * restrict xin, double[0:] * restrict yin, double[0:] * restrict zin, int & restrict mini, int & restrict maxi, int & restrict minj, int & restrict maxj, logical(kind=4) & restrict norm) +void formdr.constprop (double[0:] * restrict g, double[0:] * restrict g2, double[0:] * restrict xin, double[0:] * restrict yin, double[0:] * restrict zin, int & restrict minj, int & restrict maxj) similarly here we drop some parameters by we do not seem to do the replacement for rest of them.
Fun fact, for me the benchmark passes (test) verification when built with: -O2 -g -flto=16 -fno-ipa-sra but fails when built with: -O2 -g -flto=16 -fno-ipa-sra -fdump-ipa-all Consistently so. On revision r12-5309-g0002a8a1997c7b. I thought I was going mad when attempting to somehow isolate what was going on. But I do not really know what to do with this piece of knowledge.
I have to leave the office for today. The problematic option seems to be -fdump-ipa-inline. With it, 28 out of 124 partitions are different in their ccp2 dumps but so far I only found UID differences.
Yes, it's really as you described. Started with r11-5061-g85ebbabd85e03bdc.
(In reply to Martin Liška from comment #8) > Yes, it's really as you described. Started with r11-5061-g85ebbabd85e03bdc. There are quite some -Wlto-type-mismatch warnings, so it may be related to that?
I isolated small reproducer: $ cat 1.f SUBROUTINE DENDD1(B,L2) IF(CITYPIPSI.EQ.1) CALL VADD(A1DB1DA,1,L2) END $ cat 2.f LOGICAL BSFDD L2 = 2 CALL DENDD1(X0X0,L2) IF(BSFDD) THEN CALL VCLR(X01NXYZ2*L2) CALL HFD(BSFDD) END IF END $ cat cmd FLAGS="-O2 -flto=16 -flto-partition=1to1 -fPIC" gfortran-11 *.f $FLAGS -c gfortran-11 *.o $FLAGS -o gamess1 -shared $OBJ gfortran-11 *.o $FLAGS -fdump-ipa-inline -o gamess2 -shared $OBJ objdump -d gamess1 > gamess1.txt objdump -d gamess2 > gamess2.txt diff -u gamess1 gamess2 if test $? = 1; then exit 0 else exit 1 fi $ ./cmd && diff gamess1.txt gamess2.txt ... @@ -76,91 +76,88 @@ 10d8: 31 c0 xor %eax,%eax 10da: 48 83 c4 18 add $0x18,%rsp 10de: c3 ret - 10df: 66 0f ef c0 pxor %xmm0,%xmm0 - 10e3: 48 8d 7c 24 0c lea 0xc(%rsp),%rdi - 10e8: f3 0f 2a 44 24 04 cvtsi2ssl 0x4(%rsp),%xmm0 - 10ee: f3 0f 59 05 26 0f 00 mulss 0xf26(%rip),%xmm0 # 201c <options.0.0+0x1c> - 10f5: 00 - 10f6: f3 0f 11 44 24 0c movss %xmm0,0xc(%rsp) - 10fc: e8 2f ff ff ff call 1030 <vclr_@plt> - 1101: 48 89 e7 mov %rsp,%rdi ... but this got fixed on master with r12-5223-gecdf414bd89e6ba2. I'll try if it happens with the current master. Hope it helps.
Thanks for reducing the testcase. I found the reason for build difference. There is misplaced code clearing to_info_lto. hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build4/gcc$ cat ~/fix diff --git a/gcc/ipa-modref.c b/gcc/ipa-modref.c index a70575bc807..1241e567266 100644 --- a/gcc/ipa-modref.c +++ b/gcc/ipa-modref.c @@ -5123,6 +5150,7 @@ ipa_merge_modref_summary_after_inlining (cgraph_edge *edge) fprintf (dump_file, "Removed mod-ref summary for %s\n", to->dump_name ()); summaries_lto->remove (to); + to_info_lto = NULL; } else if (to_info_lto && dump_file) { @@ -5130,7 +5158,6 @@ ipa_merge_modref_summary_after_inlining (cgraph_edge *edge) fprintf (dump_file, "Updated mod-ref summary for %s\n", to->dump_name ()); to_info_lto->dump (dump_file); - to_info_lto = NULL; } if (callee_info_lto) summaries_lto->remove (edge->callee);
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:8c693978dd64b16637577ebf50c760053d7d2165 commit r12-5328-g8c693978dd64b16637577ebf50c760053d7d2165 Author: Jan Hubicka <jh@suse.cz> Date: Wed Nov 17 01:43:57 2021 +0100 Fix clearing of to_info_lto in ipa_merge_modref_summary_after_inlining This patch fixes bug that caused some optimizations to be dropped with -fdump-ipa-inline. gcc/ChangeLog: 2021-11-17 Jan Hubicka <hubicka@ucw.cz> PR ipa/103246 * ipa-modref.c (ipa_merge_modref_summary_after_inlining): Fix clearing of to_info_lto
(In reply to Jan Hubicka from comment #11) > Thanks for reducing the testcase. > I found the reason for build difference. There is misplaced code clearing > to_info_lto. Thanks! Great you found it so quickly.
> Thanks! Great you found it so quickly. It is bit stupid code since everything is duplicated twice (for LTO and non-LTO). I have to refactor it: we could have common base of the two summaries and handle most of things in unified way. The problem is that gengtype does not like it, so it will probably need manual GGC annotations. It stil leaves us with a wrong code, right?
Yep, I still see difference in the output with and without SRA
Building with LTO only ecpder.fppized.o and ecplib.fppized.o I get the failure going away with --dbg-cnt=ipa_cp_values:10 and :9 works. Different ipa-cp decision is +***dbgcnt: upper limit 10 reached for ipa_cp_values.*** + - Creating a specialized node of formdr/4 for all known contexts. + replacing param #5 int & restrict with const &ecpidx.iamin + replacing param #6 int & restrict with const &ecpidx.iamax + Removed a reference from formii.constprop/284 to ecpidx/11. + ...and replaced it with LOAD one. + Removed a reference from ecp2d/1 to ecpidx/11. + Removed a reference from ecp1d/0 to ecpidx/11. + replacing param #9 logical(kind=4) & restrict with const &ecpidx.norm - forcing load reference In modref the propagation goes same way and we update signature as follows: +Updating summary for formdr.constprop/285 from: + loads: + Limits: 32 bases, 16 refs + Every base + stores: + Limits: 32 bases, 16 refs + Every base + Side effects + Nondeterministic + parm 0 flags: no_direct_clobber no_direct_escape no_indirect_escape + parm 1 flags: no_direct_clobber no_direct_escape no_indirect_escape + parm 2 flags: no_direct_escape no_indirect_escape + parm 3 flags: no_direct_escape no_indirect_escape + parm 4 flags: no_direct_escape no_indirect_escape + parm 5 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read + parm 6 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read + parm 7 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read + parm 8 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read + parm 9 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read +to: + loads: + Limits: 32 bases, 16 refs + Every base + stores: + Limits: 32 bases, 16 refs + Every base + Side effects + Nondeterministic + parm 0 flags: no_direct_clobber no_direct_escape no_indirect_escape + parm 1 flags: no_direct_clobber no_direct_escape no_indirect_escape + parm 2 flags: no_direct_escape no_indirect_escape + parm 3 flags: no_direct_escape no_indirect_escape + parm 4 flags: no_direct_escape no_indirect_escape + parm 5 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read + parm 6 flags: no_direct_clobber no_indirect_clobber no_direct_escape no_indirect_escape no_indirect_read this changes partitioning decisions which makes quite a lot of fuzz in latr dump files.
and fun fact is that it really depends how program is paritioned. -flto-partition=max keeps problem reproducing while -flto-partition=one makes it go away and -flto-partition=one --disable-tree-modref2 brings it back.
This seems to fix it for this particular partitioning at least ;) The problem was in summary update producing ill formed access range (which doesn't know base but knows offset from it) and that confused streamer. I will test it with additional sanity check that accesses are sane. diff --git a/gcc/ipa-modref-tree.h b/gcc/ipa-modref-tree.h index 1bf2aa8460e..c29dda56fc1 100644 --- a/gcc/ipa-modref-tree.h +++ b/gcc/ipa-modref-tree.h @@ -677,6 +677,8 @@ struct GTY((user)) modref_tree access_node->parm_index = (*map)[access_node->parm_index]; else access_node->parm_index = MODREF_UNKNOWN_PARM; + if (access_node->parm_index == MODREF_UNKNOWN_PARM) + access_node->parm_offset_known = false; } } }
I was too optimistic - the patch does not help. The problem is related to modref. With lto-partition=max the problem triggers while lto-partition=one hides it becuase late tree modref recomputes summaries and after that things works as expected. It may be wrong summary produced by the ipa-modref and streamed to the ltrans or some TBAA related bug. Disabling TBAA checks from the summary use makes the testcase to pass and there are some TBAA violation warnings when linking. I am isolating the problematic disambiguation.
linking with --dbg-cnt=ipa_cp_values:10 -flto-partition=one --disable-tree-modref2 --dbg-cnt=ipa_mod_ref:836 -fdump-tree-all-details -fdump-ipa-all-details is first that produces different results. Diff to 835 is: diff -u 835/gamess.ltrans0.ltrans.113t.dse2 836/gamess.ltrans0.ltrans.113t.dse2 --- 835/gamess.ltrans0.ltrans.113t.dse2 2021-11-17 20:11:34.894715053 +0100 +++ 836/gamess.ltrans0.ltrans.113t.dse2 2021-11-17 20:12:40.387968675 +0100 @@ -14780,7 +14780,6 @@ ipa-modref: call to dco.isra/287 does not use ref: kz alias sets: 4->4 ipa-modref: call stmt dc_1560 = dco.isra (l_1973, m_1975, &kx, &ky, &kz, la_1974, mu_1976, _513, _510, _507, _504, _501, _498); ipa-modref: call to dco.isra/287 does not use ref: jndx alias sets: 4->4 -***dbgcnt: upper limit 835 reached for ipa_mod_ref.*** ipa-modref: call stmt dc_1560 = dco.isra (l_1973, m_1975, &kx, &ky, &kz, la_1974, mu_1976, _513, _510, _507, _504, _501, _498); ipa-modref: call to dco.isra/287 does not use ref: kz alias sets: 4->4 Deleted trivially dead stmt: _1548 = (long int) _1547; @@ -14799,6 +14798,13 @@ Deleted trivially dead stmt: _1770 = indx_1668 + 1; +***dbgcnt: upper limit 836 reached for ipa_mod_ref.*** +ipa-modref: call stmt dc_1710 = dco.isra (0, 0, &D.8524, &D.8523, &D.8522, la_1977, mu_1978, _513, _510, _507, _504, _501, _498); +ipa-modref: call to dco.isra/287 does not use ref: D.8522 alias sets: 4->4 + Deleted dead store: D.8522 = _1709; + + Deleted trivially dead stmt: _1709 = kz_1696 + kzp_1699; + At WPA time we first compute summary for original function and then update for new signature: Updating summary for dco.isra/287 from: loads: Limits: 32 bases, 16 refs Base 0:int (alias set 4) Ref 0:int (alias set 4) access: Parm 0 param offset:0 offset:0 size:32 max_size:32 access: Parm 1 param offset:0 offset:0 size:32 max_size:32 access: Parm 5 param offset:0 offset:0 size:32 max_size:32 access: Parm 6 param offset:0 offset:0 size:32 max_size:32 access: Parm 2 param offset:0 offset:0 size:32 max_size:32 access: Parm 3 param offset:0 offset:0 size:32 max_size:32 access: Parm 4 param offset:0 offset:0 size:32 max_size:32 Base 1:int[0:] (alias set 4) Ref 0:int (alias set 4) access: Parm 9 param offset:0 offset:0 size:32 max_size:-1 access: Parm 10 param offset:0 offset:0 size:32 max_size:-1 access: Parm 11 param offset:0 offset:0 size:32 max_size:-1 access: Parm 12 param offset:0 offset:0 size:32 max_size:-1 Base 2:double[0:] (alias set 5) Ref 0:double (alias set 5) access: Parm 8 param offset:0 offset:0 size:64 max_size:-1 Base 3:double[15625] (alias set 5) Ref 0:double (alias set 5) access: Parm 7 param offset:0 offset:0 size:64 max_size:-1 stores: Limits: 32 bases, 16 refs parm 0 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 1 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 2 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 3 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 4 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 5 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 6 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 7 flags: not_returned_directly no_indirect_read parm 8 flags: not_returned_directly no_indirect_read parm 9 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 10 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 11 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 12 flags: not_returned_directly not_returned_indirectly no_indirect_read to: loads: Limits: 32 bases, 16 refs Base 0:int (alias set 4) Ref 0:int (alias set 4) access: access: access: access: access: Parm 2 param offset:0 offset:0 size:32 max_size:32 access: Parm 3 param offset:0 offset:0 size:32 max_size:32 access: Parm 4 param offset:0 offset:0 size:32 max_size:32 Base 1:int[0:] (alias set 4) Ref 0:int (alias set 4) access: Parm 9 param offset:0 offset:0 size:32 max_size:-1 access: Parm 10 param offset:0 offset:0 size:32 max_size:-1 access: Parm 11 param offset:0 offset:0 size:32 max_size:-1 access: Parm 12 param offset:0 offset:0 size:32 max_size:-1 Base 2:double[0:] (alias set 5) Ref 0:double (alias set 5) access: Parm 8 param offset:0 offset:0 size:64 max_size:-1 Base 3:double[15625] (alias set 5) Ref 0:double (alias set 5) access: Parm 7 param offset:0 offset:0 size:64 max_size:-1 stores: Limits: 32 bases, 16 refs parm 2 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 3 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 4 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 7 flags: not_returned_directly no_indirect_read parm 8 flags: not_returned_directly no_indirect_read parm 9 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 10 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 11 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 12 flags: not_returned_directly not_returned_indirectly no_indirect_read This seems to correclty represent that some params was scalarised. Only one oddity is that the list is not optimized and kept in redundant form but it should be harmless. At stream in time we see: Read modref for dco.isra/287 loads: Limits: 32 bases, 16 refs Base 0: alias set 4 Ref 0: alias set 4 access: Parm 9 param offset:0 offset:0 size:32 max_size:-1 access: Parm 10 param offset:0 offset:0 size:32 max_size:-1 access: Parm 11 param offset:0 offset:0 size:32 max_size:-1 access: Parm 12 param offset:0 offset:0 size:32 max_size:-1 Base 1: alias set 3 Ref 0: alias set 3 access: Parm 8 param offset:0 offset:0 size:64 max_size:-1 access: Parm 7 param offset:0 offset:0 size:64 max_size:-1 stores: Limits: 32 bases, 16 refs Try dse parm 2 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 3 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 4 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 7 flags: not_returned_directly no_indirect_read parm 8 flags: not_returned_directly no_indirect_read parm 9 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 10 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 11 flags: not_returned_directly not_returned_indirectly no_indirect_read parm 12 flags: not_returned_directly not_returned_indirectly no_indirect_read So types are translated to alias sets however the stream in is wrong since there was accesses without known params and they are gone. So it is streaming bug (and was present already in the initial modref version)
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:425369bf3068a9f840d1c2f04a4d4c38e924d4dc commit r12-5351-g425369bf3068a9f840d1c2f04a4d4c38e924d4dc Author: Jan Hubicka <jh@suse.cz> Date: Wed Nov 17 22:04:26 2021 +0100 Fix modref summary streaming Fixes bug in streaming in modref access tree that now cause a failure of gamess benchmark. The bug is quite old (present in GCC11 release) but it needs quite interesting series of events to manifest. In particular 1) At lto time ISRA turns some parameters passed by reference to scalar 2) At lto time modref computes summaries for old parameters and then updates them but does so quite stupidly believing that the load from parameters are now unkonwn loads (rather than optimized out). This renders summary not very useful since it thinks every memory aliasing int is now accssed (as opposed as parameter dereference) 3) At stream in we notice too early that summary is useless, set every_access flag and drop the list. However while reading rest of the summary we overwrite the flag back to 0 which makes us to lose part of summary. 4) right selection of partitions needs to be done to avoid late modref from recalculating and thus fixing the summary. This patch fixes the stream in bug, however we also should fix updating of summaries. gcc/ChangeLog: 2021-11-17 Jan Hubicka <hubicka@ucw.cz> PR ipa/103246 * ipa-modref.c (read_modref_records): Fix streaminig in of every_access flag.
Seems fixed on LNT now too https://lnt.opensuse.org/db_default/v4/SPEC/21494
Meant to close the bug too.. Fixed now.
(In reply to CVS Commits from comment #12) > The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: > > https://gcc.gnu.org/g:8c693978dd64b16637577ebf50c760053d7d2165 > > commit r12-5328-g8c693978dd64b16637577ebf50c760053d7d2165 > Author: Jan Hubicka <jh@suse.cz> > Date: Wed Nov 17 01:43:57 2021 +0100 > > Fix clearing of to_info_lto in ipa_merge_modref_summary_after_inlining > > This patch fixes bug that caused some optimizations to be dropped with > -fdump-ipa-inline. > > gcc/ChangeLog: > > 2021-11-17 Jan Hubicka <hubicka@ucw.cz> > > PR ipa/103246 > * ipa-modref.c (ipa_merge_modref_summary_after_inlining): Fix > clearing > of to_info_lto Can you please backport this to GCC 11?
The releases/gcc-11 branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:9d3f1435a348bece9e11787df982bd465db74ed8 commit r11-9248-g9d3f1435a348bece9e11787df982bd465db74ed8 Author: Jan Hubicka <jh@suse.cz> Date: Wed Nov 17 22:04:26 2021 +0100 Fix modref summary streaming Fixes bug in streaming in modref access tree that now cause a failure of gamess benchmark. The bug is quite old (present in GCC11 release) but it needs quite interesting series of events to manifest. In particular 1) At lto time ISRA turns some parameters passed by reference to scalar 2) At lto time modref computes summaries for old parameters and then updates them but does so quite stupidly believing that the load from parameters are now unkonwn loads (rather than optimized out). This renders summary not very useful since it thinks every memory aliasing int is now accssed (as opposed as parameter dereference) 3) At stream in we notice too early that summary is useless, set every_access flag and drop the list. However while reading rest of the summary we overwrite the flag back to 0 which makes us to lose part of summary. 4) right selection of partitions needs to be done to avoid late modref from recalculating and thus fixing the summary. This patch fixes the stream in bug, however we also should fix updating of summaries. gcc/ChangeLog: 2021-11-17 Jan Hubicka <hubicka@ucw.cz> PR ipa/103246 * ipa-modref.c (read_modref_records): Fix streaminig in of every_access flag. (cherry picked from commit 425369bf3068a9f840d1c2f04a4d4c38e924d4dc)