This is seen here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=21414&plot.0=475.407.0 There is also a large binary growth, so it seems to not be a noise. https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=21414&plot.0=475.401.4 Quite possibly related to this weekend modref and/or ipa-sra changes enabling more transforms.
So code size is due to ipa-icf: hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ /aux/hubicka/trunk-install-fortran/bin/gfortran -Ofast -march=native -flto 548.exchange2_r/src/exchange2.F90 size a.out hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ size a.out text data bss dec hex filename 245442 864 6368 252674 3db02 a.out hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ /aux/hubicka/trunk-install-fortran/bin/gfortran -Ofast -march=native -flto 548.exchange2_r/src/exchange2.F90 -fno-ipa-icf -fno-ipa-modref hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ size a.out text data bss dec hex filename 178982 864 6336 186182 2d746 a.out hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ /aux/hubicka/trunk-install-fortran/bin/gfortran -Ofast -march=native -flto 548.exchange2_r/src/exchange2.F90 -fno-ipa-icf hubicka@lomikamen:/aux/hubicka/trunk/build6/gcc$ size a.out text data bss dec hex filename 245442 864 6368 252674 3db02 a.out We end up with isra clones of digits_2: bb/a.ltrans0.ltrans.250t.optimized:;; Function digits_2.constprop.isra (__brute_force_MOD_digits_2.constprop.6.isra.0, funcdef_no=29, decl_uid=4805, cgraph_uid=32, symbol_order=157) bb/a.ltrans1.ltrans.250t.optimized:;; Function digits_2.constprop.isra (__brute_force_MOD_digits_2.constprop.4.isra.0, funcdef_no=2, decl_uid=4730, cgraph_uid=3, symbol_order=159) bb/a.ltrans1.ltrans.250t.optimized:;; Function digits_2.constprop.isra (__brute_force_MOD_digits_2.constprop.2.isra.0, funcdef_no=6, decl_uid=4737, cgraph_uid=7, symbol_order=161) bb/a.ltrans1.ltrans.250t.optimized:;; Function digits_2.constprop.isra (__brute_force_MOD_digits_2.constprop.0.isra.0, funcdef_no=10, decl_uid=4741, cgraph_uid=11, symbol_order=163) bb/a.ltrans1.ltrans.250t.optimized:;; Function rearrange.isra (__brute_force_MOD_rearrange.isra.0, funcdef_no=14, decl_uid=4744, cgraph_uid=15, symbol_order=165) Curious is that original has 3 clones a.ltrans0.ltrans.250t.optimized:;; Function digits_2.constprop (__brute_force_MOD_digits_2.constprop.3, funcdef_no=2, decl_uid=4799, cgraph_uid=27, symbol_order=143) a.ltrans0.ltrans.250t.optimized:;; Function digits_2.constprop (__brute_force_MOD_digits_2.constprop.0, funcdef_no=5, decl_uid=4796, cgraph_uid=24, symbol_order=140) a.ltrans1.ltrans.250t.optimized:;; Function digits_2.constprop (__brute_force_MOD_digits_2.constprop.6, funcdef_no=5, decl_uid=4760, cgraph_uid=2, symbol_order=146) which likely explain the code size difference. The ipa-cp decisions are same so it looks like isra is affecting inliner which instead of producing 3 function for clones 0,3,6 produces 4 functions for clones 0,2,4,6
There is difference in inlier decision. Since all clones are of same size it depends on the order inliner picks them and combines together before hitting large-function-growth. It seems that with isra ordering inliner simply less lucky. Instead of inline stack: IPA function summary for digits_2.constprop/143 inlinable global time: 22960.500916 self size: 1277 global size: 2534 min size: 513 self stack: 261 global stack: 783 estimated growth:-488 size:513.000000, time:6690.410500 size:3.000000, time:2.000001, executed if:(not inlined) size:0.500000, time:0.500000, executed if:(not inlined), nonconst if:(op0[ref offset: 0] changed) && (not inlined) size:138.500000, time:217.532556, nonconst if:(op0[ref offset: 0] changed) size:36.000000, time:34.793911, executed if:(op0[ref offset: 0],(# % 3) == 2), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0],(# % 3) == 2) size:198.000000, time:574.099545, executed if:(op0[ref offset: 0],(# % 3) == 2) size:36.000000, time:34.793911, executed if:(op0[ref offset: 0],(# % 3) == 1), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0],(# % 3) == 1) size:270.000000, time:1357.103458, executed if:(op0[ref offset: 0],(# % 3) == 1) size:21.000000, time:375.971570, executed if:(op0[ref offset: 0] == 5) size:1263.000000, time:12359.502960, executed if:(op0[ref offset: 0] != 8) size:1.000000, time:0.900000, executed if:(op0[ref offset: 0] != 8), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0] != 8) size:48.000000, time:1300.920311, executed if:(op0[ref offset: 0] == 8) loop iterations: 0.68 for (op0[ref offset: 0] changed) 0.76 for (op0[ref offset: 0] changed) 0.88 for (op0[ref offset: 0] changed) 1.08 for (op0[ref offset: 0] changed) 1.40 for (op0[ref offset: 0] changed) 1.93 for (op0[ref offset: 0] changed) 2.80 for (op0[ref offset: 0] changed) 4.23 for (op0[ref offset: 0] changed) 11.88 for (op0[ref offset: 0] changed) 4.59 for (op0[ref offset: 0] changed) 3.16 for (op0[ref offset: 0] changed) 2.29 for (op0[ref offset: 0] changed) 1.76 for (op0[ref offset: 0] changed) 1.44 for (op0[ref offset: 0] changed) 1.24 for (op0[ref offset: 0] changed) 1.12 for (op0[ref offset: 0] changed) calls: covered.constprop/148 --param max-inline-insns-auto limit reached freq:0.30 loop depth: 9 size: 4 time: 13 callee size:262 stack:1472 predicate: (op0[ref offset: 0] == 8) op0 is compile time invariant op0 points to local or readonly memory op1 is compile time invariant op1 points to local or readonly memory digits_2.constprop/144 inlined freq:0.90 Stack frame offset 261, callee self size 261 __builtin_unreachable/156 unreachable freq:0.00 cross module loop depth:18 size: 0 time: 0 predicate: (false) op0 is compile time invariant op0 points to local or readonly memory op1 is compile time invariant op1 points to local or readonly memory digits_2.constprop/145 inlined freq:0.81 Stack frame offset 522, callee self size 261 __builtin_unreachable/156 unreachable freq:0.00 cross module loop depth:27 size: 0 time: 0 predicate: (false) op0 points to local or readonly memory op1 is compile time invariant op1 points to local or readonly memory digits_2.constprop/146 --param large-function-growth limit reached freq:0.73 loop depth:27 size: 2 time: 11 callee size:1019 stack:522 predicate: (op0[ref offset: 0] != 8) op0 is compile time invariant op0 points to local or readonly memory where inlining fails only at recursion depth 4 we get: IPA function summary for digits_2.constprop.isra/163 inlinable global time: 17184.704285 self size: 1277 global size: 1994 min size: 513 self stack: 261 global stack: 522 estimated growth:301 size:513.000000, time:6690.410500 size:3.000000, time:2.000001, executed if:(not inlined) size:0.500000, time:0.500000, executed if:(not inlined), nonconst if:(op0[ref offset: 0] changed) && (not inlined) size:138.500000, time:217.532556, nonconst if:(op0[ref offset: 0] changed) size:36.000000, time:34.793911, executed if:(op0[ref offset: 0],(# % 3) == 2), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0],(# % 3) == 2) size:198.000000, time:574.099545, executed if:(op0[ref offset: 0],(# % 3) == 2) size:36.000000, time:34.793911, executed if:(op0[ref offset: 0],(# % 3) == 1), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0],(# % 3) == 1) size:270.000000, time:1357.103458, executed if:(op0[ref offset: 0],(# % 3) == 1) size:21.000000, time:375.971570, executed if:(op0[ref offset: 0] == 5) size:723.000000, time:6582.815331, executed if:(op0[ref offset: 0] != 8) size:1.000000, time:0.900000, executed if:(op0[ref offset: 0] != 8), nonconst if:(op0[ref offset: 0] changed) && (op0[ref offset: 0] != 8) size:48.000000, time:1300.920311, executed if:(op0[ref offset: 0] == 8) loop iterations: 0.68 for (op0[ref offset: 0] changed) 0.76 for (op0[ref offset: 0] changed) 0.88 for (op0[ref offset: 0] changed) 1.08 for (op0[ref offset: 0] changed) 1.40 for (op0[ref offset: 0] changed) 1.93 for (op0[ref offset: 0] changed) 2.80 for (op0[ref offset: 0] changed) 4.23 for (op0[ref offset: 0] changed) 11.88 for (op0[ref offset: 0] changed) 4.59 for (op0[ref offset: 0] changed) 3.16 for (op0[ref offset: 0] changed) 2.29 for (op0[ref offset: 0] changed) 1.76 for (op0[ref offset: 0] changed) 1.44 for (op0[ref offset: 0] changed) 1.24 for (op0[ref offset: 0] changed) 1.12 for (op0[ref offset: 0] changed) calls: digits_2.constprop.isra/162 inlined freq:0.90 Stack frame offset 261, callee self size 261 digits_2.constprop.isra/161 --param large-function-growth limit reached freq:0.81 loop depth:18 size: 2 time: 11 callee size:1033 stack:522 predicate: (op0[ref offset: 0] != 8) op0 is compile time invariant op0 points to local or readonly memory __builtin_unreachable/168 unreachable freq:0.00 cross module loop depth:18 size: 0 time: 0 predicate: (false) op0 is compile time invariant op0 points to local or readonly memory op1 is compile time invariant op1 points to local or readonly memory covered.constprop/148 --param max-inline-insns-auto limit reached freq:0.30 loop depth: 9 size: 4 time: 13 callee size:262 stack:1472 predicate: (op0[ref offset: 0] == 8) op0 is compile time invariant op0 points to local or readonly memory op1 is compile time invariant op1 points to local or readonly memory where we fail at depth2
The heuristics here is quite simplistic - it all happens after small function inlining while trying to inline function called once. Each of clones is called once and it really depends on the order we walk the chain.
Still, the interaction between IPA-CP and IPA-SRA is bad here. Just looking at the optimized dump, one of the "specialized functions" starts with: <bb 2> [local count: 62767467]: # DEBUG D#203 s=> row # DEBUG row => D#203 _2 = (long int) ISRA.10821_938(D); where the ISRA param contains the constant we wanted to specialize for... making the clones worse than useless. From the IPA-CP ltrans dumps it is clear that the transformation phase of IPA-CP considers the first parameter dead and so does not perform the substitutions even though the parameter is replaced only by a "subsequent" pass. The infrastructure invokes the transform function on node digits_2.constprop.isra/157 (note the isra) which has already been modified by the subsequent pass (when it was cloned). I like the idea of transformation phases better than putting everything into tree-inline (and by extension ipa-param-manipulation) but perhaps we have to do aggregate constant replacements there too?
> I like the idea of transformation phases better than putting > everything into tree-inline (and by extension ipa-param-manipulation) > but perhaps we have to do aggregate constant replacements there too? So the situation is that we inline call A->B (where both A and B are clones of the main function) and while we place uses of the constant parmater in A we miss replacement in B because transform is not run on it. I think proper solution here (discussed also few years ago) is to keep the optimization summaries and teach value numbering to look up the constant from the summary. We also have other situations where the existing transform pass fails to pattern match and this lets us to feed other info, like value ranges to the optimizer. We have open PR somwhere for this problem, right?
struct a{int a,b;}; int bar (struct a *a) { if (!a->a) __builtin_abort (); } static __attribute__ ((noinline)) int foo (struct a a) { struct a b = a; bar (&b); return b.a+b.b; } int test() { struct a a={1,2}; return foo (a); } Is an example where we also miss transformation with -fno-early-inlining -O2
(In reply to hubicka from comment #5) > > I like the idea of transformation phases better than putting > > everything into tree-inline (and by extension ipa-param-manipulation) > > but perhaps we have to do aggregate constant replacements there too? > > So the situation is that we inline call A->B (where both A and B are > clones of the main function) and while we place uses of the constant > parmater in A we miss replacement in B because transform is not run on > it. No, we miss it everywhere, even in A (see that the code above is from BB 2) or probably also without any cloning whatsoever. This happens when IPA-SRA does its thing on the same parameter on which IPA-CP decided to propagate aggregate constants. In the IPA analysis stage (which creates the virtual clones), IPA-CP runs before IPA-SRA. But in the transformation phase, it is apparently the other way round - well, not exactly, IPA-SRA does not formally have a transformation phase, it happens as part of tree_function_versioning, but the effect is the same. > > I think proper solution here (discussed also few years ago) is to keep > the optimization summaries and teach value numbering to look up the > constant from the summary. > Yes, but this is another (but different) problem that we probably also should try to solve now.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 > > --- Comment #7 from Martin Jambor <jamborm at gcc dot gnu.org> --- > (In reply to hubicka from comment #5) > > > I like the idea of transformation phases better than putting > > > everything into tree-inline (and by extension ipa-param-manipulation) > > > but perhaps we have to do aggregate constant replacements there too? > > > > So the situation is that we inline call A->B (where both A and B are > > clones of the main function) and while we place uses of the constant > > parmater in A we miss replacement in B because transform is not run on > > it. > > No, we miss it everywhere, even in A (see that the code above is from > BB 2) or probably also without any cloning whatsoever. This happens > when IPA-SRA does its thing on the same parameter on which IPA-CP > decided to propagate aggregate constants. > > In the IPA analysis stage (which creates the virtual clones), IPA-CP > runs before IPA-SRA. But in the transformation phase, it is > apparently the other way round - well, not exactly, IPA-SRA does not > formally have a transformation phase, it happens as part of > tree_function_versioning, but the effect is the same. OK, so the problem is that we don't do ipa-sra changes "in place" as a well behaved transform pass but it i merged into versioning code while ipa-cp is the other way around. So one fix would be alo to make ipa-cp understand that changes to signature happened and update its summary just like we do for modref? We will need it also for ... > > > > > I think proper solution here (discussed also few years ago) is to keep > > the optimization summaries and teach value numbering to look up the > > constant from the summary. > > > > Yes, but this is another (but different) problem that we probably also > should try to solve now. ... fixing this problem properly. I just loked into thi again and we already have code that preserves propagates bits on pointer parmeters (since these do not have value ranges). Same way we need to preserve the known partial aggregates and hook it up into sccvn's vn_reference_lookup_2 and _3. Honza
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103227 > ... fixing this problem properly. > I just loked into thi again and we already have code that preserves > propagates bits on pointer parmeters (since these do not have value > ranges). Same way we need to preserve the known partial aggregates and > hook it up into sccvn's vn_reference_lookup_2 and _3. Also note that with this implemented I think we should be able to remove the ipa-cp transformation code :)
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:0f5afb626381d19bfced30bc19cf3b03867fa6f5 commit r12-5439-g0f5afb626381d19bfced30bc19cf3b03867fa6f5 Author: Jan Hubicka <jh@suse.cz> Date: Sun Nov 21 16:15:41 2021 +0100 Improve base tracking in ipa-modref on exchange2 benchamrk we miss some useful propagation because modref gives up very early on analyzing accesses through pointers. For example in int test (int *a) { int i; for (i=0; a[i];i++); return i+a[i]; } We are not able to determine that a[i] accesses are relative to a. This is because get_access requires the SSA name that is in MEM_REF to be PARM_DECL while on other places we use ipa-prop helper to work out the proper base pointers. This patch commonizes the code in get_access and parm_map_for_arg so both use the check properly and extends it to also figure out that newly allocated memory is not a side effect to caller. gcc/ChangeLog: 2021-11-21 Jan Hubicka <hubicka@ucw.cz> PR ipa/103227 * ipa-modref.c (parm_map_for_arg): Rename to ... (parm_map_for_ptr): .. this one; handle static chain and calls to malloc functions. (modref_access_analysis::get_access): Use parm_map_for_ptr. (modref_access_analysis::process_fnspec): Update. (modref_access_analysis::analyze_load): Update. (modref_access_analysis::analyze_store): Update. gcc/testsuite/ChangeLog: 2021-11-21 Jan Hubicka <hubicka@ucw.cz> PR ipa/103227 * gcc.dg/tree-ssa/modref-15.c: New test.
Created attachment 51863 [details] Untested fix I am testing the attached patch. I would like to file a new bug for the testcase in comment #6 as it is a different issue.
Some testing is still underway, but I have proposed the patch (with one minor testsuite change) on the mailing list: https://gcc.gnu.org/pipermail/gcc-patches/2021-November/585337.html
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>: https://gcc.gnu.org/g:5bc4cb04127a4805b6228b0a6cbfebdbd61314d2 commit r12-5527-g5bc4cb04127a4805b6228b0a6cbfebdbd61314d2 Author: Martin Jambor <mjambor@suse.cz> Date: Thu Nov 25 17:58:12 2021 +0100 ipa: Teach IPA-CP transformation about IPA-SRA modifications (PR 103227) PR 103227 exposed an issue with ordering of transformations of IPA passes. IPA-CP can create clones for constants passed by reference and at the same time IPA-SRA can also decide that the parameter does not need to be a pointer (or an aggregate) and plan to convert it into (a) simple scalar(s). Because no intermediate clone is created just for the purpose of ordering the transformations and because IPA-SRA transformation is implemented as part of clone materialization, the IPA-CP transformation happens only afterwards, reversing the order of the transformations compared to the ordering of analyses. IPA-CP transformation looks at planned substitutions for values passed by reference or in aggregates but finds that all the relevant parameters no longer exist. Currently it subsequently simply gives up, leading to clones created for no good purpose (and huge regression of 548.exchange_r. This patch teaches it recognize the situation, look up the new scalarized parameter and perform value substitution on it. On my desktop this has recovered the lost exchange2 run-time (and some more). I have disabled IPA-SRA in a Fortran testcase so that the dumping from the transformation phase can still be matched in order to verify that IPA-CP understands the IL after verifying that it does the right thing also with IPA-SRA. gcc/ChangeLog: 2021-11-23 Martin Jambor <mjambor@suse.cz> PR ipa/103227 * ipa-prop.h (ipa_get_param): New overload. Move bits of the existing one to the new one. * ipa-param-manipulation.h (ipa_param_adjustments): New member function get_updated_index_or_split. * ipa-param-manipulation.c (ipa_param_adjustments::get_updated_index_or_split): New function. * ipa-prop.c (adjust_agg_replacement_values): Reimplement, add capability to identify scalarized parameters and perform substitution on them. (ipcp_transform_function): Create descriptors earlier, handle new return values of adjust_agg_replacement_values. gcc/testsuite/ChangeLog: 2021-11-23 Martin Jambor <mjambor@suse.cz> PR ipa/103227 * gcc.dg/ipa/pr103227-1.c: New test. * gcc.dg/ipa/pr103227-3.c: Likewise. * gcc.dg/ipa/pr103227-2.c: Likewise. * gfortran.dg/pr53787.f90: Disable IPA-SRA.
Seems the performance is now better than before https://lnt.opensuse.org/db_default/v4/SPEC/graph?highlight_run=21683&plot.0=286.407.0 Still I think I should implement the logic to stabilize the order of nodes and edges in callgraph and lets discuss if we can make inline heuristics more robust here.
Fixed.
The master branch has been updated by Martin Jambor <jamborm@gcc.gnu.org>: https://gcc.gnu.org/g:4834e9360f7bf42fbeabaa20de5619e67c9fee4e commit r13-4685-g4834e9360f7bf42fbeabaa20de5619e67c9fee4e Author: Martin Jambor <mjambor@suse.cz> Date: Wed Dec 14 00:33:05 2022 +0100 ipa: Better way of applying both IPA-CP and IPA-SRA (PR 103227) This is basically a better fix for PR 103227. The one currently in use, rushed in late at stage3, which means that IPA-CP transformation simply does a replacement of default-definition of IPA-SRA-created scalar parameters with a constant, meant that IPA-SRA actually often led to creation of a bunch of unused parameters, which was rather ironic and sub-optimal. This patch rips that old way out and makes sure the clash is resolved at clone-materialization time. What happens is that: 1) IPA-SRA IPA analysis (decision) stage recognizes the clash and does not create a param adjustment entry for such a scalar component. 2) Clone materialization code checks the IPA-CP transformation summary and when it realizes that it is removing a parameter that is a base for a discovered IPA-CP aggregate constant, and: a) the value is passed by reference, it internally records that any load of the value is replaced directly with the known constant value. IPA-SRA will not attempt to split values passed by reference when there is a write to it so we know such a load won't be on a a LHS. b) the value is passed by value, there can be stores to the corresponding bit of the aggregate and so all accesses are replaced with a new decl and an assignment of the constant to this decl is generated at the beginning of the function. The new testcase contains an xfail as the patch does not fix PR 107640 but it is one that ICEs when one is not careful about remapping indices of parameters, so I'd like to have it in testsuite/gcc.gd/ipa/ even now. I don't think that PR 107640 should be attempted through ipa-param-manipulation replacements because the information is not really there any more and we'd either need to do the replacements earlier or dig deep into the clone parent info. Instead, we should record somewhere that at the beginning of the function the bits of the global decl have known values and use that in the value numbering. That way we could one day encode also known constants in globals that do not come through parameters. gcc/ChangeLog: 2022-11-11 Martin Jambor <mjambor@suse.cz> PR ipa/103227 * ipa-param-manipulation.h (class ipa_param_adjustments): Removed member function get_updated_index_or_split. (class ipa_param_body_adjustments): New overload of register_replacement, new member function append_init_stmts, new member m_split_agg_csts_inits. * ipa-param-manipulation.cc: Include ipa-prop.h. (ipa_param_adjustments::get_updated_index_or_split): Removed. (ipa_param_body_adjustments::register_replacement): New overload, use it from the older one. (ipa_param_body_adjustments::common_initialization): Added the capability to create replacements for conflicting IPA-CP discovered constants. (ipa_param_body_adjustments::ipa_param_body_adjustments): Construct the new member. (ipa_param_body_adjustments::append_init_stmts): New function. * ipa-sra.cc: Include ipa-prop.h. (push_param_adjustments_for_index): Require IPA-CP transformation summary as a parameter, do not create replacements which are known to have constant values. (process_isra_node_results): Find and pass to the above function the IPA-CP transformation summary. * ipa-prop.cc (adjust_agg_replacement_values): Remove the functionality replacing IPA-SRA created scalar parameters with constants. Simplify, do not require parameter descriptors, do not return anything. (ipcp_transform_function): Simplify now that adjust_agg_replacement_values does not change cfg. Move definition and initialization of descriptors lower. * tree-inline.cc (tree_function_versioning): Call append_init_stmts of param_body_adjs, if there are any. gcc/testsuite/ChangeLog: 2022-11-11 Martin Jambor <mjambor@suse.cz> PR ipa/103227 PR ipa/107640 * gcc.dg/ipa/pr107640-2.c: New test.