Created attachment 32646 [details] reduced testcase (gcc.dg/torture/pr30665-1.c) Output: $ gcc -O2 -flto -fipa-pta testcase.c $ ./a.out Aborted Relevant objdump output: 0000000000400530 <main>: 400530: 48 83 ec 28 sub rsp,0x28 400534: 48 8d 74 24 10 lea rsi,[rsp+0x10] 400539: 48 89 e7 mov rdi,rsp 40053c: c7 04 24 ad 3a 00 00 mov DWORD PTR [rsp],0x3aad 400543: c7 44 24 04 d1 5a 00 mov DWORD PTR [rsp+0x4],0x5ad1 40054a: 00 40054b: c7 44 24 10 19 45 00 mov DWORD PTR [rsp+0x10],0x4519 400552: 00 400553: c7 44 24 14 a0 5b 00 mov DWORD PTR [rsp+0x14],0x5ba0 40055a: 00 40055b: e8 10 01 00 00 call 400670 <f.constprop.0> 400560: e8 9b ff ff ff call 400500 <abort@plt> 400565: 0f 1f 00 nop DWORD PTR [rax] abort() is called unconditionally
I'll have a look.
The issue is that IPA-PTA sees main () still as calling f but the cgraph has it calling f.constprop.0. For some reason the IPA-CP transform has not been applied to main () (yet?). In .simdclone I see a call to f.constprop.0. Honza? Martin? Late IPA passes need all regular IPA transforms applied and clones materialized.
(In reply to Richard Biener from comment #2) > Late IPA passes need all regular IPA transforms applied and clones > materialized. I agree that would make sense but it is not what we do. Just look at compile() in cgraphunit.c. When late IPA passes run, clones are already materialized but transformation phases of regular IPA passes have not been run yet. Apart from the fact that late-IPA passes do not see the bodies transformed and inlining performed it also means that call statements have not been updated to reflect call redirections performed at analysis stage which is the problem here and which is (for reasons I have never quite grasped) done as a part of inlining transformation phase. OTOH, I am all for moving the call to execute_all_ipa_transforms() from expand_function() to compile(), at least in trunk.
On Wed, 23 Apr 2014, jamborm at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60911 > > --- Comment #3 from Martin Jambor <jamborm at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #2) > > Late IPA passes need all regular IPA transforms applied and clones > > materialized. > > I agree that would make sense but it is not what we do. Just look at > compile() in cgraphunit.c. When late IPA passes run, clones are > already materialized but transformation phases of regular IPA passes > have not been run yet. Apart from the fact that late-IPA passes do > not see the bodies transformed and inlining performed it also means > that call statements have not been updated to reflect call > redirections performed at analysis stage which is the problem here and > which is (for reasons I have never quite grasped) done as a part of > inlining transformation phase. Hmm, ok. But that makes "simple" IPA passes work on a limbo state. I think we delay applying IPA transforms as "optimization" with respect to compile-time/memory-use as in theory we only require a single function body in memory during LTRANS. I wonder why it works without LTO ... because there I _do_ see the IPA transform applied to main (). > OTOH, I am all for moving the call to execute_all_ipa_transforms() > from expand_function() to compile(), at least in trunk. We also have, in execute_one_pass, /* SIPLE IPA passes do not handle callgraphs with IPA transforms in it. Apply all trnasforms first. */ if (pass->type == SIMPLE_IPA_PASS) { bool applied = false; do_per_function (apply_ipa_transforms, (void *)&applied); if (applied) symtab_remove_unreachable_nodes (true, dump_file); /* Restore current_pass. */ current_pass = pass; but that doesn't seem to work ... at least with LTO? Which is because for main() FOR_EACH_DEFINED_FUNCTION (node) if (node->analyzed && gimple_has_body_p (node->decl) && (!node->clone_of || node->decl != node->clone_of->decl)) { this doesn't trigger (for f.constprop it does). gimple_has_body_p (node->decl) is false. main/1 (main) @0x7ffff6daf148 Type: function definition analyzed Visibility: externally_visible prevailing_def public References: Referring: Read from file: /tmp/ccED1aQr.ltrans0.o Availability: available First run: 0 Function flags: only_called_at_startup Called by: Calls: abort/2 (0.00 per call) f.constprop.0/3 (1.00 per call) it's probably not yet streamed in?
Ah, and tree-ssa-structalias.c does /* Build the constraints. */ FOR_EACH_DEFINED_FUNCTION (node) { varinfo_t vi; /* Nodes without a body are not interesting. Especially do not visit clones at this point for now - we get duplicate decls there for inline clones at least. */ if (!cgraph_function_with_gimple_body_p (node) || node->clone_of) continue; cgraph_get_body (node); but doesn't apply IPA transforms. I think the kludge in the pass manager should do that instead. I wonder how the above doesn't trigger the in_lto_p assert in cgraph_get_body though ... (maybe the odd DECL_RESULT check makes sure we don't trigger that path). Ick. That code path triggers also with 2171 do_per_function (clear_last_verified, NULL); doing the push/pop_cfun game on each fn just to clear last_verified ... in WPA stage ... same for 2179 do_per_function (update_properties_after_pass, pass); and more. Trying to make Index: gcc/passes.c =================================================================== --- gcc/passes.c (revision 209677) +++ gcc/passes.c (working copy) @@ -1506,9 +1506,11 @@ do_per_function (void (*callback) (void { struct cgraph_node *node; FOR_EACH_DEFINED_FUNCTION (node) - if (node->analyzed && gimple_has_body_p (node->decl) + if (node->analyzed + && cgraph_function_with_gimple_body_p (node) && (!node->clone_of || node->decl != node->clone_of->decl)) { + cgraph_get_body (node); push_cfun (DECL_STRUCT_FUNCTION (node->decl)); callback (data); if (!flag_wpa) not ICE ... Index: gcc/passes.c =================================================================== --- gcc/passes.c (revision 209677) +++ gcc/passes.c (working copy) @@ -1506,18 +1506,24 @@ do_per_function (void (*callback) (void { struct cgraph_node *node; FOR_EACH_DEFINED_FUNCTION (node) - if (node->analyzed && gimple_has_body_p (node->decl) + if (node->analyzed + && cgraph_function_with_gimple_body_p (node) && (!node->clone_of || node->decl != node->clone_of->decl)) { - push_cfun (DECL_STRUCT_FUNCTION (node->decl)); - callback (data); if (!flag_wpa) + cgraph_get_body (node); + if (gimple_has_body_p (node->decl)) { - free_dominance_info (CDI_DOMINATORS); - free_dominance_info (CDI_POST_DOMINATORS); + push_cfun (DECL_STRUCT_FUNCTION (node->decl)); + callback (data); + if (!flag_wpa) + { + free_dominance_info (CDI_DOMINATORS); + free_dominance_info (CDI_POST_DOMINATORS); + } + pop_cfun (); + ggc_collect (); } - pop_cfun (); - ggc_collect (); } } } works though.
Seems like simd_clone_create gets along with this because it does if (!cgraph_function_with_gimple_body_p (old_node)) return NULL; cgraph_get_body (old_node); new_node = cgraph_function_versioning (old_node, vNULL, NULL, NULL, false, NULL, NULL, "simdclone"); and cgraph_function_versioning applies IPA transforms?
Anyway, that push/pop_cfun stuff should be pushed to the callbacks, not be done in do_per_function (). Likewise the ggc_collect (). Btw, same code in do_per_function_toporder.
Created attachment 32662 [details] cleanup patch
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60911 > > --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- > Ah, and tree-ssa-structalias.c does > > /* Build the constraints. */ > FOR_EACH_DEFINED_FUNCTION (node) > { > varinfo_t vi; > /* Nodes without a body are not interesting. Especially do not > visit clones at this point for now - we get duplicate decls > there for inline clones at least. */ > if (!cgraph_function_with_gimple_body_p (node) || node->clone_of) > continue; > cgraph_get_body (node); > > but doesn't apply IPA transforms. I think the kludge in the pass manager > should do that instead. I wonder how the above doesn't trigger the > in_lto_p assert in cgraph_get_body though ... (maybe the odd DECL_RESULT > check makes sure we don't trigger that path). Odd DECL_RESULT check is check if the function is already in memory. It probably should be documented though. The purpose of function is to implement lazy loading of function bodies. If body is already loaded it should do nothing. Making pass manager to read all function bodies prior first enabled late small_ipa pass is definitely possible, but it would mean that every late IPA pass will push up peak memory use of ltrans stage quite a bit. I think PTA should be done on whole program and not here. Late IPA passes (in addition to experiments where this is OK) may be useful for stuff that doesn't need to touch whole program all the time, I think Jakub's SIMD stuff is a good example. What about adding parameter to cgraph_get_body to apply transformations? (inliner and clone materialization needs bodies w/o transforms)
> > We also have, in execute_one_pass, > > /* SIPLE IPA passes do not handle callgraphs with IPA transforms in it. > Apply all trnasforms first. */ > if (pass->type == SIMPLE_IPA_PASS) > { > bool applied = false; > do_per_function (apply_ipa_transforms, (void *)&applied); > if (applied) > symtab_remove_unreachable_nodes (true, dump_file); > /* Restore current_pass. */ > current_pass = pass; > > but that doesn't seem to work ... at least with LTO? Which is > because for main() > > FOR_EACH_DEFINED_FUNCTION (node) > if (node->analyzed && gimple_has_body_p (node->decl) > && (!node->clone_of || node->decl != node->clone_of->decl)) > { > > this doesn't trigger (for f.constprop it does). > gimple_has_body_p (node->decl) is false. yeah, this is something I missed with the lazy loading. There is predicate cgraph_function_with_gimple_body_p with does not care about the lazy stuff (i.e. will return same value for both). The problem is that we do not want to run the verifiers that are also called by this for_each.* wrapper. As written in previous post, for 4.10, I think we should have cgraph_get_body to lazily load body and apply transformations/produce the clone. In short it should do whatever necessary to actually get the real body to the optimizer. I want to keep the default optimization queue to not necessarily load everything into memory. Therefore I think late IPA passes should be responsible to call cgraph_get_body themselves to avoid growth in ltrans peak memory use for passes that touch just few functions (like this SIMD stuff). In long term IPA-PTA, that is an example of pass that touch all bodies, should go to real IPA queue IMO and be reimplemented by one of the more scalable algorithms perhaps with context senstivity. In fact I have student (Lada Laska) who is looking into the topic. For local passes I think it should be pass manager's responsibilty to load the bodies, as it is now. This would hopefully not be difficult to use and it won't boost peak ltrans memory that now, with WPA improvements, seems to be a bottleneck http://hubicka.blogspot.ca/2014/04/linktime-optimization-in-gcc-2-firefox.html at least for firefox and -flto=16 (it should actually reduce the peak) For 4.9 I would suggest simply making the above code to apply transformations even with LTO (i.e. drop the gimple_has_body_p check from the patch and replace it cgraph_function_with_gimple_body_p). Will look into it now.
Index: gcc/passes.c =================================================================== --- gcc/passes.c (revision 209742) +++ gcc/passes.c (working copy) @@ -2194,8 +2194,26 @@ execute_one_pass (opt_pass *pass) Apply all trnasforms first. */ if (pass->type == SIMPLE_IPA_PASS) { + struct cgraph_node *node; bool applied = false; - do_per_function (apply_ipa_transforms, (void *)&applied); + FOR_EACH_DEFINED_FUNCTION (node) + if (node->analyzed + && cgraph_function_with_gimple_body_p (node) + && (!node->clone_of || node->decl != node->clone_of->decl)) + { + if (!node->global.inlined_to + && node->ipa_transforms_to_apply.exists ()) + { + cgraph_get_body (node); + push_cfun (DECL_STRUCT_FUNCTION (node->decl)); + execute_all_ipa_transforms (); + rebuild_cgraph_edges (); + free_dominance_info (CDI_DOMINATORS); + free_dominance_info (CDI_POST_DOMINATORS); + pop_cfun (); + applied = true; + } + } if (applied) symtab_remove_unreachable_nodes (true, dump_file); /* Restore current_pass. */ should work and avoid all the issues I ran into with the patches patching do_per_function. I'm testing that for trunk/4.9 now with possible cleanups done on trunk only.
Author: rguenth Date: Fri Apr 25 07:44:40 2014 New Revision: 209779 URL: http://gcc.gnu.org/viewcvs?rev=209779&root=gcc&view=rev Log: 2014-04-25 Richard Biener <rguenther@suse.de> PR ipa/60911 * passes.c (apply_ipa_transforms): Inline into only caller ... (execute_one_pass): ... here. Properly bring in function bodies for nodes we want to apply IPA transforms to. * gcc.dg/lto/pr60911_0.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/lto/pr60911_0.c Modified: trunk/gcc/ChangeLog trunk/gcc/passes.c trunk/gcc/testsuite/ChangeLog
Author: rguenth Date: Fri Apr 25 07:48:06 2014 New Revision: 209781 URL: http://gcc.gnu.org/viewcvs?rev=209781&root=gcc&view=rev Log: 2014-04-25 Richard Biener <rguenther@suse.de> PR ipa/60911 * passes.c (apply_ipa_transforms): Inline into only caller ... (execute_one_pass): ... here. Properly bring in function bodies for nodes we want to apply IPA transforms to. * gcc.dg/lto/pr60911_0.c: New testcase. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/lto/pr60911_0.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/passes.c branches/gcc-4_9-branch/gcc/testsuite/ChangeLog
Fixed.