Minimal options to reproduce regression (x86, 64-bit): -O3 -flto The reason behind the regression is that since r256888 a cost_compare function is not inlined into spec_qsort. These two functions are in different source files. I've managed to force cost_compare to be inlined by creating in the same source file a copy of spec_qsort function with explicit calls of cost_compare. This reverted performance to r256887 level.
Let me take a look.
I can confirm the regression on a Haswell CPU.
So the problematic predictor change is: - DEF_PREDICTOR (PRED_NULL_RETURN, "null return", HITRATE (91), 0) + DEF_PREDICTOR (PRED_NULL_RETURN, "null return", HITRATE (71), 0) It lower frequency of BB 5: primal_bea_mpp (int64_t m, struct arc_t * arcs, struct arc_t * stop_arcs, int64_t * basket_sizes, struct BASKET * * perm, int thread, struct arc_t * * end_arc, int64_t step, int64_t num_threads, int64_t max_elems) { ... <bb 33> [local count: 12634988]: READY: # DEBUG BEGIN_STMT _113 = *_40; _114 = (sizetype) _113; _115 = _114 + 1; _116 = _115 * 8; _117 = perm_148(D) + _116; _118 = *_117; _118->number = -1; # DEBUG BEGIN_STMT _122 = *_40; if (_122 == 0) goto <bb 34>; [29.93%] else goto <bb 35>; [70.07%] <bb 34> [local count: 3781652]: # DEBUG BEGIN_STMT // predicted unlikely by early return (on trees) predictor. goto <bb 36>; [100.00%] <bb 35> [local count: 8853336]: # DEBUG BEGIN_STMT _127 = (long unsigned int) _122; _128 = perm_148(D) + 8; spec_qsort (_128, _127, 8, cost_compare); # DEBUG BEGIN_STMT _176 = MEM[(struct BASKET * *)perm_148(D) + 8B]; <bb 36> [local count: 12634988]: # _135 = PHI <0B(34), _176(35)> return _135; } It's function defined in pbeampp.c. Then in IPA CP we end up with: Evaluating opportunities for spec_qsort/353. - considering value arc_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1) good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 987, scc) -> evaluation: 314, threshold: 500 good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 987, scc) -> evaluation: 643, threshold: 500 Creating a specialized node of spec_qsort/353. replacing param #2 size_t with const 8 replacing param #3 int (*<T813>) (const void *, const void *) with const arc_compare Accounting size:173.00, time:191.00 on predicate exec:(true) Accounting size:3.00, time:2.00 on new predicate exec:(not inlined) the new node is spec_qsort.constprop/377. ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 71 converting indirect call in spec_qsort.constprop to direct call to arc_compare controlled uses count of param 3 bumped down to 8 ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 218 converting indirect call in spec_qsort.constprop to direct call to arc_compare controlled uses count of param 3 bumped down to 7 ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 267 converting indirect call in spec_qsort.constprop to direct call to arc_compare controlled uses count of param 3 bumped down to 6 ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 348 converting indirect call in spec_qsort.constprop to direct call to arc_compare controlled uses count of param 3 bumped down to 5 - considering value cost_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1) good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 701, scc) -> evaluation: 223, threshold: 500 good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 701, scc) -> evaluation: 457, threshold: 500 - Creating a specialized node of spec_qsort/353 for all known contexts. replacing param #2 size_t with const 8 Accounting size:173.00, time:191.00 on predicate exec:(true) Accounting size:3.00, time:2.00 on new predicate exec:(not inlined) the new node is spec_qsort.constprop/378. Evaluating opportunities for spec_qsort/353. - adding an extra caller spec_qsort.constprop/377 of spec_qsort.constprop/377 - considering value cost_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1) good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 701, scc) -> evaluation: 223, threshold: 500 good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 701, scc) -> evaluation: 457, threshold: 500 Marking node as dead: spec_qsort/353. Here you can see evaluation is 457, which the threshold is 500. Another question is why IPA inline does not inline the call. Answer is here: IPA function summary for primal_bea_mpp.constprop/366 inlinable global time: 127.809798 self size: 134 global size: 132 min size: 10 self stack: 0 global stack: 0 size:112.000000, time:108.000000 size:7.000000, time:2.000000, executed if:(not inlined) size:2.000000, time:2.000000, nonconst if:(op3 changed) size:1.000000, time:1.000000, nonconst if:(op7 changed) size:3.000000, time:3.000000, nonconst if:(op9 changed) size:1.000000, time:1.000000, nonconst if:(op4 changed) size:1.000000, time:1.000000, nonconst if:(op8 changed) loop iterations:(op8 changed) loop stride:(op8 changed) calls: spec_qsort.constprop/378 function body not available loop depth: 0 freq:0.70 size: 5 time: 14 callee size:138 stack: 0 op2 is compile time invariant op3 is compile time invariant Note that spec_qsort is a recursive function, so I guess spec_qsort.constprop/378 is clone created due to the recursion. Leaving to Martin Jambor as he's the IPA CP expert! P.S. Martin please apply following patch because one dump output has the dash and second not: diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index 4202c999675..edc0bda3eca 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -4625,7 +4625,7 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, return false; if (dump_file) - fprintf (dump_file, " Creating a specialized node of %s.\n", + fprintf (dump_file, " - Creating a specialized node of %s.\n", node->dump_name ()); callers = gather_edges_for_value (val, node, caller_count);
In my (more recent) checkout of trunk, the estimated benefit is even lower, only 270. Raising devirtualization hint to compensate seems excessive, I need: diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index ee41a8d55b7..09ba92ef14d 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -2612,6 +2612,25 @@ devirtualization_time_bonus (struct cgraph_node *node, res += 7 / ((int)speculative + 1); } + if (res) + /* Devirtualization is likely more important in recursive callers because + those cannot be entirely inlined themselves. */ + for (cgraph_edge *e = node->callees; e; e = e->next_callee) + { + enum availability avail; + cgraph_node *callee = e->callee->function_symbol (&avail); + if (e->caller == callee + && avail >= AVAIL_AVAILABLE) + { + res = 8 * res; + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "Recursive devirtualization bohus %i\n", + res); + break; + } + } + return res; } Which is excessive (but it may make sense to boost IPA-CP bonuses for self-recursive functions a bit since those cannot be dealt with completely by inlining). Nevertless I already have a prototype patch which can explain to IPA-CP that only there is only one remaining value in the contexts for which we did not clone and replace it there too.
Created attachment 43588 [details] Prototype patch I have to leave the office now but this is the (only very very lightly tested) prototype patch that fixes the regression as described above. It still needs a bit of care before submitting, however.
Seems we have a handle on this bug?
Any progress here? I don't see the patch posted on gcc-patches.
Created attachment 43872 [details] Current patch This patch, which speeds up mcf and which is more consistent with the IPA-CP design and does not miscompile mcf, has passed LTO bootstrap and testing, I will try it on a few bigger applications and submit it shortly afterwards.
I have posted a proposed fix to the mailing list as: https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00419.html (please ignore the stuff I mistakenly pasted to the subject line).
Author: jamborm Date: Wed Apr 11 13:30:53 2018 New Revision: 259319 URL: https://gcc.gnu.org/viewcvs?rev=259319&root=gcc&view=rev Log: Improve IPA-CP handling of self-recursive calls 2018-04-11 Martin Jambor <mjambor@suse.cz> PR ipa/84149 * ipa-cp.c (propagate_vals_across_pass_through): Expand comment. (cgraph_edge_brings_value_p): New parameter dest_val, check if it is not the same as the source val. (cgraph_edge_brings_value_p): New parameter. (gather_edges_for_value): Pass destination value to cgraph_edge_brings_value_p. (perhaps_add_new_callers): Likewise. (get_info_about_necessary_edges): Likewise and exclude values brought only by self-recursive edges. (create_specialized_node): Redirect only clones of self-calling edges. (+self_recursive_pass_through_p): New function. (find_more_scalar_values_for_callers_subset): Use it. (find_aggregate_values_for_callers_subset): Likewise. (known_aggs_to_agg_replacement_list): Removed. (decide_whether_version_node): Re-calculate known constants for all remaining context clones. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-cp.c
So hopefully fixed.
Created attachment 43973 [details] Unreduced test-case Unfortunately I see one more ICE (reduced from Firefox w/ -O3). I'm currently reducing that, but hopefully will be possible to debug for unreduced version.
(In reply to Martin Liška from comment #12) > Created attachment 43973 [details] > Unreduced test-case > > Unfortunately I see one more ICE (reduced from Firefox w/ -O3). I'm > currently reducing that, but hopefully will be possible to debug for > unreduced version. $ ./xg++ -B. -O3 -c ~/Downloads/ice.ii /home/marxin/Downloads/ice.ii:5441:214: internal compiler error: in create_specialized_node, at ipa-cp.c:3870 already_AddRefed<nsIURI> nsImageLoadingContent::GetCurrentRequestFinalURI() { nsCOMPtr<nsIURI> uri; if (mCurrentRequest) { mCurrentRequest->GetFinalURI(getter_AddRefs(uri)); } return uri.forget(); } ^ 0x1624b99 create_specialized_node ../../gcc/ipa-cp.c:3870 0x16252a2 decide_about_value<tree_node*> ../../gcc/ipa-cp.c:4699 0x1626440 decide_whether_version_node ../../gcc/ipa-cp.c:4743 0x1626440 ipcp_decision_stage ../../gcc/ipa-cp.c:4906 0x1626440 ipcp_driver ../../gcc/ipa-cp.c:5086
New ICE has a new PR85447.
Hi All, I created a new issue PR85449 that occurs after this change. IPA seems to be mixing up some of the recursive calls in a specialized function causing an invalid code path.