Bug 84149 - [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10% performance regression with r256888
Summary: [8 Regression] SPEC CPU2017 505.mcf/605.mcf ~10% performance regression with ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 8.0
: P1 normal
Target Milestone: 8.0
Assignee: Martin Jambor
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec 84016 84613
  Show dependency treegraph
 
Reported: 2018-01-31 13:55 UTC by Alexander Nesterovskiy
Modified: 2018-04-18 10:38 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-02-05 00:00:00


Attachments
Prototype patch (1.75 KB, patch)
2018-03-07 19:23 UTC, Martin Jambor
Details | Diff
Current patch (2.55 KB, patch)
2018-04-06 14:43 UTC, Martin Jambor
Details | Diff
Unreduced test-case (602.21 KB, application/x-bzip)
2018-04-18 08:33 UTC, Martin Liška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Nesterovskiy 2018-01-31 13:55:47 UTC
Minimal options to reproduce regression (x86, 64-bit):
-O3 -flto

The reason behind the regression is that since r256888 a cost_compare function is not inlined into spec_qsort.
These two functions are in different source files.
I've managed to force cost_compare to be inlined by creating in the same source file a copy of spec_qsort function with explicit calls of cost_compare.
This reverted performance to r256887 level.
Comment 1 Martin Liška 2018-02-05 13:40:06 UTC
Let me take a look.
Comment 2 Martin Liška 2018-02-08 15:19:57 UTC
I can confirm the regression on a Haswell CPU.
Comment 3 Martin Liška 2018-02-13 12:14:47 UTC
So the problematic predictor change is:

- DEF_PREDICTOR (PRED_NULL_RETURN, "null return", HITRATE (91), 0)
+ DEF_PREDICTOR (PRED_NULL_RETURN, "null return", HITRATE (71), 0)

It lower frequency of BB 5:

primal_bea_mpp (int64_t m, struct arc_t * arcs, struct arc_t * stop_arcs, int64_t * basket_sizes, struct BASKET * * perm, int thread, struct arc_t * * end_arc, int64_t step, int64_t num_threads, int64_t max_elems)
{
...
  <bb 33> [local count: 12634988]:
READY:
  # DEBUG BEGIN_STMT
  _113 = *_40;
  _114 = (sizetype) _113;
  _115 = _114 + 1;
  _116 = _115 * 8;
  _117 = perm_148(D) + _116;
  _118 = *_117;
  _118->number = -1;
  # DEBUG BEGIN_STMT
  _122 = *_40;
  if (_122 == 0)
    goto <bb 34>; [29.93%]
  else
    goto <bb 35>; [70.07%]

  <bb 34> [local count: 3781652]:
  # DEBUG BEGIN_STMT
  // predicted unlikely by early return (on trees) predictor.
  goto <bb 36>; [100.00%]

  <bb 35> [local count: 8853336]:
  # DEBUG BEGIN_STMT
  _127 = (long unsigned int) _122;
  _128 = perm_148(D) + 8;
  spec_qsort (_128, _127, 8, cost_compare);
  # DEBUG BEGIN_STMT
  _176 = MEM[(struct BASKET * *)perm_148(D) + 8B];

  <bb 36> [local count: 12634988]:
  # _135 = PHI <0B(34), _176(35)>
  return _135;
}

It's function defined in pbeampp.c.

Then in IPA CP we end up with:

Evaluating opportunities for spec_qsort/353.
 - considering value arc_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1)
     good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 987, scc) -> evaluation: 314, threshold: 500
     good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 987, scc) -> evaluation: 643, threshold: 500
  Creating a specialized node of spec_qsort/353.
    replacing param #2 size_t with const 8
    replacing param #3 int (*<T813>) (const void *, const void *) with const arc_compare
		Accounting size:173.00, time:191.00 on predicate exec:(true)
		Accounting size:3.00, time:2.00 on new predicate exec:(not inlined)
     the new node is spec_qsort.constprop/377.
ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 71
converting indirect call in spec_qsort.constprop to direct call to arc_compare
     controlled uses count of param 3 bumped down to 8
ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 218
converting indirect call in spec_qsort.constprop to direct call to arc_compare
     controlled uses count of param 3 bumped down to 7
ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 267
converting indirect call in spec_qsort.constprop to direct call to arc_compare
     controlled uses count of param 3 bumped down to 6
ipa-prop: Discovered an indirect call to a known target (spec_qsort.constprop/377 -> arc_compare/144), for stmt with uid 348
converting indirect call in spec_qsort.constprop to direct call to arc_compare
     controlled uses count of param 3 bumped down to 5
 - considering value cost_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1)
     good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 701, scc) -> evaluation: 223, threshold: 500
     good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 701, scc) -> evaluation: 457, threshold: 500
 - Creating a specialized node of spec_qsort/353 for all known contexts.
    replacing param #2 size_t with const 8
		Accounting size:173.00, time:191.00 on predicate exec:(true)
		Accounting size:3.00, time:2.00 on new predicate exec:(not inlined)
     the new node is spec_qsort.constprop/378.

Evaluating opportunities for spec_qsort/353.
 - adding an extra caller spec_qsort.constprop/377 of spec_qsort.constprop/377
 - considering value cost_compare for param #3 int (*<T813>) (const void *, const void *) (caller_count: 1)
     good_cloning_opportunity_p (time: 143, size: 269, freq_sum: 701, scc) -> evaluation: 223, threshold: 500
     good_cloning_opportunity_p (time: 942, size: 866, freq_sum: 701, scc) -> evaluation: 457, threshold: 500
  Marking node as dead: spec_qsort/353.

Here you can see evaluation is 457, which the threshold is 500.

Another question is why IPA inline does not inline the call. Answer is here:

IPA function summary for primal_bea_mpp.constprop/366 inlinable
  global time:     127.809798
  self size:       134
  global size:     132
  min size:       10
  self stack:      0
  global stack:    0
    size:112.000000, time:108.000000
    size:7.000000, time:2.000000,  executed if:(not inlined)
    size:2.000000, time:2.000000,  nonconst if:(op3 changed)
    size:1.000000, time:1.000000,  nonconst if:(op7 changed)
    size:3.000000, time:3.000000,  nonconst if:(op9 changed)
    size:1.000000, time:1.000000,  nonconst if:(op4 changed)
    size:1.000000, time:1.000000,  nonconst if:(op8 changed)
  loop iterations:(op8 changed)
  loop stride:(op8 changed)
  calls:
    spec_qsort.constprop/378 function body not available
      loop depth: 0 freq:0.70 size: 5 time: 14 callee size:138 stack: 0
       op2 is compile time invariant
       op3 is compile time invariant

Note that spec_qsort is a recursive function, so I guess spec_qsort.constprop/378 is
clone created due to the recursion.

Leaving to Martin Jambor as he's the IPA CP expert!
P.S. Martin please apply following patch because one dump output has the dash and second not:

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 4202c999675..edc0bda3eca 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -4625,7 +4625,7 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset,
     return false;
 
   if (dump_file)
-    fprintf (dump_file, "  Creating a specialized node of %s.\n",
+    fprintf (dump_file, " - Creating a specialized node of %s.\n",
             node->dump_name ());
 
   callers = gather_edges_for_value (val, node, caller_count);
Comment 4 Martin Jambor 2018-03-07 19:15:25 UTC
In my (more recent) checkout of trunk, the estimated benefit is even
lower, only 270.  Raising devirtualization hint to compensate seems
excessive, I need:

diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index ee41a8d55b7..09ba92ef14d 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -2612,6 +2612,25 @@ devirtualization_time_bonus (struct cgraph_node *node,
 	res += 7 / ((int)speculative + 1);
     }
 
+  if (res)
+    /*  Devirtualization is likely more important in recursive callers because
+	those cannot be entirely inlined themselves.  */
+    for (cgraph_edge *e = node->callees; e; e = e->next_callee)
+      {
+	enum availability avail;
+	cgraph_node *callee = e->callee->function_symbol (&avail);
+	if (e->caller == callee
+	    && avail >= AVAIL_AVAILABLE)
+	  {
+	    res = 8 * res;
+
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "Recursive devirtualization bohus %i\n",
+		       res);
+	    break;
+	  }
+      }
+
   return res;
 }
 
Which is excessive (but it may make sense to boost IPA-CP bonuses for
self-recursive functions a bit since those cannot be dealt with
completely by inlining).  Nevertless I already have a prototype patch which
can explain to IPA-CP that only there is only one remaining value in the
contexts for which we did not clone and replace it there too.
Comment 5 Martin Jambor 2018-03-07 19:23:06 UTC
Created attachment 43588 [details]
Prototype patch

I have to leave the office now but this is the (only very very lightly tested) prototype patch that fixes the regression as described above.  It still needs a bit of care before submitting, however.
Comment 6 Richard Biener 2018-03-27 10:25:37 UTC
Seems we have a handle on this bug?
Comment 7 Jakub Jelinek 2018-04-04 16:14:52 UTC
Any progress here?  I don't see the patch posted on gcc-patches.
Comment 8 Martin Jambor 2018-04-06 14:43:09 UTC
Created attachment 43872 [details]
Current patch

This patch, which speeds up mcf and which is more consistent with the IPA-CP design and does not miscompile mcf, has passed LTO bootstrap and testing, I will try it on a few bigger applications and submit it shortly afterwards.
Comment 9 Martin Jambor 2018-04-10 08:13:13 UTC
I have posted a proposed fix to the mailing list as:

https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00419.html

(please ignore the stuff I mistakenly pasted to the subject line).
Comment 10 Martin Jambor 2018-04-11 13:31:27 UTC
Author: jamborm
Date: Wed Apr 11 13:30:53 2018
New Revision: 259319

URL: https://gcc.gnu.org/viewcvs?rev=259319&root=gcc&view=rev
Log:
Improve IPA-CP handling of self-recursive calls

2018-04-11  Martin Jambor  <mjambor@suse.cz>

	PR ipa/84149
	* ipa-cp.c (propagate_vals_across_pass_through): Expand comment.
	(cgraph_edge_brings_value_p): New parameter dest_val, check if it is
	not the same as the source val.
	(cgraph_edge_brings_value_p): New parameter.
	(gather_edges_for_value): Pass destination value to
	cgraph_edge_brings_value_p.
	(perhaps_add_new_callers): Likewise.
	(get_info_about_necessary_edges): Likewise and exclude values brought
	only by self-recursive edges.
	(create_specialized_node): Redirect only clones of self-calling edges.
	(+self_recursive_pass_through_p): New function.
	(find_more_scalar_values_for_callers_subset): Use it.
	(find_aggregate_values_for_callers_subset): Likewise.
	(known_aggs_to_agg_replacement_list): Removed.
	(decide_whether_version_node): Re-calculate known constants for all
	remaining context clones.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ipa-cp.c
Comment 11 Jakub Jelinek 2018-04-11 13:36:48 UTC
So hopefully fixed.
Comment 12 Martin Liška 2018-04-18 08:33:54 UTC
Created attachment 43973 [details]
Unreduced test-case

Unfortunately I see one more ICE (reduced from Firefox w/ -O3). I'm currently reducing that, but hopefully will be possible to debug for unreduced version.
Comment 13 Martin Liška 2018-04-18 08:35:19 UTC
(In reply to Martin Liška from comment #12)
> Created attachment 43973 [details]
> Unreduced test-case
> 
> Unfortunately I see one more ICE (reduced from Firefox w/ -O3). I'm
> currently reducing that, but hopefully will be possible to debug for
> unreduced version.

$ ./xg++ -B. -O3 -c ~/Downloads/ice.ii 

/home/marxin/Downloads/ice.ii:5441:214: internal compiler error: in create_specialized_node, at ipa-cp.c:3870
     already_AddRefed<nsIURI> nsImageLoadingContent::GetCurrentRequestFinalURI() {   nsCOMPtr<nsIURI> uri;   if (mCurrentRequest) {     mCurrentRequest->GetFinalURI(getter_AddRefs(uri));   }   return uri.forget(); }
                                                                                                                                                                                                                      ^
0x1624b99 create_specialized_node
	../../gcc/ipa-cp.c:3870
0x16252a2 decide_about_value<tree_node*>
	../../gcc/ipa-cp.c:4699
0x1626440 decide_whether_version_node
	../../gcc/ipa-cp.c:4743
0x1626440 ipcp_decision_stage
	../../gcc/ipa-cp.c:4906
0x1626440 ipcp_driver
	../../gcc/ipa-cp.c:5086
Comment 14 Martin Liška 2018-04-18 08:58:17 UTC
New ICE has a new PR85447.
Comment 15 Tamar Christina 2018-04-18 10:38:28 UTC
Hi All,

I created a new issue PR85449 that occurs after this change. IPA seems to be mixing up some of the recursive calls in a specialized function causing an invalid code path.