Bug 78496 - [7 Regression] Missed opportunities for jump threading
Summary: [7 Regression] Missed opportunities for jump threading
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P2 normal
Target Milestone: 8.0
Assignee: Jeffrey A. Law
URL:
Keywords: missed-optimization
Depends on:
Blocks: jumpthreading
  Show dependency treegraph
 
Reported: 2016-11-23 15:02 UTC by Yuri Rumyantsev
Modified: 2019-11-14 09:38 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work: 8.0
Known to fail: 7.5.0
Last reconfirmed: 2016-12-22 00:00:00


Attachments
test-case to reproduce, compile with -O3 option. (1.83 KB, text/plain)
2016-11-23 15:02 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2016-11-23 15:02:29 UTC
Created attachment 40131 [details]
test-case to reproduce, compile with -O3 option.

We noticed a huge performance drop on one important benchmark which is caused by hoisting and collecting comparisons participated in conditional branches. Here is comments provided by Richard on it:

Note this is a general issue with PRE which tends to
see partial redundancies when it can compute an expression to a
constant on one edge.  There is nothing wrong with that but the
particular example shows the lack of a cost model with respect
to register pressure (same applies to other GIMPLE optimization passes).

In this case we have a lot of expression anticipated from the same
blocks where on one incoming edge their value is constant.  Profitability
here really depends on the "distance" of the to be inserted PHI and
its use I guess.

We're missing quite some jump-threading here as well:

  <bb 16>:
  # x1_197 = PHI <x1_261(15), x1_435(123), x1_435(105)>
  # _407 = PHI <_16(15), _16(123), 0(105)>
  # aa1_410 = PHI <aa1_185(15), aa1_185(123), aa1_216(105)>
  # d1_413 = PHI <d1_191(15), d1_191(123), d1_432(105)>
  # w1_416 = PHI <w1_260(15), w1_260(123), 0(105)>
  # v1_377 = PHI <v1_558(15), v1_558(123), 0(105)>
  # oo1_371 = PHI <oo1_567(15), oo1_567(123), oo1_194(105)>
  # ss1_376 = PHI <ss1_576(15), ss1_576(123), ss1_192(105)>
  # r1_609 = PHI <r1_585(15), r1_585(123), r1_190(105)>
  # _612 = PHI <_596(15), _596(123), _188(105)>
  # out_ind_lsm.82_322 = PHI <out_ind_lsm.82_321(15),
out_ind_lsm.82_321(123), out_ind_lsm.82_532(105)>
  _549 = w1_416 <= 899;
  _548 = _407 > 839;
  _541 = _548 & _549;
  if (_541 != 0)
    goto <bb 17>;
  else
    goto <bb 124>;

here 105 -> 16 -> 124 (forwarder) -> 18 which would eventually
make PRE behave somewhat saner (avoding the far distances).

The case appears with phicprop1 (or rather DOM, itself missing
a followup transform with respect to folding a degenerate constant
PHI plus the followup secondary threading opportunities).  The
backwards threader doesn't exploit the above opportunity though.
Our forward threaders (like DOM) do.  Unfortunately it requires
quite a few iterations to get all opportunities exploited...
(inserting 9 DOM/phi-only-cprop pass pairs "helps")

I suggest to open a bugreport for this.  Jeff may want to look at
the threading issue (I believe the backward threader _does_ iterate).

I attach a test-case to reproduce an issue.
Comment 1 Jeffrey A. Law 2016-12-22 21:42:53 UTC
There's definitely a missing jump thread here.  ANd we've got code that's supposed to handle this.
Comment 2 Jeffrey A. Law 2016-12-22 22:38:17 UTC
So first you need to be aware that the backwards threader does not do simplifications nor does it know how to analyze logical operations.  Adding those capabilities to the backward threader is on the TODO list for gcc-8.

The forward walking threader does have some code to optimize stuff like this:

 _549 = w1_416 <= 899;
  _548 = _407 > 839;
  _541 = _548 & _549;
  if (_541 != 0)

Essentially it'll want to try and prove that _548 or _549 is zero, which in turn would allow it to know that _541 is zero.

The problem is almost certainly the fact that we don't iterate the forward jump threader.

In particular if we look at block 105 after forward jump threading:

;;   basic block 105, loop depth 1, count 0, freq 3060, maybe hot
;;    prev block 104, next block 106, flags: (NEW, REACHABLE, VISITED)
;;    pred:       11 (FALSE_VALUE,EXECUTABLE)
;;                99 [100.0%]  (FALLTHRU,EXECUTABLE)
  # aa1_216 = PHI <aa1_258(11), aa1_502(99)>
  # d1_206 = PHI <d1_432(11), d1_432(99)>
  # v1_196 = PHI <0(11), 0(99)>
  # oo1_194 = PHI <oo1_565(11), r1_254(99)>
  # ss1_192 = PHI <ss1_574(11), ss1_257(99)>
  # r1_190 = PHI <r1_583(11), 0(99)>
  # _188 = PHI <1(11), _500(99)>
  _184 = v1_196 * 3600;
  _182 = (unsigned int) _184;
  _180 = _182 / 60;
  w1_178 = (int) _180;
  _177 = w1_178 <= 449;
  _176 = _180 > 389;
  _169 = _176 & _177;
  goto <bb 16>; [100.00%]

We would need to propagate v1_196 = 0 into the uses, which in turn result in _180 having the value 0.  That value in turn would feed into the PHI at the start of bb16:

;;   basic block 16, loop depth 1, count 0, freq 8500, maybe hot
;;   Invalid sum of incoming frequencies 6375, should be 8500
;;    prev block 15, next block 17, flags: (NEW, REACHABLE, VISITED)
;;    pred:       15 [100.0%]  (FALLTHRU,EXECUTABLE)
;;                14 [21.9%]  (FALSE_VALUE,EXECUTABLE)
;;                105 [100.0%]  (FALLTHRU)
  # x1_197 = PHI <x1_261(15), x1_435(14), x1_435(105)>
  # _407 = PHI <_16(15), _16(14), _180(105)>
  # aa1_410 = PHI <aa1_185(15), aa1_185(14), aa1_216(105)>
  # d1_413 = PHI <d1_191(15), d1_191(14), d1_206(105)>
  # w1_416 = PHI <w1_260(15), w1_260(14), w1_178(105)>
  # v1_377 = PHI <v1_558(15), v1_558(14), v1_196(105)>
  # oo1_371 = PHI <oo1_567(15), oo1_567(14), oo1_194(105)>
  # ss1_376 = PHI <ss1_576(15), ss1_576(14), ss1_192(105)>
  # r1_609 = PHI <r1_585(15), r1_585(14), r1_190(105)>
  # _612 = PHI <_596(15), _596(14), _188(105)>
  _549 = w1_416 <= 899;
  _548 = _407 > 839;
  _541 = _548 & _549;
  if (_541 != 0)
    goto <bb 17>; [50.00%]
  else
    goto <bb 18>; [50.00%]


Once the value of _180 is propagated into the RHS of the assignment to _407 we'd be able to thread this block.  But there's not another forward jump threading pass that runs prior to PRE.  

I looked at ways to introduce iteration without the full DOM pass years ago.  It was pretty obvious that the most interesting things happened as a result of exposing degenerate PHIs.  But I wasn't ever able to make a leap from that to a low overhead iterating jump threader.  What did come out of it was the phi-only cprop pass which propagates away the degenerate PHIs.

Through the years I've become increasingly convinced that forward, equivalence table based jump threading is the wrong model.  THe right model (IMHO) is an on-demand backwards substitution based threader.

Sebastian started that work and is what you would find in tree-ssa-threadbackwards.c.  I continue to extend that towards the long term vision.

Essentially we want to raise the query _541 != 0 an back substitute/fold
(_548 & _549) != 0
((_407 > 839) & _549) != 0
((_180 > 839) & _540) != 0
(((_182 / 60) > 839) & _540) != 0
((((unsigned int _184) / 60) > 839) & _540) != 0
(((((unsigned int) (v1_196 * 3600))/ 60) > 839) & _540) != 0
(((((unsigned int) (0) * 3600)) / 60) > 839) & _540) != 0

At which point it simplifies to 0 != 0

We're not to that point yet, but that's the direction its going.
Comment 3 Jeffrey A. Law 2017-01-25 21:22:44 UTC
So I've been pondering this a bit more.  Essentially I see two paths forward.  

One is to enhance the backwards threader so that it can do more general lookups/simplifications.  I've cobbled together some code for this, but it's certainly not gcc-7 material and even once working it may not necessarily capture this case.

The second would be to take advantage of some of the DOM restructuring we did a year or so ago and do a block local jump threading pass where the edges/blocks to consider are seeded by phi-only-cprop.  Depending on the cost iterating phi-only-cprop and a local threader might be acceptable -- particularly if we throttled back threading elsewhere.  

I haven't cobbled together any code for the second approach, but it shouldn't be hard to prototype a non-iterating implementation.  Most of the infrastructure we need is already in place.
Comment 4 Jeffrey A. Law 2017-02-07 00:38:44 UTC
More thoughts on how we might approach resolving.

To tackle in the backwards threader I think we need to change the model for how backwards threading works.

Right now it starts walking up the use-def chain to find a constant value for the given SSA_NAME.  If it finds a constant, any constant, then we have a jump threading opportunity.  We extract the conditional from the last block in the path, plug the constant into the right place and determine where it's going to go.  Then we record the jump thread path.

The conceptual change we need to make is to be able to raise the query "find paths with where X has a constant value and record those paths and the value found for X".

That would allow us to do things like simplify a bit-and by determining if one operand is zero along a particular path.  Then we can examine how that feeds a conditional.  Note carefully how determining the object has a constant value other than 0 does not allow simplification.

Given that kind of structure we could probably look to tackle this BZ in the backward threader.

We could look to improve DOM's threader.  Right now we don't pick up the secondary threading opportunities because of limitations imposed by our SSA graph updating code.  Converting that code to use the SEME copier like the backwards threader uses would allow removal of that limitation and likely allow detection of at least some cascading jump threading effects.

Anyway, I wanted to get those thoughts recorded here so they're not lost.
Comment 5 Jeffrey A. Law 2017-03-22 05:40:46 UTC
More comments.  As has been noted, this looks like a case where we need iteration to fully optimize.  However, there are things we can do to improve VRP's jump threading which should have a direct positive impact on this test.


vrp has code which will forward propagate something like

x = (typecast) y;
if (x == 42)

Into

if (y == 42)

That's fine and good, except it really makes it hard for jump threading to thread through such conditionals since we may have had ASSERT_EXPRs to give us a nice range for X, but not for Y.

From my experiments, deferring that transformation until after jump threading would definitely help this testcase.  Doing so allows VRP1 to jump thread a number of range tests of w1.  The total number of vrp & dom jump threads are currently 15.  Fixing increases that to 57.

We also have cases where we've got something like this:

;;   basic block 13, loop depth 1, count 0, freq 8500, maybe hot
;;    prev block 12, next block 129, flags: (NEW, REACHABLE, VISITED)
;;    pred:       11 [50.0%]  (FALSE_VALUE,EXECUTABLE)
;;                12 [100.0%]  (FALLTHRU,EXECUTABLE)
  out_ind.71_313 = out_ind;
  OUT[out_ind.71_313] = ss1_233;
  _314 = out_ind.71_313 + 1;
  out_ind = _314;
  aa1_258 = ye1_171 + aa1_186;
  if (v1_179 > 0)
    goto <bb 14>; [64.00%]
  else
    goto <bb 129>; [36.00%]
;;    succ:       14 [64.0%]  (TRUE_VALUE,EXECUTABLE)
;;                129 [36.0%]  (FALSE_VALUE,EXECUTABLE)

;;   basic block 129, loop depth 1, count 0, freq 3060, maybe hot
;;    prev block 13, next block 14, flags: (NEW)
;;    pred:       13 [36.0%]  (FALSE_VALUE,EXECUTABLE)
  v1_378 = ASSERT_EXPR <v1_179, v1_179 <= 0>;
  goto <bb 16>; [100.00%]
;;    succ:       16 [100.0%]  (FALLTHRU)

Where we have this range computed for v1_378:
v1_378: [0, 0]  EQUIVALENCES: { v1_179 } (1 elements)


Presumably we don't simplify the ASSERT_EXPR because it was thought that they're just going to be dropped.  But by simplifying the ASSERT_EXPR into an equality test, we can then propagate a value of 0 for v1_179 into BB16 when it's reached via BB129.  That in turn allows us to thread the jump earlier (in VRP1 rather than in DOM2, which is usually helpful).


The one I still haven't cracked looks like this:

;;   basic block 18, loop depth 1, count 0, freq 2125, maybe hot
;;    prev block 131, next block 19, flags: (NEW, REACHABLE, VISITED)
;;    pred:       17 [50.0%]  (TRUE_VALUE,EXECUTABLE)
  w1_390 = ASSERT_EXPR <w1_389, w1_389 <= 449>;
  _17 = 450 - w1_390;
  _18 = (unsigned int) _17;
  _19 = _18 * oo1_227;
  _20 = _19 / 3600;
  _21 = _20 + _310;
  x1_261 = (int) _21;
;;    succ:       19 [100.0%]  (FALLTHRU,EXECUTABLE)

;;   basic block 19, loop depth 1, count 0, freq 8500, maybe hot
;;    prev block 18, next block 132, flags: (NEW, REACHABLE, VISITED)
;;    pred:       130 [100.0%]  (FALLTHRU)
;;                131 [100.0%]  (FALLTHRU)
;;                18 [100.0%]  (FALLTHRU,EXECUTABLE)
  # x1_197 = PHI <x1_206(130), x1_206(131), x1_261(18)>
  if (w1_260 > 839)
    goto <bb 20>; [50.00%]
  else
    goto <bb 132>; [50.00%]
;;    succ:       20 [50.0%]  (TRUE_VALUE,EXECUTABLE)
;;                132 [50.0%]  (FALSE_VALUE,EXECUTABLE)

We've got an obviously threadable jump when BB19 is reached via BB18.

w1_260 is not used by the ASSERT_EXPR in BB18, so we don't find that ASSERT_EXPR, which we need to be able to thread the jump.  I can see expensive ways to get to the ASSERT_EXPR, but nothing I'd consider clean enough yet.  I suspect this happens enough that if we were to catch it in VRP1 that we'd probably be reasonably close to catching the jump threads early enough to essentially resolve this BZ.
Comment 6 Jeffrey A. Law 2017-03-23 20:37:14 UTC
So I've got a hack that allows me to evaluate the effect of the last example from c#5.  So let's look at how the number of realized jump threads is affected by the various tweaks I'm playing with:


    VRP1  DOM2    DOM3   VRP2  thread{1,2,3,4}
base 6      9       0      0       0 
p1   27     9       3      0       0
p2   30     6       0     21       0
p3   51     6       0      0       0
p4   51     6       0      0       6

VRP/DOM/thread columns count the number of jump threads realized by that pass.
  


p1 defers simplification of conditionals until after VRP threading is done.  It's clearly finding many previously missed jump threads and even exposes some new opportunities for DOM3. 

p2 simplifies ASSERT_EXPRs from relational to equality tests.  It moves a small number of jump threads from DOM into VRP1, and also picks up a ton of new jump threads to VRP2. 

p3 catches the case in the last example of c#5.  It moves all those cases exposed for VRP2 by patch #2 to occur in VRP1. 

p4 retunes the FSM threader slightly.  I'm still experimenting here, but it does catch some stuff that was previously missed *and* exposes some new opportunities.  Ie, it catches stuff in thread2 and exposes stuff for thread3.

It's pretty clear that the patches can significantly improve the jump threading we're doing for this testcase and do so early in the pipeline, where they're the most beneficial.


We don't have a good description of the primary effect we're looking for, but I'll guess that we're finding partial redundancies for those edges with constant PHI arguments.  Many of those should just go away completely.  If we look at the pre dumps, we start with 278 partial redundancies.  The first patch drops that to 73.  The second patch drops it further to 61 where it stabilizes.  So we're certainly seeing a lot fewer partial redundancies.
Comment 7 Jeffrey A. Law 2017-03-28 03:38:09 UTC
While I've got changes which I think will address the problems in this BZ; given how late we are in stage4, I'm going to defer to gcc-8.
Comment 8 Jeffrey A. Law 2017-05-03 16:34:17 UTC
Author: law
Date: Wed May  3 16:33:45 2017
New Revision: 247556

URL: https://gcc.gnu.org/viewcvs?rev=247556&root=gcc&view=rev
Log:
	PR tree-optimization/78496
	* tree-vrp.c (simplify_cond_using_ranges_1): Renamed
	from simplify_cond_using_ranges.  Split off code to walk
	backwards through casts into ...
	(simplify_cond_using_ranges_2): New function.
	(simplify_stmt_using_ranges): Call simplify_cond_using_ranges_1.
	(execute_vrp): After identifying jump threads, call
	simplify_cond_using_ranges_2.

	PR tree-optimization/78496
	* gcc.dg/tree-ssa/ssa-thread-15.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-15.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vrp.c
Comment 9 Jeffrey A. Law 2017-05-06 15:04:12 UTC
Author: law
Date: Sat May  6 15:03:40 2017
New Revision: 247721

URL: https://gcc.gnu.org/viewcvs?rev=247721&root=gcc&view=rev
Log:
	PR tree-optimization/78496
	* tree-vrp.c (simplify_assert_expr_using_ranges): New function.
	(simplify_stmt_using_ranges): Call it.
	(vrp_dom_walker::before_dom_children): Extract equivalences
	from an ASSERT_EXPR with an equality comparison against a
	constant.

	PR tree-optimization/78496
	* gcc.dg/tree-ssa/ssa-thread-16.c: New test.
	* gcc.dg/tree-ssa/ssa-thread-17.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-16.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-17.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vrp.c
Comment 10 Jeffrey A. Law 2017-05-06 18:21:02 UTC
Author: law
Date: Sat May  6 18:20:31 2017
New Revision: 247722

URL: https://gcc.gnu.org/viewcvs?rev=247722&root=gcc&view=rev
Log:
	PR tree-optimization/78496
	* tree-vrp.c (simplify_assert_expr_using_ranges): Remove debugging
	code.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vrp.c
Comment 11 Jeffrey A. Law 2017-05-07 15:11:27 UTC
Author: law
Date: Sun May  7 15:10:55 2017
New Revision: 247727

URL: https://gcc.gnu.org/viewcvs?rev=247727&root=gcc&view=rev
Log:
2017-05-07  Jeff Law  <law@redhat.com>

	Revert:
	2017-05-06  Jeff Law  <law@redhat.com>
	PR tree-optimization/78496
	* tree-vrp.c (simplify_assert_expr_using_ranges): Remove debugging
	code.

	PR tree-optimization/78496
	* tree-vrp.c (simplify_assert_expr_using_ranges): New function.
	(simplify_stmt_using_ranges): Call it.
	(vrp_dom_walker::before_dom_children): Extract equivalences
	from an ASSERT_EXPR with an equality comparison against a
	constant.

	Revert:
	2017-05-06  Jeff Law  <law@redhat.com>
	PR tree-optimization/78496
	* gcc.dg/tree-ssa/ssa-thread-16.c: New test.
	* gcc.dg/tree-ssa/ssa-thread-17.c: New test.

Removed:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-16.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-thread-17.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-vrp.c
Comment 12 Jeffrey A. Law 2017-12-04 16:14:56 UTC
Author: law
Date: Mon Dec  4 16:14:24 2017
New Revision: 255387

URL: https://gcc.gnu.org/viewcvs?rev=255387&root=gcc&view=rev
Log:
	PR tree-optimizatin/78496
	* gimple-ssa-evrp-analyze.h
	(evrp_range_analyzer::get_vr_values): Simplify.
	* gimple-ssa-evrp-analyze.c: Corresponding changes.
	* tree-ssa-dom.c: Include alloc-pool.h, tree-vrp.h, vr-values.h
	and gimple-ssa-evrp-analyze.h.
	(dom_opt_dom_walker class): Add evrp_range_analyzer member.
	(simplify_stmt_for_jump_threading): Copy a blob of code from
	tree-vrp.c to use ranges to simplify statements.
	(dom_opt_dom_walker::before_dom_children): Call
	evrp_range_analyzer::{enter,record_ranges_from_stmt} methods.
	(dom_opt_dom_walker::after_dom_children): Similarly for
	evrp_range_analyzer::leave.
	(dom_opt_dom_walker::optimize_stmt): Use EVRP ranges to optimize
	conditionals.

	PR tree-optimization/78496
	* gcc.dg/builtin-unreachable-6.c: Disable DOM.
	* gcc.dg/builtin-unreachable-6a.c: New test.
	* gcc.dg/tree-ssa/20030922-1.c: No longer XFAIL.
	* gcc.dg/ssa-dom-branch-1.c: Tweak expected output.

Added:
    trunk/gcc/testsuite/gcc.dg/builtin-unreachable-6a.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-evrp-analyze.h
    trunk/gcc/gimple-ssa-evrp.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/builtin-unreachable-6.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/20030922-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-branch-1.c
    trunk/gcc/tree-ssa-dom.c
Comment 13 Jeffrey A. Law 2017-12-04 16:15:52 UTC
Fixed on the trunk.   No plans to backport.
Comment 14 Jakub Jelinek 2018-05-02 10:04:39 UTC
GCC 8.1 has been released.
Comment 15 Jakub Jelinek 2018-07-26 11:00:03 UTC
GCC 8.2 has been released.
Comment 16 Richard Biener 2019-11-14 09:38:06 UTC
Fixed in GCC 8.