Bug 32306 - [6/7/8 Regression] redundant && || not eliminated
Summary: [6/7/8 Regression] redundant && || not eliminated
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 6.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 19794
  Show dependency treegraph
 
Reported: 2007-06-12 14:45 UTC by Pranav Bhandarkar
Modified: 2017-11-24 16:23 UTC (History)
12 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.1.2
Known to fail:
Last reconfirmed: 2011-11-30 00:00:00


Attachments
Testcase displaying the said behaviour (214 bytes, text/plain)
2007-06-12 14:48 UTC, Pranav Bhandarkar
Details
Code Generated by 4.1 (749 bytes, application/octet-stream)
2007-06-12 14:50 UTC, Pranav Bhandarkar
Details
Code Generated by 4.3 (1.34 KB, application/octet-stream)
2007-06-12 14:50 UTC, Pranav Bhandarkar
Details
A smaller Testcase that displays the said behaviour (151 bytes, application/octet-stream)
2007-06-12 21:10 UTC, Pranav Bhandarkar
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pranav Bhandarkar 2007-06-12 14:45:16 UTC
For the following Code Snippet
void bar ()
{

  b1 = foo(1);
  b2 = foo(1);
  b3 = foo(1);
  b4 = foo(1);
  b5 = foo(1);
  b6 = foo(1);
  b7 = foo(1);
  b8 = foo(1);
  b9 = foo(1);
  b10 = foo(1);
  b11 = foo(1);
  b12 = foo(1);

  array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[2] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[3] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[4] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[5] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[6] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[7] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[8] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[9] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[10] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;

  return;
}

Where b ( from b1 to b12) are all declared static short b1, static short b2 etc.
and array is static short array[11].
This should generate code such as

if (b1 == 0) goto L1 else goto L2
L2:
if (b2 == 0) goto L1 else goto L3
L3:
if (b3 == 0) goto L1 else goto L4
L4:
if (b4 == 0) goto L1 else goto L5
L5:
if (b5 == 0) goto L1 else goto L6
L6:
if (b6 == 0) goto L1 else goto L7
L7:
if (b7 == 0) goto L1 else goto L8
L8:
if (b8 == 0) goto L1 else goto L9
L9:
if (b9 == 0) goto L1 else goto L10
L10:
if (b10 == 0) goto L1 else goto L11
L11:
if (b11 == 0) goto L1 else goto L12
L12:
if (b12 == 0) goto L1 else goto L13
L13:
array[i]=1 (for i from 0 to 10)
return

L1:
array[i]=0 (for i from 0 to 10)
return

This is exactly what 4.1 generates but 4.3 fails to combine the if sequences.
Version Details:
GNU C version 4.3.0 20070316 (experimental) (arm-none-eabi)
        compiled by GNU C version 3.4.6 (Ubuntu 3.4.6-1ubuntu2).
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Comment 1 Pranav Bhandarkar 2007-06-12 14:48:15 UTC
Created attachment 13686 [details]
Testcase displaying the said behaviour
Comment 2 Pranav Bhandarkar 2007-06-12 14:50:01 UTC
Created attachment 13687 [details]
Code Generated by 4.1
Comment 3 Pranav Bhandarkar 2007-06-12 14:50:33 UTC
Created attachment 13688 [details]
Code Generated by 4.3
Comment 4 Richard Biener 2007-06-12 15:06:52 UTC
Try to narrow it down to sth shorter.  (Looks like a jump-threading issue)
Comment 5 Pranav Bhandarkar 2007-06-12 21:10:56 UTC
Created attachment 13694 [details]
A smaller Testcase that displays the said behaviour

Reduced the testcase. Reducing to less than b1 thru b6 results in the compiler being able to combine the if conditions at the tree level. Therefore couldnt reduce it to less than b6. array elements written have been reduced to 4.
Comment 6 Pranav Bhandarkar 2007-06-12 21:14:29 UTC
dom, in 4.1 is able to combine the if conditions unlike 4.3.
Comment 7 Richard Biener 2007-06-12 21:57:36 UTC
The difference is that we iterated jump threading in DOM in 4.1 but do so no longer.  On the mainline each dom and vrp pass figures more jump threading
opportunities - just not enough.  With 4.1 the third and last DOM pass doesn't
find any more.
Comment 9 Mark Mitchell 2007-10-09 19:21:27 UTC
Change target milestone to 4.2.3, as 4.2.2 has been released.
Comment 10 Steven Bosscher 2008-01-09 13:49:25 UTC
This is a code generation difference due to a design choice.  IMHO this should not be a P2 regression.
Comment 11 Joseph S. Myers 2008-02-01 16:54:09 UTC
4.2.3 is being released now, changing milestones of open bugs to 4.2.4.
Comment 12 Joseph S. Myers 2008-05-19 20:23:10 UTC
4.2.4 is being released, changing milestones to 4.2.5.
Comment 13 Paolo Bonzini 2008-11-20 08:48:05 UTC
I agree with Steven that the bug title does not make sense.  It is the same as complaining that IRA has a regression because it disables regmove.

OTOH, this *is* a regression and the bug should stay open.  For example, it could be fixed by doing some of the tree optimizations with && not lowered to jumps (just shooting).
Comment 14 Ramana Radhakrishnan 2009-01-15 10:35:08 UTC
Redo CCs
Comment 15 Joseph S. Myers 2009-03-31 20:10:27 UTC
Closing 4.2 branch.
Comment 16 Richard Biener 2009-08-04 12:28:14 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 17 Richard Biener 2010-05-22 18:11:34 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 18 Richard Biener 2011-06-27 12:14:35 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 19 Richard Biener 2012-01-12 13:02:25 UTC
Shorter testcase, compilable and to the point.  We are not able to CSE
the b1 && ... && b8 sequence because we produce control-flow for it
during gimplification.

void bar (short *array,
          short b1, short b2, short b3, short b4,
          short b5, short b6, short b7, short b8)
{
  array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
  array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
}
Comment 20 Jakub Jelinek 2012-03-13 12:48:02 UTC
4.4 branch is being closed, moving to 4.5.4 target.
Comment 21 Richard Biener 2012-07-02 11:53:42 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 22 Steven Bosscher 2012-11-09 23:13:53 UTC
(In reply to comment #19)
> Shorter testcase, compilable and to the point.  We are not able to CSE
> the b1 && ... && b8 sequence because we produce control-flow for it
> during gimplification.

Do you know why that is so?
Comment 23 Jakub Jelinek 2013-01-10 11:26:09 UTC
It generally depends on the branch cost, e.g. out of #c19 testcase on x86_64 we generate:
  D.1729 = b1 != 0;
  D.1730 = b2 != 0;
  D.1731 = D.1729 & D.1730;
  if (D.1731 != 0) goto <D.1732>; else goto <D.1727>;
  <D.1732>:
  D.1733 = b3 != 0;
  D.1734 = b4 != 0;
  D.1735 = D.1733 & D.1734;
  if (D.1735 != 0) goto <D.1736>; else goto <D.1727>;
  <D.1736>:
etc., but on other targets it could have just one comparison per conditional jump, or on the other side 4.  While the tests don't have side-effects, turning too many &&s into &s wouldn't result in very good code, though of course would make it easier to perform some optimizations.  Anyway, you can write it in the source as a series of ifs:
  array[0] = 1;
  if (!b1)
    array[0] = 0;
  else if (!b2)
    array[0] = 0;
  else if (!b3)
    array[0] = 0;
...
and we wouldn't be able to optimize it anyway.
Comment 24 Jakub Jelinek 2013-04-12 15:17:03 UTC
GCC 4.6.4 has been released and the branch has been closed.
Comment 25 Richard Biener 2014-06-12 13:46:56 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 26 Jakub Jelinek 2014-12-19 13:36:25 UTC
GCC 4.8.4 has been released.
Comment 27 Richard Biener 2015-06-23 08:19:03 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 28 Jakub Jelinek 2015-06-26 19:59:33 UTC
GCC 4.9.3 has been released.
Comment 29 Bernd Schmidt 2016-01-26 14:40:39 UTC
(In reply to Richard Biener from comment #19)
> Shorter testcase, compilable and to the point.  We are not able to CSE
> the b1 && ... && b8 sequence because we produce control-flow for it
> during gimplification.
> 
> void bar (short *array,
>           short b1, short b2, short b3, short b4,
>           short b5, short b6, short b7, short b8)
> {
>   array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
>   array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
> }

This seems to be optimized as of 4.7, but the original testcase still needs work. Maybe sccvn could be taught to recreate the conditionals?
Comment 30 Richard Biener 2016-08-03 10:43:11 UTC
GCC 4.9 branch is being closed
Comment 31 Martin Sebor 2017-01-13 17:00:36 UTC
With GCC 7.0 I don't see much of a difference between the code emitted for the original test case and for the smaller one from comment #19.  Can this bug be resolved or am I missing something?

For convenience, here are both test cases:

#if !ORIGINAL

void bar (short *array,
	  short b1, short b2, short b3, short b4,
	  short b5, short b6, short b7, short b8)
{
  array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
  array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8;
}

#else

void bar (short *array,
	  short b1, short b2, short b3, short b4,
	  short b5, short b6, short b7, short b8,
	  short b9, short b10, short b11, short b12)
{
  array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[2] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[3] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[4] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[5] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[6] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[7] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[8] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[9] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
  array[10] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12;
}
#endif
Comment 32 Richard Biener 2017-01-16 08:14:16 UTC
> grep if t.c.227t.optimized  | wc -l
109

should be 12 if CSE worked.
Comment 33 Jeffrey A. Law 2017-03-17 17:58:53 UTC
Martin,

Richi's right. If you look at the .optimized dump you'll an huge number of redundant conditionals.

We should compute b1 && b2 && b3 ... && b12 once and store the result into each array element.

But the way we generate code for this can make it awful hard for the optimizers to see the giant common subexpression.

If you look at the shape of the CFG and explore how to expose the various redundancies, you'll see there's a cascading effect.  ie, you expose some, simplify and new opportunities appear.

Jump threading had the ability to do this many years ago.  We kept iterating DOM and forward jump threading until nothing changed.  But that was very expensive relative to what was gained and we removed the iteration code.

There's various things in the pipeline that might (or might not) help this specific BZ over time.  It's certainly in the back of my mind as we continue the transition away from the forward equivalency table based jump threading to the backward walking of the use-def chains.

The backwards approach holds a lot more promise  to address this BZ.  I'm pretty sure the stuff planned would allow it to find all the threads in a single pass.  The question of cost of the backwards walking and  block copying necessary to isolate/expose the redundancies.
Comment 34 Jakub Jelinek 2017-03-24 18:26:43 UTC
So, could we in SCCVN recognize similarly to maybe_optimize_range_tests inter-bb && and || tests at the end stored through PHI into an SSA_NAME without side-effects in between and handle it (hash it and compare) as if it was actually using && or || (or & and | ?)?
Comment 35 Jakub Jelinek 2017-10-10 13:28:15 UTC
GCC 5 branch is being closed
Comment 36 Jeffrey A. Law 2017-11-24 16:23:24 UTC
Just a couple notes.  I'm not currently looking at this, but this is probably the best bug to track thoughts around how to try and capture secondary effects of jump threading without re-running all of DOM.

-- Taken from another BZ so that it doesn't get lost --

I looked at ways to introduce iteration without the full DOM pass years ago.  It was pretty obvious that the most interesting things happened as a result of exposing degenerate PHIs.  But I wasn't ever able to make a leap from that to a low overhead iterating jump threader.  What did come out of it was the phi-only cprop pass which propagates away the degenerate PHIs.

--

And today's thought.  In addition to degenerate PHIs the other key property which indicates an exposed secondary opportunity is a reduction in the number of preds for a block.  Particularly when we can drop from N to 1 pred.

ssa--dom-simplify-1 is a good example, particularly if one disables the VRP threader.   Prior to DOM we'll see:

;;   basic block 2, loop depth 0, count 1073741825 (estimated locally), maybe hot
;;    prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED)
;;    pred:       ENTRY [always]  count:1073741826 (estimated locally) (FALLTHRU,EXECUTABLE)
  if (x_3(D) > 3)
    goto <bb 3>; [33.00%]
  else
    goto <bb 4>; [67.00%]
;;    succ:       3 [33.0% (guessed)]  count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                4 [67.0% (guessed)]  count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 3, loop depth 0, count 354334802 (estimated locally), maybe hot
;;    prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED)
;;    pred:       2 [33.0% (guessed)]  count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  frob (1);
;;    succ:       4 [always (guessed)]  count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 4, loop depth 0, count 1073741825 (estimated locally), maybe hot
;;    prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED)
;;    pred:       3 [always (guessed)]  count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE)
;;                2 [67.0% (guessed)]  count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE)
  if (x_3(D) > 2)
    goto <bb 5>; [33.00%]
  else
    goto <bb 6>; [67.00%]
;;    succ:       5 [33.0% (guessed)]  count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE)
;;                6 [67.0% (guessed)]  count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE)

;;   basic block 5, loop depth 0, count 354334802 (estimated locally), maybe hot
;;    prev block 4, next block 6, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 [33.0% (guessed)]  count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE)
  frob (x_3(D));
;;    succ:       6 [always (guessed)]  count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE)

;;   basic block 6, loop depth 0, count 1073741825 (estimated locally), maybe hot
;;    prev block 5, next block 1, flags: (NEW, REACHABLE, VISITED)
;;    pred:       4 [67.0% (guessed)]  count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                5 [always (guessed)]  count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE)
  return;
;;    succ:       EXIT [always (guessed)]  count:1073741825 (estimated locally)

}


DOM is going to discover the 3->4->5 jump thread easily.    But there are no PHIs in BB4 that would trigger any kind of reanalysis.  But the # preds for BB4 drops from 2 to 1.  This is important as the remaining path into BB4 can be further optimized after we realize the 3->4->5 jump thread.

It feels as if when we discover a degenerate PHI or the incoming preds drops to 1, then we ought to start a re-analysis.  The blocks involved would start at the idom of the affected block, covering the dominance tree and the PHI nodes as the dominance frontier.

I thought I explored that idom/dominance tree/dominance frontier idea years ago and likely dismissed it as incomplete (consider how scrambled the dominance tree can get after threading).  But while I could certainly conjure up scenarios where it's incomplete, it might be "good enough" to catch the secondary opportunities without a fully iterating DOM.

--

Of course I'm also very interested to evaluate if any of that is necessary with Aldy's recent work on the backwards threader.