Summary: | [11/12/13/14 Regression] redundant && || not eliminated | ||
---|---|---|---|
Product: | gcc | Reporter: | Pranav Bhandarkar <pranav.bhandarkar> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | bonzini, dave, dimhen, gcc-bugs, icera-sdkteam-gnu, jakub, jeffreyalaw, ktietz, msebor, ramana.r |
Priority: | P2 | Keywords: | missed-optimization |
Version: | 4.3.0 | ||
Target Milestone: | 11.5 | ||
Host: | Target: | ||
Build: | Known to work: | 4.1.2 | |
Known to fail: | Last reconfirmed: | 2023-07-15 00:00:00 | |
Bug Depends on: | |||
Bug Blocks: | 19794 | ||
Attachments: |
Testcase displaying the said behaviour
Code Generated by 4.1 Code Generated by 4.3 A smaller Testcase that displays the said behaviour |
Description
Pranav Bhandarkar
2007-06-12 14:45:16 UTC
Created attachment 13686 [details]
Testcase displaying the said behaviour
Created attachment 13687 [details]
Code Generated by 4.1
Created attachment 13688 [details]
Code Generated by 4.3
Try to narrow it down to sth shorter. (Looks like a jump-threading issue) Created attachment 13694 [details]
A smaller Testcase that displays the said behaviour
Reduced the testcase. Reducing to less than b1 thru b6 results in the compiler being able to combine the if conditions at the tree level. Therefore couldnt reduce it to less than b6. array elements written have been reduced to 4.
dom, in 4.1 is able to combine the if conditions unlike 4.3. The difference is that we iterated jump threading in DOM in 4.1 but do so no longer. On the mainline each dom and vrp pass figures more jump threading opportunities - just not enough. With 4.1 the third and last DOM pass doesn't find any more. Change target milestone to 4.2.3, as 4.2.2 has been released. This is a code generation difference due to a design choice. IMHO this should not be a P2 regression. 4.2.3 is being released now, changing milestones of open bugs to 4.2.4. 4.2.4 is being released, changing milestones to 4.2.5. I agree with Steven that the bug title does not make sense. It is the same as complaining that IRA has a regression because it disables regmove. OTOH, this *is* a regression and the bug should stay open. For example, it could be fixed by doing some of the tree optimizations with && not lowered to jumps (just shooting). Redo CCs Closing 4.2 branch. GCC 4.3.4 is being released, adjusting target milestone. GCC 4.3.5 is being released, adjusting target milestone. 4.3 branch is being closed, moving to 4.4.7 target. Shorter testcase, compilable and to the point. We are not able to CSE the b1 && ... && b8 sequence because we produce control-flow for it during gimplification. void bar (short *array, short b1, short b2, short b3, short b4, short b5, short b6, short b7, short b8) { array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; } 4.4 branch is being closed, moving to 4.5.4 target. The 4.5 branch is being closed, adjusting target milestone. (In reply to comment #19) > Shorter testcase, compilable and to the point. We are not able to CSE > the b1 && ... && b8 sequence because we produce control-flow for it > during gimplification. Do you know why that is so? It generally depends on the branch cost, e.g. out of #c19 testcase on x86_64 we generate: D.1729 = b1 != 0; D.1730 = b2 != 0; D.1731 = D.1729 & D.1730; if (D.1731 != 0) goto <D.1732>; else goto <D.1727>; <D.1732>: D.1733 = b3 != 0; D.1734 = b4 != 0; D.1735 = D.1733 & D.1734; if (D.1735 != 0) goto <D.1736>; else goto <D.1727>; <D.1736>: etc., but on other targets it could have just one comparison per conditional jump, or on the other side 4. While the tests don't have side-effects, turning too many &&s into &s wouldn't result in very good code, though of course would make it easier to perform some optimizations. Anyway, you can write it in the source as a series of ifs: array[0] = 1; if (!b1) array[0] = 0; else if (!b2) array[0] = 0; else if (!b3) array[0] = 0; ... and we wouldn't be able to optimize it anyway. GCC 4.6.4 has been released and the branch has been closed. The 4.7 branch is being closed, moving target milestone to 4.8.4. GCC 4.8.4 has been released. The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3. GCC 4.9.3 has been released. (In reply to Richard Biener from comment #19) > Shorter testcase, compilable and to the point. We are not able to CSE > the b1 && ... && b8 sequence because we produce control-flow for it > during gimplification. > > void bar (short *array, > short b1, short b2, short b3, short b4, > short b5, short b6, short b7, short b8) > { > array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; > array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; > } This seems to be optimized as of 4.7, but the original testcase still needs work. Maybe sccvn could be taught to recreate the conditionals? GCC 4.9 branch is being closed With GCC 7.0 I don't see much of a difference between the code emitted for the original test case and for the smaller one from comment #19. Can this bug be resolved or am I missing something? For convenience, here are both test cases: #if !ORIGINAL void bar (short *array, short b1, short b2, short b3, short b4, short b5, short b6, short b7, short b8) { array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; } #else void bar (short *array, short b1, short b2, short b3, short b4, short b5, short b6, short b7, short b8, short b9, short b10, short b11, short b12) { array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[2] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[3] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[4] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[5] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[6] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[7] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[8] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[9] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; array[10] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8 && b9 && b10 && b11 && b12; } #endif > grep if t.c.227t.optimized | wc -l
109
should be 12 if CSE worked.
Martin, Richi's right. If you look at the .optimized dump you'll an huge number of redundant conditionals. We should compute b1 && b2 && b3 ... && b12 once and store the result into each array element. But the way we generate code for this can make it awful hard for the optimizers to see the giant common subexpression. If you look at the shape of the CFG and explore how to expose the various redundancies, you'll see there's a cascading effect. ie, you expose some, simplify and new opportunities appear. Jump threading had the ability to do this many years ago. We kept iterating DOM and forward jump threading until nothing changed. But that was very expensive relative to what was gained and we removed the iteration code. There's various things in the pipeline that might (or might not) help this specific BZ over time. It's certainly in the back of my mind as we continue the transition away from the forward equivalency table based jump threading to the backward walking of the use-def chains. The backwards approach holds a lot more promise to address this BZ. I'm pretty sure the stuff planned would allow it to find all the threads in a single pass. The question of cost of the backwards walking and block copying necessary to isolate/expose the redundancies. So, could we in SCCVN recognize similarly to maybe_optimize_range_tests inter-bb && and || tests at the end stored through PHI into an SSA_NAME without side-effects in between and handle it (hash it and compare) as if it was actually using && or || (or & and | ?)? GCC 5 branch is being closed Just a couple notes. I'm not currently looking at this, but this is probably the best bug to track thoughts around how to try and capture secondary effects of jump threading without re-running all of DOM. -- Taken from another BZ so that it doesn't get lost -- I looked at ways to introduce iteration without the full DOM pass years ago. It was pretty obvious that the most interesting things happened as a result of exposing degenerate PHIs. But I wasn't ever able to make a leap from that to a low overhead iterating jump threader. What did come out of it was the phi-only cprop pass which propagates away the degenerate PHIs. -- And today's thought. In addition to degenerate PHIs the other key property which indicates an exposed secondary opportunity is a reduction in the number of preds for a block. Particularly when we can drop from N to 1 pred. ssa--dom-simplify-1 is a good example, particularly if one disables the VRP threader. Prior to DOM we'll see: ;; basic block 2, loop depth 0, count 1073741825 (estimated locally), maybe hot ;; prev block 0, next block 3, flags: (NEW, REACHABLE, VISITED) ;; pred: ENTRY [always] count:1073741826 (estimated locally) (FALLTHRU,EXECUTABLE) if (x_3(D) > 3) goto <bb 3>; [33.00%] else goto <bb 4>; [67.00%] ;; succ: 3 [33.0% (guessed)] count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE) ;; 4 [67.0% (guessed)] count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; basic block 3, loop depth 0, count 354334802 (estimated locally), maybe hot ;; prev block 2, next block 4, flags: (NEW, REACHABLE, VISITED) ;; pred: 2 [33.0% (guessed)] count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE) frob (1); ;; succ: 4 [always (guessed)] count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE) ;; basic block 4, loop depth 0, count 1073741825 (estimated locally), maybe hot ;; prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED) ;; pred: 3 [always (guessed)] count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE) ;; 2 [67.0% (guessed)] count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE) if (x_3(D) > 2) goto <bb 5>; [33.00%] else goto <bb 6>; [67.00%] ;; succ: 5 [33.0% (guessed)] count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE) ;; 6 [67.0% (guessed)] count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; basic block 5, loop depth 0, count 354334802 (estimated locally), maybe hot ;; prev block 4, next block 6, flags: (NEW, REACHABLE, VISITED) ;; pred: 4 [33.0% (guessed)] count:354334800 (estimated locally) (TRUE_VALUE,EXECUTABLE) frob (x_3(D)); ;; succ: 6 [always (guessed)] count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE) ;; basic block 6, loop depth 0, count 1073741825 (estimated locally), maybe hot ;; prev block 5, next block 1, flags: (NEW, REACHABLE, VISITED) ;; pred: 4 [67.0% (guessed)] count:719407025 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; 5 [always (guessed)] count:354334802 (estimated locally) (FALLTHRU,EXECUTABLE) return; ;; succ: EXIT [always (guessed)] count:1073741825 (estimated locally) } DOM is going to discover the 3->4->5 jump thread easily. But there are no PHIs in BB4 that would trigger any kind of reanalysis. But the # preds for BB4 drops from 2 to 1. This is important as the remaining path into BB4 can be further optimized after we realize the 3->4->5 jump thread. It feels as if when we discover a degenerate PHI or the incoming preds drops to 1, then we ought to start a re-analysis. The blocks involved would start at the idom of the affected block, covering the dominance tree and the PHI nodes as the dominance frontier. I thought I explored that idom/dominance tree/dominance frontier idea years ago and likely dismissed it as incomplete (consider how scrambled the dominance tree can get after threading). But while I could certainly conjure up scenarios where it's incomplete, it might be "good enough" to catch the secondary opportunities without a fully iterating DOM. -- Of course I'm also very interested to evaluate if any of that is necessary with Aldy's recent work on the backwards threader. Another thought for dealing with this BZ: Use the infrastructure Alex built to identify blocks that are in effect forwarders (ie, they need not be copied for jump threading). Use that knowledge to thread deeper in the CFG each iteration. So I prototyped the idea in c#37. It helps, but is not sufficient to address this case. Given that it doesn't fully address this case, I don't think it's warranted at this stage in development. I'll queue it up for gcc-9. I'm also going to push this BZ out to gcc-9. GCC 11.1 has been released, retargeting bugs to GCC 11.2. GCC 11.2 is being released, retargeting bugs to GCC 11.3 GCC 11.3 is being released, retargeting bugs to GCC 11.4. (In reply to Richard Biener from comment #19) > Shorter testcase, compilable and to the point. We are not able to CSE > the b1 && ... && b8 sequence because we produce control-flow for it > during gimplification. > > void bar (short *array, > short b1, short b2, short b3, short b4, > short b5, short b6, short b7, short b8) > { > array[0] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; > array[1] = b1 && b2 && b3 && b4 && b5 && b6 && b7 && b8; > } This reduced testcase we optimize since GCC 12. GCC 11.4 is being released, retargeting bugs to GCC 11.5. |