[Bug tree-optimization/99788] missed optimization for dead code elimination at -O3 (vs. -O1)
rguenth at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Fri Mar 26 11:48:26 GMT 2021
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99788
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2021-03-26
Version|unknown |11.0
Component|ipa |tree-optimization
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. The issue is that at -O3 we inline e() and while inside e() we
eliminate the call to foo since the preceeding for() loop does not terminate
(CCP figures this out), the inline copy has the loop header PHI not simplified
at the point CCP runs (and it doesn't run later again):
<bb 3> [local count: 43379093]:
a = 1;
a.3_4 = a;
<bb 4> [local count: 350976297]:
# a.3_3 = PHI <a.3_5(4), a.3_4(3)>
a.2_6 = (unsigned char) a.3_3;
_7 = a.2_6 + 2;
_8 = (char) _7;
a = _8;
a.3_5 = a;
if (a.3_5 != 0)
goto <bb 4>; [87.64%]
else
goto <bb 5>; [12.36%]
<bb 5> [local count: 43379093]:
foo ();
vs.
<bb 3> [local count: 955630225]:
# a.3_22 = PHI <_3(3), 1(2)>
a.2_1 = (unsigned char) a.3_22;
_2 = a.2_1 + 2;
_3 = (char) _2;
a = _3;
if (_3 != 0)
goto <bb 3>; [89.00%]
else
goto <bb 4>; [11.00%]
<bb 4> [local count: 118111600]:
foo ();
and the difference starts with loop header copying which is applied to
the outline but not the inline copy of the loop.
Analyzing loop 1
Loop 1 is not do-while loop: latch is not empty.
Will duplicate bb 4
Not duplicating bb 3: it is single succ.
Duplicating header of the loop 1 up to edge 4->3, 3 insns.
Loop 1 is do-while loop
Loop 1 is now do-while loop.
vs.
Analyzing loop 1
Analyzing loop 2
Loop 2 is not do-while loop: latch is not empty.
Not duplicating bb 5: optimizing for size.
where the decision on optimizing for size is because this is main(). Renaming
main() to baz() fixes the issue.
But I wonder why we inline e() into cold main at all. Honza? I see
Processing frequency f/9
Called by main/11 that is normal or hot
t.c:24:3: note: Inlining f/9 to main/11 with frequency 1.00
so here main() is normal or hot but loop header copying sees
optimize_loop_for_size_p () == true!?
IPA inlining sees
Considering d/10 with 20 size
to be inlined into main/11 in t.c:17
Estimated badness is -0.000046, frequency 0.00.
Badness calculation for main/11 -> d/10
size growth 16, time 8428.908463 unspec 8428.908463
-0.000011: guessed profile. frequency 0.000400, count -1 caller count -1
time saved 0.004400 overall growth -4 (current) -4 (original) -4 (compensated)
Adjusted by hints -0.000046
Updated mod-ref summary for main/11
loads:
Limits: 32 bases, 16 refs
Every base
stores:
Limits: 32 bases, 16 refs
Accounting size:17.00, time:2.97 on predicate exec:(true)
Processing frequency d/10
Called by main/11 that is executed once
Processing frequency e/13
Called by d/10 that is executed once
Node e/13 promoted to executed once.
Accounting size:-2.00, time:-0.00 on predicate exec:(true)
Accounting size:1.00, time:0.40 on predicate exec:(true)
t.c:17:5: optimized: Inlined d/10 into main/11 which now has time 8.370758 and
size 24, net change of -4.
so something is off with how we process speed/size optimization. Note
it looks like the loop copy in main gets cold also because it is predicated
by if (b) which is predicted as very cold:
<bb 2> [local count: 1073741824]:
b.0_2 = b;
if (b.0_2 != 0)
goto <bb 8>; [0.04%]
else
goto <bb 7>; [99.96%]
<bb 8> [local count: 429496]:
<bb 3> [local count: 43379093]:
a = 1;
goto <bb 5>; [100.00%]
<bb 4> [local count: 350976297]:
a.2_6 = (unsigned char) a.3_5;
_7 = a.2_6 + 2;
_8 = (char) _7;
a = _8;
<bb 5> [local count: 394355390]:
a.3_5 = a;
if (a.3_5 != 0)
goto <bb 4>; [89.00%]
else
goto <bb 6>; [11.00%]
still when the function is not called main() we're not getting the
optimize_loop_for_size () predicate evaluated to true (with the
exact same local profile as above!).
More information about the Gcc-bugs
mailing list