[Bug tree-optimization/99788] missed optimization for dead code elimination at -O3 (vs. -O1)

Fri Mar 26 11:48:26 GMT 2021

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99788

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2021-03-26
            Version|unknown                     |11.0
          Component|ipa                         |tree-optimization
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The issue is that at -O3 we inline e() and while inside e() we
eliminate the call to foo since the preceeding for() loop does not terminate
(CCP figures this out), the inline copy has the loop header PHI not simplified
at the point CCP runs (and it doesn't run later again):

  <bb 3> [local count: 43379093]:
  a = 1;
  a.3_4 = a;

  <bb 4> [local count: 350976297]:
  # a.3_3 = PHI <a.3_5(4), a.3_4(3)>
  a.2_6 = (unsigned char) a.3_3;
  _7 = a.2_6 + 2;
  _8 = (char) _7;
  a = _8;
  a.3_5 = a;
  if (a.3_5 != 0)
    goto <bb 4>; [87.64%]
  else
    goto <bb 5>; [12.36%]

  <bb 5> [local count: 43379093]:
  foo ();

vs.

  <bb 3> [local count: 955630225]:
  # a.3_22 = PHI <_3(3), 1(2)>
  a.2_1 = (unsigned char) a.3_22;
  _2 = a.2_1 + 2;
  _3 = (char) _2;
  a = _3;
  if (_3 != 0)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111600]:
  foo ();

and the difference starts with loop header copying which is applied to
the outline but not the inline copy of the loop.

Analyzing loop 1
Loop 1 is not do-while loop: latch is not empty.
    Will duplicate bb 4
  Not duplicating bb 3: it is single succ.
Duplicating header of the loop 1 up to edge 4->3, 3 insns.
Loop 1 is do-while loop
Loop 1 is now do-while loop.

vs.

Analyzing loop 1
Analyzing loop 2
Loop 2 is not do-while loop: latch is not empty.
  Not duplicating bb 5: optimizing for size.

where the decision on optimizing for size is because this is main().  Renaming
main() to baz() fixes the issue.

But I wonder why we inline e() into cold main at all.  Honza?  I see

Processing frequency f/9
  Called by main/11 that is normal or hot
t.c:24:3: note: Inlining f/9 to main/11 with frequency 1.00

so here main() is normal or hot but loop header copying sees
optimize_loop_for_size_p () == true!?

IPA inlining sees

Considering d/10 with 20 size
 to be inlined into main/11 in t.c:17
 Estimated badness is -0.000046, frequency 0.00.
    Badness calculation for main/11 -> d/10
      size growth 16, time 8428.908463 unspec 8428.908463
      -0.000011: guessed profile. frequency 0.000400, count -1 caller count -1
time saved 0.004400 overall growth -4 (current) -4 (original) -4 (compensated)
      Adjusted by hints -0.000046
Updated mod-ref summary for main/11
  loads:
    Limits: 32 bases, 16 refs
    Every base
  stores:
    Limits: 32 bases, 16 refs
                Accounting size:17.00, time:2.97 on predicate exec:(true)
Processing frequency d/10
  Called by main/11 that is executed once
Processing frequency e/13
  Called by d/10 that is executed once
Node e/13 promoted to executed once.
                Accounting size:-2.00, time:-0.00 on predicate exec:(true)
                Accounting size:1.00, time:0.40 on predicate exec:(true)
t.c:17:5: optimized:  Inlined d/10 into main/11 which now has time 8.370758 and
size 24, net change of -4.

so something is off with how we process speed/size optimization.  Note
it looks like the loop copy in main gets cold also because it is predicated
by if (b) which is predicted as very cold:

  <bb 2> [local count: 1073741824]:
  b.0_2 = b;
  if (b.0_2 != 0)
    goto <bb 8>; [0.04%]
  else
    goto <bb 7>; [99.96%]

  <bb 8> [local count: 429496]:

  <bb 3> [local count: 43379093]:
  a = 1;
  goto <bb 5>; [100.00%]

  <bb 4> [local count: 350976297]:
  a.2_6 = (unsigned char) a.3_5;
  _7 = a.2_6 + 2;
  _8 = (char) _7;
  a = _8;

  <bb 5> [local count: 394355390]:
  a.3_5 = a;
  if (a.3_5 != 0)
    goto <bb 4>; [89.00%]
  else
    goto <bb 6>; [11.00%]

still when the function is not called main() we're not getting the
optimize_loop_for_size () predicate evaluated to true (with the
exact same local profile as above!).