100314 – missed optimization for dead code elimination at -O3 (vs. -O1) (inlining differences due to missed dse)

Bug 100314 - missed optimization for dead code elimination at -O3 (vs. -O1) (inlining differences due to missed dse)

Summary: missed optimization for dead code elimination at -O3 (vs. -O1) (inlining dif...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	ipa (show other bugs)
Version:	12.0

Importance:	P3 normal
Target Milestone:	15.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	100220
	Show dependency tree / graph

Reported:	2021-04-28 10:17 UTC by Zhendong Su
Modified:	2024-05-16 09:05 UTC (History)
CC List:	5 users (show)

See Also:	100220 100188
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-01-11 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zhendong Su 2021-04-28 10:17:26 UTC

[592] % gcctk -v
Using built-in specs.
COLLECT_GCC=gcctk
COLLECT_LTO_WRAPPER=/local/suz-local/software/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk/configure --disable-bootstrap --prefix=/local/suz-local/software/local/gcc-trunk --enable-languages=c,c++ --disable-werror --enable-multilib --with-system-zlib
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 12.0.0 20210428 (experimental) [master revision 852dd866e2f:9b04e5b2651:b81e2d5e76a6bcc71f45b122e8b5538ddb7ebf4c] (GCC) 
[593] % 
[593] % gcctk -O1 -S -o O1.s small.c
[594] % gcctk -O3 -S -o O3.s small.c
[595] % 
[595] % wc O1.s O3.s
  21   45  392 O1.s
  42   86  686 O3.s
  63  131 1078 total
[596] % 
[596] % grep foo O1.s
[597] % grep foo O3.s
	jmp	foo
[598] % 
[598] % cat small.c
extern void foo(void);
static int a, *c, g, **j;
int b;
static void e() {
  int k, *l[5] = {&k, &k, &k, &k, &k};
  while (g) {
    j = &l[0];
    b++;
  }
}
static void d(int m) {
  int **h[30] = {&c}, ***i[1] = {&h[3]};
  if (m)
    foo();
  e();
}
int main() {
  d(a);
  return 0;
}

Comment 1 Andrew Pinski 2021-09-25 10:05:04 UTC

Confirmed, similar issue to PR 100220.

Comment 2 Richard Biener 2022-01-11 09:53:36 UTC

the key is to notice that 'a' is not modified and thus zero and propagate that to 'd'.  With -O3 IPA figures out 'a' is not modified but that isn't taken into
account by IPA CP which sees

  a.0_1 = a;
  d (a.0_1);

with -O1 we happen to inline everything during IPA and nothing early while
with -O3 'e' is early inlined into 'd' (but not d into main) and IPA
refuses to inline d into main:

t.c:18:3: missed:   not inlinable: main/7 -> d/6, --param large-stack-frame-growth limit reached

sth is off with the stack limit I guess, maybe it's not cummulated correctly
at -O1.

Comment 3 Jan Hubicka 2022-01-11 10:36:55 UTC

With -O1 i get:
IPA function summary for main/7 inlinable
  global time:     72.936364
  self size:       6
  global size:     19
  min size:       16
  self stack:      0
  global stack:    44
    size:15.000000, time:67.636364
    size:3.000000, time:2.000000,  executed if:(not inlined)
  calls:
    d/6 inlined
      freq:1.00
      Stack frame offset 0, callee self size 0
      e/5 inlined
        freq:1.00
        Stack frame offset 0, callee self size 44
      foo/8 function body not available
        freq:0.33 loop depth: 0 size: 1 time: 10

So we get 44 bytes of stack use.

However with -O3 we see:
IPA function summary for main/7 inlinable
  global time:     14.000000
  self size:       6
  global size:     6
  min size:       3
  self stack:      0
  global stack:    0
    size:1.000000, time:1.000000
    size:3.000000, time:2.000000,  executed if:(not inlined)
  calls:
    d/6 function not considered for inlining
      freq:1.00 loop depth: 0 size: 2 time: 11 callee size:11 stack:284

IPA function summary for d/6 inlinable
  global time:     87.936364
  self size:       23
  global size:     23
  min size:       17
  self stack:      284
  global stack:    284
    size:17.000000, time:80.636364
    size:3.000000, time:2.000000,  executed if:(not inlined)
    size:2.000000, time:2.000000,  nonconst if:(op0 changed)
  calls:
    foo/8 function body not available
      freq:0.33 loop depth: 0 size: 1 time: 10 predicate: (op0 != 0)

So now the stack is 284 bytes for d which hits the limits.
It seems to be array h[30] (240 bytes) and l[5] (40 bytes).

With -O1 array h is optimized out completely however with -O3 I see:
void d (int m)                                                                  
{                                                                               
  int k;                                                                        
  int * l[5];                                                                   
  int * * h[30];                                                                
  int b.1_8;                                                                    
  int _9;                                                                       
  int g.2_10;                                                                   
                                                                                
  <bb 2> [local count: 118111600]:                                              
  MEM <char[232]> [(int * *[30] *)&h + 8B] = {};                                
  h[0] = &c;                                                                    
  if (m_5(D) != 0)                                                              
    goto <bb 3>; [33.00%]                                                       
  else                                                                          
    goto <bb 4>; [67.00%]                                                       
                                                                                
  <bb 3> [local count: 38976828]:                                               
  foo ();                                                                       
                                                                                
  <bb 4> [local count: 118111600]:                                              
  l[0] = &k;                                                                    
  l[1] = &k;                                                                    
  l[2] = &k;                                                                    
  l[3] = &k;                                                                    
  l[4] = &k;                                                                    
  goto <bb 6>; [100.00%]                                                        
                                                                                
  <bb 5> [local count: 955630225]:                                              
  j = &l[0];                                                                    
  b.1_8 = b;                                                                    
  _9 = b.1_8 + 1;                                                               
  b = _9;                                                                       
                                                                                
  <bb 6> [local count: 1073741824]:                                             
  g.2_10 = g;                                                                   
  if (g.2_10 != 0)                                                              
    goto <bb 5>; [89.00%]                                                       
  else                                                                          
    goto <bb 7>; [11.00%]                                                       
                                                                                
  <bb 7> [local count: 118111600]:                                              
  k ={v} {CLOBBER};                                                             
  l ={v} {CLOBBER};                                                             
  h ={v} {CLOBBER};                                                             
  return;                                                                       
                                                                                
}                                                                               

So h is dead but DSE did not happen since at DSE time we see:

  <bb 2> :                                                                      
  MEM <char[232]> [(int * *[30] *)&h + 8B] = {};                                
  h[0] = &c;                                                                    
  i[0] = &h[3];                                                                 
  if (m_6(D) != 0)                                                              
    goto <bb 3>; [INV]                                                          
  else                                                                          
    goto <bb 4>; [INV]                                                          
                                                                                
  <bb 3> :                                                                      
  foo ();                                                                       
                                                                                
  <bb 4> :                                                                      

Which keeps h alive.

Comment 4 Jan Hubicka 2022-01-11 10:40:34 UTC

However i is also dead at dse time:
void d (int m)                                                                  
{                                                                               
  int k;                                                                        
  int * l[5];                                                                   
  int * * * i[1];                                                               
  int * * h[30];                                                                
  int b.1_11;                                                                   
  int _12;                                                                      
  int g.2_13;                                                                   
                                                                                
  <bb 2> :                                                                      
  MEM <char[232]> [(int * *[30] *)&h + 8B] = {};                                
  h[0] = &c;                                                                    
  i[0] = &h[3];                                                                 
  if (m_6(D) != 0)                                                              
    goto <bb 3>; [INV]                                                          
  else                                                                          
    goto <bb 4>; [INV]                                                          
                                                                                
  <bb 3> :                                                                      
  foo ();                                                                       
                                                                                
  <bb 4> :                                                                      
  l[0] = &k;                                                                    
  l[1] = &k;                                                                    
  l[2] = &k;                                                                    
  l[3] = &k;                                                                    
  l[4] = &k;                                                                    
  goto <bb 6>; [100.00%]                                                        
                                                                                
  <bb 5> :                                                                      
  j = &l[0];                                                                    
  b.1_11 = b;                                                                   
  _12 = b.1_11 + 1;                                                             
  b = _12;                                                                      
                                                                                
  <bb 6> :                                                                      
  g.2_13 = g;                                                                   
  if (g.2_13 != 0)                                                              
    goto <bb 5>; [89.00%]                                                       
  else                                                                          
    goto <bb 7>; [11.00%]                                                       
                                                                                
  <bb 7> :                                                                      
  k ={v} {CLOBBER};                                                             
  l ={v} {CLOBBER};                                                             
  h ={v} {CLOBBER};                                                             
  i ={v} {CLOBBER};                                                             
  return;                                                                       
                                                                                
}                                                                               

Not sure why DSE is not optimizing this out.

Comment 5 Andrew Pinski 2023-08-18 02:32:25 UTC

(In reply to Richard Biener from comment #2)
> the key is to notice that 'a' is not modified and thus zero and propagate
> that to 'd'.  With -O3 IPA figures out 'a' is not modified but that isn't
> taken into

And that is the same issue as PR 100188 (and maybe others).

Comment 6 Richard Biener 2024-05-15 16:44:56 UTC

Testing a patch.

Comment 7 GCC Commits 2024-05-16 09:04:19 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:1e0ae1f52741f7e0133661659ed2d210f939a398

commit r15-571-g1e0ae1f52741f7e0133661659ed2d210f939a398
Author: Richard Biener <rguenther@suse.de>
Date:   Wed May 15 18:32:37 2024 +0200

    tree-optimization/79958 - make DSE track multiple paths
    
    DSE currently gives up when the path we analyze forks.  This leads
    to multiple missed dead store elimination PRs.  The following fixes
    this by recursing for each path and maintaining the visited bitmap
    to avoid visiting CFG re-merges multiple times.  The overall cost
    is still limited by the same bound, it's just more likely we'll hit
    the limit now.  The patch doesn't try to deal with byte tracking
    once a path forks but drops info on the floor and only handling
    fully dead stores in that case.
    
            PR tree-optimization/79958
            PR tree-optimization/109087
            PR tree-optimization/100314
            PR tree-optimization/114774
            * tree-ssa-dse.cc (dse_classify_store): New forwarder.
            (dse_classify_store): Add arguments cnt and visited, recurse
            to track multiple paths when we end up with multiple defs.
    
            * gcc.dg/tree-ssa/ssa-dse-48.c: New testcase.
            * gcc.dg/tree-ssa/ssa-dse-49.c: Likewise.
            * gcc.dg/tree-ssa/ssa-dse-50.c: Likewise.
            * gcc.dg/tree-ssa/ssa-dse-51.c: Likewise.
            * gcc.dg/graphite/pr80906.c: Avoid DSE of last data reference
            in loop.
            * g++.dg/ipa/devirt-24.C: Adjust for extra DSE.
            * g++.dg/warn/Wuninitialized-pr107919-1.C: Use more important
            -O2 optimization level, -O1 regresses.

Comment 8 Richard Biener 2024-05-16 09:05:48 UTC

Should be fixed for GCC 15.