[592] % gcctk -v Using built-in specs. COLLECT_GCC=gcctk COLLECT_LTO_WRAPPER=/local/suz-local/software/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk/configure --disable-bootstrap --prefix=/local/suz-local/software/local/gcc-trunk --enable-languages=c,c++ --disable-werror --enable-multilib --with-system-zlib Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.0.0 20210428 (experimental) [master revision 852dd866e2f:9b04e5b2651:b81e2d5e76a6bcc71f45b122e8b5538ddb7ebf4c] (GCC) [593] % [593] % gcctk -O1 -S -o O1.s small.c [594] % gcctk -O3 -S -o O3.s small.c [595] % [595] % wc O1.s O3.s 21 45 392 O1.s 42 86 686 O3.s 63 131 1078 total [596] % [596] % grep foo O1.s [597] % grep foo O3.s jmp foo [598] % [598] % cat small.c extern void foo(void); static int a, *c, g, **j; int b; static void e() { int k, *l[5] = {&k, &k, &k, &k, &k}; while (g) { j = &l[0]; b++; } } static void d(int m) { int **h[30] = {&c}, ***i[1] = {&h[3]}; if (m) foo(); e(); } int main() { d(a); return 0; }
Confirmed, similar issue to PR 100220.
the key is to notice that 'a' is not modified and thus zero and propagate that to 'd'. With -O3 IPA figures out 'a' is not modified but that isn't taken into account by IPA CP which sees a.0_1 = a; d (a.0_1); with -O1 we happen to inline everything during IPA and nothing early while with -O3 'e' is early inlined into 'd' (but not d into main) and IPA refuses to inline d into main: t.c:18:3: missed: not inlinable: main/7 -> d/6, --param large-stack-frame-growth limit reached sth is off with the stack limit I guess, maybe it's not cummulated correctly at -O1.
With -O1 i get: IPA function summary for main/7 inlinable global time: 72.936364 self size: 6 global size: 19 min size: 16 self stack: 0 global stack: 44 size:15.000000, time:67.636364 size:3.000000, time:2.000000, executed if:(not inlined) calls: d/6 inlined freq:1.00 Stack frame offset 0, callee self size 0 e/5 inlined freq:1.00 Stack frame offset 0, callee self size 44 foo/8 function body not available freq:0.33 loop depth: 0 size: 1 time: 10 So we get 44 bytes of stack use. However with -O3 we see: IPA function summary for main/7 inlinable global time: 14.000000 self size: 6 global size: 6 min size: 3 self stack: 0 global stack: 0 size:1.000000, time:1.000000 size:3.000000, time:2.000000, executed if:(not inlined) calls: d/6 function not considered for inlining freq:1.00 loop depth: 0 size: 2 time: 11 callee size:11 stack:284 IPA function summary for d/6 inlinable global time: 87.936364 self size: 23 global size: 23 min size: 17 self stack: 284 global stack: 284 size:17.000000, time:80.636364 size:3.000000, time:2.000000, executed if:(not inlined) size:2.000000, time:2.000000, nonconst if:(op0 changed) calls: foo/8 function body not available freq:0.33 loop depth: 0 size: 1 time: 10 predicate: (op0 != 0) So now the stack is 284 bytes for d which hits the limits. It seems to be array h[30] (240 bytes) and l[5] (40 bytes). With -O1 array h is optimized out completely however with -O3 I see: void d (int m) { int k; int * l[5]; int * * h[30]; int b.1_8; int _9; int g.2_10; <bb 2> [local count: 118111600]: MEM <char[232]> [(int * *[30] *)&h + 8B] = {}; h[0] = &c; if (m_5(D) != 0) goto <bb 3>; [33.00%] else goto <bb 4>; [67.00%] <bb 3> [local count: 38976828]: foo (); <bb 4> [local count: 118111600]: l[0] = &k; l[1] = &k; l[2] = &k; l[3] = &k; l[4] = &k; goto <bb 6>; [100.00%] <bb 5> [local count: 955630225]: j = &l[0]; b.1_8 = b; _9 = b.1_8 + 1; b = _9; <bb 6> [local count: 1073741824]: g.2_10 = g; if (g.2_10 != 0) goto <bb 5>; [89.00%] else goto <bb 7>; [11.00%] <bb 7> [local count: 118111600]: k ={v} {CLOBBER}; l ={v} {CLOBBER}; h ={v} {CLOBBER}; return; } So h is dead but DSE did not happen since at DSE time we see: <bb 2> : MEM <char[232]> [(int * *[30] *)&h + 8B] = {}; h[0] = &c; i[0] = &h[3]; if (m_6(D) != 0) goto <bb 3>; [INV] else goto <bb 4>; [INV] <bb 3> : foo (); <bb 4> : Which keeps h alive.
However i is also dead at dse time: void d (int m) { int k; int * l[5]; int * * * i[1]; int * * h[30]; int b.1_11; int _12; int g.2_13; <bb 2> : MEM <char[232]> [(int * *[30] *)&h + 8B] = {}; h[0] = &c; i[0] = &h[3]; if (m_6(D) != 0) goto <bb 3>; [INV] else goto <bb 4>; [INV] <bb 3> : foo (); <bb 4> : l[0] = &k; l[1] = &k; l[2] = &k; l[3] = &k; l[4] = &k; goto <bb 6>; [100.00%] <bb 5> : j = &l[0]; b.1_11 = b; _12 = b.1_11 + 1; b = _12; <bb 6> : g.2_13 = g; if (g.2_13 != 0) goto <bb 5>; [89.00%] else goto <bb 7>; [11.00%] <bb 7> : k ={v} {CLOBBER}; l ={v} {CLOBBER}; h ={v} {CLOBBER}; i ={v} {CLOBBER}; return; } Not sure why DSE is not optimizing this out.
(In reply to Richard Biener from comment #2) > the key is to notice that 'a' is not modified and thus zero and propagate > that to 'd'. With -O3 IPA figures out 'a' is not modified but that isn't > taken into And that is the same issue as PR 100188 (and maybe others).
Testing a patch.
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:1e0ae1f52741f7e0133661659ed2d210f939a398 commit r15-571-g1e0ae1f52741f7e0133661659ed2d210f939a398 Author: Richard Biener <rguenther@suse.de> Date: Wed May 15 18:32:37 2024 +0200 tree-optimization/79958 - make DSE track multiple paths DSE currently gives up when the path we analyze forks. This leads to multiple missed dead store elimination PRs. The following fixes this by recursing for each path and maintaining the visited bitmap to avoid visiting CFG re-merges multiple times. The overall cost is still limited by the same bound, it's just more likely we'll hit the limit now. The patch doesn't try to deal with byte tracking once a path forks but drops info on the floor and only handling fully dead stores in that case. PR tree-optimization/79958 PR tree-optimization/109087 PR tree-optimization/100314 PR tree-optimization/114774 * tree-ssa-dse.cc (dse_classify_store): New forwarder. (dse_classify_store): Add arguments cnt and visited, recurse to track multiple paths when we end up with multiple defs. * gcc.dg/tree-ssa/ssa-dse-48.c: New testcase. * gcc.dg/tree-ssa/ssa-dse-49.c: Likewise. * gcc.dg/tree-ssa/ssa-dse-50.c: Likewise. * gcc.dg/tree-ssa/ssa-dse-51.c: Likewise. * gcc.dg/graphite/pr80906.c: Avoid DSE of last data reference in loop. * g++.dg/ipa/devirt-24.C: Adjust for extra DSE. * g++.dg/warn/Wuninitialized-pr107919-1.C: Use more important -O2 optimization level, -O1 regresses.
Should be fixed for GCC 15.