Bug 92657

Summary: High stack usage due ftree-ch
Product: gcc Reporter: Adhemerval Zanella <adhemerval.zanella>
Component: rtl-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal Keywords: missed-optimization
Priority: P3    
Version: 10.0   
Target Milestone: ---   
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2019-11-26 00:00:00
Attachments: High stack usage due ftree-ch

Description Adhemerval Zanella 2019-11-25 13:00:15 UTC
Created attachment 47351 [details]
High stack usage due ftree-ch

The code snippet (gcc_free_ch_stack.c) shows a high stack usage.  With GCC 9.2.1 I see the resulting stack usage using -fstack-usage along with -O2:

arm                     632
aarch64                 448
powerpc                 912
powerpc64le             560
s390                    600
s390x                   632
i386                    1376
x86_64                  784

The same issue also shows in master branch. It seems that it is due -ftree-ch pass with feeds -ftree-loop-im, -ftree-pre, -fmove-loop-invariants, and -fgcse. Andrew Pinski suggested is mostly due lack of a good estimate register pressure for loop invariant code motion.

Andrew also suggested to use -fno-tree-loop-im -fno-tree-pre -fno-gcse, however even with this options the resulting stack usage does not get in par with -Os option (which disables -ftree-ch).  On powerpc64le:

$ ./gcc/xgcc -v 2>&1 | grep 'gcc version'
gcc version 10.0.0 20191121 (experimental) (GCC) 

$ ./gcc/xgcc -B gcc -O2 stack_usage.c -fstack-usage -c; cat stack_usage.su
stack_usage.c:157:6:mlx5e_grp_sw_update_stats	496	static

$ ./gcc/xgcc -B gcc -O2 stack_usage.c -fstack-usage -c -fno-tree-loop-im -fno-tree-pre -fno-move-loop-invariants -fno-gcse; cat stack_usage.su
stack_usage.c:157:6:mlx5e_grp_sw_update_stats	176	static$ ./gcc/xgcc -B gcc -Os stack_usage.c -fstack-usage -c; cat stack_usage.su

$ ./gcc/xgcc -B gcc -Os stack_usage.c -fstack-usage -c; cat stack_usage.su
stack_usage.c:157:6:mlx5e_grp_sw_update_stats	32	static
Comment 1 Andrew Pinski 2019-11-25 14:07:49 UTC
Again, this is not due to tree-ch at all.  This is due to the code motion passes move invariant load/stores out of the loop.  Tree-ch pass just allows those passes to work.

All three (gcse, tree pre and tree lim) need to be disabled to see the difference as all three are able to do the transformation.
Comment 2 Adhemerval Zanella 2019-11-25 14:11:25 UTC
(In reply to Andrew Pinski from comment #1)
> Again, this is not due to tree-ch at all.  This is due to the code motion
> passes move invariant load/stores out of the loop.  Tree-ch pass just allows
> those passes to work.
> 
> All three (gcse, tree pre and tree lim) need to be disabled to see the
> difference as all three are able to do the transformation.

Sorry if I was not clear that tree-ch is not the culprit, but rather that it enabled further optimizations to increase register pressure.  But as I added by disabling gcse, tree pre, and tree lim does help total stack usage, but it does not reach on same level as disabling tree-ch.
Comment 3 Adhemerval Zanella 2019-11-25 14:26:45 UTC
(In reply to Adhemerval Zanella from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > Again, this is not due to tree-ch at all.  This is due to the code motion
> > passes move invariant load/stores out of the loop.  Tree-ch pass just allows
> > those passes to work.
> > 
> > All three (gcse, tree pre and tree lim) need to be disabled to see the
> > difference as all three are able to do the transformation.
> 
> Sorry if I was not clear that tree-ch is not the culprit, but rather that it
> enabled further optimizations to increase register pressure.  But as I added
> by disabling gcse, tree pre, and tree lim does help total stack usage, but
> it does not reach on same level as disabling tree-ch.

Ok, gcse, tree pre and tree lim are just tree of the flags that are increasing the stack.  Other not enabled by Os but enabled by O2 are increasing stack usage.

Maybe changing the title to "High stack usage with tree-loop-im, tree-pre, and gcse"?
Comment 4 Richard Biener 2019-11-26 07:51:04 UTC
From a quick look it's a classical testcase for excessive store-motion plus
PRE and GCSE managing to do half of that.

So in essence there are probably duplicates of this bug and what we miss
is something of a register pressure estimation framework on GIMPLE (we do
have multiple sketches of that spread across some passes).  The main issue
here is (as can be seen here) that implementing such estimation in one
pass doesn't solve the issue but merely pushes it elsewhere.

Note that for i?86 with SSE STV is also an offender:

t.c:157:6:mlx5e_grp_sw_update_stats     1376    static
t.c:157:6:mlx5e_grp_sw_update_stats     936     static   with -mno-stv
Comment 5 Arnd Bergmann 2020-01-05 11:15:03 UTC
Submitted a workaround for the warning that triggered this bug report in the linux kernel:

https://lore.kernel.org/lkml/20200104215156.689245-1-arnd@arndb.de/