Bug 70232 - [6 regression] excessive stack usage with -O2
Summary: [6 regression] excessive stack usage with -O2
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.0
: P1 normal
Target Milestone: 6.0
Assignee: Jeffrey A. Law
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-03-15 00:36 UTC by Arnd Bergmann
Modified: 2016-03-23 09:43 UTC (History)
8 users (show)

See Also:
Host:
Target: arm-linux-gnueabi
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-03-16 00:00:00


Attachments
partially reduced test case (1.03 KB, text/plain)
2016-03-15 00:36 UTC, Arnd Bergmann
Details
simpler test case without manual byte swap (788 bytes, text/plain)
2016-03-16 17:27 UTC, Arnd Bergmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arnd Bergmann 2016-03-15 00:36:18 UTC
Created attachment 37964 [details]
partially reduced test case

Using gcc-6.0 for building the Linux kernel, I see one file (out of many hundred randconfig test builds so far) that occasionally hits the maximum per-function stack size limit. I have only tested on 32-bit ARM, and I have manually reduced the test case:

$ arm-linux-gnueabi-gcc -o lpfc_scsi.o -O2 -Wframe-larger-than=1024
lpfc_scsi.c: In function 'lpfc_find_next_oas_lun':
lpfc_scsi.c:117:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=]
$ arm-linux-gnueabi-gcc --version                                                                                                       
arm-linux-gnueabi-gcc (GCC) 6.0.0 20160313 (experimental)

I see this with all options as long as -O2 is set, but not with -O3 or -O1 or -Os. Normal stack usage for this function is usually 128 bytes.
Comment 1 Andrew Pinski 2016-03-15 00:42:03 UTC
>I have only tested on 32-bit ARM, 

Looks only to be an 32bit ARM issue.  AARCH64 both LP64 and ILP32 does not have a stack size issue.  
For both of those we get:
        stp     x29, x30, [sp, -128]!

So only 128 bytes.
Comment 2 Ramana Radhakrishnan 2016-03-16 09:14:06 UTC
(In reply to Andrew Pinski from comment #1)
> >I have only tested on 32-bit ARM, 
> 
> Looks only to be an 32bit ARM issue.  AARCH64 both LP64 and ILP32 does not
> have a stack size issue.  
> For both of those we get:
>         stp     x29, x30, [sp, -128]!
> 
> So only 128 bytes.

Confirmed . 

GCC 5 uses about 136 bytes
GCC 6 uses 1160 bytes at O2 for the reduced testcase.
Comment 3 Ramana Radhakrishnan 2016-03-16 11:45:19 UTC
(In reply to Ramana Radhakrishnan from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > >I have only tested on 32-bit ARM, 
> > 
> > Looks only to be an 32bit ARM issue.  AARCH64 both LP64 and ILP32 does not
> > have a stack size issue.  
> > For both of those we get:
> >         stp     x29, x30, [sp, -128]!
> > 
> > So only 128 bytes.
> 
> Confirmed . 
> 
> GCC 5 uses about 136 bytes
> GCC 6 uses 1160 bytes at O2 for the reduced testcase.

A few differences noted :

- AArch32 ends up inlining the memcpy - However -fno-builtin makes very little difference to the stack size used, so that's really not an issue here.

- AArch64 does not inline the memcpy at all. 

It does appear though that the register allocator seems to manage much better on AArch64 than on AArch32 with the shape of the CFG. That is possibly one difference - the other difference which I've noticed between GCC 5 and GCC 6 is that dom2 / jump threading looks like it duplicates a lot more blocks in in GCC 6 compared to GCC 5 and I need to go back and look into that one.

The problem appears to stem with --param fsm-maximum-phi-arguments=7 but that's just a diagnostic tool to see what the trigger point is.

CCing law and vmakarov to see if they have any insights in this case.
Comment 4 Arnd Bergmann 2016-03-16 13:54:25 UTC
I've tried out a few more things as well, to see if the alignment of the struct lpfc_name type or the builtin memcpy makes a different. Replacing the array of eight bytes with a single uint64_t and scalar operations instead of string functions makes very little difference, so it seems to be neither of them.

However, I think the wwn_to_uint64_t() function is what causes the problem. This is supposed to be turned into a direct load or a byte reversing load depending on endianess, but this apparently does not happen.

Adding -mbig-endian to the compiler flags brings the stack usage down, so presumably the optimization step that identifies byteswap patters is what causes the stack growth.
Comment 5 Richard Biener 2016-03-16 14:09:13 UTC
(In reply to Arnd Bergmann from comment #4)
> I've tried out a few more things as well, to see if the alignment of the
> struct lpfc_name type or the builtin memcpy makes a different. Replacing the
> array of eight bytes with a single uint64_t and scalar operations instead of
> string functions makes very little difference, so it seems to be neither of
> them.
> 
> However, I think the wwn_to_uint64_t() function is what causes the problem.
> This is supposed to be turned into a direct load or a byte reversing load
> depending on endianess, but this apparently does not happen.

I'm not aware of such transform - we have the 'bswap' pass on GIMPLE
but that only looks for the direct load case (-mbig-endian):

64 bit load in target endianness found at: _95 = MEM[(uint8_t *)vport_wwpn_14(D)];
64 bit load in target endianness found at: _125 = MEM[(uint8_t *)target_wwpn_16(D)];
64 bit load in target endianness found at: _167 = MEM[(uint8_t *)vport_wwpn_14(D)];
64 bit load in target endianness found at: _197 = MEM[(uint8_t *)target_wwpn_16(D)];
64 bit load in target endianness found at: _236 = MEM[(uint8_t *)vport_wwpn_14(D)];
64 bit load in target endianness found at: _266 = MEM[(uint8_t *)target_wwpn_16(D)];

as there is no "standard" target independent way to express a byte
reversed load (optab or so).  The closest we'd have is to "vectorize"
this as

  tem = load v8qi;
  tem = vec_perm <tem, tem, { 7, 6, 5, 4, 3, 2, 1, 0 }>;
  ... = VIEW_CONVERT <uint64_t, tem>;

if both the vector mode exists and the constant permute is handled
by the target.  The target would then need to combine the load
and the permute into a reversing load (if the target indeed has such
instruction).

> Adding -mbig-endian to the compiler flags brings the stack usage down, so
> presumably the optimization step that identifies byteswap patters is what
> causes the stack growth.
Comment 6 Richard Biener 2016-03-16 14:14:34 UTC
Btw, the bswap pass for arm doesn't detect any bswap32 or bswap64 instruction.

There is

(define_expand "bswapsi2"
  [(set (match_operand:SI 0 "s_register_operand" "=r")
        (bswap:SI (match_operand:SI 1 "s_register_operand" "r")))]
"TARGET_EITHER && (arm_arch6 || !optimize_size)"

but no bswapdi2.
Comment 7 Richard Biener 2016-03-16 14:18:23 UTC
The bswap pass could learn to split this up into two loads and bswaps and
combine the results with a single shift and or.
Comment 8 Richard Biener 2016-03-16 14:28:52 UTC
But not sure how this became a regression - does GCC 5 really do better here?

With -fno-tree-dominator-opts the warning no longer triggers and we get

        sub     sp, sp, #428

so it may be another jump-threading related regression.  With additional
-fno-tree-vrp we get to

        sub     sp, sp, #140

> grep '<< 56' t.c.211t.optimized
  _68 = _38 << 56;
  _98 = _97 << 56;

for both and with VRP and DOM

> grep '<< 56' t.c.211t.optimized
  _68 = _38 << 56;
  _98 = _97 << 56;
  _137 = _138 << 56;
  _375 = _376 << 56;
  _342 = _343 << 56;
  _170 = _169 << 56;
  _603 = _602 << 56;
  _636 = _635 << 56;
  _209 = _208 << 56;
  _683 = _682 << 56;
  _716 = _715 << 56;
  _239 = _238 << 56;
  _523 = _522 << 56;
  _556 = _555 << 56;
  _300 = _301 << 56;
  _37 = _76 << 56;
  _443 = _442 << 56;
  _476 = _475 << 56;

so we get nine(!) copies of it.
Comment 9 Thomas Preud'homme 2016-03-16 14:56:42 UTC
(In reply to Richard Biener from comment #7)
> The bswap pass could learn to split this up into two loads and bswaps and
> combine the results with a single shift and or.

I'll add it to the bswap TODO list. Thanks.
Comment 10 Richard Biener 2016-03-16 14:58:29 UTC
(In reply to Thomas Preud'homme from comment #9)
> (In reply to Richard Biener from comment #7)
> > The bswap pass could learn to split this up into two loads and bswaps and
> > combine the results with a single shift and or.
> 
> I'll add it to the bswap TODO list. Thanks.

Actually handles it via builtin_bswap64 already and generic code in optabs.c

But somehow bswap64 is still not detected for me on arm (works on x86-64 with
-m32).
Comment 11 Arnd Bergmann 2016-03-16 16:53:50 UTC
For reference, I have sent a patch to the kernel to replace the open-coded byteswap with a __builtin_bswap64. This produces much better object code with both gcc-5 and gcc-6 than the original version, so it's certainly a good thing to do regardless of the resolution in gcc-6:

http://thread.gmane.org/gmane.linux.kernel/2178301

For reference, these are the sizes of stack usage and function length:

gcc-5.3.1, linux-4.5:
Comment 12 Arnd Bergmann 2016-03-16 17:27:09 UTC
Created attachment 37991 [details]
simpler test case without manual byte swap

For reference, I have sent a patch to the kernel to replace the open-coded byteswap with a __builtin_bswap64. This produces much better object code with both gcc-5 and gcc-6 than the original version and uses the native swap instructions, so it's certainly a good thing to do regardless of the resolution in gcc-6:

http://thread.gmane.org/gmane.linux.kernel/2178301

For reference, these are the sizes of stack usage and function length using the actual source code from the kernel:

                         stack     length
gcc-5.3.1, linux-4.5:     156      1146
gcc-5.3.1, patched:        28       804
gcc-6.0.0, linux-4.5:    1192      5144
gcc-6.0.0, patched:        76      1612

I have adapted the test case now to no longer use unaligned data, memcpy or manual byte swaps. The object code generated by gcc-6 nowhere as bad as with the original example, but still considerably worse than what I get with  gcc-5.
Comment 13 Jakub Jelinek 2016-03-17 15:53:27 UTC
So, except for at most 16 bytes regressions, there were just 2 commits that made the #c0 testcase use significantly more stack,
r227307 from 140 to 604 and r229685 from 612 to 1152, both FSM related.
Comment 14 Jeffrey A. Law 2016-03-21 23:21:34 UTC
WRT c#8, the FSM threader is much better at finding jump threading opportunities which involve multiple blocks in the thread path.  I'm still not entirely happy with the heuristics around that.

Closely related, the FSM threader is particularly poor at identifying jump threads which use identical paths (starting from different incoming edges to the path).  The FSM threader makes a unique copy of the path for each jump thread.  THe old threader will create one jump thread path for that case and redirect all appropriate edges to that common path -- this can significantly reduce unnecessary block copying.  I suspect (but have not confirmed) that limitation of the FSM threader is playing a role here.  Addressing that is near the top of the gcc-7 queue.

ANyway, I'll try to take a deeper look tomorrow.
Comment 15 Jeffrey A. Law 2016-03-22 16:23:30 UTC
So this is definitely related to the FSM threader not being able to share a single jump threading path.  Here's an example:

j.c.110t.dom2:  Registering FSM jump thread: (23, 25) incoming edge;  (25, 28)  (28, 27)  (27, 11)  (11, 13)  (13, 14)  (14, 24) nocopy; (14, 24) 
j.c.110t.dom2:  Registering FSM jump thread: (22, 25) incoming edge;  (25, 28)  (28, 27)  (27, 11)  (11, 13)  (13, 14)  (14, 24) nocopy; (14, 24) 


Note that the only difference is the incoming edge.  The old threader would duplicate the path once and redirect both incoming edges to that single duplicate path.   The FSM threader doesn't have that ability and won't until gcc-7.

In bb27 and bb11 we have a fairly significant sequence of shifts and ior operations that result in the large number of duplicated statements.

I'm a bit surprised that the statements-to-copy clamp didn't kick in here and that's what I'll be looking at for gcc-6.  For gcc-7 sharing jump thread paths is high on the TODO list.
Comment 16 Jeffrey A. Law 2016-03-22 18:48:05 UTC
So for thread paths noted in c#15 we have the following pieces of data

1. 69 statements to copy

2. 7 blocks to copy

3. Threads through latch, but does not create an irreducible loop

4. Eliminates a simple conditional branch, not a multiway branch


tree-ssa-threadbackwards is used for 2 cases.  Its first (and original) purpose was to catch FSM loops.  Those are characterized by threading through the latch and eliminating a multi-way branch at the end of path.  These are very profitable and as such they allow for a lot of copying (100 statements).

The second case is as a replacement for the older forward-walking jump threading done by VRP/DOM that we're phasing out.  These are less profitable and have much lower copying thresholds.

#4 means that we've got the latter case.  But the code is not identifying it as such.  The code tests !threaded_through_latch when it should be testing !(threaded_through_latch && threaded_multiway_branch).

Fixing that results in all those threads around the loop being rejected as needing too many statement copies.  Only two small/short jump threads are realized.  All the undesirable copying is gone and we have a stack size of 140 bytes.
Comment 17 Jeffrey A. Law 2016-03-22 21:33:06 UTC
Author: law
Date: Tue Mar 22 21:32:34 2016
New Revision: 234409

URL: https://gcc.gnu.org/viewcvs?rev=234409&root=gcc&view=rev
Log:
	PR target/70232
	tree-ssa-threadbackward.c
	(fsm_find_control_statement_thread_paths): Correctly distinguish
	between old style jump threads vs FSM jump threads.

	PR target/70232
	* gcc.dg/tree-ssa/pr70232.c: New test.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr70232.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-threadbackward.c
Comment 18 Jeffrey A. Law 2016-03-22 21:41:35 UTC
Fixed on the trunk.
Comment 19 Arnd Bergmann 2016-03-23 09:43:05 UTC
I've confirmed that the current gcc trunk addresses the original problem in the Linux kernel build, aside from the reduced test case. Thanks a lot for working on this!