Created attachment 37964 [details] partially reduced test case Using gcc-6.0 for building the Linux kernel, I see one file (out of many hundred randconfig test builds so far) that occasionally hits the maximum per-function stack size limit. I have only tested on 32-bit ARM, and I have manually reduced the test case: $ arm-linux-gnueabi-gcc -o lpfc_scsi.o -O2 -Wframe-larger-than=1024 lpfc_scsi.c: In function 'lpfc_find_next_oas_lun': lpfc_scsi.c:117:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=] $ arm-linux-gnueabi-gcc --version arm-linux-gnueabi-gcc (GCC) 6.0.0 20160313 (experimental) I see this with all options as long as -O2 is set, but not with -O3 or -O1 or -Os. Normal stack usage for this function is usually 128 bytes.
>I have only tested on 32-bit ARM, Looks only to be an 32bit ARM issue. AARCH64 both LP64 and ILP32 does not have a stack size issue. For both of those we get: stp x29, x30, [sp, -128]! So only 128 bytes.
(In reply to Andrew Pinski from comment #1) > >I have only tested on 32-bit ARM, > > Looks only to be an 32bit ARM issue. AARCH64 both LP64 and ILP32 does not > have a stack size issue. > For both of those we get: > stp x29, x30, [sp, -128]! > > So only 128 bytes. Confirmed . GCC 5 uses about 136 bytes GCC 6 uses 1160 bytes at O2 for the reduced testcase.
(In reply to Ramana Radhakrishnan from comment #2) > (In reply to Andrew Pinski from comment #1) > > >I have only tested on 32-bit ARM, > > > > Looks only to be an 32bit ARM issue. AARCH64 both LP64 and ILP32 does not > > have a stack size issue. > > For both of those we get: > > stp x29, x30, [sp, -128]! > > > > So only 128 bytes. > > Confirmed . > > GCC 5 uses about 136 bytes > GCC 6 uses 1160 bytes at O2 for the reduced testcase. A few differences noted : - AArch32 ends up inlining the memcpy - However -fno-builtin makes very little difference to the stack size used, so that's really not an issue here. - AArch64 does not inline the memcpy at all. It does appear though that the register allocator seems to manage much better on AArch64 than on AArch32 with the shape of the CFG. That is possibly one difference - the other difference which I've noticed between GCC 5 and GCC 6 is that dom2 / jump threading looks like it duplicates a lot more blocks in in GCC 6 compared to GCC 5 and I need to go back and look into that one. The problem appears to stem with --param fsm-maximum-phi-arguments=7 but that's just a diagnostic tool to see what the trigger point is. CCing law and vmakarov to see if they have any insights in this case.
I've tried out a few more things as well, to see if the alignment of the struct lpfc_name type or the builtin memcpy makes a different. Replacing the array of eight bytes with a single uint64_t and scalar operations instead of string functions makes very little difference, so it seems to be neither of them. However, I think the wwn_to_uint64_t() function is what causes the problem. This is supposed to be turned into a direct load or a byte reversing load depending on endianess, but this apparently does not happen. Adding -mbig-endian to the compiler flags brings the stack usage down, so presumably the optimization step that identifies byteswap patters is what causes the stack growth.
(In reply to Arnd Bergmann from comment #4) > I've tried out a few more things as well, to see if the alignment of the > struct lpfc_name type or the builtin memcpy makes a different. Replacing the > array of eight bytes with a single uint64_t and scalar operations instead of > string functions makes very little difference, so it seems to be neither of > them. > > However, I think the wwn_to_uint64_t() function is what causes the problem. > This is supposed to be turned into a direct load or a byte reversing load > depending on endianess, but this apparently does not happen. I'm not aware of such transform - we have the 'bswap' pass on GIMPLE but that only looks for the direct load case (-mbig-endian): 64 bit load in target endianness found at: _95 = MEM[(uint8_t *)vport_wwpn_14(D)]; 64 bit load in target endianness found at: _125 = MEM[(uint8_t *)target_wwpn_16(D)]; 64 bit load in target endianness found at: _167 = MEM[(uint8_t *)vport_wwpn_14(D)]; 64 bit load in target endianness found at: _197 = MEM[(uint8_t *)target_wwpn_16(D)]; 64 bit load in target endianness found at: _236 = MEM[(uint8_t *)vport_wwpn_14(D)]; 64 bit load in target endianness found at: _266 = MEM[(uint8_t *)target_wwpn_16(D)]; as there is no "standard" target independent way to express a byte reversed load (optab or so). The closest we'd have is to "vectorize" this as tem = load v8qi; tem = vec_perm <tem, tem, { 7, 6, 5, 4, 3, 2, 1, 0 }>; ... = VIEW_CONVERT <uint64_t, tem>; if both the vector mode exists and the constant permute is handled by the target. The target would then need to combine the load and the permute into a reversing load (if the target indeed has such instruction). > Adding -mbig-endian to the compiler flags brings the stack usage down, so > presumably the optimization step that identifies byteswap patters is what > causes the stack growth.
Btw, the bswap pass for arm doesn't detect any bswap32 or bswap64 instruction. There is (define_expand "bswapsi2" [(set (match_operand:SI 0 "s_register_operand" "=r") (bswap:SI (match_operand:SI 1 "s_register_operand" "r")))] "TARGET_EITHER && (arm_arch6 || !optimize_size)" but no bswapdi2.
The bswap pass could learn to split this up into two loads and bswaps and combine the results with a single shift and or.
But not sure how this became a regression - does GCC 5 really do better here? With -fno-tree-dominator-opts the warning no longer triggers and we get sub sp, sp, #428 so it may be another jump-threading related regression. With additional -fno-tree-vrp we get to sub sp, sp, #140 > grep '<< 56' t.c.211t.optimized _68 = _38 << 56; _98 = _97 << 56; for both and with VRP and DOM > grep '<< 56' t.c.211t.optimized _68 = _38 << 56; _98 = _97 << 56; _137 = _138 << 56; _375 = _376 << 56; _342 = _343 << 56; _170 = _169 << 56; _603 = _602 << 56; _636 = _635 << 56; _209 = _208 << 56; _683 = _682 << 56; _716 = _715 << 56; _239 = _238 << 56; _523 = _522 << 56; _556 = _555 << 56; _300 = _301 << 56; _37 = _76 << 56; _443 = _442 << 56; _476 = _475 << 56; so we get nine(!) copies of it.
(In reply to Richard Biener from comment #7) > The bswap pass could learn to split this up into two loads and bswaps and > combine the results with a single shift and or. I'll add it to the bswap TODO list. Thanks.
(In reply to Thomas Preud'homme from comment #9) > (In reply to Richard Biener from comment #7) > > The bswap pass could learn to split this up into two loads and bswaps and > > combine the results with a single shift and or. > > I'll add it to the bswap TODO list. Thanks. Actually handles it via builtin_bswap64 already and generic code in optabs.c But somehow bswap64 is still not detected for me on arm (works on x86-64 with -m32).
For reference, I have sent a patch to the kernel to replace the open-coded byteswap with a __builtin_bswap64. This produces much better object code with both gcc-5 and gcc-6 than the original version, so it's certainly a good thing to do regardless of the resolution in gcc-6: http://thread.gmane.org/gmane.linux.kernel/2178301 For reference, these are the sizes of stack usage and function length: gcc-5.3.1, linux-4.5:
Created attachment 37991 [details] simpler test case without manual byte swap For reference, I have sent a patch to the kernel to replace the open-coded byteswap with a __builtin_bswap64. This produces much better object code with both gcc-5 and gcc-6 than the original version and uses the native swap instructions, so it's certainly a good thing to do regardless of the resolution in gcc-6: http://thread.gmane.org/gmane.linux.kernel/2178301 For reference, these are the sizes of stack usage and function length using the actual source code from the kernel: stack length gcc-5.3.1, linux-4.5: 156 1146 gcc-5.3.1, patched: 28 804 gcc-6.0.0, linux-4.5: 1192 5144 gcc-6.0.0, patched: 76 1612 I have adapted the test case now to no longer use unaligned data, memcpy or manual byte swaps. The object code generated by gcc-6 nowhere as bad as with the original example, but still considerably worse than what I get with gcc-5.
So, except for at most 16 bytes regressions, there were just 2 commits that made the #c0 testcase use significantly more stack, r227307 from 140 to 604 and r229685 from 612 to 1152, both FSM related.
WRT c#8, the FSM threader is much better at finding jump threading opportunities which involve multiple blocks in the thread path. I'm still not entirely happy with the heuristics around that. Closely related, the FSM threader is particularly poor at identifying jump threads which use identical paths (starting from different incoming edges to the path). The FSM threader makes a unique copy of the path for each jump thread. THe old threader will create one jump thread path for that case and redirect all appropriate edges to that common path -- this can significantly reduce unnecessary block copying. I suspect (but have not confirmed) that limitation of the FSM threader is playing a role here. Addressing that is near the top of the gcc-7 queue. ANyway, I'll try to take a deeper look tomorrow.
So this is definitely related to the FSM threader not being able to share a single jump threading path. Here's an example: j.c.110t.dom2: Registering FSM jump thread: (23, 25) incoming edge; (25, 28) (28, 27) (27, 11) (11, 13) (13, 14) (14, 24) nocopy; (14, 24) j.c.110t.dom2: Registering FSM jump thread: (22, 25) incoming edge; (25, 28) (28, 27) (27, 11) (11, 13) (13, 14) (14, 24) nocopy; (14, 24) Note that the only difference is the incoming edge. The old threader would duplicate the path once and redirect both incoming edges to that single duplicate path. The FSM threader doesn't have that ability and won't until gcc-7. In bb27 and bb11 we have a fairly significant sequence of shifts and ior operations that result in the large number of duplicated statements. I'm a bit surprised that the statements-to-copy clamp didn't kick in here and that's what I'll be looking at for gcc-6. For gcc-7 sharing jump thread paths is high on the TODO list.
So for thread paths noted in c#15 we have the following pieces of data 1. 69 statements to copy 2. 7 blocks to copy 3. Threads through latch, but does not create an irreducible loop 4. Eliminates a simple conditional branch, not a multiway branch tree-ssa-threadbackwards is used for 2 cases. Its first (and original) purpose was to catch FSM loops. Those are characterized by threading through the latch and eliminating a multi-way branch at the end of path. These are very profitable and as such they allow for a lot of copying (100 statements). The second case is as a replacement for the older forward-walking jump threading done by VRP/DOM that we're phasing out. These are less profitable and have much lower copying thresholds. #4 means that we've got the latter case. But the code is not identifying it as such. The code tests !threaded_through_latch when it should be testing !(threaded_through_latch && threaded_multiway_branch). Fixing that results in all those threads around the loop being rejected as needing too many statement copies. Only two small/short jump threads are realized. All the undesirable copying is gone and we have a stack size of 140 bytes.
Author: law Date: Tue Mar 22 21:32:34 2016 New Revision: 234409 URL: https://gcc.gnu.org/viewcvs?rev=234409&root=gcc&view=rev Log: PR target/70232 tree-ssa-threadbackward.c (fsm_find_control_statement_thread_paths): Correctly distinguish between old style jump threads vs FSM jump threads. PR target/70232 * gcc.dg/tree-ssa/pr70232.c: New test. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr70232.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-threadbackward.c
Fixed on the trunk.
I've confirmed that the current gcc trunk addresses the original problem in the Linux kernel build, aside from the reduced test case. Thanks a lot for working on this!