Bug 79149 - bad optimization on MIPS and ARM leading to excessive stack usage in some cases
Summary: bad optimization on MIPS and ARM leading to excessive stack usage in some cases
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: ---
Assignee: Maxim Kuvyrkov
URL:
Keywords: missed-optimization
Depends on: 11488
Blocks:
  Show dependency treegraph
 
Reported: 2017-01-19 15:38 UTC by Arnd Bergmann
Modified: 2017-06-16 14:12 UTC (History)
6 users (show)

See Also:
Host:
Target: arm-linux-gnueabi, mips-linux
Build:
Known to work:
Known to fail: 7.0
Last reconfirmed: 2017-06-16 00:00:00


Attachments
compressed preprocessed source for wp512 crypto from linux kernel (204.18 KB, application/gzip)
2017-01-19 15:38 UTC, Arnd Bergmann
Details
wp512 reference source code, standalone version (19.30 KB, text/plain)
2017-01-20 11:31 UTC, Arnd Bergmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arnd Bergmann 2017-01-19 15:38:07 UTC
Created attachment 40546 [details]
compressed preprocessed source for wp512 crypto from linux kernel

Build-testing the Linux kernel on MIPS shows warnings about possible kernel stack overflow in the kernel's crypto libraries, in particular the wp512 algorithm, as the stack frame grows beyond the normal per-function limit of 1024 bytes we use in the kernel.

I have been able to avoid the problem by turning off two specific optimizations,
and I have reproduced the same thing with both gcc-4.9 and gcc-7.0:

$ /home/arnd/cross-gcc/bin/mips-linux-gcc-7.0.0 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/mips-linux-gcc-7.0.0 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 304 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/mips-linux-gcc-4.9.3 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 1096 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/mips-linux-gcc-4.9.3 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic 
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]

To cross-check the problem, I have tried compiling the same file on ARM, which shows similar results but stays below the warning limit:

/home/arnd/cross-gcc/bin/arm-linux-gnueabi-gcc-7.0.0 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 816 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/arm-linux-gnueabi-gcc-7.0.0 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 344 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/arm-linux-gnueabi-gcc-4.9.3 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i  
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 840 bytes is larger than 100 bytes [-Wframe-larger-than=]

$ /home/arnd/cross-gcc/bin/arm-linux-gnueabi-gcc-4.9.3 -fno-strict-aliasing -O2 -c -Wall -Wframe-larger-than=100 -Wno-pointer-sign  wp512.i   -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
../../crypto/wp512.c: In function 'wp512_process_buffer':
../../crypto/wp512.c:987:1: warning: the frame size of 376 bytes is larger than 100 bytes [-Wframe-larger-than=]

However, using an x86 compiler, the frame for the same source is always under 300 bytes, and the options have no effect.
Comment 1 Arnd Bergmann 2017-01-20 09:22:30 UTC
Additional information:

I see the same behavior to a varying degree on most other architectures (but notably not x86) using the preprocessed source from the MIPS kernel configuration, these are always one run with -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic and one run without:

=== /home/arnd/cross-gcc/bin/aarch64-linux-gcc-5.2.1 ===
../../crypto/wp512.c:987:1: warning: the frame size of 224 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 368 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/alpha-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 240 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1136 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 2092 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-5.2.1 ===
../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 2208 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/cris-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/frv-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 296 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/hppa64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 192 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/hppa-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 276 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 644 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/i386-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/m32r-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 332 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 716 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/m68k-linux-gcc-6.0.0 ===
../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/microblaze-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 280 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1108 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 208 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1328 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1096 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips-linux-gcc-7.0.0 ===
../../crypto/wp512.c:987:1: warning: the frame size of 304 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/powerpc64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 144 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1088 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/powerpc-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 208 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1088 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/s390-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 456 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/sh3-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 292 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 292 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/sparc64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 208 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 992 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/sparc-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 320 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 672 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/x86_64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 224 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 224 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/x86_64-linux-gcc-6.1.1 ===
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/x86_64-linux-gcc-7.0.0 ===
../../crypto/wp512.c:987:1: warning: the frame size of 256 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 256 bytes is larger than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/xtensa-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 304 bytes is larger than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1152 bytes is larger than 100 bytes [-Wframe-larger-than=]

I also tried reproducing the behavior with the official reference implementation of wp512, from http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html, which is  very similar in the wp512_process_buffer() function but so far does not show the problem!
Comment 2 Arnd Bergmann 2017-01-20 11:31:10 UTC
Created attachment 40554 [details]
wp512 reference source code, standalone version

After checking a bit more, I found that the reference source code implementation does behave exactly like the in-kernel version after all, and I was able to do some performance timing (using qemu-user) on it as well.

Building Whirlpool.c using "mips64el-linux-gnuabi64-gcc-5 -O2 -Wframe-larger-than=100 Whirlpool.c -o Whirlpool-mips-smallstack -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" in this case uses 256 bytes of stack in the processBuffer and run for 87 seconds doing 10000000 iterations in qemu, while the version without "fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" takes 230 seconds and needs 1520 bytes of stack. The extra time is apparently spent spilling registers to the stack.

The same test with arm32 shows a less significant version of the same behavior, with the stack shrinking from 832 to 352 bytes, and the time improving from 301 seconds to 217 seconds.

Obviously it would be helpful to do the same tests on actual hardware, as benchmarking in an emulated machine can be very misleading.
Comment 3 Andrew Pinski 2017-01-20 12:25:04 UTC
There is another bug referring to the pre-register allocation schedule messing up. X86 does not have this scheduler turNed on by default.
Comment 4 Andrew Pinski 2017-01-20 12:25:56 UTC
Can you try -fno-sched-insns?
Comment 5 Arnd Bergmann 2017-01-20 16:14:32 UTC
-fno-schedule-insns is comparable in stack frame size to "-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" on all architectures (give or take a few bytes), but actually produces much better code.

In my simulated mips64 run, I see these numbers:

-O2:  49.0Mbit/s
-O2 -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic: 109.7 Mbit/s
-O2 -fno-schedule-insns: 179.2 Mbit/s

The trend is the same on arm an aarch64 for emulated runs, and I confirmed earlier that the results on real hardware are comparable to what we get in qemu.
Comment 6 Maxim Kuvyrkov 2017-01-20 20:01:25 UTC
Without looking at the code (it's 11pm) my guess is that 1st scheduling pass is misbehaving in some way, most likely it is doing a lot of interblock moves.  One of the big differences between x86 and ARM/MIPS scheduling is that x86 disables interblock scheduling.  Does -fno-schedule-insns fix the warnings on ARM/MIPS?
Comment 7 Andrew Pinski 2017-01-20 20:06:20 UTC
See also PR11488.
Comment 8 Andrew Pinski 2017-01-20 20:08:01 UTC
The other thing to do is try with -fsched-pressure .  PowerPC turns on -fsched-pressure by default (see PR 11488).
Comment 9 Arnd Bergmann 2017-01-20 21:16:08 UTC
The warning seems to reliably disappear with -fno-schedule-insns, on every combination I've tried it produces better (smaller stack and faster code) or identical results to -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic for the test case, but the margins vary a lot depending on gcc version and architecture. I tried various gcc versions on ARM, as well as gcc-4.9.3 across many architectures.

"-fsched-pressure" on mips64 helps a lot (factor 2 in both frame size and performance) but is still worse than "-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" or "-fno-schedule-insns", which give factor 2.5 to 3.5 in performance and reduce the stack size to what it should be (220 to 272 bytes).
I tried gcc-4.9 and gcc-7.0 here, which show the same behavior, though "gcc-4.9 -fno-schedule-insns" is faster by a good margin (factor 1.1 to 1.2) compared to the second fastest ("gcc-7 -fno-schedule-insns").

On arm and aarch64, "-fsched-pressure" has no effect on this test case that I can see (have not compared the object files, but frame size and performance are unchanged). "-fno-schedule-insns" is noticeably better, with frame size of 350 bytes on ARM compared to the default 824, and performance better by factor 1.6 compared to the default -O2. I also looked at the frame sizes on my older arm compilers and saw the same on all 8 versions I have (4.5 through 7.0).
Comment 10 Arnd Bergmann 2017-01-22 19:58:00 UTC
(In reply to Arnd Bergmann from comment #9)

> "-fsched-pressure" on mips64 helps a lot
> ...
> On arm and aarch64, "-fsched-pressure" has no effect

I realized later that on arm and aarch64, -fsched-pressure is enabled by default. Disabling it on these two makes it as bad as mips64, which has it disabled by default.
Comment 11 Arnd Bergmann 2017-02-06 17:06:06 UTC
I've submitted a workaround for the kernel now, addressing the stack usage warning on MIPS, as well as performance on ARM and others:

https://patchwork.kernel.org/patch/9555183/

The patch has two different workarounds, as I found that adding -Wno-schedule-insns gives us the best results on the whirlpool512 code for both stack size and performance by a wide margin, while -fsched-pressure is better on stack size for "serpent" across architectures and compiler versions

However, it is interesting to notice that arm-linux-gnueabi-gcc-7 produces worse
results with the serpent source code in terms of stack size with the default
"-fsched-pressure" ("press") than older versions, and worse than -fno-schedule-insns (nosched):

                                    default press   nopress nosched
    arm-linux-gnueabi-gcc-4.4.7     592                     440
    arm-linux-gnueabi-gcc-4.5.4     776     448     776     544
    arm-linux-gnueabi-gcc-4.6.4     776     448     776     544
    arm-linux-gnueabi-gcc-4.7.4     768     448     768     544
    arm-linux-gnueabi-gcc-4.8.5     488     488     776     544
    arm-linux-gnueabi-gcc-4.9.3     552     552     776     536
    arm-linux-gnueabi-gcc-5.3.1     552     552     776     536
    arm-linux-gnueabi-gcc-6.1.1     560     560     776     536
    arm-linux-gnueabi-gcc-7.0.1     616     616     808     536

If we want to continue investigating this, I can try to construct a standalone test case for performance testing on 'serpent' as well.
Comment 12 Wilco 2017-02-07 15:40:22 UTC
Does wp512 use 64-bit types? If so, this is likely PR77308.
Comment 13 Arnd Bergmann 2017-02-07 16:10:46 UTC
(In reply to wilco from comment #12)
> Does wp512 use 64-bit types? If so, this is likely PR77308.

Yes, as seen in the attachment it uses lots of 64-bit operations. However, it sounds like PR77308 is ARM specific, but I see the same behavior
on most other architectures, including 64-bit ones. Quoting the
kernel patch I linked to, with stack frame sizes for the function
depending on architecture, optimization flags and compiler version
(only 4.9 and 7.0 here, there is little difference anyway)

default: -O2
press:	 -O2 -fsched-pressure
nopress: -O2 -fschedule-insns -fno-sched-pressure
nosched: -O2 -no-schedule-insns (disables sched-pressure)

				default	press	nopress	nosched
alpha-linux-gcc-4.9.3		1136	848	1136	176
am33_2.0-linux-gcc-4.9.3	2100	2076	2100	2104
arm-linux-gnueabi-gcc-4.9.3	848	848	1048	352
cris-linux-gcc-4.9.3		272	272	272	272
frv-linux-gcc-4.9.3		1128	1000	1128	280
hppa64-linux-gcc-4.9.3		1128	336	1128	184
hppa-linux-gcc-4.9.3		644	308	644	276
i386-linux-gcc-4.9.3		352	352	352	352
m32r-linux-gcc-4.9.3		720	656	720	268
microblaze-linux-gcc-4.9.3	1108	604	1108	256
mips64-linux-gcc-4.9.3		1328	592	1328	208
mips-linux-gcc-4.9.3		1096	624	1096	240
powerpc64-linux-gcc-4.9.3	1088	432	1088	160
powerpc-linux-gcc-4.9.3		1080	584	1080	224
s390-linux-gcc-4.9.3		456	456	624	360
sh3-linux-gcc-4.9.3		292	292	292	292
sparc64-linux-gcc-4.9.3		992	240	992	208
sparc-linux-gcc-4.9.3		680	592	680	312
x86_64-linux-gcc-4.9.3		224	240	272	224
xtensa-linux-gcc-4.9.3		1152	704	1152	304

aarch64-linux-gcc-7.0.0		224	224	1104	208
arm-linux-gnueabi-gcc-7.0.1	824	824	1048	352
mips-linux-gcc-7.0.0		1120	648	1120	272
mips64-linux-gcc-7.0.0		1072	608	1072	224
x86_64-linux-gcc-7.0.1		240	240	304	240
Comment 14 Wilco 2017-02-08 13:45:49 UTC
(In reply to Arnd Bergmann from comment #13)
> (In reply to wilco from comment #12)
> > Does wp512 use 64-bit types? If so, this is likely PR77308.
> 
> Yes, as seen in the attachment it uses lots of 64-bit operations. However,
> it sounds like PR77308 is ARM specific, but I see the same behavior
> on most other architectures, including 64-bit ones. Quoting the
> kernel patch I linked to, with stack frame sizes for the function
> depending on architecture, optimization flags and compiler version
> (only 4.9 and 7.0 here, there is little difference anyway)

The 64-bit expansion issues in PR77308 are ARM specific indeed, but it also shows scheduling causes unnecessary high register pressure and extra spilling.

Your results indicate that -fsched-pressure should really be the default. And given that it still results in more spilling compared to not scheduling, it probably needs to be made less aggressive or compute pressure more accurately.
Comment 15 Ramana Radhakrishnan 2017-06-16 14:12:24 UTC
Well given all the comments, confirmed then ... :)