Bug 83356 - [7 Regression] excessive stack usage compiling with -O2 -fsanitize=bounds -fsanitize=object-size
Summary: [7 Regression] excessive stack usage compiling with -O2 -fsanitize=bounds -fs...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: sanitizer (show other bugs)
Version: 7.2.1
: P2 normal
Target Milestone: 8.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
: 85362 (view as bug list)
Depends on:
Blocks: 83651
  Show dependency treegraph
 
Reported: 2017-12-10 20:50 UTC by Arnd Bergmann
Modified: 2019-11-14 10:52 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work: 7.1.0, 8.0
Known to fail: 7.2.0, 7.4.0, 7.5.0
Last reconfirmed: 2017-12-18 00:00:00


Attachments
The AES crypto implementation from linux, preprocessed (157.97 KB, application/gzip)
2017-12-10 20:50 UTC, Arnd Bergmann
Details
gcc7-pr83356.patch (1.20 KB, patch)
2017-12-19 16:22 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Arnd Bergmann 2017-12-10 20:50:16 UTC
Created attachment 42826 [details]
The AES crypto implementation from linux, preprocessed

After upgrading from gcc-7.1.1 (20170717) to 7.2.1 (20171130), I observed a warning when building the kernel in configurations with UBSAN:

x86_64-linux-gcc-7.2.1 -Wframe-larger-than=1024 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing -O2 -Wall -c aes_generic.i 
/git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt':
/git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 3840 bytes is larger than 1024 bytes [-Wframe-larger-than=]
/git/arm-soc/crypto/aes_generic.c: In function 'aes_decrypt':
/git/arm-soc/crypto/aes_generic.c:1441:1: warning: the frame size of 3840 bytes is larger than 1024 bytes [-Wframe-larger-than=]

With gcc-7.1.1, the stack frames were 304 bytes each, gcc-8.0.0 uses 512 bytes, and with "gcc-7.2.1 -Os -fsanitize=bounds -fsanitize=object-size", it's even smaller at 56 bytes.

I have only noticed this regression with one file in the kernel so far, crypto/aes_generic.c, all other files apparently stay below the stack frame warning limit.
Comment 1 Martin Liška 2017-12-18 13:18:34 UTC
Confirmed, only GCC 7 branch is affected by gcc-7-branch@251376:

Where related part is backport of PR target/81921:

            PR target/81921
            * config/i386/i386.c: Include symbol-summary.h, ipa-prop.h
            and ipa-inline.h.
            (ix86_can_inline_p): When ix86_fpmath flags do not match
            check whether the callee uses FP math at all.
    
            * gcc/testsuite/gcc.target/i386/pr81921.c: New testcase.

That is probably wrong:
https://gcc.gnu.org/ml/gcc-patches/2017-11/msg00178.html

Anyway it's an inlining that is enabled, let me reduce a test-case that will illustrate the problem.
Comment 2 Martin Liška 2017-12-18 13:42:58 UTC
(In reply to Martin Liška from comment #1)
> Confirmed, only GCC 7 branch is affected by gcc-7-branch@251376:
> 
> Where related part is backport of PR target/81921:
> 
>             PR target/81921
>             * config/i386/i386.c: Include symbol-summary.h, ipa-prop.h
>             and ipa-inline.h.
>             (ix86_can_inline_p): When ix86_fpmath flags do not match
>             check whether the callee uses FP math at all.
>     
>             * gcc/testsuite/gcc.target/i386/pr81921.c: New testcase.
> 

So no, it's a different story: backport of trunk@254218 is responsible for that.
Can you please Richi take a look?
Comment 3 Richard Biener 2017-12-18 14:19:11 UTC
Well, there's nothing we can do here if PRE decides to do more now (the possible effect of the fix).  We are simply finding a _lot_ more opportunities to hoist/PRE stuff (all sanitizer stuff...).  And PRE has no throttling on whatsoever doing that.  It PREs/hoists the arguments to UBSAN_{BOUNDS,OBJECT_SIZE} but not
the IFNs themselves (those are neither pure nor const and I'm not sure PRE
would handle them at all).

In the process of doing this it moves address calculation to common place
and thus enlarging the lifetime of the addressed objects.

There are a lot of redundant sanitization checks but the IFNs have
side-effects (I believe they eventually abort (), right?) so we can't
make them pure or const.  I believe we don't have any category of
calls that may abort but can be commoned.  PRE just does the job for
the arguments...

I believe the way the function is written is just asking for troubles...

Maybe sanitize instrumentation can also be more clever avoiding all the
redundancies...  (seems to happen on GENERIC already for BOUNDS).

That said, can't see an easy workaround but to change the source and/or
not use -fsanitize= and expect decent code quality.
Comment 4 Jakub Jelinek 2017-12-18 14:33:14 UTC
(In reply to Richard Biener from comment #3)
(I believe they eventually abort (), right?) so we can't

Well, not abort, either runtime error message + die, or runtime error message + continue, or __builtin_trap depending on options.

> Maybe sanitize instrumentation can also be more clever avoiding all the
> redundancies...  (seems to happen on GENERIC already for BOUNDS).

The sanopt pass removes the redundancies that can be removed.
Comment 5 Arnd Bergmann 2017-12-18 15:10:14 UTC
(In reply to Richard Biener from comment #3)

> That said, can't see an easy workaround but to change the source and/or
> not use -fsanitize= and expect decent code quality.

I don't see a good way to modify the source code without risking performance regressions in other configurations (arch, compiler version or flags), and the AES cipher is a rather performance critical part of the kernel.

I have experimentally shown that passing either -fno-tree-sra and/or -fno-tree-pre improves the stack usage in this file significantly (312 bytes with ubsan, 192 bytes without), though it's still around twice as much as with gcc-7.1.1, both with and without sanitizers.

I have not yet tried to analyze or measure the resulting object code though.
Comment 6 rguenther@suse.de 2017-12-18 15:11:34 UTC
On Mon, 18 Dec 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> 
> --- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #3)
> (I believe they eventually abort (), right?) so we can't
> 
> Well, not abort, either runtime error message + die, or runtime error message +
> continue, or __builtin_trap depending on options.
> 
> > Maybe sanitize instrumentation can also be more clever avoiding all the
> > redundancies...  (seems to happen on GENERIC already for BOUNDS).
> 
> The sanopt pass removes the redundancies that can be removed.

Hum, that runs pretty late...  can't we run it at the end of early
opts (as well?)?  That would also remove load from optimizers.
Comment 7 Jakub Jelinek 2017-12-18 15:20:06 UTC
(In reply to rguenther@suse.de from comment #6)
> On Mon, 18 Dec 2017, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83356
> > 
> > --- Comment #4 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #3)
> > (I believe they eventually abort (), right?) so we can't
> > 
> > Well, not abort, either runtime error message + die, or runtime error message +
> > continue, or __builtin_trap depending on options.
> > 
> > > Maybe sanitize instrumentation can also be more clever avoiding all the
> > > redundancies...  (seems to happen on GENERIC already for BOUNDS).
> > 
> > The sanopt pass removes the redundancies that can be removed.
> 
> Hum, that runs pretty late...  can't we run it at the end of early
> opts (as well?)?  That would also remove load from optimizers.

Well, right now that pass does both the optimizations (removing redundant checks; this is the sanopt_optimize part) and lowering of the IFNs etc.
The optimization part supposedly could be done multiple times, the rest must be done just once and should be done late.
So, if we add a bool to the pass (set_pass_param, clone) which would say if it is the late pass and we'd return right after the sanopt_optimize call if not the latest one, this could be done.
Comment 8 Jakub Jelinek 2017-12-18 21:00:48 UTC
I'll try it tomorrow.
Comment 9 Jakub Jelinek 2017-12-19 16:22:14 UTC
Created attachment 42921 [details]
gcc7-pr83356.patch

Untested patch for the second sanopt.  This doesn't change anything on this testcase though, because we don't sanopt_optimize IFN_UBSAN_BOUNDS nor IFN_UBSAN_OBJECT_SIZE.  The latter is sometimes (in 230 out of 702 cases on this testcase) folded away during objsz2 pass and can't be folded or redundancy optimized earlier anyway.  On this testcase, I don't think most of the other calls are actually redundant, what is true is that most of the UBSAN_BOUNDS calls use highest array index 255 and the indexes are unsigned_int >> 24 or some other extractions of unsigned char.  By using value ranges we could get rid of most of them, but I'm actually hesistant here, because value ranges are computed with the assumption UB doesn't happen in the program, while sanitizers are the way to detect those UBs.  Unfortunately, we don't have different kinds of ranges, ranges that are proven for any kind of execution, including ones with UB, vs. ranges that derive from lack of UB.  In cases like:
  _74 = _1197 & 255;
  _419 = (int) _74;
  UBSAN_BOUNDS (0B, _419, 255);
or:
  _360 = _1197 >> 24;
  _428 = (int) _360;
  UBSAN_BOUNDS (0B, _428, 255);
or:
  _1200 = (unsigned char) _1199;
  _372 = (int) _1199;
  UBSAN_BOUNDS (0B, _372, 255);
perhaps we could at some point remove those ifns, because no matter what the arguments will provably be all >= 0 and <= 255.  Though, without using full VR that sounds quite hackish and in many other real-world cases the array size won't be that nice and we won't be able to prove anything.
Comment 10 Arnd Bergmann 2017-12-20 21:27:07 UTC
I sent a (rather crude) workaround to the kernel mailing list now, mainly to get the crypto maintainers involved in this as well. I also did further testing and found that in the entire kernel, this is the only place that caused this particular problem:

- comparing gcc-7.1.1 and gcc-7.2.1, many of the 72326 functions with >64 byte stack usage use less stack in gcc-7.2, only the AES code uses much more

- In 30 architectures I tested, the stack frames are more than twice as big as the second-largest function, around 2KB for 32-bit architectures and 4KB for 64-bit architectures. The only exception is aarch64, which uses only 500 bytes here.

- There are a handful of other functions that show a very substantial stack size increase with UBSAN (enic_rq_service() 2240 bytes, fnic_rq_cmpl_handler_cont() 2240 bytes, dsp3780I_EnableDSP() 1016 bytes), but all of them are the same way in older compilers as well, those are not regressions. 

- From looking at the assembler output for the AES cipher, it appears that gcc-7.1 and gcc-8 are both worse than gcc-6, with or without UBSAN, but only gcc-7.2 with UBSAN is drastically worse. As mentioned in my patch description, this needs to be tested better. There is a test module in the kernel to measure the throughput of the AES code, but I have not tried it myself.

https://patchwork.kernel.org/patch/10126567/
Comment 11 Arnd Bergmann 2018-01-12 12:49:06 UTC
The second version of my workaround (build with 'gcc -Os' on gcc-7.1+) was merged into mainline linux: https://patchwork.kernel.org/patch/10143607/
Comment 12 Arnd Bergmann 2018-01-14 21:47:57 UTC
Unfortunately that patch caused a regression (nothing to do with the compiler really, just the way powerpc linux uses some libgcc functions), and I've done some more investigation. The new finding is that selectively turning off '-fcode-hoisting' on gcc-7.2.1 restores the behavior we had on gcc-7.2.1, solving both the stack consumption problem with UBSAN and the performance regression (with and without UBSAN) as described in pr83651, with performance better than my previous workarounds that replaced -O2 with -Os, -O1, or -O2 -fno-tree-sra -fno-tree-pre.

I have to do more performance tests to send an updated kernel patch, but maybe this already helps analyze the problem further.
Comment 13 Richard Biener 2018-01-25 08:28:12 UTC
GCC 7.3 is being released, adjusting target milestone.
Comment 14 Jakub Jelinek 2018-04-12 07:27:16 UTC
*** Bug 85362 has been marked as a duplicate of this bug. ***
Comment 15 Richard Biener 2019-11-14 10:52:07 UTC
Works in GCC 8.