Bug 86214 - [8 Regression] Strongly increased stack usage
Summary: [8 Regression] Strongly increased stack usage
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.1.0
: P2 normal
Target Milestone: 9.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-06-19 16:48 UTC by Steinar H. Gunderson
Modified: 2021-05-14 11:00 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-06-21 00:00:00


Attachments
Test case (260 bytes, text/plain)
2018-06-19 16:48 UTC, Steinar H. Gunderson
Details
Test case #2 (1.07 KB, text/plain)
2018-07-04 11:59 UTC, Steinar H. Gunderson
Details
Unreduced test case (533.40 KB, application/x-xz)
2018-07-04 15:23 UTC, Steinar H. Gunderson
Details
gcc9-pr86214.patch (2.26 KB, patch)
2019-01-17 11:10 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steinar H. Gunderson 2018-06-19 16:48:38 UTC
Created attachment 44296 [details]
Test case

Hi,

We noticed that MySQL does not pass its test suite when compiled with GCC 8; it runs out of stack. (GCC 7 is fine.) A reduced test case is included (mostly by C-Reduce, but it needed some help by hand); most of it appears to be fluff that keeps the compiler from just optimizing away the entire thing, but the gist of it seems to be that it inlines the bg::bl() function several times without caring that it balloons the stack size, and then doesn't manage to shrink the stack again by overlapping variables. Putting the noinline attribute on bg::bl() seems to be a workaround for now.

For comparison:

> g++-7 -O2 -Wstack-usage=1 -Wno-return-type -Wno-unused-result -c stack.i
stack.i: In function ‘void c()’:
stack.i:34:6: warning: stack usage is 8240 bytes [-Wstack-usage=]
 void c() {
      ^

> g++-8 -O2 -Wstack-usage=1 -Wno-return-type -Wno-unused-result -c stack.i                 
stack.i: In function ‘void c()’:
stack.i:34:6: warning: stack usage is 32816 bytes [-Wstack-usage=]
 void c() {
      ^

The actual, unreduced file can be found at https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0ins.cc#L926 (the line is positioned on a function whose adding noinline helps, although I don't think it corresponds directly to bg::bl; I think bg::bl might be ib::error, and the 8192-sized buffer comes from ib::logger::msg).
Comment 1 Alexander Monakov 2018-06-21 16:58:57 UTC
Inlining decisions are not so different between 7/8, the main difference is gcc-8 translates b::x() into __builtin_unreachable and warns accordingly:

warning: no return statement in function returning non-void [-Wreturn-type]

   b x(b) {}
           ^

and with that change gcc-8 no longer manages to prove that big arrays have non-overlapping lifetimes.

If I change the source to well-formed 'void x(b) {}', it compiles as desired.

So, assuming the original MySQL source is free of that warning, the testcase is too aggressively reduced and no longer reflects the original issue. Can you please re-reduce?
Comment 2 Steinar H. Gunderson 2018-07-04 08:18:45 UTC
OK, starting a reduce that also checks for no -Wreturn-type warnings.
Comment 3 Steinar H. Gunderson 2018-07-04 11:59:49 UTC
Created attachment 44349 [details]
Test case #2
Comment 4 Steinar H. Gunderson 2018-07-04 12:00:58 UTC
New test case added. It's quite a bit larger than the old one, but has no -Wreturn-type warning.
Comment 5 Alexander Monakov 2018-07-04 14:00:55 UTC
Sorry, this still seems over-reduced: the 'cmp' variable is uninitialized, and gcc-7 completely optimizes out the block with large stack usage guarded by 'cmp == 0' test, so gcc-7 vs gcc-8 is not directly comparable. It's strange that gcc-7 optimizes that out, but it's a different issue.

Can you attach the unreduced preprocessed source, and if you make another attempt at reducing, perhaps enable most warnings?

That said, it seems gcc is not very good at re-discovering non-overlapping stack allocations introduced by inlining. Looking at your testcase I came up with the following minimal test:

struct S{~S();};

void f(void *);

inline void ff()
{
  char c[1000];
  f(c);
}

void g(int n)
{
  S s;
  char c[100];
  f(c);
  if (n) ff(), ff();
}

(there's no regression vs. gcc-7 on this example, but gcc-4.6 used to get a better result by consuming 1100 bytes rather than 2100).
Comment 6 Steinar H. Gunderson 2018-07-04 14:39:41 UTC
I wouldn't be surprised if everything is really over-reduced, and that GCC 7 just happens not to hit this by luck.

I'll be making a checkout of the public git repository and try to reproduce there, so that I can give you an unreduced version.
Comment 7 Steinar H. Gunderson 2018-07-04 15:23:37 UTC
Created attachment 44351 [details]
Unreduced test case

There; unreduced, from public MySQL. This is preprocessed with GCC 7, and compiles both with 7 and 8 (the latter triggers too big stack).
Comment 8 Alexander Monakov 2018-07-04 17:34:39 UTC
Removing the 'waiting' status.
Comment 9 Jakub Jelinek 2018-07-26 11:01:58 UTC
GCC 8.2 has been released.
Comment 10 Richard Biener 2018-12-20 12:46:43 UTC
The issue with Alex testcase is that inlining when faced with

ff ()
{
  charD.10 cD.2379[1000];

  <bb 2> :
  f (&cD.2379);

  <bb 3> :
  cD.2379 ={v} {CLOBBER};
  return;

  <bb 4> :
<L0>:
  cD.2379 ={v} {CLOBBER};
  resx 1

}

ends up unifying the resx 1 parts, making the lifetime of the two instances
overlap:

g (intD.9 nD.2380)
{
  charD.10 cD.2411[1000];
  charD.10 cD.2410[1000];
...
  <bb 4> :
  f (&cD.2410);

  <bb 5> :
  cD.2410 ={v} {CLOBBER};
  f (&cD.2411);

  <bb 6> :
  cD.2411 ={v} {CLOBBER};
... all fine up to here

  <bb 9> :
<L4>:
  sD.2383 ={v} {CLOBBER};
  cD.2399 ={v} {CLOBBER};
  resx 1

oops.  When analyzing lifetime of variables during RTL expansion we might
want to not consider blocks where only the clobbers mention the vars thus
virtually move those up as far as possible.
Comment 11 Jakub Jelinek 2019-01-16 19:12:13 UTC
Looking at the #c7 testcase, confirming that ~ 29KB stack in one of the functions.  The problem is that msg has char buf[8192]; variable in it and is inline, gets inlined into a function 3 times and can throw.
ehcleanup1 removes the buf (and str) clobbers that were the only reason to have an EH pad that just rethrows (and I agree it is a good idea to do that, because otherwise inliner thinks the functions are more expensive than they actually are).
But then the function into which this function is ultimately inlined has some finalization (destructors) covering the code into which it has been inlined, so the former EH with no successor block because it would throw externally now becomes EH edge from the code to a landing block onto which everything is marked as conflicting, there is no clobber at all for the variables.

So, I wonder if we shouldn't add in such cases clobbers to the start of those landing pads in the situation, either for all variables that live in memory, or at least for the larger ones.  Will play with it tomorrow.

There is another thing - I've noticed add_stack_var_conflict is often called with x == y, shouldn't we return right away in that case?  We don't need to record that a var conflicts with itself, we later on return that no variable conflicts with itself.

Note, before r255104 we weren't inlining msg into the bigger function and thus the issue was latent.

The workaround for MySQL, at least for -O2, would be to move logger:msg definition out from the class, so it is not inline, then at least gcc trunk doesn't want to inline it and you don't run into this.
Comment 12 Jakub Jelinek 2019-01-17 08:05:45 UTC
Author: jakub
Date: Thu Jan 17 08:05:12 2019
New Revision: 268009

URL: https://gcc.gnu.org/viewcvs?rev=268009&root=gcc&view=rev
Log:
	PR tree-optimization/86214
	* cfgexpand.c (add_stack_var_conflict): Don't add any conflicts
	if x == y.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgexpand.c
Comment 13 Jakub Jelinek 2019-01-17 08:27:10 UTC
Short testcase that shows what's going on on the #c7 testcase:
typedef __SIZE_TYPE__ size_t;
struct A { A (); ~A (); int a; void qux (const char *); };
int bar (char *);

static inline __attribute__((always_inline)) A
foo ()
{
  char b[8192];
  int x = bar (b);
  A s;
  if (x > 0 && (size_t) x < sizeof b)
    s.qux (b);
  return s;
}

void
baz ()
{
  A a;
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
}
Comment 14 Jakub Jelinek 2019-01-17 08:53:37 UTC
(In reply to Jakub Jelinek from comment #11)
> The workaround for MySQL, at least for -O2, would be to move logger:msg
> definition out from the class, so it is not inline, then at least gcc trunk
> doesn't want to inline it and you don't run into this.

Or just add -fconserve-stack to the g++ options.
Comment 15 Jakub Jelinek 2019-01-17 08:58:24 UTC
Ah, the #c13 testcase shows the issue, but in the early inliner, not in IPA inliner like is triggered on the #c7 testcase.
This modified testcase triggers it in the IPA inliner:
typedef __SIZE_TYPE__ size_t;
struct A { A (); ~A (); int a; void qux (const char *); };
int bar (char *);

static inline A
foo ()
{
  char b[8192];
  int x = bar (b);
  A s;
  if (x > 0 && (size_t) x < sizeof b)
    s.qux (b);
  return s;
}

void
baz ()
{
  A a;
  char c[1024];
  bar (c);
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
  foo (); foo (); foo (); foo (); foo ();
}
Comment 16 Jakub Jelinek 2019-01-17 11:10:33 UTC
Created attachment 45449 [details]
gcc9-pr86214.patch

Untested patch.  Not 100% sure if there must be just one landing pad, will see during bootstrap/regtest.  If there must be at most one, the second field can go and it can just clear the first field.
Comment 17 Jakub Jelinek 2019-01-18 10:07:59 UTC
Author: jakub
Date: Fri Jan 18 10:07:27 2019
New Revision: 268067

URL: https://gcc.gnu.org/viewcvs?rev=268067&root=gcc&view=rev
Log:
	PR tree-optimization/86214
	* tree-inline.h (struct copy_body_data): Add
	add_clobbers_to_eh_landing_pads member.
	* tree-inline.c (add_clobbers_to_eh_landing_pad): New function.
	(copy_edges_for_bb): Call it if EH edge destination is <
	id->add_clobbers_to_eh_landing_pads.  Fix a comment typo.
	(expand_call_inline): Set id->add_clobbers_to_eh_landing_pads
	if flag_stack_reuse != SR_NONE and clear it afterwards.

	* g++.dg/opt/pr86214-1.C: New test.
	* g++.dg/opt/pr86214-2.C: New test.

Added:
    trunk/gcc/testsuite/g++.dg/opt/pr86214-1.C
    trunk/gcc/testsuite/g++.dg/opt/pr86214-2.C
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-inline.c
    trunk/gcc/tree-inline.h
Comment 18 Jakub Jelinek 2019-01-18 10:15:29 UTC
Should be fixed now on the trunk.
Likely not going to backport.
Comment 19 Steinar H. Gunderson 2019-01-18 10:27:31 UTC
Thanks for fixing. IIRC we just added a noinline attribute somewhere in the code, so we already have a workaround.
Comment 20 Christophe Lyon 2019-01-22 13:49:42 UTC
(In reply to Jakub Jelinek from comment #17)
> Author: jakub
> Date: Fri Jan 18 10:07:27 2019
> New Revision: 268067
> 
> URL: https://gcc.gnu.org/viewcvs?rev=268067&root=gcc&view=rev
> Log:
> 	PR tree-optimization/86214
> 	* tree-inline.h (struct copy_body_data): Add
> 	add_clobbers_to_eh_landing_pads member.
> 	* tree-inline.c (add_clobbers_to_eh_landing_pad): New function.
> 	(copy_edges_for_bb): Call it if EH edge destination is <
> 	id->add_clobbers_to_eh_landing_pads.  Fix a comment typo.
> 	(expand_call_inline): Set id->add_clobbers_to_eh_landing_pads
> 	if flag_stack_reuse != SR_NONE and clear it afterwards.
> 
> 	* g++.dg/opt/pr86214-1.C: New test.
> 	* g++.dg/opt/pr86214-2.C: New test.
> 
> Added:
>     trunk/gcc/testsuite/g++.dg/opt/pr86214-1.C
>     trunk/gcc/testsuite/g++.dg/opt/pr86214-2.C
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/testsuite/ChangeLog
>     trunk/gcc/tree-inline.c
>     trunk/gcc/tree-inline.h

Hi Jakub,

Since you committed this patch, I've noticed regressions on some arm targets:
FAIL: 23_containers/list/requirements/exception/basic.cc execution test
libstdc++.log has this:
spawn [open ...]
N10__gnu_test12functor_base19iterator_operationsINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEEEE
end count 2
N10__gnu_test12functor_base25const_iterator_operationsINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEEEE
end count 3
N10__gnu_test12functor_base11erase_pointINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1ELb0EEE
end count 4
N10__gnu_test12functor_base11erase_rangeINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1ELb0EEE
end count 5
N10__gnu_test12functor_base12insert_pointINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1ELb0EEE
end count 6
N10__gnu_test12functor_base7emplaceINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb0EEE
end count 7
N10__gnu_test12functor_base13emplace_pointINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1ELb0ELb0EEE
end count 8
N10__gnu_test12functor_base13emplace_frontINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 9
N10__gnu_test12functor_base12emplace_backINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 10
N10__gnu_test12functor_base9pop_frontINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 11
N10__gnu_test12functor_base8pop_backINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 12
N10__gnu_test12functor_base10push_frontINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 13
N10__gnu_test12functor_base9push_backINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb1EEE
end count 14
N10__gnu_test12functor_base6rehashINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEELb0EEE
end count 15
N10__gnu_test12functor_base4swapINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS4_21throw_allocator_limitIS5_EEEEEE
end count 16
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

The qemu trace does not seem very helpful (and is probably incomplete):
IN: _ZN10__gnu_test12basic_safetyINSt7__cxx114listIN9__gnu_cxx17throw_value_limitENS3_21throw_allocator_limitIS4_EEEEE3runEv
0x0001c724:  e5943008  ldr      r3, [r4, #8]
0x0001c728:  e59d2010  ldr      r2, [sp, #0x10]
0x0001c72c:  e59d101c  ldr      r1, [sp, #0x1c]
0x0001c730:  e3530000  cmp      r3, #0
0x0001c734:  e5987000  ldr      r7, [r8]
0x0001c738:  e5812000  str      r2, [r1]
0x0001c73c:  e5896000  str      r6, [sb]
0x0001c740:  e5882000  str      r2, [r8]
0x0001c744:  0a000510  beq      #0x1db8c

----------------
IN: 
0x40adc6bc:  e0849009  add      sb, r4, sb
0x40adc6c0:  e59f29c8  ldr      r2, [pc, #0x9c8]
0x40adc6c4:  e5993004  ldr      r3, [sb, #4]
0x40adc6c8:  e08f2002  add      r2, pc, r2
0x40adc6cc:  e3833001  orr      r3, r3, #1
0x40adc6d0:  e1550002  cmp      r5, r2
0x40adc6d4:  e59da018  ldr      sl, [sp, #0x18]
0x40adc6d8:  e5893004  str      r3, [sb, #4]
0x40adc6dc:  0a000002  beq      #0x40adc6ec

----------------
IN: 
0x40adc6ec:  e59f39a0  ldr      r3, [pc, #0x9a0]
0x40adc6f0:  e2846008  add      r6, r4, #8
0x40adc6f4:  e08f3003  add      r3, pc, r3
0x40adc6f8:  e593102c  ldr      r1, [r3, #0x2c]
0x40adc6fc:  e3510000  cmp      r1, #0
0x40adc700:  1affff13  bne      #0x40adc354

----------------
IN: 
0x40adc704:  e1a00006  mov      r0, r6
0x40adc708:  e28dd044  add      sp, sp, #0x44
0x40adc70c:  e8bd8ff0  pop      {r4, r5, r6, r7, r8, sb, sl, fp, pc}

----------------
IN: 
0x40adc6ac:  e085240c  add      r2, r5, ip, lsl #8
0x40adc6b0:  e2822028  add      r2, r2, #0x28
0x40adc6b4:  e3a03001  mov      r3, #1
0x40adc6b8:  eaffffce  b        #0x40adc5f8


I've noticed this on:
arm-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a9
--with-fpu vfp
RUNTESTFLAGS: -march=armv5t

arm-none-linux-gnueabihf
--with-mode arm
--with-cpu arm10tdmi
--with-fpu vfp

arm-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a5
--with-fpu vfpv3-d16-fp16

armeb-none-linux-gnueabihf
--with-mode arm
--with-cpu cortex-a9
--with-fpu vfpv3-d16-fp16

Other arm-none-linux-gnueabihf configs I test still run fine.
Comment 21 Jakub Jelinek 2019-02-22 15:21:04 UTC
GCC 8.3 has been released.
Comment 22 Jakub Jelinek 2020-03-04 09:31:59 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 23 Jakub Jelinek 2021-05-14 11:00:39 UTC
The GCC 8 branch is being closed, fixed in GCC 9.1.