Bug 80290 - [6/7/8 Regression] g++ uses unreasonable amount of memory compiling nested string maps
Summary: [6/7/8 Regression] g++ uses unreasonable amount of memory compiling nested st...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 7.0.1
: P2 normal
Target Milestone: 6.5
Assignee: Jason Merrill
URL:
Keywords: memory-hog
Depends on:
Blocks:
 
Reported: 2017-04-03 14:08 UTC by Andrew Jeffery
Modified: 2017-07-04 08:47 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-04-04 00:00:00


Attachments
Source file demonstrating bad behaviour (105.75 KB, text/plain)
2017-04-03 14:08 UTC, Andrew Jeffery
Details
unincluded testcase (1.35 KB, text/plain)
2017-04-04 08:01 UTC, Richard Biener
Details
testcase just using <map> (753 bytes, text/plain)
2017-04-04 10:18 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Jeffery 2017-04-03 14:08:49 UTC
Created attachment 41107 [details]
Source file demonstrating bad behaviour

g++ uses an unreasonable amount of memory (greater than 5GB) when compiling the attached code. A source archive is attached, and the test can be run with `make`. Testing suggests this issue occurs with x86_64 g++s 5.4 and 6.2 from Ubuntu Yakkety, and both 6.3 and 7.0.1 built from source. The bug was originally encountered under a Yocto/Poky build cross-compiling from x86_64 to ARMv6 (targeting an ARM1176 core).

Alan Modra's tests indicate g++ 4.8 performs better than any of the previously mentioned versions. Behaviour in clang 3.8 also appears reasonable.

GCCs built from source were built with a configure line derived from Ubuntu Yakkety's g++ builds, with minor modifications to restrict the supported languages and architectures, and to change prefixes.

Test case is built with `g++ -std=c++11 fru-gen.ii -o fru-gen.o`

$ ~/gcc/master/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/home/andrew/gcc/master/bin/g++
COLLECT_LTO_WRAPPER=/home/andrew/gcc/master/libexec/gcc/x86_64-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../configure -v --enable-languages=c,c++ --prefix=/home/andrew/gcc/master --enable-shared --enable-linker-build-id --without-included-gettext --enable-threads=posix --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-gtk-cairo --with-arch-directory=amd64 --disable-werror --disable-multilib --with-abi=m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --enable-bootstrap
Thread model: posix
gcc version 7.0.1 20170401 (experimental) (GCC)
Comment 1 Richard Biener 2017-04-04 08:01:13 UTC
Confirmed.  The initializer is not unreasonably large IMHO but is certainly the problem.

clang++ uses 500MB for me when not optimizing, g++ 4.8 and g++ 5 around 200MB.
Starting with g++ 6 we use gigabytes.

Seems to be caused by libstdc++ changes given g++ 7 preprocessed source uses
as much memory with g++ 5 as with 7 (even worse, at 12GB and still compiling...)
Comment 2 Richard Biener 2017-04-04 08:01:44 UTC
Created attachment 41116 [details]
unincluded testcase
Comment 3 Richard Biener 2017-04-04 08:07:11 UTC
All memory is allocated during 

Run till exit from #84 0x000000000080a9f7 in check_initializer (
    decl=<var_decl 0x7ffff4ecdea0 frus>, init=<constructor 0x7ffff4dc2678>, 
    flags=5, cleanups=0x7fffffffd7f0)
    at /space/rguenther/src/svn/gcc-7-branch/gcc/cp/decl.c:6318
...
Run till exit from #104 0x00000000009d4b2f in expand_default_init (
    binfo=<tree_binfo 0x7ffff4ece960>, 
    true_exp=<var_decl 0x7ffff4ecdea0 frus>, 
    exp=<var_decl 0x7ffff4ecdea0 frus>, init=<constructor 0x7ffff4dc2678>, 
    flags=5, complain=3)
    at /space/rguenther/src/svn/gcc-7-branch/gcc/cp/init.c:1787

where it all the time does template instantiation and overload resolution.  I wonder if some trivial caching of "last elements overload set" during initializer processing could speed up things dramatically and avoid generating so much garbage
(we don't GC during a single check_initializer call).
Comment 4 Richard Biener 2017-04-04 08:50:32 UTC
For an unoptimized compiler the profile points at conversion_obstack_alloc called 16 million times when simplifying the testcase to contain a single initializer element.

Note we essentially have a std::map<unsigned, std::map<std::string, std::map<std::string, std::map<std::string, std::map<std::string, std::string> > > > > initializer.

Simplifying the initializer down to

extern const FruMap frus = {
   {1,{
         {"/system/chassis/motherboard/cpu0",{
             {"xyz.openbmc_project.Inventory.Decorator.Asset",{
                 {"PartNumber",{
                     {"IPMIFruSection","Board"},
                 }},
             }},
        }},
   }},
};

shows that adding another element to the innermost map increases memory use from
107MB to 124MB.  So even individual elements are quite costly.

For the above simplest case we still have 1.4 _million_ calls to conversion_obstack_alloc.

The C++ code is also a heavy allocator of tree vectors and tree lists during
overload resolution / template instantiation and matching.
Comment 5 Jonathan Wakely 2017-04-04 09:36:45 UTC
(In reply to Richard Biener from comment #1)
> Seems to be caused by libstdc++ changes given g++ 7 preprocessed source uses
> as much memory with g++ 5 as with 7 (even worse, at 12GB and still
> compiling...)

I doubt there's anything we can do to reduce it without losing standards conformance in some way, but I'll take a look and what changed.
Comment 6 Richard Biener 2017-04-04 10:11:52 UTC
Seems we build many functions (instantiate templates?) and then unify them via
fn_type_unification?  I believe the root cause is somewhere deeply embedded in the C++ FE ...

But again, using preprocessed source with GCC 5 libstdc++ with GCC 7 compiles the thing in just 250MB ram... (note this is all with the old ABI classes but quick testing shows new std::string doesn't make a difference here).
Comment 7 Richard Biener 2017-04-04 10:18:55 UTC
Created attachment 41120 [details]
testcase just using <map>

Reduced testcase.
Comment 8 Jonathan Wakely 2017-04-04 11:01:23 UTC
The drastic slowdown started with r225189

We can't realistically revert that, so we'll have to find a way to make the FE faster. I don't really know what I'm talking about, but I like this idea:

(In reply to Richard Biener from comment #3)
> I wonder if some trivial caching of "last elements overload set" during
> initializer processing could speed up things dramatically and avoid
> generating so much garbage
> (we don't GC during a single check_initializer call).
Comment 9 Jonathan Wakely 2017-04-04 11:09:47 UTC
Reduced:

#include <utility>

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

#define INIT { 1, \
  { "/system/chassis/motherboard/cpu0", \
    { "xyz.openbmc_project.Inventory.Decorator.Asset", \
      { "PartNumber", \
        {"IPMIFruSection","Board"}, \
      }, \
    } \
  }, \
}

extern const FruMap frus[] = {
INIT,
INIT,
INIT,
INIT,
INIT,
};


GCC 5:

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 (17%) wall    1323 kB (15%) ggc
 phase parsing           :   0.04 (100%) usr   0.01 (100%) sys   0.05 (83%) wall    7520 kB (84%) ggc
 |name lookup            :   0.01 (25%) usr   0.00 ( 0%) sys   0.03 (50%) wall     581 kB ( 7%) ggc
 |overload resolution    :   0.01 (25%) usr   0.00 ( 0%) sys   0.01 (17%) wall    1867 kB (21%) ggc
 parser (global)         :   0.02 (50%) usr   0.00 ( 0%) sys   0.02 (33%) wall    2692 kB (30%) ggc
 parser struct body      :   0.01 (25%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     911 kB (10%) ggc
 template instantiation  :   0.01 (25%) usr   0.01 (100%) sys   0.03 (50%) wall    3239 kB (36%) ggc
 TOTAL                 :   0.04             0.01             0.06               8941 kB


GCC 6:

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1397 kB ( 1%) ggc
 phase parsing           :   0.62 (98%) usr   0.19 (100%) sys   0.81 (96%) wall  125355 kB (99%) ggc
 phase opt and generate  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 2%) wall     231 kB ( 0%) ggc
 |name lookup            :   0.01 ( 2%) usr   0.01 ( 5%) sys   0.00 ( 0%) wall    1155 kB ( 1%) ggc
 |overload resolution    :   0.24 (38%) usr   0.04 (21%) sys   0.27 (32%) wall   68949 kB (54%) ggc
 callgraph construction  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall     231 kB ( 0%) ggc
 callgraph optimization  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall       0 kB ( 0%) ggc
 preprocessing           :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall     187 kB ( 0%) ggc
 parser (global)         :   0.36 (57%) usr   0.10 (53%) sys   0.55 (65%) wall  103307 kB (81%) ggc
 parser struct body      :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1227 kB ( 1%) ggc
 parser inl. meth. body  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      95 kB ( 0%) ggc
 template instantiation  :   0.21 (33%) usr   0.09 (47%) sys   0.24 (29%) wall   19154 kB (15%) ggc
 symout                  :   0.02 ( 3%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1387 kB ( 1%) ggc
 TOTAL                 :   0.63             0.19             0.84             127016 kB
Comment 10 Jonathan Wakely 2017-04-04 11:12:23 UTC
Or even just:

#include <utility>

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

#define INIT { 1, { "", { "", { "", {"",""}, }, } }, }

extern const FruMap frus[] = {
INIT,
INIT,
INIT,
INIT,
INIT,
};

All the time is spent in implicit_conversion().
Comment 11 Jonathan Wakely 2017-04-04 11:17:18 UTC
With gcc-6 every extra member of the init-list increases memory usage dramatically:

 TOTAL     :   0.64             0.17             0.82             126289 kB
 TOTAL     :   0.76             0.23             1.00             147632 kB
 TOTAL     :   0.88             0.26             1.15             168975 kB

but with gcc-5:

 TOTAL        :   0.03             0.01             0.06               8390 kB
 TOTAL        :   0.03             0.01             0.05               8727 kB
 TOTAL        :   0.05             0.02             0.08               9064 kB
Comment 12 Richard Biener 2017-04-04 12:04:18 UTC
switching offender to the C++ FE
Comment 13 Richard Biener 2017-04-04 12:47:34 UTC
typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > FruMap;
extern const FruMap frus[] = {
    { 1, { "", { "", {"",""}, }, }, },
};

is twice as fast as

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

extern const FruMap frus[] = {
    { 1, { "", { "", { "", {"",""}, }, } }, },
};

the latter with 70713 implicit_conversion calls while the former with "only"
14331 implicit_conversion calls.  Removing another std::pair<> level reduces
it to 4389 calls (still 10689 tsubst_copy_and_build calls(!)).

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, const char *> > > FruMap;
extern const FruMap frus[] = {
    { 1, { "", {"",""}, }, },
};  

constexpr evaluation seems to cause quite some bits of it.  The constexpr hash
is quite big and we arrive with following exprs to evaluate/cache:

{}
0 || 0
std::_PCC<false, const char*, const char*>::_ConstructiblePair<const char*, const char*> ()
!std::_PCC<false, const char*, const char*>::_ImplicitlyConvertiblePair<const char*, const char*> ()
std::_PCC<false, const char*, const char*>::_ConstructiblePair<const char*, const char*> () && !std::_PCC<false, const char*, const char*>::_ImplicitlyConvertiblePair<const char*, const char*> ()

and for the && we don't use the cached LHS we computed above because
we don't use maybe_constant_value for example in cxx_eval_logical_expression
but we recurse into cxx_eval_constant_expression!

A simple patch like

Index: gcc/cp/constexpr.c
===================================================================
--- gcc/cp/constexpr.c  (revision 246678)
+++ gcc/cp/constexpr.c  (working copy)
@@ -3860,6 +3860,9 @@ lookup_placeholder (const constexpr_ctx
   return ob;
 }
 
+static GTY((deletable)) hash_map<tree, tree> *cv_cache;
+
+
 /* Attempt to reduce the expression T to a constant value.
    On failure, issue diagnostic and return error_mark_node.  */
 /* FIXME unify with c_fully_fold */
@@ -3874,6 +3877,11 @@ cxx_eval_constant_expression (const cons
   constexpr_ctx new_ctx;
   tree r = t;
 
+  tree *cached;
+  if (cv_cache
+      && (cached = cv_cache->get (t)))
+    return *cached;
+
   if (jump_target && *jump_target)
     {
       /* If we are jumping, ignore all statements/expressions except those
@@ -4824,8 +4832,6 @@ fold_simple (tree t)
    Otherwise, if T does not have TREE_CONSTANT set, returns T.
    Otherwise, returns a version of T without TREE_CONSTANT.  */
 
-static GTY((deletable)) hash_map<tree, tree> *cv_cache;
-
 tree
 maybe_constant_value (tree t, tree decl)
 {

should improve this, but while it catches some cases it doesn't result in
an overall improvement.  The patch is likely incorrect anyway.

But I think it shows the issue (and likely the patch doesn't help because
"equivalent" calls are not detected as such by the cv_cache map which
ends up with pointer equivalence AFAICS).

Oh, and there are constexpr evaluations that do not go through the cache
anyway (potential_constant_expression, cxx_constant_value)
Comment 14 Richard Biener 2017-04-04 12:50:41 UTC
(In reply to Richard Biener from comment #13)
> typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *,
> std::pair<const char *, const char *> > > > FruMap;
> extern const FruMap frus[] = {
>     { 1, { "", { "", {"",""}, }, }, },
> };
> 
> is twice as fast as
> 
> typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *,
> std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;
> 
> extern const FruMap frus[] = {
>     { 1, { "", { "", { "", {"",""}, }, } }, },
> };
> 
> the latter with 70713 implicit_conversion calls while the former with "only"
> 14331 implicit_conversion calls.  Removing another std::pair<> level reduces
> it to 4389 calls (still 10689 tsubst_copy_and_build calls(!)).

Similar amount of cxx_eval_call_expression so it boils down to constexpr handling...
Comment 15 Jason Merrill 2017-04-19 06:10:58 UTC
Adding a timevar to cxx_eval_outermost_constant_expr shows minimal time spent in constexpr evaluation; it also doesn't show up in -fmem-report.  So constexpr isn't the problem.

It seems to me that the problem is that we added a bunch of constructor template overloads, and made all of them rely on doing further overload resolution in order to evaluate the SFINAE condition, and that's a lot more work for the compiler.  And then there are a few places, notably push_tinst_level, that allocate memory that isn't freed until the next GC, so in this case it keeps building up.

We ought to be able to be smarter about caching previous overload resolution results, but that would be new machinery that doesn't seem suitable for stage 4.

I'll keep poking.
Comment 16 rguenther@suse.de 2017-04-19 08:05:54 UTC
On April 19, 2017 8:10:58 AM GMT+02:00, "jason at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80290
>
>--- Comment #15 from Jason Merrill <jason at gcc dot gnu.org> ---
>Adding a timevar to cxx_eval_outermost_constant_expr shows minimal time
>spent
>in constexpr evaluation; it also doesn't show up in -fmem-report.  So
>constexpr
>isn't the problem.

The number of calls suggest otherwise.  Are you sure the timevar isn't pushed away here (overload resolution seemed to be triggered by constexpr evaluation, so better constrxpr caching should help)

>It seems to me that the problem is that we added a bunch of constructor
>template overloads, and made all of them rely on doing further overload
>resolution in order to evaluate the SFINAE condition, and that's a lot
>more
>work for the compiler.  And then there are a few places, notably
>push_tinst_level, that allocate memory that isn't freed until the next
>GC, so
>in this case it keeps building up.

I suppose adding GC points isn't easily possible either?

>We ought to be able to be smarter about caching previous overload
>resolution
>results, but that would be new machinery that doesn't seem suitable for
>stage
>4.
>
>I'll keep poking.
Comment 17 Richard Biener 2017-07-04 08:47:22 UTC
GCC 6.4 is being released, adjusting target milestone.