Bug 80290 - [6/7/8/9 Regression] g++ uses unreasonable amount of memory compiling nested string maps
Summary: [6/7/8/9 Regression] g++ uses unreasonable amount of memory compiling nested ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: c++ (show other bugs)
Version: 7.0.1
: P2 normal
Target Milestone: 7.4
Assignee: Jason Merrill
URL:
Keywords: deferred, memory-hog, patch
Depends on:
Blocks:
 
Reported: 2017-04-03 14:08 UTC by Andrew Jeffery
Modified: 2018-06-28 03:22 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-04-04 00:00:00


Attachments
Source file demonstrating bad behaviour (105.75 KB, text/plain)
2017-04-03 14:08 UTC, Andrew Jeffery
Details
unincluded testcase (1.35 KB, text/plain)
2017-04-04 08:01 UTC, Richard Biener
Details
testcase just using <map> (753 bytes, text/plain)
2017-04-04 10:18 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Jeffery 2017-04-03 14:08:49 UTC
Created attachment 41107 [details]
Source file demonstrating bad behaviour

g++ uses an unreasonable amount of memory (greater than 5GB) when compiling the attached code. A source archive is attached, and the test can be run with `make`. Testing suggests this issue occurs with x86_64 g++s 5.4 and 6.2 from Ubuntu Yakkety, and both 6.3 and 7.0.1 built from source. The bug was originally encountered under a Yocto/Poky build cross-compiling from x86_64 to ARMv6 (targeting an ARM1176 core).

Alan Modra's tests indicate g++ 4.8 performs better than any of the previously mentioned versions. Behaviour in clang 3.8 also appears reasonable.

GCCs built from source were built with a configure line derived from Ubuntu Yakkety's g++ builds, with minor modifications to restrict the supported languages and architectures, and to change prefixes.

Test case is built with `g++ -std=c++11 fru-gen.ii -o fru-gen.o`

$ ~/gcc/master/bin/g++ -v
Using built-in specs.
COLLECT_GCC=/home/andrew/gcc/master/bin/g++
COLLECT_LTO_WRAPPER=/home/andrew/gcc/master/libexec/gcc/x86_64-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../configure -v --enable-languages=c,c++ --prefix=/home/andrew/gcc/master --enable-shared --enable-linker-build-id --without-included-gettext --enable-threads=posix --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-gtk-cairo --with-arch-directory=amd64 --disable-werror --disable-multilib --with-abi=m64 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --enable-bootstrap
Thread model: posix
gcc version 7.0.1 20170401 (experimental) (GCC)
Comment 1 Richard Biener 2017-04-04 08:01:13 UTC
Confirmed.  The initializer is not unreasonably large IMHO but is certainly the problem.

clang++ uses 500MB for me when not optimizing, g++ 4.8 and g++ 5 around 200MB.
Starting with g++ 6 we use gigabytes.

Seems to be caused by libstdc++ changes given g++ 7 preprocessed source uses
as much memory with g++ 5 as with 7 (even worse, at 12GB and still compiling...)
Comment 2 Richard Biener 2017-04-04 08:01:44 UTC
Created attachment 41116 [details]
unincluded testcase
Comment 3 Richard Biener 2017-04-04 08:07:11 UTC
All memory is allocated during 

Run till exit from #84 0x000000000080a9f7 in check_initializer (
    decl=<var_decl 0x7ffff4ecdea0 frus>, init=<constructor 0x7ffff4dc2678>, 
    flags=5, cleanups=0x7fffffffd7f0)
    at /space/rguenther/src/svn/gcc-7-branch/gcc/cp/decl.c:6318
...
Run till exit from #104 0x00000000009d4b2f in expand_default_init (
    binfo=<tree_binfo 0x7ffff4ece960>, 
    true_exp=<var_decl 0x7ffff4ecdea0 frus>, 
    exp=<var_decl 0x7ffff4ecdea0 frus>, init=<constructor 0x7ffff4dc2678>, 
    flags=5, complain=3)
    at /space/rguenther/src/svn/gcc-7-branch/gcc/cp/init.c:1787

where it all the time does template instantiation and overload resolution.  I wonder if some trivial caching of "last elements overload set" during initializer processing could speed up things dramatically and avoid generating so much garbage
(we don't GC during a single check_initializer call).
Comment 4 Richard Biener 2017-04-04 08:50:32 UTC
For an unoptimized compiler the profile points at conversion_obstack_alloc called 16 million times when simplifying the testcase to contain a single initializer element.

Note we essentially have a std::map<unsigned, std::map<std::string, std::map<std::string, std::map<std::string, std::map<std::string, std::string> > > > > initializer.

Simplifying the initializer down to

extern const FruMap frus = {
   {1,{
         {"/system/chassis/motherboard/cpu0",{
             {"xyz.openbmc_project.Inventory.Decorator.Asset",{
                 {"PartNumber",{
                     {"IPMIFruSection","Board"},
                 }},
             }},
        }},
   }},
};

shows that adding another element to the innermost map increases memory use from
107MB to 124MB.  So even individual elements are quite costly.

For the above simplest case we still have 1.4 _million_ calls to conversion_obstack_alloc.

The C++ code is also a heavy allocator of tree vectors and tree lists during
overload resolution / template instantiation and matching.
Comment 5 Jonathan Wakely 2017-04-04 09:36:45 UTC
(In reply to Richard Biener from comment #1)
> Seems to be caused by libstdc++ changes given g++ 7 preprocessed source uses
> as much memory with g++ 5 as with 7 (even worse, at 12GB and still
> compiling...)

I doubt there's anything we can do to reduce it without losing standards conformance in some way, but I'll take a look and what changed.
Comment 6 Richard Biener 2017-04-04 10:11:52 UTC
Seems we build many functions (instantiate templates?) and then unify them via
fn_type_unification?  I believe the root cause is somewhere deeply embedded in the C++ FE ...

But again, using preprocessed source with GCC 5 libstdc++ with GCC 7 compiles the thing in just 250MB ram... (note this is all with the old ABI classes but quick testing shows new std::string doesn't make a difference here).
Comment 7 Richard Biener 2017-04-04 10:18:55 UTC
Created attachment 41120 [details]
testcase just using <map>

Reduced testcase.
Comment 8 Jonathan Wakely 2017-04-04 11:01:23 UTC
The drastic slowdown started with r225189

We can't realistically revert that, so we'll have to find a way to make the FE faster. I don't really know what I'm talking about, but I like this idea:

(In reply to Richard Biener from comment #3)
> I wonder if some trivial caching of "last elements overload set" during
> initializer processing could speed up things dramatically and avoid
> generating so much garbage
> (we don't GC during a single check_initializer call).
Comment 9 Jonathan Wakely 2017-04-04 11:09:47 UTC
Reduced:

#include <utility>

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

#define INIT { 1, \
  { "/system/chassis/motherboard/cpu0", \
    { "xyz.openbmc_project.Inventory.Decorator.Asset", \
      { "PartNumber", \
        {"IPMIFruSection","Board"}, \
      }, \
    } \
  }, \
}

extern const FruMap frus[] = {
INIT,
INIT,
INIT,
INIT,
INIT,
};


GCC 5:

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 (17%) wall    1323 kB (15%) ggc
 phase parsing           :   0.04 (100%) usr   0.01 (100%) sys   0.05 (83%) wall    7520 kB (84%) ggc
 |name lookup            :   0.01 (25%) usr   0.00 ( 0%) sys   0.03 (50%) wall     581 kB ( 7%) ggc
 |overload resolution    :   0.01 (25%) usr   0.00 ( 0%) sys   0.01 (17%) wall    1867 kB (21%) ggc
 parser (global)         :   0.02 (50%) usr   0.00 ( 0%) sys   0.02 (33%) wall    2692 kB (30%) ggc
 parser struct body      :   0.01 (25%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     911 kB (10%) ggc
 template instantiation  :   0.01 (25%) usr   0.01 (100%) sys   0.03 (50%) wall    3239 kB (36%) ggc
 TOTAL                 :   0.04             0.01             0.06               8941 kB


GCC 6:

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1397 kB ( 1%) ggc
 phase parsing           :   0.62 (98%) usr   0.19 (100%) sys   0.81 (96%) wall  125355 kB (99%) ggc
 phase opt and generate  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.02 ( 2%) wall     231 kB ( 0%) ggc
 |name lookup            :   0.01 ( 2%) usr   0.01 ( 5%) sys   0.00 ( 0%) wall    1155 kB ( 1%) ggc
 |overload resolution    :   0.24 (38%) usr   0.04 (21%) sys   0.27 (32%) wall   68949 kB (54%) ggc
 callgraph construction  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall     231 kB ( 0%) ggc
 callgraph optimization  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall       0 kB ( 0%) ggc
 preprocessing           :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall     187 kB ( 0%) ggc
 parser (global)         :   0.36 (57%) usr   0.10 (53%) sys   0.55 (65%) wall  103307 kB (81%) ggc
 parser struct body      :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1227 kB ( 1%) ggc
 parser inl. meth. body  :   0.01 ( 2%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      95 kB ( 0%) ggc
 template instantiation  :   0.21 (33%) usr   0.09 (47%) sys   0.24 (29%) wall   19154 kB (15%) ggc
 symout                  :   0.02 ( 3%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1387 kB ( 1%) ggc
 TOTAL                 :   0.63             0.19             0.84             127016 kB
Comment 10 Jonathan Wakely 2017-04-04 11:12:23 UTC
Or even just:

#include <utility>

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

#define INIT { 1, { "", { "", { "", {"",""}, }, } }, }

extern const FruMap frus[] = {
INIT,
INIT,
INIT,
INIT,
INIT,
};

All the time is spent in implicit_conversion().
Comment 11 Jonathan Wakely 2017-04-04 11:17:18 UTC
With gcc-6 every extra member of the init-list increases memory usage dramatically:

 TOTAL     :   0.64             0.17             0.82             126289 kB
 TOTAL     :   0.76             0.23             1.00             147632 kB
 TOTAL     :   0.88             0.26             1.15             168975 kB

but with gcc-5:

 TOTAL        :   0.03             0.01             0.06               8390 kB
 TOTAL        :   0.03             0.01             0.05               8727 kB
 TOTAL        :   0.05             0.02             0.08               9064 kB
Comment 12 Richard Biener 2017-04-04 12:04:18 UTC
switching offender to the C++ FE
Comment 13 Richard Biener 2017-04-04 12:47:34 UTC
typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > FruMap;
extern const FruMap frus[] = {
    { 1, { "", { "", {"",""}, }, }, },
};

is twice as fast as

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;

extern const FruMap frus[] = {
    { 1, { "", { "", { "", {"",""}, }, } }, },
};

the latter with 70713 implicit_conversion calls while the former with "only"
14331 implicit_conversion calls.  Removing another std::pair<> level reduces
it to 4389 calls (still 10689 tsubst_copy_and_build calls(!)).

typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *, const char *> > > FruMap;
extern const FruMap frus[] = {
    { 1, { "", {"",""}, }, },
};  

constexpr evaluation seems to cause quite some bits of it.  The constexpr hash
is quite big and we arrive with following exprs to evaluate/cache:

{}
0 || 0
std::_PCC<false, const char*, const char*>::_ConstructiblePair<const char*, const char*> ()
!std::_PCC<false, const char*, const char*>::_ImplicitlyConvertiblePair<const char*, const char*> ()
std::_PCC<false, const char*, const char*>::_ConstructiblePair<const char*, const char*> () && !std::_PCC<false, const char*, const char*>::_ImplicitlyConvertiblePair<const char*, const char*> ()

and for the && we don't use the cached LHS we computed above because
we don't use maybe_constant_value for example in cxx_eval_logical_expression
but we recurse into cxx_eval_constant_expression!

A simple patch like

Index: gcc/cp/constexpr.c
===================================================================
--- gcc/cp/constexpr.c  (revision 246678)
+++ gcc/cp/constexpr.c  (working copy)
@@ -3860,6 +3860,9 @@ lookup_placeholder (const constexpr_ctx
   return ob;
 }
 
+static GTY((deletable)) hash_map<tree, tree> *cv_cache;
+
+
 /* Attempt to reduce the expression T to a constant value.
    On failure, issue diagnostic and return error_mark_node.  */
 /* FIXME unify with c_fully_fold */
@@ -3874,6 +3877,11 @@ cxx_eval_constant_expression (const cons
   constexpr_ctx new_ctx;
   tree r = t;
 
+  tree *cached;
+  if (cv_cache
+      && (cached = cv_cache->get (t)))
+    return *cached;
+
   if (jump_target && *jump_target)
     {
       /* If we are jumping, ignore all statements/expressions except those
@@ -4824,8 +4832,6 @@ fold_simple (tree t)
    Otherwise, if T does not have TREE_CONSTANT set, returns T.
    Otherwise, returns a version of T without TREE_CONSTANT.  */
 
-static GTY((deletable)) hash_map<tree, tree> *cv_cache;
-
 tree
 maybe_constant_value (tree t, tree decl)
 {

should improve this, but while it catches some cases it doesn't result in
an overall improvement.  The patch is likely incorrect anyway.

But I think it shows the issue (and likely the patch doesn't help because
"equivalent" calls are not detected as such by the cv_cache map which
ends up with pointer equivalence AFAICS).

Oh, and there are constexpr evaluations that do not go through the cache
anyway (potential_constant_expression, cxx_constant_value)
Comment 14 Richard Biener 2017-04-04 12:50:41 UTC
(In reply to Richard Biener from comment #13)
> typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *,
> std::pair<const char *, const char *> > > > FruMap;
> extern const FruMap frus[] = {
>     { 1, { "", { "", {"",""}, }, }, },
> };
> 
> is twice as fast as
> 
> typedef std::pair<unsigned, std::pair<const char *, std::pair<const char *,
> std::pair<const char *, std::pair<const char *, const char *> > > > > FruMap;
> 
> extern const FruMap frus[] = {
>     { 1, { "", { "", { "", {"",""}, }, } }, },
> };
> 
> the latter with 70713 implicit_conversion calls while the former with "only"
> 14331 implicit_conversion calls.  Removing another std::pair<> level reduces
> it to 4389 calls (still 10689 tsubst_copy_and_build calls(!)).

Similar amount of cxx_eval_call_expression so it boils down to constexpr handling...
Comment 15 Jason Merrill 2017-04-19 06:10:58 UTC
Adding a timevar to cxx_eval_outermost_constant_expr shows minimal time spent in constexpr evaluation; it also doesn't show up in -fmem-report.  So constexpr isn't the problem.

It seems to me that the problem is that we added a bunch of constructor template overloads, and made all of them rely on doing further overload resolution in order to evaluate the SFINAE condition, and that's a lot more work for the compiler.  And then there are a few places, notably push_tinst_level, that allocate memory that isn't freed until the next GC, so in this case it keeps building up.

We ought to be able to be smarter about caching previous overload resolution results, but that would be new machinery that doesn't seem suitable for stage 4.

I'll keep poking.
Comment 16 rguenther@suse.de 2017-04-19 08:05:54 UTC
On April 19, 2017 8:10:58 AM GMT+02:00, "jason at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80290
>
>--- Comment #15 from Jason Merrill <jason at gcc dot gnu.org> ---
>Adding a timevar to cxx_eval_outermost_constant_expr shows minimal time
>spent
>in constexpr evaluation; it also doesn't show up in -fmem-report.  So
>constexpr
>isn't the problem.

The number of calls suggest otherwise.  Are you sure the timevar isn't pushed away here (overload resolution seemed to be triggered by constexpr evaluation, so better constrxpr caching should help)

>It seems to me that the problem is that we added a bunch of constructor
>template overloads, and made all of them rely on doing further overload
>resolution in order to evaluate the SFINAE condition, and that's a lot
>more
>work for the compiler.  And then there are a few places, notably
>push_tinst_level, that allocate memory that isn't freed until the next
>GC, so
>in this case it keeps building up.

I suppose adding GC points isn't easily possible either?

>We ought to be able to be smarter about caching previous overload
>resolution
>results, but that would be new machinery that doesn't seem suitable for
>stage
>4.
>
>I'll keep poking.
Comment 17 Richard Biener 2017-07-04 08:47:22 UTC
GCC 6.4 is being released, adjusting target milestone.
Comment 18 Jason Merrill 2018-02-16 20:13:21 UTC
Not currently working on this.
Comment 19 Alexandre Oliva 2018-04-05 04:29:06 UTC
Mine
Comment 20 Alexandre Oliva 2018-04-16 22:36:05 UTC
Patch posted
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00709.html
Comment 21 Alexandre Oliva 2018-04-18 05:17:58 UTC
Author: aoliva
Date: Wed Apr 18 05:17:26 2018
New Revision: 259457

URL: https://gcc.gnu.org/viewcvs?rev=259457&root=gcc&view=rev
Log:
[PR c++/80290] recycle tinst garbage sooner

tinst_level objects are created during template instantiation, and
they're most often quite short-lived, but since there's no intervening
garbage collection, they accumulate throughout the pass while most by
far could be recycled after a short while.  The original testcase in
PR80290, for example, creates almost 55 million tinst_level objects,
all but 10 thousand of them without intervening GC, but we don't need
more than 284 of them live at a time.

Furthermore, in many cases, TREE_LIST objects are created to stand for
the decl in tinst_level.  In most cases, those can be recycled as soon
as the tinst_level object is recycled; in some relatively common
cases, they are modified and reused in up to 3 tinst_level objects.
In the same testcase, TREE_LISTs are used in all but 3 thousand of the
tinst_level objects, and we don't need more than 89 tinst_level
objects with TREE_LISTs live at a time.  Furthermore, all but 2
thousand of those are created without intervening GC.

This patch arranges for tinst_level objects to be refcounted (I've
seen as many as 20 live references to a single tinst_level object in
my testing), and for pending_template, tinst_level and TREE_LIST
objects that can be recycled to be added to freelists; that's much
faster than ggc_free()ing them and reallocating them shortly
thereafter.  In fact, the introduction of such freelists is what makes
this entire patch lighter-weight, when it comes to memory use, and
faster.  With refcounting alone, the total memory footprint was still
about the same, and we spent more time.

In order to further reduce memory use, I've arranged for us to create
TREE_LISTs lazily, only at points that really need them (when printing
error messages).  This grows tinst_level by an additional pointer, but
since a TREE_LIST holds a lot more than an extra pointer, and
tinst_levels with TREE_LISTs used to be allocated tens of thousands of
times more often than plain decl ones, we still save memory overall.

I was still not quite happy about growing memory use in cases that
used template classes but not template overload resolution, so I
changed the type of the errors field to unsigned short, from int.
With that change, in_system_header_p and refcount move into the same
word or half-word that used to hold errors, releasing a full word,
bringing tinst_level back to its original size, now without any
padding.

The errors field is only used to test whether we've had any errors
since the expansion of some template, to skip the expansion of further
templates.  If we get 2^16 errors or more, it is probably reasonable
to refrain from expanding further templates, even if we would expand
them before this change.

With these changes, compile time for the original testcase at -O0,
with default checking enabled, is cut down by some 3.7%, total GCable
memory allocation is cut down by almost 20%, and total memory use (as
reported by GNU time as maxresident) is cut down by slightly over 15%.


for  gcc/cp/ChangeLog

	PR c++/80290
	* cp-tree.h (struct tinst_level): Split decl into tldcl and
	targs.  Add private split_list_p, tree_list_p, and not_list_p
	inline const predicates; to_list private member function
	declaration; free public member function declaration; list_p,
	get_node and maybe_get_node accessors, and refcount data
	member.  Narrow errors to unsigned short.
	* error.c (print_instantiation_full_context): Use new
	accessors.
	(print_instantiation_partial_context_line): Likewise.  Drop
	const from tinst_level-typed parameter.
	* mangle.c (mangle_decl_string): Likewise.
	* pt.c (freelist): New template class.
	(tree_list_freelist_head): New var.
	(tree_list_freelist): New fn, along with specializations.
	(tinst_level_freelist_head): New var.
	(pending_template_freelist_head): Likewise.
	(tinst_level_freelist, pending_template_freelist): New fns.
	(tinst_level::to_list, tinst_level::free): Define.
	(inc_refcount_use, dec_refcount_use): New fns for tinst_level.
	(set_refcount_ptr): New template fn.
	(add_pending_template): Adjust for refcounting, freelists and
	new accessors.
	(neglectable_inst_p): Take a NULL d as a non-DECL.
	(limit_bad_template_recursion): Use new accessors.
	(push_tinst_level): New overload to create the list.
	(push_tinst_level_loc): Make it static, split decl into two
	args, adjust tests and initialization to cope with split
	lists, use freelist, adjust for refcounting.
	(push_tinst_level_loc): New wrapper with the old interface.
	(pop_tinst_level): Adjust for refcounting.
	(record_last_problematic_instantiation): Likewise.
	(reopen_tinst_level): Likewise.  Use new accessors.
	(instantiate_alias_template): Adjust for split list.
	(fn_type_unification): Likewise.
	(get_partial_spec_bindings): Likewise.
	(instantiate_pending_templates): Use new accessors.  Adjust
	for refcount.  Release pending_template to freelist.
	(instantiating_current_function_p): Use new accessors.

Modified:
    trunk/gcc/cp/ChangeLog
    trunk/gcc/cp/cp-tree.h
    trunk/gcc/cp/error.c
    trunk/gcc/cp/mangle.c
    trunk/gcc/cp/pt.c
Comment 22 Alexandre Oliva 2018-04-18 06:02:58 UTC
The patch is in.  I suppose there's still room for improvement, but this was as far as I could get for now.
Comment 23 Andrew Jeffery 2018-04-18 06:52:12 UTC
> The patch is in.  I suppose there's still room for improvement, but this was as far as I could get for now.

Thanks for your effort! I'll take it for a test drive when I get a moment, we hit this case still with one of our codebases.

Is there a chance the pach will be backported? I'd love to apply it to the 7.3.0 release (which we're using via Yocto).
Comment 24 Richard Biener 2018-04-18 09:31:07 UTC
While memory use improved the testcase in comment#10 still shows substantial increase in memory and time for each added init element:

1.24user 0.04system 0:01.28elapsed 100%CPU (0avgtext+0avgdata 170772maxresident)k
1.49user 0.07system 0:01.56elapsed 99%CPU (0avgtext+0avgdata 206396maxresident)k
0inputs+0outputs (0major+50914minor)pagefaults 0swaps
1.70user 0.05system 0:01.76elapsed 100%CPU (0avgtext+0avgdata 233904maxresident)k

so we improved some constant factor but didn't really address the underlying
issue.

Numbers from before the patch (on a loaded system, sorry) were

3.87user 0.10system 0:04.33elapsed 91%CPU (0avgtext+0avgdata 235760maxresident)k
0inputs+0outputs (0major+54479minor)pagefaults 0swaps
5.23user 0.16system 0:06.06elapsed 88%CPU (0avgtext+0avgdata 283860maxresident)k
0inputs+0outputs (0major+64971minor)pagefaults 0swaps
6.17user 0.15system 0:06.90elapsed 91%CPU (0avgtext+0avgdata 323796maxresident)k
0inputs+0outputs (0major+74942minor)pagefaults 0swaps
Comment 25 Andrew Jeffery 2018-04-19 00:22:10 UTC
I built and ran gcc master over the project that was causing us pain, and it still does. Here's the output:

$ /usr/bin/time make libapphandler_la-sensor-gen.lo
  CXX      libapphandler_la-sensor-gen.lo
TINST: ctor 2 dtor 2 max 1 end 0 long 2 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 1 reused: 1
TINST: ctor 178 dtor 178 max 1 end 0 long 176 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 1 reused: 2
TINST: ctor 589 dtor 589 max 12 end 0 long 411 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 2 reused: 14
TINST: ctor 876 dtor 844 max 33 end 32 long 411 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 2 reused: 47
TINST: ctor 1105 dtor 1055 max 52 end 50 long 411 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 2 reused: 67
TINST: ctor 5179 dtor 5057 max 124 end 122 long 4074 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 9 reused: 141
TINST: ctor 7298 dtor 7143 max 163 end 155 long 4074 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 9 reused: 182
TINST: ctor 10737 dtor 10391 max 347 end 346 long 4074 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 9 reused: 374
TINST: ctor 20890547 dtor 20889843 max 704 end 704 long 20879810 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 9 reused: 732
TINST: ctor 20912080 dtor 20911687 max 791 end 393 long 20879810 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 23 reused: 819
TINST: ctor 20917617 dtor 20917349 max 791 end 268 long 20879810 LST: ctor 0 dtor 0 max 0 end 0 long 0 REFS: 23 reused: 834
96.24user 2.97system 1:39.31elapsed 99%CPU (0avgtext+0avgdata 5058160maxresident)k
37592inputs+23512outputs (192major+1376999minor)pagefaults 0swaps
Comment 26 Alexandre Oliva 2018-04-19 04:31:04 UTC
I don't know that the so-called underlying problem is fixable.  Several template constructors are now involved that weren't in earlier versions of the standard, and so we have to instantiate them.  Even if we recover some of the temporary memory with the patch that went in, Jason's comment 15 still (and again) applies.
Which means this should now be marked as deferred.
Comment 27 Alexandre Oliva 2018-04-19 06:39:38 UTC
Author: aoliva
Date: Thu Apr 19 06:39:06 2018
New Revision: 259486

URL: https://gcc.gnu.org/viewcvs?rev=259486&root=gcc&view=rev
Log:
PR c++/80290
* cp-tree.h (tinst_level::free): Fix whitespace.

Modified:
    trunk/gcc/cp/ChangeLog
    trunk/gcc/cp/cp-tree.h
Comment 28 Jason Merrill 2018-06-25 20:08:23 UTC
This is definitely an algorithmic complexity issue.  With 4 levels of pair, we do 3886 substitutions of default arguments.  With 5 levels, we do 23326, a bit more than 5x as many.

The basic problem seems to be that for function parameters that have no deducible template parameters, we check convertibility as part of type deduction, and then we check convertibility again once we've actually formed the candidate.  And many of the pair constructor templates use the pair template parameters in their parameter types, which are non-deducible in this context.

With a simple example like

template <class T, class U>
struct A
{ 
  template <class U2 = U>
  A(T, U) { }
};

using A1 = A<int, int>;
using A2 = A<int, A1>;
using A3 = A<int, A2>;

A1 a1 = { 1, 1 };
A2 a2 = { 1, { 1, 1 } };
A3 a3 = { 1, { 1, { 1, 1 } } };

We end up considering the constructor template once for a1, three times (rather than twice) for a2, and seven times (rather than three times) for a3.

If the second parameter is U2, and therefore deduced, the number of constructor calls scales linearly as expected.
Comment 29 Jason Merrill 2018-06-27 03:00:12 UTC
Author: jason
Date: Wed Jun 27 02:59:38 2018
New Revision: 262172

URL: https://gcc.gnu.org/viewcvs?rev=262172&root=gcc&view=rev
Log:
	PR c++/80290 - memory-hog with std::pair.

	* pt.c (fn_type_unification): Add convs parameter.
	(check_non_deducible_conversion): Remember conversion.
	(check_non_deducible_conversions): New.  Do checks here.
	(type_unification_real): Not here.  Remove flags parm.
	* call.c (add_function_candidate): Make convs a parameter.
	Don't recalculate the conversion if it's already set.
	(add_template_candidate_real): Allocate convs here.
	(good_conversion, conv_flags): New.

When the std::pair constructors got more complex to handle, it aggravated a
preexisting algorithmic problem in template overload resolution:

As part of template argument deduction in a call, once we've deduced all
the template arguments we can but before we substitute them to form an
actual declaration, for any function parameters that don't involve template
parameters we need to check that it's possible to convert the argument to
the parameter type (wg21.link/cwg1391).

As a result, we end up calculating the conversion twice: once here, and
then again in add_function_candidate as part of normal overload resolution.
Normally this isn't a big deal, but when the argument is a multiply-nested
initializer list, doubling the conversion processing at each level leads to
combinatorial explosion.

The patch for trunk avoids the duplication by remembering the conversion we
calculate at deduction time and then reusing it in overload resolution
rather than calculating it again.

Modified:
    trunk/gcc/cp/ChangeLog
    trunk/gcc/cp/call.c
    trunk/gcc/cp/class.c
    trunk/gcc/cp/cp-tree.h
    trunk/gcc/cp/pt.c
Comment 30 Jason Merrill 2018-06-27 03:05:48 UTC
Author: jason
Date: Wed Jun 27 03:05:17 2018
New Revision: 262174

URL: https://gcc.gnu.org/viewcvs?rev=262174&root=gcc&view=rev
Log:
	PR c++/80290 - memory-hog with std::pair.

	* pt.c (type_unification_real): Skip non-dependent conversion
	check for a nested list argument.
	(braced_init_depth): New.

Modified:
    branches/gcc-8-branch/gcc/cp/ChangeLog
    branches/gcc-8-branch/gcc/cp/pt.c
Comment 31 Jason Merrill 2018-06-28 00:25:53 UTC
Author: jason
Date: Thu Jun 28 00:25:21 2018
New Revision: 262204

URL: https://gcc.gnu.org/viewcvs?rev=262204&root=gcc&view=rev
Log:
	PR c++/80290 - memory-hog with std::pair.

	* pt.c (type_unification_real): Skip non-dependent conversion
	check for a nested list argument.
	(braced_init_depth): New.

Modified:
    branches/gcc-7-branch/gcc/cp/ChangeLog
    branches/gcc-7-branch/gcc/cp/pt.c
Comment 32 Jason Merrill 2018-06-28 03:22:16 UTC
Fixed for 7.4/8.2/9.