66059 – make_integer_sequence should use a log(N) implementation

Bug 66059 - make_integer_sequence should use a log(N) implementation

Summary: make_integer_sequence should use a log(N) implementation

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	libstdc++ (show other bugs)
Version:	4.9.0

Importance:	P3 normal
Target Milestone:	5.5
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog

Duplicates (1):	68642 (view as bug list)
Depends on:
Blocks:

Reported:	2015-05-07 19:53 UTC by rhalbersma
Modified:	2017-05-17 17:18 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-05-07 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description rhalbersma 2015-05-07 19:53:19 UTC

The following program fails to compile for g++ 4.9.0 and further 

#include <utility>

int main()
{
    constexpr auto N = 1024;
    using T = std::make_integer_sequence<int, N>;
    static_assert(T::size() == N, "");
}

fatal error: template instantiation depth exceeds maximum of 900 (use -ftemplate-depth= to increase the maximum)

Note that Clang with -stdlib=libc++ can compile integer sequences of up to one million for machines with enough memory, and around 65K for online compilers such as wandbox.com. Clang + libstdc++ requires -ftemplate-depth=N for N beyond 256. 

Changing g++ default template instantiation depth beyond 900 is not a viable solution. It simply masks the problem of excessive template instantiations. 

Instead, the implementation of make_integer_sequence<T, N> should be made to scale as log(N). See e.g. http://stackoverflow.com/a/17426611/819272

Comment 1 Jonathan Wakely 2015-05-07 21:43:14 UTC

Yep, been meaning to do this for some time.

Comment 2 rhalbersma 2015-11-16 17:42:12 UTC

Apparently VC and Clang have compiler hooks for this: https://www.reddit.com/r/cpp/comments/3t0nrc/true_story_efficient_packing_tales_of_c/cx26s02

Comment 3 Daniel Frey 2015-11-17 05:19:20 UTC

A better O(log N) library-only solution than the linked one is available at https://github.com/taocpp/sequences/blob/master/include/tao/seq/make_integer_sequence.hpp

Jonathan, we already exchanged some mails a few weeks ago about std::tuple improvements. The above implementation can be used if the compiler-intrinsic still needs time to materialize, the rest should be independently implementable. If you are interested, let me know how to proceed.

Comment 4 Jonathan Wakely 2015-11-17 11:39:10 UTC

I already have a better make_integer_sequence, but have been trying to write an intrinsic.

Comment 5 Jonathan Wakely 2015-11-17 19:54:43 UTC

Fixed on trunk

Comment 6 Jonathan Wakely 2015-11-17 19:55:04 UTC

Author: redi
Date: Tue Nov 17 19:54:33 2015
New Revision: 230496

URL: https://gcc.gnu.org/viewcvs?rev=230496&root=gcc&view=rev
Log:
PR libstdc++/66059 optimise _Build_index_tuple

	PR libstdc++/66059
	* include/std/utility (_Build_index_tuple): Optimise.

Modified:
    trunk/libstdc++-v3/ChangeLog
    trunk/libstdc++-v3/include/std/utility

Comment 7 Jonathan Wakely 2015-11-17 20:14:44 UTC

(In reply to Daniel Frey from comment #3)
> A better O(log N) library-only solution than the linked one is available at
> https://github.com/taocpp/sequences/blob/master/include/tao/seq/
> make_integer_sequence.hpp

The version I came up with is very close to Xeo's at stackoverflow. I tried something more like yours and it used a LOT more memory.

Here's what I tested:

#include <stddef.h>

namespace std
{
  template<size_t... _Indexes> struct _Index_tuple { };

#if DUP
  template<typename _ITup, int _Odd> struct _Itup_dup;

  template<size_t... _Ind>
    struct _Itup_dup<_Index_tuple<_Ind...>, 0>
    {
      static constexpr size_t _Nm = sizeof...(_Ind);
      using __type = _Index_tuple<_Ind..., _Nm + _Ind...>;
    };

  template<size_t... _Ind>
    struct _Itup_dup<_Index_tuple<_Ind...>, 1>
    {
      static constexpr size_t _Nm = sizeof...(_Ind);
      using __type = _Index_tuple<_Ind..., _Nm + _Ind..., 2 * _Nm>;
    };

  // Builds an _Index_tuple<0, 1, 2, ..., _Num-1>.
  template<size_t _Num>
    struct _Build_index_tuple
    {
      using __type = typename _Itup_dup<
	typename _Build_index_tuple<_Num / 2>::__type, _Num % 2>::__type;
    };
#else
  template<typename _ITup, typename _Jtup> struct _Itup_cat;

  template<size_t... _Ind1, size_t... _Ind2>
    struct _Itup_cat<_Index_tuple<_Ind1...>, _Index_tuple<_Ind2...>>
    {
      using __type = _Index_tuple<_Ind1..., (_Ind2 + sizeof...(_Ind1))...>;
    };

  template<size_t _Num>
    struct _Build_index_tuple
    : _Itup_cat<typename _Build_index_tuple<_Num / 2>::__type,
                typename _Build_index_tuple<_Num - _Num / 2>::__type>
    { };
#endif

  template<>
    struct _Build_index_tuple<1> { typedef _Index_tuple<0> __type; };

  template<>
    struct _Build_index_tuple<0> { typedef _Index_tuple<> __type; };

}

template<typename T, typename U> struct is_same { static const bool value = false; };
template<typename T> struct is_same<T, T> { static const bool value = true; };

int main()
{
    constexpr auto N = 1024;
    using T = std::_Build_index_tuple<N>::__type;
    static_assert(sizeof(T), "");
    static_assert( is_same<std::_Build_index_tuple<10>::__type, std::_Index_tuple<0, 1, 2, 3, 4, 5, 6, 7, 8, 9>>::value, "");
    static_assert( is_same<std::_Build_index_tuple<9>::__type, std::_Index_tuple<0, 1, 2, 3, 4, 5, 6, 7, 8>>::value, "");

}



tmp$ g++11 it.cc -ftime-report 

Execution times (seconds)
 phase setup             :   0.01 (11%) usr   0.00 ( 0%) sys   0.02 (20%) wall    1396 kB (15%) ggc
 phase parsing           :   0.07 (78%) usr   0.01 (100%) sys   0.08 (80%) wall    7235 kB (80%) ggc
 phase opt and generate  :   0.01 (11%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     421 kB ( 5%) ggc
 |name lookup            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 (10%) wall     111 kB ( 1%) ggc
 |overload resolution    :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 (10%) wall      66 kB ( 1%) ggc
 garbage collection      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 (10%) wall       0 kB ( 0%) ggc
 template instantiation  :   0.07 (78%) usr   0.01 (100%) sys   0.07 (70%) wall    6763 kB (75%) ggc
 initialize rtl          :   0.01 (11%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       8 kB ( 0%) ggc
 TOTAL                 :   0.09             0.01             0.10               9066 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.
tmp$ g++11 it.cc -ftime-report -DDUP

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.01 (33%) sys   0.02 ( 2%) wall    1396 kB ( 3%) ggc
 phase parsing           :   0.91 (99%) usr   0.02 (67%) sys   0.93 (98%) wall   45729 kB (96%) ggc
 phase finalize          :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 garbage collection      :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall       0 kB ( 0%) ggc
 preprocessing           :   0.01 ( 1%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall      30 kB ( 0%) ggc
 parser struct body      :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      22 kB ( 0%) ggc
 template instantiation  :   0.89 (97%) usr   0.02 (67%) sys   0.91 (96%) wall   45262 kB (95%) ggc
 TOTAL                 :   0.92             0.03             0.95              47577 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

As you can see, defining DUP makes it 10 times slower and using 5 times as much memory. Looking at your implementation I see only one relevant difference, you pass N/2 as a template argument instead of calculating it with sizeof... i.e.

  template<typename _ITup, size_t N, int _Odd> struct _Itup_dup;

  template<size_t... _Ind, size_t _Nm>
    struct _Itup_dup<_Index_tuple<_Ind...>, _Nm, 0>
    {
      using __type = _Index_tuple<_Ind..., _Nm + _Ind...>;
    };

  template<size_t... _Ind, size_t _Nm>
    struct _Itup_dup<_Index_tuple<_Ind...>, _Nm, 1>
    {
      using __type = _Index_tuple<_Ind..., _Nm + _Ind..., 2 * _Nm>;
    };

  // Builds an _Index_tuple<0, 1, 2, ..., _Num-1>.
  template<size_t _Num>
    struct _Build_index_tuple
    : typename _Itup_dup<
	typename _Build_index_tuple<_Num / 2>::__type, _Num / 2, _Num % 2>
    { };

and indeed this is even faster than what I just committed. I didn't realise sizeof... would have such an impact.

My one can be optimised to be almost as fast by avoiding sizeof... in _Itup_cat.

Comment 8 Daniel Frey 2015-11-17 21:03:40 UTC

(In reply to Jonathan Wakely from comment #7)
> (In reply to Daniel Frey from comment #3)
> > A better O(log N) library-only solution than the linked one is available at
> > https://github.com/taocpp/sequences/blob/master/include/tao/seq/
> > make_integer_sequence.hpp
> 
> The version I came up with is very close to Xeo's at stackoverflow. I tried
> something more like yours and it used a LOT more memory.
> 
> Here's what I tested:
> 
> [...snip...]
> 
> As you can see, defining DUP makes it 10 times slower and using 5 times as
> much memory. Looking at your implementation I see only one relevant
> difference, you pass N/2 as a template argument instead of calculating it
> with sizeof... i.e.

Yeah, avoiding sizeof... is crucial. I don't really know why, I just know it is. And only for GCC, Clang doesn't care if you use sizeof... or pass in the size explicitly. This behavior from GCC might be worth a look from the compiler people, as there is no reason why it should use so much more memory.

> and indeed this is even faster than what I just committed. I didn't realise
> sizeof... would have such an impact.
> 
> My one can be optimised to be almost as fast by avoiding sizeof... in
> _Itup_cat.

I guess my version (without the sizeof...) is still better the larger the numbers get, but it won't make a real difference in practice as it's kind of unrealistic to generate index sequences with 100000 or more elements.

Comment 9 Adrian Wielgosik 2015-11-18 12:33:59 UTC

I did a simple comparison of the improved libstdc++ implementation and Daniel's one - the difference in time and resource usage is still significant.

$ cat with_stdlib.cpp

#include <utility>
std::make_index_sequence<5000> thing;

$ cat with_tao.cpp

#include <make_integer_sequence.hpp>
tao::seq::make_index_sequence<5000> thing;

$ ./g++trunk with_stdlib.cpp -c -ftime-report

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1386 kB ( 1%) ggc
 phase parsing           :   0.97 (100%) usr   0.01 (100%) sys   0.97 (98%) wall  118230 kB (99%) ggc
 phase finalize          :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall       0 kB ( 0%) ggc
 |name lookup            :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall     440 kB ( 0%) ggc
 |overload resolution    :   0.03 ( 3%) usr   0.00 ( 0%) sys   0.03 ( 3%) wall     335 kB ( 0%) ggc
 parser (global)         :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1459 kB ( 1%) ggc
 parser struct body      :   0.02 ( 2%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall    1049 kB ( 1%) ggc
 parser inl. func. body  :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 1%) wall      37 kB ( 0%) ggc
 template instantiation  :   0.95 (98%) usr   0.01 (100%) sys   0.94 (95%) wall  115432 kB (96%) ggc
 TOTAL                 :   0.97             0.01             0.99             119627 kB
Internal checks disabled; compiler is not suited for release.
Configure with --enable-checking=release to enable checks.

$ ./g++trunk -I sequences/include/tao/seq/ with_tao.cpp -c -ftime-report

Execution times (seconds)
 phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall    1386 kB (16%) ggc
 phase parsing           :   0.10 (100%) usr   0.01 (100%) sys   0.11 (100%) wall    7413 kB (84%) ggc
 |name lookup            :   0.01 (10%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall     450 kB ( 5%) ggc
 preprocessing           :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 9%) wall     192 kB ( 2%) ggc
 parser (global)         :   0.01 (10%) usr   0.00 ( 0%) sys   0.01 ( 9%) wall    1483 kB (17%) ggc
 parser struct body      :   0.02 (20%) usr   0.00 ( 0%) sys   0.02 (18%) wall    1063 kB (12%) ggc
 template instantiation  :   0.07 (70%) usr   0.01 (100%) sys   0.07 (64%) wall    4563 kB (52%) ggc
 TOTAL                 :   0.10             0.01             0.11               8811 kB
Internal checks disabled; compiler is not suited for release.
Configure with --enable-checking=release to enable checks.

Comment 10 Jonathan Wakely 2015-11-18 20:29:33 UTC

(In reply to Adrian Wielgosik from comment #9)
>  TOTAL                 :   0.97             0.01             0.99           
> 119627 kB

By passing in the length of the first sequence instead of using sizeof... the std::make_integer_sequence result comes down to:

 TOTAL                 :   0.23             0.01             0.25              15114 kB

Which is very reasonable, and within an order of magnitude of Daniel's, which I think is good enough (especially if we're going to get a compiler intrinsic eventually anyway).

But rather than making that change I think the right thing to do is to fix the front-end so that sizeof... doesn't make such a big difference and we don't have to jump through hoops to avoid it. I'll file a PR for that tomorrow.

Comment 11 Jonathan Wakely 2015-11-18 20:49:46 UTC

(In reply to Jonathan Wakely from comment #7)
> The version I came up with is very close to Xeo's at stackoverflow. I tried
> something more like yours and it used a LOT more memory.
> 
> Here's what I tested:
> 
> #include <stddef.h>
> 
> namespace std
> {
>   template<size_t... _Indexes> struct _Index_tuple { };
> 
> #if DUP
>   template<typename _ITup, int _Odd> struct _Itup_dup;
> 
>   template<size_t... _Ind>
>     struct _Itup_dup<_Index_tuple<_Ind...>, 0>
>     {
>       static constexpr size_t _Nm = sizeof...(_Ind);
>       using __type = _Index_tuple<_Ind..., _Nm + _Ind...>;
>     };

It turns out that the huge cost of this version is not due to sizeof... it's due to the static variable, _Nm

Avoiding sizeof... helps too, but not as much as eliminating the variable and writing it as:

      using __type = _Index_tuple<_Ind..., sizeof...(_Ind) + _Ind...>;

Comment 12 Jonathan Wakely 2015-11-18 21:09:22 UTC

I've opened PR 68422 for the sizeof... inefficiency.

Comment 13 Jonathan Wakely 2015-12-01 22:37:37 UTC

*** Bug 68642 has been marked as a duplicate of this bug. ***

Comment 14 rhalbersma 2015-12-19 08:57:56 UTC

Ping to get this merged into the upcoming 5.4 release.

Comment 15 Jonathan Wakely 2017-05-17 17:18:28 UTC

Fixed for 5.5 too.

Comment 16 Jonathan Wakely 2017-05-17 17:18:41 UTC

Author: redi
Date: Wed May 17 17:18:07 2017
New Revision: 248163

URL: https://gcc.gnu.org/viewcvs?rev=248163&root=gcc&view=rev
Log:
PR libstdc++/66059 optimise _Build_index_tuple

Backport from mainline
2015-11-17  Jonathan Wakely  <jwakely@redhat.com>

	PR libstdc++/66059
	* include/std/utility (_Build_index_tuple): Optimise.

Modified:
    branches/gcc-5-branch/libstdc++-v3/ChangeLog
    branches/gcc-5-branch/libstdc++-v3/include/std/utility