Bug 62056 - Long compile times with large tuples
Summary: Long compile times with large tuples
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 5.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: patch
Depends on:
Blocks:
 
Reported: 2014-08-07 20:05 UTC by Agustín Bergé
Modified: 2014-10-07 11:25 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-09-30 00:00:00


Attachments
draft flat tuple implementation (5.83 KB, text/plain)
2014-08-07 20:05 UTC, Agustín Bergé
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Agustín Bergé 2014-08-07 20:05:24 UTC
Created attachment 33270 [details]
draft flat tuple implementation

The recursive implementation of `std::tuple` causes noticeable longer compilation times and memory usage than a non-recursive implementation would. Furthermore, with a max template depth of 256 (Clang's default), the following test case results in a compilation error when attempting to use `make_tuple`  with more than 17 arguments:

    #include <tuple>

    struct T {} t;

    int main()
    {
      auto tt = std::make_tuple(
        t, t, t, t, t, t, t, t,
        t, t, t, t, t, t, t, t,
        t, t
        );
      return sizeof(std::get<15>(tt));
    }

The times reported for building this example with one less argument to `make_tuple` are:

 TOTAL                 :   0.45             34300 kB
 TOTAL (-O2)           :   0.37             28160 kB

With a flat implementation (draft attached), the times reported are:

 TOTAL                 :   0.31             21245 kB
 TOTAL (-O2)           :   0.21             18056 kB

With the default template depth limit of 900, the limit is reached with more than 63 elements for the recursive implementation, and more than 253 elements for the flat implementation (reached while building an `index_sequence`).

The times reported for building a similar example but with 63 elements are:

 TOTAL (recursive)     :   0.96            102293 kB
 TOTAL (recursive -O2) :   0.58             77839 kB
 TOTAL (flat)          :   0.27             33737 kB
 TOTAL (flat -O2)      :   0.20             27213 kB
Comment 1 Piotr Dziwinski 2014-09-28 12:48:12 UTC
I can confirm this bug report as I also noticed longer compilation times with GCC's implementation of `std::tuple`, specifically when switching from `std::tr1::tuple` to `std::tuple` (problem originally reported in Google Test library: https://groups.google.com/forum/#!topic/googletestframework/TGrf26S65n0).

Using the same sample code as above, when I replace `std::tuple` to `std::tr1::tuple`, I get better compilation time (tested with GCC 4.9.1):

std::tuple       0m0.388s
std::tr1::tuple  0m0.134s

This difference is small in such small example, but it has noticeable impact on larger projects (total compilation time increase counted in minutes).

I would also second the proposal to fix this issue by implementing flat version of std::tuple. Perhaps the existing std::tr1::tuple implementation can be re-used here?
Comment 2 Jonathan Wakely 2014-09-29 08:36:34 UTC
(In reply to Piotr Dziwinski from comment #1)
> I would also second the proposal to fix this issue by implementing flat
> version of std::tuple. Perhaps the existing std::tr1::tuple implementation
> can be re-used here?

GCC's std::tr1:tuple is not a flat implementation, and does not conform to the C++11 requirements, so no, that's not an option.
Comment 3 Piotr Dziwinski 2014-09-29 14:21:07 UTC
(In reply to Jonathan Wakely from comment #2)
> (In reply to Piotr Dziwinski from comment #1)
> > I would also second the proposal to fix this issue by implementing flat
> > version of std::tuple. Perhaps the existing std::tr1::tuple implementation
> > can be re-used here?
> 
> GCC's std::tr1:tuple is not a flat implementation, and does not conform to
> the C++11 requirements, so no, that's not an option.

Oh, you're right - somehow I got convinced that std::tr1::tuple was a flat implementation since it compiles so much faster. But this raises a question - why do the two recursive implementations have such different compile times? Perhaps by analyzing the differences between std::tr1::tuple and std::tuple, we can learn something to our advantage?
Comment 4 Jonathan Wakely 2014-09-29 14:43:40 UTC
tr1::tuple doesn't support perfect-forwarding or move semantics

tr1::tuple doesn't support uses-allocator construction

tr1::tuple doesn't support 'final' classes

tr1::tuple doesn't have correct exception specifications

tr1::tuple doesn't prevent implicit conversions that would use explicit constructors

tr1::tuple doesn't support tuple concatenation

If you can add all those features to the <tr1/tuple> implementation so that it meets the C++11 requirements and it still compiles faster then I'd be interested in your analysis of the remaining differences. Otherwise I'm going to assume the difference is because the <tuple> header contains more than twice as many lines of code and several additional features.
Comment 5 Piotr Dziwinski 2014-09-30 06:24:43 UTC
(In reply to Jonathan Wakely from comment #4)
> tr1::tuple doesn't support perfect-forwarding or move semantics
> 
> tr1::tuple doesn't support uses-allocator construction
> 
> tr1::tuple doesn't support 'final' classes
> 
> tr1::tuple doesn't have correct exception specifications
> 
> tr1::tuple doesn't prevent implicit conversions that would use explicit
> constructors
> 
> tr1::tuple doesn't support tuple concatenation
> 
> If you can add all those features to the <tr1/tuple> implementation so that
> it meets the C++11 requirements and it still compiles faster then I'd be
> interested in your analysis of the remaining differences. Otherwise I'm
> going to assume the difference is because the <tuple> header contains more
> than twice as many lines of code and several additional features.

Ok, I understand it now. I was speaking from only somewhat experienced user perspective and I did not realise the deeper implications of standard compliance.

Just for the record, I did some testing and found two important factors here are:
 - dependency to <array> (to enable it in tuple concatenation) - (change in compile time 0.357s -> 0.231s),
 - allocator constructors - (change 0.231s -> 0.185s).

So in the end please ignore my interruption with `std::tr1::tuple`. It seems the recursive version of `std::tuple` is not going to be optimized easily and the new flat implementation is the way to go here.
Comment 6 Agustín Bergé 2014-09-30 13:34:48 UTC
(In reply to Piotr Dziwinski from comment #5)
> It seems
> the recursive version of `std::tuple` is not going to be optimized easily
> and the new flat implementation is the way to go here.

Well, not necessarily, It's not `std::tuple` which is at fault, but the compiler (gcc and every other). Note that this issue was moved from libstdc++ to c++, so that's promising!
Comment 7 Manuel López-Ibáñez 2014-09-30 14:33:52 UTC
(In reply to Agustín Bergé from comment #6)
> Well, not necessarily, It's not `std::tuple` which is at fault, but the
> compiler (gcc and every other). Note that this issue was moved from
> libstdc++ to c++, so that's promising!

I don't see any analysis of what the "bug" actually is. If there is a bug in g++, you should be able to trigger it with a testcase not including <tuple>.

On the other hand, if you want to provide an alternative implementation of <tuple>, you should join libstc++: https://gcc.gnu.org/onlinedocs/libstdc++/manual/appendix_contributing.html
Comment 8 Agustín Bergé 2014-09-30 15:15:09 UTC
(In reply to Manuel López-Ibáñez from comment #7)
> (In reply to Agustín Bergé from comment #6)
> > Well, not necessarily, It's not `std::tuple` which is at fault, but the
> > compiler (gcc and every other). Note that this issue was moved from
> > libstdc++ to c++, so that's promising!
> 
> I don't see any analysis of what the "bug" actually is. If there is a bug in
> g++, you should be able to trigger it with a testcase not including <tuple>.
> 
> On the other hand, if you want to provide an alternative implementation of
> <tuple>, you should join libstc++:
> https://gcc.gnu.org/onlinedocs/libstdc++/manual/appendix_contributing.html

This issue started targeting libstdc++, and later someone reassigned it to c++. The recursive tuple implementation in libstdc++ causes our use of large tuples to more than double compilation time when compared to a flat implementation. Hence the inclusion of <tuple>.

There is no bug, but a QoI issue that makes `std::tuple` use prohibitive for large tuples. Whether the QoI issue belongs with libc++ or g++, I am not entitled to say. The underlying issue is a g++ issue, but there are known compile-time efficient implementations of `std::tuple`. Of course, a solution in the compiler would benefit a wider audience so it would be preferable. And again, no solution would be fine as well given that this is QoI.

As for a "bug" analysis (which is a requirement that I was unaware of), I haven't put this particular combination of implementation and compiler under a profiler. I have done however enough experimentation on similar code bases and the results are consistent, so while I might be off in the specific details it should still give you a strong hint of where to look. The issue boils down to name mangling and lookup: the longer the name the more memory consumption and cpu usage. For a recursive variadic implementation, instantiating a tuple of N elements with aggregated mangled length M results in the mangling of N bases with average mangled length M/2. For a flat variadic implementation, it results in the mangling of N bases of average mangled length M/N instead. As the tuple gets large and the names get long (and mangled names get very long very fast), the recursive implementation turns into a compilation time (and memory!) hog way faster than a flat implementation does.

Please let me know if I can help you in any other way.
Comment 9 Manuel López-Ibáñez 2014-09-30 17:38:44 UTC
(In reply to Agustín Bergé from comment #8)
> Please let me know if I can help you in any other way.

Thanks for the detailed explanation.


As for the g++ issue, then a minimal testcase that does not include the whole of <tuple> might be useful to see what could be speed up: See https://gcc.gnu.org/bugs/minimize.html The testcase itself might generate code programmatically using macros, for example, but less code without any headers is easier to analyze.

I cannot say if the libstdc++ implementation could be better, but the way to prove it is simply to submit a new implementation and convince the maintainers that it is better with numbers. I'm sure they will be delighted to receive such contribution. But a draft is not enough. The new implementation needs to pass all testcases in the testsuite.
Comment 10 Agustín Bergé 2014-09-30 20:19:55 UTC
(In reply to Manuel López-Ibáñez from comment #9)
> I cannot say if the libstdc++ implementation could be better

From a compile-time performance point of view it is, check the implementation attached to this issue.

> but the way to
> prove it is simply to submit a new implementation and convince the
> maintainers that it is better with numbers. I'm sure they will be delighted
> to receive such contribution. But a draft is not enough. The new
> implementation needs to pass all testcases in the testsuite.

Such implementation and numbers were initially attached to this issue. The draft implementation is feature complete (modulo bugs) and is implemented with the minimal possible number of changes from the current implementation. The entire functionality is retained, the layout is switch to left-to-right (instead of reversed), I have not analyzed codegen.

I was directly instructed by libstdc++ developers to submit this issue, I was told they were not aware of the poor compile-time performance of tuple. I submitted the issue as instructed. I decided to add the proof-of-concept implementation so that people could understand the implications by simply replacing a header. I added the analysis of the underlying issue upon your request.

I am not interested in pursuing this further without as much as a hint that the compile-time performance issue would even be considered. Remember, this is QoI, so completely ignoring this issue is a reasonable resolution. After all, this is a known issue with known solutions (or I should say workarounds, as the underlying issue is not addressed), and there are other concerns to keep in mind besides compile-time performance.
Comment 11 Manuel López-Ibáñez 2014-10-01 20:35:14 UTC
Jonathan, what should we do about this? Is this implementation better than the one in libstdc++?

If so, what steps are needed to get this integrated?
Comment 12 Manuel López-Ibáñez 2014-10-01 20:37:27 UTC
I think this is a libstdc++ issue for now. It would be nice to have a simpler testcase for the g++ compile time problem, then we can open a different independent report.
Comment 13 Manuel López-Ibáñez 2014-10-01 23:36:38 UTC
Thus, I guess this is confirmed, just waiting for a review.
Comment 14 Jonathan Wakely 2014-10-02 08:54:02 UTC
(In reply to Manuel López-Ibáñez from comment #11)
> Jonathan, what should we do about this? Is this implementation better than
> the one in libstdc++?

I don't know, I haven't looked.

Agustin, do you have a copyright assignment?
Comment 15 Agustín Bergé 2014-10-05 20:17:49 UTC
(In reply to Jonathan Wakely from comment #14)
> Agustin, do you have a copyright assignment?

I do not have one. The attached <tuple> is derived from libstdc++'s <tuple> and I have left the license in place, wouldn't that be enough? I did not intend for the attachment to be a patch (it is not a patch, I did not consider coding styles, I did not run tests, etc), but merely a proof of concept implementation so that others could measure the compile-time performance win of a flat implementation.
Comment 16 Jonathan Wakely 2014-10-07 11:25:01 UTC
(In reply to Agustín Bergé from comment #15)
> (In reply to Jonathan Wakely from comment #14)
> > Agustin, do you have a copyright assignment?
> 
> I do not have one. The attached <tuple> is derived from libstdc++'s <tuple>
> and I have left the license in place, wouldn't that be enough?

No. You are free to release your derived work under the terms of the GPL, but that is not sufficient for the code to be accepted into the FSF repository.

> I did not
> intend for the attachment to be a patch (it is not a patch, I did not
> consider coding styles, I did not run tests, etc), but merely a proof of
> concept implementation so that others could measure the compile-time
> performance win of a flat implementation.

And that's fine. But to respond to Manu in comment 11 and comment 13, we need more than just a review, since the proof of concept attached here cannot be used without the usual legal prerequisites being met. Another option would be for someone else to provide a flat implementation, but again that is more than just a review.