Bug 108487 - [12/13 Regression] ~20-30x slowdown in populating std::vector from std::ranges::iota_view
Summary: [12/13 Regression] ~20-30x slowdown in populating std::vector from std::range...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 12.2.0
: P2 normal
Target Milestone: 14.3
Assignee: Jonathan Wakely
URL:
Keywords: missed-optimization
Depends on:
Blocks: std::vector
  Show dependency treegraph
 
Reported: 2023-01-20 22:20 UTC by Mark Bourgeault
Modified: 2025-10-10 23:17 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 10.3.0, 14.2.1, 15.0
Known to fail: 10.4.0, 11.1.0, 12.4.0, 13.3.0, 14.2.0
Last reconfirmed: 2023-03-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Bourgeault 2023-01-20 22:20:32 UTC
Using -std=c++20 -O3, comparing gcc 12.2 vs. gcc 10.3:
 * fn2 is 20-30x slower on gcc 12.2 (i.e. 2000-3000% more) 
 * fn1 is ~20% slower on gcc 12.2 

This test was run on an 52 core Intel Xeon Gold 6278C CPU.  Tests on www.godbolt.org directionally align with these findings.  It seems the slowdown was introduced in 10.4 & 11.1.  The trunk has identical performance to 12.2.

#include <vector>
#include <ranges>
#include <ctime>
#include <iostream>

__attribute__((noinline)) std::vector<int> fn1(int n)
{
    auto v = std::vector<int>(n);
    for(int i = 0; i != n; ++i)
        v[i] = i;
    return v;
}

__attribute__((noinline)) std::vector<int> fn2(int n)
{
    auto rng = std::ranges::iota_view{0, n};
    return std::vector<int>{rng.begin(), rng.end()};
}

int main() {
    int n = 100'000;
    int times = 100'000;

    auto t0 = std::clock();
    for (int i = 0; i < times; ++i)
        fn1(n);            
    auto t1 = std::clock();
    for (int i = 0; i < times; ++i)
        fn2(n);            
    auto t2 = std::clock();
    std::cout << t1 - t0 << '\n';
    std::cout << t2 - t1 << '\n';
    return 0;
}

P.S. 20% slowdown for a common vector population is still significant IMO.  I am not sure that qualifies as a bug.  I did not file one on account of the 'fn1' slowdown.
Comment 1 Alexander Monakov 2023-01-21 11:48:06 UTC
Regarding fn1, would you mind re-running the test on your Xeon CPU with fn2 removed from the source code and -falign-loops=32 added to gcc command line? For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary.

Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable.

Regarding fn2, this may partially be a library issue, compiling preprocessed source from gcc-10.4 using gcc-10.2 also exhibits the problem. Inner loop becomes significantly more complicated. Bisecting should be helpful.
Comment 2 Mark Bourgeault 2023-01-21 13:32:53 UTC
>> For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary
Yes, it does appear that this is the explanation for the difference.  Here are the full results:

original code
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms

fn1 only
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms
 
>> Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable.
Yes, I know.  I really only used godbolt to reach the conclusion that the issue still exists on trunk.
Comment 3 Alexander Monakov 2023-01-21 15:38:31 UTC
libstdc++ uses a less efficient specialization of _M_range_initialize.

With 10.2 headers, we use

      template<typename _ForwardIterator>
 void
 _M_range_initialize(_ForwardIterator __first, _ForwardIterator __last,
       std::forward_iterator_tag)
 {
   const size_type __n = std::distance(__first, __last);
   this->_M_impl._M_start
     = this->_M_allocate(_S_check_init_len(__n, _M_get_Tp_allocator()));
   this->_M_impl._M_end_of_storage = this->_M_impl._M_start + __n;
   this->_M_impl._M_finish =
     std::__uninitialized_copy_a(__first, __last,
     this->_M_impl._M_start,
     _M_get_Tp_allocator());
 }

but with 10.4 headers, we use

      template<typename _InputIterator>
 void
 _M_range_initialize(_InputIterator __first, _InputIterator __last,
       std::input_iterator_tag)
 {
   try {
     for (; __first != __last; ++__first)

       emplace_back(*__first);

   } catch(...) {
     clear();
     throw;
   }
 }
Comment 4 Jonathan Wakely 2023-01-21 23:45:03 UTC
That's because iota_view's iterator is only an input iterator in gcc-10, as a result of r10-9796-g1cb39945993c89
Comment 5 Jonathan Wakely 2023-01-21 23:50:24 UTC
See https://cplusplus.github.io/LWG/issue3291 for the rationale.

See PR 100070 for suggestions to deal with such iterators better.
Comment 6 Jonathan Wakely 2023-01-21 23:50:56 UTC
(In reply to Jonathan Wakely from comment #4)
> That's because iota_view's iterator is only an input iterator in gcc-10,

I meant to say gcc-10.4 there.
Comment 7 Mark Bourgeault 2023-01-22 03:07:06 UTC
>> See PR 100070 for suggestions to deal with such iterators better.

Unless I'm missing something, there's nothing in that PR that a *user* can do to achieve the gcc 10.3 performance w/ std::iota_view.
Comment 8 Mark Bourgeault 2023-01-22 10:33:50 UTC
What about something like this?

#if __cplusplus >= 201709L
      template<typename _InputIterator,
	       typename = std::_RequireInputIter<_InputIterator>>
	vector(_InputIterator __first, _InputIterator __last,
	       const allocator_type& __a = allocator_type())
	: _Base(__a)
	{
          struct PseudoRange { _InputIterator begin(); _InputIterator end(); };
          if constexpr (std::ranges::random_access_range<PseudoRange>) {
            _M_range_initialize(__first, __last,
			      std::random_access_iterator_tag());
          }
          else if constexpr (std::ranges::forward_range<PseudoRange>) {
            _M_range_initialize(__first, __last, std::forward_iterator_tag());
          }
          else {
	    _M_range_initialize(__first, __last,
			      std::__iterator_category(__first));
	}
#else
  ...
#endif

-----

Here is a PoC:

#include <vector>
#include <ranges>
#include <iostream>

template<typename I>
void test(I first, I last) {
    struct PseudoRange { I begin(); I end(); };
    if constexpr (std::ranges::random_access_range<PseudoRange>) {
        std::cout << "RA\n"; 
    }
    else if constexpr (std::ranges::forward_range<PseudoRange>) {
        std::cout << "F\n"; 
    }
    else  {
        std::cout << "I\n"; 
    }
}

int main() {
    auto rng = std::ranges::iota_view{0, 10} | std::views::transform([](int i) { return i*i; });
    test(rng.begin(), rng.end());
    auto rng2 = std::ranges::iota_view{0, 10} | std::views::filter([](int i) { return i%2; });
    test(rng2.begin(), rng2.end());
    return 0;
}
Comment 9 Jonathan Wakely 2023-01-22 17:14:34 UTC
I think we just want to dispatch on iterator concept not iterator category.
Comment 10 Richard Biener 2023-07-07 10:44:45 UTC
GCC 10 branch is being closed.
Comment 11 Richard Biener 2024-07-19 13:19:08 UTC
GCC 11 branch is being closed.
Comment 12 Jonathan Wakely 2025-03-24 18:40:12 UTC
We still don't have code to make this more efficient (that would be Bug 100070) but the correct way to construct a vector from iota_view is C++23's std::ranges::to

Adding this function:

__attribute__((noinline)) std::vector<int> fn3(int n)
{
  return std::ranges::to<std::vector<int>>(std::ranges::iota_view{0, n});
}

And then calling it from main() I get:

3678697
35870273
3000809

So this is significantly better than the naive loop.
Comment 13 Jonathan Wakely 2025-03-25 16:37:52 UTC
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2025-March/679257.html

With that change the code in comment 12 gives:

3699085
2928473
2922233

These numbers and the ones in comment 12 both produced with trunk and -O2.
Comment 14 Jonathan Wakely 2025-03-25 17:47:36 UTC
Fixed on trunk so far.
Comment 15 GCC Commits 2025-03-25 17:47:58 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:e200f53a5556516ec831e6b7a34aaa0f10a4ab0a

commit r15-8904-ge200f53a5556516ec831e6b7a34aaa0f10a4ab0a
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Tue Mar 25 13:24:08 2025 +0000

    libstdc++: Optimize std::vector construction from input iterators [PR108487]
    
    LWG 3291 make std::ranges::iota_view's iterator have input_iterator_tag
    as its iterator_category, even though it satisfies the C++20
    std::forward_iterator concept. This means that the traditional
    std::vector::vector(InputIterator, InputIterator) constructor treats
    iota_view iterators as input iterators, because it only understands the
    C++17 iterator requirements, not the C++20 iterator concepts. This
    results in a loop that calls emplace_back for each individual element of
    the iota_view, requiring the vector to reallocate repeatedly as the
    values are inserted. This makes it unnecessarily slow to construct a
    vector from an iota_view.
    
    This change adds a new _M_range_initialize_n function for initializing a
    vector from a range (which doesn't have to be common) and a size. This
    new function can be used by vector(InputIterator, InputIterator) and
    vector(from_range_t, R&&) when std::ranges::distance can be used to get
    the size. It can also be used by the _M_range_initialize overload that
    gets the size for a Cpp17ForwardIterator pair using std::distance, and
    by the vector(initializer_list) constructor.
    
    With this new function constructing a std::vector from iota_view does
    a single allocation of the correct size and so doesn't need to
    reallocate in a loop.
    
    Previously the _M_range_initialize overload for Cpp17ForwardIterator was
    using a local RAII _Guard_alloc object to own the storage, but that was
    redundant. The _Vector_base can own the storage right away, and its
    destructor will deallocate it if _M_range_initialize exits via an
    exception.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/108487
            * include/bits/stl_vector.h (vector(initializer_list)): Call
            _M_range_initialize_n instead of _M_range_initialize.
            (vector(InputIterator, InputIterator)): Use _M_range_initialize_n
            for C++20 sized sentinels and forward iterators.
            (vector(from_range_t, R&&)): Use _M_range_initialize_n for sized
            ranges and forward ranges.
            (vector::_M_range_initialize(FwIt, FwIt, forward_iterator_tag)):
            Likewise.
            (vector::_M_range_initialize_n): New function.
            * testsuite/23_containers/vector/cons/108487.cc: New test.
    
    Reviewed-by: Tomasz KamiÅski <tkaminsk@redhat.com>
Comment 16 GCC Commits 2025-03-31 10:37:22 UTC
The releases/gcc-14 branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:5a830c6cd54d376ee23043381c6ed761559e1e08

commit r14-11483-g5a830c6cd54d376ee23043381c6ed761559e1e08
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Tue Mar 25 13:24:08 2025 +0000

    libstdc++: Optimize std::vector construction from input iterators [PR108487]
    
    LWG 3291 make std::ranges::iota_view's iterator have input_iterator_tag
    as its iterator_category, even though it satisfies the C++20
    std::forward_iterator concept. This means that the traditional
    std::vector::vector(InputIterator, InputIterator) constructor treats
    iota_view iterators as input iterators, because it only understands the
    C++17 iterator requirements, not the C++20 iterator concepts. This
    results in a loop that calls emplace_back for each individual element of
    the iota_view, requiring the vector to reallocate repeatedly as the
    values are inserted. This makes it unnecessarily slow to construct a
    vector from an iota_view.
    
    This change adds a new _M_range_initialize_n function for initializing a
    vector from a range (which doesn't have to be common) and a size. This
    new function can be used by vector(InputIterator, InputIterator) when
    std::ranges::distance can be used to get the size. It can also be used
    by the _M_range_initialize overload that gets the size for a
    Cpp17ForwardIterator pair using std::distance, and by the
    vector(initializer_list) constructor.
    
    With this new function constructing a std::vector from iota_view does
    a single allocation of the correct size and so doesn't need to
    reallocate in a loop.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/108487
            * include/bits/stl_vector.h (vector(initializer_list)): Call
            _M_range_initialize_n instead of _M_range_initialize.
            (vector(InputIterator, InputIterator)): Use _M_range_initialize_n
            for C++20 sized sentinels and forward iterators.
            (vector::_M_range_initialize(FwIt, FwIt, forward_iterator_tag)):
            Use _M_range_initialize_n.
            (vector::_M_range_initialize_n): New function.
            * testsuite/23_containers/vector/cons/108487.cc: New test.
    
    gcc/testsuite/ChangeLog:
    
            * g++.dg/tree-ssa/initlist-opt1.C: Match _M_range_initialize_n
            instead of _M_range_initialize.
            * g++.dg/tree-ssa/initlist-opt2.C: Likewise.
    
    Reviewed-by: Tomasz KamiÅski <tkaminsk@redhat.com>
    
    (cherry picked from commit e200f53a5556516ec831e6b7a34aaa0f10a4ab0a)
Comment 17 Jonathan Wakely 2025-07-03 20:25:54 UTC
This can cause problems for code which isn't ready for C++20, as described at https://gcc.gnu.org/gcc-15/porting_to.html#cxx20-iterators

As a result, I don't think it should be backported any further, so the performance regression won't be fixed for GCC 12 and 13.