Using -std=c++20 -O3, comparing gcc 12.2 vs. gcc 10.3: * fn2 is 20-30x slower on gcc 12.2 (i.e. 2000-3000% more) * fn1 is ~20% slower on gcc 12.2 This test was run on an 52 core Intel Xeon Gold 6278C CPU. Tests on www.godbolt.org directionally align with these findings. It seems the slowdown was introduced in 10.4 & 11.1. The trunk has identical performance to 12.2. #include <vector> #include <ranges> #include <ctime> #include <iostream> __attribute__((noinline)) std::vector<int> fn1(int n) { auto v = std::vector<int>(n); for(int i = 0; i != n; ++i) v[i] = i; return v; } __attribute__((noinline)) std::vector<int> fn2(int n) { auto rng = std::ranges::iota_view{0, n}; return std::vector<int>{rng.begin(), rng.end()}; } int main() { int n = 100'000; int times = 100'000; auto t0 = std::clock(); for (int i = 0; i < times; ++i) fn1(n); auto t1 = std::clock(); for (int i = 0; i < times; ++i) fn2(n); auto t2 = std::clock(); std::cout << t1 - t0 << '\n'; std::cout << t2 - t1 << '\n'; return 0; } P.S. 20% slowdown for a common vector population is still significant IMO. I am not sure that qualifies as a bug. I did not file one on account of the 'fn1' slowdown.
Regarding fn1, would you mind re-running the test on your Xeon CPU with fn2 removed from the source code and -falign-loops=32 added to gcc command line? For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary. Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable. Regarding fn2, this may partially be a library issue, compiling preprocessed source from gcc-10.4 using gcc-10.2 also exhibits the problem. Inner loop becomes significantly more complicated. Bisecting should be helpful.
>> For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary Yes, it does appear that this is the explanation for the difference. Here are the full results: original code * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms fn1 only * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms >> Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable. Yes, I know. I really only used godbolt to reach the conclusion that the issue still exists on trunk.
libstdc++ uses a less efficient specialization of _M_range_initialize. With 10.2 headers, we use template<typename _ForwardIterator> void _M_range_initialize(_ForwardIterator __first, _ForwardIterator __last, std::forward_iterator_tag) { const size_type __n = std::distance(__first, __last); this->_M_impl._M_start = this->_M_allocate(_S_check_init_len(__n, _M_get_Tp_allocator())); this->_M_impl._M_end_of_storage = this->_M_impl._M_start + __n; this->_M_impl._M_finish = std::__uninitialized_copy_a(__first, __last, this->_M_impl._M_start, _M_get_Tp_allocator()); } but with 10.4 headers, we use template<typename _InputIterator> void _M_range_initialize(_InputIterator __first, _InputIterator __last, std::input_iterator_tag) { try { for (; __first != __last; ++__first) emplace_back(*__first); } catch(...) { clear(); throw; } }
That's because iota_view's iterator is only an input iterator in gcc-10, as a result of r10-9796-g1cb39945993c89
See https://cplusplus.github.io/LWG/issue3291 for the rationale. See PR 100070 for suggestions to deal with such iterators better.
(In reply to Jonathan Wakely from comment #4) > That's because iota_view's iterator is only an input iterator in gcc-10, I meant to say gcc-10.4 there.
>> See PR 100070 for suggestions to deal with such iterators better. Unless I'm missing something, there's nothing in that PR that a *user* can do to achieve the gcc 10.3 performance w/ std::iota_view.
What about something like this? #if __cplusplus >= 201709L template<typename _InputIterator, typename = std::_RequireInputIter<_InputIterator>> vector(_InputIterator __first, _InputIterator __last, const allocator_type& __a = allocator_type()) : _Base(__a) { struct PseudoRange { _InputIterator begin(); _InputIterator end(); }; if constexpr (std::ranges::random_access_range<PseudoRange>) { _M_range_initialize(__first, __last, std::random_access_iterator_tag()); } else if constexpr (std::ranges::forward_range<PseudoRange>) { _M_range_initialize(__first, __last, std::forward_iterator_tag()); } else { _M_range_initialize(__first, __last, std::__iterator_category(__first)); } #else ... #endif ----- Here is a PoC: #include <vector> #include <ranges> #include <iostream> template<typename I> void test(I first, I last) { struct PseudoRange { I begin(); I end(); }; if constexpr (std::ranges::random_access_range<PseudoRange>) { std::cout << "RA\n"; } else if constexpr (std::ranges::forward_range<PseudoRange>) { std::cout << "F\n"; } else { std::cout << "I\n"; } } int main() { auto rng = std::ranges::iota_view{0, 10} | std::views::transform([](int i) { return i*i; }); test(rng.begin(), rng.end()); auto rng2 = std::ranges::iota_view{0, 10} | std::views::filter([](int i) { return i%2; }); test(rng2.begin(), rng2.end()); return 0; }
I think we just want to dispatch on iterator concept not iterator category.
GCC 10 branch is being closed.
GCC 11 branch is being closed.
We still don't have code to make this more efficient (that would be Bug 100070) but the correct way to construct a vector from iota_view is C++23's std::ranges::to Adding this function: __attribute__((noinline)) std::vector<int> fn3(int n) { return std::ranges::to<std::vector<int>>(std::ranges::iota_view{0, n}); } And then calling it from main() I get: 3678697 35870273 3000809 So this is significantly better than the naive loop.
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2025-March/679257.html With that change the code in comment 12 gives: 3699085 2928473 2922233 These numbers and the ones in comment 12 both produced with trunk and -O2.
Fixed on trunk so far.
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:e200f53a5556516ec831e6b7a34aaa0f10a4ab0a commit r15-8904-ge200f53a5556516ec831e6b7a34aaa0f10a4ab0a Author: Jonathan Wakely <jwakely@redhat.com> Date: Tue Mar 25 13:24:08 2025 +0000 libstdc++: Optimize std::vector construction from input iterators [PR108487] LWG 3291 make std::ranges::iota_view's iterator have input_iterator_tag as its iterator_category, even though it satisfies the C++20 std::forward_iterator concept. This means that the traditional std::vector::vector(InputIterator, InputIterator) constructor treats iota_view iterators as input iterators, because it only understands the C++17 iterator requirements, not the C++20 iterator concepts. This results in a loop that calls emplace_back for each individual element of the iota_view, requiring the vector to reallocate repeatedly as the values are inserted. This makes it unnecessarily slow to construct a vector from an iota_view. This change adds a new _M_range_initialize_n function for initializing a vector from a range (which doesn't have to be common) and a size. This new function can be used by vector(InputIterator, InputIterator) and vector(from_range_t, R&&) when std::ranges::distance can be used to get the size. It can also be used by the _M_range_initialize overload that gets the size for a Cpp17ForwardIterator pair using std::distance, and by the vector(initializer_list) constructor. With this new function constructing a std::vector from iota_view does a single allocation of the correct size and so doesn't need to reallocate in a loop. Previously the _M_range_initialize overload for Cpp17ForwardIterator was using a local RAII _Guard_alloc object to own the storage, but that was redundant. The _Vector_base can own the storage right away, and its destructor will deallocate it if _M_range_initialize exits via an exception. libstdc++-v3/ChangeLog: PR libstdc++/108487 * include/bits/stl_vector.h (vector(initializer_list)): Call _M_range_initialize_n instead of _M_range_initialize. (vector(InputIterator, InputIterator)): Use _M_range_initialize_n for C++20 sized sentinels and forward iterators. (vector(from_range_t, R&&)): Use _M_range_initialize_n for sized ranges and forward ranges. (vector::_M_range_initialize(FwIt, FwIt, forward_iterator_tag)): Likewise. (vector::_M_range_initialize_n): New function. * testsuite/23_containers/vector/cons/108487.cc: New test. Reviewed-by: Tomasz KamiÅski <tkaminsk@redhat.com>
The releases/gcc-14 branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:5a830c6cd54d376ee23043381c6ed761559e1e08 commit r14-11483-g5a830c6cd54d376ee23043381c6ed761559e1e08 Author: Jonathan Wakely <jwakely@redhat.com> Date: Tue Mar 25 13:24:08 2025 +0000 libstdc++: Optimize std::vector construction from input iterators [PR108487] LWG 3291 make std::ranges::iota_view's iterator have input_iterator_tag as its iterator_category, even though it satisfies the C++20 std::forward_iterator concept. This means that the traditional std::vector::vector(InputIterator, InputIterator) constructor treats iota_view iterators as input iterators, because it only understands the C++17 iterator requirements, not the C++20 iterator concepts. This results in a loop that calls emplace_back for each individual element of the iota_view, requiring the vector to reallocate repeatedly as the values are inserted. This makes it unnecessarily slow to construct a vector from an iota_view. This change adds a new _M_range_initialize_n function for initializing a vector from a range (which doesn't have to be common) and a size. This new function can be used by vector(InputIterator, InputIterator) when std::ranges::distance can be used to get the size. It can also be used by the _M_range_initialize overload that gets the size for a Cpp17ForwardIterator pair using std::distance, and by the vector(initializer_list) constructor. With this new function constructing a std::vector from iota_view does a single allocation of the correct size and so doesn't need to reallocate in a loop. libstdc++-v3/ChangeLog: PR libstdc++/108487 * include/bits/stl_vector.h (vector(initializer_list)): Call _M_range_initialize_n instead of _M_range_initialize. (vector(InputIterator, InputIterator)): Use _M_range_initialize_n for C++20 sized sentinels and forward iterators. (vector::_M_range_initialize(FwIt, FwIt, forward_iterator_tag)): Use _M_range_initialize_n. (vector::_M_range_initialize_n): New function. * testsuite/23_containers/vector/cons/108487.cc: New test. gcc/testsuite/ChangeLog: * g++.dg/tree-ssa/initlist-opt1.C: Match _M_range_initialize_n instead of _M_range_initialize. * g++.dg/tree-ssa/initlist-opt2.C: Likewise. Reviewed-by: Tomasz KamiÅski <tkaminsk@redhat.com> (cherry picked from commit e200f53a5556516ec831e6b7a34aaa0f10a4ab0a)
This can cause problems for code which isn't ready for C++20, as described at https://gcc.gnu.org/gcc-15/porting_to.html#cxx20-iterators As a result, I don't think it should be backported any further, so the performance regression won't be fixed for GCC 12 and 13.