Using -std=c++20 -O3, comparing gcc 12.2 vs. gcc 10.3: * fn2 is 20-30x slower on gcc 12.2 (i.e. 2000-3000% more) * fn1 is ~20% slower on gcc 12.2 This test was run on an 52 core Intel Xeon Gold 6278C CPU. Tests on www.godbolt.org directionally align with these findings. It seems the slowdown was introduced in 10.4 & 11.1. The trunk has identical performance to 12.2. #include <vector> #include <ranges> #include <ctime> #include <iostream> __attribute__((noinline)) std::vector<int> fn1(int n) { auto v = std::vector<int>(n); for(int i = 0; i != n; ++i) v[i] = i; return v; } __attribute__((noinline)) std::vector<int> fn2(int n) { auto rng = std::ranges::iota_view{0, n}; return std::vector<int>{rng.begin(), rng.end()}; } int main() { int n = 100'000; int times = 100'000; auto t0 = std::clock(); for (int i = 0; i < times; ++i) fn1(n); auto t1 = std::clock(); for (int i = 0; i < times; ++i) fn2(n); auto t2 = std::clock(); std::cout << t1 - t0 << '\n'; std::cout << t2 - t1 << '\n'; return 0; } P.S. 20% slowdown for a common vector population is still significant IMO. I am not sure that qualifies as a bug. I did not file one on account of the 'fn1' slowdown.
Regarding fn1, would you mind re-running the test on your Xeon CPU with fn2 removed from the source code and -falign-loops=32 added to gcc command line? For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary. Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable. Regarding fn2, this may partially be a library issue, compiling preprocessed source from gcc-10.4 using gcc-10.2 also exhibits the problem. Inner loop becomes significantly more complicated. Bisecting should be helpful.
>> For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary Yes, it does appear that this is the explanation for the difference. Here are the full results: original code * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms fn1 only * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms >> Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable. Yes, I know. I really only used godbolt to reach the conclusion that the issue still exists on trunk.
libstdc++ uses a less efficient specialization of _M_range_initialize. With 10.2 headers, we use template<typename _ForwardIterator> void _M_range_initialize(_ForwardIterator __first, _ForwardIterator __last, std::forward_iterator_tag) { const size_type __n = std::distance(__first, __last); this->_M_impl._M_start = this->_M_allocate(_S_check_init_len(__n, _M_get_Tp_allocator())); this->_M_impl._M_end_of_storage = this->_M_impl._M_start + __n; this->_M_impl._M_finish = std::__uninitialized_copy_a(__first, __last, this->_M_impl._M_start, _M_get_Tp_allocator()); } but with 10.4 headers, we use template<typename _InputIterator> void _M_range_initialize(_InputIterator __first, _InputIterator __last, std::input_iterator_tag) { try { for (; __first != __last; ++__first) emplace_back(*__first); } catch(...) { clear(); throw; } }
That's because iota_view's iterator is only an input iterator in gcc-10, as a result of r10-9796-g1cb39945993c89
See https://cplusplus.github.io/LWG/issue3291 for the rationale. See PR 100070 for suggestions to deal with such iterators better.
(In reply to Jonathan Wakely from comment #4) > That's because iota_view's iterator is only an input iterator in gcc-10, I meant to say gcc-10.4 there.
>> See PR 100070 for suggestions to deal with such iterators better. Unless I'm missing something, there's nothing in that PR that a *user* can do to achieve the gcc 10.3 performance w/ std::iota_view.
What about something like this? #if __cplusplus >= 201709L template<typename _InputIterator, typename = std::_RequireInputIter<_InputIterator>> vector(_InputIterator __first, _InputIterator __last, const allocator_type& __a = allocator_type()) : _Base(__a) { struct PseudoRange { _InputIterator begin(); _InputIterator end(); }; if constexpr (std::ranges::random_access_range<PseudoRange>) { _M_range_initialize(__first, __last, std::random_access_iterator_tag()); } else if constexpr (std::ranges::forward_range<PseudoRange>) { _M_range_initialize(__first, __last, std::forward_iterator_tag()); } else { _M_range_initialize(__first, __last, std::__iterator_category(__first)); } #else ... #endif ----- Here is a PoC: #include <vector> #include <ranges> #include <iostream> template<typename I> void test(I first, I last) { struct PseudoRange { I begin(); I end(); }; if constexpr (std::ranges::random_access_range<PseudoRange>) { std::cout << "RA\n"; } else if constexpr (std::ranges::forward_range<PseudoRange>) { std::cout << "F\n"; } else { std::cout << "I\n"; } } int main() { auto rng = std::ranges::iota_view{0, 10} | std::views::transform([](int i) { return i*i; }); test(rng.begin(), rng.end()); auto rng2 = std::ranges::iota_view{0, 10} | std::views::filter([](int i) { return i%2; }); test(rng2.begin(), rng2.end()); return 0; }
I think we just want to dispatch on iterator concept not iterator category.
GCC 10 branch is being closed.
GCC 11 branch is being closed.