Bug 108487 - [12/13/14/15 Regression] ~20-30x slowdown in populating std::vector from std::ranges::iota_view
Summary: [12/13/14/15 Regression] ~20-30x slowdown in populating std::vector from std:...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 12.2.0
: P2 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: std::vector
  Show dependency treegraph
 
Reported: 2023-01-20 22:20 UTC by Mark Bourgeault
Modified: 2024-12-21 07:13 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-03-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Bourgeault 2023-01-20 22:20:32 UTC
Using -std=c++20 -O3, comparing gcc 12.2 vs. gcc 10.3:
 * fn2 is 20-30x slower on gcc 12.2 (i.e. 2000-3000% more) 
 * fn1 is ~20% slower on gcc 12.2 

This test was run on an 52 core Intel Xeon Gold 6278C CPU.  Tests on www.godbolt.org directionally align with these findings.  It seems the slowdown was introduced in 10.4 & 11.1.  The trunk has identical performance to 12.2.

#include <vector>
#include <ranges>
#include <ctime>
#include <iostream>

__attribute__((noinline)) std::vector<int> fn1(int n)
{
    auto v = std::vector<int>(n);
    for(int i = 0; i != n; ++i)
        v[i] = i;
    return v;
}

__attribute__((noinline)) std::vector<int> fn2(int n)
{
    auto rng = std::ranges::iota_view{0, n};
    return std::vector<int>{rng.begin(), rng.end()};
}

int main() {
    int n = 100'000;
    int times = 100'000;

    auto t0 = std::clock();
    for (int i = 0; i < times; ++i)
        fn1(n);            
    auto t1 = std::clock();
    for (int i = 0; i < times; ++i)
        fn2(n);            
    auto t2 = std::clock();
    std::cout << t1 - t0 << '\n';
    std::cout << t2 - t1 << '\n';
    return 0;
}

P.S. 20% slowdown for a common vector population is still significant IMO.  I am not sure that qualifies as a bug.  I did not file one on account of the 'fn1' slowdown.
Comment 1 Alexander Monakov 2023-01-21 11:48:06 UTC
Regarding fn1, would you mind re-running the test on your Xeon CPU with fn2 removed from the source code and -falign-loops=32 added to gcc command line? For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary.

Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable.

Regarding fn2, this may partially be a library issue, compiling preprocessed source from gcc-10.4 using gcc-10.2 also exhibits the problem. Inner loop becomes significantly more complicated. Bisecting should be helpful.
Comment 2 Mark Bourgeault 2023-01-21 13:32:53 UTC
>> For fn1, assembly of the inner loop should be identical, so I think the 20% you were seeing may result from different loop alignment with respect to 32b fetch boundary
Yes, it does appear that this is the explanation for the difference.  Here are the full results:

original code
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms

fn1 only
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms
 
>> Also please note that cloud instances backing godbolt.org have different CPUs, so timing results from different runs are not directly comparable.
Yes, I know.  I really only used godbolt to reach the conclusion that the issue still exists on trunk.
Comment 3 Alexander Monakov 2023-01-21 15:38:31 UTC
libstdc++ uses a less efficient specialization of _M_range_initialize.

With 10.2 headers, we use

      template<typename _ForwardIterator>
 void
 _M_range_initialize(_ForwardIterator __first, _ForwardIterator __last,
       std::forward_iterator_tag)
 {
   const size_type __n = std::distance(__first, __last);
   this->_M_impl._M_start
     = this->_M_allocate(_S_check_init_len(__n, _M_get_Tp_allocator()));
   this->_M_impl._M_end_of_storage = this->_M_impl._M_start + __n;
   this->_M_impl._M_finish =
     std::__uninitialized_copy_a(__first, __last,
     this->_M_impl._M_start,
     _M_get_Tp_allocator());
 }

but with 10.4 headers, we use

      template<typename _InputIterator>
 void
 _M_range_initialize(_InputIterator __first, _InputIterator __last,
       std::input_iterator_tag)
 {
   try {
     for (; __first != __last; ++__first)

       emplace_back(*__first);

   } catch(...) {
     clear();
     throw;
   }
 }
Comment 4 Jonathan Wakely 2023-01-21 23:45:03 UTC
That's because iota_view's iterator is only an input iterator in gcc-10, as a result of r10-9796-g1cb39945993c89
Comment 5 Jonathan Wakely 2023-01-21 23:50:24 UTC
See https://cplusplus.github.io/LWG/issue3291 for the rationale.

See PR 100070 for suggestions to deal with such iterators better.
Comment 6 Jonathan Wakely 2023-01-21 23:50:56 UTC
(In reply to Jonathan Wakely from comment #4)
> That's because iota_view's iterator is only an input iterator in gcc-10,

I meant to say gcc-10.4 there.
Comment 7 Mark Bourgeault 2023-01-22 03:07:06 UTC
>> See PR 100070 for suggestions to deal with such iterators better.

Unless I'm missing something, there's nothing in that PR that a *user* can do to achieve the gcc 10.3 performance w/ std::iota_view.
Comment 8 Mark Bourgeault 2023-01-22 10:33:50 UTC
What about something like this?

#if __cplusplus >= 201709L
      template<typename _InputIterator,
	       typename = std::_RequireInputIter<_InputIterator>>
	vector(_InputIterator __first, _InputIterator __last,
	       const allocator_type& __a = allocator_type())
	: _Base(__a)
	{
          struct PseudoRange { _InputIterator begin(); _InputIterator end(); };
          if constexpr (std::ranges::random_access_range<PseudoRange>) {
            _M_range_initialize(__first, __last,
			      std::random_access_iterator_tag());
          }
          else if constexpr (std::ranges::forward_range<PseudoRange>) {
            _M_range_initialize(__first, __last, std::forward_iterator_tag());
          }
          else {
	    _M_range_initialize(__first, __last,
			      std::__iterator_category(__first));
	}
#else
  ...
#endif

-----

Here is a PoC:

#include <vector>
#include <ranges>
#include <iostream>

template<typename I>
void test(I first, I last) {
    struct PseudoRange { I begin(); I end(); };
    if constexpr (std::ranges::random_access_range<PseudoRange>) {
        std::cout << "RA\n"; 
    }
    else if constexpr (std::ranges::forward_range<PseudoRange>) {
        std::cout << "F\n"; 
    }
    else  {
        std::cout << "I\n"; 
    }
}

int main() {
    auto rng = std::ranges::iota_view{0, 10} | std::views::transform([](int i) { return i*i; });
    test(rng.begin(), rng.end());
    auto rng2 = std::ranges::iota_view{0, 10} | std::views::filter([](int i) { return i%2; });
    test(rng2.begin(), rng2.end());
    return 0;
}
Comment 9 Jonathan Wakely 2023-01-22 17:14:34 UTC
I think we just want to dispatch on iterator concept not iterator category.
Comment 10 Richard Biener 2023-07-07 10:44:45 UTC
GCC 10 branch is being closed.
Comment 11 Richard Biener 2024-07-19 13:19:08 UTC
GCC 11 branch is being closed.