Bug 104719 - Use of `std::move` in libstdc++ leads to worsened debug performance
Summary: Use of `std::move` in libstdc++ leads to worsened debug performance
Status: RESOLVED DUPLICATE of bug 96780
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-28 16:20 UTC by Vittorio Romeo
Modified: 2022-12-31 19:58 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-02-28 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vittorio Romeo 2022-02-28 16:20:58 UTC
`std::accumulate` is defined as follows in `libstdc++`:

```
  template<typename _InputIterator, typename _Tp>
    _GLIBCXX20_CONSTEXPR
    inline _Tp
    accumulate(_InputIterator __first, _InputIterator __last, _Tp __init)
    {
      // concept requirements
      __glibcxx_function_requires(_InputIteratorConcept<_InputIterator>)
      __glibcxx_requires_valid_range(__first, __last);

      for (; __first != __last; ++__first)
	__init = _GLIBCXX_MOVE_IF_20(__init) + *__first;
      return __init;
    }
```

Where `_GLIBCXX_MOVE_IF_20` is:

```
#if __cplusplus > 201703L
// _GLIBCXX_RESOLVE_LIB_DEFECTS
// DR 2055. std::move in std::accumulate and other algorithms
# define _GLIBCXX_MOVE_IF_20(_E) std::move(_E)
#else
# define _GLIBCXX_MOVE_IF_20(_E) _E
#endif
```

When compiling a program using `std::accumulate` in debug mode, under `-Og`, there is a noticeable performance impact due to the presence of `std::move`.
- With `std::move`: https://quick-bench.com/q/h_M_AUs3pgBE3bYr82rsA1_VtjU
- Without `std::move`: https://quick-bench.com/q/ysis2b1CgIZkRsO2cqfjZm9Jkio

This performance degradation is one example of why many people (especially in the gamedev community) are not adopting standard library algorithms and modern C++ more widely. 

Would it be possible to replace `std::move` calls internal to `libstdc++` with a cast, or some sort of compiler intrinsic? Or maybe mark `std::move` as "always inline" even without optimizations enabled? 

Related issue for libc++:
https://github.com/llvm/llvm-project/issues/53689
Comment 1 Jonathan Wakely 2022-02-28 16:46:09 UTC
If it's not inlined at -Og then that's a compiler problem, GCC inlines it at -Og:
https://quick-bench.com/q/d1H4VjD_Xns4ef7iOfweIy6rPNM
Comment 2 Jonathan Wakely 2022-02-28 16:50:29 UTC
(In reply to Vittorio Romeo from comment #0)
> Would it be possible to replace `std::move` calls internal to `libstdc++`
> with a cast,

No, absolutely not.

> or some sort of compiler intrinsic?

No, but the compiler could just fold it away (see PR 96780).

> Or maybe mark `std::move`
> as "always inline" even without optimizations enabled? 

Maybe, although since your benchmark shows there is no problem with GCC at -Og I think this is a Clang problem, not libstdc++ or libc++.
Comment 3 Eric Gallager 2022-02-28 16:56:47 UTC
Seems related to some of the requests for more warnings about uses of std::move that might degrade performance, e.g. bug 67906, bug 81159, bug 90428

Also, Marek's blog post about std::move seems relevant here: https://developers.redhat.com/blog/2019/04/12/understanding-when-not-to-stdmove-in-c
Comment 4 Vittorio Romeo 2022-02-28 17:00:20 UTC
I see that `std::move` is indeed inlined with `-Og`, my apologies on not noticing that. 

I like the idea of having the compiler itself fold calls to things like `std::move` and `std::forward` as suggested in the linked https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96780. 

But I think this issue I opened should be more general for any standard library function that ends up impacting debug performance. Another common example in the gamedev community is `std::vector`.

In this benchmark, which uses `-Og`, you can notice a large performance difference between a `std::vector<int>` and `int*` dynamic array for operations that I believe should have equal performance:
- https://quick-bench.com/q/lrS4I-lmDJ3VFP8L8rG2YHGXO-8
- https://quick-bench.com/q/Uf-t79n7uYWAKdThOL_wxSp12Y0

Are the above results also something that should be handled on the compiler side of things? Or would, for example, marking `std::vector::operator[]` and `std::vector::iterator::operator*` as `always_inline` remove the performance discrepancy?
Comment 5 Jonathan Wakely 2022-02-28 17:16:50 UTC
(In reply to Eric Gallager from comment #3)
> Seems related to some of the requests for more warnings about uses of
> std::move that might degrade performance, e.g. bug 67906, bug 81159, bug
> 90428
> 
> Also, Marek's blog post about std::move seems relevant here:
> https://developers.redhat.com/blog/2019/04/12/understanding-when-not-to-
> stdmove-in-c

I don't think any of those are relevant, they're about *unnecessary* uses of std::move where that is either entirely redundant or doesn't change semantics but prevents the optimizer doing better on its own. This is about *necessary* uses of std::move, and removing the move here would change semantics and potentially break the code. The request is to replace it with some kind of magic that does the same as std::move without actually writing std::move.
Comment 6 Vittorio Romeo 2022-02-28 17:20:45 UTC
> The request is to replace it with some kind of magic that does the same as std::move without actually writing std::move.

More generally speaking, ensure that function such as `std::move`, `std::forward`, `std::vector<T>::operator[]`, `std::vector<T>::iterator::operator*`, and so on never appear in debugging call stacks and do not affect performance in `-Og` (or even `-O0`. 

I think the title for my issue is a bit too specific, but I'd like to make this a wider discussion in how to mitigate debug performance overhead caused by C++ standard library abstractions.
Comment 7 Jonathan Wakely 2022-02-28 17:24:27 UTC
(In reply to Vittorio Romeo from comment #4)
> I like the idea of having the compiler itself fold calls to things like
> `std::move` and `std::forward` as suggested in the linked
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96780. 

That benefits user code too, so that users won't have to use static_cast<T&&> in their own code because "std::move is slow".

> But I think this issue I opened should be more general for any standard
> library function that ends up impacting debug performance. Another common
> example in the gamedev community is `std::vector`.

Does the gamedev community actually use -Og though? Because if they only use -O0 (as is certainly the case for some) then the solution is "don't do that". Do they even use GCC/libstdc++ at all? Would they actually be more likely to if we change anything here?

For those who are using -Og, adding always_inline to trivial accessors shouldn't be necessary, because GCC _should_ always inline them anyway.
 
> In this benchmark, which uses `-Og`, you can notice a large performance
> difference between a `std::vector<int>` and `int*` dynamic array for
> operations that I believe should have equal performance:
> - https://quick-bench.com/q/lrS4I-lmDJ3VFP8L8rG2YHGXO-8
> - https://quick-bench.com/q/Uf-t79n7uYWAKdThOL_wxSp12Y0
> 
> Are the above results also something that should be handled on the compiler
> side of things? Or would, for example, marking `std::vector::operator[]` and
> `std::vector::iterator::operator*` as `always_inline` remove the performance
> discrepancy?

Somebody will have to investigate whether lack of inlining is the problem there. My guess would be that it's not due to trivial functions like op[] failing to be inlined, and so always_inline wouldn't help (except at -O0, but "don't do that" if you need performance :-)
Comment 8 Jonathan Wakely 2022-02-28 17:37:06 UTC
(In reply to Vittorio Romeo from comment #6)
> > The request is to replace it with some kind of magic that does the same as std::move without actually writing std::move.
> 
> More generally speaking, ensure that function such as `std::move`,
> `std::forward`, `std::vector<T>::operator[]`,
> `std::vector<T>::iterator::operator*`, and so on never appear in debugging
> call stacks and do not affect performance in `-Og` (or even `-O0`. 

They will still appear in debugging stacks, because GCC emits debug info even for inline functions. That means that GDB makes it look like the function was entered and exited, even if the actual code is completely inlined. That was the original topic of PR 96780 (because I don't think it's useful for std::move and std::forward). I think what *should* matters is actual raw performance, not how many stack frames you see in the debugger. However, reading twitter and reddit, I think people would complain about insane levels of complexity in the std::lib *even if it has zero overhead*.

> I think the title for my issue is a bit too specific, but I'd like to make
> this a wider discussion in how to mitigate debug performance overhead caused
> by C++ standard library abstractions.

OK, confirming as an enhancement, but somebody will need to do measurements and propose specific patches. I'm not aware of any low-hanging fruit for runtime perf, so the problems need to be identified before they can be fixed.
Comment 9 Vittorio Romeo 2022-03-01 02:23:06 UTC
I have done some benchmarking for three use cases, both with `-O0` and `-Og`, hacking my `libstdc++` headers to add `[[gnu::always_inline]]` where deemed appropriate. 

---

The use cases were:

1. `operator[]` benchmark -- `vector_squareop` and `carray_squareop` as seen above

2. Iterator benchmark -- `vector_iter` and `carray_iter` as seen above

3. Algorithm benchmark -- `std::accumulate` versus a raw `for` loop

---

All the benchmark results, benchmarking rig specs, and used code available here:
https://gist.github.com/vittorioromeo/efa005d44ccd4ec7279181768a0c1f0b

---

In short, these are the results:

- For all benchmarks, when using `-O0` without any modification to `libstdc++`, the overhead of the Standard Library can be huge (+25-400%).

- For all benchmarks, when using `-Og` without any modification to `libstdc++`, the overhead of the Standard Library is small (+5-15%).

- For the `operator[]` benchmark, when using `-O0` after applying `[[gnu::always_inline]]` to all the functions touched by the benchmark, we reduce the overhead from 25% to around 10%.

- For the `operator[]` benchmark, when using `-Og` after applying `[[gnu::always_inline]]` to all the functions touched by the benchmark, we reduce the overhead from 34% to around 11%. 

- For the iterator benchmark, when using `-O0` after applying `[[gnu::always_inline]]` to all the functions touched by the benchmark, we reduce the overhead from 302% to around 186%. 

- For the iterator benchmark, when using `-Og` after applying `[[gnu::always_inline]]` to all the functions touched by the benchmark, we reduce the overhead from 11% to around 8%. 

- For the algorithm benchmark, when using `-O0` after applying `[[gnu::always_inline]]` to all the functions touched by the benchmark, we reduce the overhead from 304% to around 47%.

- For the algorithm benchmark, when using `-Og`, independently of whether we modify `libstdc++` or not, the overhead is around 36%.
Comment 10 Jonathan Wakely 2022-03-01 20:12:41 UTC
There is a downside to always_inline, which might be similar to the stack usage problem mentioned in the llvm ticket. If a function grows too large, the optimiser can just give up and stop inlining. But for always_inline that gives a fatal error with no way to fix it. The attribute tells the compiler kitty isn't allowed to give up, the call must *always* be inline. If we liberally sprinkle it all over the std::lib and it results in compiler errors, users can't do anything about it. The code won't compile, period. Annotations that usually improve performance, but then sometimes give unavoidable errors are not a good idea.
Comment 11 Jonathan Wakely 2022-03-01 20:14:04 UTC
Stoopid autocorrect. s/kitty/it/
Comment 12 Jason Merrill 2022-03-10 23:04:51 UTC
(In reply to Vittorio Romeo from comment #9)
> - For the `operator[]` benchmark, when using `-Og` after applying
> `[[gnu::always_inline]]` to all the functions touched by the benchmark, we
> reduce the overhead from 34% to around 11%. 

Quoting from your gist:

-Og, without `[[gnu::always_inline]]` on `operator[]`
carray_squareop_mean          440 ns          439 ns            3
vector_squareop_mean          661 ns          662 ns            3
-Og, with `[[gnu::always_inline]]` on `operator[]`
vector_squareop_mean          494 ns          491 ns            3

Which looks significant...

...but I don't see this when I run your test myself; the vector_squareop results (and the generated code) are unaffected by adding always_inline.  In particular there are no calls to operator[].

carray_squareop          547 ns          546 ns      1272023
vector_squareop          591 ns          589 ns      1227215

carray:

        movslq  %edx, %rax
        leaq    (%rbx,%rax,4), %rcx
        movl    (%rcx), %eax
        addl    $1, %eax
        movl    %eax, (%rcx)
        movl    %eax, (%rcx)
        addl    $1, %edx

vector:

        movslq  %ecx, %rax
        salq    $2, %rax
        addq    (%rsp), %rax
        movl    (%rax), %esi
        leal    1(%rsi), %edx
        movl    %edx, (%rax)
        movl    %edx, (%rax)
        addl    $1, %ecx

It seems the main difference between the two is that the vector version needs to keep loading the base pointer from the stack (%rsp) for some reason, rather than keep it in a register %rbx.  This doesn't seem like an inlining issue at all.

The double move in both versions is curious.
Comment 13 Jason Merrill 2022-03-26 23:37:13 UTC
I believe the primary concern of this PR is resolved by Patrick's patch for PR96780; you can now use -ffold-simple-inlines to avoid calls to std::move at -O0.  Please open separate PRs for individual standard library abstraction penalty issues.

*** This bug has been marked as a duplicate of bug 96780 ***
Comment 14 Jonathan Wakely 2022-03-27 07:35:12 UTC
I have a patch to remove indirections in std::array which I'll commit for GCC 13.
Comment 15 GCC Commits 2022-05-04 15:34:43 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:ef8d5ac08b5e60f35c52087d88c0235c8ce6b65b

commit r13-113-gef8d5ac08b5e60f35c52087d88c0235c8ce6b65b
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Fri Mar 25 10:28:28 2022 +0000

    libstdc++: Simplify std::array accessors [PR104719]
    
    This removes the __array_traits::_S_ref and __array_traits::_S_ptr
    accessors, which only exist to make the special case of std::array<T, 0>
    syntactically well-formed.
    
    By changing the empty type used as the std::array<T, 0>::_M_elems data
    member to support operator[] and conversion to a pointer, we can write
    code using the natural syntax. The indirection through _S_ref and
    _S_ptr is removed for the common case, and a function call is only used
    for the special case of zero-size arrays.
    
    The invalid member access for zero-sized arrays is changed to use
    __builtin_trap() instead of a null dereference. This guarantees a
    runtime error if it ever gets called, instead of undefined behaviour
    that is likely to get optimized out as unreachable.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/104719
            * include/std/array (__array_traits::_S_ref): Remove.
            (__array_traits::_S_ptr): Remove.
            (__array_traits<T, 0>::_Type): Define operator[] and operator T*
            to provide an array-like API.
            (array::_AT_Type): Remove public typeef.
            (array::operator[], array::at, array::front, array::back): Use
            index operator to access _M_elems instead of _S_ref.
            (array::data): Use implicit conversion from _M_elems to pointer.
            (swap(array&, array&)): Use __enable_if_t helper.
            (get<I>): Use index operator to access _M_elems.
            * testsuite/23_containers/array/tuple_interface/get_neg.cc:
            Adjust dg-error line numbers.
Comment 16 GCC Commits 2022-05-04 15:34:49 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:22399ad6edcd4a2903b05196b59eec3159ceaa38

commit r13-114-g22399ad6edcd4a2903b05196b59eec3159ceaa38
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Wed Apr 27 16:09:06 2022 +0100

    libstdc++: Add always_inline to the simplest std::array accessors [PR104719]
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/104719
            * include/std/array (array::size(), array::max_size())
            (array::empty(), array::data()): Add  always_inline attribute.
Comment 17 cqwrteur 2022-05-04 15:49:25 UTC
(In reply to Jonathan Wakely from comment #14)
> I have a patch to remove indirections in std::array which I'll commit for
> GCC 13.

shouldn't it also add [[__gnu__::__artificial__]] attribute?
Comment 18 Jonathan Wakely 2022-05-04 15:55:45 UTC
No.