Created attachment 45259 [details] specialize std::find to memchr for character searches in continous memory If std::find() is called with continuous random access iterators and a trivial char sized value, then calling memchr() is much more efficient than calling into the generic __find_if(). The attached patch implements this optimization. That means it specializes a std::find helper on the iterator category and the value and calls __builtin_memchr() if possible. I've tested it on Fedora 27 and Fedora 29 (libstdc++-8.2.1-5.fc29.x86_64), but it should apply to the master branch as well. Note that the patch uses techniques similar to what is already used for other algorithms in bits/stl_algo.h and bits/stl_algobase.h for detecting continuous random access iterators and specializations (e.g. std::equal calling memcmp if possible, other algorithms calling memmove, etc.). I benchmarked some memchr()/std::find() implementations and one relevant result was that glibc's memchr() outperforms libstc++'s std::find() status-quo implementation (which does some manual loop unrolling in the find_if helper) by a factor of 1.5 or so (it always outperforms std::find(), but the factors vary between different CPU families). Also, on many CPUs, std::find() is even slower than a simple loop. (i.e. when using it for character searches) See also: https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
I realise this is many years later, but please send patches to the gcc-patches & libstdc++ mailing lists. Patches on BZ are just used to store WIPs/drafts.
(In reply to Georg Sauthoff from comment #0) > Created attachment 45259 [details] > specialize std::find to memchr for character searches in continous memory > > If std::find() is called with continuous random access iterators and a > trivial char sized value, then calling memchr() is much more efficient than > calling into the generic __find_if(). > > The attached patch implements this optimization. > > That means it specializes a std::find helper on the iterator category and > the value and calls __builtin_memchr() if possible. Why specialize on the iterator category, when the __is_simple boolean already checks if the iterator is a pointer? The condition of a trivial byte-sized type seem insufficient, because you could have: struct B { char c; bool operator==(const B& b) const { return true; } }; I would prefer to do simply: --- a/libstdc++-v3/include/bits/stl_algo.h +++ b/libstdc++-v3/include/bits/stl_algo.h @@ -3846,6 +3846,32 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO __glibcxx_function_requires(_EqualOpConcept< typename iterator_traits<_InputIterator>::value_type, _Tp>) __glibcxx_requires_valid_range(__first, __last); + +#if __cpp_if_constexpr + using _ValT = typename iterator_traits<_InputIterator>::value_type; + if constexpr (is_same_v<_ValT, _Tp>) + if constexpr (__is_byte<_ValT>::__value) +#if __cpp_lib_concepts + if constexpr (contiguous_iterator<_InputIterator>) + { + if (const size_t __n = __last - __first) + { + auto __p0 = std::to_address(__first); + if (auto __p1 = __builtin_memchr(__p0, __val, __n)) + return __first + (__p1 - __p0); + } + return __last; + } +#else + if constexpr (is_pointer_v<_InputIterator>) + { + if (const size_t __n = __last - __first) + if (auto __p = __builtin_memchr(__first, __val, __n)) + return __p; + return __last; + } +#endif +#endif return std::__find_if(__first, __last, __gnu_cxx::__ops::__iter_equals_val(__val)); } I think we're going to remove the manual loop unrolling in __find_if for GCC 15, which should allow the compiler to optimize it better, potentially auto-vectorizing. That might make memchr less advantageous, but I think it's worth doing anyway.
(In reply to Jonathan Wakely from comment #2) > + using _ValT = typename iterator_traits<_InputIterator>::value_type; > + if constexpr (is_same_v<_ValT, _Tp>) > + if constexpr (__is_byte<_ValT>::__value) We can do better than this. We can use memchr to find a char in a range of signed char, or even to find an int in a range of signed char, as long as we're careful about values. For example, given s = "abc"sv, std::find(s.begin(). s.end(), 'a'+0) should find a match, but std::find(s.begin(), s.end(), 'a'+256) should not (even though memchr would match in that case, because it does (unsigned char)('a'+256)). We should also ensure that std::ranges::find gets the same optimizations.
Sam, thank you for the hint and surfacing it again. (In reply to Jonathan Wakely from comment #2) [..] > I would prefer to do simply: [..] Yes, please go ahead with your approach. > I think we're going to remove the manual loop unrolling in __find_if for GCC > 15, which should allow the compiler to optimize it better, potentially > auto-vectorizing. That might make memchr less advantageous, but I think it's > worth doing anyway. This is great news!
> I think we're going to remove the manual loop unrolling in __find_if for GCC > 15, which should allow the compiler to optimize it better, potentially > auto-vectorizing. That might make memchr less advantageous, but I think it's > worth doing anyway. And even for code-size flags (-Os) memchr still gives best of both worlds as auto-vectorizing increases the size.
> We can use memchr to find a char in a range of signed char, or even to find an int in a range of signed char, as long as we're careful about values. +1, this approach should fix the bug i reported https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115040
Is there a plan to push a patch for this?
Yes, but it's only a missed-optimization bug so there are much higher priorities.
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html Rerunning benchmarks with this patch would be very welcome.
With this patch find of int8_t gets converted to memchr. Using testcase from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115040 as example. With the patch posted in https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html ``` bool find_epi8(const std::vector<int8_t>& v) { return std::find(v.begin(), v.end(), 42) != v.end(); } ``` $ gcc -O3 -ftree-vectorize -march=pantherlake test.cpp -S -o test.s ``` .globl _Z9find_epi8RKSt6vectorIaSaIaEE .type _Z9find_epi8RKSt6vectorIaSaIaEE, @function _Z9find_epi8RKSt6vectorIaSaIaEE: .LFB1535: .cfi_startproc pushq %rbx .cfi_def_cfa_offset 16 .cfi_offset 3, -16 movq 8(%rdi), %rbx movq (%rdi), %rdi testq %rdi, %rdi je .L2 movq %rbx, %rdx subq %rdi, %rdx xorl %eax, %eax testq %rdx, %rdx jle .L1 movl $42, %esi call memchr movq %rax, %rdx cmpq %rax, %rbx setne %al testq %rdx, %rdx setne %dl andl %edx, %eax .L1: popq %rbx .cfi_remember_state .cfi_def_cfa_offset 8 ret .p2align 4,,10 .p2align 3 .L2: .cfi_restore_state cmpq $3, %rbx jg .L4 cmpq $2, %rbx je .L10 cmpq $3, %rbx je .L6 xorl %eax, %eax cmpq $1, %rbx jne .L1 .L7: cmpb $42, (%rdi) sete %al cmpq %rdi, %rbx setne %dl andl %edx, %eax popq %rbx .cfi_def_cfa_offset 8 ret .L10: .cfi_restore_state xorl %edx, %edx movl $1, %edi .L5: cmpb $42, (%rdx) movl $1, %eax jne .L7 popq %rbx .cfi_remember_state .cfi_def_cfa_offset 8 ret .L6: .cfi_restore_state cmpb $42, 0 movl $1, %eax je .L1 movl $2, %edi movl $1, %edx jmp .L5 .cfi_endproc .section .text.unlikely .cfi_startproc .type _Z9find_epi8RKSt6vectorIaSaIaEE.cold, @function _Z9find_epi8RKSt6vectorIaSaIaEE.cold: .LFSB1535: .L4: .cfi_def_cfa_offset 16 .cfi_offset 3, -16 movzbl 0, %eax ud2 .cfi_endproc .LFE1535: .text .size _Z9find_epi8RKSt6vectorIaSaIaEE, .-_Z9find_epi8RKSt6vectorIaSaIaEE .section .text.unlikely .size _Z9find_epi8RKSt6vectorIaSaIaEE.cold, .-_Z9find_epi8RKSt6vectorIaSaIaEE.cold ```
(In reply to Jonathan Wakely from comment #9) > Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html > > Rerunning benchmarks with this patch would be very welcome. OK, I have tested the patches on AArch64 and also compared against our vectorized cases. I tested 5 needle positions (where the element it's searching for is found): 1 0 100 1000 10000 -1 where 0 is never find, and -1 means fine last: Results are: +--------+-----------+---------+---------+ | NEEDLE | scalar 1x | vect | memchr | +--------+-----------+---------+---------+ | 1 | -11.67% | -14.95% | -12.92% | | 0 | 137.48% | -82.31% | -83.36% | | 100 | 3.75% | 17.06% | 8.02% | | 1000 | -10.34% | -10.83% | 0.29% | | 10000 | -3.25% | -4.97% | -5.19% | | -1 | 10.28% | -31.17% | -33.93% | +--------+-----------+---------+---------+ So it looks like, staying with scalar the unrolling still has a positive effect, but calling memchr has an effect on longer searches, shorter ones. the vector loop as currently vectorized has about a 10% unneeded overhead which we will be working on this year. But otherwise it's also a significant win for longer searches. So perhaps an idea is to use memchr for bytes, for everything else remove the unrolled code and let the vectorizer take care of it, and if that fails let the RTL or tree unroller do it? for completeness, my benchmark was: #include <algorithm> #include <string> #include <fstream> #include <sstream> #include <stdio.h> #ifndef NEEDLE #define NEEDLE 50 #endif #ifndef ITERS #define ITERS 1000 #endif __attribute__((noipa)) bool doIt (const char* s, char v, size_t len) { const char* l = s + len; const char* r = std::find (s, l, v); return (r != l); } int main () { std::ifstream t("find.data"); std::stringstream buffer; buffer << t.rdbuf(); std::string content = buffer.str(); if (NEEDLE > 0) content[NEEDLE-1] = '|'; else if (NEEDLE < 0) content[content.length()-1] = '|'; bool found = false; for (int i = 0; i < ITERS; i++) found = found | doIt (content.c_str (), '|', content.length ()); return found; }
I had a bug in the benchmark, I forgot to set taskset, These are the correct ones: +--------+-----------+---------+---------+ | NEEDLE | scalar 1x | vect | memchr | +--------+-----------+---------+---------+ | 1 | -0.14% | 174.95% | 373.69% | | 0 | 0.00% | -90.60% | -95.21% | | 100 | 0.03% | -80.28% | -80.39% | | 1000 | 0.00% | -89.46% | -94.06% | | 10000 | 0.00% | -90.33% | -95.19% | | -1 | 0.00% | -90.60% | -95.21% | +--------+-----------+---------+---------+ So this shows that on modern cores the unrolled scalar has no influence, so we should just remove it. It also shows that memchr is universally faster and that for the rest the vectorizer does a pretty good job. We'll get some additional speedups there soon as well but memchr should still win as it's hand tuned. So I think for 1-byte we should use memchr and the rest remove the unrolled code and let the vectorizer handle it.
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:de19b516edbf919d31e9d22fdbf6066342d904a2 commit r15-1857-gde19b516edbf919d31e9d22fdbf6066342d904a2 Author: Jonathan Wakely <jwakely@redhat.com> Date: Wed Jun 5 16:01:26 2024 +0100 libstdc++: Use memchr to optimize std::find [PR88545] This optimizes std::find to use memchr when searching for an integer in a range of bytes. libstdc++-v3/ChangeLog: PR libstdc++/88545 PR libstdc++/115040 * include/bits/cpp_type_traits.h (__can_use_memchr_for_find): New variable template. * include/bits/ranges_util.h (__find_fn): Use memchr when possible. * include/bits/stl_algo.h (find): Likewise. * testsuite/25_algorithms/find/bytes.cc: New test.
(In reply to GCC Commits from comment #13) > The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: > > https://gcc.gnu.org/g:de19b516edbf919d31e9d22fdbf6066342d904a2 > > commit r15-1857-gde19b516edbf919d31e9d22fdbf6066342d904a2 > Author: Jonathan Wakely <jwakely@redhat.com> > Date: Wed Jun 5 16:01:26 2024 +0100 > > libstdc++: Use memchr to optimize std::find [PR88545] > > This optimizes std::find to use memchr when searching for an integer in > a range of bytes. I see libc++ will also use wmemchr for some cases too. I noticed it when looking into https://hachyderm.io/@frog@idtech.space/112757633248218604 .
The wmemchr case is covered by PR 115040
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>: https://gcc.gnu.org/g:e69456ff9a54ba3e9c93842b05757b7d8fff6d9d commit r15-2356-ge69456ff9a54ba3e9c93842b05757b7d8fff6d9d Author: Jonathan Wakely <jwakely@redhat.com> Date: Thu Jul 4 12:01:29 2024 +0100 libstdc++: Remove __find_if unrolling for random access iterators As the numbers in PR libstdc++/88545 show, the manual loop unrolling in std::__find_if doesn't actually help these days, and it prevents the compiler from auto-vectorizing. Remove the dispatching on iterator_category and just use the simple loop for all iterator categories. libstdc++-v3/ChangeLog: * include/bits/stl_algobase.h (__find_if): Remove overloads for dispatching on iterator_category. Do not unroll loop manually. * include/bits/stl_algo.h (__find_if_not): Remove iterator_category argument from __find_if call.
I think this can be closed as FIXED now.