Bug 88545 - std::find compile to memchr in trivial random access cases (patch)
Summary: std::find compile to memchr in trivial random access cases (patch)
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: unknown
: P3 enhancement
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL: https://gcc.gnu.org/pipermail/gcc-pat...
Keywords: missed-optimization, patch
Depends on:
Blocks:
 
Reported: 2018-12-18 20:29 UTC by Georg Sauthoff
Modified: 2024-12-21 07:18 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-01-07 00:00:00


Attachments
specialize std::find to memchr for character searches in continous memory (798 bytes, patch)
2018-12-18 20:29 UTC, Georg Sauthoff
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Georg Sauthoff 2018-12-18 20:29:00 UTC
Created attachment 45259 [details]
specialize std::find to memchr for character searches in continous memory

If std::find() is called with continuous random access iterators and a trivial char sized value, then calling memchr() is much more efficient than calling into the generic __find_if().

The attached patch implements this optimization.

That means it specializes a std::find helper on the iterator category and the value and calls __builtin_memchr() if possible.

I've tested it on Fedora 27 and Fedora 29 (libstdc++-8.2.1-5.fc29.x86_64), but it should apply to the master branch as well.

Note that the patch uses techniques similar to what is already used for other algorithms in bits/stl_algo.h and bits/stl_algobase.h for detecting continuous random access iterators and specializations (e.g. std::equal calling memcmp if possible, other algorithms calling memmove, etc.).

I benchmarked some memchr()/std::find() implementations and one relevant result was that glibc's memchr() outperforms libstc++'s std::find() status-quo implementation (which does some manual loop unrolling in the find_if helper) by a factor of 1.5 or so (it always outperforms std::find(), but the factors vary between different CPU families). Also, on many CPUs, std::find() is even slower than a simple loop. (i.e. when using it for character searches)

See also: https://gms.tf/stdfind-and-memchr-optimizations.html#measurements
Comment 1 Sam James 2024-05-10 23:22:01 UTC
I realise this is many years later, but please send patches to the gcc-patches & libstdc++ mailing lists.

Patches on BZ are just used to store WIPs/drafts.
Comment 2 Jonathan Wakely 2024-05-13 13:00:14 UTC
(In reply to Georg Sauthoff from comment #0)
> Created attachment 45259 [details]
> specialize std::find to memchr for character searches in continous memory
> 
> If std::find() is called with continuous random access iterators and a
> trivial char sized value, then calling memchr() is much more efficient than
> calling into the generic __find_if().
> 
> The attached patch implements this optimization.
> 
> That means it specializes a std::find helper on the iterator category and
> the value and calls __builtin_memchr() if possible.

Why specialize on the iterator category, when the __is_simple boolean already checks if the iterator is a pointer?

The condition of a trivial byte-sized type seem insufficient, because you could have:

struct B {
  char c;
  bool operator==(const B& b) const { return true; }
};



I would prefer to do simply:

--- a/libstdc++-v3/include/bits/stl_algo.h
+++ b/libstdc++-v3/include/bits/stl_algo.h
@@ -3846,6 +3846,32 @@ _GLIBCXX_BEGIN_NAMESPACE_ALGO
       __glibcxx_function_requires(_EqualOpConcept<
                typename iterator_traits<_InputIterator>::value_type, _Tp>)
       __glibcxx_requires_valid_range(__first, __last);
+
+#if __cpp_if_constexpr
+      using _ValT = typename iterator_traits<_InputIterator>::value_type;
+      if constexpr (is_same_v<_ValT, _Tp>)
+       if constexpr (__is_byte<_ValT>::__value)
+#if __cpp_lib_concepts
+         if constexpr (contiguous_iterator<_InputIterator>)
+           {
+             if (const size_t __n = __last - __first)
+               {
+                 auto __p0 = std::to_address(__first);
+                 if (auto __p1 = __builtin_memchr(__p0, __val, __n))
+                   return __first + (__p1 - __p0);
+               }
+             return __last;
+           }
+#else
+         if constexpr (is_pointer_v<_InputIterator>)
+           {
+             if (const size_t __n = __last - __first)
+               if (auto __p = __builtin_memchr(__first, __val, __n))
+                 return __p;
+             return __last;
+           }
+#endif
+#endif
       return std::__find_if(__first, __last,
                            __gnu_cxx::__ops::__iter_equals_val(__val));
     }


I think we're going to remove the manual loop unrolling in __find_if for GCC 15, which should allow the compiler to optimize it better, potentially auto-vectorizing. That might make memchr less advantageous, but I think it's worth doing anyway.
Comment 3 Jonathan Wakely 2024-05-13 14:18:14 UTC
(In reply to Jonathan Wakely from comment #2)
> +      using _ValT = typename iterator_traits<_InputIterator>::value_type;
> +      if constexpr (is_same_v<_ValT, _Tp>)
> +       if constexpr (__is_byte<_ValT>::__value)

We can do better than this.

We can use memchr to find a char in a range of signed char, or even to find an int in a range of signed char, as long as we're careful about values.

For example, given s = "abc"sv, std::find(s.begin(). s.end(), 'a'+0) should find a match, but std::find(s.begin(), s.end(), 'a'+256) should not (even though memchr would match in that case, because it does (unsigned char)('a'+256)).

We should also ensure that std::ranges::find gets the same optimizations.
Comment 4 Georg Sauthoff 2024-05-14 22:33:53 UTC
Sam, thank you for the hint and surfacing it again.

(In reply to Jonathan Wakely from comment #2)
[..]
> I would prefer to do simply:
[..]

Yes, please go ahead with your approach.


> I think we're going to remove the manual loop unrolling in __find_if for GCC
> 15, which should allow the compiler to optimize it better, potentially
> auto-vectorizing. That might make memchr less advantageous, but I think it's
> worth doing anyway.

This is great news!
Comment 5 AK 2024-05-14 22:55:32 UTC
> I think we're going to remove the manual loop unrolling in __find_if for GCC
> 15, which should allow the compiler to optimize it better, potentially
> auto-vectorizing. That might make memchr less advantageous, but I think it's
> worth doing anyway.

And even for code-size flags (-Os) memchr still gives best of both worlds as auto-vectorizing increases the size.
Comment 6 AK 2024-05-14 22:57:56 UTC
> We can use memchr to find a char in a range of signed char, or even to find an int in a range of signed char, as long as we're careful about values.

+1, this approach should fix the bug i reported https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115040
Comment 7 AK 2024-06-03 18:37:20 UTC
Is there a plan to push a patch for this?
Comment 8 Jonathan Wakely 2024-06-05 09:16:45 UTC
Yes, but it's only a missed-optimization bug so there are much higher priorities.
Comment 9 Jonathan Wakely 2024-06-05 15:33:50 UTC
Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html

Rerunning benchmarks with this patch would be very welcome.
Comment 10 AK 2024-06-27 18:04:27 UTC
With this patch find of int8_t gets converted to memchr.
Using testcase from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115040 as example. With the patch posted in https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html

```
bool find_epi8(const std::vector<int8_t>& v) {
    return std::find(v.begin(), v.end(), 42) != v.end();
}
```

$ gcc -O3 -ftree-vectorize -march=pantherlake test.cpp -S -o test.s

```
 .globl  _Z9find_epi8RKSt6vectorIaSaIaEE
  .type _Z9find_epi8RKSt6vectorIaSaIaEE, @function
_Z9find_epi8RKSt6vectorIaSaIaEE:
.LFB1535:
  .cfi_startproc
  pushq %rbx
  .cfi_def_cfa_offset 16
  .cfi_offset 3, -16
  movq  8(%rdi), %rbx
  movq  (%rdi), %rdi
  testq %rdi, %rdi
  je  .L2
  movq  %rbx, %rdx
  subq  %rdi, %rdx
  xorl  %eax, %eax
  testq %rdx, %rdx
  jle .L1
  movl  $42, %esi
  call  memchr
  movq  %rax, %rdx
  cmpq  %rax, %rbx
  setne %al
  testq %rdx, %rdx
  setne %dl
  andl  %edx, %eax
.L1:
  popq  %rbx
  .cfi_remember_state
  .cfi_def_cfa_offset 8
  ret
 .p2align 4,,10
  .p2align 3
.L2:
  .cfi_restore_state
  cmpq  $3, %rbx
  jg  .L4
  cmpq  $2, %rbx
  je  .L10
  cmpq  $3, %rbx
  je  .L6
  xorl  %eax, %eax
  cmpq  $1, %rbx
  jne .L1
.L7:
  cmpb  $42, (%rdi)
  sete  %al
  cmpq  %rdi, %rbx
  setne %dl
  andl  %edx, %eax
  popq  %rbx
  .cfi_def_cfa_offset 8
  ret
.L10:
  .cfi_restore_state
  xorl  %edx, %edx
  movl  $1, %edi
.L5:
  cmpb  $42, (%rdx)
  movl  $1, %eax
  jne .L7
  popq  %rbx
  .cfi_remember_state
  .cfi_def_cfa_offset 8
  ret
.L6:
  .cfi_restore_state
  cmpb  $42, 0
  movl  $1, %eax
  je  .L1
  movl  $2, %edi
  movl  $1, %edx
  jmp .L5
  .cfi_endproc
  .section  .text.unlikely
  .cfi_startproc
  .type _Z9find_epi8RKSt6vectorIaSaIaEE.cold, @function
_Z9find_epi8RKSt6vectorIaSaIaEE.cold:
.LFSB1535:
.L4:
 .cfi_def_cfa_offset 16
  .cfi_offset 3, -16
  movzbl  0, %eax
  ud2
  .cfi_endproc
.LFE1535:
  .text
  .size _Z9find_epi8RKSt6vectorIaSaIaEE, .-_Z9find_epi8RKSt6vectorIaSaIaEE
  .section  .text.unlikely
  .size _Z9find_epi8RKSt6vectorIaSaIaEE.cold, .-_Z9find_epi8RKSt6vectorIaSaIaEE.cold
```
Comment 11 Tamar Christina 2024-07-01 08:43:21 UTC
(In reply to Jonathan Wakely from comment #9)
> Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html
> 
> Rerunning benchmarks with this patch would be very welcome.

OK, I have tested the patches on AArch64 and also compared against our vectorized cases.

I tested 5 needle positions (where the element it's searching for is found):

1
0
100
1000
10000
-1

where 0 is never find, and -1 means fine last:

Results are:

+--------+-----------+---------+---------+
| NEEDLE | scalar 1x | vect    | memchr  |
+--------+-----------+---------+---------+
| 1      | -11.67%   | -14.95% | -12.92% |
| 0      | 137.48%   | -82.31% | -83.36% |
| 100    | 3.75%     | 17.06%  | 8.02%   |
| 1000   | -10.34%   | -10.83% | 0.29%   |
| 10000  | -3.25%    | -4.97%  | -5.19%  |
| -1     | 10.28%    | -31.17% | -33.93% |
+--------+-----------+---------+---------+

So it looks like, staying with scalar the unrolling still has a positive effect, but calling memchr has an effect on longer searches, shorter ones. 

the vector loop as currently vectorized has about a 10% unneeded overhead which we will be working on this year.  But otherwise it's also a significant win for longer searches.

So perhaps an idea is to use memchr for bytes, for everything else remove the unrolled code and let the vectorizer take care of it, and if that fails let the RTL or tree unroller do it?

for completeness, my benchmark was:

#include <algorithm>
#include <string>
#include <fstream>
#include <sstream>
#include <stdio.h>

#ifndef NEEDLE
#define NEEDLE 50
#endif

#ifndef ITERS
#define ITERS 1000
#endif

__attribute__((noipa))
bool
doIt (const char* s, char v, size_t len)
{
  const char* l = s + len;
  const char* r = std::find (s, l, v);
  return (r != l);
}

int main ()
{
  std::ifstream t("find.data");
  std::stringstream buffer;
  buffer << t.rdbuf();
  std::string content = buffer.str();
  if (NEEDLE > 0)
    content[NEEDLE-1] = '|';
  else if (NEEDLE < 0)
    content[content.length()-1] = '|';

  bool found = false;
  for (int i = 0; i < ITERS; i++)
     found = found | doIt (content.c_str (), '|', content.length ());

  return found;
}
Comment 12 Tamar Christina 2024-07-01 13:29:22 UTC
I had a bug in the benchmark, I forgot to set taskset,

These are the correct ones:

+--------+-----------+---------+---------+
| NEEDLE | scalar 1x | vect    | memchr  |
+--------+-----------+---------+---------+
| 1      | -0.14%    | 174.95% | 373.69% |
| 0      | 0.00%     | -90.60% | -95.21% |
| 100    | 0.03%     | -80.28% | -80.39% |
| 1000   | 0.00%     | -89.46% | -94.06% |
| 10000  | 0.00%     | -90.33% | -95.19% |
| -1     | 0.00%     | -90.60% | -95.21% |
+--------+-----------+---------+---------+

So this shows that on modern cores the unrolled scalar has no influence, so we should just remove it.

It also shows that memchr is universally faster and that for the rest the vectorizer does a pretty good job.  We'll get some additional speedups there soon as well but memchr should still win as it's hand tuned.

So I think for 1-byte we should use memchr and the rest remove the unrolled code and let the vectorizer handle it.
Comment 13 GCC Commits 2024-07-05 11:19:56 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:de19b516edbf919d31e9d22fdbf6066342d904a2

commit r15-1857-gde19b516edbf919d31e9d22fdbf6066342d904a2
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Wed Jun 5 16:01:26 2024 +0100

    libstdc++: Use memchr to optimize std::find [PR88545]
    
    This optimizes std::find to use memchr when searching for an integer in
    a range of bytes.
    
    libstdc++-v3/ChangeLog:
    
            PR libstdc++/88545
            PR libstdc++/115040
            * include/bits/cpp_type_traits.h (__can_use_memchr_for_find):
            New variable template.
            * include/bits/ranges_util.h (__find_fn): Use memchr when
            possible.
            * include/bits/stl_algo.h (find): Likewise.
            * testsuite/25_algorithms/find/bytes.cc: New test.
Comment 14 Andrew Pinski 2024-07-15 22:48:05 UTC
(In reply to GCC Commits from comment #13)
> The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:
> 
> https://gcc.gnu.org/g:de19b516edbf919d31e9d22fdbf6066342d904a2
> 
> commit r15-1857-gde19b516edbf919d31e9d22fdbf6066342d904a2
> Author: Jonathan Wakely <jwakely@redhat.com>
> Date:   Wed Jun 5 16:01:26 2024 +0100
> 
>     libstdc++: Use memchr to optimize std::find [PR88545]
>     
>     This optimizes std::find to use memchr when searching for an integer in
>     a range of bytes.

I see libc++ will also use wmemchr for some cases too.
I noticed it when looking into https://hachyderm.io/@frog@idtech.space/112757633248218604 .
Comment 15 Jonathan Wakely 2024-07-16 08:29:49 UTC
The wmemchr case is covered by PR 115040
Comment 16 GCC Commits 2024-07-27 11:48:57 UTC
The master branch has been updated by Jonathan Wakely <redi@gcc.gnu.org>:

https://gcc.gnu.org/g:e69456ff9a54ba3e9c93842b05757b7d8fff6d9d

commit r15-2356-ge69456ff9a54ba3e9c93842b05757b7d8fff6d9d
Author: Jonathan Wakely <jwakely@redhat.com>
Date:   Thu Jul 4 12:01:29 2024 +0100

    libstdc++: Remove __find_if unrolling for random access iterators
    
    As the numbers in PR libstdc++/88545 show, the manual loop unrolling in
    std::__find_if doesn't actually help these days, and it prevents the
    compiler from auto-vectorizing.
    
    Remove the dispatching on iterator_category and just use the simple loop
    for all iterator categories.
    
    libstdc++-v3/ChangeLog:
    
            * include/bits/stl_algobase.h (__find_if): Remove overloads for
            dispatching on iterator_category. Do not unroll loop manually.
            * include/bits/stl_algo.h (__find_if_not): Remove
            iterator_category argument from __find_if call.
Comment 17 Jonathan Wakely 2024-12-10 15:26:26 UTC
I think this can be closed as FIXED now.