Bug 103798 - memchr with a (small) constant string should be expanded inline.
Summary: memchr with a (small) constant string should be expanded inline.
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2021-12-22 07:03 UTC by Zhao Wei Liew
Modified: 2022-08-28 07:42 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-12-22 00:00:00


Attachments
A patch (762 bytes, patch)
2022-06-16 17:37 UTC, H.J. Lu
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Zhao Wei Liew 2021-12-22 07:03:17 UTC
Given the following code:

#include <string_view>

size_t findFirstE_slow(std::string_view sv) {
  return sv.find_first_of("eE");
}

size_t findFirstE_fast(std::string_view sv) {
  auto it{sv.begin()};
  for (; it != sv.end() && *it != 'e' && *it != 'E'; ++it)
    ;
  return it == sv.end() ? std::string_view::npos : size_t(it - sv.begin());
}


x86-64 gcc (trunk) -std=c++20 -O3 generates the following assembly:

.LC0:
        .string "eE"
findFirstE_slow(std::basic_string_view<char, std::char_traits<char> >):
        push    r12
        push    rbp
        push    rbx
        test    rdi, rdi
        je      .L4
        mov     rbx, rdi
        mov     rbp, rsi
        xor     r12d, r12d
        jmp     .L3
.L8:
        add     r12, 1
        cmp     rbx, r12
        je      .L4
.L3:
        movsx   esi, BYTE PTR [rbp+0+r12]
        mov     edx, 2
        mov     edi, OFFSET FLAT:.LC0
        call    memchr
        test    rax, rax
        je      .L8
        mov     rax, r12
        pop     rbx
        pop     rbp
        pop     r12
        ret
.L4:
        mov     r12, -1
        pop     rbx
        pop     rbp
        mov     rax, r12
        pop     r12
        ret

findFirstE_fast(std::basic_string_view<char, std::char_traits<char> >):
        add     rdi, rsi
        cmp     rdi, rsi
        je      .L13
        mov     rax, rsi
        jmp     .L12
.L15:
        add     rax, 1
        cmp     rdi, rax
        je      .L13
.L12:
        movzx   edx, BYTE PTR [rax]
        and     edx, -33
        cmp     dl, 69
        jne     .L15
        sub     rax, rsi
        ret
.L13:
        mov     rax, -1
        ret


Even though both findFirstE_slow() and findFirstE_fast() are logically equivalent, findFirstE_slow() calls memchr() for every character in the input string.

findFirstE_fast() does what we would reasonably expect, comparing every character in the input string with 'e' and 'E'.

In practice, when the set of characters to be found is small, findFirstE_slow() runs noticeably slower than findFirstE_fast(). This seems to be a missed optimization in char_traits<char>::find(), which affects several find-related methods in string_view.

Relevant StackOverflow question with quick-bench output, Compiler Explorer output, and more discussion: https://stackoverflow.com/questions/70433152/missed-optimization-with-string-viewfind-first-of
Comment 1 Andrew Pinski 2021-12-22 07:24:32 UTC
Really the issue is:
  _8 = __builtin_memchr ("eE", _7, 2);
  if (_8 != 0B)


Should just be expanded to _7 == 'e' || _7 == 'E' .

That is:
int f(char a)
{
   return  __builtin_memchr ("e", a, 1) != 0;
}
int f1(char a)
{
   return a == 'e';
}
int g(char a)
{
   return  __builtin_memchr ("eE", a, 2) != 0;
}
int g1(char a)
{
   return a == 'e' || a == 'E';
}

f and f1 should produce the same assembly
and g and g1 should produce the same (similar) assembly.
Comment 2 H.J. Lu 2022-06-16 17:37:53 UTC
Created attachment 53154 [details]
A patch

Like this?
Comment 3 GCC Commits 2022-07-14 21:10:52 UTC
The master branch has been updated by H.J. Lu <hjl@gcc.gnu.org>:

https://gcc.gnu.org/g:c6cf555a88f8ffb8ec85c2b94fe656ab217260ea

commit r13-1699-gc6cf555a88f8ffb8ec85c2b94fe656ab217260ea
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri Jun 17 07:33:06 2022 -0700

    Simplify memchr with small constant strings
    
    When memchr is applied on a constant string of no more than the bytes of
    a word, simplify memchr by checking each byte in the constant string.
    
    int f (int a)
    {
       return  __builtin_memchr ("AE", a, 2) != 0;
    }
    
    is simplified to
    
    int f (int a)
    {
      return ((char) a == 'A' || (char) a == 'E') != 0;
    }
    
    gcc/
    
            PR tree-optimization/103798
            * tree-ssa-forwprop.cc: Include "tree-ssa-strlen.h".
            (simplify_builtin_call): Inline memchr with constant strings of
            no more than the bytes of a word.
            * tree-ssa-strlen.cc (use_in_zero_equality): Make it global.
            * tree-ssa-strlen.h (use_in_zero_equality): New.
    
    gcc/testsuite/
    
            PR tree-optimization/103798
            * c-c++-common/pr103798-1.c: New test.
            * c-c++-common/pr103798-2.c: Likewise.
            * c-c++-common/pr103798-3.c: Likewise.
            * c-c++-common/pr103798-4.c: Likewise.
            * c-c++-common/pr103798-5.c: Likewise.
            * c-c++-common/pr103798-6.c: Likewise.
            * c-c++-common/pr103798-7.c: Likewise.
            * c-c++-common/pr103798-8.c: Likewise.
            * c-c++-common/pr103798-9.c: Likewise.
            * c-c++-common/pr103798-10.c: Likewise.
Comment 4 Jan-Benedict Glaw 2022-08-28 07:42:32 UTC
I just started looking through the automated builds logs and noticed that three configurations (`.../configure --enable-werror-always --enable-languages=all --disable-gcov --disable-shared --disable-threads --target=rl78-elf --without-headers`, and with --target=avr-elf as well as with --target=gcc-pru-elf) break during `make V=1 all-gcc`:

[...]
/usr/lib/gcc-snapshot/bin/g++  -fno-PIE -c   -g -O2   -DIN_GCC  -DCROSS_DIRECTORY_STRUCTURE   -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wno-narrowing -Wwrite-strings -W
cast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -fno-common  -DHAVE_CONFIG_H -I. -I. -I../../gcc/gcc -I../../gcc/gcc/. -I../../gcc/gcc/../include -I../../gcc/gcc/../libcpp/include -I../../gcc/gcc/../libcody  -I../../gcc/gcc/../libdecnumber -I../../gcc/gcc/../libdecnumber/dpd -I../libdecnumber -I../../gcc/gcc/../libbacktrace   -o tree-ssa-forwprop.o -MT tree-ssa-forwprop.o -MMD -MP -MF ./.deps/tree-ssa-forwprop.TPo ../../gcc/gcc/tree-ssa-forwprop.cc
../../gcc/gcc/tree-ssa-forwprop.cc: In function 'bool simplify_builtin_call(gimple_stmt_iterator*, tree)':
../../gcc/gcc/tree-ssa-forwprop.cc:1258:42: error: array subscript 1 is outside array bounds of 'tree_node* [1]' [-Werror=array-bounds]
 1258 |             op[i - 1] = fold_convert_loc (loc, boolean_type_node,
      |                         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
 1259 |                                           fold_build2_loc (loc,
      |                                           ~~~~~~~~~~~~~~~~~~~~~
 1260 |                                                            BIT_IOR_EXPR,
      |                                                            ~~~~~~~~~~~~~
 1261 |                                                            boolean_type_node,
      |                                                            ~~~~~~~~~~~~~~~~~~
 1262 |                                                            op[i - 1],
      |                                                            ~~~~~~~~~~
 1263 |                                                            op[i]));
      |                                                            ~~~~~~~
In file included from ../../gcc/gcc/system.h:707,
                 from ../../gcc/gcc/tree-ssa-forwprop.cc:21:
../../gcc/gcc/../include/libiberty.h:733:36: note: at offset 8 into object of size [0, 8] allocated by '__builtin_alloca'
  733 | # define alloca(x) __builtin_alloca(x)
      |                    ~~~~~~~~~~~~~~~~^~~
../../gcc/gcc/../include/libiberty.h:365:40: note: in expansion of macro 'alloca'
  365 | #define XALLOCAVEC(T, N)        ((T *) alloca (sizeof (T) * (N)))
      |                                        ^~~~~~
../../gcc/gcc/tree-ssa-forwprop.cc:1250:22: note: in expansion of macro 'XALLOCAVEC'
 1250 |           tree *op = XALLOCAVEC (tree, isize);
      |                      ^~~~~~~~~~
cc1plus: all warnings being treated as errors
make[1]: *** [Makefile:1146: tree-ssa-forwprop.o] Error 1


This also happens for --target=avr-elf and --target=gcc-pru-elf .

Thank,
  Jan-Benedict