Bug 80813 - [12/13/14/15 Regression] x86: std::vector<bool>::operator[] could be somewhat faster using BT instead of SHL
Summary: [12/13/14/15 Regression] x86: std::vector<bool>::operator[] could be somewhat...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: 12.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: std::vector
  Show dependency treegraph
 
Reported: 2017-05-18 04:47 UTC by Peter Cordes
Modified: 2025-01-23 14:51 UTC (History)
3 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-12-20 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-05-18 04:47:51 UTC
This actually applies to all cases of testing a single bit in an array, not just vector<bool>.

#include <vector>
bool f(const std::vector<bool>& v, std::size_t x) {
    return v[x];
}

compiles with -O3 to this asm, on gcc5.x, 6.x, 7.1, and 8:

        # https://godbolt.org/g/OHsgcv
        movq    (%rdi), %rdx
        movq    %rsi, %r8
        movl    %esi, %ecx
        shrq    $6, %r8
        movl    $1, %eax
        salq    %cl, %rax
        testq   %rax, (%rdx,%r8,8)
        setne   %al
        ret

(or with -march=haswell, it uses SHLX for the 1<<x)

It would be more efficient to do 1<<x using
  xor  %eax,%eax
  bts  %rsi,%rax

but gcc doesn't do this, even with -march=haswell.  (On AMD CPUs, BTS reg,reg is 2 m-ops, but it's probably about break-even if it saves a mov instruction to copy the shift-count to ecx.  But for AMD with BMI2, mov $1,reg and shlx is probably optimal despite the larger code-size.)

---

But it's even better to load and then test the bit with BT like clang does.

There's also no need to do a 64-bit load.  We can avoid a lot of REX prefixes by doing a 32-bit load.

        # hand-crafted sequence I think is optimal, modified from clang output
        movq    (%rdi), %rax
        movl    %esi, %ecx
        shrq    $5, %rsi        # does still need to be 64-bit
        movl    (%rax,%rsi,4), %eax
        btl     %ecx, %eax
        setb    %al
        retq

Using BT is fewer instructions, and variable-count SHL is slower than BT on Intel SnB-family CPUs.  According to Agner Fog's tables, SHL %cl,%reg is 3 uops on SnB/HSW/SKL, but BT %reg,%reg is 1 uop on Intel and AMD CPUs.  (SHLX is 1 uop, so BMI2 avoids the x86 legacy baggage of leaving flags untouched if count==0.)

If we were branching on it, TEST/JZ could macro-fuse on Intel and AMD CPUs, but BT/JNC couldn't.  The BT strategy is still a speed win, or with BMI2 only a win on code-size.


----

Actually, if not branching on it, we could just isolate the bit with a shift and mask, instead of a TEST and SETCC.

        movq    (%rdi), %rax
        movl    %esi, %ecx
        shrq    $5, %rsi
        movl    (%rax,%rsi,4), %eax
        shrx    %ecx, %eax, %eax
        andl    $1, %eax
        retq

This is as cheap as the bt/setb sequence if BMI2 is available, but zero-extends the bool for free to fill a register instead of having it only in the low 8.  (OTOH, BT/SETNC lets us invert the bit for free).

The latency should be the same as with SHL/TEST/SETNZ or with BT/SETC, because TEST with a memory operand still runs as a separate load and then 1-cycle test.  The load can start before the mask operand is ready.  So there are always two 1-cycle ALU operations after the load.

Without BMI2, using shr %cl,%eax would be worse on Intel SnB-family, because it has higher latency than BT and is on the critical path (unlike when shl is used to make a mask in parallel with the load, so load-use latency hides the latency of setting up the mask unless ALU resource conflicts delay it).
Comment 1 Andrew Pinski 2021-07-21 05:45:34 UTC
Note the code is worse in GCC 11+, there seems to be a missing optimization on the gimple level ...
Comment 2 Jan Hubicka 2024-12-20 13:24:04 UTC
trunk does
f(std::vector<bool, std::allocator<bool> > const&, unsigned long):
        testq   %rsi, %rsi
        leaq    63(%rsi), %rax
        movq    (%rdi), %rdx
        cmovns  %rsi, %rax
        sarq    $6, %rax
        leaq    (%rdx,%rax,8), %rdx
        movq    %rsi, %rax
        sarq    $63, %rax
        shrq    $58, %rax
        addq    %rax, %rsi
        andl    $63, %esi
        subq    %rax, %rsi
        jns     .L2
        addq    $64, %rsi
        subq    $8, %rdx
.L2:
        movl    $1, %eax
        shlx    %rsi, %rax, %rax
        andq    (%rdx), %rax
        setne   %al
        ret

Removing basic block 5
bool f (const struct vector & v, size_t x)
{ 
  difference_type __n;
  _Bit_type * const SR.16;
  _Bit_type * _4;
  long int __n.0_5;
  long unsigned int _12;
  long unsigned int _13;
  long unsigned int _14;
  bool _15;
  long int _16;
  long int _20;
  long unsigned int _21;
  long unsigned int _22;
  _Bit_type * _23;
  _Bit_type * _26;
  unsigned int _42;
  
  <bb 2> [local count: 1073741824]:
  _4 = v_2(D)->D.25666._M_impl.D.25135._M_start.D.16486._M_p;
  __n.0_5 = (long int) x_3(D);
  _20 = __n.0_5 / 64;
  _21 = (long unsigned int) _20;
  _22 = _21 * 8;
  _23 = _4 + _22;
  __n_24 = __n.0_5 % 64;
  if (__n_24 < 0)
    goto <bb 3>; [41.00%]
  else
    goto <bb 4>; [59.00%]
  
  <bb 3> [local count: 440234144]:
  __n_25 = __n_24 + 64;
  _26 = _23 + 18446744073709551608;

  <bb 4> [local count: 1073741824]:
  # SR.16_41 = PHI <_26(3), _23(2)>
  # _16 = PHI <__n_25(3), __n_24(2)>
  _42 = (unsigned int) _16;
  _12 = 1 << _42;
  _13 = *SR.16_41;
  _14 = _12 & _13;
  _15 = _14 != 0;
  return _15;
  
} 

This is a regression since gcc 7 which produces more reasonable code:
f(std::vector<bool, std::allocator<bool> > const&, unsigned long):
        movq    (%rdi), %rdx
        movq    %rsi, %rcx
        movq    %rsi, %rax
        movl    $1, %esi
        shrq    $6, %rcx
        shlx    %rax, %rsi, %rsi
        andq    (%rdx,%rcx,8), %rsi
        setne   %al
        ret

clang:
f(std::vector<bool, std::allocator<bool>> const&, unsigned long):
        leaq    63(%rsi), %rax
        testq   %rsi, %rsi
        cmovnsq %rsi, %rax
        sarq    $6, %rax
        shlq    $3, %rax
        addq    (%rdi), %rax
        movabsq $-9223372036854775808, %rcx
        leaq    63(%rcx), %rdx
        andq    %rsi, %rdx
        xorl    %edi, %edi
        cmpq    %rcx, %rdx
        setbe   %dil
        movq    -8(%rax,%rdi,8), %rax
        btq     %rsi, %rax
        setb    %al
        retq

clang with libc++

f(std::__1::vector<bool, std::__1::allocator<bool>> const&, unsigned long):
        movq    (%rdi), %rax
        movq    %rsi, %rcx
        shrq    $6, %rcx
        movq    (%rax,%rcx,8), %rax
        btq     %rsi, %rax
        setb    %al
        retq
Comment 3 Jan Hubicka 2024-12-20 13:50:29 UTC
OK, so the horrid codegen is because bvector's [] operator is imlemented using iterator:
  return begin()[__n];
iterator's [] operator is implemented using:

    _GLIBCXX20_CONSTEXPR
    void
    _M_incr(ptrdiff_t __i)
    {
      _M_assume_normalized();
      difference_type __n = __i + _M_offset;
      _M_p += __n / int(_S_word_bit);
      __n = __n % int(_S_word_bit);
      if (__n < 0)
        {
          __n += int(_S_word_bit);
          --_M_p;
        }
      _M_offset = static_cast<unsigned int>(__n);
    }


So the conditional is a check for negative __n which makes sense for iterator's [], but not for vector's [].  We could add __builtin_unreachable hint that __n is at most max_size, but since [] is such a common operation on vectors, I think it makes sense to make life of compiler easier and micro-optimize it in the array.

diff --git a/libstdc++-v3/include/bits/stl_bvector.h b/libstdc++-v3/include/bits/stl_bvector.h
index 341eee33b21..43a07bc3e55 100644
--- a/libstdc++-v3/include/bits/stl_bvector.h
+++ b/libstdc++-v3/include/bits/stl_bvector.h
@@ -1132,7 +1141,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
       operator[](size_type __n)
       {
        __glibcxx_requires_subscript(__n);
-       return begin()[__n];
+       return _Bit_reference (this->_M_impl._M_start._M_p
+                              + __n / int(_S_word_bit),
+                              __n % int(_S_word_bit));
       }
 
       _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
@@ -1140,7 +1151,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER
       operator[](size_type __n) const
       {
        __glibcxx_requires_subscript(__n);
-       return begin()[__n];
+       return _Bit_reference (this->_M_impl._M_start._M_p
+                              + __n / int(_S_word_bit),
+                              __n % int(_S_word_bit));
       }
 
     protected:

With this I now get:
.LFB1248:
        .cfi_startproc
        movq    (%rdi), %rax
        movq    %rsi, %rdx
        shrq    $6, %rdx
        andq    (%rax,%rdx,8), %rsi
        andl    $63, %esi
        setne   %al
        ret
that is certainly better. Still does not use BT.
Not sure if _M_incr can be tweaked for less horrid codegen.
Comment 4 Jan Hubicka 2024-12-20 13:57:25 UTC
Bit_reference constructor takes mask and not bit position.
      _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
      reference
      operator[](size_type __n)
      { 
        __glibcxx_requires_subscript(__n);
        return _Bit_reference (this->_M_impl._M_start._M_p
                               + __n / int(_S_word_bit),
                               1UL << __n % int(_S_word_bit));
      }

      _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR
      const_reference
      operator[](size_type __n) const
      { 
        __glibcxx_requires_subscript(__n);
        return _Bit_reference (this->_M_impl._M_start._M_p
                               + __n / int(_S_word_bit),
                               1UL << __n % int(_S_word_bit));
      }
this gets us back to original code:
_Z1fRKSt6vectorIbSaIbEEm:
.LFB1248:
        .cfi_startproc
        movq    %rdi, %rax
        movq    %rsi, %rdi
        movl    %esi, %ecx
        movq    (%rax), %rdx
        shrq    $6, %rdi
        movl    $1, %eax
        salq    %cl, %rax
        andq    (%rdx,%rdi,8), %rax
        setne   %al
        ret
Comment 5 Jan Hubicka 2024-12-20 14:22:39 UTC
Combine constructs:

(set (reg:CCZ 17 flags)
    (compare:CCZ (zero_extract:DI (mem:DI (plus:DI (mult:DI (reg:DI 111 [ _8 ])
                        (const_int 8 [0x8]))
                    (reg/f:DI 112 [ v_2(D)->D.25666._M_impl.D.25135._M_start.D.16486._M_p ])) [3 *_10+0 S8 A64])
            (const_int 1 [0x1])
            (subreg:QI (reg:SI 116 [ _12 ]) 0))
        (const_int 0 [0])))

While bt pattern is:


(define_insn "*bt<mode>"
  [(set (reg:CCC FLAGS_REG)
    (compare:CCC
      (zero_extract:SWI48
        (match_operand:SWI48 0 "nonimmediate_operand" "r,m")
        (const_int 1)
        (match_operand:QI 1 "nonmemory_operand" "q<S>,<S>"))
      (const_int 0)))]
  ""
{
  switch (get_attr_mode (insn))
    {
    case MODE_SI:
      return "bt{l}\t{%k1, %k0|%k0, %k1}";

    case MODE_DI:
      return "bt{q}\t{%q1, %0|%0, %q1}";

    default:
      gcc_unreachable ();
    }
} 
  [(set_attr "type" "alu1")
   (set_attr "prefix_0f" "1")
   (set (attr "mode")
    (if_then_else
      (and (match_test "CONST_INT_P (operands[1])")
           (match_test "INTVAL (operands[1]) < 32"))
      (const_string "SI")
      (const_string "<MODE>")))])

It fails since BT produces carry while we want zero flag...
Comment 6 Jan Hubicka 2024-12-27 20:16:22 UTC
Patch to optimize operator[] to be again branchless posted
https://gcc.gnu.org/pipermail/gcc-patches/2024-December/672286.html

Main problem with auto-generating bt is that it needs change of conditional from CCZ to CCC which needs some combine trickery
Comment 7 GCC Commits 2025-01-23 14:51:45 UTC
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:2d55c0161562f96d2230cd132b494a5d06352a23

commit r15-7163-g2d55c0161562f96d2230cd132b494a5d06352a23
Author: Jan Hubicka <jh@suse.cz>
Date:   Thu Jan 23 15:50:50 2025 +0100

    Optimize vector<bool>::operator[]
    
    the following testcase:
    
      bool f(const std::vector<bool>& v, std::size_t x) {
        return v[x];
      }
    
    is compiled as:
    
    f(std::vector<bool, std::allocator<bool> > const&, unsigned long):
            testq   %rsi, %rsi
            leaq    63(%rsi), %rax
            movq    (%rdi), %rdx
            cmovns  %rsi, %rax
            sarq    $6, %rax
            leaq    (%rdx,%rax,8), %rdx
            movq    %rsi, %rax
            sarq    $63, %rax
            shrq    $58, %rax
            addq    %rax, %rsi
            andl    $63, %esi
            subq    %rax, %rsi
            jns     .L2
            addq    $64, %rsi
            subq    $8, %rdx
    .L2:
            movl    $1, %eax
            shlx    %rsi, %rax, %rax
            andq    (%rdx), %rax
            setne   %al
            ret
    
    which is quite expensive for simple bit access in a bitmap.  The reason is that
    the bit access is implemented using iterators
            return begin()[__n];
    Which in turn cares about situation where __n is negative yielding the extra
    conditional.
    
        _GLIBCXX20_CONSTEXPR
        void
        _M_incr(ptrdiff_t __i)
        {
          _M_assume_normalized();
          difference_type __n = __i + _M_offset;
          _M_p += __n / int(_S_word_bit);
          __n = __n % int(_S_word_bit);
          if (__n < 0)
            {
              __n += int(_S_word_bit);
              --_M_p;
            }
          _M_offset = static_cast<unsigned int>(__n);
        }
    
    While we can use __builtin_unreachable to declare that __n is in range
    0...max_size () but I think it is better to implement it directly, since
    resulting code is shorter and much easier to optimize.
    
    We now porduce:
    .LFB1248:
            .cfi_startproc
            movq    (%rdi), %rax
            movq    %rsi, %rdx
            shrq    $6, %rdx
            andq    (%rax,%rdx,8), %rsi
            andl    $63, %esi
            setne   %al
            ret
    
    Testcase suggests
            movq    (%rdi), %rax
            movl    %esi, %ecx
            shrq    $5, %rsi        # does still need to be 64-bit
            movl    (%rax,%rsi,4), %eax
            btl     %ecx, %eax
            setb    %al
            retq
    Which is still one instruction shorter.
    
    libstdc++-v3/ChangeLog:
    
            PR target/80813
            * include/bits/stl_bvector.h (vector<bool, _Alloc>::operator []): Do
            not use iterators.
    
    gcc/testsuite/ChangeLog:
    
            PR target/80813
            * g++.dg/tree-ssa/bvector-3.C: New test.