This actually applies to all cases of testing a single bit in an array, not just vector<bool>. #include <vector> bool f(const std::vector<bool>& v, std::size_t x) { return v[x]; } compiles with -O3 to this asm, on gcc5.x, 6.x, 7.1, and 8: # https://godbolt.org/g/OHsgcv movq (%rdi), %rdx movq %rsi, %r8 movl %esi, %ecx shrq $6, %r8 movl $1, %eax salq %cl, %rax testq %rax, (%rdx,%r8,8) setne %al ret (or with -march=haswell, it uses SHLX for the 1<<x) It would be more efficient to do 1<<x using xor %eax,%eax bts %rsi,%rax but gcc doesn't do this, even with -march=haswell. (On AMD CPUs, BTS reg,reg is 2 m-ops, but it's probably about break-even if it saves a mov instruction to copy the shift-count to ecx. But for AMD with BMI2, mov $1,reg and shlx is probably optimal despite the larger code-size.) --- But it's even better to load and then test the bit with BT like clang does. There's also no need to do a 64-bit load. We can avoid a lot of REX prefixes by doing a 32-bit load. # hand-crafted sequence I think is optimal, modified from clang output movq (%rdi), %rax movl %esi, %ecx shrq $5, %rsi # does still need to be 64-bit movl (%rax,%rsi,4), %eax btl %ecx, %eax setb %al retq Using BT is fewer instructions, and variable-count SHL is slower than BT on Intel SnB-family CPUs. According to Agner Fog's tables, SHL %cl,%reg is 3 uops on SnB/HSW/SKL, but BT %reg,%reg is 1 uop on Intel and AMD CPUs. (SHLX is 1 uop, so BMI2 avoids the x86 legacy baggage of leaving flags untouched if count==0.) If we were branching on it, TEST/JZ could macro-fuse on Intel and AMD CPUs, but BT/JNC couldn't. The BT strategy is still a speed win, or with BMI2 only a win on code-size. ---- Actually, if not branching on it, we could just isolate the bit with a shift and mask, instead of a TEST and SETCC. movq (%rdi), %rax movl %esi, %ecx shrq $5, %rsi movl (%rax,%rsi,4), %eax shrx %ecx, %eax, %eax andl $1, %eax retq This is as cheap as the bt/setb sequence if BMI2 is available, but zero-extends the bool for free to fill a register instead of having it only in the low 8. (OTOH, BT/SETNC lets us invert the bit for free). The latency should be the same as with SHL/TEST/SETNZ or with BT/SETC, because TEST with a memory operand still runs as a separate load and then 1-cycle test. The load can start before the mask operand is ready. So there are always two 1-cycle ALU operations after the load. Without BMI2, using shr %cl,%eax would be worse on Intel SnB-family, because it has higher latency than BT and is on the critical path (unlike when shl is used to make a mask in parallel with the load, so load-use latency hides the latency of setting up the mask unless ALU resource conflicts delay it).
Note the code is worse in GCC 11+, there seems to be a missing optimization on the gimple level ...
trunk does f(std::vector<bool, std::allocator<bool> > const&, unsigned long): testq %rsi, %rsi leaq 63(%rsi), %rax movq (%rdi), %rdx cmovns %rsi, %rax sarq $6, %rax leaq (%rdx,%rax,8), %rdx movq %rsi, %rax sarq $63, %rax shrq $58, %rax addq %rax, %rsi andl $63, %esi subq %rax, %rsi jns .L2 addq $64, %rsi subq $8, %rdx .L2: movl $1, %eax shlx %rsi, %rax, %rax andq (%rdx), %rax setne %al ret Removing basic block 5 bool f (const struct vector & v, size_t x) { difference_type __n; _Bit_type * const SR.16; _Bit_type * _4; long int __n.0_5; long unsigned int _12; long unsigned int _13; long unsigned int _14; bool _15; long int _16; long int _20; long unsigned int _21; long unsigned int _22; _Bit_type * _23; _Bit_type * _26; unsigned int _42; <bb 2> [local count: 1073741824]: _4 = v_2(D)->D.25666._M_impl.D.25135._M_start.D.16486._M_p; __n.0_5 = (long int) x_3(D); _20 = __n.0_5 / 64; _21 = (long unsigned int) _20; _22 = _21 * 8; _23 = _4 + _22; __n_24 = __n.0_5 % 64; if (__n_24 < 0) goto <bb 3>; [41.00%] else goto <bb 4>; [59.00%] <bb 3> [local count: 440234144]: __n_25 = __n_24 + 64; _26 = _23 + 18446744073709551608; <bb 4> [local count: 1073741824]: # SR.16_41 = PHI <_26(3), _23(2)> # _16 = PHI <__n_25(3), __n_24(2)> _42 = (unsigned int) _16; _12 = 1 << _42; _13 = *SR.16_41; _14 = _12 & _13; _15 = _14 != 0; return _15; } This is a regression since gcc 7 which produces more reasonable code: f(std::vector<bool, std::allocator<bool> > const&, unsigned long): movq (%rdi), %rdx movq %rsi, %rcx movq %rsi, %rax movl $1, %esi shrq $6, %rcx shlx %rax, %rsi, %rsi andq (%rdx,%rcx,8), %rsi setne %al ret clang: f(std::vector<bool, std::allocator<bool>> const&, unsigned long): leaq 63(%rsi), %rax testq %rsi, %rsi cmovnsq %rsi, %rax sarq $6, %rax shlq $3, %rax addq (%rdi), %rax movabsq $-9223372036854775808, %rcx leaq 63(%rcx), %rdx andq %rsi, %rdx xorl %edi, %edi cmpq %rcx, %rdx setbe %dil movq -8(%rax,%rdi,8), %rax btq %rsi, %rax setb %al retq clang with libc++ f(std::__1::vector<bool, std::__1::allocator<bool>> const&, unsigned long): movq (%rdi), %rax movq %rsi, %rcx shrq $6, %rcx movq (%rax,%rcx,8), %rax btq %rsi, %rax setb %al retq
OK, so the horrid codegen is because bvector's [] operator is imlemented using iterator: return begin()[__n]; iterator's [] operator is implemented using: _GLIBCXX20_CONSTEXPR void _M_incr(ptrdiff_t __i) { _M_assume_normalized(); difference_type __n = __i + _M_offset; _M_p += __n / int(_S_word_bit); __n = __n % int(_S_word_bit); if (__n < 0) { __n += int(_S_word_bit); --_M_p; } _M_offset = static_cast<unsigned int>(__n); } So the conditional is a check for negative __n which makes sense for iterator's [], but not for vector's []. We could add __builtin_unreachable hint that __n is at most max_size, but since [] is such a common operation on vectors, I think it makes sense to make life of compiler easier and micro-optimize it in the array. diff --git a/libstdc++-v3/include/bits/stl_bvector.h b/libstdc++-v3/include/bits/stl_bvector.h index 341eee33b21..43a07bc3e55 100644 --- a/libstdc++-v3/include/bits/stl_bvector.h +++ b/libstdc++-v3/include/bits/stl_bvector.h @@ -1132,7 +1141,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER operator[](size_type __n) { __glibcxx_requires_subscript(__n); - return begin()[__n]; + return _Bit_reference (this->_M_impl._M_start._M_p + + __n / int(_S_word_bit), + __n % int(_S_word_bit)); } _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR @@ -1140,7 +1151,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER operator[](size_type __n) const { __glibcxx_requires_subscript(__n); - return begin()[__n]; + return _Bit_reference (this->_M_impl._M_start._M_p + + __n / int(_S_word_bit), + __n % int(_S_word_bit)); } protected: With this I now get: .LFB1248: .cfi_startproc movq (%rdi), %rax movq %rsi, %rdx shrq $6, %rdx andq (%rax,%rdx,8), %rsi andl $63, %esi setne %al ret that is certainly better. Still does not use BT. Not sure if _M_incr can be tweaked for less horrid codegen.
Bit_reference constructor takes mask and not bit position. _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR reference operator[](size_type __n) { __glibcxx_requires_subscript(__n); return _Bit_reference (this->_M_impl._M_start._M_p + __n / int(_S_word_bit), 1UL << __n % int(_S_word_bit)); } _GLIBCXX_NODISCARD _GLIBCXX20_CONSTEXPR const_reference operator[](size_type __n) const { __glibcxx_requires_subscript(__n); return _Bit_reference (this->_M_impl._M_start._M_p + __n / int(_S_word_bit), 1UL << __n % int(_S_word_bit)); } this gets us back to original code: _Z1fRKSt6vectorIbSaIbEEm: .LFB1248: .cfi_startproc movq %rdi, %rax movq %rsi, %rdi movl %esi, %ecx movq (%rax), %rdx shrq $6, %rdi movl $1, %eax salq %cl, %rax andq (%rdx,%rdi,8), %rax setne %al ret
Combine constructs: (set (reg:CCZ 17 flags) (compare:CCZ (zero_extract:DI (mem:DI (plus:DI (mult:DI (reg:DI 111 [ _8 ]) (const_int 8 [0x8])) (reg/f:DI 112 [ v_2(D)->D.25666._M_impl.D.25135._M_start.D.16486._M_p ])) [3 *_10+0 S8 A64]) (const_int 1 [0x1]) (subreg:QI (reg:SI 116 [ _12 ]) 0)) (const_int 0 [0]))) While bt pattern is: (define_insn "*bt<mode>" [(set (reg:CCC FLAGS_REG) (compare:CCC (zero_extract:SWI48 (match_operand:SWI48 0 "nonimmediate_operand" "r,m") (const_int 1) (match_operand:QI 1 "nonmemory_operand" "q<S>,<S>")) (const_int 0)))] "" { switch (get_attr_mode (insn)) { case MODE_SI: return "bt{l}\t{%k1, %k0|%k0, %k1}"; case MODE_DI: return "bt{q}\t{%q1, %0|%0, %q1}"; default: gcc_unreachable (); } } [(set_attr "type" "alu1") (set_attr "prefix_0f" "1") (set (attr "mode") (if_then_else (and (match_test "CONST_INT_P (operands[1])") (match_test "INTVAL (operands[1]) < 32")) (const_string "SI") (const_string "<MODE>")))]) It fails since BT produces carry while we want zero flag...
Patch to optimize operator[] to be again branchless posted https://gcc.gnu.org/pipermail/gcc-patches/2024-December/672286.html Main problem with auto-generating bt is that it needs change of conditional from CCZ to CCC which needs some combine trickery
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:2d55c0161562f96d2230cd132b494a5d06352a23 commit r15-7163-g2d55c0161562f96d2230cd132b494a5d06352a23 Author: Jan Hubicka <jh@suse.cz> Date: Thu Jan 23 15:50:50 2025 +0100 Optimize vector<bool>::operator[] the following testcase: bool f(const std::vector<bool>& v, std::size_t x) { return v[x]; } is compiled as: f(std::vector<bool, std::allocator<bool> > const&, unsigned long): testq %rsi, %rsi leaq 63(%rsi), %rax movq (%rdi), %rdx cmovns %rsi, %rax sarq $6, %rax leaq (%rdx,%rax,8), %rdx movq %rsi, %rax sarq $63, %rax shrq $58, %rax addq %rax, %rsi andl $63, %esi subq %rax, %rsi jns .L2 addq $64, %rsi subq $8, %rdx .L2: movl $1, %eax shlx %rsi, %rax, %rax andq (%rdx), %rax setne %al ret which is quite expensive for simple bit access in a bitmap. The reason is that the bit access is implemented using iterators return begin()[__n]; Which in turn cares about situation where __n is negative yielding the extra conditional. _GLIBCXX20_CONSTEXPR void _M_incr(ptrdiff_t __i) { _M_assume_normalized(); difference_type __n = __i + _M_offset; _M_p += __n / int(_S_word_bit); __n = __n % int(_S_word_bit); if (__n < 0) { __n += int(_S_word_bit); --_M_p; } _M_offset = static_cast<unsigned int>(__n); } While we can use __builtin_unreachable to declare that __n is in range 0...max_size () but I think it is better to implement it directly, since resulting code is shorter and much easier to optimize. We now porduce: .LFB1248: .cfi_startproc movq (%rdi), %rax movq %rsi, %rdx shrq $6, %rdx andq (%rax,%rdx,8), %rsi andl $63, %esi setne %al ret Testcase suggests movq (%rdi), %rax movl %esi, %ecx shrq $5, %rsi # does still need to be 64-bit movl (%rax,%rsi,4), %eax btl %ecx, %eax setb %al retq Which is still one instruction shorter. libstdc++-v3/ChangeLog: PR target/80813 * include/bits/stl_bvector.h (vector<bool, _Alloc>::operator []): Do not use iterators. gcc/testsuite/ChangeLog: PR target/80813 * g++.dg/tree-ssa/bvector-3.C: New test.