[Bug target/80813] New: x86: std::vector<bool>::operator[] could be somewhat faster using BT instead of SHL

Thu May 18 05:37:00 GMT 2017

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80813

            Bug ID: 80813
           Summary: x86: std::vector<bool>::operator[] could be somewhat
                    faster using BT instead of SHL
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

This actually applies to all cases of testing a single bit in an array, not
just vector<bool>.

#include <vector>
bool f(const std::vector<bool>& v, std::size_t x) {
    return v[x];
}

compiles with -O3 to this asm, on gcc5.x, 6.x, 7.1, and 8:

        # https://godbolt.org/g/OHsgcv
        movq    (%rdi), %rdx
        movq    %rsi, %r8
        movl    %esi, %ecx
        shrq    $6, %r8
        movl    $1, %eax
        salq    %cl, %rax
        testq   %rax, (%rdx,%r8,8)
        setne   %al
        ret

(or with -march=haswell, it uses SHLX for the 1<<x)

It would be more efficient to do 1<<x using
  xor  %eax,%eax
  bts  %rsi,%rax

but gcc doesn't do this, even with -march=haswell.  (On AMD CPUs, BTS reg,reg
is 2 m-ops, but it's probably about break-even if it saves a mov instruction to
copy the shift-count to ecx.  But for AMD with BMI2, mov $1,reg and shlx is
probably optimal despite the larger code-size.)

---

But it's even better to load and then test the bit with BT like clang does.

There's also no need to do a 64-bit load.  We can avoid a lot of REX prefixes
by doing a 32-bit load.

        # hand-crafted sequence I think is optimal, modified from clang output
        movq    (%rdi), %rax
        movl    %esi, %ecx
        shrq    $5, %rsi        # does still need to be 64-bit
        movl    (%rax,%rsi,4), %eax
        btl     %ecx, %eax
        setb    %al
        retq

Using BT is fewer instructions, and variable-count SHL is slower than BT on
Intel SnB-family CPUs.  According to Agner Fog's tables, SHL %cl,%reg is 3 uops
on SnB/HSW/SKL, but BT %reg,%reg is 1 uop on Intel and AMD CPUs.  (SHLX is 1
uop, so BMI2 avoids the x86 legacy baggage of leaving flags untouched if
count==0.)

If we were branching on it, TEST/JZ could macro-fuse on Intel and AMD CPUs, but
BT/JNC couldn't.  The BT strategy is still a speed win, or with BMI2 only a win
on code-size.

----

Actually, if not branching on it, we could just isolate the bit with a shift
and mask, instead of a TEST and SETCC.

        movq    (%rdi), %rax
        movl    %esi, %ecx
        shrq    $5, %rsi
        movl    (%rax,%rsi,4), %eax
        shrx    %ecx, %eax, %eax
        andl    $1, %eax
        retq

This is as cheap as the bt/setb sequence if BMI2 is available, but zero-extends
the bool for free to fill a register instead of having it only in the low 8. 
(OTOH, BT/SETNC lets us invert the bit for free).

The latency should be the same as with SHL/TEST/SETNZ or with BT/SETC, because
TEST with a memory operand still runs as a separate load and then 1-cycle test.
 The load can start before the mask operand is ready.  So there are always two
1-cycle ALU operations after the load.

Without BMI2, using shr %cl,%eax would be worse on Intel SnB-family, because it
has higher latency than BT and is on the critical path (unlike when shl is used
to make a mask in parallel with the load, so load-use latency hides the latency
of setting up the mask unless ALU resource conflicts delay it).