[Bug target/80813] New: x86: std::vector<bool>::operator[] could be somewhat faster using BT instead of SHL
peter at cordes dot ca
gcc-bugzilla@gcc.gnu.org
Thu May 18 05:37:00 GMT 2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80813
Bug ID: 80813
Summary: x86: std::vector<bool>::operator[] could be somewhat
faster using BT instead of SHL
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
This actually applies to all cases of testing a single bit in an array, not
just vector<bool>.
#include <vector>
bool f(const std::vector<bool>& v, std::size_t x) {
return v[x];
}
compiles with -O3 to this asm, on gcc5.x, 6.x, 7.1, and 8:
# https://godbolt.org/g/OHsgcv
movq (%rdi), %rdx
movq %rsi, %r8
movl %esi, %ecx
shrq $6, %r8
movl $1, %eax
salq %cl, %rax
testq %rax, (%rdx,%r8,8)
setne %al
ret
(or with -march=haswell, it uses SHLX for the 1<<x)
It would be more efficient to do 1<<x using
xor %eax,%eax
bts %rsi,%rax
but gcc doesn't do this, even with -march=haswell. (On AMD CPUs, BTS reg,reg
is 2 m-ops, but it's probably about break-even if it saves a mov instruction to
copy the shift-count to ecx. But for AMD with BMI2, mov $1,reg and shlx is
probably optimal despite the larger code-size.)
---
But it's even better to load and then test the bit with BT like clang does.
There's also no need to do a 64-bit load. We can avoid a lot of REX prefixes
by doing a 32-bit load.
# hand-crafted sequence I think is optimal, modified from clang output
movq (%rdi), %rax
movl %esi, %ecx
shrq $5, %rsi # does still need to be 64-bit
movl (%rax,%rsi,4), %eax
btl %ecx, %eax
setb %al
retq
Using BT is fewer instructions, and variable-count SHL is slower than BT on
Intel SnB-family CPUs. According to Agner Fog's tables, SHL %cl,%reg is 3 uops
on SnB/HSW/SKL, but BT %reg,%reg is 1 uop on Intel and AMD CPUs. (SHLX is 1
uop, so BMI2 avoids the x86 legacy baggage of leaving flags untouched if
count==0.)
If we were branching on it, TEST/JZ could macro-fuse on Intel and AMD CPUs, but
BT/JNC couldn't. The BT strategy is still a speed win, or with BMI2 only a win
on code-size.
----
Actually, if not branching on it, we could just isolate the bit with a shift
and mask, instead of a TEST and SETCC.
movq (%rdi), %rax
movl %esi, %ecx
shrq $5, %rsi
movl (%rax,%rsi,4), %eax
shrx %ecx, %eax, %eax
andl $1, %eax
retq
This is as cheap as the bt/setb sequence if BMI2 is available, but zero-extends
the bool for free to fill a register instead of having it only in the low 8.
(OTOH, BT/SETNC lets us invert the bit for free).
The latency should be the same as with SHL/TEST/SETNZ or with BT/SETC, because
TEST with a memory operand still runs as a separate load and then 1-cycle test.
The load can start before the mask operand is ready. So there are always two
1-cycle ALU operations after the load.
Without BMI2, using shr %cl,%eax would be worse on Intel SnB-family, because it
has higher latency than BT and is on the critical path (unlike when shl is used
to make a mask in parallel with the load, so load-use latency hides the latency
of setting up the mask unless ALU resource conflicts delay it).
More information about the Gcc-bugs
mailing list