the code int test_bit(unsigned long *words, int bit) { int wsize = (sizeof *words) * 8; return (words[bit / wsize] & (1 << (bit % wsize))) != 0; } can compile to xor %rax, %rax bt %rsi, (%rdi) setc %al but instead compiles to a much longer sequence, using many more registers, which is probably slower as well. If gcc recognized this common idiom (like it recognizes bit rotate sequences), smaller and more optimal code would be generated (especially if the result of the test is in an if statement - it could boil down to a bt; jc sequence).
Confirmed, not a regression.
The resulting code for -march=opteron: test_bit: .LFB2: leal 63(%rsi), %edx testl %esi, %esi movl %esi, %eax cmovns %esi, %edx sarl $31, %eax shrl $26, %eax sarl $6, %edx leal (%rsi,%rax), %ecx movslq %edx,%rdx andl $63, %ecx subl %eax, %ecx movl $1, %eax sall %cl, %eax cltq testq %rax, (%rdi,%rdx,8) setne %al movzbl %al, %eax ret For -march=nocona the code is even uglier.
This is what the i386 machine description has to say about BT and friends: ;; %%% bts, btr, btc, bt. ;; In general these instructions are *slow* when applied to memory, ;; since they enforce atomic operation. When applied to registers, ;; it depends on the cpu implementation. They're never faster than ;; the corresponding and/ior/xor operations, so with 32-bit there's ;; no point. But in 64-bit, we can't hold the relevant immediates ;; within the instruction itself, so operating on bits in the high ;; 32-bits of a register becomes easier. ;; ;; These are slow on Nocona, but fast on Athlon64. We do require the use ;; of btrq and btcq for corner cases of post-reload expansion of absdf and ;; negdf respectively, so they can never be disabled entirely.
Benchmark results, 32 bit code, various methods On an athlon 64: bts reg, (reg): 1 cycle bts reg, (mem): 3 cycles C code (reg): 1 cycle C code (mem): 5 cycles On a Xeon: bts reg, (reg): 6 cycles bts reg, (mem): 15 cycles C code (reg): 1 cycle C code (mem): 5 cycles Looks like a very small win on athlon 64 when modifying memory.
Created attachment 11243 [details] benchmark for various set_bit() implementions
oops, the benchmark was for bts. will do again for bt.
Created attachment 11244 [details] bt instruction benchmark redone the test for test_bit(), this time always forcing a memory access: Athlon 64: bt: 3 cycles generic: 3 cycles Xeon: bt: 10 cycles generic: 4 cycles so, bt might be usable for -Os, but likely not with the effort.
Code size issue
Note there is a bug in the original testcase. It should be: int test_bit(unsigned long *words, int bit) { int wsize = (sizeof *words) * 8; return (words[bit / wsize] & (1ul << (bit % wsize))) != 0; } if int is 32bit and long is 64bit, you would have gotten the wrong result.
With the fixed testcase we get: movq %rsi, %rax movq %rsi, %rcx shrq $6, %rax andl $63, %ecx movq (%rdi,%rax,8), %rax shrq %cl, %rax andl $1, %eax ICC can produce the btq but with extra instructions still: movq %rsi, %rax #5.25 shrq $6, %rax #5.25 movq (%rdi,%rax,8), %rdx #5.13 xorl %eax, %eax #5.61 btq %rsi, %rdx #5.61 setc %al