Bug 47769 - [missed optimization] use of btr (bit test and reset)
Summary: [missed optimization] use of btr (bit test and reset)
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.0
: P3 minor
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2011-02-16 18:43 UTC by Matthias Kretz
Modified: 2018-04-24 14:48 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-02-16 19:46:40


Attachments
Test code to see whether btr gets used automatically and to compare speed (351 bytes, text/x-c++src)
2011-02-17 10:00 UTC, Matthias Kretz
Details
TimeStampCounter class for benchmarking (806 bytes, text/x-chdr)
2011-02-17 10:01 UTC, Matthias Kretz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Kretz 2011-02-16 18:43:57 UTC
The code:

tmp &= ~(1 << bit);

gets translated to

actual shift, not, and and instructions. Instead GCC could emit one btr instruction (which modifies the flags - unwanted but acceptable):

btr %[bit], %[tmp]

The btr instruction has a latency of 1 cycle and throughput of 0.5 cycles on all recent Intel CPUs and thus outperforms the shift + not + and combination.

Rationale:
I make use of this pattern for iteration over a bitmask. I use bsf (_bit_scan_forward(bitmask)) to find the lowest set bit. To find the next one I have to mask off the last found bit and currently have to use inline assembly to get a btr instruction there. Alternatively an intrinsic for btr and friends might make sense.
Comment 1 Richard Biener 2011-02-16 19:46:40 UTC
Can you provide a testcase that can be compiled please?

Cut&pasting from i386.md:

;; %%% bts, btr, btc, bt.
;; In general these instructions are *slow* when applied to memory,
;; since they enforce atomic operation.  When applied to registers,
;; it depends on the cpu implementation.  They're never faster than
;; the corresponding and/ior/xor operations, so with 32-bit there's
;; no point.  But in 64-bit, we can't hold the relevant immediates
;; within the instruction itself, so operating on bits in the high
;; 32-bits of a register becomes easier.
;;
;; These are slow on Nocona, but fast on Athlon64.  We do require the use
;; of btrq and btcq for corner cases of post-reload expansion of absdf and
;; negdf respectively, so they can never be disabled entirely.

....

(define_insn "*btrq"
  [(set (zero_extract:DI (match_operand:DI 0 "register_operand" "+r")
                         (const_int 1)
                         (match_operand:DI 1 "const_0_to_63_operand" ""))
        (const_int 0))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_64BIT && (TARGET_USE_BT || reload_completed)"
  "btr{q}\t{%1, %0|%0, %1}"
  [(set_attr "type" "alu1")
   (set_attr "prefix_0f" "1")
   (set_attr "mode" "DI")])

and

  /* X86_TUNE_USE_BT */
  m_AMD_MULTIPLE | m_ATOM | m_CORE2I7 | m_GENERIC,

so it appears it should already be used by default.
Comment 2 Matthias Kretz 2011-02-17 10:00:55 UTC
Created attachment 23375 [details]
Test code to see whether btr gets used automatically and to compare speed

compile with
g++ -O3 -march=core2 -msse4 -Wall -funroll-loops

-funroll-loops is not required to see the speedup, but it shows that a higher instruction level parallelism can be achieved with the use of btr.
Comment 3 Matthias Kretz 2011-02-17 10:01:34 UTC
Created attachment 23376 [details]
TimeStampCounter class for benchmarking
Comment 4 Matthias Kretz 2012-03-29 09:55:46 UTC
ping.
Are you still waiting for more input from me?
Comment 5 Matthias Kretz 2013-05-03 11:45:49 UTC
Another ping.
The bug status is still WAITING...
Comment 6 Peter Cordes 2017-09-19 15:56:45 UTC
This seems to be partially fixed in gcc8.0:

#include <stdint.h>
uint64_t btr_variable(uint64_t x, unsigned bit) {
        //bit = 53;  // produces btr in older gcc, too.
    return x & ~(1ULL << bit);
}
        movq    %rdi, %rax
        btrq    %rsi, %rax
        ret

vs. gcc7.2 -O3 -mtune=haswell
        movl    %esi, %ecx
        movq    $-2, %rdx
        rolq    %cl, %rdx
        movq    %rdx, %rax      # this is dumb, should have put the mask in rax in the first place
        andq    %rdi, %rax
        ret

Or with bit=53:
        movabsq $-9007199254740993, %rax
        andq    %rdi, %rax
        ret

btr $53, %rax  only has 2 per clock throughput instead of 4 per clock for AND, but a 10-byte mov instruction to set up the constant is almost never going to be worth it for -mtune=haswell.  It takes up extra slots in the uop cache.

---

The inner loop from the Matthias's attached program *really* confuses gcc, so badly that it never gets to the btr pattern, apparently.

unsigned long cfunc_one(unsigned long tmp) {
    for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 3) {
        tmp &= ~(1UL << bit);
    }
    return tmp;
}

        movq    %rdi, %rax
        xorl    %ecx, %ecx
        movl    $1, %esi
.L5:
        movq    %rsi, %rdx       # start with 1UL every time
        salq    %cl, %rdx
        addq    $3, %rcx
        notq    %rdx             # what happened to rotating -2?
        andq    %rdx, %rax
        cmpq    $66, %rcx
        jne     .L5
        ret


This is obviously horrible, but the right answer isn't btr in a loop, it's what clang does:

        movabsq $7905747460161236406, %rax # imm = 0x6DB6DB6DB6DB6DB6 every third bit unset
        andq    %rdi, %rax
        retq

gcc does spot this with `bit += 7`, I guess because with fewer iterations it decides to try fully unrolling and then can optimize.

With a constant shift count and an inline function call, gcc manages to get really confused auto-vectorizing the loop:

uint64_t btr64(uint64_t x, unsigned bit) {
    bit = 53;
    return x & ~(1ULL << bit);
}

unsigned long cfunc_one(unsigned long tmp) {
    for (unsigned long bit = 0; bit < sizeof(unsigned long) * 8; bit += 7) {
        //tmp &= ~(1UL << bit);
        tmp = btr64(tmp, bit);
    }
    return tmp;
}


        movdqa  .LC0(%rip), %xmm0    # constant with both halves the same.
        movdqa  %xmm0, %xmm1
        psrldq  $8, %xmm1
        pand    %xmm1, %xmm0
        movq    %xmm0, %rax
         # The above is equivalent to mov .LC0(%rip), %rax
        andq    %rdi, %rax
        ret



(In reply to Richard Biener from comment #1)
> Can you provide a testcase that can be compiled please?
> 
> Cut&pasting from i386.md:
> 
> ;; %%% bts, btr, btc, bt.
> ;; In general these instructions are *slow* when applied to memory,
> ;; since they enforce atomic operation.

This error is fixed in the current version  https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/i386/i386.md.

They're slow because of crazy-CISC semantics, and aren't atomic without a lock prefix.  btr %rax, (%rdi) uses %rax as a bit index into memory relative to %rdi, so the actual byte or dword or qword eventually accessed is *not* determined by the addressing mode alone.  It's micro-coded as several uops.

>  When applied to registers,
> ;; it depends on the cpu implementation.  They're never faster than
> ;; the corresponding and/ior/xor operations, so with 32-bit there's
> ;; no point.  But in 64-bit, we can't hold the relevant immediates
> ;; within the instruction itself, so operating on bits in the high
> ;; 32-bits of a register becomes easier.

This section is talking about using it with an immediate operand like
  btr  $53, %rax    because  and $imm64, %rax  doesn't exist, only and $sign_extended_imm32, %rax

Does `(set_attr "type" "alu1")` mean gcc thinks it only has 1 per clock throughput?  Or that it competes with other "alu1" instructions?

On Intel since Sandybridge, bt/btr/bts/btc reg,reg or imm,reg is 2 per clock.  It's 1 per clock on Bulldozer-family and Jaguar, 2 per clock on Ryzen.

On Silvermont / KNL, they're 1 per clock occupying both integer ports.