25671 – test_bit() compilation does not expand to "bt" instruction

Bug 25671 - test_bit() compilation does not expand to "bt" instruction

Summary: test_bit() compilation does not expand to "bt" instruction

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.2

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	code-size, missed-optimization

Depends on:
Blocks:	16996
	Show dependency tree / graph

Reported:	2006-01-04 15:24 UTC by Avi Kivity
Modified:	2024-10-24 23:36 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-03-06 00:00:00

Attachments
benchmark for various set_bit() implementions (500 bytes, text/plain) 2006-04-11 15:38 UTC, Avi Kivity	Details
bt instruction benchmark (515 bytes, text/plain) 2006-04-11 15:53 UTC, Avi Kivity	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Avi Kivity 2006-01-04 15:24:53 UTC

the code

int test_bit(unsigned long *words, int bit)
{
    int wsize = (sizeof *words) * 8;
    return (words[bit / wsize] & (1 << (bit % wsize))) != 0;
}

can compile to

    xor %rax, %rax
    bt  %rsi, (%rdi)
    setc %al

but instead compiles to a much longer sequence, using many more registers, which is probably slower as well. If gcc recognized this common idiom (like it recognizes bit rotate sequences), smaller and more optimal code would be generated (especially if the result of the test is in an if statement - it could boil down to a bt; jc sequence).

Comment 1 Andrew Pinski 2006-01-04 15:33:26 UTC

Confirmed, not a regression.

Comment 2 Steven Bosscher 2006-04-10 20:18:15 UTC

The resulting code for -march=opteron:

test_bit:
.LFB2:
        leal    63(%rsi), %edx
        testl   %esi, %esi
        movl    %esi, %eax
        cmovns  %esi, %edx
        sarl    $31, %eax
        shrl    $26, %eax
        sarl    $6, %edx
        leal    (%rsi,%rax), %ecx
        movslq  %edx,%rdx
        andl    $63, %ecx
        subl    %eax, %ecx
        movl    $1, %eax
        sall    %cl, %eax
        cltq
        testq   %rax, (%rdi,%rdx,8)
        setne   %al
        movzbl  %al, %eax
        ret

For -march=nocona the code is even uglier.

Comment 3 Steven Bosscher 2006-04-10 20:31:49 UTC

This is what the i386 machine description has to say about BT and friends:

;; %%% bts, btr, btc, bt.
;; In general these instructions are *slow* when applied to memory,
;; since they enforce atomic operation.  When applied to registers,
;; it depends on the cpu implementation.  They're never faster than
;; the corresponding and/ior/xor operations, so with 32-bit there's
;; no point.  But in 64-bit, we can't hold the relevant immediates
;; within the instruction itself, so operating on bits in the high
;; 32-bits of a register becomes easier.
;;
;; These are slow on Nocona, but fast on Athlon64.  We do require the use
;; of btrq and btcq for corner cases of post-reload expansion of absdf and
;; negdf respectively, so they can never be disabled entirely.

Comment 4 Avi Kivity 2006-04-11 15:36:58 UTC

Benchmark results, 32 bit code, various methods

On an athlon 64:

   bts reg, (reg):  1 cycle
   bts reg, (mem):  3 cycles
   C code (reg):    1 cycle
   C code (mem):    5 cycles

On a Xeon:

   bts reg, (reg):  6 cycles
   bts reg, (mem): 15 cycles
   C code (reg):    1 cycle
   C code (mem):    5 cycles

Looks like a very small win on athlon 64 when modifying memory.

Comment 5 Avi Kivity 2006-04-11 15:38:08 UTC

Created attachment 11243 [details]
benchmark for various set_bit() implementions

Comment 6 Avi Kivity 2006-04-11 15:39:34 UTC

oops, the benchmark was for bts. will do again for bt.

Comment 7 Avi Kivity 2006-04-11 15:53:36 UTC

Created attachment 11244 [details]
bt instruction benchmark

redone the test for test_bit(), this time always forcing a memory access:

Athlon 64:

     bt:      3 cycles
     generic: 3 cycles

Xeon:

     bt:     10 cycles
     generic: 4 cycles

so, bt might be usable for -Os, but likely not with the effort.

Comment 8 Steven Bosscher 2006-04-11 23:03:44 UTC

Code size issue

Comment 9 Andrew Pinski 2021-08-19 20:46:58 UTC

Note there is a bug in the original testcase.
It should be:

int test_bit(unsigned long *words, int bit)
{
    int wsize = (sizeof *words) * 8;
    return (words[bit / wsize] & (1ul << (bit % wsize))) != 0;
}

if int is 32bit and long is 64bit, you would have gotten the wrong result.

Comment 10 Andrew Pinski 2021-08-19 20:49:32 UTC

With the fixed testcase we get:

        movq    %rsi, %rax
        movq    %rsi, %rcx
        shrq    $6, %rax
        andl    $63, %ecx
        movq    (%rdi,%rax,8), %rax
        shrq    %cl, %rax
        andl    $1, %eax

ICC can produce the btq but with extra instructions still:
        movq      %rsi, %rax                                    #5.25
        shrq      $6, %rax                                      #5.25
        movq      (%rdi,%rax,8), %rdx                           #5.13
        xorl      %eax, %eax                                    #5.61
        btq       %rsi, %rdx                                    #5.61
        setc      %al