Bug 97759 - Could std::has_single_bit be faster?
Summary: Could std::has_single_bit be faster?
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: libstdc++ (show other bugs)
Version: 10.2.1
: P3 enhancement
Target Milestone: ---
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2020-11-09 00:01 UTC by gcc-bugs
Modified: 2024-03-04 22:56 UTC (History)
9 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-03-04 00:00:00


Attachments
CMakeLists.txt (358 bytes, text/plain)
2020-11-09 23:28 UTC, gcc-bugs
Details
has_single_bit_benchmark.cpp (327 bytes, text/x-csrc)
2020-11-09 23:28 UTC, gcc-bugs
Details

Note You need to log in before you can comment on or make changes to this bug.
Description gcc-bugs 2020-11-09 00:01:25 UTC
Hello gcc-team,

we are thrilled that C++20 offers some efficient bit implementation and that we could exchange some of our own implementation with the standardized ones, making the code more accessible.

I replaced our implementation and noticed that `std::has_single_bit` was slower than what we had before by around 30%. (The other functions matched our timings.)

Additionally, we have a (micro-)benchmark that compares the standard arithmetic bit trick (https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2) with the implementation where popcount == 1. We decided to use the arithmetic version, because we measured that it was faster than popcount on our machines (mostly intel processors).

Interestingly, it seems that the popcount benchmark matches the std::has_single_bit time-wise, so I guess that std::has_single_bit is implemented via popcount.

Those timings could be reproduced at an unknown location https://quick-bench.com/q/Y28keu_mSh25WwhO05T4SKrbHpk

I don't know how to fix this, but I would expect that the optimizer would recognize popcount=1 and knows that there is a more efficient version. Or change the implementation to arithmetic, where again the optimizer could decide to replace that by a popcount if that is more efficient on some architecture?

Thank you!
Comment 1 Thomas Koenig 2020-11-09 05:41:03 UTC
Could you post the benchmark and the exact architecture where the arithmetic
version is faster?
Comment 2 Richard Biener 2020-11-09 08:16:39 UTC
I think using popcount () in the standard library is reasonable and instead
GCC should eventually optimize popcount () == 1 during RTL expansion.  An
alternative would be to add an explicit __builtin_is_pow2 ().  Note
matching popcount (x) == 1 || x == 0 for the simple arithmetic trick would
be always good IMHO.
Comment 3 Hongtao.liu 2020-11-09 09:13:32 UTC
for testcase:
---
#include<stdbool.h>

bool
is_power2_popcnt (int a)
{
  return __builtin_popcount (a) == 1;
}

bool
is_power2_arithmetic (int a)
{
  return !(a & (a - 1)) && a;
}
---

gcc -O2 -mavx2 -S got

---
	.file	"test.c"
	.text
	.p2align 4
	.globl	is_power2_popcnt
	.type	is_power2_popcnt, @function
is_power2_popcnt:
.LFB0:
	.cfi_startproc
	popcntl	%edi, %edi
	cmpl	$1, %edi
	sete	%al
	ret
	.cfi_endproc
.LFE0:
	.size	is_power2_popcnt, .-is_power2_popcnt
	.p2align 4
	.globl	is_power2_arithmetic
	.type	is_power2_arithmetic, @function
is_power2_arithmetic:
.LFB1:
	.cfi_startproc
	leal	-1(%rdi), %eax
	testl	%edi, %eax
	sete	%al
	testl	%edi, %edi
	setne	%dl
	andl	%edx, %eax
	ret
	.cfi_endproc
---

Latency of popcnt is 3, others is 1.

Can't tell which version is better from static observation.
Comment 4 Jakub Jelinek 2020-11-09 10:44:23 UTC
Looking at the referenced benchmark, options that would result in popcnt* instruction aren't there though and in the assembly it calls __popcountdi2.
So, I guess we want at least:
1) for POPCOUNT (x) == 1 use x && (x & (x - 1)) == 0 if optab_handler (popcount_optab, mode) == CODE_FOR_nothing (a fuzzy case are the double-word cases where we'd end up with popcount (x >> wordbits) + popcount ((word) x) == 1
vs. double-word x && (x & (x - 1)) = 0)
2) the case Richi wrote about, optimize POPCOUNT (x) <= 1 or POPCOUNT (x) == 1 || x == 0 to (x & (x - 1)) == 0 always
Comment 5 Jakub Jelinek 2020-11-09 11:00:28 UTC
Unfortunately we don't TER calls, so the expander doesn't see POPCOUNT (x) == 1 or POPCOUNT (x) <= 1
and the isel pass seems to be (at least so far) extremely vector specific;
another option is do it in match.pd or in tree-ssa-math-opts.c
Comment 6 Jonathan Wakely 2020-11-09 11:13:20 UTC
As an aside, libstdc++ does already use the ((x-1) & x) == 0 idiom in <bits/uniform_int_dist.h> where we are happy for zero to be treated as a power of two (because we call _Power_of_2(n+1) and we want the result to be true for n==numeric_limits<decltype(n)>::max()). We could replace _Power_of_2(n+1) with std::__has_single_bit(n+1) || (n+1)==0 but would need to define __has_single_bit for C++11 as well as C++14.
Comment 7 gcc-bugs 2020-11-09 23:28:18 UTC
Created attachment 49530 [details]
CMakeLists.txt
Comment 8 gcc-bugs 2020-11-09 23:28:38 UTC
Created attachment 49531 [details]
has_single_bit_benchmark.cpp
Comment 9 gcc-bugs 2020-11-09 23:55:19 UTC
Thank you for so many responses

(In reply to Thomas Koenig from comment #1)
> Could you post the benchmark and the exact architecture where the arithmetic
> version is faster?

```
> cat /proc/cpuinfo

rocessor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 158
model name	: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
stepping	: 9
microcode	: 0xd6
cpu MHz		: 3489.761
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 22
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
vmx flags	: vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple shadow_vmcs pml ept_mode_based_exec
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds
bogomips	: 5802.42
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:
```

I added the sources.

You can execute it by putting it into one folder and execute

```
> cmake -DCMAKE_BUILD_TYPE=Release ./
> make VERBOSE=1

/usr/bin/c++  -I/benchmark/_deps/gbenchmark_fetch_content-src/src/../include -O3 -DNDEBUG -std=gnu++2a -o CMakeFiles/has_single_bit_benchmark.dir/has_single_bit_benchmark.cpp.o -c /benchmark/has_single_bit_benchmark.cpp
```

```
> ./has_single_bit_benchmark --benchmark_repetitions=9 --benchmark_min_time=2

2020-11-10T00:32:52+01:00
Running ./has_single_bit_benchmark
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.57, 0.82, 1.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
has_single_bit_arithmetic_mean         6.45 ns         6.44 ns            9
has_single_bit_arithmetic_median       6.44 ns         6.44 ns            9
has_single_bit_arithmetic_stddev      0.113 ns        0.111 ns            9

has_single_bit_popcount_mean           8.84 ns         8.82 ns            9
has_single_bit_popcount_median         8.84 ns         8.82 ns            9
has_single_bit_popcount_stddev        0.060 ns        0.061 ns            9

has_single_bit_std_mean                9.23 ns         9.21 ns            9
has_single_bit_std_median              9.23 ns         9.20 ns            9
has_single_bit_std_stddev             0.046 ns        0.048 ns            9
```

----

I thought about it and tried the same again with `-march=native` and noticed that popcount was now (slightly) more efficient. Some more testing showed that `-mpopcnt` is everything needed to make this test fly.

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
has_single_bit_arithmetic_mean         6.81 ns         6.81 ns            9
has_single_bit_arithmetic_median       6.81 ns         6.80 ns            9
has_single_bit_arithmetic_stddev      0.201 ns        0.200 ns            9

has_single_bit_popcount_mean           6.47 ns         6.47 ns            9
has_single_bit_popcount_median         6.46 ns         6.46 ns            9
has_single_bit_popcount_stddev        0.043 ns        0.042 ns            9

has_single_bit_std_mean                6.50 ns         6.50 ns            9
has_single_bit_std_median              6.51 ns         6.50 ns            9
has_single_bit_std_stddev             0.031 ns        0.031 ns            9


So the use case would be generic builds like debian binaries.
Comment 10 gcc-bugs 2020-11-10 00:16:37 UTC
And maybe a related question:

I know that an arithmetic implementation might auto-vectorize, but would a popcount implementation do that too?

Since AVX512_BITALG (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=popcnt&expand=4350) retro-introduced popcount for smaller vector lengths, what is with instruction sets like AVX2 and lower?
Comment 11 Hongtao.liu 2020-11-10 02:15:36 UTC
(In reply to gcc-bugs from comment #10)
> And maybe a related question:
> 
> I know that an arithmetic implementation might auto-vectorize, but would a
> popcount implementation do that too?
> 
> Since AVX512_BITALG
> (https://software.intel.com/sites/landingpage/IntrinsicsGuide/
> #text=popcnt&expand=4350) retro-introduced popcount for smaller vector
> lengths, what is with instruction sets like AVX2 and lower?

I open a bugzilla for it in 97770.
Comment 12 Jonathan Wakely 2020-11-10 08:57:40 UTC
r11-4843 should have removed the small difference between the std and popcount benchmarks.

I still see a small advantage to arithmetic for skylake i7-6700 CPU @ 3.40GHz and i7-8650U CPU @ 1.90GHz though.
Comment 13 Andrew Pinski 2021-08-03 05:51:58 UTC
For aarch64, we get:
is_power2_popcnt(int):
        fmov    s0, w0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        fmov    w0, s0
        cmp     w0, 1
        cset    w0, eq
        ret
is_power2_arithmetic(int):
        sub     w1, w0, #1
        tst     w1, w0
        ccmp    w0, 0, 4, eq
        cset    w0, ne
        ret

Obviously is_power2_arithmetic is better.
Note clang uses popcntl for both (on both x86 and arm64) if popcntl can be done inline and uses test/set/and if popcntl does not exist.
Comment 14 Peter Cordes 2022-03-03 11:08:09 UTC
Agreed with the idea of expanding doing the  popcount(x) == 1  peephole replacement in the compiler, not forcing the header to figure out whether we have efficient popcount or not.

If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was zero, and ZF=1 if the output is zero.  Unfortunately none of the standard jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first.  
(If Intel had designed it to produce ZF and CF inverted, it would be non-intuitive for all other uses but would have allowed blsr / jcc to implement if(has_single_bit(x)).)

CF=1, ZF=0  impossible: input was zero, output was non-zero
CF=1, ZF=1  input was zero
CF=0, ZF=0  input had multiple bits set
CF=0, ZF=1  input had a single bit set.

If we're going to branch on it anyway after inlining, a branchy strategy is probably good:

singlebit_bmi2_branchy:
   xor    %eax, %eax
   blsr   %edi, %edi    #  ZF=1 means we cleared the last bit, or the input was zero
   jc     .Linput_zero  # input was zero, return 0 regardless of ZF
   setz   %al
 .Linput_zero:
   ret


And when we want a boolean in a register, a combination of setz and cmovc can materialize one.  With clever choice of registers, we can even avoid giving setcc a false dependency on a register that isn't already part of its dep chain

singlebit_bmi2_cmov:
   blsr    %edi, %eax
   setz    %al         # false dep, but it's ready if FLAGS are ready because we wrote it with BLSR
   cmovc   %edi, %eax  # return 1 only if ZF=1 (setz produces 1) and CF=0 (cmovc doesn't overwrite it with the input 0)
   ret

With xor-zeroing first, we could produce the boolean zero-extended to 32-bit, instead of here where only the low 8 bits are actually 0 / 1.  (Which is fine for returning a bool in all the mainstream calling conventions)

(This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can conditionally update flags.)

An evil variation on this uses setnz / dec to invert ZF without affecting CF, allowing JA:

   blsr   %edi,%eax
   setnz  %al             # AL = !ZF
   dec    %al             # 1-1 -> ZF=1,  0-1 -> ZF=0.  ZF=!ZF without affecting CF
   # seta   %al             # set on CF=0 and ZF=0
   ja     was_single_bit    # only actually useful for branching after inlining

dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it doesn't cost any extra partial-FLAGS merging uops, because JA simply has both parts of FLAGS as separate inputs.  (This is why CMOVA / CMOVBE are still 2 uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts of FLAGS, while others need either CF or SPAZO not both.  Interestingly, Zen1/2 have that effect but not Zen3)

I don't know how AMD handles dec / ja partial-flags shenanigans.  Intel Haswell would I think have a flags-merging uop; older Intel doesn't support BMI1 so P6-family is irrelevant.  https://stackoverflow.com/a/49868149/224132

I haven't benchmarked them because they have different use-cases (materializing a boolean vs. creating a FLAGS condition to branch on, being branchless itself), so any single benchmark would make one of them look good.  If your data almost never (or always) has an all-zero input, the JC in the first version will predict well.  After inlining, if the caller branches on the bool result, you might want to just branch on both conditions separately.

I don't think this setnz/dec/ja version is ever useful.  Unlike branching separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common while single-bit inputs are rare.  But blsr/setz/cmovc + test/jnz is only 4 uops, same as this on Skylake. (test can macro-fuse with jnz).

The uops are all dependent on each other, so it also has the same latency (to detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops.  The only thing this has going for it is avoiding a port-1-only uop, I think.

It's also possible to blsr / lahf / and  ah, (1<<6) | (1<<0) / cmp  ah, 1<<6    to directly check that ZF=1 and CF=0.  I doubt that's useful.  Or hmm, can we branch directly on PF after AND with that 2-bit mask?  CF=1 ZF=0 is impossible, so the only other odd-parity case is CF=0 ZF=1.  AMD and Intel can macro-fuse test/jp.

   blsr  %edi, %eax
   lahf
   test  $(1<<6) | (1<<0), %ah        # check ZF and CF.
   jpo   was_single_bit               # ZF != CF means CF=0, ZF=1 because the other way is impossible.

Also possible of course is the straightforward 2x setcc and AND to materialize a boolean in the bottom byte of EAX.  Good ILP, only 3 cycle latency from input to result on Intel, but that's the same as setz/cmovc which is fewer uops and can avoid false dependencies without destroying the input.

  blsr   %edi, %eax
  setnc  %al
  setz   %dil
  and    %edi, %eax     # or test %dil, %al

----

If BLSR is available, POPCNT is also definitely available, so these need to be compared against POPCNT / cmp / je (or sete).  With that in mind, BLSR / 2x JCC is actually worse for the front-end, 3 uops because it can't macro-fuse.  It has slightly lower latency before a branch miss could be discovered, but it has 2 separate branches instead of just 1.

Potentially we could take on of the JCCs off the fast-path with BLSR/JCC/JCC, at the expense of branch-prediction throughput when jumping to another branch on some CPUs.

Low-power cores like Silvermont-family don't do macro-fusion, and Alder Lake suddenly made Gracemont significantly more relevant for tune=generic than Intel low-power chips were before.  They have single-uop 3c latency popcnt, and 1 uop 3c latency for BMI1 instructions like blsr.


---

Also note that AMD Zen has 1-cycle latency popcnt as a single uop (down from 4c in Bulldozer-family), but BLSR is a 2-uop instruction on AMD even including Zen3.  (Perhaps due to setting CF from the input instead of output?  Or they just decode it to sub/and uops.)

So with a -mtune that indicates AMD, we should favour popcnt/cmp/setz to materialize a boolean.  (And write the setz target with popcnt to break *its* false dependency, because unlike Intel before IceLake, popcnt itself doesn't have a false output dependency on AMD.)

But with Intel tuning and BMI2 available, we should consider blsr/setz/cmovc if not branching on the result.  That might be too niche to be worth adding/maintaining code to look for it; probably most use-cases of single-bit checks *do* branch on it instead of assigning a boolean result somewhere.
Comment 15 Andrew Pinski 2024-03-04 22:56:45 UTC
>popcount (x) == 1 || x == 0

That could be optimized to just `popcount (x) <= 1`.

I am going to look to see what is left in GCC 15.