Hello gcc-team, we are thrilled that C++20 offers some efficient bit implementation and that we could exchange some of our own implementation with the standardized ones, making the code more accessible. I replaced our implementation and noticed that `std::has_single_bit` was slower than what we had before by around 30%. (The other functions matched our timings.) Additionally, we have a (micro-)benchmark that compares the standard arithmetic bit trick (https://graphics.stanford.edu/~seander/bithacks.html#DetermineIfPowerOf2) with the implementation where popcount == 1. We decided to use the arithmetic version, because we measured that it was faster than popcount on our machines (mostly intel processors). Interestingly, it seems that the popcount benchmark matches the std::has_single_bit time-wise, so I guess that std::has_single_bit is implemented via popcount. Those timings could be reproduced at an unknown location https://quick-bench.com/q/Y28keu_mSh25WwhO05T4SKrbHpk I don't know how to fix this, but I would expect that the optimizer would recognize popcount=1 and knows that there is a more efficient version. Or change the implementation to arithmetic, where again the optimizer could decide to replace that by a popcount if that is more efficient on some architecture? Thank you!
Could you post the benchmark and the exact architecture where the arithmetic version is faster?
I think using popcount () in the standard library is reasonable and instead GCC should eventually optimize popcount () == 1 during RTL expansion. An alternative would be to add an explicit __builtin_is_pow2 (). Note matching popcount (x) == 1 || x == 0 for the simple arithmetic trick would be always good IMHO.
for testcase: --- #include<stdbool.h> bool is_power2_popcnt (int a) { return __builtin_popcount (a) == 1; } bool is_power2_arithmetic (int a) { return !(a & (a - 1)) && a; } --- gcc -O2 -mavx2 -S got --- .file "test.c" .text .p2align 4 .globl is_power2_popcnt .type is_power2_popcnt, @function is_power2_popcnt: .LFB0: .cfi_startproc popcntl %edi, %edi cmpl $1, %edi sete %al ret .cfi_endproc .LFE0: .size is_power2_popcnt, .-is_power2_popcnt .p2align 4 .globl is_power2_arithmetic .type is_power2_arithmetic, @function is_power2_arithmetic: .LFB1: .cfi_startproc leal -1(%rdi), %eax testl %edi, %eax sete %al testl %edi, %edi setne %dl andl %edx, %eax ret .cfi_endproc --- Latency of popcnt is 3, others is 1. Can't tell which version is better from static observation.
Looking at the referenced benchmark, options that would result in popcnt* instruction aren't there though and in the assembly it calls __popcountdi2. So, I guess we want at least: 1) for POPCOUNT (x) == 1 use x && (x & (x - 1)) == 0 if optab_handler (popcount_optab, mode) == CODE_FOR_nothing (a fuzzy case are the double-word cases where we'd end up with popcount (x >> wordbits) + popcount ((word) x) == 1 vs. double-word x && (x & (x - 1)) = 0) 2) the case Richi wrote about, optimize POPCOUNT (x) <= 1 or POPCOUNT (x) == 1 || x == 0 to (x & (x - 1)) == 0 always
Unfortunately we don't TER calls, so the expander doesn't see POPCOUNT (x) == 1 or POPCOUNT (x) <= 1 and the isel pass seems to be (at least so far) extremely vector specific; another option is do it in match.pd or in tree-ssa-math-opts.c
As an aside, libstdc++ does already use the ((x-1) & x) == 0 idiom in <bits/uniform_int_dist.h> where we are happy for zero to be treated as a power of two (because we call _Power_of_2(n+1) and we want the result to be true for n==numeric_limits<decltype(n)>::max()). We could replace _Power_of_2(n+1) with std::__has_single_bit(n+1) || (n+1)==0 but would need to define __has_single_bit for C++11 as well as C++14.
Created attachment 49530 [details] CMakeLists.txt
Created attachment 49531 [details] has_single_bit_benchmark.cpp
Thank you for so many responses (In reply to Thomas Koenig from comment #1) > Could you post the benchmark and the exact architecture where the arithmetic > version is faster? ``` > cat /proc/cpuinfo rocessor : 0 vendor_id : GenuineIntel cpu family : 6 model : 158 model name : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz stepping : 9 microcode : 0xd6 cpu MHz : 3489.761 cache size : 8192 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple shadow_vmcs pml ept_mode_based_exec bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds bogomips : 5802.42 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: ``` I added the sources. You can execute it by putting it into one folder and execute ``` > cmake -DCMAKE_BUILD_TYPE=Release ./ > make VERBOSE=1 /usr/bin/c++ -I/benchmark/_deps/gbenchmark_fetch_content-src/src/../include -O3 -DNDEBUG -std=gnu++2a -o CMakeFiles/has_single_bit_benchmark.dir/has_single_bit_benchmark.cpp.o -c /benchmark/has_single_bit_benchmark.cpp ``` ``` > ./has_single_bit_benchmark --benchmark_repetitions=9 --benchmark_min_time=2 2020-11-10T00:32:52+01:00 Running ./has_single_bit_benchmark Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 8192 KiB (x1) Load Average: 0.57, 0.82, 1.35 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- has_single_bit_arithmetic_mean 6.45 ns 6.44 ns 9 has_single_bit_arithmetic_median 6.44 ns 6.44 ns 9 has_single_bit_arithmetic_stddev 0.113 ns 0.111 ns 9 has_single_bit_popcount_mean 8.84 ns 8.82 ns 9 has_single_bit_popcount_median 8.84 ns 8.82 ns 9 has_single_bit_popcount_stddev 0.060 ns 0.061 ns 9 has_single_bit_std_mean 9.23 ns 9.21 ns 9 has_single_bit_std_median 9.23 ns 9.20 ns 9 has_single_bit_std_stddev 0.046 ns 0.048 ns 9 ``` ---- I thought about it and tried the same again with `-march=native` and noticed that popcount was now (slightly) more efficient. Some more testing showed that `-mpopcnt` is everything needed to make this test fly. --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- has_single_bit_arithmetic_mean 6.81 ns 6.81 ns 9 has_single_bit_arithmetic_median 6.81 ns 6.80 ns 9 has_single_bit_arithmetic_stddev 0.201 ns 0.200 ns 9 has_single_bit_popcount_mean 6.47 ns 6.47 ns 9 has_single_bit_popcount_median 6.46 ns 6.46 ns 9 has_single_bit_popcount_stddev 0.043 ns 0.042 ns 9 has_single_bit_std_mean 6.50 ns 6.50 ns 9 has_single_bit_std_median 6.51 ns 6.50 ns 9 has_single_bit_std_stddev 0.031 ns 0.031 ns 9 So the use case would be generic builds like debian binaries.
And maybe a related question: I know that an arithmetic implementation might auto-vectorize, but would a popcount implementation do that too? Since AVX512_BITALG (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=popcnt&expand=4350) retro-introduced popcount for smaller vector lengths, what is with instruction sets like AVX2 and lower?
(In reply to gcc-bugs from comment #10) > And maybe a related question: > > I know that an arithmetic implementation might auto-vectorize, but would a > popcount implementation do that too? > > Since AVX512_BITALG > (https://software.intel.com/sites/landingpage/IntrinsicsGuide/ > #text=popcnt&expand=4350) retro-introduced popcount for smaller vector > lengths, what is with instruction sets like AVX2 and lower? I open a bugzilla for it in 97770.
r11-4843 should have removed the small difference between the std and popcount benchmarks. I still see a small advantage to arithmetic for skylake i7-6700 CPU @ 3.40GHz and i7-8650U CPU @ 1.90GHz though.
For aarch64, we get: is_power2_popcnt(int): fmov s0, w0 cnt v0.8b, v0.8b addv b0, v0.8b fmov w0, s0 cmp w0, 1 cset w0, eq ret is_power2_arithmetic(int): sub w1, w0, #1 tst w1, w0 ccmp w0, 0, 4, eq cset w0, ne ret Obviously is_power2_arithmetic is better. Note clang uses popcntl for both (on both x86 and arm64) if popcntl can be done inline and uses test/set/and if popcntl does not exist.
Agreed with the idea of expanding doing the popcount(x) == 1 peephole replacement in the compiler, not forcing the header to figure out whether we have efficient popcount or not. If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was zero, and ZF=1 if the output is zero. Unfortunately none of the standard jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first. (If Intel had designed it to produce ZF and CF inverted, it would be non-intuitive for all other uses but would have allowed blsr / jcc to implement if(has_single_bit(x)).) CF=1, ZF=0 impossible: input was zero, output was non-zero CF=1, ZF=1 input was zero CF=0, ZF=0 input had multiple bits set CF=0, ZF=1 input had a single bit set. If we're going to branch on it anyway after inlining, a branchy strategy is probably good: singlebit_bmi2_branchy: xor %eax, %eax blsr %edi, %edi # ZF=1 means we cleared the last bit, or the input was zero jc .Linput_zero # input was zero, return 0 regardless of ZF setz %al .Linput_zero: ret And when we want a boolean in a register, a combination of setz and cmovc can materialize one. With clever choice of registers, we can even avoid giving setcc a false dependency on a register that isn't already part of its dep chain singlebit_bmi2_cmov: blsr %edi, %eax setz %al # false dep, but it's ready if FLAGS are ready because we wrote it with BLSR cmovc %edi, %eax # return 1 only if ZF=1 (setz produces 1) and CF=0 (cmovc doesn't overwrite it with the input 0) ret With xor-zeroing first, we could produce the boolean zero-extended to 32-bit, instead of here where only the low 8 bits are actually 0 / 1. (Which is fine for returning a bool in all the mainstream calling conventions) (This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can conditionally update flags.) An evil variation on this uses setnz / dec to invert ZF without affecting CF, allowing JA: blsr %edi,%eax setnz %al # AL = !ZF dec %al # 1-1 -> ZF=1, 0-1 -> ZF=0. ZF=!ZF without affecting CF # seta %al # set on CF=0 and ZF=0 ja was_single_bit # only actually useful for branching after inlining dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it doesn't cost any extra partial-FLAGS merging uops, because JA simply has both parts of FLAGS as separate inputs. (This is why CMOVA / CMOVBE are still 2 uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts of FLAGS, while others need either CF or SPAZO not both. Interestingly, Zen1/2 have that effect but not Zen3) I don't know how AMD handles dec / ja partial-flags shenanigans. Intel Haswell would I think have a flags-merging uop; older Intel doesn't support BMI1 so P6-family is irrelevant. https://stackoverflow.com/a/49868149/224132 I haven't benchmarked them because they have different use-cases (materializing a boolean vs. creating a FLAGS condition to branch on, being branchless itself), so any single benchmark would make one of them look good. If your data almost never (or always) has an all-zero input, the JC in the first version will predict well. After inlining, if the caller branches on the bool result, you might want to just branch on both conditions separately. I don't think this setnz/dec/ja version is ever useful. Unlike branching separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common while single-bit inputs are rare. But blsr/setz/cmovc + test/jnz is only 4 uops, same as this on Skylake. (test can macro-fuse with jnz). The uops are all dependent on each other, so it also has the same latency (to detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops. The only thing this has going for it is avoiding a port-1-only uop, I think. It's also possible to blsr / lahf / and ah, (1<<6) | (1<<0) / cmp ah, 1<<6 to directly check that ZF=1 and CF=0. I doubt that's useful. Or hmm, can we branch directly on PF after AND with that 2-bit mask? CF=1 ZF=0 is impossible, so the only other odd-parity case is CF=0 ZF=1. AMD and Intel can macro-fuse test/jp. blsr %edi, %eax lahf test $(1<<6) | (1<<0), %ah # check ZF and CF. jpo was_single_bit # ZF != CF means CF=0, ZF=1 because the other way is impossible. Also possible of course is the straightforward 2x setcc and AND to materialize a boolean in the bottom byte of EAX. Good ILP, only 3 cycle latency from input to result on Intel, but that's the same as setz/cmovc which is fewer uops and can avoid false dependencies without destroying the input. blsr %edi, %eax setnc %al setz %dil and %edi, %eax # or test %dil, %al ---- If BLSR is available, POPCNT is also definitely available, so these need to be compared against POPCNT / cmp / je (or sete). With that in mind, BLSR / 2x JCC is actually worse for the front-end, 3 uops because it can't macro-fuse. It has slightly lower latency before a branch miss could be discovered, but it has 2 separate branches instead of just 1. Potentially we could take on of the JCCs off the fast-path with BLSR/JCC/JCC, at the expense of branch-prediction throughput when jumping to another branch on some CPUs. Low-power cores like Silvermont-family don't do macro-fusion, and Alder Lake suddenly made Gracemont significantly more relevant for tune=generic than Intel low-power chips were before. They have single-uop 3c latency popcnt, and 1 uop 3c latency for BMI1 instructions like blsr. --- Also note that AMD Zen has 1-cycle latency popcnt as a single uop (down from 4c in Bulldozer-family), but BLSR is a 2-uop instruction on AMD even including Zen3. (Perhaps due to setting CF from the input instead of output? Or they just decode it to sub/and uops.) So with a -mtune that indicates AMD, we should favour popcnt/cmp/setz to materialize a boolean. (And write the setz target with popcnt to break *its* false dependency, because unlike Intel before IceLake, popcnt itself doesn't have a false output dependency on AMD.) But with Intel tuning and BMI2 available, we should consider blsr/setz/cmovc if not branching on the result. That might be too niche to be worth adding/maintaining code to look for it; probably most use-cases of single-bit checks *do* branch on it instead of assigning a boolean result somewhere.
>popcount (x) == 1 || x == 0 That could be optimized to just `popcount (x) <= 1`. I am going to look to see what is left in GCC 15.