On Sandy/Ivy Bridge and Haswell processors, the instruction: popcnt src, dest appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing. This causes a loss in performance as explained here: http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
Please clarify - this is a defect in the CPU? Can you point to an official errata? In which case we might want to adjust the scheduler description used for GENERIC tuning (and for the specific broken CPUs). Not sure if we want to disable popcnt use completely.
I think adjusting for scheduling won't help much: rather than making the compiler aware of increased latency, you'd need that either the register allocator avoids using a recently written hard register for popcnt (I'm not aware of such capability), or as a stopgap measure the compiler can issue a dependency-breaking instruction (xor %reg %reg) just before popcnt.
This seems to be specific to some latest Intel CPUs. I am not sure which other CPUs are affected. There is no official errata for this behavior AFAIK. As Alexander suggested, it would be a great idea to have a work around for this in gcc for these specific CPUs.
> Not sure if we want to > disable popcnt use completely. No matter how to fix this, do not disable popcnt! Even with the false dependency it is still the fastest instruction for popcounting. The false dependency makes it slower, but it is still faster than a hand written version. The easiest fix IMHO would be using xor %r %r on the output register. It seems to work extremely well, as you can see in the answer of the linked SO question.
Maybe there are a lot more instructions with such a false dependency. popcnt may only be the tip of the ice berg. I don't think Intel only got this operation wrong and all other SSE/AVX/... instructions are correct. I rather think a group of operations is implemented like popcnt. The source code in the linked SO question yields a good testbed for other operations as well: Simply replace popcount by another intrinsic and check if the performance deviations occur.
I don't see any issues with 'false dependency' on HSW. I've got sep data on it: for unsigned veriant (with LEA instructions): 0x400b30 52 161 lea 0x1(%rdx),%ecx 0x400b33 53 0 popcnt (%rbx,%rax,8),%rax 0x400b39 54 353 lea 0x2(%rdx),%r8d 0x400b3d 55 0 popcnt (%rbx,%rcx,8),%rcx 0x400b43 56 170 add %rax,%rcx 0x400b46 57 25 lea 0x3(%rdx),%esi 0x400b49 58 332 popcnt (%rbx,%r8,8),%rax 0x400b4f 59 196 add %rax,%rcx 0x400b52 60 199 popcnt (%rbx,%rsi,8),%rax 0x400b58 61 235 add %rax,%rcx 0x400b5b 62 414 lea 0x4(%rdx),%eax 0x400b5e 63 0 add %rcx,%r14 0x400b61 64 312 mov %rax,%rdx 0x400b64 65 0 cmp %rax,%r12 0x400b67 66 0 ja 400b30 <main+0xb0> and we don't see any performance anomaly with popcnt. But for 2nd loop we have 0x400c50 118 0 popcnt -0x8(%rdx),%rax 0x400c56 119 0 popcnt (%rdx),%rcx 0x400c5b 120 1086 add %rax,%rcx 0x400c5e 121 492 popcnt 0x8(%rdx),%rax 0x400c64 122 3 add %rcx,%rax 0x400c67 123 507 add $0x20,%rdx 0x400c6b 124 0 popcnt -0x10(%rdx),%rcx 0x400c71 125 955 add %rax,%rcx 0x400c74 126 479 add %rcx,%r13 0x400c77 127 489 cmp %rsi,%rdx 0x400c7a 128 0 jne 400c50 <main+0x1d0> So far I can't imagine what the problem is.
Please ignore my previous comment - if we insert nullifying of destination register before each popcnt (and lzcnt) performance will restore: original test results: unsigned 83886630000 0.848533 sec 24.715 GB/s uint64_t 83886630000 1.37436 sec 15.2592 GB/s fixed popcnt: unsigned 90440370000 0.853753 sec 24.5639 GB/s uint64_t 83886630000 0.694458 sec 30.1984 GB/s Here is assembly for 2nd loop: .L16: xorq %rax, %rax popcntq -8(%rdx), %rax xorq %rcx, %rcx popcntq (%rdx), %rcx addq %rax, %rcx xorq %rax, %rax popcntq 8(%rdx), %rax addq %rcx, %rax addq $32, %rdx xorq %rcx, %rcx popcntq -16(%rdx), %rcx addq %rax, %rcx addq %rcx, %r13 cmpq %rsi, %rdx jne .L16
@Yuri: Note however, that the result of your fixed u32 version seems to be wrong.
This is not u32 version but u64. The first loop (u32) version looks like: .L23: leal 1(%rdx), %ecx xorq %rax, %rax popcntq (%rbx,%rax,8), %rax leal 2(%rdx), %r8d xorq %rcx, %rcx popcntq (%rbx,%rcx,8), %rcx addq %rax, %rcx leal 3(%rdx), %esi xorq %rax, %rax popcntq (%rbx,%r8,8), %rax addq %rax, %rcx xorq %rax, %rax popcntq (%rbx,%rsi,8), %rax addq %rax, %rcx leal 4(%rdx), %eax addq %rcx, %r14 movq %rax, %rdx cmpq %rax, %r12 ja .L23
Confirmed at least.
Author: uros Date: Mon Aug 18 18:00:52 2014 New Revision: 214112 URL: https://gcc.gnu.org/viewcvs?rev=214112&root=gcc&view=rev Log: PR target/62011 * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): New tune flag. * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define. * config/i386/i386.md (unspec) <UNSPEC_INSN_FALSE_DEP>: New unspec. (ffs<mode>2): Do not expand with tzcnt for TARGET_AVOID_FALSE_DEP_FOR_BMI. (ffssi2_no_cmove): Ditto. (*tzcnt<mode>_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI. (ctz<mode>2): New expander. (*ctz<mode>2_falsedep_1): New insn_and_split pattern. (*ctz<mode>2_falsedep): New insn. (*ctz<mode>2): Rename from ctz<mode>2. (clz<mode>2_lzcnt): New expander. (*clz<mode>2_lzcnt_falsedep_1): New insn_and_split pattern. (*clz<mode>2_lzcnt_falsedep): New insn. (*clz<mode>2): Rename from ctz<mode>2. (popcount<mode>2): New expander. (*popcount<mode>2_falsedep_1): New insn_and_split pattern. (*popcount<mode>2_falsedep): New insn. (*popcount<mode>2): Rename from ctz<mode>2. (*popcount<mode>2_cmp): Remove. (*popcountsi2_cmp_zext): Ditto. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/i386.md trunk/gcc/config/i386/x86-tune.def
Author: uros Date: Thu Aug 21 18:03:49 2014 New Revision: 214279 URL: https://gcc.gnu.org/viewcvs?rev=214279&root=gcc&view=rev Log: Backport from mainline 2014-08-19 H.J. Lu <hongjiu.lu@intel.com> * config/i386/i386.md (*ctz<mode>2_falsedep_1): Don't clear destination if it is used in source. (*clz<mode>2_lzcnt_falsedep_1): Likewise. (*popcount<mode>2_falsedep_1): Likewise. Backport from mainline 2014-08-18 Uros Bizjak <ubizjak@gmail.com> PR target/62011 * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): New tune flag. * config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define. * config/i386/i386.md (unspec) <UNSPEC_INSN_FALSE_DEP>: New unspec. (ffs<mode>2): Do not expand with tzcnt for TARGET_AVOID_FALSE_DEP_FOR_BMI. (ffssi2_no_cmove): Ditto. (*tzcnt<mode>_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI. (ctz<mode>2): New expander. (*ctz<mode>2_falsedep_1): New insn_and_split pattern. (*ctz<mode>2_falsedep): New insn. (*ctz<mode>2): Rename from ctz<mode>2. (clz<mode>2_lzcnt): New expander. (*clz<mode>2_lzcnt_falsedep_1): New insn_and_split pattern. (*clz<mode>2_lzcnt_falsedep): New insn. (*clz<mode>2): Rename from ctz<mode>2. (popcount<mode>2): New expander. (*popcount<mode>2_falsedep_1): New insn_and_split pattern. (*popcount<mode>2_falsedep): New insn. (*popcount<mode>2): Rename from ctz<mode>2. (*popcount<mode>2_cmp): Remove. (*popcountsi2_cmp_zext): Ditto. Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/config/i386/i386.h branches/gcc-4_9-branch/gcc/config/i386/i386.md branches/gcc-4_9-branch/gcc/config/i386/x86-tune.def
Fixed for 4.9.2+.
.
For what it's worth and because Richard asked for it above, there is are Intel erratum for this, at least as of Haswell, for example HSD146: "POPCNT Instruction May Take Longer to Execute Than Expected". It mentions only popcnt, and I found it for Haswell, Skylake (SKL029) and Broadwell. The text is: POPCNT Instruction May Take Longer to Execute Than Expected Problem: POPCNT instruction execution with a 32 or 64 bit operand may be delayed until previous non-dependent instructions have executed. Implication: Software using the POPCNT instruction may experience lower performance than expected. Workaround: None identified
Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt.
(In reply to Travis Downs from comment #16) > Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt. How to confirm it? As I see it is fixed for popcnt. Could you show some reproducer?
FYI, in Intel 10th/11th Generation Processor Errata Table, there is no popcnt problem. In 9th Generation Errata Table, this problem exists.