Bug 62011 - False Data Dependency in popcnt instruction
Summary: False Data Dependency in popcnt instruction
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: unknown
: P3 normal
Target Milestone: 4.9.2
Assignee: Not yet assigned to anyone
URL: http://stackoverflow.com/questions/25...
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2014-08-04 14:29 UTC by Andev
Modified: 2017-11-16 19:32 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-08-13 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andev 2014-08-04 14:29:52 UTC
On Sandy/Ivy Bridge and Haswell processors, the instruction:

popcnt src, dest

appears to have a false dependency on the destination register dest. Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

This causes a loss in performance as explained here:
http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
Comment 1 Richard Biener 2014-08-05 09:07:22 UTC
Please clarify - this is a defect in the CPU?  Can you point to an official
errata?

In which case we might want to adjust the scheduler description used for
GENERIC tuning (and for the specific broken CPUs).  Not sure if we want to
disable popcnt use completely.
Comment 2 Alexander Monakov 2014-08-05 11:49:01 UTC
I think adjusting for scheduling won't help much: rather than making the compiler aware of increased latency, you'd need that either the register allocator avoids using a recently written hard register for popcnt (I'm not aware of such capability), or as a stopgap measure the compiler can issue a dependency-breaking instruction (xor %reg %reg) just before popcnt.
Comment 3 Andev 2014-08-05 13:40:48 UTC
This seems to be specific to some latest Intel CPUs. I am not sure which other CPUs are affected. There is no official errata for this behavior AFAIK. 

As Alexander suggested, it would be a great idea to have a work around for this in gcc for these specific CPUs.
Comment 4 finis 2014-08-07 09:18:13 UTC
> Not sure if we want to
> disable popcnt use completely.

No matter how to fix this, do not disable popcnt! Even with the false dependency it is still the fastest instruction for popcounting. The false dependency makes it slower, but it is still faster than a hand written version.

The easiest fix IMHO would be using xor %r %r on the output register. It seems to work extremely well, as you can see in the answer of the linked SO question.
Comment 5 finis 2014-08-07 09:22:11 UTC
Maybe there are a lot more instructions with such a false dependency. popcnt may only be the tip of the ice berg. I don't think Intel only got this operation wrong and all other SSE/AVX/... instructions are correct. I rather think a group of operations is implemented like popcnt. The source code in the linked SO question yields a good testbed for other operations as well: Simply replace popcount by another intrinsic and check if the performance deviations occur.
Comment 6 Yuri Rumyantsev 2014-08-11 14:37:26 UTC
I don't see any issues with 'false dependency' on HSW. I've got sep data on it:

for unsigned veriant (with LEA instructions):

0x400b30 52 161 lea    0x1(%rdx),%ecx 
0x400b33 53 0 popcnt (%rbx,%rax,8),%rax 
0x400b39 54 353 lea    0x2(%rdx),%r8d 
0x400b3d 55 0 popcnt (%rbx,%rcx,8),%rcx 
0x400b43 56 170 add    %rax,%rcx 
0x400b46 57 25 lea    0x3(%rdx),%esi 
0x400b49 58 332 popcnt (%rbx,%r8,8),%rax 
0x400b4f 59 196 add    %rax,%rcx 
0x400b52 60 199 popcnt (%rbx,%rsi,8),%rax 
0x400b58 61 235 add    %rax,%rcx 
0x400b5b 62 414 lea    0x4(%rdx),%eax 
0x400b5e 63 0 add    %rcx,%r14 
0x400b61 64 312 mov    %rax,%rdx 
0x400b64 65 0 cmp    %rax,%r12 
0x400b67 66 0 ja     400b30 <main+0xb0> 

and we don't see any performance anomaly with popcnt.

But for 2nd loop we have

0x400c50 118 0 popcnt -0x8(%rdx),%rax 
0x400c56 119 0 popcnt (%rdx),%rcx 
0x400c5b 120 1086 add    %rax,%rcx 
0x400c5e 121 492 popcnt 0x8(%rdx),%rax 
0x400c64 122 3 add    %rcx,%rax 
0x400c67 123 507 add    $0x20,%rdx 
0x400c6b 124 0 popcnt -0x10(%rdx),%rcx 
0x400c71 125 955 add    %rax,%rcx 
0x400c74 126 479 add    %rcx,%r13 
0x400c77 127 489 cmp    %rsi,%rdx 
0x400c7a 128 0 jne    400c50 <main+0x1d0>

So far I can't imagine what the problem is.
Comment 7 Yuri Rumyantsev 2014-08-12 08:38:50 UTC
Please ignore my previous comment - if we insert nullifying of destination register before each popcnt (and lzcnt) performance will restore:

original test results:

unsigned        83886630000     0.848533 sec    24.715 GB/s
uint64_t        83886630000     1.37436 sec     15.2592 GB/s

fixed popcnt:

unsigned        90440370000     0.853753 sec    24.5639 GB/s
uint64_t        83886630000     0.694458 sec    30.1984 GB/s

Here is assembly for 2nd loop:

.L16:
	xorq    %rax, %rax	
	popcntq	-8(%rdx), %rax
	xorq    %rcx, %rcx	
	popcntq	(%rdx), %rcx
	addq	%rax, %rcx
	xorq    %rax, %rax	
	popcntq	8(%rdx), %rax
	addq	%rcx, %rax
	addq	$32, %rdx
	xorq    %rcx, %rcx	
	popcntq	-16(%rdx), %rcx
	addq	%rax, %rcx
	addq	%rcx, %r13
	cmpq	%rsi, %rdx
	jne	.L16
Comment 8 finis 2014-08-12 10:48:58 UTC
@Yuri: Note however, that the result of your fixed u32 version seems to be wrong.
Comment 9 Yuri Rumyantsev 2014-08-12 11:25:55 UTC
This is not u32 version but u64. The first loop (u32) version looks like:

.L23:
	leal	1(%rdx), %ecx
	xorq    %rax, %rax	
	popcntq	(%rbx,%rax,8), %rax
	leal	2(%rdx), %r8d
	xorq    %rcx, %rcx	
	popcntq	(%rbx,%rcx,8), %rcx
	addq	%rax, %rcx
	leal	3(%rdx), %esi
	xorq    %rax, %rax	
	popcntq	(%rbx,%r8,8), %rax
	addq	%rax, %rcx
	xorq    %rax, %rax	
	popcntq	(%rbx,%rsi,8), %rax
	addq	%rax, %rcx
	leal	4(%rdx), %eax
	addq	%rcx, %r14
	movq	%rax, %rdx
	cmpq	%rax, %r12
	ja	.L23
Comment 10 Richard Biener 2014-08-13 08:23:26 UTC
Confirmed at least.
Comment 11 uros 2014-08-18 18:01:23 UTC
Author: uros
Date: Mon Aug 18 18:00:52 2014
New Revision: 214112

URL: https://gcc.gnu.org/viewcvs?rev=214112&root=gcc&view=rev
Log:
	PR target/62011
	* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
	New tune flag.
	* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define.
	* config/i386/i386.md (unspec) <UNSPEC_INSN_FALSE_DEP>: New unspec.
	(ffs<mode>2): Do not expand with tzcnt for
	TARGET_AVOID_FALSE_DEP_FOR_BMI.
	(ffssi2_no_cmove): Ditto.
	(*tzcnt<mode>_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI.
	(ctz<mode>2): New expander.
	(*ctz<mode>2_falsedep_1): New insn_and_split pattern.
	(*ctz<mode>2_falsedep): New insn.
	(*ctz<mode>2): Rename from ctz<mode>2.
	(clz<mode>2_lzcnt): New expander.
	(*clz<mode>2_lzcnt_falsedep_1): New insn_and_split pattern.
	(*clz<mode>2_lzcnt_falsedep): New insn.
	(*clz<mode>2): Rename from ctz<mode>2.
	(popcount<mode>2): New expander.
	(*popcount<mode>2_falsedep_1): New insn_and_split pattern.
	(*popcount<mode>2_falsedep): New insn.
	(*popcount<mode>2): Rename from ctz<mode>2.
	(*popcount<mode>2_cmp): Remove.
	(*popcountsi2_cmp_zext): Ditto.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/x86-tune.def
Comment 12 uros 2014-08-21 18:04:21 UTC
Author: uros
Date: Thu Aug 21 18:03:49 2014
New Revision: 214279

URL: https://gcc.gnu.org/viewcvs?rev=214279&root=gcc&view=rev
Log:
	Backport from mainline
	2014-08-19  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386.md (*ctz<mode>2_falsedep_1): Don't clear
	destination if it is used in source.
	(*clz<mode>2_lzcnt_falsedep_1): Likewise.
	(*popcount<mode>2_falsedep_1): Likewise.

	Backport from mainline
	2014-08-18  Uros Bizjak  <ubizjak@gmail.com>

	PR target/62011
	* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
	New tune flag.
	* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_BMI): New define.
	* config/i386/i386.md (unspec) <UNSPEC_INSN_FALSE_DEP>: New unspec.
	(ffs<mode>2): Do not expand with tzcnt for
	TARGET_AVOID_FALSE_DEP_FOR_BMI.
	(ffssi2_no_cmove): Ditto.
	(*tzcnt<mode>_1): Disable for TARGET_AVOID_FALSE_DEP_FOR_BMI.
	(ctz<mode>2): New expander.
	(*ctz<mode>2_falsedep_1): New insn_and_split pattern.
	(*ctz<mode>2_falsedep): New insn.
	(*ctz<mode>2): Rename from ctz<mode>2.
	(clz<mode>2_lzcnt): New expander.
	(*clz<mode>2_lzcnt_falsedep_1): New insn_and_split pattern.
	(*clz<mode>2_lzcnt_falsedep): New insn.
	(*clz<mode>2): Rename from ctz<mode>2.
	(popcount<mode>2): New expander.
	(*popcount<mode>2_falsedep_1): New insn_and_split pattern.
	(*popcount<mode>2_falsedep): New insn.
	(*popcount<mode>2): Rename from ctz<mode>2.
	(*popcount<mode>2_cmp): Remove.
	(*popcountsi2_cmp_zext): Ditto.


Modified:
    branches/gcc-4_9-branch/gcc/ChangeLog
    branches/gcc-4_9-branch/gcc/config/i386/i386.h
    branches/gcc-4_9-branch/gcc/config/i386/i386.md
    branches/gcc-4_9-branch/gcc/config/i386/x86-tune.def
Comment 13 Uroš Bizjak 2014-08-21 18:06:31 UTC
Fixed for 4.9.2+.
Comment 14 Uroš Bizjak 2014-08-21 18:06:49 UTC
.
Comment 15 Travis Downs 2017-11-11 17:40:14 UTC
For what it's worth and because Richard asked for it above, there is are Intel erratum for this, at least as of Haswell, for example HSD146: "POPCNT Instruction May Take Longer to Execute Than Expected". 

It mentions only popcnt, and I found it for Haswell, Skylake (SKL029) and Broadwell. The text is:

POPCNT Instruction May Take Longer to Execute Than Expected

Problem:
POPCNT instruction execution with a 32 or 64 bit operand may be delayed until 
previous non-dependent instructions have executed.

Implication:
Software using the POPCNT instruction may experience lower performance than expected. 

Workaround:
None identified
Comment 16 Travis Downs 2017-11-11 19:42:37 UTC
Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt.
Comment 17 Andrew Senkevich 2017-11-16 19:32:08 UTC
(In reply to Travis Downs from comment #16)
> Also, this is fixed for Skylake for tzcnt and lzcnt but not popcnt.

How to confirm it? As I see it is fixed for popcnt. Could you show some reproducer?