Created attachment 36146 [details] The std::bitset<> version I have attached two small, semantically equivalent C++14 programs. One uses std::bitset<26> for its operations; the other uses raw unsigned int. The one that uses unsigned int runs 53% slower than the bitset<> version, as compiled with g++-5.1 and running on a 2013-era Haswell i7-4770. While this represents, perhaps, a stunning triumph in the optimization of inline member and lambda functions operating on structs, it may represent an equally intensely embarrassing, even mystifying, failure for optimization of the underlying raw integer operations. For both, build and test was with $ g++-5 -O3 -march=native -mtune=native -g3 -Wall $PROG.cc $ time ./a.out | wc -l 2818 Times on a 3.2GHz Haswell are consistently 0.25s for the unsigned int version, 0.16s for the std::bitset<26> version. These programs are archived at <https://github.com/ncm/nytm-spelling-bee/>. The runtimes of the two versions are identical as built and run on my 2009 Westmere 2.4GHz i5-M520, and about the same as the integer version on Haswell.
Created attachment 36147 [details] The unsigned int version
The 4.9.2 release, "Debian 4.9.2-10", does not exhibit this bug. When built with g++-4.9, the unsigned int version is slightly faster than the std::bitset<> version. The g++-5 release used was "Debian 5.1.1-9 20150602". The Haswell host is running under a virtualbox VM, with /proc/cpuinfo reporting stepping 3, microcode 0x19, and flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm The compiler used for the test on the Westmere M520, that appears not to exhibit the bug, was a snapshot "g++ (GCC) 6.0.0 20150504".
Created attachment 36159 [details] bitset, but using an inlined container adapter, not lambdas, and slow This version compiles just as badly as the integer version, even by gcc-4.9.
Also fails 5.2.1 (Debian 5.2.1--15) 5.2.1 20150808 As noted, the third version of the program, using bitset but not using lambdas, is as slow as the version using unsigned int -- even when built using gcc-4.9. (Recall the int version and the first bitset version run fast when built with gcc-4.9.) Confirmed that on Westmere, compiled -march=native, all versions run about the same speed with all versions of the compiler reported, and this runtime is about the same as the slow Haswell speed despite the very different clock rate.
My preliminary conclusion is that a hardware optimization provided in Haswell but not in Westmere is not recognizing the opportunity in the unsigned int test case, that it finds in the original bitset version, as compiled by gcc-5. I have also observed that adding an assertion that the array index is not negative, before the first array access, slows the program a further 100%, on Westmere. Note that the entire data set fits in L3 cache on all tested targets, so memory bandwidth does not figure. To my inexperienced eye the effects look like branch mispredictions. I do not understand why a 3.4 GHz DDR3 Haswell runs as slowly as a 2.4 GHz DDR2 Westmere, when branch prediction (or whatever it is) fails.
It seems worth adding that the same failure occurs without "-march=native".
(In reply to ncm from comment #6) > It seems worth adding that the same failure occurs without "-march=native". Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options?
(In reply to Uroš Bizjak from comment #7) > Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options? Also, perf is able to record execution profiles, it will help you find hot spots.
I did experiment with -m[no-]bmi[2] a fair bit. It all made a significant difference in the instructions emitted, but exactly zero difference in runtime. That's actually not surprising at all; those instructions get decomposed into micro-ops that exactly match those from the equivalent instructions, and are cached, and the loops that dominate runtime execute out of the micro-op cache. The only real effect is maybe slightly shorter object code, which could matter in a program dominated by bus traffic with loops too big to cache well. I say "maybe slightly shorter" because instruction-set extension instructions are actually huge, mostly prefixes. I.e. most of the BMI stuff is marketing fluff, added mainly to make the competition waste money matching them instead of improving the product.
I found this, which at first blush seems like it might be relevant. The hardware complained about here is the same Haswell i7-4770. http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
Aha, Uroš, I see your name in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 Please forgive me for "teaching" you about micro-ops. The code being generated for all versions does use (e.g.) "popcntq %rax, %rax" almost everywhere. Not quite everywhere -- I see one "popcntq %rax, %rdx" -- but certainly in all the performance-sensitive bits.
As regards hot spots, the program has two: int score[7] = { 0, }; for (Letters word : words) /**/ if (!(word & ~seven)) for_each_in_seven([&](Letters letter, int place) { if (word & letter) /**/ score[place] += (word == seven) ? 3 : 1; }); The first is executed 300M times, the second 3.3M times. Inserting a counter bump before the second eliminates the slowdown: if (word & letter) { ++count; /**/ score[place] += (word == seven) ? 3 : 1; } This fact seems consequential. The surrounding for_each_in_seven loop isn't doing popcounts, but is doing "while (v &= -v)". I have repeated tests using -m[no-]bmi[2], with identical results (i.e. no effect).
This is essentially the entire difference between the versions of puzzlegen-int.cc without, and with, the added "++count;" line referenced above (modulo register assignments and branch labels) that sidesteps the +50% pessimization: (Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions specified. Output with "-mbmi -mbmi2" has different instructions, but they do not noticeably affect run time on Haswell i7-4770.) @@ -793,25 +793,26 @@ .L141: movl (%rdi), %esi # MEM[base: _244, offset: 0], word testl %r11d, %esi # D.66634, word jne .L138 #, xorl %eax, %eax # tmp419 cmpl %esi, %r12d # word, seven leaq 208(%rsp), %rcx #, tmp574 sete %al #, tmp419 movl %r12d, %edx # seven, seven leal 1(%rax,%rax), %r8d #, D.66619 .p2align 4,,10 .p2align 3 .L140: movl %edx, %eax # seven, D.66634 negl %eax # D.66634 andl %edx, %eax # seven, D.66622 testl %eax, %esi # D.66622, word je .L139 #, addl %r8d, 24(%rcx) # D.66619, MEM[base: _207, offset: 24B] + addl $1, %ebx #, count .L139: notl %eax # D.66622 subq $4, %rcx #, ivtmp.424 andl %eax, %edx # D.66622, seven jne .L140 #, addq $4, %rdi #, ivtmp.428 cmpq %rdi, %r10 # ivtmp.428, D.66637 jne .L141 #, I tried a version of the program with a fixed-length loop (over 'place' in [6..0]) so that branches do not depend on results of "rest &= ~-rest". The compiler unrolled the loop, but the program ran at pessimized speed with or without the "++count" line. I am very curious whether this has been reproduced on others' Haswells, and on Ivybridge and Skylake.
A notable difference between g++-4.9 output and g++-5 output is that, while both hoist the "(word == seven)" comparison out of the innermost loop, gcc-4.9 splits inner loop into two versions, one that increments scores by 3 and another that increments by 1, where g++-5 saves 3 or 1 into a register and uses the same inner loop for both cases. Rewriting the critical loop - to run with separate inner loops - does not slow down the fast g++-4.9-compiled program, but - fails to speed up the slow g++-5-compiled program. - to precompute a 1 or 3 increment, with one inner loop for both cases - does slow down the previously fast g++-4.9-compiled program, and - does not change the speed of the slow g++-5-compiled program
(In reply to ncm from comment #13) > I am very curious whether this has been reproduced on others' Haswells, > and on Ivybridge and Skylake. I could reproduce this with Haswell i7-5820K (affected) and also tested on Ivybridge i7-3770K (not affected). Haswell: GCC 4.9.3 native real 0m0.150s user 0m0.149s sys 0m0.001s GCC 6.0.0 native real 0m0.213s user 0m0.212s sys 0m0.000s GCC 6.0.0 native + count real 0m0.166s user 0m0.165s sys 0m0.000s ================ GCC 4.9.3 generic real 0m0.128s user 0m0.127s sys 0m0.001s GCC 6.0.0 generic real 0m0.213s user 0m0.213s sys 0m0.000s GCC 6.0.0 generic + count real 0m0.125s user 0m0.124s sys 0m0.000s Ivybridge: GCC 4.9.2 native real 0m0.218s user 0m0.216s sys 0m0.000s GCC 6.0.0 native real 0m0.186s user 0m0.184s sys 0m0.000s GCC 6.0.0 native + count real 0m0.266s user 0m0.264s sys 0m0.000s ================ GCC 4.9.2 generic real 0m0.224s user 0m0.220s sys 0m0.004s GCC 6.0.0 generic real 0m0.183s user 0m0.180s sys 0m0.000s GCC 6.0.0 generic + count real 0m0.181s user 0m0.176s sys 0m0.004s Here "native" means "-march=native -mtune=native", "generic" means no arch options. "+count" means adding a counter into the loop and writing it's value to a global volatile variable after the loop.
Confirmed. Neither GCC 5 nor trunk are unswitching the loop in main.
GCC 5.3 is being released, adjusting target milestone.
It is far from clear to me that gcc-5's choice to put the increment value in a register, and use just one loop body, is wrong. Rather, it appears that an incidental choice in the placement order of basic blocks or register assignment interacts badly with a bug in Haswell branch prediction, value dependency tracking, micro-op cache, or something. An actual fix for this would need to identify and step around Haswell's sensititvity to whichever detail of code generation this program happens upon.
bitset<> version real 0m1.073s user 0m1.052s sys 0m0.021s unsigned int real 0m0.903s user 0m0.883s sys 0m0.019s bitset with container adapter: real 0m1.519s user 0m1.499s sys 0m0.019s
(In reply to Andrew Pinski from comment #19) > bitset<> version > real 0m1.073s > user 0m1.052s > sys 0m0.021s > > unsigned int > real 0m0.903s > user 0m0.883s > sys 0m0.019s > > bitset with container adapter: > real 0m1.519s > user 0m1.499s > sys 0m0.019s I should say this is on a ThunderX T88 pass 2 running at 1.9GHz.
Created attachment 37629 [details] Test case using __builtin_expect() I wondered if this might match some other effects I've seen recently, so I tested this on Skylake and Haswell with g++ 5.3.0. My current belief is that everything here is expected behavior, and there is no bug with either the compiler or processor. The code spends most of its time in a tight loop that depends on runtime input, and the compiler doesn't have any reason to know which branch is more likely. The addition of "count" changes the heuristic in recent compilers, and by chance, changes it for the worse. More directly, one can get the same effect with __builtin_expect(). Here are the (excerpted) results I see on Skylake: nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null 210.098864 task-clock (msec) # 1.001 CPUs utilized 711,412,710 cycles # 3.386 GHz 1,930,668,439 instructions # 2.71 insns per cycle 625,004,194 branches # 2974.810 M/sec 2,920,721 branch-misses # 0.47% of all branches 0.209838807 seconds time elapsed nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen -DUSE_EXPECT nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null 122.037566 task-clock (msec) # 1.001 CPUs utilized 413,227,074 cycles # 3.386 GHz 1,930,678,854 instructions # 4.67 insns per cycle 625,011,748 branches # 5121.470 M/sec 2,952,030 branch-misses # 0.47% of all branches 0.121937547 seconds time elapsed Both versions have the same number of instructions, branches, and branch-misses. The version with USE_EXPECT is almost twice as fast because it executes almost twice as many instructions per cycle. GCC does a great job of producing a very tight loop that executes in one cycle per iteration: 77.21 │210:┌─→add $0x4,%rcx 0.66 │ │ cmp %rcx,%r8 │ │↓ je 260 4.20 │219:│ mov (%rcx),%edi 0.88 │ │ test %r9d,%edi │ └──jne 210 The cmp/je and test/jne instructions fuse into on µop each, so that the processor can execute the remaining 4 fused µops at an iteration per cycle. If the loop is rearranged so that the middle branch is taken, each iteration takes a minimum of 2 cycles instead. My suspicion was that we'd see GCC splitting the instructions in a way that prevented fusion (which I think it often does), but I think it's doing an excellent job here. It's possible there is a different effect in one of the other test cases, but I don't think the case that involves 'count' needs fixing.
(In reply to Nathan Kurz from comment #21) > My current belief is > that everything here is expected behavior, and there is no bug with either > the compiler or processor. > > The code spends most of its time in a tight loop that depends on runtime > input, and the compiler doesn't have any reason to know which branch is more > likely. The addition of "count" changes the heuristic in recent compilers, > and by chance, changes it for the worse. I am going to disagree, carefully. It seems clear, in any case, that Haswell is off the hook. 1. As a correction: *without* the count takes twice as long to run as with, or when using bitset<>. 2. As a heuristic, favoring a branch to skip a biggish loop body evidently has much less downside than favoring the branch into it. Maybe Gcc already has such a heuristic, and the addition of 7 conditional increments in the loop, or whatever overhead bitset<> adds, was enough to push it over? Westmere runs both instruction sequences (with and without __builtin_expect) the same. Maybe on Westmere the loop takes two cycles regardless of code placement, and Gcc is (still) tuned for the older timings?
> 1. As a correction: *without* the count takes twice as long to run as with, > or when using bitset<>. Oops, I did say that backwards. My tests agree with what you say. > 2. As a heuristic, favoring a branch to skip a biggish loop body evidently > has much less downside than favoring the branch into it. A correctly predicted branch-not-taken will be the same speed or one cycle faster than a correctly predicted branch-taken. Modern Intel can sustain 2 fused cmp/jcc µops per cycle, but only if the first is not-taken. > Maybe Gcc already has such a heuristic, and the addition of 7 conditional > increments in the loop, or whatever overhead bitset<> adds, was enough > to push it over? I presume this is true, but don't know the GCC internals. I did test though with clang++ 3.8.0 and icpc 16.01, and found that both made the same layout choices as GCC in the absence of __builtin_expect() and had approximately the same speed on Skylake (about 2 cycles per iteration). With USE_EXPECT, clang and gcc were equally fast (one cycle), and icc was at an intermediate speed (1.2-ish). My reason for thinking this is not a bug is that the fastest choice will depend on the contents of the word list. Regardless of layout, there will be one option that is slightly faster than the other. I guess it's reasonable to ask, though, whether it's better by default to try to save one cycle on an already very fast empty loop, or to save one cycle on a more expensive loop. But the real gain (if there is one) will be matching the layout to the runtime behavior, for which the compiler requires outside information. > Westmere runs both instruction sequences (with and without __builtin_expect) > the same. Maybe on Westmere the loop takes two cycles regardless of code > placement, and Gcc is (still) tuned for the older timings? This matches my belief: Haswell and Skylake have the potential of doing the inner loop in a single cycle if the layout matches the runtime pattern, while Nehalem (Westmere's sibling) takes at least 2 cycles regardless. This is due to the capabilities for "Macro-op Fusion" across the generations. See the sections with that heading for Nehalem, Sandy Bridge, Haswell, and Skylake here: http://www.agner.org/optimize/microarchitecture.pdf. Often the limiting factor for very tight loops like this is that only 4 µops can be "retired" per cycle. If both cmp/jcc instructions are fused (cmp here includes a different subset of cmp/test/add/sub for different generations), and if the first branch is not taken, it's possible to do both plus the 'mov' and 'add' in a single cycle on Haswell and Skylake. If as on previous generations they are not both fused (see Agner's PDF), or if the first branch is taken, there is a two cycle minimum for the loop. My recommendation to any GCC internals experts reading this would be to read Agner's suggestions for promoting fusion, and change the loop optimizers to respect them. While not happening in this case, I feel I often see situations where GCC often puts 'mov' or other non-flag changing instructions between the 'cmp' and 'jcc', defeating macro-op fusion and creating slower loops. Intel's icc frequently does a better job at avoiding this, and thus often produces slightly faster loops.
GCC 5.4 is being released, adjusting target milestone.
GCC 5 branch is being closed
Still fails on Skylake (i7-6700HQ) and gcc 8.1.0. The good news is that clang++-7.0.0 code is slow on all four versions.
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
> My reason for thinking this is not a bug is that the fastest choice will > depend on the contents of the word list. Regardless of layout, there will > be one option that is slightly faster than the other. I guess it's > reasonable to ask, though, whether it's better by default to try to save one > cycle on an already very fast empty loop, or to save one cycle on a more > expensive loop. But the real gain (if there is one) will be matching the > layout to the runtime behavior, for which the compiler requires outside > information. Saving one cycle on a two-cycle loop has a possibility of a much larger effect than saving one cycle of a fifty-cycle loop. Even if the fifty-cycle loop is the norm, an extra cycle costs only 2%, but if the two-cycle loop is the more common, as in this case, saving the one cycle is a big win.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.