Bug 67153 - [6/7/8 Regression] integer optimizations 53% slower than std::bitset<>
Summary: [6/7/8 Regression] integer optimizations 53% slower than std::bitset<>
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 5.1.1
: P2 normal
Target Milestone: 6.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2015-08-07 23:54 UTC by ncm
Modified: 2017-10-10 13:27 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work: 4.9.2
Known to fail: 5.2.1
Last reconfirmed: 2015-11-18 00:00:00


Attachments
The std::bitset<> version (855 bytes, text/x-csrc)
2015-08-07 23:54 UTC, ncm
Details
The unsigned int version (832 bytes, text/x-csrc)
2015-08-07 23:55 UTC, ncm
Details
bitset, but using an inlined container adapter, not lambdas, and slow (1.07 KB, text/x-csrc)
2015-08-10 02:37 UTC, ncm
Details
Test case using __builtin_expect() (980 bytes, text/x-csrc)
2016-02-08 11:29 UTC, Nathan Kurz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ncm 2015-08-07 23:54:19 UTC
Created attachment 36146 [details]
The std::bitset<> version

I have attached two small, semantically equivalent C++14 programs.

One uses std::bitset<26> for its operations; the other uses raw unsigned int.
The one that uses unsigned int runs 53% slower than the bitset<> version, as
compiled with g++-5.1 and running on a 2013-era Haswell i7-4770.  While this
represents, perhaps, a stunning triumph in the optimization of inline member
and lambda functions operating on structs, it may represent an equally
intensely embarrassing, even mystifying, failure for optimization of the
underlying raw integer operations.

For both, build and test was with

  $ g++-5 -O3 -march=native -mtune=native -g3 -Wall $PROG.cc
  $ time ./a.out | wc -l
  2818

Times on a 3.2GHz Haswell are consistently 0.25s for the unsigned int
version, 0.16s for the std::bitset<26> version.

These programs are archived at <https://github.com/ncm/nytm-spelling-bee/>.

The runtimes of the two versions are identical as built and run on my
2009 Westmere 2.4GHz i5-M520, and about the same as the integer version
on Haswell.
Comment 1 ncm 2015-08-07 23:55:32 UTC
Created attachment 36147 [details]
The unsigned int version
Comment 2 ncm 2015-08-08 05:03:10 UTC
The 4.9.2 release, "Debian 4.9.2-10", does not exhibit this bug.  When built 
with g++-4.9, the unsigned int version is slightly faster than the std::bitset<> 
version.  The g++-5 release used was "Debian 5.1.1-9 20150602".

The Haswell host is running under a virtualbox VM, with /proc/cpuinfo
reporting stepping 3, microcode 0x19, and flags: fpu vme de pse tsc msr 
pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 
ht syscall nx rdtscp lm constant_tsc rep_good nopl pni ssse3 lahf_lm

The compiler used for the test on the Westmere M520, that appears not to 
exhibit the bug, was a snapshot "g++ (GCC) 6.0.0 20150504".
Comment 3 ncm 2015-08-10 02:37:39 UTC
Created attachment 36159 [details]
bitset, but using an inlined container adapter, not lambdas, and slow

This version compiles just as badly as the integer version, even by gcc-4.9.
Comment 4 ncm 2015-08-10 11:30:55 UTC
Also fails 5.2.1 (Debian 5.2.1--15) 5.2.1 20150808
As noted, the third version of the program, using bitset but not using
lambdas, is as slow as the version using unsigned int -- even when built
using gcc-4.9.  (Recall the int version and the first bitset version
run fast when built with gcc-4.9.)

Confirmed that on Westmere, compiled -march=native, all versions 
run about the same speed with all versions of the compiler reported,
and this runtime is about the same as the slow Haswell speed despite
the very different clock rate.
Comment 5 ncm 2015-08-10 11:47:11 UTC
My preliminary conclusion is that a hardware optimization provided in Haswell
but not in Westmere is not recognizing the opportunity in the unsigned int
test case, that it finds in the original bitset version, as compiled by gcc-5.

I have also observed that adding an assertion that the array index is not
negative, before the first array access, slows the program a further 100%, 
on Westmere.

Note that the entire data set fits in L3 cache on all tested targets, so
memory bandwidth does not figure.

To my inexperienced eye the effects look like branch mispredictions.
I do not understand why a 3.4 GHz DDR3 Haswell runs as slowly as a 
2.4 GHz DDR2 Westmere, when branch prediction (or whatever it is) 
fails.
Comment 6 ncm 2015-08-11 17:39:36 UTC
It seems worth adding that the same failure occurs without "-march=native".
Comment 7 Uroš Bizjak 2015-08-11 18:17:21 UTC
(In reply to ncm from comment #6)
> It seems worth adding that the same failure occurs without "-march=native".

Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options?
Comment 8 Uroš Bizjak 2015-08-11 18:19:43 UTC
(In reply to Uroš Bizjak from comment #7)

> Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options?

Also, perf is able to record execution profiles, it will help you find hot spots.
Comment 9 ncm 2015-08-13 00:32:50 UTC
I did experiment with -m[no-]bmi[2] a fair bit.  It all made a significant
difference in the instructions emitted, but exactly zero difference in 
runtime. That's actually not surprising at all; those instructions get 
decomposed into micro-ops that exactly match those from the equivalent
instructions, and are cached, and the loops that dominate runtime execute 
out of the micro-op cache.  The only real effect is maybe slightly shorter
object code, which could matter in a program dominated by bus traffic
with loops too big to cache well.  I say "maybe slightly shorter" because
instruction-set extension instructions are actually huge, mostly prefixes.

I.e. most of the BMI stuff is marketing fluff, added mainly to make the 
competition waste money matching them instead of improving the product.
Comment 10 ncm 2015-08-13 01:49:20 UTC
I found this, which at first blush seems like it might be relevant.
The hardware complained about here is the same Haswell i7-4770.

http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance
Comment 11 ncm 2015-08-13 02:24:14 UTC
Aha, Uroš, I see your name in 

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Please forgive me for "teaching" you about micro-ops.

The code being generated for all versions does use (e.g.)
"popcntq %rax, %rax" almost everywhere.  Not quite everywhere -- I see 
one "popcntq %rax, %rdx" -- but certainly in all the performance-sensitive 
bits.
Comment 12 ncm 2015-08-13 14:32:17 UTC
As regards hot spots, the program has two:

        int score[7] = { 0, };
        for (Letters word : words)
/**/        if (!(word & ~seven))
                for_each_in_seven([&](Letters letter, int place) {
                    if (word & letter)
/**/                    score[place] += (word == seven) ? 3 : 1;
                });

The first is executed 300M times, the second 3.3M times.
Inserting a counter bump before the second eliminates the slowdown:

                    if (word & letter) {
                        ++count;
/**/                    score[place] += (word == seven) ? 3 : 1;
                    }

This fact seems consequential.  The surrounding for_each_in_seven
loop isn't doing popcounts, but is doing "while (v &= -v)".

I have repeated tests using -m[no-]bmi[2], with identical results
(i.e. no effect).
Comment 13 ncm 2015-08-16 21:43:19 UTC
This is essentially the entire difference between the versions of
puzzlegen-int.cc without, and with, the added "++count;" line 
referenced above (modulo register assignments and branch labels)
that sidesteps the +50% pessimization:

(Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using
g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions 
specified.  Output with "-mbmi -mbmi2" has different instructions, but
they do not noticeably affect run time on Haswell i7-4770.)

@@ -793,25 +793,26 @@
 .L141:
        movl    (%rdi), %esi    # MEM[base: _244, offset: 0], word
        testl   %r11d, %esi     # D.66634, word
        jne     .L138           #,
        xorl    %eax, %eax      # tmp419
        cmpl    %esi, %r12d     # word, seven
        leaq    208(%rsp), %rcx #, tmp574
        sete    %al             #, tmp419
        movl    %r12d, %edx     # seven, seven
        leal    1(%rax,%rax), %r8d      #, D.66619
        .p2align 4,,10
        .p2align 3
  .L140:
        movl    %edx, %eax      # seven, D.66634
        negl    %eax    # D.66634
        andl    %edx, %eax      # seven, D.66622
        testl   %eax, %esi      # D.66622, word
        je      .L139   #,
        addl    %r8d, 24(%rcx)  # D.66619, MEM[base: _207, offset: 24B]
+       addl    $1, %ebx        #, count
 .L139:
        notl    %eax            # D.66622
        subq    $4, %rcx        #, ivtmp.424
        andl    %eax, %edx      # D.66622, seven
        jne     .L140           #,
        addq    $4, %rdi        #, ivtmp.428
        cmpq    %rdi, %r10      # ivtmp.428, D.66637
        jne     .L141           #,

I tried a version of the program with a fixed-length loop (over 
'place' in [6..0]) so that branches do not depend on results of
"rest &= ~-rest".  The compiler unrolled the loop, but the program
ran at pessimized speed with or without the "++count" line.

I am very curious whether this has been reproduced on others' Haswells,
and on Ivybridge and Skylake.
Comment 14 ncm 2015-08-16 23:25:04 UTC
A notable difference between g++-4.9 output and g++-5 output is that,
while both hoist the "(word == seven)" comparison out of the innermost
loop, gcc-4.9 splits inner loop into two versions, one that increments 
scores by 3 and another that increments by 1, where g++-5 saves 3 or 1
into a register and uses the same inner loop for both cases.

Rewriting the critical loop
  - to run with separate inner loops
     - does not slow down the fast g++-4.9-compiled program, but
     - fails to speed up the slow g++-5-compiled program.
  - to precompute a 1 or 3 increment, with one inner loop for both cases
     - does slow down the previously fast g++-4.9-compiled program, and
     - does not change the speed of the slow g++-5-compiled program
Comment 15 Mikhail Maltsev 2015-08-17 03:06:12 UTC
(In reply to ncm from comment #13)
> I am very curious whether this has been reproduced on others' Haswells,
> and on Ivybridge and Skylake.

I could reproduce this with Haswell i7-5820K (affected) and also tested on Ivybridge i7-3770K (not affected).

Haswell:

GCC 4.9.3 native

real    0m0.150s
user    0m0.149s
sys     0m0.001s
GCC 6.0.0 native

real    0m0.213s
user    0m0.212s
sys     0m0.000s
GCC 6.0.0 native + count

real    0m0.166s
user    0m0.165s
sys     0m0.000s

================

GCC 4.9.3 generic

real    0m0.128s
user    0m0.127s
sys     0m0.001s
GCC 6.0.0 generic

real    0m0.213s
user    0m0.213s
sys     0m0.000s
GCC 6.0.0 generic + count

real    0m0.125s
user    0m0.124s
sys     0m0.000s


Ivybridge:

GCC 4.9.2 native

real    0m0.218s
user    0m0.216s
sys     0m0.000s
GCC 6.0.0 native

real    0m0.186s
user    0m0.184s
sys     0m0.000s
GCC 6.0.0 native + count

real    0m0.266s
user    0m0.264s
sys     0m0.000s

================

GCC 4.9.2 generic

real    0m0.224s
user    0m0.220s
sys     0m0.004s
GCC 6.0.0 generic

real    0m0.183s
user    0m0.180s
sys     0m0.000s
GCC 6.0.0 generic + count

real    0m0.181s
user    0m0.176s
sys     0m0.004s

Here "native" means "-march=native -mtune=native", "generic" means no arch options. "+count" means adding a counter into the loop and writing it's value to a global volatile variable after the loop.
Comment 16 Richard Biener 2015-11-18 11:41:27 UTC
Confirmed.  Neither GCC 5 nor trunk are unswitching the loop in main.
Comment 17 Richard Biener 2015-12-04 10:46:29 UTC
GCC 5.3 is being released, adjusting target milestone.
Comment 18 ncm 2015-12-04 13:23:01 UTC
It is far from clear to me that gcc-5's choice to put the increment value in a register, and use just one loop body, is wrong. Rather, it appears that an incidental choice in the placement order of basic blocks or register assignment interacts badly with a bug in Haswell branch prediction, value dependency tracking, micro-op cache, or something.  An actual fix for this would need to identify and step around Haswell's sensititvity to whichever detail of code generation this program happens upon.
Comment 19 Andrew Pinski 2016-01-05 00:31:22 UTC
bitset<> version
real    0m1.073s
user    0m1.052s
sys     0m0.021s

unsigned int
real    0m0.903s
user    0m0.883s
sys     0m0.019s

bitset with container adapter:
real    0m1.519s
user    0m1.499s
sys     0m0.019s
Comment 20 Andrew Pinski 2016-01-05 00:32:09 UTC
(In reply to Andrew Pinski from comment #19)
> bitset<> version
> real    0m1.073s
> user    0m1.052s
> sys     0m0.021s
> 
> unsigned int
> real    0m0.903s
> user    0m0.883s
> sys     0m0.019s
> 
> bitset with container adapter:
> real    0m1.519s
> user    0m1.499s
> sys     0m0.019s

I should say this is on a ThunderX T88 pass 2 running at 1.9GHz.
Comment 21 Nathan Kurz 2016-02-08 11:29:39 UTC
Created attachment 37629 [details]
Test case using __builtin_expect()

I wondered if this might match some other effects I've seen recently, so I tested this on Skylake and Haswell with g++ 5.3.0.  My current belief is that everything here is expected behavior, and there is no bug with either the compiler or processor.  

The code spends most of its time in a tight loop that depends on runtime input, and the compiler doesn't have any reason to know which branch is more likely.   The addition of "count" changes the heuristic in recent compilers, and by chance, changes it for the worse.   

More directly, one can get the same effect with __builtin_expect().  Here are the (excerpted) results I see on Skylake:  

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
        210.098864      task-clock (msec)         #    1.001 CPUs utilized
       711,412,710      cycles                    #    3.386 GHz
     1,930,668,439      instructions              #    2.71  insns per cycle
       625,004,194      branches                  # 2974.810 M/sec
         2,920,721      branch-misses             #    0.47% of all branches
       0.209838807 seconds time elapsed

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o puzzlegen -DUSE_EXPECT
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
        122.037566      task-clock (msec)         #    1.001 CPUs utilized
       413,227,074      cycles                    #    3.386 GHz
     1,930,678,854      instructions              #    4.67  insns per cycle
       625,011,748      branches                  # 5121.470 M/sec
         2,952,030      branch-misses             #    0.47% of all branches
       0.121937547 seconds time elapsed

Both versions have the same number of instructions, branches, and branch-misses.  The version with USE_EXPECT is almost twice as fast because it executes almost twice as many instructions per cycle.  GCC does a great job of producing a very tight loop that executes in one cycle per iteration: 

   77.21 │210:┌─→add    $0x4,%rcx 
    0.66 │    │  cmp    %rcx,%r8  
         │    │↓ je     260       
    4.20 │219:│  mov    (%rcx),%edi 
    0.88 │    │  test   %r9d,%edi   
         │    └──jne    210

The cmp/je and test/jne instructions fuse into on µop each, so that the processor can execute the remaining 4 fused µops at an iteration per cycle. If the loop is rearranged so that the middle branch is taken, each iteration takes a minimum of 2 cycles instead.

My suspicion was that we'd see GCC splitting the instructions in a way that prevented fusion (which I think it often does), but I think it's doing an excellent job here.  It's possible there is a different effect in one of the other test cases, but I don't think the case that involves 'count' needs fixing.
Comment 22 ncm 2016-02-08 19:11:43 UTC
(In reply to Nathan Kurz from comment #21)
> My current belief is
> that everything here is expected behavior, and there is no bug with either
> the compiler or processor.  
> 
> The code spends most of its time in a tight loop that depends on runtime
> input, and the compiler doesn't have any reason to know which branch is more
> likely.   The addition of "count" changes the heuristic in recent compilers,
> and by chance, changes it for the worse.  

I am going to disagree, carefully.  It seems clear, in any case, that Haswell
is off the hook.

1. As a correction: *without* the count takes twice as long to run as with,
   or when using bitset<>.

2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
   has much less downside than favoring the branch into it.  Maybe Gcc
   already has such a heuristic, and the addition of 7 conditional 
   increments in the loop, or whatever overhead bitset<> adds, was enough
   to push it over?  

Westmere runs both instruction sequences (with and without __builtin_expect)
the same.  Maybe on Westmere the loop takes two cycles regardless of code 
placement, and Gcc is (still) tuned for the older timings?
Comment 23 Nathan Kurz 2016-02-08 21:51:02 UTC
> 1. As a correction: *without* the count takes twice as long to run as with,
>    or when using bitset<>.

Oops, I did say that backwards.  My tests agree with what you say.  
 
> 2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
>    has much less downside than favoring the branch into it.  

A correctly predicted branch-not-taken will be the same speed or one cycle faster than a correctly predicted branch-taken.   Modern Intel can sustain 2 fused cmp/jcc µops per cycle, but only if the first is not-taken.  

> Maybe Gcc already has such a heuristic, and the addition of 7 conditional 
> increments in the loop, or whatever overhead bitset<> adds, was enough
> to push it over?  

I presume this is true, but don't know the GCC internals.  I did test though with clang++ 3.8.0 and icpc 16.01, and found that both made the same layout choices as GCC in the absence of __builtin_expect() and had approximately the same speed on Skylake (about 2 cycles per iteration).  With USE_EXPECT, clang and gcc were equally fast (one cycle), and icc was at an intermediate speed (1.2-ish).  

My reason for thinking this is not a bug is that the fastest choice will depend on the contents of the word list.   Regardless of layout, there will be one option that is slightly faster than the other.   I guess it's reasonable to ask, though, whether it's better by default to try to save one cycle on an already very fast empty loop, or to save one cycle on a more expensive loop. But the real gain (if there is one) will be matching the layout to the runtime behavior, for which the compiler requires outside information.   

> Westmere runs both instruction sequences (with and without __builtin_expect)
> the same.  Maybe on Westmere the loop takes two cycles regardless of code 
> placement, and Gcc is (still) tuned for the older timings?


This matches my belief:  Haswell and Skylake have the potential of doing the inner loop in a single cycle if the layout matches the runtime pattern, while Nehalem (Westmere's sibling) takes at least 2 cycles regardless.   This is due to the capabilities for "Macro-op Fusion" across the generations.  See the sections with that heading for Nehalem, Sandy Bridge, Haswell, and Skylake here: http://www.agner.org/optimize/microarchitecture.pdf. 

Often the limiting factor for very tight loops like this is that only 4 µops can be "retired" per cycle.   If both cmp/jcc instructions are fused (cmp here includes a different subset of cmp/test/add/sub for different generations), and if the first branch is not taken, it's possible to do both plus the 'mov' and 'add' in a single cycle on Haswell and Skylake.  If as on previous generations they are not both fused (see Agner's PDF), or if the first branch is taken, there is a two cycle minimum for the loop.  

My recommendation to any GCC internals experts reading this would be to read Agner's suggestions for promoting fusion, and change the loop optimizers to respect them. While not happening in this case, I feel I often see situations where  GCC often puts 'mov' or other non-flag changing instructions between the 'cmp' and 'jcc', defeating macro-op fusion and creating slower loops.  Intel's icc frequently does a better job at avoiding this, and thus often produces slightly faster loops.
Comment 24 Richard Biener 2016-06-03 10:06:21 UTC
GCC 5.4 is being released, adjusting target milestone.
Comment 25 Jakub Jelinek 2017-10-10 13:27:46 UTC
GCC 5 branch is being closed