Created attachment 41053 [details] self contained benchmark of scimark2 MC while chasing the regression I then found identified and solved in #79389 I discovered that pgo manages to do much worse than the regression above. The symptom is the same: a huge increase in branch-miss. This is not a regression: it is the same at least since gcc5.3 Attached a self contained single file, copy of scimark2 MC, and a couple of scripts to compile and run it just tar -xzf fullMC.tgz cd fullMC # standard compilation -O2 -O3 ./runit # same with pgo passes ./dopgo or just do [innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g -fprofile-generate=pgo ; time ./a.out 1.848u 0.000s 0:01.85 99.4% 0+0k 0+8io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time ./a.out 0.967u 0.001s 0:00.96 100.0% 0+0k 0+0io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out 0.328u 0.000s 0:00.32 100.0% 0+0k 0+0io 0pf+0w for reference: cat dopgo cat /proc/cpuinfo | grep name | head -n 1 gcc -v rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out on my machine the result is # standard compilation [innocent@vinavx3 fullMC]$ ./runit model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) gcc -O2 fullMC.c -g real 0m0.489s user 0m0.485s sys 0m0.002s Performance counter stats for './a.out': 486.303424 task-clock (msec) # 0.999 CPUs utilized 1901271534 cycles # 3.910 GHz 6403589598 instructions # 3.37 insn per cycle 700683389 branches # 1440.836 M/sec 13582 branch-misses # 0.00% of all branches 0.486571089 seconds time elapsed gcc -O3 fullMC.c -g real 0m0.330s user 0m0.330s sys 0m0.000s Performance counter stats for './a.out': 327.385696 task-clock (msec) # 0.999 CPUs utilized 1279958668 cycles # 3.910 GHz 5009002909 instructions # 3.91 insn per cycle 306481761 branches # 936.149 M/sec 10805 branch-misses # 0.00% of all branches 0.327637485 seconds time elapsed // pro generation and use (perf after use...) [innocent@vinavx3 fullMC]$ ./dopgo model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) Performance counter stats for './a.out': 964.399833 task-clock (msec) # 1.000 CPUs utilized 3770455888 cycles # 3.910 GHz 5007987488 instructions # 1.33 insn per cycle 816525627 branches # 846.667 M/sec 88982233 branch-misses # 10.90% of all branches 0.964699603 seconds time elapsed Performance counter stats for './a.out': 964.540691 task-clock (msec) # 1.000 CPUs utilized 3771010753 cycles # 3.910 GHz 5007957589 instructions # 1.33 insn per cycle 816522043 branches # 846.540 M/sec 88992086 branch-misses # 10.90% of all branches 0.964758684 seconds time elapsed
For some reason, recently fixed if-conversion (PR79389) does not trigger in PGO case. There is still a jump with -O2: mulsd %xmm0, %xmm5 mulsd %xmm2, %xmm2 addsd %xmm2, %xmm5 ucomisd %xmm5, %xmm4 jb .L17 .L16: addl $1, %ebp .L17: addl $1, %edi cmpl %edi, %ebx je .L5 Since this asm corresponds to random operands, the jump can't be predicted: for (count=0; count<Num_samples; count++) { double x= Random_nextDouble(R); double y= Random_nextDouble(R); if ( x*x + y*y <= 1.0) under_curve ++; } Based on the discussion in PR79389, and the fact that -O2 and -O3 both compile to a jump, I suspect that loop splitting cost model should be fine tuned to also handle PGO case. Note that Adding some CCs.
On Mon, 27 Mar 2017, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197 > > Uroš Bizjak <ubizjak at gmail dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|UNCONFIRMED |NEW > Last reconfirmed| |2017-03-27 > CC| |jakub at gcc dot gnu.org, > | |rguenth at gcc dot gnu.org > Ever confirmed|0 |1 > > --- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> --- > For some reason, recently fixed if-conversion (PR79389) does not trigger in PGO > case. There is still a jump with -O2: > > mulsd %xmm0, %xmm5 > mulsd %xmm2, %xmm2 > addsd %xmm2, %xmm5 > ucomisd %xmm5, %xmm4 > jb .L17 > .L16: > addl $1, %ebp > .L17: > addl $1, %edi > cmpl %edi, %ebx > je .L5 > > Since this asm corresponds to random operands, the jump can't be predicted: > > for (count=0; count<Num_samples; count++) > { > double x= Random_nextDouble(R); > double y= Random_nextDouble(R); > > if ( x*x + y*y <= 1.0) > under_curve ++; > > } > > Based on the discussion in PR79389, and the fact that -O2 and -O3 both compile > to a jump, I suspect that loop splitting cost model should be fine tuned to > also handle PGO case. Note that > > Adding some CCs. Not sure - loop splitting isn't done here and doing it would remove the if-conversion opportunity. I think that if FDO says either the true or false edge is very likely then not if-converting the loop is best? Or is a well-predicted conditional move as good as a well-predicted if? 10% missed branches would be more than /* When branch is predicted to be taken with probability lower than this threshold (in percent), then it is considered well predictable. */ DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME, "predictable-branch-outcome", "Maximal estimated outcome of branch considered predictable.", 2, 0, 50) so it shouldn't affect if-conversion... Are we sure we're not hitting some architectural limitation here? Like disabling the loop stream cache because of size or the CFG? (otoh we have calls in the loop(?)).
(In reply to rguenther@suse.de from comment #2) > I think that if FDO says either the true or false edge is very likely > then not if-converting the loop is best? Or is a well-predicted > conditional move as good as a well-predicted if? 10% missed branches > would be more than Please note that when if-conversion succeeded through noce_try_addcc, we don't care about prediction anymore. The conversion converts: ucomisd %xmm5, %xmm4 jb .L17 .L16: addl $1, %ebp .L17: to: ucomisd %xmm0, %xmm3 # 195 *cmpiudf/2 [length = 4] sbbl $-1, %ebx # 196 subsi3_carry/1 [length = 3] IMO, this conversion should always be performed, as it is always a win.
According to my analysis, this is mostly caused by different inlining decisions with regards to inlining new_Random_seed into MonteCarlo_integrate. Inlining happens at profile-generate time, but does not happen at profile-use time. This appears to throw off edge probabilities, and also prevents the compiler from seeing that R->haveRange accessed in Random_nextDouble (which is inlined) is always 0. Declaring new_Random_seed (which is called once) as 'inline __attribute__((always_inline))' makes code generation sane again.
On trunk, manually fixing up inlining is not enough: trunk additionally needs -fno-tracer, otherwise crucial if-conversion of 'if (k < 0) k += m1;' is prevented.
On Tue, 28 Mar 2017, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197 > > --- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > On trunk, manually fixing up inlining is not enough: trunk additionally needs > -fno-tracer, otherwise crucial if-conversion of 'if (k < 0) k += m1;' is > prevented. That means -ftracer has similar issues as path splitting. -ftracer simply looks for a path with high probability so supposedly the distorted profile might be the reason for this?
No, with fixed-up inlining -ftracer sees reasonable edge probabilities. The reason tracer makes things worse here, is that it clones the path leading to a 50%/50% conditional branch (and correctly stops at that branch), making the tiny BB under that branch ineligible(?) for if-conversion. We go from <BB0> if (!cond) goto <BB2> <BBcond> var = VAL; // this can eventually become a cmov <BB2> to <BB0_1> if (!cond) goto <BB2> goto <BBcond> ... <BB0_2> if (!cond) goto <BB2> <BBcond> var = VAL; // this doesn't become a cmov <BB2> I think in principle if-conversion could still do its job here by duplicating the conditional var=VAL assignment under BB0_1. Here's a silly compile-only sample where -O2 -ftracer is worse than -O2 due to this effect: void f(long n, signed char *x) { for (; n; n--) { long a=x[n], b; if (!a) a = 42; b = x[a]; if (b < 0) b += a; x[b] = 0; } }
On Wed, 29 Mar 2017, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197 > > --- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > No, with fixed-up inlining -ftracer sees reasonable edge probabilities. The > reason tracer makes things worse here, is that it clones the path leading to a > 50%/50% conditional branch (and correctly stops at that branch), making the > tiny BB under that branch ineligible(?) for if-conversion. We go from > > <BB0> > if (!cond) goto <BB2> > <BBcond> > var = VAL; // this can eventually become a cmov > <BB2> > > to > > <BB0_1> > if (!cond) goto <BB2> > goto <BBcond> > > ... > > <BB0_2> > if (!cond) goto <BB2> > <BBcond> > var = VAL; // this doesn't become a cmov > <BB2> > > > I think in principle if-conversion could still do its job here by duplicating > the conditional var=VAL assignment under BB0_1. > > Here's a silly compile-only sample where -O2 -ftracer is worse than -O2 due to > this effect: > > void f(long n, signed char *x) > { > for (; n; n--) { > long a=x[n], b; > if (!a) > a = 42; > b = x[a]; > if (b < 0) > b += a; > x[b] = 0; > } > } It's hard for tracer to predict followup optimization opportunities in the isolated path and weight that against missed if-conversion opportunities. Of course the fact that late phiopt runs after tracer / split-paths doesn't help, nor that phiopt doesn't catch this kind of if-conversion (the idea was that the RTL pass can do _much_ better in assessing if-conversion const/benefit). Tracer doesn't have a very sophisticated cost model either.
The original idea of tracing was that we can pro-actively duplicate tails and rely on crossjumping to merge the paths back if they did not trigger context sensitive optimizations. Nowdays crossjumping much weaker than it used to be because it does not handle well RTL genertated from SSA and we do not do any gimple level tail merging after tracer which we probably should.