Bug 80197 - pgo dramatically pessimizes scimark2 MonteCarlo benchmark
Summary: pgo dramatically pessimizes scimark2 MonteCarlo benchmark
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 7.0.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 79703
  Show dependency treegraph
 
Reported: 2017-03-26 13:41 UTC by vincenzo Innocente
Modified: 2024-09-24 15:27 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-03-27 00:00:00


Attachments
self contained benchmark of scimark2 MC (2.48 KB, application/x-gzip)
2017-03-26 13:41 UTC, vincenzo Innocente
Details

Note You need to log in before you can comment on or make changes to this bug.
Description vincenzo Innocente 2017-03-26 13:41:58 UTC
Created attachment 41053 [details]
self contained benchmark of scimark2 MC

while chasing the regression I then found identified and solved in #79389
I discovered that pgo manages to do much worse than the regression above.
The symptom is the same: a huge increase in branch-miss.
This is not a regression: it is the same at least since gcc5.3
Attached a self contained single file, copy of scimark2 MC, and a couple of scripts to compile and run it

just
tar -xzf fullMC.tgz
cd fullMC
# standard compilation -O2 -O3
./runit 
# same with pgo passes
./dopgo

or just do
[innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g -fprofile-generate=pgo ; time ./a.out
1.848u 0.000s 0:01.85 99.4%	0+0k 0+8io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time ./a.out
0.967u 0.001s 0:00.96 100.0%	0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out
0.328u 0.000s 0:00.32 100.0%	0+0k 0+0io 0pf+0w


for reference:
cat dopgo
cat /proc/cpuinfo | grep name | head -n 1
gcc -v
rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out
rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out


on my machine the result is
# standard compilation
[innocent@vinavx3 fullMC]$ ./runit 
model name	: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 
gcc -O2 fullMC.c -g

real	0m0.489s
user	0m0.485s
sys	0m0.002s

 Performance counter stats for './a.out':

        486.303424      task-clock (msec)         #    0.999 CPUs utilized          
        1901271534      cycles                    #    3.910 GHz                    
        6403589598      instructions              #    3.37  insn per cycle         
         700683389      branches                  # 1440.836 M/sec                  
             13582      branch-misses             #    0.00% of all branches        

       0.486571089 seconds time elapsed

gcc -O3 fullMC.c -g

real	0m0.330s
user	0m0.330s
sys	0m0.000s

 Performance counter stats for './a.out':

        327.385696      task-clock (msec)         #    0.999 CPUs utilized          
        1279958668      cycles                    #    3.910 GHz                    
        5009002909      instructions              #    3.91  insn per cycle         
         306481761      branches                  #  936.149 M/sec                  
             10805      branch-misses             #    0.00% of all branches        

       0.327637485 seconds time elapsed


// pro generation and use (perf after use...)
[innocent@vinavx3 fullMC]$ ./dopgo 
model name	: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 

 Performance counter stats for './a.out':

        964.399833      task-clock (msec)         #    1.000 CPUs utilized          
        3770455888      cycles                    #    3.910 GHz                    
        5007987488      instructions              #    1.33  insn per cycle         
         816525627      branches                  #  846.667 M/sec                  
          88982233      branch-misses             #   10.90% of all branches        

       0.964699603 seconds time elapsed


 Performance counter stats for './a.out':

        964.540691      task-clock (msec)         #    1.000 CPUs utilized          
        3771010753      cycles                    #    3.910 GHz                    
        5007957589      instructions              #    1.33  insn per cycle         
         816522043      branches                  #  846.540 M/sec                  
          88992086      branch-misses             #   10.90% of all branches        

       0.964758684 seconds time elapsed
Comment 1 Uroš Bizjak 2017-03-27 08:41:21 UTC
For some reason, recently fixed if-conversion (PR79389) does not trigger in PGO case. There is still a jump with -O2:

	mulsd	%xmm0, %xmm5
	mulsd	%xmm2, %xmm2
	addsd	%xmm2, %xmm5
	ucomisd	%xmm5, %xmm4
	jb	.L17
.L16:
	addl	$1, %ebp
.L17:
	addl	$1, %edi
	cmpl	%edi, %ebx
	je	.L5

Since this asm corresponds to random operands, the jump can't be predicted:

        for (count=0; count<Num_samples; count++)
        {
            double x= Random_nextDouble(R);
            double y= Random_nextDouble(R);

            if ( x*x + y*y <= 1.0)
                 under_curve ++;
            
        }

Based on the discussion in PR79389, and the fact that -O2 and -O3 both compile to a jump, I suspect that loop splitting cost model should be fine tuned to also handle PGO case. Note that

Adding some CCs.
Comment 2 rguenther@suse.de 2017-03-27 09:27:20 UTC
On Mon, 27 Mar 2017, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197
> 
> Uroš Bizjak <ubizjak at gmail dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|UNCONFIRMED                 |NEW
>    Last reconfirmed|                            |2017-03-27
>                  CC|                            |jakub at gcc dot gnu.org,
>                    |                            |rguenth at gcc dot gnu.org
>      Ever confirmed|0                           |1
> 
> --- Comment #1 from Uroš Bizjak <ubizjak at gmail dot com> ---
> For some reason, recently fixed if-conversion (PR79389) does not trigger in PGO
> case. There is still a jump with -O2:
> 
>         mulsd   %xmm0, %xmm5
>         mulsd   %xmm2, %xmm2
>         addsd   %xmm2, %xmm5
>         ucomisd %xmm5, %xmm4
>         jb      .L17
> .L16:
>         addl    $1, %ebp
> .L17:
>         addl    $1, %edi
>         cmpl    %edi, %ebx
>         je      .L5
> 
> Since this asm corresponds to random operands, the jump can't be predicted:
> 
>         for (count=0; count<Num_samples; count++)
>         {
>             double x= Random_nextDouble(R);
>             double y= Random_nextDouble(R);
> 
>             if ( x*x + y*y <= 1.0)
>                  under_curve ++;
> 
>         }
> 
> Based on the discussion in PR79389, and the fact that -O2 and -O3 both compile
> to a jump, I suspect that loop splitting cost model should be fine tuned to
> also handle PGO case. Note that
> 
> Adding some CCs.

Not sure - loop splitting isn't done here and doing it would remove
the if-conversion opportunity.

I think that if FDO says either the true or false edge is very likely
then not if-converting the loop is best?  Or is a well-predicted
conditional move as good as a well-predicted if?  10% missed branches
would be more than

/* When branch is predicted to be taken with probability lower than this
   threshold (in percent), then it is considered well predictable. */
DEFPARAM (PARAM_PREDICTABLE_BRANCH_OUTCOME,
          "predictable-branch-outcome",
          "Maximal estimated outcome of branch considered predictable.",
          2, 0, 50)

so it shouldn't affect if-conversion...

Are we sure we're not hitting some architectural limitation here?  Like
disabling the loop stream cache because of size or the CFG?
(otoh we have calls in the loop(?)).
Comment 3 Uroš Bizjak 2017-03-27 14:32:30 UTC
(In reply to rguenther@suse.de from comment #2)

> I think that if FDO says either the true or false edge is very likely
> then not if-converting the loop is best?  Or is a well-predicted
> conditional move as good as a well-predicted if?  10% missed branches
> would be more than

Please note that when if-conversion succeeded through noce_try_addcc, we don't care about prediction anymore. The conversion converts:

	ucomisd	%xmm5, %xmm4
	jb	.L17
.L16:
	addl	$1, %ebp
.L17:

to:

	ucomisd	%xmm0, %xmm3	# 195	*cmpiudf/2	[length = 4]
	sbbl	$-1, %ebx	# 196	subsi3_carry/1	[length = 3]

IMO, this conversion should always be performed, as it is always a win.
Comment 4 Alexander Monakov 2017-03-27 16:43:25 UTC
According to my analysis, this is mostly caused by different inlining decisions with regards to inlining new_Random_seed into MonteCarlo_integrate.  Inlining happens at profile-generate time, but does not happen at profile-use time.  This appears to throw off edge probabilities, and also prevents the compiler from seeing that R->haveRange accessed in Random_nextDouble (which is inlined) is always 0.

Declaring new_Random_seed (which is called once) as 'inline __attribute__((always_inline))' makes code generation sane again.
Comment 5 Alexander Monakov 2017-03-28 17:10:18 UTC
On trunk, manually fixing up inlining is not enough: trunk additionally needs -fno-tracer, otherwise crucial if-conversion of 'if (k < 0) k += m1;' is prevented.
Comment 6 rguenther@suse.de 2017-03-29 07:16:37 UTC
On Tue, 28 Mar 2017, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197
> 
> --- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> On trunk, manually fixing up inlining is not enough: trunk additionally needs
> -fno-tracer, otherwise crucial if-conversion of 'if (k < 0) k += m1;' is
> prevented.

That means -ftracer has similar issues as path splitting.  -ftracer
simply looks for a path with high probability so supposedly the
distorted profile might be the reason for this?
Comment 7 Alexander Monakov 2017-03-29 14:15:59 UTC
No, with fixed-up inlining -ftracer sees reasonable edge probabilities. The reason tracer makes things worse here, is that it clones the path leading to a 50%/50% conditional branch (and correctly stops at that branch), making the tiny BB under that branch ineligible(?) for if-conversion. We go from

<BB0>
if (!cond) goto <BB2>
  <BBcond>
  var = VAL;   // this can eventually become a cmov
<BB2>

to

<BB0_1>
if (!cond) goto <BB2>
goto <BBcond>

...

<BB0_2>
if (!cond) goto <BB2>
  <BBcond>
  var = VAL;  // this doesn't become a cmov
<BB2>


I think in principle if-conversion could still do its job here by duplicating the conditional var=VAL assignment under BB0_1.

Here's a silly compile-only sample where -O2 -ftracer is worse than -O2 due to this effect:

void f(long n, signed char *x)
{
  for (; n; n--) {
    long a=x[n], b;
    if (!a)
      a = 42;
    b = x[a];
    if (b < 0)
      b += a;
    x[b] = 0;
  }
}
Comment 8 rguenther@suse.de 2017-03-30 07:50:02 UTC
On Wed, 29 Mar 2017, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197
> 
> --- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> No, with fixed-up inlining -ftracer sees reasonable edge probabilities. The
> reason tracer makes things worse here, is that it clones the path leading to a
> 50%/50% conditional branch (and correctly stops at that branch), making the
> tiny BB under that branch ineligible(?) for if-conversion. We go from
> 
> <BB0>
> if (!cond) goto <BB2>
>   <BBcond>
>   var = VAL;   // this can eventually become a cmov
> <BB2>
> 
> to
> 
> <BB0_1>
> if (!cond) goto <BB2>
> goto <BBcond>
> 
> ...
> 
> <BB0_2>
> if (!cond) goto <BB2>
>   <BBcond>
>   var = VAL;  // this doesn't become a cmov
> <BB2>
> 
> 
> I think in principle if-conversion could still do its job here by duplicating
> the conditional var=VAL assignment under BB0_1.
> 
> Here's a silly compile-only sample where -O2 -ftracer is worse than -O2 due to
> this effect:
> 
> void f(long n, signed char *x)
> {
>   for (; n; n--) {
>     long a=x[n], b;
>     if (!a)
>       a = 42;
>     b = x[a];
>     if (b < 0)
>       b += a;
>     x[b] = 0;
>   }
> }

It's hard for tracer to predict followup optimization opportunities
in the isolated path and weight that against missed if-conversion
opportunities.

Of course the fact that late phiopt runs after tracer / split-paths
doesn't help, nor that phiopt doesn't catch this kind of
if-conversion (the idea was that the RTL pass can do _much_ better
in assessing if-conversion const/benefit).

Tracer doesn't have a very sophisticated cost model either.
Comment 9 Jan Hubicka 2017-07-24 22:12:47 UTC
The original idea of tracing was that we can pro-actively duplicate tails and rely on crossjumping to merge the paths back if they did not trigger context sensitive optimizations.  Nowdays crossjumping much weaker than it used to be because it does not handle well RTL genertated from SSA and we do not do any gimple level tail merging after tracer which we probably should.