Bug 79389 - [7 Regression] 30% performance regression in SciMark2 MonteCarlo
Summary: [7 Regression] 30% performance regression in SciMark2 MonteCarlo
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P1 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 79703
  Show dependency treegraph
 
Reported: 2017-02-06 13:12 UTC by krister.walfridsson
Modified: 2019-11-19 23:32 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-02-06 00:00:00


Attachments
The relevant source code and generated assembler before/after this change (1.24 KB, application/gzip)
2017-02-06 13:12 UTC, krister.walfridsson
Details
gcc7-pr79389.patch (1.80 KB, patch)
2017-02-22 14:02 UTC, Jakub Jelinek
Details | Diff
patch (1.17 KB, text/plain)
2017-02-23 12:02 UTC, Richard Biener
Details
revised patch (1.54 KB, text/plain)
2017-02-23 14:53 UTC, Richard Biener
Details

Note You need to log in before you can comment on or make changes to this bug.
Description krister.walfridsson 2017-02-06 13:12:52 UTC
Created attachment 40676 [details]
The relevant source code and generated assembler before/after this change

The MonteCarlo test from the old SciMark2 (http://math.nist.gov/scimark) used in the Phoronix compiler test suite has regressed 30% compared to GCC 6.3 when run on Intel i7 6800K Broadwell (compiled with "-O3 -march=native").

The regression was introduced in r238005:
------------------------------------------------------------------------
r238005 | rguenth | 2016-07-05 15:25:47 +0200 (tis, 05 jul 2016) | 7 lines

2016-07-05  Richard Biener  <rguenther@suse.de>

        * gimple-ssa-split-paths.c (find_block_to_duplicate_for_splitting_pa):
	Handle empty else block.
	(is_feasible_trace): Likewise.
	(split_paths): Likewise.

------------------------------------------------------------------------
and has the effect that the if-statement in the loop

    for (count=0; count<Num_samples; count++)
    {
      double x= Random_nextDouble(R);
      double y= Random_nextDouble(R);
      if ( x*x + y*y <= 1.0)
        under_curve ++;
    }
is changed from a cmov to a branch, which mispredicts.
Comment 1 Richard Biener 2017-02-06 13:21:19 UTC
Possibly related to PR77283.  Confirmed by looking at http://gcc.opensuse.org/c++bench-czerny/scimark2/
Comment 2 Richard Biener 2017-02-06 13:28:51 UTC
So we think the splitting is useful because on one path it exposes that under_curve is unchanged which might expose a jump threading opportunity.
Not in this case so the possible jump threading opportunity needs to be
further verified by some heuristics (a use in a conditional maybe?).
Comment 3 Uroš Bizjak 2017-02-06 14:42:55 UTC
Please note that clang if-converts:

            if ( x*x + y*y <= 1.0)
                 under_curve ++;

to SETcc + ADD:

  5,64 │      movsd  (%rsp),%xmm1
       │      mulsd  %xmm1,%xmm1
       │      mulsd  %xmm0,%xmm0
 23,59 │      addsd  %xmm1,%xmm0
 18,54 │      ucomis 0xb32(%rip),%xmm0        # 403190 <TINY_LU_SIZE+0x258>
 10,88 │      setbe  %al
 12,31 │      movzbl %al,%eax
 10,68 │      add    %eax,%ebp
  7,11 │      dec    %ebx
       │    ↑ jne    30

OTOH, gcc emits LEA + CMOVE to conditionally increase under_curve:

  7,46 │      movsd  0x8(%rsp),%xmm1
       │      lea    0x1(%rbx),%eax
       │      movsd  0x861(%rip),%xmm2        # 403448 <RESOLUTION_DEFAULT+0x48>
       │      mulsd  %xmm0,%xmm0
 25,81 │      mulsd  %xmm1,%xmm1
       │      addsd  %xmm0,%xmm1
 22,23 │      comisd %xmm1,%xmm2
  8,43 │      cmovae %eax,%ebx
 21,60 │      add    $0x1,%ebp
       │      cmp    %ebp,%r13d
       │    ↑ jne    30

LEA insn moves ebx+1 to eax and CMOVE effectively decides between ebx and ebx+1. Considering that CMOVE is problematic for some x86 targets, IMO clang code is more "universal" for these short loops, since CMOVE is avoided.

Note also that clang reverses the loop to use DEC insn instead of ADD/CMP.

Putting loop reversal aside, this is the case of missing if-conversion.
Comment 4 Uroš Bizjak 2017-02-22 10:27:44 UTC
The testcase for missing if-conversion:

--cut here--
extern double Random_nextDouble (void);

int foo (int Num_samples)
{
  int count;
  int under_curve = 0;

  for (count=0; count<Num_samples; count++)
    {
      double x= Random_nextDouble ();
      double y= Random_nextDouble ();
      if ( x*x + y*y <= 1.0)
        under_curve ++;
    }

  return under_curve;
}
--cut here--

gcc -O2:

        xorl    %ebx, %ebx
        ...
.L5:
        call    Random_nextDouble
        movsd   %xmm0, 8(%rsp)
        call    Random_nextDouble
        movsd   8(%rsp), %xmm1
++>     leal    1(%rbx), %eax
        mulsd   %xmm0, %xmm0
        mulsd   %xmm1, %xmm1
        movsd   .LC0(%rip), %xmm2
        addsd   %xmm0, %xmm1
        ucomisd %xmm1, %xmm2
-->     cmovnb  %eax, %ebx
        addl    $1, %ebp
        cmpl    %ebp, %r12d
        jne     .L5

gcc -O3 is even more "innovative":

        xorl    %r12d, %r12d
        ...
.L2:
        call    Random_nextDouble
        movsd   %xmm0, 8(%rsp)
        call    Random_nextDouble
        movsd   8(%rsp), %xmm1
        mulsd   %xmm0, %xmm0
        mulsd   %xmm1, %xmm1
        movsd   .LC0(%rip), %xmm2
        addsd   %xmm0, %xmm1
        ucomisd %xmm1, %xmm2
-->     jnb     .L13
        addl    $1, %ebx
        cmpl    %ebx, %ebp
        jne     .L2

.L13:
        addl    $1, %ebx
++>     addl    $1, %r12d
        cmpl    %ebp, %ebx
        jne     .L2

IF-conversion to SETcc + ZEXT + ADD would avoid cmove/jump in this short loop.

(I'm not sure if tree or RTL ifconvert pass should catch this situation.)
Comment 5 rguenther@suse.de 2017-02-22 10:29:36 UTC
On Wed, 22 Feb 2017, ubizjak at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79389
> 
> Uroš Bizjak <ubizjak at gmail dot com> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org
> 
> --- Comment #4 from Uroš Bizjak <ubizjak at gmail dot com> ---
> The testcase for missing if-conversion:
> 
> --cut here--
> extern double Random_nextDouble (void);
> 
> int foo (int Num_samples)
> {
>   int count;
>   int under_curve = 0;
> 
>   for (count=0; count<Num_samples; count++)
>     {
>       double x= Random_nextDouble ();
>       double y= Random_nextDouble ();
>       if ( x*x + y*y <= 1.0)
>         under_curve ++;
>     }
> 
>   return under_curve;
> }
> --cut here--
> 
> gcc -O2:
> 
>         xorl    %ebx, %ebx
>         ...
> .L5:
>         call    Random_nextDouble
>         movsd   %xmm0, 8(%rsp)
>         call    Random_nextDouble
>         movsd   8(%rsp), %xmm1
> ++>     leal    1(%rbx), %eax
>         mulsd   %xmm0, %xmm0
>         mulsd   %xmm1, %xmm1
>         movsd   .LC0(%rip), %xmm2
>         addsd   %xmm0, %xmm1
>         ucomisd %xmm1, %xmm2
> -->     cmovnb  %eax, %ebx
>         addl    $1, %ebp
>         cmpl    %ebp, %r12d
>         jne     .L5
> 
> gcc -O3 is even more "innovative":
> 
>         xorl    %r12d, %r12d
>         ...
> .L2:
>         call    Random_nextDouble
>         movsd   %xmm0, 8(%rsp)
>         call    Random_nextDouble
>         movsd   8(%rsp), %xmm1
>         mulsd   %xmm0, %xmm0
>         mulsd   %xmm1, %xmm1
>         movsd   .LC0(%rip), %xmm2
>         addsd   %xmm0, %xmm1
>         ucomisd %xmm1, %xmm2
> -->     jnb     .L13
>         addl    $1, %ebx
>         cmpl    %ebx, %ebp
>         jne     .L2
> 
> .L13:
>         addl    $1, %ebx
> ++>     addl    $1, %r12d
>         cmpl    %ebp, %ebx
>         jne     .L2
> 
> IF-conversion to SETcc + ZEXT + ADD would avoid cmove/jump in this short loop.
> 
> (I'm not sure if tree or RTL ifconvert pass should catch this situation.)

There is only RTL ifconvert left.
Comment 6 Jakub Jelinek 2017-02-22 12:21:38 UTC
Well, at least in the -O2 case, it clearly shows that (RTL) if-conversion does happen.
If SETCC + ZEXT + ADD is always a win compared to LEAL x+1 + CMOV*, then we should just expand the cmov in that case to that or combine or peephole it to that.
Now, with -m32 -march=i586 -O2 (so that cmov can't be used), I see noce_try_addcc (which is exactly that, SETCC + ZEXT + ADD) fails because the condition is complex:
(ne (and:QI (subreg:QI (zero_extract:SI (reg:HI 100)
                (const_int 8 [0x8])
                (const_int 8 [0x8])) 0)
        (const_int 5 [0x5]))
    (const_int 0 [0]))
and in that case it attempts to emit:
(insn 53 0 0 (set (reg:SI 102)
        (eq:SI (reg:CCNO 17 flags)
            (const_int 0 [0]))) -1
     (nil))
but that isn't recognized by the backend.
Comment 7 Jakub Jelinek 2017-02-22 12:56:39 UTC
For -O2 -m64 the reason why noce_try_addcc doesn't trigger is that we have if_info->cond of
(unlt (reg:DF 100)
    (reg:DF 97))
and reverse_comparison_code on it returns UNKNOWN.
Though, that UNLT comes from:
noce_get_condition which calls canonicalize_condition on the original
(ge (reg:CCFP 17 flags)
    (const_int 0 [0]))
condition with reverse set to true.
Now, if I call canonicalize_condition on the original ge with reversed set to false, I get:
(ge (reg:DF 100)
    (reg:DF 97))
The FPU compare reversions are just way too many to be understandable, I have no idea why UNLT can't be reversed, while GE can be reversed to UNLT.
Comment 8 Uroš Bizjak 2017-02-22 13:11:02 UTC
(In reply to Jakub Jelinek from comment #6)
> Well, at least in the -O2 case, it clearly shows that (RTL) if-conversion
> does happen.
> If SETCC + ZEXT + ADD is always a win compared to LEAL x+1 + CMOV*, then we
> should just expand the cmov in that case to that or combine or peephole it
> to that.
> Now, with -m32 -march=i586 -O2 (so that cmov can't be used), I see
> noce_try_addcc (which is exactly that, SETCC + ZEXT + ADD) fails because the
> condition is complex:
> (ne (and:QI (subreg:QI (zero_extract:SI (reg:HI 100)
>                 (const_int 8 [0x8])
>                 (const_int 8 [0x8])) 0)
>         (const_int 5 [0x5]))
>     (const_int 0 [0]))

You can use -march=i686, so fcomi will be emitted instead of fcomp/fnstsw/test.
Comment 9 Jakub Jelinek 2017-02-22 13:29:43 UTC
With:
--- ifcvt.c.jj2	2017-01-23 18:09:45.000000000 +0100
+++ ifcvt.c	2017-02-22 14:25:15.758296571 +0100
@@ -777,6 +777,9 @@ struct noce_if_info
   /* The jump condition.  */
   rtx cond;
 
+  /* Reversed jump condition.  */
+  rtx rev_cond;
+
   /* New insns should be inserted before this one.  */
   rtx_insn *cond_earliest;
 
@@ -898,7 +901,7 @@ noce_emit_store_flag (struct noce_if_inf
       && (normalize == 0 || STORE_FLAG_VALUE == normalize))
     {
       rtx src = gen_rtx_fmt_ee (code, GET_MODE (x), XEXP (cond, 0),
-			    XEXP (cond, 1));
+			        XEXP (cond, 1));
       rtx set = gen_rtx_SET (x, src);
 
       start_sequence ();
@@ -1553,11 +1556,10 @@ noce_try_addcc (struct noce_if_info *if_
 
   if (GET_CODE (if_info->a) == PLUS
       && rtx_equal_p (XEXP (if_info->a, 0), if_info->b)
-      && (reversed_comparison_code (if_info->cond, if_info->jump)
-	  != UNKNOWN))
+      && if_info->rev_cond)
     {
-      rtx cond = if_info->cond;
-      enum rtx_code code = reversed_comparison_code (cond, if_info->jump);
+      rtx cond = if_info->rev_cond;
+      enum rtx_code code = GET_CODE (cond);
 
       /* First try to use addcc pattern.  */
       if (general_operand (XEXP (cond, 0), VOIDmode)
@@ -4064,6 +4066,11 @@ noce_find_if_block (basic_block test_bb,
   if_info.else_bb = else_bb;
   if_info.join_bb = join_bb;
   if_info.cond = cond;
+  rtx_insn *rev_cond_earliest;
+  if_info.rev_cond = noce_get_condition (jump, &rev_cond_earliest,
+					 !then_else_reversed);
+  gcc_assert (if_info.rev_cond == NULL_RTX
+	      || rev_cond_earliest == cond_earliest);
   if_info.cond_earliest = cond_earliest;
   if_info.jump = jump;
   if_info.then_else_reversed = then_else_reversed;

I get:
	comisd	%xmm1, %xmm2
	sbbl	$-1, %ebp
instead of:
	leal	1(%rbx), %eax
...
	comisd	%xmm1, %xmm2
	cmovnb	%eax, %ebx
(plus slightly different RA).
Comment 10 Jakub Jelinek 2017-02-22 14:02:29 UTC
Created attachment 40815 [details]
gcc7-pr79389.patch

Full untested patch.
Comment 11 Uroš Bizjak 2017-02-22 16:23:55 UTC
(In reply to Jakub Jelinek from comment #10)
> Created attachment 40815 [details]
> gcc7-pr79389.patch
> 
> Full untested patch.

I can confirm that MonteCarlo benchmark considerably improves with -O2, but please note that your patch is ineffective with -Ofast for some reason, and the testcase still compiles to a jump.
Comment 12 Jakub Jelinek 2017-02-22 17:08:10 UTC
At -O3 it is caused by the r238005 change as written earlier, that is really not a beneficial optimization in this case.  -O3 -fno-split-paths should still see the benefit of the patch I've attached.  So we need some patch from Richard too or a way to undo the damage afterwards (still in GIMPLE or on RTL).
Comment 13 Jakub Jelinek 2017-02-22 18:50:26 UTC
Note that my patch doesn't bootstrap, so I'll need to debug it.
Comment 14 Richard Biener 2017-02-23 12:02:21 UTC
Created attachment 40818 [details]
patch

Patch I am testing for the loop splitting cost model.  I still believe loop splitting on its own is not profitable and it should be integrated with threading and CSE.
Comment 15 Richard Biener 2017-02-23 14:43:36 UTC
(In reply to Richard Biener from comment #14)
> Created attachment 40818 [details]
> patch
> 
> Patch I am testing for the loop splitting cost model.  I still believe loop
> splitting on its own is not profitable and it should be integrated with
> threading and CSE.

It regresses

FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-not dom3 "bit_xor"
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 0;" 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 1;" 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_[0-9]+. with constant .0." 1
FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_[0-9]+. with constant .1." 1

this testcase wants to "thread"

   if (bufferstep != 0)
    ...
   bufferstep ^= 1;

once which the backward threading code does only if we first split the path.
This effectively rotates the loop.  The threading doesn't seem to matter
though, what matters is the xor that can be conditionally simplified.

Thus another "enabler" for path splitting would be

  if (x ==/!= CST)
    ...
  use-of-x;

so we can constant propagate.
Comment 16 Richard Biener 2017-02-23 14:53:16 UTC
Created attachment 40819 [details]
revised patch
Comment 17 Jeffrey A. Law 2017-02-23 21:37:05 UTC
Note this was extracted from the main motivator for path splitting, so it's probably wise to make sure we get this right.

The key is blob of code:

for ()
  {
     ...
     if (bufferstep)
        outputbuffer = (delta << 4) & 0xf0;
      else
        *outp++ = (delta & 0x0f) | outputbuffer;
      bufferstep = !bufferstep;
  }

Path splitting essentially turns that into:


for ()
  {
     ...
     if (bufferstep)
       {
         outputbuffer = (delta << 4) & 0xf0;
         bufferstep = !bufferstep;
       }
      else
       {
         *outp++ = (delta & 0x0f) | outputbuffer;
         bufferstep = !bufferstep;
       }
  }

DOM should know that bufferstep has the range [0,1] and thus can record a path sensitive equivalence for bufferstep on both arms resulting in:


for ()
  {
     ...
     if (bufferstep)
       {
         outputbuffer = (delta << 4) & 0xf0;
         bufferstep = 0;
       }
      else
       {
         *outp++ = (delta & 0x0f) | outputbuffer;
         bufferstep = 1;
       }
  }

And we propagate the assignment to bufferstep away into a PHI node.

As long as we preserve that (simplification of the xor), we're good.

Long term we'd like to thread the if (bufferstep) test since bufferstep is just a flip-flop and the state for every iteration can be known at compile time.  BUt that's a future TODO.  Again, preserve the ability to simplify the assignment to bufferstep at the bottom of the loop and we're good.
Comment 18 Jakub Jelinek 2017-02-23 22:05:52 UTC
Author: jakub
Date: Thu Feb 23 22:05:19 2017
New Revision: 245690

URL: https://gcc.gnu.org/viewcvs?rev=245690&root=gcc&view=rev
Log:
	PR tree-optimization/79389
	* ifcvt.c (struct noce_if_info): Add rev_cond field.
	(noce_reversed_cond_code): New function.
	(noce_emit_store_flag): Use rev_cond if non-NULL instead of
	reversed_comparison_code.  Formatting fix.
	(noce_try_store_flag): Test rev_cond != NULL in addition to
	reversed_comparison_code.
	(noce_try_store_flag_constants): Likewise.
	(noce_try_store_flag_mask): Likewise.
	(noce_try_addcc): Use rev_cond if non-NULL instead of
	reversed_comparison_code.
	(noce_try_cmove_arith): Likewise.  Formatting fixes.
	(noce_try_minmax, noce_try_abs): Clear rev_cond.
	(noce_find_if_block): Initialize rev_cond.
	(find_cond_trap): Call noce_get_condition with then_bb == trap_bb
	instead of false as last argument never attempt to reverse it
	afterwards.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ifcvt.c
Comment 19 Richard Biener 2017-02-24 08:05:04 UTC
Author: rguenth
Date: Fri Feb 24 08:04:31 2017
New Revision: 245696

URL: https://gcc.gnu.org/viewcvs?rev=245696&root=gcc&view=rev
Log:
2017-02-24  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/79389
	* gimple-ssa-split-paths.c (is_feasible_trace): Verify more
	properly that a threading opportunity exists.  Detect conditional
	copy/constant propagation opportunities.

	* gcc.dg/tree-ssa/split-path-10.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-10.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-split-paths.c
    trunk/gcc/testsuite/ChangeLog
Comment 20 Richard Biener 2017-02-24 08:05:12 UTC
Fixed.
Comment 21 Richard Biener 2017-02-24 11:52:04 UTC
Author: rguenth
Date: Fri Feb 24 11:51:33 2017
New Revision: 245713

URL: https://gcc.gnu.org/viewcvs?rev=245713&root=gcc&view=rev
Log:
2017-02-24  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/79389
	* gimple-ssa-split-paths.c (is_feasible_trace): Properly skip
	debug insns.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-split-paths.c