Created attachment 40676 [details] The relevant source code and generated assembler before/after this change The MonteCarlo test from the old SciMark2 (http://math.nist.gov/scimark) used in the Phoronix compiler test suite has regressed 30% compared to GCC 6.3 when run on Intel i7 6800K Broadwell (compiled with "-O3 -march=native"). The regression was introduced in r238005: ------------------------------------------------------------------------ r238005 | rguenth | 2016-07-05 15:25:47 +0200 (tis, 05 jul 2016) | 7 lines 2016-07-05 Richard Biener <rguenther@suse.de> * gimple-ssa-split-paths.c (find_block_to_duplicate_for_splitting_pa): Handle empty else block. (is_feasible_trace): Likewise. (split_paths): Likewise. ------------------------------------------------------------------------ and has the effect that the if-statement in the loop for (count=0; count<Num_samples; count++) { double x= Random_nextDouble(R); double y= Random_nextDouble(R); if ( x*x + y*y <= 1.0) under_curve ++; } is changed from a cmov to a branch, which mispredicts.
Possibly related to PR77283. Confirmed by looking at http://gcc.opensuse.org/c++bench-czerny/scimark2/
So we think the splitting is useful because on one path it exposes that under_curve is unchanged which might expose a jump threading opportunity. Not in this case so the possible jump threading opportunity needs to be further verified by some heuristics (a use in a conditional maybe?).
Please note that clang if-converts: if ( x*x + y*y <= 1.0) under_curve ++; to SETcc + ADD: 5,64 │ movsd (%rsp),%xmm1 │ mulsd %xmm1,%xmm1 │ mulsd %xmm0,%xmm0 23,59 │ addsd %xmm1,%xmm0 18,54 │ ucomis 0xb32(%rip),%xmm0 # 403190 <TINY_LU_SIZE+0x258> 10,88 │ setbe %al 12,31 │ movzbl %al,%eax 10,68 │ add %eax,%ebp 7,11 │ dec %ebx │ ↑ jne 30 OTOH, gcc emits LEA + CMOVE to conditionally increase under_curve: 7,46 │ movsd 0x8(%rsp),%xmm1 │ lea 0x1(%rbx),%eax │ movsd 0x861(%rip),%xmm2 # 403448 <RESOLUTION_DEFAULT+0x48> │ mulsd %xmm0,%xmm0 25,81 │ mulsd %xmm1,%xmm1 │ addsd %xmm0,%xmm1 22,23 │ comisd %xmm1,%xmm2 8,43 │ cmovae %eax,%ebx 21,60 │ add $0x1,%ebp │ cmp %ebp,%r13d │ ↑ jne 30 LEA insn moves ebx+1 to eax and CMOVE effectively decides between ebx and ebx+1. Considering that CMOVE is problematic for some x86 targets, IMO clang code is more "universal" for these short loops, since CMOVE is avoided. Note also that clang reverses the loop to use DEC insn instead of ADD/CMP. Putting loop reversal aside, this is the case of missing if-conversion.
The testcase for missing if-conversion: --cut here-- extern double Random_nextDouble (void); int foo (int Num_samples) { int count; int under_curve = 0; for (count=0; count<Num_samples; count++) { double x= Random_nextDouble (); double y= Random_nextDouble (); if ( x*x + y*y <= 1.0) under_curve ++; } return under_curve; } --cut here-- gcc -O2: xorl %ebx, %ebx ... .L5: call Random_nextDouble movsd %xmm0, 8(%rsp) call Random_nextDouble movsd 8(%rsp), %xmm1 ++> leal 1(%rbx), %eax mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 movsd .LC0(%rip), %xmm2 addsd %xmm0, %xmm1 ucomisd %xmm1, %xmm2 --> cmovnb %eax, %ebx addl $1, %ebp cmpl %ebp, %r12d jne .L5 gcc -O3 is even more "innovative": xorl %r12d, %r12d ... .L2: call Random_nextDouble movsd %xmm0, 8(%rsp) call Random_nextDouble movsd 8(%rsp), %xmm1 mulsd %xmm0, %xmm0 mulsd %xmm1, %xmm1 movsd .LC0(%rip), %xmm2 addsd %xmm0, %xmm1 ucomisd %xmm1, %xmm2 --> jnb .L13 addl $1, %ebx cmpl %ebx, %ebp jne .L2 .L13: addl $1, %ebx ++> addl $1, %r12d cmpl %ebp, %ebx jne .L2 IF-conversion to SETcc + ZEXT + ADD would avoid cmove/jump in this short loop. (I'm not sure if tree or RTL ifconvert pass should catch this situation.)
On Wed, 22 Feb 2017, ubizjak at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79389 > > Uroš Bizjak <ubizjak at gmail dot com> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jakub at gcc dot gnu.org > > --- Comment #4 from Uroš Bizjak <ubizjak at gmail dot com> --- > The testcase for missing if-conversion: > > --cut here-- > extern double Random_nextDouble (void); > > int foo (int Num_samples) > { > int count; > int under_curve = 0; > > for (count=0; count<Num_samples; count++) > { > double x= Random_nextDouble (); > double y= Random_nextDouble (); > if ( x*x + y*y <= 1.0) > under_curve ++; > } > > return under_curve; > } > --cut here-- > > gcc -O2: > > xorl %ebx, %ebx > ... > .L5: > call Random_nextDouble > movsd %xmm0, 8(%rsp) > call Random_nextDouble > movsd 8(%rsp), %xmm1 > ++> leal 1(%rbx), %eax > mulsd %xmm0, %xmm0 > mulsd %xmm1, %xmm1 > movsd .LC0(%rip), %xmm2 > addsd %xmm0, %xmm1 > ucomisd %xmm1, %xmm2 > --> cmovnb %eax, %ebx > addl $1, %ebp > cmpl %ebp, %r12d > jne .L5 > > gcc -O3 is even more "innovative": > > xorl %r12d, %r12d > ... > .L2: > call Random_nextDouble > movsd %xmm0, 8(%rsp) > call Random_nextDouble > movsd 8(%rsp), %xmm1 > mulsd %xmm0, %xmm0 > mulsd %xmm1, %xmm1 > movsd .LC0(%rip), %xmm2 > addsd %xmm0, %xmm1 > ucomisd %xmm1, %xmm2 > --> jnb .L13 > addl $1, %ebx > cmpl %ebx, %ebp > jne .L2 > > .L13: > addl $1, %ebx > ++> addl $1, %r12d > cmpl %ebp, %ebx > jne .L2 > > IF-conversion to SETcc + ZEXT + ADD would avoid cmove/jump in this short loop. > > (I'm not sure if tree or RTL ifconvert pass should catch this situation.) There is only RTL ifconvert left.
Well, at least in the -O2 case, it clearly shows that (RTL) if-conversion does happen. If SETCC + ZEXT + ADD is always a win compared to LEAL x+1 + CMOV*, then we should just expand the cmov in that case to that or combine or peephole it to that. Now, with -m32 -march=i586 -O2 (so that cmov can't be used), I see noce_try_addcc (which is exactly that, SETCC + ZEXT + ADD) fails because the condition is complex: (ne (and:QI (subreg:QI (zero_extract:SI (reg:HI 100) (const_int 8 [0x8]) (const_int 8 [0x8])) 0) (const_int 5 [0x5])) (const_int 0 [0])) and in that case it attempts to emit: (insn 53 0 0 (set (reg:SI 102) (eq:SI (reg:CCNO 17 flags) (const_int 0 [0]))) -1 (nil)) but that isn't recognized by the backend.
For -O2 -m64 the reason why noce_try_addcc doesn't trigger is that we have if_info->cond of (unlt (reg:DF 100) (reg:DF 97)) and reverse_comparison_code on it returns UNKNOWN. Though, that UNLT comes from: noce_get_condition which calls canonicalize_condition on the original (ge (reg:CCFP 17 flags) (const_int 0 [0])) condition with reverse set to true. Now, if I call canonicalize_condition on the original ge with reversed set to false, I get: (ge (reg:DF 100) (reg:DF 97)) The FPU compare reversions are just way too many to be understandable, I have no idea why UNLT can't be reversed, while GE can be reversed to UNLT.
(In reply to Jakub Jelinek from comment #6) > Well, at least in the -O2 case, it clearly shows that (RTL) if-conversion > does happen. > If SETCC + ZEXT + ADD is always a win compared to LEAL x+1 + CMOV*, then we > should just expand the cmov in that case to that or combine or peephole it > to that. > Now, with -m32 -march=i586 -O2 (so that cmov can't be used), I see > noce_try_addcc (which is exactly that, SETCC + ZEXT + ADD) fails because the > condition is complex: > (ne (and:QI (subreg:QI (zero_extract:SI (reg:HI 100) > (const_int 8 [0x8]) > (const_int 8 [0x8])) 0) > (const_int 5 [0x5])) > (const_int 0 [0])) You can use -march=i686, so fcomi will be emitted instead of fcomp/fnstsw/test.
With: --- ifcvt.c.jj2 2017-01-23 18:09:45.000000000 +0100 +++ ifcvt.c 2017-02-22 14:25:15.758296571 +0100 @@ -777,6 +777,9 @@ struct noce_if_info /* The jump condition. */ rtx cond; + /* Reversed jump condition. */ + rtx rev_cond; + /* New insns should be inserted before this one. */ rtx_insn *cond_earliest; @@ -898,7 +901,7 @@ noce_emit_store_flag (struct noce_if_inf && (normalize == 0 || STORE_FLAG_VALUE == normalize)) { rtx src = gen_rtx_fmt_ee (code, GET_MODE (x), XEXP (cond, 0), - XEXP (cond, 1)); + XEXP (cond, 1)); rtx set = gen_rtx_SET (x, src); start_sequence (); @@ -1553,11 +1556,10 @@ noce_try_addcc (struct noce_if_info *if_ if (GET_CODE (if_info->a) == PLUS && rtx_equal_p (XEXP (if_info->a, 0), if_info->b) - && (reversed_comparison_code (if_info->cond, if_info->jump) - != UNKNOWN)) + && if_info->rev_cond) { - rtx cond = if_info->cond; - enum rtx_code code = reversed_comparison_code (cond, if_info->jump); + rtx cond = if_info->rev_cond; + enum rtx_code code = GET_CODE (cond); /* First try to use addcc pattern. */ if (general_operand (XEXP (cond, 0), VOIDmode) @@ -4064,6 +4066,11 @@ noce_find_if_block (basic_block test_bb, if_info.else_bb = else_bb; if_info.join_bb = join_bb; if_info.cond = cond; + rtx_insn *rev_cond_earliest; + if_info.rev_cond = noce_get_condition (jump, &rev_cond_earliest, + !then_else_reversed); + gcc_assert (if_info.rev_cond == NULL_RTX + || rev_cond_earliest == cond_earliest); if_info.cond_earliest = cond_earliest; if_info.jump = jump; if_info.then_else_reversed = then_else_reversed; I get: comisd %xmm1, %xmm2 sbbl $-1, %ebp instead of: leal 1(%rbx), %eax ... comisd %xmm1, %xmm2 cmovnb %eax, %ebx (plus slightly different RA).
Created attachment 40815 [details] gcc7-pr79389.patch Full untested patch.
(In reply to Jakub Jelinek from comment #10) > Created attachment 40815 [details] > gcc7-pr79389.patch > > Full untested patch. I can confirm that MonteCarlo benchmark considerably improves with -O2, but please note that your patch is ineffective with -Ofast for some reason, and the testcase still compiles to a jump.
At -O3 it is caused by the r238005 change as written earlier, that is really not a beneficial optimization in this case. -O3 -fno-split-paths should still see the benefit of the patch I've attached. So we need some patch from Richard too or a way to undo the damage afterwards (still in GIMPLE or on RTL).
Note that my patch doesn't bootstrap, so I'll need to debug it.
Created attachment 40818 [details] patch Patch I am testing for the loop splitting cost model. I still believe loop splitting on its own is not profitable and it should be integrated with threading and CSE.
(In reply to Richard Biener from comment #14) > Created attachment 40818 [details] > patch > > Patch I am testing for the loop splitting cost model. I still believe loop > splitting on its own is not profitable and it should be integrated with > threading and CSE. It regresses FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-not dom3 "bit_xor" FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 0;" 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Folded to: _[0-9]+ = 1;" 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_[0-9]+. with constant .0." 1 FAIL: gcc.dg/tree-ssa/pr69270.c scan-tree-dump-times dom3 "Replaced .bufferstep_[0-9]+. with constant .1." 1 this testcase wants to "thread" if (bufferstep != 0) ... bufferstep ^= 1; once which the backward threading code does only if we first split the path. This effectively rotates the loop. The threading doesn't seem to matter though, what matters is the xor that can be conditionally simplified. Thus another "enabler" for path splitting would be if (x ==/!= CST) ... use-of-x; so we can constant propagate.
Created attachment 40819 [details] revised patch
Note this was extracted from the main motivator for path splitting, so it's probably wise to make sure we get this right. The key is blob of code: for () { ... if (bufferstep) outputbuffer = (delta << 4) & 0xf0; else *outp++ = (delta & 0x0f) | outputbuffer; bufferstep = !bufferstep; } Path splitting essentially turns that into: for () { ... if (bufferstep) { outputbuffer = (delta << 4) & 0xf0; bufferstep = !bufferstep; } else { *outp++ = (delta & 0x0f) | outputbuffer; bufferstep = !bufferstep; } } DOM should know that bufferstep has the range [0,1] and thus can record a path sensitive equivalence for bufferstep on both arms resulting in: for () { ... if (bufferstep) { outputbuffer = (delta << 4) & 0xf0; bufferstep = 0; } else { *outp++ = (delta & 0x0f) | outputbuffer; bufferstep = 1; } } And we propagate the assignment to bufferstep away into a PHI node. As long as we preserve that (simplification of the xor), we're good. Long term we'd like to thread the if (bufferstep) test since bufferstep is just a flip-flop and the state for every iteration can be known at compile time. BUt that's a future TODO. Again, preserve the ability to simplify the assignment to bufferstep at the bottom of the loop and we're good.
Author: jakub Date: Thu Feb 23 22:05:19 2017 New Revision: 245690 URL: https://gcc.gnu.org/viewcvs?rev=245690&root=gcc&view=rev Log: PR tree-optimization/79389 * ifcvt.c (struct noce_if_info): Add rev_cond field. (noce_reversed_cond_code): New function. (noce_emit_store_flag): Use rev_cond if non-NULL instead of reversed_comparison_code. Formatting fix. (noce_try_store_flag): Test rev_cond != NULL in addition to reversed_comparison_code. (noce_try_store_flag_constants): Likewise. (noce_try_store_flag_mask): Likewise. (noce_try_addcc): Use rev_cond if non-NULL instead of reversed_comparison_code. (noce_try_cmove_arith): Likewise. Formatting fixes. (noce_try_minmax, noce_try_abs): Clear rev_cond. (noce_find_if_block): Initialize rev_cond. (find_cond_trap): Call noce_get_condition with then_bb == trap_bb instead of false as last argument never attempt to reverse it afterwards. Modified: trunk/gcc/ChangeLog trunk/gcc/ifcvt.c
Author: rguenth Date: Fri Feb 24 08:04:31 2017 New Revision: 245696 URL: https://gcc.gnu.org/viewcvs?rev=245696&root=gcc&view=rev Log: 2017-02-24 Richard Biener <rguenther@suse.de> PR tree-optimization/79389 * gimple-ssa-split-paths.c (is_feasible_trace): Verify more properly that a threading opportunity exists. Detect conditional copy/constant propagation opportunities. * gcc.dg/tree-ssa/split-path-10.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/split-path-10.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-split-paths.c trunk/gcc/testsuite/ChangeLog
Fixed.
Author: rguenth Date: Fri Feb 24 11:51:33 2017 New Revision: 245713 URL: https://gcc.gnu.org/viewcvs?rev=245713&root=gcc&view=rev Log: 2017-02-24 Richard Biener <rguenther@suse.de> PR tree-optimization/79389 * gimple-ssa-split-paths.c (is_feasible_trace): Properly skip debug insns. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-split-paths.c