Bug 56309 - conditional moves instead of compare and branch result in almost 2x slower code
Summary: conditional moves instead of compare and branch result in almost 2x slower code
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: cmov
  Show dependency treegraph
 
Reported: 2013-02-13 20:21 UTC by arturomdn
Modified: 2021-09-04 20:33 UTC (History)
13 users (show)

See Also:
Host:
Target: x86
Build:
Known to work:
Known to fail:
Last reconfirmed: 2013-02-14 00:00:00


Attachments
Self contained source file with parameter x passed by value (slow) (90.29 KB, application/octet-stream)
2013-02-13 20:21 UTC, arturomdn
Details
Self contained source file with parameter x passed by reference (fast) (90.29 KB, application/octet-stream)
2013-02-13 20:22 UTC, arturomdn
Details
simple reproducer (623 bytes, text/x-csrc)
2015-08-06 09:35 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description arturomdn 2013-02-13 20:21:06 UTC
Created attachment 29442 [details]
Self contained source file with parameter x passed by value (slow)

This bug report reflects the analysis of a question asked in stackoverflow

http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939

When an unsigned long long parameter to a function is passed by reference instead of by value the result is a dramatic almost 2x improvement in speed when compiled with -O3.  Given that the function is inlined this is unexpected.  Upon closer inspection it was found that the code generated is quite different, as if passing the parameter by value enables an optimization (use of x86 conditional moves) that backfires, possibly by suffering an unexpected stall in the processor.

Two files are attached

by-val-O3.ii
by-ref-O3.ii

They differ only in the way the unsigned long long parameter "x" is passed.

./by-ref-O3
Took 11.85 seconds total.

./by-ref-O3
Took 6.67 seconds total.
Comment 1 arturomdn 2013-02-13 20:22:51 UTC
Created attachment 29443 [details]
Self contained source file with parameter x passed by reference (fast)

This source file differs from by-val-O3.ii only in the way parameter "x" is passed.
Comment 2 Andrew Pinski 2013-02-13 20:25:58 UTC
Which target is this on?  On some (most non x86 targets) conditional moves are faster than compare and branch.
Comment 3 arturomdn 2013-02-13 20:29:12 UTC
Intel Xeon X5570 @ 2.93GHz

(In reply to comment #2)
> Which target is this on?  On some (most non x86 targets) conditional moves are
> faster than compare and branch.
Comment 4 Uroš Bizjak 2013-02-14 08:16:26 UTC
(In reply to comment #0)

> When an unsigned long long parameter to a function is passed by reference
> instead of by value the result is a dramatic almost 2x improvement in speed
> when compiled with -O3.  Given that the function is inlined this is unexpected.
>  Upon closer inspection it was found that the code generated is quite
> different, as if passing the parameter by value enables an optimization (use of
> x86 conditional moves) that backfires, possibly by suffering an unexpected
> stall in the processor.

This is due to x86 processor deficiency (HW bug?) with cmove insns. Please see explanations in PR53346 and PR54073.

There is no solution (yet?), compiler should probably detect this situation and avoid conversion to cmove.
Comment 5 Uroš Bizjak 2013-02-14 08:22:33 UTC
For reference, some quotes from Honza:

-- PR54073--
The decision on whether to use cmov or jmp was always tricky on x86
architectures. Cmov increase dependency chains, register pressure (both values
needs to be loaded in) and has long opcode. So jump sequence, if well
predicted, flows better through the out-of-order core. If badly predicted it
is, of course, a disaster. I think more modern CPUs solved the problems with
long latency of cmov, but the dependency chains are still there.

[...]

We should do something about rnflow. I will look into that.
The usual wisdom is that lacking profile feedback one should handle non-loop
branhces as inpredctable and loop branches as predictable, so all handled by
ifconvert fals to the first category. With profile feedback one can see branch
probability and if it is close to 0 or REG_BR_PROB_BASE tread the branch as
predictable. We handle this with predictable_edge_p parameter passed to
BRANCH_COST (that by itself is a gross, but for years we was not able to come
with something saner)
-- /PR54073 --

-- PR53046 --
Well, as I wrote to the other PR, the main problem of cmov is extension of
dependency chain.  For well predicted sequence with conditional jump there is
no update of rbs so the loop executes faster, because the
loads/stores/comparisons executes "in parallel". The load in the next iteration
can then happen speculatively before the condition from previous iteration is
resolved. With cmov in it, there is dependence on rbx for all the other
computations in the loop.

I guess there is no localy available information suggesting suggesting that the
particular branch is well predictable, at least without profile feedback (where
we won't disable the conversion anyway).

[...]
-- /PR53046 --
Comment 6 Uroš Bizjak 2013-02-14 08:27:52 UTC
(In reply to comment #5)
> -- PR53046 --
> [...]
> -- /PR53046 --

Actually, PR53346.
Comment 7 Yuri Rumyantsev 2013-02-14 14:49:31 UTC
My 2 cents.

We can enhance if conversion phase in gcc by analyzing conditional jump probablity and if we have skewed hammock do not perform transformation. In general it requires to use profiling but for this specific example we can use so called "static profiler", e.g. in our example we have 
    if (tmp >= imax) 
where imax = 1 << 32, i.e. huge conastant. For such conditions we can assume that this branch is likely not-taken and reject if-conversion of such hammock.

Other possiblity is implement support for __builtin_expect in gcc but it requires source code changing too.
Comment 8 arturomdn 2013-02-14 15:53:15 UTC
It is possible (just a guess) that the extra compare is causing an interlock in the processor since the first cmov is issued speculatively and the condition won't be confirmed until the first compare has executed.  Someone from Intel could tell us exactly why the original sequence is so disastrous and suggest an alternative that still uses cmov and is better than jmp.

I wonder if instead of emitting this sequence

   shr    $0x20,%rdi                                                                                                                        
   and    $0xffffffff,%ecx                                                                                                                   
   cmp    %r8,%rdx                                                                                                                          
   cmovbe %r11,%rdi                                                                                                                          
   add    $0x1,%rax                                                                                                                          
   cmp    %r8,%rdx                                                                                                                           
   cmovbe %rdx,%rcx                                                                                                                         

it would do this instead

   shr    $0x20,%rdi                                                                                                                        
   and    $0xffffffff,%ecx                                                                                                                   
   add    $0x1,%rax                                                                                                                          
   cmp    %r8,%rdx                                                                                                                          
   cmovbe %r11,%rdi                                                                                                                          
   cmovbe %rdx,%rcx
Comment 9 arturomdn 2013-02-14 16:00:49 UTC
I found in the Intel optimization guide an example of this idiom of comparing once and issuing two cmov back-to-back... so the problem isn't the two cmov, but possibly introducing the 2nd compare that splits them.

not_equal:
   movzx eax, BYTE PTR[esi+edx]
   movzx edx, BYTE PTR[edi+edx]
   cmp eax, edx
   cmova eax, ONE
   cmovb eax, NEG_ONE
   jmp ret_tag

Taken from example 10-16 of
http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
Comment 10 arturomdn 2013-02-14 16:43:23 UTC
Might be worth mentioning here what I said in the stackoverflow answer, that in this particular case the entire conditional branch can be avoided because it is redundant.

This code

    if (tmp >= imax) {
        carry = tmp >> numbits;    // <---- A
        tmp &= imax - 1;                // <---- B
    } else {
        carry = 0;                            // <---- C
    }

can be reduced to

    carry = tmp >> numbits;
    tmp &= imax - 1;

Proof:
    1) numbits is 32
    2) imax is 1ULL << 32 so lower 32 bits are zero
    3) imax - 1 is 0xFFFFFFFF (see 2)
    4) if tmp >= imax then tmp has bits set in upper 32 bits
    5) otherwise if tmp < imax then tmp does not have bits set in upper 32 bits
    6) statement "A" is equivalent to "C" when tmp < imax (because of 4)
    7) statement "B" is a NOP when tmp < imax (because of 3 and 5)
Comment 11 Steven Bosscher 2013-02-14 16:59:04 UTC
(In reply to comment #8)
> I wonder if instead of emitting this sequence
> 
>    shr    $0x20,%rdi
>    and    $0xffffffff,%ecx
>    cmp    %r8,%rdx
>    cmovbe %r11,%rdi
>    add    $0x1,%rax
>    cmp    %r8,%rdx
>    cmovbe %rdx,%rcx
> 
> it would do this instead
> 
>    shr    $0x20,%rdi
>    and    $0xffffffff,%ecx
>    add    $0x1,%rax
>    cmp    %r8,%rdx
>    cmovbe %r11,%rdi
>    cmovbe %rdx,%rcx

GCC fails to do so because the flags are clobbered between the two
cmovs, preventing code motion to group the two cmovs:

  197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI}
  199: {r110:DI=r110:DI+0x1;clobber flags:CC;}
  201: flags:CC=cmp(r124:DI,r235:DI)
  202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI}

If you do this change manually in your code (compile with -S, "fix"
the .s file and assemble it), does that speed up your code?
Comment 12 Steven Bosscher 2013-02-14 17:10:02 UTC
A bit more clear with insn 195 added:

  195: flags:CC=cmp(r124:DI,r235:DI)
  197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI}
  199: {r110:DI=r110:DI+0x1;clobber flags:CC;}
  201: flags:CC=cmp(r124:DI,r235:DI)
  202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI}

insns 195 and 201 compute the same condition but GCC can't eliminate
the comparison in insns 201 because insn 199 clobbers the flags (i.e.
destroys the result from insn 195).

As for speed, of course I can measure that myself:

--- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
+++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
@@ -357,9 +357,8 @@
        shrq    $32, %rdi
        cmpq    %r8, %rdx
        cmovbe  %r11, %rdi
-       addq    $1, %rax
-       cmpq    %r8, %rdx
        cmovbe  %rdx, %rcx
+       addq    $1, %rax
        cmpq    %rbp, %rax
        movq    %rcx, -8(%rsi,%rax,8)
        jne     .L50

unmodified: Took 14.31 seconds total.
modified:   Took 13.04 seconds total.

So re. comment #9: it's not the problem but it'd be a small improvement.
Comment 13 Steven Bosscher 2013-02-14 17:26:08 UTC
$ g++ --version
g++ (GCC) 4.7.2 ...
$ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out
Took 14.31 seconds total.
$ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out
Took 14.31 seconds total.
$ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out
Took 10.25 seconds total.
Comment 14 arturomdn 2013-02-14 17:30:54 UTC
I also did the experiment, with the same results... it got faster but not as
fast as the version with
conditional branch instead of conditional moves:

./by-ref-O3
Took 6.65 seconds total.

./by-val-O3 
Took 11.64 seconds total.

./by-val-fixed-O3 
Took 9.94 seconds total.

--- by-val-O3.s 2013-02-14 11:27:28.109856000 -0600
+++ by-val-fixed-O3.s   2013-02-14 11:11:07.312317000 -0600
@@ -679,13 +679,12 @@
        shrq    $32, %rdi
        .loc 4 25 0
        andl    $4294967295, %ecx
+       addq    $1, %rax
        cmpq    %r8, %rdx
        cmovbe  %r11, %rdi
 .LVL43:
        .loc 4 29 0
-       addq    $1, %rax
 .LVL44:
-       cmpq    %r8, %rdx
        cmovbe  %rdx, %rcx
 .LVL45:
        .loc 4 21 0
Comment 15 Uroš Bizjak 2013-02-14 17:34:15 UTC
(In reply to comment #13)
> $ g++ --version
> g++ (GCC) 4.7.2 ...
> $ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out
> Took 14.31 seconds total.
> $ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out
> Took 14.31 seconds total.
> $ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out
> Took 10.25 seconds total.

This is red herring, see PR53346c11:

--q--
You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option
makes all the difference!
--/q--

Please try wiht -ftree-vectorize -fno-tree-loop-if-convert.
Comment 16 arturomdn 2013-02-14 17:42:55 UTC
With -ftree-vectorize -fno-tree-loop-if-convert flags it generated this for the loop in question:

.L39:
        movq    %rdi, %rdx
        addq    (%rsi,%rax,8), %rcx
        imulq   (%r9,%rax,8), %rdx
        addq    %rcx, %rdx
        xorl    %ecx, %ecx
        cmpq    %r10, %rdx
        jbe     .L38
        movq    %rdx, %rcx
        andl    $4294967295, %edx
        shrq    $32, %rcx
.L38:
        addq    $1, %rax
        cmpq    %r8, %rax
        movq    %rdx, -8(%rsi,%rax,8)
        jne     .L39

And it executed fast:

./by-val-O3-flags
Took 6.74 seconds total.
Comment 17 Uroš Bizjak 2013-02-14 17:49:46 UTC
(In reply to comment #16)

> And it executed fast:
> 
> ./by-val-O3-flags
> Took 6.74 seconds total.

The solution to all these PRs is trivial: kill -ftree-loop-if-convert on x86, until infrastructure to detect possible stalls is implemented. This decision needs lots of benchmarking, though.
Comment 18 Uroš Bizjak 2013-02-14 18:23:12 UTC
Following patch is a big hammer approach to the problem, intended only for benchmarking

--cut here--
Index: common/config/i386/i386-common.c
===================================================================
--- common/config/i386/i386-common.c    (revision 195988)
+++ common/config/i386/i386-common.c    (working copy)
@@ -710,6 +710,8 @@
     /* Turn off -fschedule-insns by default.  It tends to make the
        problem with not enough registers even worse.  */
     { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
+    /* Turn off -ftree-loop-if-convert by default.  */
+    { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 },
 
 #ifdef SUBTARGET_OPTIMIZATION_OPTIONS
     SUBTARGET_OPTIMIZATION_OPTIONS,
--cut here--

OTOH, it looks that on x86, the only problematic insns are scalar integer cmov instructions. With some (machine dependant ?) pass inserted at the right spot in the sequence of passes, appropriate cmovs can be decomposed back to jumps, for some yet unknown definition of "appropriate".
Comment 19 Uroš Bizjak 2013-02-14 18:35:05 UTC
Back to "target" component.
Comment 20 Steven Bosscher 2013-02-14 22:17:23 UTC
(In reply to comment #12)
> --- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
> +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
> @@ -357,9 +357,8 @@
>         shrq    $32, %rdi
>         cmpq    %r8, %rdx
>         cmovbe  %r11, %rdi
> -       addq    $1, %rax
> -       cmpq    %r8, %rdx
>         cmovbe  %rdx, %rcx
> +       addq    $1, %rax
>         cmpq    %rbp, %rax
>         movq    %rcx, -8(%rsi,%rax,8)
>         jne     .L50
> 
> unmodified: Took 14.31 seconds total.
> modified:   Took 13.04 seconds total.
> 
> So re. comment #9: it's not the problem but it'd be a small improvement.

FWIW this comes from not eliminating the condition expression in
the conditional moves that ifcvt creates:

  tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93;
  carry_105 = tmp_93 > 4294967295 ? carry_94 : 0;

I'm surprised this form is allowed at all, I'd expect we only allow
is_gimple_reg() for a COND_EXPR_COND in a RHS context.

Anyway -- separate problem.
Comment 21 Igor Zamyatin 2013-02-15 06:49:53 UTC
(In reply to comment #18)
> Following patch is a big hammer approach to the problem, intended only for
> benchmarking
> 
> --cut here--
> Index: common/config/i386/i386-common.c
> ===================================================================
> --- common/config/i386/i386-common.c    (revision 195988)
> +++ common/config/i386/i386-common.c    (working copy)
> @@ -710,6 +710,8 @@
>      /* Turn off -fschedule-insns by default.  It tends to make the
>         problem with not enough registers even worse.  */
>      { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 },
> +    /* Turn off -ftree-loop-if-convert by default.  */
> +    { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 },
> 
>  #ifdef SUBTARGET_OPTIMIZATION_OPTIONS
>      SUBTARGET_OPTIMIZATION_OPTIONS,
> --cut here--
> 

But in this case we are giving up vectorization for some cases, aren't we?
Comment 22 Jakub Jelinek 2013-02-15 07:14:44 UTC
Yeah.  We need to either find another way how to present vectorizer with the ifconverted statement (perhaps similar to how pattern matching works, I think if-conversion only handles inner-most loops, so maybe just as an alternative gimple_seq for the single bb of the loop body?), or we need some kind of if-unconversion pass.
See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which contains much more limited version of if-unconversion, unfortunately it is going to be harder to handle more complex conditions efficiently.
Comment 23 Richard Biener 2013-02-15 09:33:22 UTC
(In reply to comment #20)
> (In reply to comment #12)
> > --- by-val-O3.s.orig    2013-02-14 18:06:56.000000000 +0100
> > +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100
> > @@ -357,9 +357,8 @@
> >         shrq    $32, %rdi
> >         cmpq    %r8, %rdx
> >         cmovbe  %r11, %rdi
> > -       addq    $1, %rax
> > -       cmpq    %r8, %rdx
> >         cmovbe  %rdx, %rcx
> > +       addq    $1, %rax
> >         cmpq    %rbp, %rax
> >         movq    %rcx, -8(%rsi,%rax,8)
> >         jne     .L50
> > 
> > unmodified: Took 14.31 seconds total.
> > modified:   Took 13.04 seconds total.
> > 
> > So re. comment #9: it's not the problem but it'd be a small improvement.
> 
> FWIW this comes from not eliminating the condition expression in
> the conditional moves that ifcvt creates:
> 
>   tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93;
>   carry_105 = tmp_93 > 4294967295 ? carry_94 : 0;
> 
> I'm surprised this form is allowed at all, I'd expect we only allow
> is_gimple_reg() for a COND_EXPR_COND in a RHS context.

Yeah, it's on my list (even with partial patches available ...).
Note that then vectorizing a COND_EXPR is not different from being
able to vectorize a comparison statement.

   pred_2 = tmp_93 > 4294967295;
   tmp_97 = pred_2 ? tmp_95 : tmp_93;
   carry_105 = pred_2 ? carry_94 : 0;

this form, both vectorized and not vectorized has issues when doing
initial instruction selection during expand (with multiple uses of
pred_2 we don't TER it).  So it might be that we present combine and
other RTL optimizers with initial code they will not be able to handle
as well as what we do right now.

> Anyway -- separate problem.

Indeed.
Comment 24 Richard Biener 2013-02-15 09:36:23 UTC
(In reply to comment #22)
> Yeah.  We need to either find another way how to present vectorizer with the
> ifconverted statement (perhaps similar to how pattern matching works, I think
> if-conversion only handles inner-most loops, so maybe just as an alternative
> gimple_seq for the single bb of the loop body?), or we need some kind of
> if-unconversion pass.
> See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which
> contains much more limited version of if-unconversion, unfortunately it is
> going to be harder to handle more complex conditions efficiently.

Well, ideally if-conversion would work as part of vectorization.
if-conversion is nothing else than pattern detection - but of course the
complication is that the vectorizer would need to be taught to vectorize
loops with more than a single basic-block (even though the vectorized
loop would again contain a single basic-block).
Comment 25 Uroš Bizjak 2013-02-17 08:35:17 UTC
(In reply to comment #21)
> (In reply to comment #18)

> But in this case we are giving up vectorization for some cases, aren't we?

Yes, and please note that vectorized conditional move doesn't result in a cmov instruction, but in a sequence of sse logic isns. So, as far as x86 targets are concerned, this bug has nothing to do with a vectorizer.

Maybe we should decompose scalar cmoves into the same sequence?
Comment 26 Uroš Bizjak 2013-02-17 08:38:24 UTC
Another analysis by Jake in PR54037:

I have done quite a bit of analysis on cmov performance across x86
architectures, so I will share here in case it helps:

Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer
architectures should probably be avoided "as a rule."

History: Conditional moves were beneficial for the Intel Pentium 4, and also
(but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the
performance of cmov vs cmp+branch is determined more by the alignment of the
target of the branch, than by the prediction rate of the branch.  The
instruction decoders would incur penalties on certain types of unaligned branch
targets (when taken), or when decoding sequences of instructions that contained
multiple branches within a 16byte "fetch" window (taken or not).  cmov was
sometimes handy for avoiding those.

With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:

I have found that use of conditional moves (cmov) is only beneficial if the
branch that the move is replacing is badly mis-predicted.  In my tests, the
cmov only became clearly "optimal" when the branch was predicted correctly less
than 92% of the time, which is abysmal by modern branch predictor standards and
rarely occurs in practice.  Above 97% prediction rates, cmov is typically
slower than cmp+branch. Inside loops that contain branches with prediction
rates approaching 100% (as is the case presented by the OP), cmov becomes a
severe performance bottleneck.  This holds true for both Core and Bulldozer. 
Bulldozer has less efficient branching than the i7, but is also severely
bottlenecked by its limited fetch/decode.  Cmov requires executing more total
instructions, and that makes Bulldozer very unhappy.

Note that my tests involved relatively simple loops that did not suffer from
the added register pressure that cmov introduces.  In practice, the prognosis
for cmov being "optimal" is even worse than what I've observed in a controlled
environment.  Furthermore, to my knowledge the status of cmov vs. branch
performance on x86 will not be changing anytime soon.  cmov will continue to be
a liability well into the next couple architecture releases from Intel and AMD.
 Piledriver will have added fetch/decode resources but should also have a
smaller mispredict penalty, so its doubtful cmov will gain much advantages
there either.

Therefore I would recommend setting -fno-tree-loop-if-convert for all -march
matching Intel Core and AMD Bulldozer/Bobcat families.


There is one good use-case for cmov on x86:  Mis-predicted conditions inside of
loops.  Currently there's no way to force that behavior in situations where I,
the programmer, am fully aware that the condition is chaotic/random.  A builtin
cmov or condition hint would be nice.  For now I'm forced to address those
(fortunately infrequent) situations via inline asm.
Comment 27 Uroš Bizjak 2013-02-17 10:10:16 UTC
Some additional analysis:

http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html
Comment 28 icegood1980@gmail.com 2013-09-22 06:37:15 UTC
hmmm, i also found that under x86
using of __builtin_expect makes code slower...
Comment 29 Jakub Jelinek 2013-10-18 09:12:13 UTC
FYI, I've posted patchset to only do tree if-conversion for vectorization and not for scalar basic blocks (and also for conditional load/store support in vectorized loops using AVX vmaskmov* and AVX2 masked gathers) in
http://gcc.gnu.org/ml/gcc-patches/2013-10/msg01373.html
I'd appreciate any benchmark results for that.

On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.
Comment 30 Uroš Bizjak 2013-10-18 10:06:23 UTC
(In reply to Jakub Jelinek from comment #29)

> On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to
> be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.

According to detailed analysis in Comment 26, perhaps a new tuning flag or target hook could be used for if-conversion of scalars. Also, I think that -Os can still use cmoves due to smaller code size.
Comment 31 rguenther@suse.de 2013-10-18 11:47:11 UTC
On Fri, 18 Oct 2013, ubizjak at gmail dot com wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309
> 
> --- Comment #30 from Uro? Bizjak <ubizjak at gmail dot com> ---
> (In reply to Jakub Jelinek from comment #29)
> 
> > On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to
> > be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.
> 
> According to detailed analysis in Comment 26, perhaps a new tuning flag or
> target hook could be used for if-conversion of scalars. Also, I think that -Os
> can still use cmoves due to smaller code size.

But RTL if-conversion still will generate cmov, right?
Comment 32 Jakub Jelinek 2013-10-18 11:57:51 UTC
(In reply to rguenther@suse.de from comment #31)
> But RTL if-conversion still will generate cmov, right?

Sure.

The current if-conversion is performed without any cost model or anything similar, say even for:
void
foo (int *a, int *b, int *c)
{
  int i;
  for (i = 0; i < 1024; i += 32)
    {
      int t1, t2;
#define V(N) \
      t1 = a[i + N]; t2 = b[i + N]; \
      a[i + N] = c[i + N / 8] ? t1 * (N + 1) / 3 + 21: t2 * (N + 3) / 17 + 9;
      V(0) V(1) V(2) V(3) V(4) V(5) V(6) V(7) V(8) V(9)
      V(10) V(11) V(12) V(13) V(14) V(15) V(16) V(17) V(18) V(19)
      V(20) V(21) V(22) V(23) V(24) V(25) V(26) V(27) V(28) V(29)
      V(30) V(31)
    }
}
it will just if-convert everything together, performing all the operations unconditionally and doing lots of conditional moves.
If my patches are way to go, supposedly the vectorizer cost model should take into account the non-if-converted loop cost instead of if-converted loop cost and compare that to the cost of the vectorized if-converted loop, though with multiple basic blocks in loop's body there won't be one cost, but perhaps one can compute minimum and maximum and average cost or something similar (also taking into account branch probabilities).
And, for targets where a tree if-conversion would be useful, after adding some cost model and limiting how many statements we should be running unconditionally, parts of the if-conversion framework could be used for a different pass.
Comment 33 Yuri Rumyantsev 2015-08-06 09:32:02 UTC
With current compiler there is not performance difference for by-ref and by-val test-cases, but if we turn off if-convert transformation we will get ~2X speed-up:
on Intel(R) Xeon(R) CPU           X5670  @ 2.93GHz

 ./t1.exe
Took 11.55 seconds total.
 ./t1.noifcvt.exe                            
Took 6.51 seconds total.

The test will be attached.
This is caused by skew conditional branch probabilities for the loop:

    for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
            tmp = x*(*rhs_it) + data[i] + carry;
            if (tmp >= imax) {
                    carry = tmp >> numbits;
                    tmp &= imax - 1;
            } else {
                    carry = 0;
            }
            data[i++] = tmp;
    }

Only 2.5% conditional branches are not taken since imax represents MAX_INT32 and profile estimation phase needs to be fixed to set-up unlikely probability for integral comparison with huge constants.
To coupe with this issue we may implement Jakub approach to design Oracle for if-conversion profitability which simply computes region (loop) costs for if-converted and not-if-converted regions ( cost of all acyclic paths).
Using such approach we can see that for fixed profile hammock predication is not profitable and if vectorization will not be successful loop must be restored to orginal one.
Comment 34 Yuri Rumyantsev 2015-08-06 09:35:38 UTC
Created attachment 36138 [details]
simple reproducer

Use -O3 -std=c++14 options to compile and -fno-tree-loop-if-convert to get better performance
Comment 35 Uroš Bizjak 2015-12-15 18:07:28 UTC
(In reply to Uroš Bizjak from comment #26)
> Another analysis by Jake in PR54037:

Eh, PR 54073.
Comment 36 Peter Cordes 2021-09-04 20:24:37 UTC
Related:  a similar case of cmov being a worse choice, for a threshold condition with an array input that happens to already be sorted:

https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2

GCC with -fprofile-generate / -fprofile-use does correctly decide to use branches.

GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize de-optimizes by putting the CMOV on the critical path, instead of as part of creating a zero/non-zero input for the ADD. PR82666.  If you do allow full -O3, then vectorization is effective, though.
Comment 37 Peter Cordes 2021-09-04 20:26:35 UTC
Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later).  Not just with -O3 -fno-tree-vectorize.

Anyway, that's related, but probably separate from choosing to do if-conversion or not after inlining.