Created attachment 29442 [details] Self contained source file with parameter x passed by value (slow) This bug report reflects the analysis of a question asked in stackoverflow http://stackoverflow.com/questions/14805641/why-does-changing-const-ull-to-const-ull-in-function-parameter-result-in-pe/14819939#14819939 When an unsigned long long parameter to a function is passed by reference instead of by value the result is a dramatic almost 2x improvement in speed when compiled with -O3. Given that the function is inlined this is unexpected. Upon closer inspection it was found that the code generated is quite different, as if passing the parameter by value enables an optimization (use of x86 conditional moves) that backfires, possibly by suffering an unexpected stall in the processor. Two files are attached by-val-O3.ii by-ref-O3.ii They differ only in the way the unsigned long long parameter "x" is passed. ./by-ref-O3 Took 11.85 seconds total. ./by-ref-O3 Took 6.67 seconds total.
Created attachment 29443 [details] Self contained source file with parameter x passed by reference (fast) This source file differs from by-val-O3.ii only in the way parameter "x" is passed.
Which target is this on? On some (most non x86 targets) conditional moves are faster than compare and branch.
Intel Xeon X5570 @ 2.93GHz (In reply to comment #2) > Which target is this on? On some (most non x86 targets) conditional moves are > faster than compare and branch.
(In reply to comment #0) > When an unsigned long long parameter to a function is passed by reference > instead of by value the result is a dramatic almost 2x improvement in speed > when compiled with -O3. Given that the function is inlined this is unexpected. > Upon closer inspection it was found that the code generated is quite > different, as if passing the parameter by value enables an optimization (use of > x86 conditional moves) that backfires, possibly by suffering an unexpected > stall in the processor. This is due to x86 processor deficiency (HW bug?) with cmove insns. Please see explanations in PR53346 and PR54073. There is no solution (yet?), compiler should probably detect this situation and avoid conversion to cmove.
For reference, some quotes from Honza: -- PR54073-- The decision on whether to use cmov or jmp was always tricky on x86 architectures. Cmov increase dependency chains, register pressure (both values needs to be loaded in) and has long opcode. So jump sequence, if well predicted, flows better through the out-of-order core. If badly predicted it is, of course, a disaster. I think more modern CPUs solved the problems with long latency of cmov, but the dependency chains are still there. [...] We should do something about rnflow. I will look into that. The usual wisdom is that lacking profile feedback one should handle non-loop branhces as inpredctable and loop branches as predictable, so all handled by ifconvert fals to the first category. With profile feedback one can see branch probability and if it is close to 0 or REG_BR_PROB_BASE tread the branch as predictable. We handle this with predictable_edge_p parameter passed to BRANCH_COST (that by itself is a gross, but for years we was not able to come with something saner) -- /PR54073 -- -- PR53046 -- Well, as I wrote to the other PR, the main problem of cmov is extension of dependency chain. For well predicted sequence with conditional jump there is no update of rbs so the loop executes faster, because the loads/stores/comparisons executes "in parallel". The load in the next iteration can then happen speculatively before the condition from previous iteration is resolved. With cmov in it, there is dependence on rbx for all the other computations in the loop. I guess there is no localy available information suggesting suggesting that the particular branch is well predictable, at least without profile feedback (where we won't disable the conversion anyway). [...] -- /PR53046 --
(In reply to comment #5) > -- PR53046 -- > [...] > -- /PR53046 -- Actually, PR53346.
My 2 cents. We can enhance if conversion phase in gcc by analyzing conditional jump probablity and if we have skewed hammock do not perform transformation. In general it requires to use profiling but for this specific example we can use so called "static profiler", e.g. in our example we have if (tmp >= imax) where imax = 1 << 32, i.e. huge conastant. For such conditions we can assume that this branch is likely not-taken and reject if-conversion of such hammock. Other possiblity is implement support for __builtin_expect in gcc but it requires source code changing too.
It is possible (just a guess) that the extra compare is causing an interlock in the processor since the first cmov is issued speculatively and the condition won't be confirmed until the first compare has executed. Someone from Intel could tell us exactly why the original sequence is so disastrous and suggest an alternative that still uses cmov and is better than jmp. I wonder if instead of emitting this sequence shr $0x20,%rdi and $0xffffffff,%ecx cmp %r8,%rdx cmovbe %r11,%rdi add $0x1,%rax cmp %r8,%rdx cmovbe %rdx,%rcx it would do this instead shr $0x20,%rdi and $0xffffffff,%ecx add $0x1,%rax cmp %r8,%rdx cmovbe %r11,%rdi cmovbe %rdx,%rcx
I found in the Intel optimization guide an example of this idiom of comparing once and issuing two cmov back-to-back... so the problem isn't the two cmov, but possibly introducing the 2nd compare that splits them. not_equal: movzx eax, BYTE PTR[esi+edx] movzx edx, BYTE PTR[edi+edx] cmp eax, edx cmova eax, ONE cmovb eax, NEG_ONE jmp ret_tag Taken from example 10-16 of http://www.intel.com/content/dam/doc/manual/64-ia-32-architectures-optimization-manual.pdf
Might be worth mentioning here what I said in the stackoverflow answer, that in this particular case the entire conditional branch can be avoided because it is redundant. This code if (tmp >= imax) { carry = tmp >> numbits; // <---- A tmp &= imax - 1; // <---- B } else { carry = 0; // <---- C } can be reduced to carry = tmp >> numbits; tmp &= imax - 1; Proof: 1) numbits is 32 2) imax is 1ULL << 32 so lower 32 bits are zero 3) imax - 1 is 0xFFFFFFFF (see 2) 4) if tmp >= imax then tmp has bits set in upper 32 bits 5) otherwise if tmp < imax then tmp does not have bits set in upper 32 bits 6) statement "A" is equivalent to "C" when tmp < imax (because of 4) 7) statement "B" is a NOP when tmp < imax (because of 3 and 5)
(In reply to comment #8) > I wonder if instead of emitting this sequence > > shr $0x20,%rdi > and $0xffffffff,%ecx > cmp %r8,%rdx > cmovbe %r11,%rdi > add $0x1,%rax > cmp %r8,%rdx > cmovbe %rdx,%rcx > > it would do this instead > > shr $0x20,%rdi > and $0xffffffff,%ecx > add $0x1,%rax > cmp %r8,%rdx > cmovbe %r11,%rdi > cmovbe %rdx,%rcx GCC fails to do so because the flags are clobbered between the two cmovs, preventing code motion to group the two cmovs: 197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI} 199: {r110:DI=r110:DI+0x1;clobber flags:CC;} 201: flags:CC=cmp(r124:DI,r235:DI) 202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI} If you do this change manually in your code (compile with -S, "fix" the .s file and assemble it), does that speed up your code?
A bit more clear with insn 195 added: 195: flags:CC=cmp(r124:DI,r235:DI) 197: r116:DI={(gtu(flags:CC,0))?r125:DI:r233:DI} 199: {r110:DI=r110:DI+0x1;clobber flags:CC;} 201: flags:CC=cmp(r124:DI,r235:DI) 202: r221:DI={(gtu(flags:CC,0))?r126:DI:r124:DI} insns 195 and 201 compute the same condition but GCC can't eliminate the comparison in insns 201 because insn 199 clobbers the flags (i.e. destroys the result from insn 195). As for speed, of course I can measure that myself: --- by-val-O3.s.orig 2013-02-14 18:06:56.000000000 +0100 +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100 @@ -357,9 +357,8 @@ shrq $32, %rdi cmpq %r8, %rdx cmovbe %r11, %rdi - addq $1, %rax - cmpq %r8, %rdx cmovbe %rdx, %rcx + addq $1, %rax cmpq %rbp, %rax movq %rcx, -8(%rsi,%rax,8) jne .L50 unmodified: Took 14.31 seconds total. modified: Took 13.04 seconds total. So re. comment #9: it's not the problem but it'd be a small improvement.
$ g++ --version g++ (GCC) 4.7.2 ... $ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out Took 14.31 seconds total. $ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out Took 14.31 seconds total. $ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out Took 10.25 seconds total.
I also did the experiment, with the same results... it got faster but not as fast as the version with conditional branch instead of conditional moves: ./by-ref-O3 Took 6.65 seconds total. ./by-val-O3 Took 11.64 seconds total. ./by-val-fixed-O3 Took 9.94 seconds total. --- by-val-O3.s 2013-02-14 11:27:28.109856000 -0600 +++ by-val-fixed-O3.s 2013-02-14 11:11:07.312317000 -0600 @@ -679,13 +679,12 @@ shrq $32, %rdi .loc 4 25 0 andl $4294967295, %ecx + addq $1, %rax cmpq %r8, %rdx cmovbe %r11, %rdi .LVL43: .loc 4 29 0 - addq $1, %rax .LVL44: - cmpq %r8, %rdx cmovbe %rdx, %rcx .LVL45: .loc 4 21 0
(In reply to comment #13) > $ g++ --version > g++ (GCC) 4.7.2 ... > $ g++ -O3 -std=c++11 by-val-O3.ii ; ./a.out > Took 14.31 seconds total. > $ g++ -O3 -std=c++11 by-val-O3.ii -fvect-cost-model ; ./a.out > Took 14.31 seconds total. > $ g++ -O3 -std=c++11 by-val-O3.ii -fno-tree-vectorize ; ./a.out > Took 10.25 seconds total. This is red herring, see PR53346c11: --q-- You are right. -ftree-vectorize implies -ftree-loop-if-convert and this option makes all the difference! --/q-- Please try wiht -ftree-vectorize -fno-tree-loop-if-convert.
With -ftree-vectorize -fno-tree-loop-if-convert flags it generated this for the loop in question: .L39: movq %rdi, %rdx addq (%rsi,%rax,8), %rcx imulq (%r9,%rax,8), %rdx addq %rcx, %rdx xorl %ecx, %ecx cmpq %r10, %rdx jbe .L38 movq %rdx, %rcx andl $4294967295, %edx shrq $32, %rcx .L38: addq $1, %rax cmpq %r8, %rax movq %rdx, -8(%rsi,%rax,8) jne .L39 And it executed fast: ./by-val-O3-flags Took 6.74 seconds total.
(In reply to comment #16) > And it executed fast: > > ./by-val-O3-flags > Took 6.74 seconds total. The solution to all these PRs is trivial: kill -ftree-loop-if-convert on x86, until infrastructure to detect possible stalls is implemented. This decision needs lots of benchmarking, though.
Following patch is a big hammer approach to the problem, intended only for benchmarking --cut here-- Index: common/config/i386/i386-common.c =================================================================== --- common/config/i386/i386-common.c (revision 195988) +++ common/config/i386/i386-common.c (working copy) @@ -710,6 +710,8 @@ /* Turn off -fschedule-insns by default. It tends to make the problem with not enough registers even worse. */ { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 }, + /* Turn off -ftree-loop-if-convert by default. */ + { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 }, #ifdef SUBTARGET_OPTIMIZATION_OPTIONS SUBTARGET_OPTIMIZATION_OPTIONS, --cut here-- OTOH, it looks that on x86, the only problematic insns are scalar integer cmov instructions. With some (machine dependant ?) pass inserted at the right spot in the sequence of passes, appropriate cmovs can be decomposed back to jumps, for some yet unknown definition of "appropriate".
Back to "target" component.
(In reply to comment #12) > --- by-val-O3.s.orig 2013-02-14 18:06:56.000000000 +0100 > +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100 > @@ -357,9 +357,8 @@ > shrq $32, %rdi > cmpq %r8, %rdx > cmovbe %r11, %rdi > - addq $1, %rax > - cmpq %r8, %rdx > cmovbe %rdx, %rcx > + addq $1, %rax > cmpq %rbp, %rax > movq %rcx, -8(%rsi,%rax,8) > jne .L50 > > unmodified: Took 14.31 seconds total. > modified: Took 13.04 seconds total. > > So re. comment #9: it's not the problem but it'd be a small improvement. FWIW this comes from not eliminating the condition expression in the conditional moves that ifcvt creates: tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93; carry_105 = tmp_93 > 4294967295 ? carry_94 : 0; I'm surprised this form is allowed at all, I'd expect we only allow is_gimple_reg() for a COND_EXPR_COND in a RHS context. Anyway -- separate problem.
(In reply to comment #18) > Following patch is a big hammer approach to the problem, intended only for > benchmarking > > --cut here-- > Index: common/config/i386/i386-common.c > =================================================================== > --- common/config/i386/i386-common.c (revision 195988) > +++ common/config/i386/i386-common.c (working copy) > @@ -710,6 +710,8 @@ > /* Turn off -fschedule-insns by default. It tends to make the > problem with not enough registers even worse. */ > { OPT_LEVELS_ALL, OPT_fschedule_insns, NULL, 0 }, > + /* Turn off -ftree-loop-if-convert by default. */ > + { OPT_LEVELS_ALL, OPT_ftree_loop_if_convert, NULL, 0 }, > > #ifdef SUBTARGET_OPTIMIZATION_OPTIONS > SUBTARGET_OPTIMIZATION_OPTIONS, > --cut here-- > But in this case we are giving up vectorization for some cases, aren't we?
Yeah. We need to either find another way how to present vectorizer with the ifconverted statement (perhaps similar to how pattern matching works, I think if-conversion only handles inner-most loops, so maybe just as an alternative gimple_seq for the single bb of the loop body?), or we need some kind of if-unconversion pass. See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which contains much more limited version of if-unconversion, unfortunately it is going to be harder to handle more complex conditions efficiently.
(In reply to comment #20) > (In reply to comment #12) > > --- by-val-O3.s.orig 2013-02-14 18:06:56.000000000 +0100 > > +++ by-val-O3.s 2013-02-14 18:07:23.000000000 +0100 > > @@ -357,9 +357,8 @@ > > shrq $32, %rdi > > cmpq %r8, %rdx > > cmovbe %r11, %rdi > > - addq $1, %rax > > - cmpq %r8, %rdx > > cmovbe %rdx, %rcx > > + addq $1, %rax > > cmpq %rbp, %rax > > movq %rcx, -8(%rsi,%rax,8) > > jne .L50 > > > > unmodified: Took 14.31 seconds total. > > modified: Took 13.04 seconds total. > > > > So re. comment #9: it's not the problem but it'd be a small improvement. > > FWIW this comes from not eliminating the condition expression in > the conditional moves that ifcvt creates: > > tmp_97 = tmp_93 > 4294967295 ? tmp_95 : tmp_93; > carry_105 = tmp_93 > 4294967295 ? carry_94 : 0; > > I'm surprised this form is allowed at all, I'd expect we only allow > is_gimple_reg() for a COND_EXPR_COND in a RHS context. Yeah, it's on my list (even with partial patches available ...). Note that then vectorizing a COND_EXPR is not different from being able to vectorize a comparison statement. pred_2 = tmp_93 > 4294967295; tmp_97 = pred_2 ? tmp_95 : tmp_93; carry_105 = pred_2 ? carry_94 : 0; this form, both vectorized and not vectorized has issues when doing initial instruction selection during expand (with multiple uses of pred_2 we don't TER it). So it might be that we present combine and other RTL optimizers with initial code they will not be able to handle as well as what we do right now. > Anyway -- separate problem. Indeed.
(In reply to comment #22) > Yeah. We need to either find another way how to present vectorizer with the > ifconverted statement (perhaps similar to how pattern matching works, I think > if-conversion only handles inner-most loops, so maybe just as an alternative > gimple_seq for the single bb of the loop body?), or we need some kind of > if-unconversion pass. > See the http://gcc.gnu.org/ml/gcc-patches/2012-11/msg01659.html patch which > contains much more limited version of if-unconversion, unfortunately it is > going to be harder to handle more complex conditions efficiently. Well, ideally if-conversion would work as part of vectorization. if-conversion is nothing else than pattern detection - but of course the complication is that the vectorizer would need to be taught to vectorize loops with more than a single basic-block (even though the vectorized loop would again contain a single basic-block).
(In reply to comment #21) > (In reply to comment #18) > But in this case we are giving up vectorization for some cases, aren't we? Yes, and please note that vectorized conditional move doesn't result in a cmov instruction, but in a sequence of sse logic isns. So, as far as x86 targets are concerned, this bug has nothing to do with a vectorizer. Maybe we should decompose scalar cmoves into the same sequence?
Another analysis by Jake in PR54037: I have done quite a bit of analysis on cmov performance across x86 architectures, so I will share here in case it helps: Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer architectures should probably be avoided "as a rule." History: Conditional moves were beneficial for the Intel Pentium 4, and also (but less-so) for AMD Athlon/Phenom chips. In the AMD Athlon/Phenom case the performance of cmov vs cmp+branch is determined more by the alignment of the target of the branch, than by the prediction rate of the branch. The instruction decoders would incur penalties on certain types of unaligned branch targets (when taken), or when decoding sequences of instructions that contained multiple branches within a 16byte "fetch" window (taken or not). cmov was sometimes handy for avoiding those. With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture: I have found that use of conditional moves (cmov) is only beneficial if the branch that the move is replacing is badly mis-predicted. In my tests, the cmov only became clearly "optimal" when the branch was predicted correctly less than 92% of the time, which is abysmal by modern branch predictor standards and rarely occurs in practice. Above 97% prediction rates, cmov is typically slower than cmp+branch. Inside loops that contain branches with prediction rates approaching 100% (as is the case presented by the OP), cmov becomes a severe performance bottleneck. This holds true for both Core and Bulldozer. Bulldozer has less efficient branching than the i7, but is also severely bottlenecked by its limited fetch/decode. Cmov requires executing more total instructions, and that makes Bulldozer very unhappy. Note that my tests involved relatively simple loops that did not suffer from the added register pressure that cmov introduces. In practice, the prognosis for cmov being "optimal" is even worse than what I've observed in a controlled environment. Furthermore, to my knowledge the status of cmov vs. branch performance on x86 will not be changing anytime soon. cmov will continue to be a liability well into the next couple architecture releases from Intel and AMD. Piledriver will have added fetch/decode resources but should also have a smaller mispredict penalty, so its doubtful cmov will gain much advantages there either. Therefore I would recommend setting -fno-tree-loop-if-convert for all -march matching Intel Core and AMD Bulldozer/Bobcat families. There is one good use-case for cmov on x86: Mis-predicted conditions inside of loops. Currently there's no way to force that behavior in situations where I, the programmer, am fully aware that the condition is chaotic/random. A builtin cmov or condition hint would be nice. For now I'm forced to address those (fortunately infrequent) situations via inline asm.
Some additional analysis: http://gcc.gnu.org/ml/gcc/2008-02/msg00601.html
hmmm, i also found that under x86 using of __builtin_expect makes code slower...
FYI, I've posted patchset to only do tree if-conversion for vectorization and not for scalar basic blocks (and also for conditional load/store support in vectorized loops using AVX vmaskmov* and AVX2 masked gathers) in http://gcc.gnu.org/ml/gcc-patches/2013-10/msg01373.html I'd appreciate any benchmark results for that. On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600.
(In reply to Jakub Jelinek from comment #29) > On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to > be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600. According to detailed analysis in Comment 26, perhaps a new tuning flag or target hook could be used for if-conversion of scalars. Also, I think that -Os can still use cmoves due to smaller code size.
On Fri, 18 Oct 2013, ubizjak at gmail dot com wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309 > > --- Comment #30 from Uro? Bizjak <ubizjak at gmail dot com> --- > (In reply to Jakub Jelinek from comment #29) > > > On the by-ref-O3.ii testcase there is no change, but by-val-O3.ii seems to > > be consistently faster, around 20% on AMD 8354 and 35% on Intel 2600. > > According to detailed analysis in Comment 26, perhaps a new tuning flag or > target hook could be used for if-conversion of scalars. Also, I think that -Os > can still use cmoves due to smaller code size. But RTL if-conversion still will generate cmov, right?
(In reply to rguenther@suse.de from comment #31) > But RTL if-conversion still will generate cmov, right? Sure. The current if-conversion is performed without any cost model or anything similar, say even for: void foo (int *a, int *b, int *c) { int i; for (i = 0; i < 1024; i += 32) { int t1, t2; #define V(N) \ t1 = a[i + N]; t2 = b[i + N]; \ a[i + N] = c[i + N / 8] ? t1 * (N + 1) / 3 + 21: t2 * (N + 3) / 17 + 9; V(0) V(1) V(2) V(3) V(4) V(5) V(6) V(7) V(8) V(9) V(10) V(11) V(12) V(13) V(14) V(15) V(16) V(17) V(18) V(19) V(20) V(21) V(22) V(23) V(24) V(25) V(26) V(27) V(28) V(29) V(30) V(31) } } it will just if-convert everything together, performing all the operations unconditionally and doing lots of conditional moves. If my patches are way to go, supposedly the vectorizer cost model should take into account the non-if-converted loop cost instead of if-converted loop cost and compare that to the cost of the vectorized if-converted loop, though with multiple basic blocks in loop's body there won't be one cost, but perhaps one can compute minimum and maximum and average cost or something similar (also taking into account branch probabilities). And, for targets where a tree if-conversion would be useful, after adding some cost model and limiting how many statements we should be running unconditionally, parts of the if-conversion framework could be used for a different pass.
With current compiler there is not performance difference for by-ref and by-val test-cases, but if we turn off if-convert transformation we will get ~2X speed-up: on Intel(R) Xeon(R) CPU X5670 @ 2.93GHz ./t1.exe Took 11.55 seconds total. ./t1.noifcvt.exe Took 6.51 seconds total. The test will be attached. This is caused by skew conditional branch probabilities for the loop: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) { tmp = x*(*rhs_it) + data[i] + carry; if (tmp >= imax) { carry = tmp >> numbits; tmp &= imax - 1; } else { carry = 0; } data[i++] = tmp; } Only 2.5% conditional branches are not taken since imax represents MAX_INT32 and profile estimation phase needs to be fixed to set-up unlikely probability for integral comparison with huge constants. To coupe with this issue we may implement Jakub approach to design Oracle for if-conversion profitability which simply computes region (loop) costs for if-converted and not-if-converted regions ( cost of all acyclic paths). Using such approach we can see that for fixed profile hammock predication is not profitable and if vectorization will not be successful loop must be restored to orginal one.
Created attachment 36138 [details] simple reproducer Use -O3 -std=c++14 options to compile and -fno-tree-loop-if-convert to get better performance
(In reply to Uroš Bizjak from comment #26) > Another analysis by Jake in PR54037: Eh, PR 54073.
Related: a similar case of cmov being a worse choice, for a threshold condition with an array input that happens to already be sorted: https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 GCC with -fprofile-generate / -fprofile-use does correctly decide to use branches. GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize de-optimizes by putting the CMOV on the critical path, instead of as part of creating a zero/non-zero input for the ADD. PR82666. If you do allow full -O3, then vectorization is effective, though.
Correction, PR82666 is that the cmov on the critical path happens even at -O2 (with GCC7 and later). Not just with -O3 -fno-tree-vectorize. Anyway, that's related, but probably separate from choosing to do if-conversion or not after inlining.