Created attachment 27860 [details] Sci Mark Monte Carlo test On an Intel Core i7 CPU (see the attached screenshot): GCC 4.2.x - 380 GCC 4.7.x - 265 i.e. 44% slower. SciMark 2.0 sources can be fetched here: http://math.nist.gov/scimark2/download_c.html
The results are obtained from here: http://openbenchmarking.org/result/1207077-SU-GCCPERFOR59 Benchmarking of GCC 4.2 through GCC 4.8 when building the compiler the same and setting CFLAGS/CXXFLAGS of -O3 and -march=native prior to test installation and execution. Benchmarking for a future article on Phoronix.com by Michael Larabel. Testing on an Intel Core i7 and AMD Opteron 2384 when using the 64-bit (x86_64 target) build of Ubuntu Linux. The CPU is: Nehalem microarchitecture, "Clarksfield" (45 nm), Intel® Core™ i7-720QM Processor (6M Cache, 1.60 GHz) http://ark.intel.com/products/43122/Intel-Core-i7-720QM-Processor-%286M-Cache-1_60-GHz%29
Our autotesters have a jump of this magnitude between K8: good r171332, bad r171367 K10: good r171399, bad r171360 IA64: good r182218, bad r182265 needs further bisection, there are a few candidates within the 171399:171360 range. IA64 is supposedly sth else (the fix for PR21617 pops up here). Confirmed at least.
-flto helps a lot in this case (CPU=K10): -O3: MonteCarlo: Mflops: 319.57 -O3 -flto: MonteCarlo: Mflops: 921.67
If they are single-file benchmarks a simple -fwhole-program would do, too. (I wonder if we can auto-detect -fwhole-program from within the gcc driver, if one performs non-partial linking on a single source input that should be safe - quite a "benchmark" thing though).
is this same as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53397
(In reply to comment #5) > is this same as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53397 No. Monto Carlo is independent of FFT. I can confirm the huge drop of the FFT score with -march=amdfam10. (-flto doesn't help in this case.)
GCC 4.7.2 has been released.
Created attachment 28674 [details] gcc48-pr54073.patch On x86_64-linux on SandyBridge CPU with -O3 -march=corei7-avx I've tracked it down to the http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=171341 change, in particular emit_conditional_move part of the changes. Before the change emit_conditional_move would completely ignore the predicate on the comparison operand (operands[1]), starting with r171341 it honors it. And the movsicc's ordered_comparison_operator would give up on the UNLT comparison in the MonteCarlo testcase, while ix86_expand_int_movcc expands it just fine and at least on that loop it is beneficial to use vucomisd %xmm0, %xmm1 cmovae %eax, %ebp instead of: .L4: addl $1, %ebx ... vucomisd %xmm0, %xmm2 jb .L4 The attached proof of concept patch attempts to just restore the 4.6 and earlier behavior by allowing in all comparisons. Of course perhaps it might be possible it needs better tuning than that, I meant it just as a start for discussions. vanilla trunk: ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1886.79 FFT Mflops: 1726.97 (N=1024) SOR Mflops: 1239.20 (100 x 100) MonteCarlo: Mflops: 374.13 Sparse matmult Mflops: 1956.30 (N=1000, nz=5000) LU Mflops: 4137.37 (M=100, N=100) patched trunk: ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to pozo@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 1910.08 FFT Mflops: 1726.97 (N=1024) SOR Mflops: 1239.20 (100 x 100) MonteCarlo: Mflops: 528.94 Sparse matmult Mflops: 1949.03 (N=1000, nz=5000) LU Mflops: 4106.27 (M=100, N=100)
(In reply to comment #8) > The attached proof of concept patch attempts to just restore the 4.6 and > earlier behavior by allowing in all comparisons. Of course perhaps it might be > possible it needs better tuning than that, I meant it just as a start for > discussions. The results look terrific, I hope this patch will be merged ASAP.
(In reply to comment #8) > The attached proof of concept patch attempts to just restore the 4.6 and > earlier behavior by allowing in all comparisons. Of course perhaps it might be > possible it needs better tuning than that, I meant it just as a start for > discussions. Please see PR53346, from comment 14 onwards, especially H.J.'s comment: -quote- I was told that cmov wins if branch is mispredicted, otherwise cmov loses. We will investigate if we can improve cmov in GCC. -/quote-
(In reply to comment #10) > Please see PR53346, from comment 14 onwards, especially H.J.'s comment: > > -quote- > I was told that cmov wins if branch is mispredicted, otherwise > cmov loses. We will investigate if we can improve cmov in GCC. > -/quote- Possibly. But then still movsicc etc. isn't automatically the right thing if the comparison is ordered and wrong otherwise, but desirable/undesirable depending on whether the compiler can guess if the condition can be predicated well or not. Guess in MonteCarlo the x*x + y*y <= 1.0 condition can't be predicted well and that is why it helps so much.
The decision on whether to use cmov or jmp was always tricky on x86 architectures. Cmov increase dependency chains, register pressure (both values needs to be loaded in) and has long opcode. So jump sequence, if well predicted, flows better through the out-of-order core. If badly predicted it is, of course, a disaster. I think more modern CPUs solved the problems with long latency of cmov, but the dependency chains are still there. This patch fixes a bug in a pattern rather than tweaks heuristic on predictability. As such I think it is OK for mainline. We should do something about rnflow. I will look into that. The usual wisdom is that lacking profile feedback one should handle non-loop branhces as inpredctable and loop branches as predictable, so all handled by ifconvert fals to the first category. With profile feedback one can see branch probability and if it is close to 0 or REG_BR_PROB_BASE tread the branch as predictable. We handle this with predictable_edge_p parameter passed to BRANCH_COST (that by itself is a gross, but for years we was not able to come with something saner) Honza
Author: jakub Date: Fri Nov 16 11:40:39 2012 New Revision: 193554 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193554 Log: PR target/54073 * config/i386/i386.md (mov<mode>cc): Use comparison_operator instead of ordered_comparison_operator resp. ix86_fp_comparison_operator predicates. * config/i386/i386.c (ix86_expand_fp_movcc): Reject TImode or for -m32 DImode comparisons. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.md
Hopefully fixed on the trunk, not planning to backport it right now.
*** Bug 53346 has been marked as a duplicate of this bug. ***
Hi, I have done quite a bit of analysis on cmov performance across x86 architectures, so I will share here in case it helps: Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer architectures should probably be avoided "as a rule." History: Conditional moves were beneficial for the Intel Pentium 4, and also (but less-so) for AMD Athlon/Phenom chips. In the AMD Athlon/Phenom case the performance of cmov vs cmp+branch is determined more by the alignment of the target of the branch, than by the prediction rate of the branch. The instruction decoders would incur penalties on certain types of unaligned branch targets (when taken), or when decoding sequences of instructions that contained multiple branches within a 16byte "fetch" window (taken or not). cmov was sometimes handy for avoiding those. With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture: I have found that use of conditional moves (cmov) is only beneficial if the branch that the move is replacing is badly mis-predicted. In my tests, the cmov only became clearly "optimal" when the branch was predicted correctly less than 92% of the time, which is abysmal by modern branch predictor standards and rarely occurs in practice. Above 97% prediction rates, cmov is typically slower than cmp+branch. Inside loops that contain branches with prediction rates approaching 100% (as is the case presented by the OP), cmov becomes a severe performance bottleneck. This holds true for both Core and Bulldozer. Bulldozer has less efficient branching than the i7, but is also severely bottlenecked by its limited fetch/decode. Cmov requires executing more total instructions, and that makes Bulldozer very unhappy. Note that my tests involved relatively simple loops that did not suffer from the added register pressure that cmov introduces. In practice, the prognosis for cmov being "optimal" is even worse than what I've observed in a controlled environment. Furthermore, to my knowledge the status of cmov vs. branch performance on x86 will not be changing anytime soon. cmov will continue to be a liability well into the next couple architecture releases from Intel and AMD. Piledriver will have added fetch/decode resources but should also have a smaller mispredict penalty, so its doubtful cmov will gain much advantages there either. Therefore I would recommend setting -fno-tree-loop-if-convert for all -march matching Intel Core and AMD Bulldozer/Bobcat families. There is one good use-case for cmov on x86: Mis-predicted conditions inside of loops. Currently there's no way to force that behavior in situations where I, the programmer, am fully aware that the condition is chaotic/random. A builtin cmov or condition hint would be nice. For now I'm forced to address those (fortunately infrequent) situations via inline asm.
(In reply to comment #16) > I have done quite a bit of analysis on cmov performance across x86 > architectures, so I will share here in case it helps: I have moved this discussion to PR56309. Let's keep this PR open for eventual backport of the patch in Comment #13 to 4.7 branch.
GCC 4.7.3 is being released, adjusting target milestone.
Fixed for 4.8.0.