54073 – [4.7 Regression] SciMark Monte Carlo test performance has seriously decreased in recent GCC releases

Bug 54073 - [4.7 Regression] SciMark Monte Carlo test performance has seriously decreased in recent GCC releases

Summary: [4.7 Regression] SciMark Monte Carlo test performance has seriously decreased...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.7.2

Importance:	P2 normal
Target Milestone:	4.8.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	53346 (view as bug list)
Depends on:
Blocks:	79703 79704 cmov
	Show dependency tree / graph

Reported:	2012-07-23 15:25 UTC by Artem S. Tashkinov
Modified:	2023-05-17 00:11 UTC (History)
CC List:	10 users (show)

See Also:
Host:
Target:
Build:
Known to work:	4.6.4, 4.8.0
Known to fail:	4.7.0, 4.7.4
Last reconfirmed:	2012-07-24 00:00:00

Attachments
Sci Mark Monte Carlo test (23.85 KB, image/png) 2012-07-23 15:25 UTC, Artem S. Tashkinov	Details
gcc48-pr54073.patch (430 bytes, patch) 2012-11-13 13:04 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Artem S. Tashkinov 2012-07-23 15:25:14 UTC

Created attachment 27860 [details]
Sci Mark Monte Carlo test

On an Intel Core i7 CPU (see the attached screenshot):

GCC 4.2.x - 380
GCC 4.7.x - 265

i.e. 44% slower.

SciMark 2.0 sources can be fetched here: http://math.nist.gov/scimark2/download_c.html

Comment 1 Artem S. Tashkinov 2012-07-23 15:43:50 UTC

The results are obtained from here:

http://openbenchmarking.org/result/1207077-SU-GCCPERFOR59

Benchmarking of GCC 4.2 through GCC 4.8 when building the compiler the same and setting CFLAGS/CXXFLAGS of -O3 and -march=native prior to test installation and execution. Benchmarking for a future article on Phoronix.com by Michael Larabel. Testing on an Intel Core i7 and AMD Opteron 2384 when using the 64-bit (x86_64 target) build of Ubuntu Linux.

The CPU is:

Nehalem microarchitecture, "Clarksfield" (45 nm), Intel® Core™ i7-720QM Processor (6M Cache, 1.60 GHz)

http://ark.intel.com/products/43122/Intel-Core-i7-720QM-Processor-%286M-Cache-1_60-GHz%29

Comment 2 Richard Biener 2012-07-24 09:22:45 UTC

Our autotesters have a jump of this magnitude between

K8:   good r171332, bad r171367
K10:  good r171399, bad r171360
IA64: good r182218, bad r182265

needs further bisection, there are a few candidates within the 171399:171360
range.  IA64 is supposedly sth else (the fix for PR21617 pops up here).

Confirmed at least.

Comment 3 Markus Trippelsdorf 2012-07-24 11:29:15 UTC

-flto helps a lot in this case (CPU=K10):

-O3:
 MonteCarlo:     Mflops:   319.57
-O3 -flto:
 MonteCarlo:     Mflops:   921.67

Comment 4 Richard Biener 2012-07-24 13:21:25 UTC

If they are single-file benchmarks a simple -fwhole-program would do, too.
(I wonder if we can auto-detect -fwhole-program from within the gcc driver,
if one performs non-partial linking on a single source input that should be
safe - quite a "benchmark" thing though).

Comment 5 Venkataramanan 2012-07-26 15:40:43 UTC

is this same as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53397

Comment 6 Markus Trippelsdorf 2012-07-26 16:13:33 UTC

(In reply to comment #5)
> is this same as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53397

No. Monto Carlo is independent of FFT.
I can confirm the huge drop of the FFT score with -march=amdfam10.
(-flto doesn't help in this case.)

Comment 7 Jakub Jelinek 2012-09-20 10:18:32 UTC

GCC 4.7.2 has been released.

Comment 8 Jakub Jelinek 2012-11-13 13:04:28 UTC

Created attachment 28674 [details]
gcc48-pr54073.patch

On x86_64-linux on SandyBridge CPU with -O3 -march=corei7-avx I've tracked it down to the 
http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=171341
change, in particular emit_conditional_move part of the changes.
Before the change emit_conditional_move would completely ignore the predicate on the comparison operand (operands[1]), starting with r171341 it honors it.
And the movsicc's ordered_comparison_operator would give up on the UNLT comparison in the MonteCarlo testcase, while ix86_expand_int_movcc expands it just fine and at least on that loop it is beneficial to use
        vucomisd        %xmm0, %xmm1
        cmovae  %eax, %ebp
instead of:
.L4:
        addl    $1, %ebx
...
        vucomisd        %xmm0, %xmm2
        jb      .L4

The attached proof of concept patch attempts to just restore the 4.6 and earlier behavior by allowing in all comparisons.  Of course perhaps it might be possible it needs better tuning than that, I meant it just as a start for discussions.

vanilla trunk:

**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1886.79
FFT             Mflops:  1726.97    (N=1024)
SOR             Mflops:  1239.20    (100 x 100)
MonteCarlo:     Mflops:   374.13
Sparse matmult  Mflops:  1956.30    (N=1000, nz=5000)
LU              Mflops:  4137.37    (M=100, N=100)

patched trunk:

**                                                              **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to pozo@nist.gov)     **
**                                                              **
Using       2.00 seconds min time per kenel.
Composite Score:         1910.08
FFT             Mflops:  1726.97    (N=1024)
SOR             Mflops:  1239.20    (100 x 100)
MonteCarlo:     Mflops:   528.94
Sparse matmult  Mflops:  1949.03    (N=1000, nz=5000)
LU              Mflops:  4106.27    (M=100, N=100)

Comment 9 Artem S. Tashkinov 2012-11-13 15:06:25 UTC

(In reply to comment #8)
> The attached proof of concept patch attempts to just restore the 4.6 and
> earlier behavior by allowing in all comparisons.  Of course perhaps it might be
> possible it needs better tuning than that, I meant it just as a start for
> discussions.

The results look terrific, I hope this patch will be merged ASAP.

Comment 10 Uroš Bizjak 2012-11-13 15:13:56 UTC

(In reply to comment #8)

> The attached proof of concept patch attempts to just restore the 4.6 and
> earlier behavior by allowing in all comparisons.  Of course perhaps it might be
> possible it needs better tuning than that, I meant it just as a start for
> discussions.

Please see PR53346, from comment 14 onwards, especially H.J.'s comment:

-quote-
I was told that cmov wins if branch is mispredicted, otherwise
cmov loses.  We will investigate if we can improve cmov in GCC.
-/quote-

Comment 11 Jakub Jelinek 2012-11-13 15:24:19 UTC

(In reply to comment #10)
> Please see PR53346, from comment 14 onwards, especially H.J.'s comment:
> 
> -quote-
> I was told that cmov wins if branch is mispredicted, otherwise
> cmov loses.  We will investigate if we can improve cmov in GCC.
> -/quote-

Possibly.  But then still movsicc etc. isn't automatically the right thing if the comparison is ordered and wrong otherwise, but desirable/undesirable depending on whether the compiler can guess if the condition can be predicated well or not.
Guess in MonteCarlo the x*x + y*y <= 1.0 condition can't be predicted well and that is why it helps so much.

Comment 12 Jan Hubicka 2012-11-13 15:54:22 UTC

The decision on whether to use cmov or jmp was always tricky on x86 architectures. Cmov increase dependency chains, register pressure (both values needs to be loaded in) and has long opcode. So jump sequence, if well predicted, flows better through the out-of-order core. If badly predicted it is, of course, a disaster. I think more modern CPUs solved the problems with long latency of cmov, but the dependency chains are still there.

This patch fixes a bug in a pattern rather than tweaks heuristic on predictability. As such I think it is OK for mainline. 

We should do something about rnflow. I will look into that.
The usual wisdom is that lacking profile feedback one should handle non-loop branhces as inpredctable and loop branches as predictable, so all handled by ifconvert fals to the first category. With profile feedback one can see branch probability and if it is close to 0 or REG_BR_PROB_BASE tread the branch as predictable. We handle this with predictable_edge_p parameter passed to BRANCH_COST (that by itself is a gross, but for years we was not able to come with something saner)

Honza

Comment 13 Jakub Jelinek 2012-11-16 11:40:42 UTC

Author: jakub
Date: Fri Nov 16 11:40:39 2012
New Revision: 193554

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193554
Log:
	PR target/54073
	* config/i386/i386.md (mov<mode>cc): Use comparison_operator
	instead of ordered_comparison_operator resp.
	ix86_fp_comparison_operator predicates.
	* config/i386/i386.c (ix86_expand_fp_movcc): Reject TImode
	or for -m32 DImode comparisons.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.md

Comment 14 Jakub Jelinek 2012-11-16 14:50:32 UTC

Hopefully fixed on the trunk, not planning to backport it right now.

Comment 15 Andrew Pinski 2012-12-31 09:40:29 UTC

*** Bug 53346 has been marked as a duplicate of this bug. ***

Comment 16 Jake Stine 2013-02-16 19:12:05 UTC

Hi,

I have done quite a bit of analysis on cmov performance across x86 architectures, so I will share here in case it helps:

Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer architectures should probably be avoided "as a rule."

History: Conditional moves were beneficial for the Intel Pentium 4, and also (but less-so) for AMD Athlon/Phenom chips. In the AMD Athlon/Phenom case the performance of cmov vs cmp+branch is determined more by the alignment of the target of the branch, than by the prediction rate of the branch. The instruction decoders would incur penalties on certain types of unaligned branch targets (when taken), or when decoding sequences of instructions that contained multiple branches within a 16byte "fetch" window (taken or not). cmov was sometimes handy for avoiding those.

With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:

I have found that use of conditional moves (cmov) is only beneficial if the branch that the move is replacing is badly mis-predicted. In my tests, the cmov only became clearly "optimal" when the branch was predicted correctly less than 92% of the time, which is abysmal by modern branch predictor standards and rarely occurs in practice. Above 97% prediction rates, cmov is typically slower than cmp+branch. Inside loops that contain branches with prediction rates approaching 100% (as is the case presented by the OP), cmov becomes a severe performance bottleneck. This holds true for both Core and Bulldozer. Bulldozer has less efficient branching than the i7, but is also severely bottlenecked by its limited fetch/decode. Cmov requires executing more total instructions, and that makes Bulldozer very unhappy.

Note that my tests involved relatively simple loops that did not suffer from the added register pressure that cmov introduces. In practice, the prognosis for cmov being "optimal" is even worse than what I've observed in a controlled environment. Furthermore, to my knowledge the status of cmov vs. branch performance on x86 will not be changing anytime soon. cmov will continue to be a liability well into the next couple architecture releases from Intel and AMD. Piledriver will have added fetch/decode resources but should also have a smaller mispredict penalty, so its doubtful cmov will gain much advantages there either.

Therefore I would recommend setting -fno-tree-loop-if-convert for all -march matching Intel Core and AMD Bulldozer/Bobcat families.

There is one good use-case for cmov on x86: Mis-predicted conditions inside of loops. Currently there's no way to force that behavior in situations where I, the programmer, am fully aware that the condition is chaotic/random. A builtin cmov or condition hint would be nice. For now I'm forced to address those (fortunately infrequent) situations via inline asm.

Comment 17 Uroš Bizjak 2013-02-17 08:40:52 UTC

(In reply to comment #16)

> I have done quite a bit of analysis on cmov performance across x86
> architectures, so I will share here in case it helps:

I have moved this discussion to PR56309. Let's keep this PR open for eventual backport of the patch in Comment #13 to 4.7 branch.

Comment 18 Richard Biener 2013-04-11 07:59:21 UTC

GCC 4.7.3 is being released, adjusting target milestone.

Comment 19 Richard Biener 2014-06-12 13:16:04 UTC

Fixed for 4.8.0.