This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/54073] [4.7 Regression] SciMark Monte Carlo test performance has seriously decreased in recent GCC releases


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54073

Jake Stine <jake.stine at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jake.stine at gmail dot com

--- Comment #16 from Jake Stine <jake.stine at gmail dot com> 2013-02-16 19:12:05 UTC ---
Hi,

I have done quite a bit of analysis on cmov performance across x86
architectures, so I will share here in case it helps:

Quick summary: Conditional moves on Intel Core/Xeon and AMD Bulldozer
architectures should probably be avoided "as a rule."

History: Conditional moves were beneficial for the Intel Pentium 4, and also
(but less-so) for AMD Athlon/Phenom chips.  In the AMD Athlon/Phenom case the
performance of cmov vs cmp+branch is determined more by the alignment of the
target of the branch, than by the prediction rate of the branch.  The
instruction decoders would incur penalties on certain types of unaligned branch
targets (when taken), or when decoding sequences of instructions that contained
multiple branches within a 16byte "fetch" window (taken or not).  cmov was
sometimes handy for avoiding those.

With regard to more current Intel Core and AMD Bulldozer/Bobcat architecture:

I have found that use of conditional moves (cmov) is only beneficial if the
branch that the move is replacing is badly mis-predicted.  In my tests, the
cmov only became clearly "optimal" when the branch was predicted correctly less
than 92% of the time, which is abysmal by modern branch predictor standards and
rarely occurs in practice.  Above 97% prediction rates, cmov is typically
slower than cmp+branch. Inside loops that contain branches with prediction
rates approaching 100% (as is the case presented by the OP), cmov becomes a
severe performance bottleneck.  This holds true for both Core and Bulldozer. 
Bulldozer has less efficient branching than the i7, but is also severely
bottlenecked by its limited fetch/decode.  Cmov requires executing more total
instructions, and that makes Bulldozer very unhappy.

Note that my tests involved relatively simple loops that did not suffer from
the added register pressure that cmov introduces.  In practice, the prognosis
for cmov being "optimal" is even worse than what I've observed in a controlled
environment.  Furthermore, to my knowledge the status of cmov vs. branch
performance on x86 will not be changing anytime soon.  cmov will continue to be
a liability well into the next couple architecture releases from Intel and AMD.
 Piledriver will have added fetch/decode resources but should also have a
smaller mispredict penalty, so its doubtful cmov will gain much advantages
there either.

Therefore I would recommend setting -fno-tree-loop-if-convert for all -march
matching Intel Core and AMD Bulldozer/Bobcat families.


There is one good use-case for cmov on x86:  Mis-predicted conditions inside of
loops.  Currently there's no way to force that behavior in situations where I,
the programmer, am fully aware that the condition is chaotic/random.  A builtin
cmov or condition hint would be nice.  For now I'm forced to address those
(fortunately infrequent) situations via inline asm.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]