bitwise & optimization

Tue Jun 9 15:44:00 GMT 2015

Hi!

Very similar to my post around june 2007 and when Linus Thorvalds posted 
6 months later something similar around 2007, i remember one of the GCC 
team members showing the middlefinger that they simply wanted to keep intel ahead of 
AMD in terms of speed and take care that GCC couldn't rival other 
compilers in terms of speed (the implication of not doing this 
optimization in branchy codes).

For my chessprogram Diep i've posted even more horrible optimizations - 
GCC has the tendency to also put such branches where i know myself that 
fall through is gonna give lots of mispredicted branches (as the total 
number of branches is too much for the processors memory), GCC managed to 
mess up at other pieces even further:

causing it to generate a jump to the end of the function and 
then back, and it also was instruction wise outside of the AMD instruction 
look ahead - which really is slower than generating a few CMOV type 
instructions or using less branches.

Not rewriting this ugly part of the GCC compiler is the reason why intel 
c++ is roughly 10-15% faster than GCC, especially in 64 bits, and why 
code generated runs faster on intel than on AMD processors 
as the instruction lookahead is larger, whereas OBJECTIVELY the code 
generated is a lot SLOWER.

Ideally you really want that some statistics generated with whatever there 
is at GCC nowadays like -fgenerate, that really every branch can get 
parameterized.

Yet a lot of ways to mess up GCC seems to do before such optimizations 
can take part.

When it would parameterize that - it would be a compiler that can generate 
code that's really objectively fast - whereas it's duck slow right now for 
branchy codes.

Kind Regards,

Vincent Diepeveen
The Netherlands

On Tue, 9 Jun 2015, Fisnik Kastrati wrote:

> To whom it may concern,
>
> I'm turning to you with regards to an unwanted optimization that g++ (v. 
> 4.8.2) is generating, see the code in the following link:
> http://goo.gl/3NVjyc
>
> The assembly code generated for both methods "amp", "ampamp" is the 
> practically the same, when using the optimization flag "-O3". However, I'm 
> interested to have a single jump for the code in the method "amp", as branch 
> misprediction penalty is very high otherwise. Is there any optmization flag 
> that I should set, in order to avoid this feature when using "-O3"? I.e., I'd 
> like a generated code similar to icc 13.
>
>
> Thank you in advance
>
>