Math_Optimization_Flags

Some useful links:

http://gcc.gnu.org/wiki/GeertBosch contains information regarding the semantics of floating point math in GCC.

http://gcc.gnu.org/wiki/FP_BOF contains summary of the general policy of floating point arithmetic in GCC BOF (2007 GCC Summit)

Optimizing SSE and x87 Math

Benjamin Redelings asked a question about math builtins

I am currently trying to get my code working under g++ 4.0, because of the possibility for improved optimization. My code includes both
- (1) matrix multiplies / dot products
  (2) a lot of log1p / exp calls.
Currently I get quite a LARGE speedup from using -ffast-math, presumably because of FP math builtins use for (2). However, if SSE supports works well in gcc 4.0, I could also gain a noticeable speedup for (1). Uros Bizjak said that FP math builtins were disabled with --fpmath=sse, because they were not advantageous on x86_64. However, if this is true, I will not be able to use SSE in my code because I use a lot of log()/exp() calls. What are the prospects that math builtins could be re-enabled for *ia32*?

In reply to this mail Uros Bizjak wrote a recommendation which flags to use math optimization

If x87 intrinsics are needed, then you should use -mfpmath=387. This flag does not mean that all SSE instructions will be disabled, but it instructs the compiler to use x87 registers and instructions for float and double (for SSE2) math calculations.

Consider this code:

double test(double a, double b) {
 double c;

 c = 1.0 + sin (a + b*b);

 return c;
}

Without disabling the intrinsic sin function, -mfpmath=sse would generate:

       movsd   16(%ebp), %xmm0
       mulsd   %xmm0, %xmm0
       addsd   8(%ebp), %xmm0
       movsd   %xmm0, (%esp)
       fsin
       fstpl   -8(%ebp)
       movsd   -8(%ebp), %xmm0
       addsd   .LC1, %xmm0
       movsd   %xmm0, -8(%ebp)
       fldl    -8(%ebp)

You can see, that a lot of register moving is needed to move registers to right place. With -mfpmath=387, the produced asm code looks a lot better:

       fldl    16(%ebp)
       fmul    %st(0), %st
       faddl   8(%ebp)
       popl    %ebp
       fsin
       faddl   .LC1

The cost of moving register to/from memory either for SSE or x87 register is quite high, and could easily kill the perfomance gains of SSE code (additional min/max insns and non-stack nature of SSE registers without the need for exchanging registers to death).

However, if youneedan x87 intrinsic in SSE code, you can use -mfpmath=sse,387 or you could use long double version of intrinsic functions. These are always enabled. By using sinl() in above code, you get an x87 sinf function even with -mfpmath=sse (please note that as long as "long double" math is needed, calculation stays in x87):

       movsd   16(%ebp), %xmm0
       mulsd   %xmm0, %xmm0
       addsd   8(%ebp), %xmm0
       movsd   %xmm0, -8(%ebp)
       fldl    -8(%ebp)
       fsin
       fld1
       leave
       faddp   %st, %st(1)

With -mfpmath=387, SSE code can be produced by SSE intrinsics. An (scalar!) example:

#include <xmmintrin.h>

float test (float a, float b) {
 __m128 A, B, C;
 float c;


 A = _mm_set_ss(a);
 B = _mm_set_ss(b);


 // stuff
 C = _mm_mul_ss(A, B);


 _mm_store_ss(&c, C);
 return c;
}

will be compiled to:

       movss   8(%ebp), %xmm0
       movss   12(%ebp), %xmm1
       mulss   %xmm1, %xmm0
       movss   %xmm0, -4(%ebp)
       flds    -4(%ebp)

So, your dot product kernel should be implemented with (vectorized) SSE intrinsics, and other scalar code with x87 instructions. This will minimize register shuffling and optimize resource usage - scalar code should use x87 (to use it with x87 intrinsics) and vector code should use SSE unit. You have a lot of possibilites to control resource usage, that were actually impossible with previous gcc versions.

However, it is true that the real performance impact of disabled x87 intrinsics will be shown when ABI and math library functions will be changed to something similar to x86_64.

HTH, Uros.

A note on how to use the vector extensions is in the Using GCC manual. See the Vector extensions page, and look at the appropriate builtins for the architecture you care about.