Some useful links: contains information regarding the semantics of floating point math in GCC. contains summary of the general policy of floating point arithmetic in GCC BOF (2007 GCC Summit)

Optimizing SSE and x87 Math

Benjamin Redelings asked a question about math builtins

In reply to this mail Uros Bizjak wrote a recommendation which flags to use math optimization

If x87 intrinsics are needed, then you should use -mfpmath=387. This flag does not mean that all SSE instructions will be disabled, but it instructs the compiler to use x87 registers and instructions for float and double (for SSE2) math calculations.

Consider this code:

double test(double a, double b) {
 double c;

 c = 1.0 + sin (a + b*b);

 return c;

Without disabling the intrinsic sin function, -mfpmath=sse would generate:

       movsd   16(%ebp), %xmm0
       mulsd   %xmm0, %xmm0
       addsd   8(%ebp), %xmm0
       movsd   %xmm0, (%esp)
       fstpl   -8(%ebp)
       movsd   -8(%ebp), %xmm0
       addsd   .LC1, %xmm0
       movsd   %xmm0, -8(%ebp)
       fldl    -8(%ebp)

You can see, that a lot of register moving is needed to move registers to right place. With -mfpmath=387, the produced asm code looks a lot better:

       fldl    16(%ebp)
       fmul    %st(0), %st
       faddl   8(%ebp)
       popl    %ebp
       faddl   .LC1

The cost of moving register to/from memory either for SSE or x87 register is quite high, and could easily kill the perfomance gains of SSE code (additional min/max insns and non-stack nature of SSE registers without the need for exchanging registers to death).

However, if youneedan x87 intrinsic in SSE code, you can use -mfpmath=sse,387 or you could use long double version of intrinsic functions. These are always enabled. By using sinl() in above code, you get an x87 sinf function even with -mfpmath=sse (please note that as long as "long double" math is needed, calculation stays in x87):

       movsd   16(%ebp), %xmm0
       mulsd   %xmm0, %xmm0
       addsd   8(%ebp), %xmm0
       movsd   %xmm0, -8(%ebp)
       fldl    -8(%ebp)
       faddp   %st, %st(1)

With -mfpmath=387, SSE code can be produced by SSE intrinsics. An (scalar!) example:

#include <xmmintrin.h>

float test (float a, float b) {
 __m128 A, B, C;
 float c;

 A = _mm_set_ss(a);
 B = _mm_set_ss(b);

 // stuff
 C = _mm_mul_ss(A, B);

 _mm_store_ss(&c, C);
 return c;

will be compiled to:

       movss   8(%ebp), %xmm0
       movss   12(%ebp), %xmm1
       mulss   %xmm1, %xmm0
       movss   %xmm0, -4(%ebp)
       flds    -4(%ebp)

So, your dot product kernel should be implemented with (vectorized) SSE intrinsics, and other scalar code with x87 instructions. This will minimize register shuffling and optimize resource usage - scalar code should use x87 (to use it with x87 intrinsics) and vector code should use SSE unit. You have a lot of possibilites to control resource usage, that were actually impossible with previous gcc versions.

However, it is true that the real performance impact of disabled x87 intrinsics will be shown when ABI and math library functions will be changed to something similar to x86_64.

HTH, Uros.

A note on how to use the vector extensions is in the Using GCC manual. See the Vector extensions page, and look at the appropriate builtins for the architecture you care about.

None: Math_Optimization_Flags (last edited 2008-01-10 19:39:00 by localhost)