Some useful links:
http://gcc.gnu.org/wiki/GeertBosch contains information regarding the semantics of floating point math in GCC.
http://gcc.gnu.org/wiki/FP_BOF contains summary of the general policy of floating point arithmetic in GCC BOF (2007 GCC Summit)
Optimizing SSE and x87 Math
Benjamin Redelings asked a question about math builtins
- I am currently trying to get my code working under g++ 4.0, because of the possibility for improved optimization. My code includes both
(1) matrix multiplies / dot products
(2) a lot of log1p / exp calls.
In reply to this mail Uros Bizjak wrote a recommendation which flags to use math optimization
If x87 intrinsics are needed, then you should use -mfpmath=387. This flag does not mean that all SSE instructions will be disabled, but it instructs the compiler to use x87 registers and instructions for float and double (for SSE2) math calculations.
Consider this code:
double test(double a, double b) { double c; c = 1.0 + sin (a + b*b); return c; }
Without disabling the intrinsic sin function, -mfpmath=sse would generate:
movsd 16(%ebp), %xmm0 mulsd %xmm0, %xmm0 addsd 8(%ebp), %xmm0 movsd %xmm0, (%esp) fsin fstpl -8(%ebp) movsd -8(%ebp), %xmm0 addsd .LC1, %xmm0 movsd %xmm0, -8(%ebp) fldl -8(%ebp)
You can see, that a lot of register moving is needed to move registers to right place. With -mfpmath=387, the produced asm code looks a lot better:
fldl 16(%ebp) fmul %st(0), %st faddl 8(%ebp) popl %ebp fsin faddl .LC1
The cost of moving register to/from memory either for SSE or x87 register is quite high, and could easily kill the perfomance gains of SSE code (additional min/max insns and non-stack nature of SSE registers without the need for exchanging registers to death).
However, if youneedan x87 intrinsic in SSE code, you can use -mfpmath=sse,387 or you could use long double version of intrinsic functions. These are always enabled. By using sinl() in above code, you get an x87 sinf function even with -mfpmath=sse (please note that as long as "long double" math is needed, calculation stays in x87):
movsd 16(%ebp), %xmm0 mulsd %xmm0, %xmm0 addsd 8(%ebp), %xmm0 movsd %xmm0, -8(%ebp) fldl -8(%ebp) fsin fld1 leave faddp %st, %st(1)
With -mfpmath=387, SSE code can be produced by SSE intrinsics. An (scalar!) example:
#include <xmmintrin.h> float test (float a, float b) { __m128 A, B, C; float c; A = _mm_set_ss(a); B = _mm_set_ss(b); // stuff C = _mm_mul_ss(A, B); _mm_store_ss(&c, C); return c; }
will be compiled to:
movss 8(%ebp), %xmm0 movss 12(%ebp), %xmm1 mulss %xmm1, %xmm0 movss %xmm0, -4(%ebp) flds -4(%ebp)
So, your dot product kernel should be implemented with (vectorized) SSE intrinsics, and other scalar code with x87 instructions. This will minimize register shuffling and optimize resource usage - scalar code should use x87 (to use it with x87 intrinsics) and vector code should use SSE unit. You have a lot of possibilites to control resource usage, that were actually impossible with previous gcc versions.
However, it is true that the real performance impact of disabled x87 intrinsics will be shown when ABI and math library functions will be changed to something similar to x86_64.
HTH, Uros.
A note on how to use the vector extensions is in the Using GCC manual. See the Vector extensions page, and look at the appropriate builtins for the architecture you care about.