Fldcw, rounding and optimizations


I have code like this:

#define DEFAULT_CW 0x0000037F
int _roundceil=DEFAULT_CW|0x0800;
int _roundfloor=DEFAULT_CW|0x0400;
int _roundtrunc=DEFAULT_CW|0x0c00;
int _roundnormal=DEFAULT_CW;
static inline void fpu_set_roundceil () { asm ("fldcw _roundceil"); }
static inline void fpu_set_roundfloor () { asm ("fldcw _roundfloor"); }
static inline void fpu_set_roundtrunc () { asm ("fldcw _roundtrunc"); }
static inline void fpu_set_roundnormal () { asm ("fldcw _roundnormal"); }

I know about fenv.h, I'm trying to save some cycles. The result when I use the above code is that the function calls are removed, and the only thing left is a single fldcw per change in the rounding mode. However, the fpu optimizer in gcc/g++ is very fragile in this situation. For complete code, see

I've tested with egcs 2.91, g++ 2.95 and gcc-3.4-20030806. With the exact code at the above URL, the fpu optimizer produces erroneous code. How do I know it's erroneous? The fpu output isn't the same at -O0 as it is at -O1. More importantly, -O0 produces an interval that contains the actual correct answer, but -O1 doesn't. Somehow -O2 produces the same thing as -O0, but I'm not sure that happens all the time.

It is understandable that the FPU optimizing code would spot calculations that appear to be duplicates in some cases, if it doesn't take the rounding mode into account. However, this behavior is unacceptable for my application. How to I make certain that the FPU optimizing code doesn't screw up my program?

Also, should this be considered a bug in g++ (the FPU optimizer changes the semantics of my program, hence the optimizer is wrong) or a bug in my code (I'm inserting inline assembler behind g++'s back, which g++ can't be expected to grok anyway?) Are the g++ FPU optimizations always correct in the absence of rounding mode tweakage? (Do all programs output the exact same floating point numbers at all precision levels, if floating point rounding mode is left alone?)

The weird thing is that if I use fenv.h, so far g++ produces the correct code at all optimization levels (although I haven't done extensive tests yet.) Is it because a function call somehow serves as a barrier to reordering fpu operations, and fenv.h introduces a call per rounding mode change? (This cost is too prohibitive for me, by the way.) If a function call introduces a barrier preventing g++ from reordering fpu operations, shouldn't an inline statement do the same?

Lastly, if I want to optimize with current compilers, but I want the semantics of my program to be correct, what optimization flags should I use to prevent the reordering of FPU operations that's causing me problems right now?


Sebastien Loisel

PS. Please cc me in your replies.

