This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

x87 float truncation/accuracy (gcc vs. icc/msvc)


My boss recently alerted me to an anomalous performance change in
a piece of code he was working on.  The reduced test case is shown
below:

float foo(float *x)
{
  int i;
  float y = 0.0;
  for (i=0; i<10; i++)
    y += 2.0*x[i];
  return y;
}

which on x87 runs at less than half of the speed of changing the
"2.0" to "2.0f"!?.  The cause can be seen in the assembly output:

foo:    subl    $4, %esp
        fldz
        movl    8(%esp), %edx
        xorl    %eax, %eax
.L6:    flds    (%edx,%eax,4)
        incl    %eax
        cmpl    $9, %eax
        fadd    %st(0), %st
        faddp   %st, %st(1)
        fstps   (%esp)           <--- here
        flds    (%esp)           <--- here
        jle     .L6
        popl    %eax
        ret


The two instructions marked "here" are removed by using the float
constant, but are present with a double constant, even with -ffast-math.
The problem is that these instructions round %st(0) to a "float" by
storing it to memory and reading it back in again.  Changing the type
of "y" to double also resolves the problem (even though the addition
is actually XFmode, we don't attempt to correctly round down to DFmode).

In the original code, these two variants actually produced significantly
different results by changing the coefficient's type.  Particularly,
interesting is that both the Microsoft Visual C/C++ compiler and Intel's
icc both *by default* completely optimized away this "float_truncate",
producing incorrectly rounded results.

The same problem also hurts GCC on the popular "mflop" benchmark.


My interest now is how best to catch this transformation/optimization
using flag_unsafe_math_optimizations.  My analysis so far is that
this is an i386.md specific transformation.  On many machines "float"
operations are faster than "double", and their hardware often supports
efficient "double->float" conversion.  The IA-32 architecture on the
other hand seems unique in that commercial compilers are free to consider
"truncdfsf2" as a no-op, in the same way as "extendsfdf2".

Do any of the x86 backend gurus have any suggestions as to how best
to implement "truncdfsf2" as a move between x87 registers, but as a
regular "fst*s" instruction for memory targets?  My initial attempt
was to simply guard the following splitter with !flag_unsafe_math_...

(define_split
  [(set (match_operand:SF 0 "register_operand" "")
        (float_truncate:SF
         (match_operand:DF 1 "fp_register_operand" "")))
   (clobber (match_operand:SF 2 "memory_operand" ""))]
  "TARGET_80387 && reload_completed"
  [(set (match_dup 2) (float_truncate:SF (match_dup 1)))
   (set (match_dup 0) (match_dup 2))]
  "")

Alas this failed miserably.

Any advice would be much appreciated.  I've confirmed that GCC performs
the related "safe" constant folding optimizations, such as converting
"(float)((double)f1 op (double)f2)" into "f1 op f2" for floating point
values f1 and f2, and operation op one of add, sub or mul.  For "mul",
for example, the two 24 bit mantissas of an IEEE "float" can't overflow
the 53 bit mantissa of an IEEE double, so there's no double rounding and
so a floating point multiplication returns the same (perfectly rounded)
result.  These don't help the code above however, which is fundamentally
unsafe and not normally a win except on Intel.

Roger
--


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]