This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Problems with FP compares / mixed SSE and I387 code


Hello!

From recent discussions regarding SSE performance [1], and from real-world application performance, where adding -mno-80387 makes quite an improvement, it looks that mixing SSE and I387 code really makes a big performance problems. Another example is [2], where it was proved that "The slowdown is caused by the shuffling of floating point values between SSE registers and the x87 registers...", which leads to disabling of all x87 intrinsics for SSE math.

Some progress was achieved by separating SSE and x87 floating point operators by TARGET_SSE_MATH, and it was again shown that a small (ehm...) oversight in i386.md produced another big performance problem [3].

Regarding these reports, I would suggest this solution:
- with TARGET_SSE_MATH, _all_ SFmode values should be processed in SSE registers.
- with TARGET_SSE_MATH && TARGET_SSE2, _all_ SFmode and DFmode values should be processed in SSE registers.
- Only XFmode (and DFmode in case of !TARGET_SSE2) values are allowed to enter FP registers.


Anything else should be considered a bug, because of its huge performance impact. A x87->SSE or SSE->x87 move goes thru memory and takes ~14cycles on P4, where (i.e) FP multiply takes 7 cycles... There is a notable exception for SSE cvttsd2si, where it is a win for x87 code, but even in this case, TARGET_NOCONA should use fisttp x87 insn.

As a side note, only "unsupported" x87 intrinsics could be disabled. So XFmode intrinsics should still be activated, because XFmode values are already in x87 registers and no shuffling penalty should occur. The same goes for DFmode values in !TARGET_SSE2 case.

I have tried to attack FP compares, but there are quite some problems, as x87/SSE compares (fcomi/ comis?), x87/SSE cmoves and SSE min/max code is badly interconnected. It is not possible to just disable FP compares, because SSE cmove can be converted to FP cmove under some conditions. This deficiency is shown in a simple example below:

double test (double a, double b) {
 double x;

 x = (a < b) : x ? 1.0;
 return x;
}

When compiled with 'O2 -ffast-math -march=pentium4 -S -mfpmath=sse -fomit-frame-pointer', we get this:
test:
subl $12, %esp
movsd 16(%esp), %xmm0
fld1
movsd %xmm0, (%esp)
fldl (%esp)
movsd 24(%esp), %xmm1
comisd %xmm0, %xmm1
fcmovbe %st(1), %st
fstp %st(1)
addl $12, %esp
ret


This happens because output register is expected to be st(0) so reload gets confused. The sse_movdfcc now gets a mixture of registers with relevant reloads...

(insn:HI 47 53 31 0 (parallel [
           (set (reg/v:DF 8 st [orig:58 x ] [58])
               (if_then_else:DF (lt (reg/v:DF 21 xmm0 [orig:60 a ] [60])
                       (reg:DF 22 xmm1))
                   (reg:DF 9 st(1))
                   (reg/v:DF 8 st [orig:58 x ] [58])))
           (clobber (scratch:DF))
           (clobber (reg:CC 17 flags))
       ]) 484 {sse_movdfcc} (insn_list:REG_DEP_TRUE 6 (nil))
   (nil))

Process this pattern with splitters that depend on:
 "!SSE_REG_P (operands[0]) && reload_completed"
or
 "SSE_REG_P (operands[0]) && reload_completed"

And we got the above ASM code.

Now, let us force output register to SSE register. Adding "return sqrt(x);" is enough, as x87's fsqrt is disabled by TARGET_SSE_MATH. This will force reload pass to use SSE register as an output register of SSE cmove. The resulting asm code is something to show!

test:
       subl    $12, %esp
       movsd   16(%esp), %xmm0
       movapd  %xmm0, %xmm2
       movsd   .LC1, %xmm1
       cmpltsd 24(%esp), %xmm2
       andpd   %xmm2, %xmm0
       andnpd  %xmm1, %xmm2
       orpd    %xmm2, %xmm0
       sqrtsd  %xmm0, %xmm0
       movsd   %xmm0, (%esp)
       fldl    (%esp)
       addl    $12, %esp
       ret

Well - no additional costly memory moves. Even remaining mem moves could be eliminated with some kinf of sse_regparm attribute (as it is the default case with x86_64) and by returning resulting FP value in SSE register for static functions.

Actually, SSE and x87 code share the same resources. So even when insn opcodes are different, these instructions fight for the same resources. And a multiplier is implemented in quite large silicon area, so I doubt that there will be multiple multipliers present on chip in near future. Currenty x87, SSE and _integer_ multiply insns share the same multiplier... So, the question is if -mfpmath=sse,387 is really worth to implement...

As SSE code will be used more and more, these bugs should be fixed to get SSE to its claimed performance [4]. I guess this is the reason that SSE code is consistently 5% slower than x87 code in povray benchmark.

Another problem with x87 code is in the fact, that x87 does not provide FP compares (fcomi) with memory arguments. SSE compares could have memory argument, so every compare that gets combined to use memory argument will be implemented as SSE compare with all neccessary register shuffling.

The problems with FP compares are summarized in [5].

Uros.

[1] http://gcc.gnu.org/ml/gcc/2005-01/msg00345.html
[2] http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01877.html
[3] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19240
[4] 'info gcc', i386 and x86-64 options, -mfpmath=sse
[5] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19252


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]