This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Problems with FP compares / mixed SSE and I387 code
- From: Uros Bizjak <uros at kss-loka dot si>
- To: gcc-patches at gcc dot gnu dot org
- Date: Fri, 07 Jan 2005 10:34:48 +0100
- Subject: Problems with FP compares / mixed SSE and I387 code
Hello!
From recent discussions regarding SSE performance [1], and from
real-world application performance, where adding -mno-80387 makes quite
an improvement, it looks that mixing SSE and I387 code really makes a
big performance problems. Another example is [2], where it was proved
that "The slowdown is caused by the shuffling of floating point values
between SSE registers and the x87 registers...", which leads to
disabling of all x87 intrinsics for SSE math.
Some progress was achieved by separating SSE and x87 floating point
operators by TARGET_SSE_MATH, and it was again shown that a small
(ehm...) oversight in i386.md produced another big performance problem [3].
Regarding these reports, I would suggest this solution:
- with TARGET_SSE_MATH, _all_ SFmode values should be processed in SSE
registers.
- with TARGET_SSE_MATH && TARGET_SSE2, _all_ SFmode and DFmode values
should be processed in SSE registers.
- Only XFmode (and DFmode in case of !TARGET_SSE2) values are allowed to
enter FP registers.
Anything else should be considered a bug, because of its huge
performance impact. A x87->SSE or SSE->x87 move goes thru memory and
takes ~14cycles on P4, where (i.e) FP multiply takes 7 cycles... There
is a notable exception for SSE cvttsd2si, where it is a win for x87
code, but even in this case, TARGET_NOCONA should use fisttp x87 insn.
As a side note, only "unsupported" x87 intrinsics could be disabled. So
XFmode intrinsics should still be activated, because XFmode values are
already in x87 registers and no shuffling penalty should occur. The same
goes for DFmode values in !TARGET_SSE2 case.
I have tried to attack FP compares, but there are quite some problems,
as x87/SSE compares (fcomi/ comis?), x87/SSE cmoves and SSE min/max code
is badly interconnected. It is not possible to just disable FP compares,
because SSE cmove can be converted to FP cmove under some conditions.
This deficiency is shown in a simple example below:
double test (double a, double b) {
double x;
x = (a < b) : x ? 1.0;
return x;
}
When compiled with 'O2 -ffast-math -march=pentium4 -S -mfpmath=sse
-fomit-frame-pointer', we get this:
test:
subl $12, %esp
movsd 16(%esp), %xmm0
fld1
movsd %xmm0, (%esp)
fldl (%esp)
movsd 24(%esp), %xmm1
comisd %xmm0, %xmm1
fcmovbe %st(1), %st
fstp %st(1)
addl $12, %esp
ret
This happens because output register is expected to be st(0) so reload
gets confused. The sse_movdfcc now gets a mixture of registers with
relevant reloads...
(insn:HI 47 53 31 0 (parallel [
(set (reg/v:DF 8 st [orig:58 x ] [58])
(if_then_else:DF (lt (reg/v:DF 21 xmm0 [orig:60 a ] [60])
(reg:DF 22 xmm1))
(reg:DF 9 st(1))
(reg/v:DF 8 st [orig:58 x ] [58])))
(clobber (scratch:DF))
(clobber (reg:CC 17 flags))
]) 484 {sse_movdfcc} (insn_list:REG_DEP_TRUE 6 (nil))
(nil))
Process this pattern with splitters that depend on:
"!SSE_REG_P (operands[0]) && reload_completed"
or
"SSE_REG_P (operands[0]) && reload_completed"
And we got the above ASM code.
Now, let us force output register to SSE register. Adding "return
sqrt(x);" is enough, as x87's fsqrt is disabled by TARGET_SSE_MATH. This
will force reload pass to use SSE register as an output register of SSE
cmove. The resulting asm code is something to show!
test:
subl $12, %esp
movsd 16(%esp), %xmm0
movapd %xmm0, %xmm2
movsd .LC1, %xmm1
cmpltsd 24(%esp), %xmm2
andpd %xmm2, %xmm0
andnpd %xmm1, %xmm2
orpd %xmm2, %xmm0
sqrtsd %xmm0, %xmm0
movsd %xmm0, (%esp)
fldl (%esp)
addl $12, %esp
ret
Well - no additional costly memory moves. Even remaining mem moves could
be eliminated with some kinf of sse_regparm attribute (as it is the
default case with x86_64) and by returning resulting FP value in SSE
register for static functions.
Actually, SSE and x87 code share the same resources. So even when insn
opcodes are different, these instructions fight for the same resources.
And a multiplier is implemented in quite large silicon area, so I doubt
that there will be multiple multipliers present on chip in near future.
Currenty x87, SSE and _integer_ multiply insns share the same
multiplier... So, the question is if -mfpmath=sse,387 is really worth to
implement...
As SSE code will be used more and more, these bugs should be fixed to
get SSE to its claimed performance [4]. I guess this is the reason that
SSE code is consistently 5% slower than x87 code in povray benchmark.
Another problem with x87 code is in the fact, that x87 does not provide
FP compares (fcomi) with memory arguments. SSE compares could have
memory argument, so every compare that gets combined to use memory
argument will be implemented as SSE compare with all neccessary register
shuffling.
The problems with FP compares are summarized in [5].
Uros.
[1] http://gcc.gnu.org/ml/gcc/2005-01/msg00345.html
[2] http://gcc.gnu.org/ml/gcc-patches/2004-11/msg01877.html
[3] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19240
[4] 'info gcc', i386 and x86-64 options, -mfpmath=sse
[5] http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19252