SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]
Uros Bizjak
ubizjak@gmail.com
Tue Oct 10 21:23:00 GMT 2006
Menezes, Evandro wrote:
>>I have results for povray-3.6.1 on "Intel(R) Xeon(TM) CPU
>>3.60GHz", 32bit code:
>>
>>-pipe -Wno-multichar -O3 -mfpmath={387,sse} -ffast-math
>>-D__NO_MATH_INLINES -march=pentium4 -mtune=pentium4 -malign-double
>>-minline-all-stringops
>>
>>The results for _official_ povray.ini benchmark show nothing
>>conclusive, with
>>
>>28m11.082s for -mfpmath=sse and
>>28m24.763s for -mfpmath=387
>>
>>Please note, that in this case, mfpmath=387 uses x87 intrinsics, and
>>SSE uses register-passing convention for local functions. I'll
>>benchmark Athlon XP soon.
>>
>>
Oops, this should read Athlon 64.
I have re-run official povray-3.6.1 benchmark on
vendor_id : AuthenticAMD
cpu family : 15
model : 47
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 2
cpu MHz : 1809.276
cache size : 512 KB
On Fedora Core 4 (2.6.11-1.1369_FC4 #1 Thu Jun 2 22:56:33 EDT 2005
x86_64 x86_64 x86_64 GNU/Linux) using out of the box glibc:
GNU C Library development release version 2.3.5, by Roland McGrath et al.
...
Compiled by GNU CC version 4.0.0 20050525 (Red Hat 4.0.0-9).
Compiled on a Linux 2.4.20 system on 2005-05-30.
Povray was compiled using "-pipe -Wno-multichar -O3 -march=k8-mtune=k8
-ffast-math -minline-all-stringops" for SSE.
The result of benchmark run was:
user 27m43.635s
387 benchmark was compiled with -mfpmath=387 added to compile flags.
The result of benchmark was:
user 28m40.049s
Now for the fun part.
I have speculated that the slowdown was due to costly SSE->mem->x87
moves. These moves should be avoided as much as possible, and this fact
was already proved some time ago (this is actually the reason why x87
intrinsics are disabled for SSE math). To prove this speculation, -msse3
was added to compile flags to enable generation of fisttp instruction.
The effect of fisttp instruction is to substitute sequences like:
41e7ef: dd 5c 24 08 fstpl 0x8(%rsp)
41e7f3: 66 0f 12 44 24 08 movlpd 0x8(%rsp),%xmm0
41e7f9: f2 0f 2c d0 cvttsd2si %xmm0,%edx
with
40ef57: db 4c 24 fc fisttpl 0xfffffffffffffffc(%rsp)
40ef5b: 8b 44 24 fc mov 0xfffffffffffffffc(%rsp),%eax
and this way many x87->mem->SSE moves were removed. The result of
benchmark run is now:
user 27m27.141s
IN comparison to previous 387 run, this shows the effect of memory moves
(-msse3 changed some 350 cvttsd2si sequences into fisttp). So, at this
point x87 code of a real world application (which is BTW the part of a
SPEC suite) beats x86_64 SSE, despite the fact that SSE has two times as
many non-stacked FP registers and implements register passing convention
(thus avoiding memory moves). Following that, implementing x87 register
passing convention we would surely remove at least some of more than
1300 remaining movlpds and some of 900 remainign movsd instructions (all
with one memory operand).
>Well, FSIN returns results accurate to only 11 mantissa bits for angles around multiples of 90° above 10E5...
>
>
>
One has to make a lot of circles to reach 10e5 degrees ;)
Uros.
More information about the Gcc-patches
mailing list