SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]

Tue Oct 10 21:23:00 GMT 2006

Menezes, Evandro wrote:

>>I have results for povray-3.6.1 on "Intel(R) Xeon(TM) CPU 
>>3.60GHz", 32bit code:
>>
>>-pipe -Wno-multichar -O3 -mfpmath={387,sse} -ffast-math
>>-D__NO_MATH_INLINES -march=pentium4 -mtune=pentium4 -malign-double
>>-minline-all-stringops
>>
>>The results for _official_ povray.ini benchmark show nothing 
>>conclusive, with
>>
>>28m11.082s for -mfpmath=sse and
>>28m24.763s for -mfpmath=387
>>
>>Please note, that in this case, mfpmath=387 uses x87 intrinsics, and
>>SSE uses register-passing convention for local functions. I'll
>>benchmark Athlon XP soon.
>>    
>>
Oops, this should read Athlon 64.

I have re-run official povray-3.6.1 benchmark on

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 47
model name      : AMD Athlon(tm) 64 Processor 3000+
stepping        : 2
cpu MHz         : 1809.276
cache size      : 512 KB

On Fedora Core 4 (2.6.11-1.1369_FC4 #1 Thu Jun 2 22:56:33 EDT 2005 
x86_64 x86_64 x86_64 GNU/Linux) using out of the box glibc:

GNU C Library development release version 2.3.5, by Roland McGrath et al.
...
Compiled by GNU CC version 4.0.0 20050525 (Red Hat 4.0.0-9).
Compiled on a Linux 2.4.20 system on 2005-05-30.

Povray was compiled using "-pipe -Wno-multichar -O3 -march=k8-mtune=k8 
-ffast-math -minline-all-stringops" for SSE.
The result of benchmark run was:
user    27m43.635s

387 benchmark was compiled with -mfpmath=387 added to compile flags.
The result of benchmark was:
user    28m40.049s

Now for the fun part.
I have speculated that the slowdown was due to costly SSE->mem->x87 
moves. These moves should be avoided as much as possible, and this fact 
was already proved some time ago (this is actually the reason why x87 
intrinsics are disabled for SSE math). To prove this speculation, -msse3 
was added to compile flags to enable generation of fisttp instruction.

The effect of fisttp instruction is to substitute sequences like:
  41e7ef:       dd 5c 24 08             fstpl  0x8(%rsp)
  41e7f3:       66 0f 12 44 24 08       movlpd 0x8(%rsp),%xmm0
  41e7f9:       f2 0f 2c d0             cvttsd2si %xmm0,%edx

with

  40ef57:       db 4c 24 fc             fisttpl 0xfffffffffffffffc(%rsp)
  40ef5b:       8b 44 24 fc             mov    0xfffffffffffffffc(%rsp),%eax

and this way many x87->mem->SSE moves were removed. The result of 
benchmark run is now:
user    27m27.141s

IN comparison to previous 387 run, this shows the effect of memory moves 
(-msse3 changed some 350 cvttsd2si sequences into fisttp). So, at this 
point x87 code of a real world application (which is BTW the part of a 
SPEC suite) beats x86_64 SSE, despite the fact that SSE has two times as 
many non-stacked FP registers and implements register passing convention 
(thus avoiding memory moves). Following that, implementing x87 register 
passing convention we would surely remove at least some of more than 
1300 remaining movlpds and some of 900 remainign movsd instructions (all 
with one memory operand).

>Well, FSIN returns results accurate to only 11 mantissa bits for angles around multiples of 90Â° above 10E5...
>
>  
>
One has to make a lot of circles to reach 10e5 degrees ;)

Uros.