This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]


Menezes, Evandro wrote:

I have results for povray-3.6.1 on "Intel(R) Xeon(TM) CPU 3.60GHz", 32bit code:

-pipe -Wno-multichar -O3 -mfpmath={387,sse} -ffast-math
-D__NO_MATH_INLINES -march=pentium4 -mtune=pentium4 -malign-double
-minline-all-stringops

The results for _official_ povray.ini benchmark show nothing conclusive, with

28m11.082s for -mfpmath=sse and
28m24.763s for -mfpmath=387

Please note, that in this case, mfpmath=387 uses x87 intrinsics, and
SSE uses register-passing convention for local functions. I'll
benchmark Athlon XP soon.


Oops, this should read Athlon 64.

I have re-run official povray-3.6.1 benchmark on

vendor_id       : AuthenticAMD
cpu family      : 15
model           : 47
model name      : AMD Athlon(tm) 64 Processor 3000+
stepping        : 2
cpu MHz         : 1809.276
cache size      : 512 KB

On Fedora Core 4 (2.6.11-1.1369_FC4 #1 Thu Jun 2 22:56:33 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux) using out of the box glibc:

GNU C Library development release version 2.3.5, by Roland McGrath et al.
...
Compiled by GNU CC version 4.0.0 20050525 (Red Hat 4.0.0-9).
Compiled on a Linux 2.4.20 system on 2005-05-30.

Povray was compiled using "-pipe -Wno-multichar -O3 -march=k8-mtune=k8 -ffast-math -minline-all-stringops" for SSE.
The result of benchmark run was:
user 27m43.635s


387 benchmark was compiled with -mfpmath=387 added to compile flags.
The result of benchmark was:
user    28m40.049s

Now for the fun part.
I have speculated that the slowdown was due to costly SSE->mem->x87 moves. These moves should be avoided as much as possible, and this fact was already proved some time ago (this is actually the reason why x87 intrinsics are disabled for SSE math). To prove this speculation, -msse3 was added to compile flags to enable generation of fisttp instruction.


The effect of fisttp instruction is to substitute sequences like:
 41e7ef:       dd 5c 24 08             fstpl  0x8(%rsp)
 41e7f3:       66 0f 12 44 24 08       movlpd 0x8(%rsp),%xmm0
 41e7f9:       f2 0f 2c d0             cvttsd2si %xmm0,%edx

with

 40ef57:       db 4c 24 fc             fisttpl 0xfffffffffffffffc(%rsp)
 40ef5b:       8b 44 24 fc             mov    0xfffffffffffffffc(%rsp),%eax

and this way many x87->mem->SSE moves were removed. The result of benchmark run is now:
user 27m27.141s


IN comparison to previous 387 run, this shows the effect of memory moves (-msse3 changed some 350 cvttsd2si sequences into fisttp). So, at this point x87 code of a real world application (which is BTW the part of a SPEC suite) beats x86_64 SSE, despite the fact that SSE has two times as many non-stacked FP registers and implements register passing convention (thus avoiding memory moves). Following that, implementing x87 register passing convention we would surely remove at least some of more than 1300 remaining movlpds and some of 900 remainign movsd instructions (all with one memory operand).

Well, FSIN returns results accurate to only 11 mantissa bits for angles around multiples of 90° above 10E5...



One has to make a lot of circles to reach 10e5 degrees ;)

Uros.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]