This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
[BENCHMARK] Povray official benchmark scores for 387 and SSE insnsets
- From: Uros Bizjak <uros at kss-loka dot si>
- To: gcc at gcc dot gnu dot org
- Date: Mon, 24 Jan 2005 17:45:01 +0100
- Subject: [BENCHMARK] Povray official benchmark scores for 387 and SSE insnsets
Hello!
Following are the scores for a Povray official benchmark on P4-3.2GHz,
800MHz FSB.
gcc version 4.0.0 20050124 (experimental)
PovRay version 3.50c, stripped.
Compile flags:
-O3 -march=pentium4 -mfpmath=??? -ffast-math -D__NO_MATH_INLINES
-finline-functions -fomit-frame-pointer -funroll-loops
-fexpensive-optimizations -malign-double -foptimize-sibling-calls
-minline-all-stringops -Wno-multichar
-mfpmath=sse:
Time For Parse: 0 hours 0 minutes 2.0 seconds (2 seconds)
Time For Photon: 0 hours 0 minutes 35.0 seconds (35 seconds)
Time For Trace: 0 hours 29 minutes 29.0 seconds (1769 seconds)
Total Time: 0 hours 30 minutes 6.0 seconds (1806 seconds)
-mfpmath=387:
Time For Parse: 0 hours 0 minutes 2.0 seconds (2 seconds)
Time For Photon: 0 hours 0 minutes 35.0 seconds (35 seconds)
Time For Trace: 0 hours 28 minutes 27.0 seconds (1707 seconds)
Total Time: 0 hours 29 minutes 4.0 seconds (1744 seconds)
It should be noted that:
- 387 version can use a lot of shortcuts by optimizing math built-in
functions, for example:
grep fsincos povray.asm.387 | wc -l
56
- Math builtins are effectively disabled for sse version.
- Another handicap for sse version is the fact that all return values
from math functions travel from st(0) x87 reg to memory and back to SSE reg:
804b91d: f2 0f 11 04 24 movsd %xmm0,(%esp,1)
804b922: e8 7d f5 ff ff call 0x804aea4
804b927: dd 5c 24 28 fstpl 0x28(%esp,1)
804b92b: f2 0f 10 4c 24 28 movsd 0x28(%esp,1),%xmm1
804b931: f2 0f 5c 4b 10 subsd 0x10(%ebx),%xmm1
- In both cases register allocator still can't figure out which register
set it _really_ wants, so these values travel the same way as in example
above:
grep pxor povray.asm.387 | wc -l
21
grep fldz povray.asm.sse | wc -l
122
grep fld1 povray.asm.sse | wc -l
171
With all these in mind, -mfpmath=sse results are very good and show the
effect of rth's TARGET_SSE_MATH cleanups.
As a last thought - is there a reason not to create an ABI similar to
something like x86_64 has for its FP math? If SFmode and DFmode floats
could be passed to and from functions in SSE registers instead of via
memory, I'm sure some benefits can be shown for FP performance. Perhaps
a multi-lib approach could be introduced, where appropriate sse-only
(sub-)library [with parameters passed in SSE registers] would be linked
in for -mfpmath=sse, instead of current x87 [with inline asm troubles
:)] one.
Uros.