This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]


On Tue, 10 Oct 2006, Uros Bizjak wrote:

> Menezes, Evandro wrote:
> 
> > >I have results for povray-3.6.1 on "Intel(R) Xeon(TM) CPU 3.60GHz", 32bit
> > >code:
> > >
> > >-pipe -Wno-multichar -O3 -mfpmath={387,sse} -ffast-math
> > >-D__NO_MATH_INLINES -march=pentium4 -mtune=pentium4 -malign-double
> > >-minline-all-stringops
> > >
> > >The results for _official_ povray.ini benchmark show nothing conclusive,
> > >with
> > >
> > >28m11.082s for -mfpmath=sse and
> > >28m24.763s for -mfpmath=387
> > >
> > >Please note, that in this case, mfpmath=387 uses x87 intrinsics, and
> > >SSE uses register-passing convention for local functions. I'll
> > >benchmark Athlon XP soon.
> > >    
> > >
> Oops, this should read Athlon 64.
> 
> I have re-run official povray-3.6.1 benchmark on
> 
> vendor_id       : AuthenticAMD
> cpu family      : 15
> model           : 47
> model name      : AMD Athlon(tm) 64 Processor 3000+
> stepping        : 2
> cpu MHz         : 1809.276
> cache size      : 512 KB
> 
> On Fedora Core 4 (2.6.11-1.1369_FC4 #1 Thu Jun 2 22:56:33 EDT 2005 x86_64
> x86_64 x86_64 GNU/Linux) using out of the box glibc:
> 
> GNU C Library development release version 2.3.5, by Roland McGrath et al.
> ...
> Compiled by GNU CC version 4.0.0 20050525 (Red Hat 4.0.0-9).
> Compiled on a Linux 2.4.20 system on 2005-05-30.
> 
> Povray was compiled using "-pipe -Wno-multichar -O3 -march=k8-mtune=k8
> -ffast-math -minline-all-stringops" for SSE.
> The result of benchmark run was:
> user    27m43.635s
> 
> 387 benchmark was compiled with -mfpmath=387 added to compile flags.
> The result of benchmark was:
> user    28m40.049s
> 
> Now for the fun part.
> I have speculated that the slowdown was due to costly SSE->mem->x87 moves.
> These moves should be avoided as much as possible, and this fact was already
> proved some time ago (this is actually the reason why x87 intrinsics are
> disabled for SSE math). To prove this speculation, -msse3 was added to compile
> flags to enable generation of fisttp instruction.
> 
> The effect of fisttp instruction is to substitute sequences like:
>  41e7ef:       dd 5c 24 08             fstpl  0x8(%rsp)
>  41e7f3:       66 0f 12 44 24 08       movlpd 0x8(%rsp),%xmm0
>  41e7f9:       f2 0f 2c d0             cvttsd2si %xmm0,%edx
> 
> with
> 
>  40ef57:       db 4c 24 fc             fisttpl 0xfffffffffffffffc(%rsp)
>  40ef5b:       8b 44 24 fc             mov    0xfffffffffffffffc(%rsp),%eax
> 
> and this way many x87->mem->SSE moves were removed. The result of benchmark
> run is now:
> user    27m27.141s

You should try the SSE support for C99 rounding functions patch
(http://www.suse.de/~rguenther/patches-0917.tar).  It gave some ~2%
improvement on spec 2k6 povray.

Richard.

--
Richard Guenther <rguenther@suse.de>
Novell / SUSE Labs


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]