This is the mail archive of the
mailing list for the GCC project.
Re: SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]
- From: Jan Hubicka <hubicka at ucw dot cz>
- To: Uros Bizjak <ubizjak at gmail dot com>
- Cc: "Menezes, Evandro" <evandro dot menezes at amd dot com>, Roger Sayle <roger at eyesopen dot com>, Michael Matz <matz at suse dot de>, Jan Hubicka <hubicka at ucw dot cz>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Richard Guenther <rguenther at suse dot de>
- Date: Wed, 11 Oct 2006 00:53:22 +0200
- Subject: Re: SSE vs. x87 povray deathmatch [was: Re: [RFC PATCH, x86_64] Use -mno-sse[,2] to fall back to x87 FP ...]
- References: <1449F58C868D8D4E9C72945771150BDF5218EB@SAUSEXMB1.amd.com> <452C07D8.email@example.com>
> Menezes, Evandro wrote:
> Povray was compiled using "-pipe -Wno-multichar -O3 -march=k8-mtune=k8
> -ffast-math -minline-all-stringops" for SSE.
> The result of benchmark run was:
> user 27m43.635s
> 387 benchmark was compiled with -mfpmath=387 added to compile flags.
> The result of benchmark was:
> user 28m40.049s
> and this way many x87->mem->SSE moves were removed. The result of
> benchmark run is now:
> user 27m27.141s
Hmm, fun ineed ;)
If you manage to get any instruction level oprofiles of routines that
execute faster on x87 than on SSE, I would be definitly interested to
see them. I will try to get this done for both benchmarks sometime
later this week or next week myself if time allows.
In addition to the mentioned math functions, comparing SSE to x87
performance is tricky especially for code working on floats as C
introduce many "implicit" float to double conversions that are noops on
x87, but rather expensive on SSE. I did some work on elliminating this
by adding folders around common offenders (as fabs), but perhaps we need
more epsecially for -ffast-math. Sadly many programs are written in a
manner doing those conversions in nontrivial cases for no good reasons
and I guess it is more or less matter of re-optimizing those
applications for new hardware (I guess povray is good example of
application that got extensive tuning for x87 hardware, dryrstone is not
Other common causes for slowdowns in SSE is the lack of reversed order
instructions (you can do reg=reg-reg2, but not reg=reg2-reg, while x87
allows both) and also sometimes increased instruction length causing
decoder stalls. None of those should however show significantly enought
to outweight all the x87 fxch braindamage... So lets give a try
identifying and hopefully fixing the SSE codegen issues.