This is the mail archive of the
mailing list for the GCC project.
Re: SSE and SSE2 intrinsics
> I'm afraid they're "I tried compiling these bits of the huge source
> distribution of the FSL MRI image analyser with -march=pentium4 and
> without"; I haven't produced small examples, though I'll have a go at that
> this evening if you want.
The -march=pentium4 is not very well tunned, but it should not produce worse
code than -march=something_else for pentium4. In case the image analyser
is heavilly using floats, I would guess the problem to be mixing of SSE and FP.
I will prepare patch to swap the register preferrence so you can try if it
solves the problem.
> Behavior I noticed by comparing gprof -l output was: transferring from FP to
> SSE registers just to use the CVTTSD2SI commands, and
> [unsigned char a,b,e; int c,d]
> a = (a<b ? a:b);
> if (c<d) e=(b>e?b:e);
Yes, we do use SSE for very small subset of FP operations that are considered
"safe". The problem is that we say compiler "usage of SSE is better", but the
register allocator is unable to discover fact that if register X is used with
other register Y that must be in x87, the register X must be in x87 as well
even when all operations handling it are available in SSE.
I do have patches to regclass to make this possible for about two years, but
they didn't get in.
> compiling using byte-sized registers without -march=pentium4 and dword-sized
> ones otherwise, and suddenly becoming the hottest spot in the code; in both
> cases it was compiled in conditional-jump-over-one-instruction style, where
> I was slightly expecting cmov.
> > I am not sure what are you shooting for -msse2 -march=pentium4 just
> enables the
> > presence of SSE2 builtins.
> Oh. I had expected -march=pentium4 to do what -mfp-math=sse does -- at
I was fighting for that behaviour, but lost the battle. Problem is that x87
is using 80bit temporaries for everything, while SSE(2) is using 32bit or 64bit
making results less exact (and IEEE conforming). Some programs (glibc)rely
on this behaviour so it is unsafe to enable that option by default.
Also the -march=pentium4 implies -msse2. -msse2 just tells the fact to compiler
that instruction set is availble. -mfp-match=sse tells compiler to automatically
use it for scalar FP generation.
THe SSE set is usefull in limited way for integer operations and of course
for vectorized code as well, but these bits are not implemented yet. They
need the register allocator patch as well as some other important changes
> least, that's the behaviour I saw on the Intel compiler. I've collected the
> 20020204 snapshot, and will comment more in a couple of days when I've had
> some time to play -- my P4 system is at home and my only Net connection at
> college, so I carry snapshots back and forth on my Windows laptop.
> Err, does -mfp-math=sse also use SSE2 for DF-mode operations, or do I need
> to set it to sse2 for those? And is it documented anywhere? -- Google shows
sse is enought - it enables usage of SSE2 in case you tell compiler that SSE2
is available by -march switch. THere is manual distributed with the sources, so
you can read it. The manual on web covers 3.0 release that do not support
> no uses of the word on the Web.
> > To get some benefits, you need to eighter use the intrisc and then the
> > code would not compile w/o those -m options or use -mfp-math=sse
> > to enable use of SSE instructions for floating point that should improve
> > perofmrance of FP code but not due to use of paralelization.
> Indeed; I already have some code which relies on the ICC intrinsics, which
> I'd rather like to be able to compile with the normal gcc tools.
I would be very interested in the results. Theoretically it should work now
assuming you are not using some callbacks from library calls that are compiled
by old gcc and missalign stack, but I am not quite sure everything is in
place. We should get it before branching, so this is top priority right now.