Gautam Sewani
Fri Jun 6 08:58:00 GMT 2008

As I mentioned, I am using intrinsics. (Intel SSE 2 intrinsics in the
emmintrin.h file to be specific). I do not wish to transfer data
between x87 and xmm registers, but when I am moving a __m128d variable
(a data type for use with the SSE2 intrinsics), to a 2-element double
array (to perform some calculation on each double individually) and
gcc is using generating x87 FPU code for that. I do not want to use
the x87 FPU at all, because as you said, there is no way of moving
data between x87 and XMM registers without going through memory.
Therefore I want to know a method/compiler-switch etc which will cause
gcc to *not* generate x87 FPU code.

On Thu, Jun 5, 2008 at 3:11 AM, Tim Prince <> wrote:
> Gautam Sewani wrote:
>> Hi,
>> I am trying to vectorize a piece of code using SSE 2 intrinsics (the
>> one's in emmintrin.h).I am using double precision floating point
>> arithmetic.The running times I obtained were very similar with and
>> without the vectorization. I suspect the reason for this is that in
>> the vectorized code, I am storing the contents of a packed xmm
>> register (represented by an __m128d variable) into a double array.
>> Looking into the assembly code generated, I saw that for this, the
>> contents of the xmm register were first saved to a memory location and
>> then loaded into the x87 FPU stack. Apparently there is no direct way
>> to transfer data between x87 and xmm registers. One way to eliminate
>> this would be to use xmm registers for all floating point
>> calculations. But inspite of using -march=prescott and -mfpmath=sse,
>> x87 instructions like fld and fstp are still used. Is there any to
>> force GCC to use only the xmm registers for all floating point
>> calculations?(I tried using the -mno-80387 option but I am getting
>> lots of weird linker errors with that). Or is there anyway to move
>> data between x87 and xmm registers without using memory as an
>> intermediary ?
> Moves between x87 and xmm registers must always go through memory.  I'm not
> clear on why you want to use x87 registers in vectorized code, or whether
> you really need intrinsics rather than auto-vectorization.  Using gcc, if
> you have a combination of vectorizable and non-vectorizable code, to get a
> benefit from vectorization, you must split your loops so that you have
> entirely vectorizable code in loops which you want speeded up.  gcc doesn't
> "distribute" automatically for vectorization.
> gcc auto-vectorization has been improving in recent versions.

