HELP With Slow SSE Code

Tue Jun 6 00:19:00 GMT 2006

I am new to the world of SSE, but in trying to speed up some C code I have
run into a wall which is both perplexing and frustrating (since I can't find
a solution).  I am hoping someone here can provide the help I seek.  I thank
you for all your assistance!

My (watered down version) code is as follows (running on a pentium4 based
machine and compiling with gcc 4.02 using the compile options:

-O3 -Wall -march=pentium4 -msse2 -mfpmath=sse):

// standard C #include files are put here
#include <emmintrin.h> // I will actually eventually be using sse2 and 
                                 // sse instructions
#include <mm_malloc.h>

void main()
float *ptr1,*ptr2,*ptr3,*tptr1,*tptr2;
__m128 m1,m2,m3,*sptr1,*sptr2,*sptr3;
int i,j,arraysize=1000,loopcount=10;

// allocate space for dynamic arrays that are aligned to 16-byte boundary
(note that arraysize will actually be read into this program in the final
ptr1=(float *) __mm_malloc(arraysize*sizeof(float),16);
ptr2=(float *) __mm_malloc(arraysize*sizeof(float),16);
ptr3=(float *) __mm_malloc(arraysize*sizeof(float),16);


// fill in two of the arrays with some numbers

  sptr1=(__m128) ptr1; // cast to size 128 bits
  sptr2=(__m128) ptr2;
  sptr3=(__m128_ ptr3;

    m3=_mm_mul_ps(m1,m2); // use SSE intrinsic instruction to
                // multiply two numbers (note that even if I use *sptrx
                // instead of mx I will get the same speed problem).


So my speed problem is as follows.  Without the line "*sptr3=m3;" the TIMING
LOOP works as expected.  That is, four times faster than if I used normal
float values instead of quad sized float values (i.e. __m128). With the line
"*sptr3=m3;" inside this TIMING LOOP the code runs about 3 times slower than
when using normal float values.  For some reason writing to the pointer
location of type __m128 seems to slow things down, but reading from it is
fine (e.g. line "m1=*sptr1;").  If I write the computed/multiplied data to a
static array (but I truly need a dynamic array) such as

x.m[j*i]=m3;  // that is, replace line *sptr3=m3 with this line

where , say

union {
__m128m m[1000*10];
float f[1000*10][4];
} x

then the program runs as fast as expected.  So what may I be doing wrong
with my code such that I do not effectively take advantage of SSE
capabilities in the pentium 4?

View this message in context:
Sent from the gcc - Help forum at

More information about the Gcc-help mailing list