This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Effectiveness of SSE on Athlon XP with gcc 3.2
- From: "Andres Chavarria" <el_andrecillo at hotmail dot com>
- To: gcc-help at gcc dot gnu dot org
- Date: Fri, 29 Nov 2002 13:22:01 +0000
- Subject: Re: Effectiveness of SSE on Athlon XP with gcc 3.2
- Bcc:
Well after some thinking and analizing the assembler code I got the answer
to my question. The bottleneck is given by the throughout of memory reading.
If I use a smaller array the data is read a lot faster and I got at least a
effectiveness of 3. Pentium 4 has probably a bigger cache so it could handle
the big amouunt better...
Andres
From: "Andres Chavarria" <el_andrecillo@hotmail.com>
To: gcc-help@gcc.gnu.org
Subject: Effectiveness of SSE on Athlon XP with gcc 3.2
Date: Thu, 28 Nov 2002 15:45:49 +0000
Hi,
I was wondering how effective the parallelization using SSE is, and I wrote
a test program that sums the values of a vector up. The SSE free code for
looks like this:
// non SSE, numbers is an array of size cNumber
float sum1,sum2,sum3,sum4;
float val;
sum1=0.0f; sum2=0.0f;sum3=0.0f; sum4=0.0f;
for(i=0;i<cNumber;i+=4){
sum1+=*(numbers++);
sum2+=*(numbers++);
sum3+=*(numbers++);
sum4+=*(numbers++);
}
val=(sum1+sum2)+(sum1+sum2);
And the one using SSE looks like this:
// with SSE, numbers is an array of size cNumber used
__attrib__((aligned(16))) to aligne it
register v4sf val1,val2;
float __attrib__((aligned(16))) momVal[4];
float val;
val1=__builtin_ia32_loadups(numbers);
for(i=4;i<cNumber;i+=4){
val2=__builtin_ia32_loadaps(numbers+i);
val1=__builtin_ia32_addps(val2,val1);
}
__builtin_ia32_storeups(momVal,val1);
val=(momVal[0]+momVal[1])+(momVal[2]+momVal[3]);
I used gcc v. 3.2 on a linux machine with kernel 2.4.19 (distribution SusE
8.1) an compiled the following way:
gcc -msse -march=athlon-xp -o ssetest main.cpp -lstdc++
If I compare the calculation time of the first code with the second one I
get on an Athlon XP 2200+ (actually 1800 Mhz) only a speedup of about
(array size 307200):
calc. time without SSE: 2.72 sec
calc. time with SSE: 2.05 sec
speedup factor 2.72/2.05 = 1.33
Theoretically the speedup factor should be somewhat smaller than 4. So I'm
quiet dissapointed... The first time I compiled and ran a similar program I
did it on a Pentium 4 mobility (1800 Mhz) processor (options -msse
-march=pentium4). I got about a speedup factor of 3.9, which was superb.
I don't think it is because of the processor (Athlon XP wouldn't have a
chance against Pentium 4, and a report of such a difference should be
present somewhere in the net...) I suspect the SSE code I wrote is far from
good or that gcc does something wrong for the Athlon XP.
I don't have access to the Pentium 4 machine so I cant compare the above
code directly anymore. It would be great I someone could test it on a
Pentium 4 or if someone has a hint what is going on.
Thanx in advance,
Andres
_________________________________________________________________
MSN Fotos: la forma más fácil de compartir e imprimir fotos.
http://photos.msn.es/support/worldwide.aspx
_________________________________________________________________
MSN. Más Útil Cada Día http://www.msn.es/intmap/