This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Effectiveness of SSE on Athlon XP with gcc 3.2

From: "Andres Chavarria" <el_andrecillo at hotmail dot com>
To: gcc-help at gcc dot gnu dot org
Date: Fri, 29 Nov 2002 13:22:01 +0000
Subject: Re: Effectiveness of SSE on Athlon XP with gcc 3.2
Bcc:

Well after some thinking and analizing the assembler code I got the answer to my question. The bottleneck is given by the throughout of memory reading. If I use a smaller array the data is read a lot faster and I got at least a effectiveness of 3. Pentium 4 has probably a bigger cache so it could handle the big amouunt better...

Andres

From: "Andres Chavarria" <el_andrecillo@hotmail.com>
To: gcc-help@gcc.gnu.org
Subject: Effectiveness of SSE on Athlon XP with gcc 3.2
Date: Thu, 28 Nov 2002 15:45:49 +0000

Hi,

I was wondering how effective the parallelization using SSE is, and I wrote a test program that sums the values of a vector up. The SSE free code for looks like this:

// non SSE, numbers is an array of size cNumber
float sum1,sum2,sum3,sum4;
float val;

sum1=0.0f; sum2=0.0f;sum3=0.0f; sum4=0.0f;
for(i=0;i<cNumber;i+=4){
sum1+=*(numbers++);
sum2+=*(numbers++);
sum3+=*(numbers++);
sum4+=*(numbers++);
}
val=(sum1+sum2)+(sum1+sum2);

And the one using SSE looks like this:

// with SSE, numbers is an array of size cNumber used __attrib__((aligned(16))) to aligne it
register v4sf val1,val2;
float __attrib__((aligned(16))) momVal[4];
float val;

val1=__builtin_ia32_loadups(numbers);

for(i=4;i<cNumber;i+=4){
val2=__builtin_ia32_loadaps(numbers+i);
val1=__builtin_ia32_addps(val2,val1);
}

__builtin_ia32_storeups(momVal,val1);

val=(momVal[0]+momVal[1])+(momVal[2]+momVal[3]);

I used gcc v. 3.2 on a linux machine with kernel 2.4.19 (distribution SusE 8.1) an compiled the following way:

gcc -msse -march=athlon-xp -o ssetest main.cpp -lstdc++

If I compare the calculation time of the first code with the second one I get on an Athlon XP 2200+ (actually 1800 Mhz) only a speedup of about (array size 307200):

calc. time without SSE: 2.72 sec
calc. time with SSE: 2.05 sec
speedup factor 2.72/2.05 = 1.33

Theoretically the speedup factor should be somewhat smaller than 4. So I'm quiet dissapointed... The first time I compiled and ran a similar program I did it on a Pentium 4 mobility (1800 Mhz) processor (options -msse -march=pentium4). I got about a speedup factor of 3.9, which was superb.

I don't think it is because of the processor (Athlon XP wouldn't have a chance against Pentium 4, and a report of such a difference should be present somewhere in the net...) I suspect the SSE code I wrote is far from good or that gcc does something wrong for the Athlon XP.

I don't have access to the Pentium 4 machine so I cant compare the above code directly anymore. It would be great I someone could test it on a Pentium 4 or if someone has a hint what is going on.

Thanx in advance,

Andres

_________________________________________________________________
MSN Fotos: la forma más fácil de compartir e imprimir fotos. http://photos.msn.es/support/worldwide.aspx

_________________________________________________________________
MSN. Más Útil Cada Día http://www.msn.es/intmap/

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]