(pain reloaded) Why worse performace in euclidean distance with SSE2?

Brian Budge brian.budge@gmail.com
Fri Apr 11 21:15:00 GMT 2008


Hi Dario -

I got more than a 10% improvement on my machine, but I looked at the
code, and I must confess, I'm not sure what you're doing.  It seems as
though the 0 and 1 cases are not equivalent code.

Can you explain the actual problem you're trying to solve?

On Wed, Apr 9, 2008 at 10:57 AM, Dario Bahena Tapia <dario.mx@gmail.com> wrote:
> Jello,
>
>  I am still depressed about my performance. I have made following
>  "improvements" in code:
>
>  1. I am not using sqrt function anymore (it seems squared euclidean
>  distance is itself a metric)
>
>  2. I am using single precision calculations
>
>  3. I am using a different memory organization (SOA), now I have two
>  big arrays with all X coordinates in one and all Y coordinates in the
>  other.
>
>  4. I have changed logic to save once current point in registers and
>  iterate, in a SIMD fashion, over the rest.
>
>  5. I am turning on full brutal optimization in gcc (-O3)
>
>  Still with these changes, I only got an small improvement in time ...
>  around 10%. I would have expected an improvement of around 75% given I
>  am now doing 4 operations at a time (and that is if we do not count
>  pipelining which may exec more than one SSE2 instruction at a "time"
>  ... but unsure here, could be telling nonsense crap ;-).
>
>  Please, hardcore intel gods of assembler optimization, provide some
>  feedback for this poor beast of the forest.
>
>  Thanks
>
>  PS: Attached are the program, dummy makefile, sample usage and the
>  assembler generated. An input size of 24,000 is enough to see the
>  problem (dimension must be multiple of 8).
>
>  On Tue, Apr 8, 2008 at 9:55 AM, Dario Bahena Tapia <dario.mx@gmail.com> wrote:
>  > Hello,
>  >
>  >  Yeah, others have suggested as well changing the way i process them in
>  >  order to allow for that. Working there ;-|
>  >
>  >  Will consider the other suggestions as well !!!
>  >
>  >  Thanks.
>  >
>  >
>  >
>  >
>  >  On Tue, Apr 8, 2008 at 2:07 AM, Zuxy Meng <zuxy.meng@gmail.com> wrote:
>  >  > Hi,
>  >  >
>  >  >  "Dario Bahena Tapia" <dario.mx@gmail.com>
>  >  > 写入消息新闻:3d104d6f0804070617u47213cc8nbc697dab9dc262b5@mail.gmail.com...
>  >  >
>  >  >
>  >  >
>  >  > > Hello,
>  >  > >
>  >  > > I have just begun to play with SSE2 and gcc intrinsics. Indeed, maybe
>  >  > > this question is not exactly about gcc  ... but I think gcc lists are
>  >  > > a very good place to find help from  hardcore assembler hackers ;-1
>  >  > >
>  >  > > I have a program which makes heavy usage of euclidean distance
>  >  > > function. The serial version is:
>  >  > >
>  >  > > inline static double dist(int i,int j)
>  >  > > {
>  >  > >  double xd = C[i][X] - C[j][X];
>  >  > >  double yd = C[i][Y] - C[j][Y];
>  >  > >  return rint(sqrt(xd*xd + yd*yd));
>  >  > > }
>  >  > >
>  >  > > As you can see each C[i] is an array of two double which represents a
>  >  > > 2D vector (indexes 0 and 1 are coordinates X,Y respectively). I tried
>  >  > > to vectorize the function using SSE2 and gcc intrinsics, here is the
>  >  > > result:
>  >  > >
>  >  > > inline static double dist_sse(int i,int j)
>  >  > > {
>  >  > >  double d;
>  >  > >  __m128d xmm0,xmm1;
>  >  > >  xmm0 =_mm_load_pd(C[i]);
>  >  > >  xmm1 = _mm_load_pd(C[j]);
>  >  > >  xmm0 = _mm_sub_pd(xmm0,xmm1);
>  >  > >  xmm1 = xmm0;
>  >  > >  xmm0 = _mm_mul_pd(xmm0,xmm1);
>  >  > >  xmm1 = _mm_shuffle_pd(xmm0, xmm0, _MM_SHUFFLE2(1, 1));
>  >  > >  xmm0 = _mm_add_pd(xmm0,xmm1);
>  >  > >  xmm0 = _mm_sqrt_pd(xmm0);
>  >  > >  _mm_store_sd(&d,xmm0);
>  >  > >  return rint(d);
>  >  > > }
>  >  > >
>  >  > > Of course each C[i] was aligned as SSE2 expects:
>  >  > >
>  >  > > for(i=0; i<D; i++)
>  >  > > C[i] = (double *) _mm_malloc(2 * sizeof(double), 16);
>  >  > >
>  >  > > And in order to activate the SSE2 features, I am using the following
>  >  > > flags for gcc (my computer is a laptop):
>  >  > >
>  >  > > CFLAGS = -O -Wall -march=pentium-m -msse2
>  >  > >
>  >  > > The vectorized version of the function seems to be correct, given it
>  >  > > provides same results as serial counterpart. However, the performace
>  >  > > is poor; execution time of program increases in approximately 50% (for
>  >  > > example, in calculating the distance of every pair of points from a
>  >  > > set of 10,000, the serial version takes around 8 seconds while
>  >  > > vectorized flavor takes 12).
>  >  > >
>  >  > > I was expecting a better time given that:
>  >  > >
>  >  > > 1. The difference of X and Y is done in parallel
>  >  > > 2. The product of each difference coordinate with itself is also done
>  >  > > in parallel
>  >  > > 3. The sqrt function used is hardware implemented (although serial
>  >  > > sqrt implementation could also take advantage of hardware)
>  >  > >
>  >  > > I suppose the problem here is my lack of experience programming in
>  >  > > assembler in general, and in particular with SSE2. Therefore, I am
>  >  > > looking for advice.
>  >  > >
>  >  >
>  >  >  1. First of all, you didn't extract the parallelism in your algorithm. SSE2
>  >  > won't help you if all you want is to pick up two points at random indices
>  >  > and calculate the distance. However it will help you a lot when you
>  >  > calculate the distances between a given point and 1 million others whose
>  >  > indices are sequential.
>  >  >
>  >  >  2. Unroll the loop to hide the latency of square root as much as possible.
>  >  >
>  >  >  3. Since the final result is an integer, you may consider using "float"
>  >  > instead of "double". That'll give you a performance boost even without SSE2.
>  >  > And rsqrtps comes in handy too, if its precision is acceptable.
>  >  >
>  >  >  --
>  >  >  Zuxy
>  >  >
>  >  >
>  >
>


More information about the Gcc-help mailing list