This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Bad code for complex arithmetic in egcs++


Hi,

I'm testing the performance of various C++ compilers on SOLARIS 2.6
for SPARC and x86. At least in one case, egcs gives bad performance.
That seems to be due to a problem with the optimization, not due
to problems inherent in C++, as shown by an analysis of the produces
assembler-files. I used egcs-1.1.1 for the tests.

One of the test cases (believe it or not, we actually
use it :-) ) is an addition of two complex vectors, a=a+b, where a and
b are vectors of fixed length. The code is implemented in a loop with integer
variable i. The assignment is choosen from one of the following:

    ptr[i]=ptr2[i]+ptr[i];

or

    ptr[i]=complex(real(ptr2[i])+real(ptr[i]),imag(ptr2[i])+imag(ptr[i]));

or

  ((double *)ptr)[2*i]=((double *)ptr2)[2*i]+((double *)ptr)[2*i];
  ((double *)ptr)[2*i+1]=((double *)ptr2)[2*i+1]+((double *)ptr)[2*i+1];

(and complex is complex<double>). I compiled with -O3, no other options.
No comments on how the coding should be done most efficiently: After all,
all are valid C++-assignments that should be, more or less, treated
the same.

Actually, I'd expect that after optimization, all of these take the same
time since all of them implement the same assignment. Unfortunately, that's
wrong. With egcs, the first two take .44 seconds, the third one takes .17
seconds on vectors of decent size. On the other hand, the third
one is not nice since it's not using the C++ structure.

Now, I thought that as usual C++ was just slower than C, but it turned out that
the figures for SUN's CC with -fast are 0.40 for the first implementation, again
0.17 for the third, but 0.11 for the second one. That's a quarter of the
time taken for the same implementation in egcs! So I started to wonder
why egcs was always slow and why CC is so fast. I looked at the assembler
output, and it's just horrible. I will quote it here, sorry if I missed
an additional optimizing step in the assembler.

Here goes the assembler output as produced by SUN CC for the second
implementation (the optimal one):

.L900000131:
	ldd     [%o2],%f0
	add     %o1,1,%o1
	add     %o0,16,%o0
	ldd     [%o2+8],%f30
	cmp     %o1,1024
	add     %o2,16,%o2
	ldd     [%o0-8],%f4
	ldd     [%o0-16],%f2
	faddd   %f30,%f4,%f30
	faddd   %f0,%f2,%f0
	std     %f30,[%o0-8]
	bl      .L900000131
	std     %f0,[%o0-16]

o2 has the value of ptr+i, o0 has the value of ptr2+i, o1 has the value of i.
In each iteration, get the four values involved in the computation,
do the addition, store into the correct places and increment i, ptr+i
and ptr+2*i. I think that looks like hand-coded, just the way everybody
would do it. Now look at the code produced by egcs, again for the
second implementation:

.LL625:
        ldd [%o4],%f6
        add %i0,1,%i0
        ldd [%g2],%f4
        cmp %i0,1023
        faddd %f4,%f6,%f4
        ldd [%o5],%f8
        add %g2,16,%g2
        ldd [%g3],%f2
        faddd %f2,%f8,%f2
!       std %f4,[%fp-32] 
!       ldd [%fp-32],%o0
        add %g3,16,%g3
!       std %o0,[%o4]
?       std %f2,[%fp-24]
?       ldd [%fp-24],%o2
        add %o4,16,%o4
?       std %o2,[%o5]
        ble .LL625
        add %o5,16,%o5

See the difference? First, four pointers are used, so that four integer
additions have to be carried out. However, that's probably not the
trouble: In the rows marked with !, f4 is ready to be written to [o4].
Instead, f4 is written to external memory, *fetched back* from external
memory, and then written to [o4]. WHY? I have no idea. I think I might even
use awk to identify this situation and improve the code. Note: f4 has not
been clobbered in any way. [%fp-32] is never used. Writing the value out
and getting it back simply has *no effect*. Same for the lines
marked with ?.

By the way, this problem has already existed in gcc2.x.

I must acknowledge that I don't understand this. Has anyone got an
explanation for it? Or, better yet, anybody has a fix (or, blush,
a command line option)? I append the source.

-- 
"Dr. Frank Wuebbeling (frank.wuebbeling@math.uni-muenster.de)"
                      (http://www.uni-muenster.de/math/users/wuebbel)
Institut fuer Numerische Mathematik der Universität Münster
FON (+49) 251 - 83 33795, FAX (+49) 251 - 83 32714
Einsteinstr. 62, D-48149 Muenster, Germany

--- test.C ---
#include "complex.h"
#include "math.h"
#include "mytime.h"

#ifdef __GNUC__
#define complex complex<double>
#endif

#define TYPE complex
#define NMAX 1024
#define ITERS 2048

main(int argc, char ** argv)
{
 TYPE * ptr, * ptr2, * ptr3;
 int i,k;
 int tim=-time(0);


 ptr=new TYPE[NMAX];
 ptr2=new TYPE[NMAX];
 ptr3=new TYPE[NMAX];

 srand48(tim);
 for (i=0;i<NMAX;i++)
 	{ptr[i]=drand48(); ptr2[i]=drand48();}
 taketime(1);
 for (k=0;k<ITERS;k++)
  for (i=0;i<NMAX;i++)
   {
    ptr[i]=ptr2[i]+ptr[i];
   }
 taketime(2);
 for (i=0;i<NMAX;i++) ptr3[i]=ptr[i];
 srand48(tim);
 for (i=0;i<NMAX;i++)
 	{ptr[i]=drand48(); ptr2[i]=drand48();}
 taketime(3);
 for (k=0;k<ITERS;k++)
  for (i=0;i<NMAX;i++)
   {
    ptr[i]=complex(real(ptr2[i])+real(ptr[i]),imag(ptr2[i])+imag(ptr[i]));
   }
 taketime(4);
 srand48(tim);
 for (i=0;i<NMAX;i++)
 	{ptr[i]=drand48(); ptr2[i]=drand48();}
 taketime(5);
 for (k=0;k<ITERS;k++)
  for (i=0;i<NMAX;i++)
 {
  ((double *)ptr)[2*i]=((double *)ptr2)[2*i]+((double *)ptr)[2*i];
  ((double *)ptr)[2*i+1]=((double *)ptr2)[2*i+1]+((double *)ptr)[2*i+1];
 }
 taketime(6);
 for (i=0;i<NMAX;i++) if (ptr3[i]!=ptr[i]) fprintf(stderr,"Error.\n");

 
 showtime(1,2);
 showtime(3,4);
 showtime(5,6);
 
}

--- timer.C ---

#include "stdio.h"
#include "stdlib.h"
#include "time.h"
#include "sys/times.h"
#include "sys/timeb.h"
#include "time.h"

#define MAXTIMERS 10
extern "C"      int ftime(struct timeb *tp);

struct tms mytms[MAXTIMERS];
struct timeb mytimeb[MAXTIMERS];

void
taketime(int k)
{
 if (k>MAXTIMERS) {fprintf(stderr,"timer: too large.\n"); exit(-1);}
 times(mytms+k);
 ftime(mytimeb+k);
 fprintf(stderr,"Setting timer %d\n",k);
}

void
showtime(int k, int l)
{
  fprintf(stderr,"Showing timers %d:%d\n\n",k,l);
  fprintf(stderr,
    "Wallclock: %f\n",(float)(mytimeb[l].millitm-mytimeb[k].millitm)/1000.+
	(float)(mytimeb[l].time-mytimeb[k].time));
  fprintf(stderr,
    "Usertime:  %f\nSystemtime: %f\n\n",(mytms[l].tms_utime-mytms[k].tms_utime)/100.,(mytms[l].tms_stime-mytms[k].tms_stime)/100.);

}

--- mytime.h ---

#include "stdlib.h"
#include "time.h"
#include "sys/times.h"
#include "sys/timeb.h"
#include "stdio.h"

void taketime(int);
void showtime(int,int);


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]