gcc 4.0.0 generates slower code than gcc 3.4.3 for the BLAS "axpy" operation. (This is no doubt specific to IA32, and perhaps also to the processor version.) The program is below, here are the timing results: gcc 3.4.3 gcc 4.0.0 Method cpu secs cpu secs z[]=x[]+alpha*y[] 1.45 1.72 z[]=z[]+alpha*y[] 1.47 2.03 z[]=z[]+y[] 1.44 1.57 The second method is a common special case of the first, so it is unfortunate that gcc 4 does poorly on it. ======== The program is in two files to defeat inlining: rzvaxpy.c and zvaxpy.c and here is the script I used to compile/run them: for m in METH1 METH2 METH3 do for cc in gcc343 gcc400 do $cc -march=i686 -O3 -D$m rzvaxpy.c zvaxpy.c echo $cc $m `(time a.out)2>&1` done done ==== zvaxpy.c void zvaxpy(double *z, double *x, double *y, int n, double alpha) { int i; #if defined(METH1) for (i = 0; i < n; i++) z[i] = x[i] + alpha * y[i]; #elif defined(METH2) for (i = 0; i < n; i++) z[i] = z[i] + alpha * y[i]; #else for (i = 0; i < n; i++) z[i] = z[i] + y[i]; #endif } ==== rzvaxpy.c #include <stdio.h> #define N 100 #define NITER ((300*1000*1000)/N) double a[100], b[100]; extern void zvaxpy(double *, double *, double *, int, double); int main() { int i; double sum; for (i = 0; i < 100; i++) { a[i] = 0; b[i] = 1; } for (i = 0; i < NITER; i++) zvaxpy(a,a, b, N, 1.1); sum = 0; for (i = 0; i < N; i++) sum += a[i]; printf("sum %g\n", sum); return 0; }
I think this basically goes back to the correct selection of IVs and i386 addressing mode, aka a*4+b and such, there are other bugs opened about that already.
This has been fixed in 4.1.0. We no get: .L4: fldl (%edx,%eax,8) faddl (%ebx,%eax,8) fstpl (%edx,%eax,8) incl %eax cmpl %eax, %ecx jne .L4 Likewise for all methods.