21550 – [4.0/4.1 Regression] i686 floating point performance 33% slower than gcc 3.4.3

Bug 21550 - [4.0/4.1 Regression] i686 floating point performance 33% slower than gcc 3.4.3

Summary: [4.0/4.1 Regression] i686 floating point performance 33% slower than gcc 3.4.3

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.1.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2005-05-13 15:22 UTC by Tom Truscott
Modified:	2005-10-16 22:25 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i686--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tom Truscott 2005-05-13 15:22:03 UTC

gcc 4.0.0 generates slower code than gcc 3.4.3 for the BLAS "axpy" operation.
(This is no doubt specific to IA32, and perhaps also to the processor version.)
The program is below, here are the timing results:

                   gcc 3.4.3    gcc 4.0.0
Method              cpu secs     cpu secs
z[]=x[]+alpha*y[]     1.45         1.72
z[]=z[]+alpha*y[]     1.47         2.03
z[]=z[]+y[]           1.44         1.57
                                                                                
The second method is a common special case of the first,
so it is unfortunate that gcc 4 does poorly on it.

========
The program is in two files to defeat inlining: rzvaxpy.c and zvaxpy.c
and here is the script I used to compile/run them:

for m in METH1 METH2 METH3
do
   for cc in gcc343 gcc400
   do
      $cc -march=i686 -O3 -D$m rzvaxpy.c zvaxpy.c
      echo $cc $m `(time a.out)2>&1`
   done
done

==== zvaxpy.c

void
zvaxpy(double *z, double *x, double *y, int n, double alpha)
{
   int i;
                                                                                
#if defined(METH1)
   for (i = 0; i < n; i++) z[i] = x[i] + alpha * y[i];
#elif defined(METH2)
   for (i = 0; i < n; i++) z[i] = z[i] + alpha * y[i];
#else
   for (i = 0; i < n; i++) z[i] = z[i] +  y[i];
#endif
}

==== rzvaxpy.c

#include <stdio.h>
                                                                                
#define N 100
#define NITER ((300*1000*1000)/N)
double a[100], b[100];
                                                                                
extern void zvaxpy(double *, double *, double *, int, double);
                                                                                
int
main()
{
   int i;
   double sum;
   for (i = 0; i < 100; i++) { a[i] = 0; b[i] = 1; }
   for (i = 0; i < NITER; i++) zvaxpy(a,a, b, N, 1.1);
   sum = 0; for (i = 0; i < N; i++) sum += a[i];
   printf("sum %g\n", sum);
   return 0;
}

Comment 1 Andrew Pinski 2005-05-13 18:03:45 UTC

I think this basically goes back to the correct selection of IVs and  i386 addressing mode, aka a*4+b 
and such, there are other bugs opened about that already.

Comment 2 Andrew Pinski 2005-10-16 22:25:03 UTC

This has been fixed in 4.1.0.
We no get:
.L4:
        fldl    (%edx,%eax,8)
        faddl   (%ebx,%eax,8)
        fstpl   (%edx,%eax,8)
        incl    %eax
        cmpl    %eax, %ecx
        jne     .L4

Likewise for all methods.