There is a following program: ------ struct str{ int f[3]; }; int main (int argc, char *argv[]) { double d[1]; double* pd=d; str s; int* pf=&s.f[0]; s.f[0]=s.f[1]=s.f[2]=0; for (int i=0;i<1000000000;++i) { #ifdef SLOW1 pd[s.f[0]+s.f[1]*s.f[2]]=1; #elif defined(SLOW2) d[pf[0]+pf[1]*pf[2]]=1; #else pd[pf[0]+pf[1]*pf[2]]=1; #endif } } ------ after compilation with gcc-3.2 -O9 on Linux Mandrake 9.0 Athlon 900MHz, time ./a.out of the SLOW[12] version gives >10s time ./a.out of the "!defined(SLOW[12])" version gives 3.50s Logically this is the same thing... Release: gcc version 3.2 (Mandrake Linux 9.0 3.2-1mdk); also with gcc-3.1 Environment: Linux 2.4.19-16mdk i686 GNU/Linux Mandrake 9.0 AMD Athlon(tm) Processor stepping 2 cpu MHz 908.111 How-To-Repeat: gcc -O9 test.cpp ; time ./a.out gcc -O9 -DSLOW1 test.cpp ; time ./a.out gcc -O9 -DSLOW2 test.cpp ; time ./a.out
Fix: gcc-2.96 doesn't have this problem. Sorry, but I don't have access to other official (besides 3.1 and 3.2) releases
Hello, your problem as stated is fixed on gcc 3.3 branch and mainline. Please note that -O9 is the same thing as -O3 (-On for n>3 is identical to -O3). Second, your code performs identically for me with -O3, -O3 -DSLOW1 and -O3 -DSLOW2 with gcc 3.3. With gcc mainline, things are somewhat more bizarre. With -O3, or -O3 -DSLOW1, the code is about 50% slower than with gcc 3.3, but with -O3 -DSLOW2 it's the same speed as gcc 3.3 (ie faster than the with the other two options). Very strange. Dara
There is a preformance regression here from 3.3, other than that the -DSLOW[12] stuff is fixed though, they alll preform the same.
Here is the difference in the code: --- temp.s Sat Dec 27 22:51:13 2003 +++ temp1.s Sat Dec 27 22:51:04 2003 @@ -5,21 +5,20 @@ .type main, @function main: pushl %ebp - xorl %edx, %edx + movl $999999999, %eax movl %esp, %ebp subl $40, %esp - leal -32(%ebp), %ecx - movl $0, -16(%ebp) + leal -32(%ebp), %edx andl $-16, %esp - movl $0, -20(%ebp) + movl $0, -16(%ebp) subl $16, %esp + movl $0, -20(%ebp) movl $0, -24(%ebp) - movl $999999999, %eax .p2align 4,,15 .L5: - movl $0, (%ecx,%edx,8) + movl $0, (%edx) decl %eax - movl $1072693248, 4(%ecx,%edx,8) + movl $1072693248, 4(%edx) jns .L5 leave xorl %eax, %eax The main difference is the use of (%ecx,%edx,8) vs (%edx) But this does not produce any difference in performance (at least on pentium 4 or pentium 3): tin:~/src/gnu/gcctest>time ./a.out 2.010u 0.000s 0:02.01 100.0% 0+0k 0+0io 69pf+0w tin:~/src/gnu/gcctest>gcc -std=c99 -O3 pr8776.c -DSLOW2 tin:~/src/gnu/gcctest>!tim time ./a.out 2.010u 0.000s 0:02.01 100.0% 0+0k 0+0io 69pf+0w tin:~/src/gnu/gcctest>gcc -std=c99 -O3 pr8776.c -DSLOW1 tin:~/src/gnu/gcctest>!time time ./a.out 2.010u 0.000s 0:02.01 100.0% 0+0k 0+0io 69pf+0w So this is fixed on the mainline and 3.3.3.