I see a 14% slowdown with the SciMark sparse matrix multiplication benchmark when going from 3.4.3 to 4.0.0 on my Gentoo box. Flags are -O3 -march=athlon-xp -fomit-frame-pointer. I compiled and linked in one run of gcc, and ran the executable from the command line with "time". 4.0's performance gets better (closer to 3.4's, which remains roughly constant) as functions from the other files are moved into main.c. On those grounds, my own inexpert opinion is that this regression stems from function inlining. Earlier tests showed that the two versions are much closer when -fomit-frame-pointer isn't used. Will upload the preprocessed test cases that you're so fond of ;).
Created attachment 8933 [details] preprocessed testcase files
It would be nice if we could get 4.1.x numbers.
I get the following with -O3 -march=pentium4 -fomit-frame-pointer on a pentium4 gentoo machine: gcc-3.4.6 gcc-4.0.2 gcc-4.1.1 2.69s 4.14s 3.26s These are all with gentoo's patches. Also, current mainline is the same as gcc-4.1.1 I can confirm that the difference without -fomit-frame-pointer is much smaller. In fact, 3.4.6 and 4.1.1 are almost the same without it.
I get on a Pentium 4, -O3 -march=pentium4 -fomit-frame-pointer -o bench Random.i SparseCompRow.i array.i kernel.i main.i 3.4.6: 3.48s 4.0.3: 4.44s 4.1.1: 4.12s 4.2.0: 4.13s
Can someone try the mainline again after Paolo B.'s patch?
IMO the problem here is in IVopts. Using gcc-3.x, the innermost loop compiles to: .L15: movl (%edi,%edx,4), %eax fldl (%ebp,%edx,8) addl $1, %edx fmull (%esi,%eax,8) cmpl %ecx, %edx faddp %st, %st(1) jl .L15 and with current SVN gcc-4.2 into: .L12: movl (%ecx), %eax fldl (%ebp,%eax,8) fmull (%edx) faddp %st, %st(1) addl $1, %ebx addl $4, %ecx addl $8, %edx cmpl %esi, %ebx jne .L12 Adding -fno-ivopts, this loop gets compiled into: .L12: movl (%edi,%edx,4), %eax fldl (%esi,%eax,8) fmull (%ebp,%edx,8) faddp %st, %st(1) addl $1, %edx cmpl %edx, %ecx jg .L12 Timings (-O3 -march=pentium4 -fomit-frame-pointer): gcc-3.2: 0m2.301s gcc-4.2: 0m2.713s gcc-4.2 + -fno-ivopts: 0m2.473s with: gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7) gcc version 4.2.0 20060816 (experimental) I think that remaining time difference is due to strange loop above innermost: gcc-3.2: fld %st(0) .L16: movl 36(%esp), %eax fld %st(0) movl 4(%eax,%ebx,4), %ecx movl (%eax,%ebx,4), %edx cmpl %ecx, %edx jge .L23 .L15: movl (%edi,%edx,4), %eax fldl (%ebp,%edx,8) addl $1, %edx fmull (%esi,%eax,8) cmpl %ecx, %edx faddp %st, %st(1) jl .L15 .L23: movl 28(%esp), %eax fstpl (%eax,%ebx,8) addl $1, %ebx cmpl 24(%esp), %ebx jl .L16 ======== gcc-4.2: .L8: movl 36(%esp), %edx movl (%edx,%edi,4), %eax movl 4(%edx,%edi,4), %esi fldz cmpl %esi, %eax jge .L11 fstp %st(0) movl 40(%esp), %ebx leal (%ebx,%eax,4), %ecx movl 32(%esp), %ebx leal (%ebx,%eax,8), %edx fldz xorl %ebx, %ebx subl %eax, %esi .L12: movl (%ecx), %eax fldl (%ebp,%eax,8) fmull (%edx) faddp %st, %st(1) addl $1, %ebx addl $4, %ecx addl $8, %edx cmpl %esi, %ebx jne .L12 .L11: movl 28(%esp), %eax fstpl (%eax,%edi,8) addl $1, %edi cmpl 24(%esp), %edi jne .L8 ======== and gcc-4.2 -fno-ivopts: .L8: leal (%ebx,%ebx), %eax movl 40(%esp), %edx movl (%edx,%eax,2), %edx movl %edx, (%esp) movl 40(%esp), %edx movl 4(%edx,%eax,2), %ecx fldz cmpl %ecx, (%esp) jge .L11 fstp %st(0) movl (%esp), %edx fldz .L12: movl (%edi,%edx,4), %eax fldl (%esi,%eax,8) fmull (%ebp,%edx,8) faddp %st, %st(1) addl $1, %edx cmpl %edx, %ecx jg .L12 .L11: movl 32(%esp), %ecx fstpl (%ecx,%ebx,8) addl $1, %ebx cmpl %ebx, 28(%esp) jg .L8
(In reply to comment #6) > I think that remaining time difference is due to strange loop above innermost: ... due to strange _header_ above innermost loop ... The problem is that we load zero in both arms of "if". This is what I get in .099t.optimized (using gcc-4.2 -O2 -fno-ivopts): <L1>:; r.0 = (unsigned int) r; D.1556 = r.0 * 4; rowR = *((int *) D.1556 + row); rowRp1 = *((int *) D.1556 + row + 4B); if (rowR < rowRp1) goto <L41>; else goto <L42>; <L42>:; sum = 0.0; goto <bb 5> (<L4>); <L41>:; i = rowR; sum = 0.0; Assignment to sum should be moved before if... SSE is able to somehow CSE zero load during RTL: .L8: movl 20(%ebp), %edx movapd %xmm2, %xmm1 movl (%edx,%ebx,4), %eax movl 4(%edx,%ebx,4), %ecx cmpl %ecx, %eax jge .L11 movl %eax, %edx .p2align 4,,7 .L12:
Also interesting is, that -march=pentium4 produces following "de-optimized" code, adding a couple more instructions and wasting %eax register: .L8: leal (%ebx,%ebx), %eax movl 40(%esp), %edx movl (%edx,%eax,2), %edx movl %edx, (%esp) movl 40(%esp), %edx movl 4(%edx,%eax,2), %ecx movapd %xmm2, %xmm1 cmpl %ecx, (%esp) jge .L11 movl (%esp), %edx .L12: Some additiona timing can be shown (gcc-4.2 -O2 -fomit-frame-pointer): -march=pentium4: 0m2.756s -march=pentium4 -fno-ivopts: 0m2.500s -march=pentium4 -fno-ivopts -mfpmath=sse: 0m2.461s -msse2 -fno-ivopts -mfmpath=sse: 0m2.311s In the last case, the generated code is equal to gcc-3.2 generated one: .L8: movl 36(%esp), %edx movapd %xmm2, %xmm1 movl (%edx,%ebx,4), %eax movl 4(%edx,%ebx,4), %ecx cmpl %ecx, %eax jge .L11 movl %eax, %edx .p2align 4,,7 .L12: movl (%edi,%edx,4), %eax movsd (%esi,%eax,8), %xmm0 mulsd (%ebp,%edx,8), %xmm0 addl $1, %edx cmpl %edx, %ecx addsd %xmm0, %xmm1 jg .L12
Fixed on the mainline by: http://gcc.gnu.org/ml/gcc-patches/2006-08/msg01036.html
(In reply to comment #9) > Fixed on the mainline by: > http://gcc.gnu.org/ml/gcc-patches/2006-08/msg01036.html Not really, the above patch fixed only one of three problems. The other two remains, that is: - ivopts problem (see comment #6) - -march=pentium4 (see comment #8) I'll try to see which option causes problems, described in #8.
Open regression with no activity since February 14. Ping?
(In reply to comment #11) > Open regression with no activity since February 14. Ping? Current 4.3 SVN produces (-O3 -march=pentium4): .L8: movl (%ebx), %eax #* ivtmp.33, tmp84 fldl (%edi,%eax,8) # fmull (%ecx) #* ivtmp.35 faddp %st, %st(1) #, addl $1, %edx #, i addl $4, %ebx #, ivtmp.33 addl $8, %ecx #, ivtmp.35 cmpl %edx, %esi # i, rowRp1 jg .L8 #, (+ -fno-ivopts): .L8: movl (%esi,%edx,4), %eax #, tmp76 fldl (%edi,%eax,8) #* x movl 16(%ebp), %eax # val, <-----here fmull (%eax,%edx,8) # faddp %st, %st(1) #, addl $1, %edx #, i cmpl %edx, %ecx # i, rowRp1 jg .L8 #, We regressed vs. 4.2 in this case; "val" is loaded inside the loop.
Re. comment #7, "Assignment to sum should be moved before if..." This is called code hoisting, and it is not performed in GCC except with -Os. See bug 24001, bug 33315, and other reports about the same issue in Bugzilla.
We still generate: .L8: movl (%ebx), %eax addl $1, %edx addl $4, %ebx fldl (%edi,%eax,8) fmull (%ecx) addl $8, %ecx cmpl %edx, %esi faddp %st, %st(1) jg .L8 This could IMO be optimized on RTL level, to use %edx as a count variable: .L8: movl (%ebx,%edx,4), %eax fldl (%edi,%eax,8) fmull (%ecx,%edx,8) addl $1, %edx cmpl %edx, %esi faddp %st, %st(1) jg .L8 This would be optimal code for this loop.
Closing 4.1 branch.
See comment #7 and comment #13.
Confirmed: 3.3 -O2 3.11s 4.1 -O2 3.44s 4.4 -O2 3.36s 4.4 -O2 -fno-ivopts 3.00s
For whatever reason, we're now faster than GCC 3.3. -O2 -fno-ivopts is still faster than -O2, will open an enhancement request for that.
Closing 4.2 branch.
WONTFIX on the 4.3 branch.