In the following C source file test.c, int compare(char const * p1, int n1, char const * p2, int n2) { char const * q1 = p1 + n1; char const * q2 = p2 + n2; while (p1 < q1 && p2 < q2) { int n = *--q1 - *--q2; if (n) { return n; } } return n1 - n2; } int main(void) { char str[1000]; int i; for (i = 0; i < 1000000; ++i) { compare(str, 1000, str, 1000); } } compiled with gcc -O2 test.c the loop body within compare takes 14 instructions on GCC 3.4.3 Linux x86, compared to only 11 instructions on GCC 3.2.2 Linux x86 (see disassemblies below), and the GCC 3.4.3 output takes substantially longer to run than the GCC 3.2.2 output: 3.4.3> time ./a.out 4.690u 0.001s 0:04.71 99.5% 3.2.2> time ./a.out 3.533u 0.002s 0:03.55 99.4% This seems to bite us on OpenOffice.org, which contains an oft-called function similar to compare above, and new versions of OOo (built with GCC 3.4) run slower than old versions (built with GCC 3.2). The question that remains for us is whether this performance loss is specific to the given function, or could be a general problem. The dissasemblies: 08048360 <compare>: ! gcc 3.4.3 8048360: 55 push %ebp 8048361: 89 e5 mov %esp,%ebp 8048363: 57 push %edi 8048364: 56 push %esi 8048365: 53 push %ebx 8048366: 8b 45 0c mov 0xc(%ebp),%eax 8048369: 8b 7d 08 mov 0x8(%ebp),%edi 804836c: 8b 75 10 mov 0x10(%ebp),%esi 804836f: 8d 1c 07 lea (%edi,%eax,1),%ebx 8048372: 8b 45 14 mov 0x14(%ebp),%eax 8048375: 8d 0c 06 lea (%esi,%eax,1),%ecx 8048378: 90 nop 8048379: 8d b4 26 00 00 00 00 lea 0x0(%esi),%esi 8048380: 39 df cmp %ebx,%edi ! <-+ 8048382: 0f 92 c0 setb %al ! | 8048385: 31 d2 xor %edx,%edx ! | 8048387: 39 ce cmp %ecx,%esi ! | 8048389: 0f 92 c2 setb %dl ! | 804838c: 85 d0 test %edx,%eax ! | 804838e: 74 13 je 80483a3 ! | 8048390: 4b dec %ebx ! | 8048391: 49 dec %ecx ! | 8048392: 0f be 01 movsbl (%ecx),%eax ! | 8048395: 0f be 13 movsbl (%ebx),%edx ! | 8048398: 29 c2 sub %eax,%edx ! | 804839a: 89 d0 mov %edx,%eax ! | 804839c: 74 e2 je 8048380 ! --+ 804839e: 5b pop %ebx 804839f: 5e pop %esi 80483a0: 5f pop %edi 80483a1: 5d pop %ebp 80483a2: c3 ret 80483a3: 5b pop %ebx 80483a4: 8b 45 0c mov 0xc(%ebp),%eax 80483a7: 8b 55 14 mov 0x14(%ebp),%edx 80483aa: 5e pop %esi 80483ab: 29 d0 sub %edx,%eax 80483ad: 5f pop %edi 80483ae: 5d pop %ebp 80483af: c3 ret 08048370 <compare>: ! gcc 3.2.2 8048370: 55 push %ebp 8048371: 89 e5 mov %esp,%ebp 8048373: 57 push %edi 8048374: 8b 45 0c mov 0xc(%ebp),%eax 8048377: 56 push %esi 8048378: 8b 7d 08 mov 0x8(%ebp),%edi 804837b: 53 push %ebx 804837c: 8b 75 10 mov 0x10(%ebp),%esi 804837f: 8d 1c 38 lea (%eax,%edi,1),%ebx 8048382: 8b 45 14 mov 0x14(%ebp),%eax 8048385: 39 df cmp %ebx,%edi 8048387: 8d 0c 30 lea (%eax,%esi,1),%ecx 804838a: 73 1a jae 80483a6 804838c: 39 ce cmp %ecx,%esi 804838e: 73 16 jae 80483a6 8048390: 4b dec %ebx ! <-+ 8048391: 49 dec %ecx ! | 8048392: 0f be 01 movsbl (%ecx),%eax ! | 8048395: 0f be 13 movsbl (%ebx),%edx ! | 8048398: 29 c2 sub %eax,%edx ! | 804839a: 89 d0 mov %edx,%eax ! | 804839c: 75 10 jne 80483ae ! | 804839e: 39 df cmp %ebx,%edi ! | 80483a0: 73 04 jae 80483a6 ! | 80483a2: 39 ce cmp %ecx,%esi ! | 80483a4: 72 ea jb 8048390 ! --+ 80483a6: 8b 45 0c mov 0xc(%ebp),%eax 80483a9: 8b 55 14 mov 0x14(%ebp),%edx 80483ac: 29 d0 sub %edx,%eax 80483ae: 5b pop %ebx 80483af: 5e pop %esi 80483b0: 5f pop %edi 80483b1: 5d pop %ebp 80483b2: c3 ret
I cannot test this right now but this might be fixed on the mainline.
Hmm, this looks like a branch cost problem or related to that. I think you can get all the speed back by supplying -mbranch-cost=1 but I could be wrong. Hmm, I wonder if 3.2.x was compiled for i486 (or i386) instead of i686 which means this might not be a regression. Can you give the output of "gcc -v" for both GCC's?
"I think you can get all the speed back by supplying -mbranch-cost=1 but I could be wrong." No, adding -mbranch-cost=1 leads to only a very minor performance improvement. "Can you give the output of "gcc -v" for both GCC's?" 3.4.3> gcc -v Reading specs from /export/home/sb93797/gcc-3.4.3-inst/lib/gcc/i686-pc-linux-gnu/3.4.3/specs Configured with: ../gcc-3.4.3/configure --prefix=/export/home/sb93797/gcc-3.4.3-inst Thread model: posix gcc version 3.4.3 3.2.2> gcc -v Reading specs from /so/env/gcc_3.2.2_linux_libc2.20/bin/../lib/gcc-lib/i686-pc-linux-gnu/3.2.2/specs Configured with: /export/home/obo/gcc-3.2.2/configure --prefix=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20 --with-as=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20/bin/as --with-ld=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20/bin/ld --enable-languages=c,c++ Thread model: posix gcc version 3.2.2
Confirmed.
I just reproduced this on two flavors of pentium 4 using -O3 -mtune=pentium. The regression is worse, sometimes much worse, with -fPIC. Times for gcc-2.95.3, gcc-3.4.3, gcc-4.0.0, gcc-4.1-20050603: test run 1: /proc/cpuinfo: cpu family 15, model 2, Intel Pentium 4 CPU 2.80GHz. no -fPIC: 2.2, 2.7. 0.6. 0.6 seconds with fPIC: 2.2, 2.9, 2.9, 2.9 seconds test run 2: /proc/cpuinfo: cpu family 15, model 4, Intel Pentium 4 CPU 3.20GHz no -fPIC: 1.9, 3.2, 0.5, n/a seconds with fPIC: 1.9, 3.6, 3.6, n/a seconds
For gcc4, the no-PIC case looks pretty good to me ;-)
Dan, can you show the assembler output for 2.95.3 and 4.0 (why is 4.1 "n/a"??) for the -fPIC case?
Moving to 4.0.2 pre Mark.
Can anybody try killloop-branch? Just a wild guess.
There is also a (minor) register allocation problem in both outputs: 8048392: 0f be 01 movsbl (%ecx),%eax ! | 8048395: 0f be 13 movsbl (%ebx),%edx ! | 8048398: 29 c2 sub %eax,%edx ! | 804839a: 89 d0 mov %edx,%eax ! | 804839c: 74 e2 je 8048380 ! --+ This could obviously become 8048392: 0f be 11 movsbl (%ecx),%edx 8048395: 0f be 03 movsbl (%ebx),%eax 8048398: 2b c2 sub %edx,%eax 804839c: 74 e2 je 8048380
The code I get from -mbranch-cost=1 is extremely similar to what 3.2.2 produced: .L7: cmpl %ebx, %edi jae .L2 cmpl %ecx, %esi jae .L2 decl %ebx movsbl -1(%ecx),%eax decl %ecx movsbl (%ebx),%edx subl %eax, %edx movl %edx, %eax je .L7 .L2: The front-end is correctly producing a TRUTH_AND_EXPR, i.e. the short-circuiting is not required. RTL expansion will use jae (short-circuited evaluation) or seta, depending on the branch cost. However, the tuning here is wrong because on a pentium4 I do get a 40% speedup from -mbranch-cost=1.
Created attachment 10016 [details] patch to fix the bug Indeed, RTL expansion will use jae (short-circuited evaluation) or seta, depending on the branch cost, but in a very involuted way which screws up the tuning. This patch makes it more direct.
The patch is not good because I forgot to check fallthrough edges. The code I added goes above, just before "case CALL_EXPR:". Restarting a bootstrap with a correct patch.
Subject: Bug 19672 CVSROOT: /cvs/gcc Module name: gcc Changes by: bonzini@gcc.gnu.org 2005-10-19 10:37:31 Modified files: gcc : ChangeLog dojump.c Log message: 2005-10-18 Paolo Bonzini <bonzini@gnu.org> PR #19672 * dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR here. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10182&r2=2.10183 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&r1=1.42&r2=1.43
Subject: Bug 19672 CVSROOT: /cvs/gcc Module name: gcc Branch: gcc-4_0-branch Changes by: bonzini@gcc.gnu.org 2005-10-19 10:41:09 Modified files: gcc : dojump.c ChangeLog Log message: 2005-10-18 Paolo Bonzini <bonzini@gnu.org> PR #19672 * dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR here. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&only_with_tag=gcc-4_0-branch&r1=1.37.8.1&r2=1.37.8.2 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-4_0-branch&r1=2.7592.2.470&r2=2.7592.2.471
Subject: Bug 19672 CVSROOT: /cvs/gcc Module name: gcc Branch: gcc-3_4-branch Changes by: bonzini@gcc.gnu.org 2005-10-19 10:49:57 Modified files: gcc : ChangeLog dojump.c Log message: 2005-10-19 Paolo Bonzini <bonzini@gnu.org> PR #19672 * dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR like TRUTH_ANDIF_EXPR and TRUTH_ORIF_EXPR, if the branch cost is low enough. Patches: http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=2.2326.2.923&r2=2.2326.2.924 http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=1.9.4.2&r2=1.9.4.3
Patches committed to all active branches.
Patch had to be reverted on 3.4
Fixed for 4.0.x and higher. Won't fix in 3.4.x