When compiling a program with computed goto:s with gcc 4.4.2 it runs significantly slower (up to a factor 10) than when it is compiled with e.g. gcc 4.1/4.3 with the same optimization flags (-O2 or -O3). A small dummy test program without header file dependencies is attached. I am compiling with a commandline like "gcc -O3 test.c -o testp.4.4.2", and run the generated executable without arguments, like "./testp.4.4.2". Generating cpu specific instructions, e.g. "-march=athlon64" seems to make no difference. I have also tried with "-fno-gcse" (as recommended in the docs) to no avail. Same results with targets x86_64 and i686 on Novell SLES 10 and Arch Linux. Interestingly enough I do not see this problem on any Intel processor I have tried, but I have seen the slowdown on all AMD processors I have tried (e.g. Dual-Core AMD Opteron Processor 2216 and AMD Turion 64 X2 Mobile Technology TL-60). In fact, the exact same two binaries resulting from compilation with gcc 4.4.2 and gcc 4.3 for i686 which show a significant performance difference on an AMD will not show any significant difference on an Intel Core 2 Duo T7500. Some observations: 1. On AMD there is a huge difference in the number of mispredicted branches between the program compiled with gcc-4.4.2 and the program compiled with earlier compilers. See for instance the following output from oprofile: --- Counted RETIRED_INDIRECT_BRANCHES_MISPREDICTED events (Retired Indirect Branches Mispredicted) with a unit mask of 0x00 (No unit mask) count 500 Counted RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS events (Retired Mispredicted Branch Instructions) with a unit mask of 0x00 (No unit mask) count 500 Counted RETIRED_TAKEN_BRANCH_INSTRUCTIONS events (Retired taken branch instructions) with a unit mask of 0x00 (No unit mask) count 500 RETIRED_INDIRE...|RETIRED_MISPRE...|RETIRED_TAKEN_...| samples| %| samples| %| samples| %| ------------------------------------------------------ 185416 88.7799 186587 82.8723 381826 48.1913 testp.4.4.2 5605 2.6838 6275 2.7870 157401 19.8660 testp.4.3 2. Gcc 4.3 generates the following assembler around the "eq:" label in the attached program: 4004c0: 48 81 fb 00 e1 f5 05 cmp $0x5f5e100,%rbx 4004c7: 74 21 je 4004ea <main+0x6a> 4004c9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 4004d0: 48 63 c5 movslq %ebp,%rax 4004d3: 48 8b 44 c4 b0 mov -0x50(%rsp,%rax,8),%rax 4004d8: ff e0 jmpq *%rax While gcc 4.4.2 will generate an additional jump instruction: 4004c0: ff e0 jmpq *%rax ... 4004d8: 48 81 fb 00 e1 f5 05 cmp $0x5f5e100,%rbx 4004df: 74 21 je 400502 <main+0x82> 4004e1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 4004e8: 48 63 c5 movslq %ebp,%rax 4004eb: 48 8b 44 c4 88 mov -0x78(%rsp,%rax,8),%rax 4004f0: eb ce jmp 4004c0 <main+0x40> 3. I see the same behaviour with a month-old snapshot of gcc 4.5. Examples of compilers used (have tried with a number of differrent builds on different targets): Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++,fortran,objc,obj-c++,ada --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic Thread model: posix gcc version 4.4.2 (GCC) Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++ --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic --disable-werror --enable-checking=release --program-suffix=-4.3 --enable-version-specific-runtime-libs Thread model: posix gcc version 4.3.3 (GCC) Test program: ============= #define VALUE 100000000 int main(int argc, char *argv[]) { void *ops[] = { &&inc, &&eq, &>, &<, &>e, &<e, &&zero, &¬_implemented, &&exit }; long i = 0; int next_op = argc; //unknown at compile time... int fail_op = 0; //inc goto *ops[0]; inc: i++; goto *ops[next_op]; eq: if (!(i == VALUE)) goto handle_fail; return 0; gt: if (!(i > VALUE)) goto handle_fail; return 0; lt: if (!(i < VALUE)) goto handle_fail; return 0; gte: if (!(i >= VALUE)) goto handle_fail; return 0; lte: if (!(i <= VALUE)) goto handle_fail; return 0; zero: if (!(i == 0)) goto handle_fail; return 0; not_implemented: fail_op = 8; //exit goto handle_fail; exit: return -1; handle_fail: goto *ops[fail_op]; }
There is a pass "duplicate_computed_gotos" that should take care of this. Why does it not work in this case?
Caused by revision 139760. http://gcc.gnu.org/viewcvs?view=revision&revision=139760
So the profiling information (which is not always accurate without real profiling) says the code is not executed that often. I guess someone needs to tune them better for computed gotos unless people really want to do profiling runs first to get better performance ...
I would just go back to the old status (of GCC 4.3 and earlier) than deciding for each basic block individually whether to unfactor or not. Could you please see if the attached problem makes the slow-down disappear? Index: bb-reorder.c =================================================================== --- bb-reorder.c (revision 155661) +++ bb-reorder.c (working copy) @@ -1981,7 +1981,9 @@ gate_duplicate_computed_gotos (void) { if (targetm.cannot_modify_jumps_p ()) return false; - return (optimize > 0 && flag_expensive_optimizations); + return (optimize > 0 + && flag_expensive_optimizations + && ! optimize_function_for_size_p (cfun)); } @@ -2072,9 +2074,6 @@ duplicate_computed_gotos (void) || single_pred_p (single_succ (bb))) continue; - if (!optimize_bb_for_size_p (bb)) - continue; - /* The successor block has to be a duplication candidate. */ if (!bitmap_bit_p (candidates, single_succ (bb)->index)) continue;
Thanks for the quick patch! Unfortunately it only works for me with option "-march=athlon64"? Is this intentional ("-march" is not needed for gcc-4.3)? Am I doing something wrong? $ gcc-4.3 -v && /opt/gcc/bin/gcc-4.4.2-new -v Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++ --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic --disable-werror --enable-checking=release --program-suffix=-4.3 --enable-version-specific-runtime-libs Thread model: posix gcc version 4.3.3 (GCC) Using built-in specs. Target: x86_64-unknown-linux-gnu Configured with: ../gcc-4.4.2/configure --prefix=/opt/gcc --enable-shared --enable-languages=c,c++ --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic --disable-werror --enable-checking=release --program-suffix=-4.4.2-new --enable-version-specific-runtime-libs Thread model: posix gcc version 4.4.2 (GCC) $ gcc-4.3 -g -O3 test.c -o testp.4.3 && /opt/gcc/bin/gcc-4.4.2-new -g -Wall -O3 test.c -o testp.4.4.2 $ time ./testp.4.3 && time ./testp.4.4.2 real 0m0.889s user 0m0.880s sys 0m0.000s real 0m4.043s user 0m4.036s sys 0m0.003s $ gcc-4.3 -g -O3 test.c -o testp.4.3 && /opt/gcc/bin/gcc-4.4.2-new -g -Wall -march=athlon64 -O3 test.c -o testp.4.4.2 $ time ./testp.4.3 && time ./testp.4.4.2 real 0m0.888s user 0m0.880s sys 0m0.000s real 0m0.638s user 0m0.627s sys 0m0.003s
I will try to distclean and rebuild from scratch to confirm my statement above.
Summary: The patch works great when building gcc from trunk (revision 155680). Both supplied test program and real application are optimized. With gcc-4.4.2 I get the optimization for the test program only with e.g. -march=athlon64 or -mtune=native (which is an improvement, previously I could not get it to work even with these options). Without -mtune/-march optimization seems to bail out on the following check in bb-reorder.c@@duplicate_computed_gotos(void): /* Obviously the block has to end in a computed jump. */ if (!computed_jump_p (BB_END (bb))) continue; I assume the patch was written for 4.5 so maybe testing it on 4.4.2 is a bit premature. Hope it helps anyway.
Subject: Bug 42621 Author: steven Date: Sun Jan 10 23:31:30 2010 New Revision: 155796 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155796 Log: PR rtl-optimization/42621 * bb-reorder.c (gate_duplicated_computed_gotos): Only run if not optimizing for size. (duplicate_computed_gotos): Remove now-redundant check. Modified: trunk/gcc/ChangeLog trunk/gcc/bb-reorder.c
Fixed for 4.5 sofar.
Please note that computed gotos are factored out because "they are a hell to deal with" in tree-cfg.c:build_gimple_cfg(). This means that they MUST be unfactored out as promised in the comment without leaving this to another optimization step that may or may not be enabled. Also, for our product there are 97 "extra jumps" and 95 of them are long jumps, i.e: 12be0: ff e1 jmp *%ecx ... 12dda: e9 01 fe ff ff jmp 12be0 <main_loop+0x220> ... so this is a serious both speed and size pessimisation :(
Is this the same bug as PR 39284?
(In reply to comment #9) > Fixed for 4.5 sofar. Doesn't appear to be fixed in GCC 4.5.2 (under Gentoo Linux). PS: The additional "jmp" instruction (as in the bug description) even appears to be generated in case of -O0. PPS: As noted by other, this bug is likely a duplicate to bug 39284 and bug 43868.
(In reply to comment #12) > PPS: As noted by other, this bug is likely a duplicate to bug 39284 and bug > 43868. As noted by others, this bug is likely a duplicate to bug 39284 and bug 43686. Sorry about the typos.
Fixed in 4.5+, 4.4 is no longer supported.