The following uses a file "ceval.i" that I'll upload shortly and a nightly build of gcc-4.4.0-20090223. The file is a version of Python's core interpretation loop modified to use computed gotos in order to help the processor's branch predictor do a better job than with a switch statement. However, gcc combines the computed gotos into only a few instructions, producing a result nearly as bad as the switch. The computed gotos are all of the form "goto *next_label;". About half of the instances counted are reachable, as measured by the asm volatile farther down. $ grep -c -F 'goto *next_label' ceval.i 256 I compile with -fno-gcse because http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html#index-fgcse-646 says it helps and with "--param max-goto-duplication-insns=100000" because Diego Novillo suggested tweaking that parameter. $ gcc-4.4 -m32 -pthread -fno-strict-aliasing -g -fwrapv -O3 -Wall -Wstrict-prototypes -fno-gcse --param max-goto-duplication-insns=100000 -S -dA ceval.i -o ceval.s $ egrep -c 'jmp[[:space:]]*\*' ceval.s 4 I tried to make the common instruction sequence leading up to the indirection jump shorter by putting an asm volatile before it, but that didn't work. You can see the result by running: $ sed 's!goto \*next_label!asm volatile ("/* Computed goto */":::"memory");goto *next_label!' < ceval.i > ceval-asm.i $ gcc-4.4 -m32 -pthread -fno-strict-aliasing -g -fwrapv -O3 -fno-gcse --param max-goto-duplication-insns=100000 -S -dA ceval-asm.i -o ceval-asm.s $ egrep -c 'Computed goto' ceval-asm.s 130 $ egrep -c 'jmp[[:space:]]*\*' ceval-asm.s 4 One of the relevant bits of assembly (search for "Computed goto" in ceval-asm.s) is: .L1125: # basic block 66 .LBE835: # ../src/Python/ceval.c:1000 .loc 1 1000 0 jmp *-72(%ebp) .LVL558: .p2align 4,,7 .p2align 3 .L1433: # basic block 67 ... # basic block 114 #APP # 11 "../src/Include/ceval-vm.i" 1 /* Computed goto */ # 0 "" 2 #NO_APP movl %esi, %ebx .LVL599: .p2align 4,,7 .p2align 3 .L484: # basic block 115 # ../src/Python/ceval.c:1000 .loc 1 1000 0 movl $1, %eax .LVL600: movl %ebx, %esi .LVL601: jmp .L1125 The documentation for max-goto-duplication-insns says: The maximum number of instructions to duplicate to a block that jumps to a computed goto. To avoid O(N^2) behavior in a number of passes, GCC factors computed gotos early in the compilation process, and unfactors them as late as possible. Only computed jumps at the end of a basic blocks with no more than max-goto-duplication-insns are unfactored. The default value is 8. Since .L1125 and basic block 66 have only a single instruction, the computed goto should have been unfactored. $ gcc-4.4 -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: /home/dnovillo/perf/sbox/gcc/local.x86_64/src/configure --prefix=/home/dnovillo/perf/sbox/gcc/local.x86_64/inst --srcdir=/home/dnovillo/perf/sbox/gcc/local.x86_64/src --with-gmp=/home/dnovillo/i686 --with-mpfr=/home/dnovillo/i686 --build=i686-pc-linux-gnu --enable-languages=c++,fortran,objc,obj-c++ Thread model: posix gcc version 4.4.0 20090223 (experimental) (GCC)
Created attachment 17354 [details] ceval.i
This is by design; GCSE is the one which pulls back the computed gotos IIRC.
It says may but not will get better performance.
Taking out -fno-gcse doesn't change the result. $ gcc-4.4 -m32 -pthread -fno-strict-aliasing -g -fwrapv -O3 --param max-goto-duplication-insns=100000 -S -dA ceval.i -o ceval.s $ egrep -c 'jmp[[:space:]]*\*' ceval.s 4
Should unfactor. We have pass_duplicate_computed_gotos for this. We should look into this, see why it doesn't work. Someone added a "optimize_for_size_p()" check in duplicate_computed_gotos(). That is just *stupid*. If there should be a check like that, it should be per function, not per basic block. I bet that's the cause.
rguenther@murzim:/tmp> gcc-4.4 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000 -S ceval.i rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s 4 rguenther@murzim:/tmp> gcc-4.3 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000 -S ceval.i rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s 5 rguenther@murzim:/tmp> gcc-4.2 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000 -S ceval.i rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s 5 rguenther@murzim:/tmp> gcc-4.1 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000 -S ceval.i rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s 19 How many gotos do you expect?
I'd like to get gcc not to combine any of them, which I believe would produce 130, as many as the asm volatiles that survived optimization.
Hmm, the conditional is bogus, there should not be ! but still after patching this we don't duplicate. The reason is that the BB 71 (containing conditional jump) is reached via 2 BBs containing memory load. I guess it is result of crossjumping. I will send patch fixing the conditional, but it makes little difference. Honza
There shouldn't be a "!". Just make use optimize_function_for_speed_p(cfun).
Honza, what do the basic blocks 2 and 71 look like for you, exactly? I see no memory load. But I have local crossjumping patches -- as you know ;-) -- so I am probably not looking at the same dumps as you are.
Hmm, it is not exactly load. In first case I get: (code_label 12524 16482 12523 70 1249 "" [4 uses]) (note 12523 12524 149 70 [bb 70] NOTE_INSN_BASIC_BLOCK) (insn:TI 149 12523 13690 70 ../src/Include/ceval-vm.i:47 (set (mem/c:SI (plus:SI (reg/f:SI 6 bp) (const_int -64 [0xffffffffffffffc0])) [72 %sfp+-40 S4 A32]) (const_int 1 [0x1])) 47 {*movsi_1} (expr_list:REG_EQUAL (const_int 1 [0x1]) (nil))) (insn 13690 149 1351 70 (set (reg/v:SI 0 ax [orig:155 why ] [155]) (const_int 1 [0x1])) 47 {*movsi_1} (nil)) (code_label 1351 13690 1352 71 382 "" [0 uses]) (note 1352 1351 1353 71 [bb 71] NOTE_INSN_BASIC_BLOCK) (jump_insn:TI 1353 1352 1354 71 ../src/Python/ceval.c:1000 (set (pc) (mem/c:SI (plus:SI (reg/f:SI 6 bp) (const_int -60 [0xffffffffffffffc4])) [72 %sfp+-36 S4 A32])) 640 {*indirect_jump} (nil)) (barrier 1354 1353 1477) So there are 4 edges reaching WHY set. In the second case it is move of WHY to 1: (code_label 1363 1365 1349 150 384 "" [127 uses]) (note 1349 1363 1350 150 [bb 150] NOTE_INSN_BASIC_BLOCK) (insn:TI 1350 1349 19980 150 ../src/Python/ceval.c:1000 (set (reg/v:SI 0 ax [orig:155 why ] [155]) (const_int 1 [0x1])) 47 {*movsi_1} (nil)) (note 19980 1350 19979 151 [bb 151] NOTE_INSN_BASIC_BLOCK) (jump_insn 19979 19980 19982 151 ../src/Python/ceval.c:1000 (set (pc) (mem/c:SI (plus:SI (reg/f:SI 6 bp) (const_int -60 [0xffffffffffffffc4])) [72 %sfp+-36 S4 A32])) 640 {*indirect_jump} (nil)) (barrier 19982 19979 1373) that prevents duplicating. Probably ordirnary bb-reorder should be convincable to handle this well? This don't seem to happen at 64bit compilation. I also posted the patch fixing optimize_for_size check
Any updates on this (after 5 years)? Or any workarounds? I tried to do some optimization to my x86 instruction parser code by using computed gotos. This post (https://blogs.oracle.com/nike/entry/fast_interpreter_using_gcc_s) suggests that it should give good speed improvements and the shown compilation results looked promising. But the results with newer GCC versions were so bad, I went back to a simple switch() The 2 major speed improvements (less instructions / improved branch prediction) are completely killed by this bug. Another thing, that I find suspicious (maybe this is some crazy optimization I just don't understand) is that GCC doesn't generate a single JMP with a memory operand anymore, but first loads the memory into a register and then does a register based JMP, even when the load operation is exactly the same in all cases. From the comments it looks like there is already a working patch. Why is it not committed? I'd really appreaciate if you could fix this asap. PS: It would be great if you could make this work for switch statements in a loop as well! Normally people don't hassle with computed gotos, they use a switch. If this is in a loop and cases go back directly to the switch statement, the additional jump should be eliminated, possibly duplicating n instructions from the top of the loop before the switch.
Author: ktietz Date: Mon Jun 23 21:52:31 2014 New Revision: 211919 URL: https://gcc.gnu.org/viewcvs?rev=211919&root=gcc&view=rev Log: PR target/39284 * passes.def (peephole2): Move peephole2 pass before before sched2 pass. * config/i386/i386.md (peehole2): Combine memories and indirect jumps. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/passes.def
I think we can close that bug for now. Patch won't be backported.
Author: rth Date: Mon Jun 30 20:14:42 2014 New Revision: 212172 URL: https://gcc.gnu.org/viewcvs?rev=212172&root=gcc&view=rev Log: PR rtl-opt/61608 PR target/39284 * bb-reorder.c (pass_duplicate_computed_gotos::execute): Cleanup the cfg if there were any changes. * passes.def: Revert move of peephole2 after reorder_blocks; move duplicate_computed_gotos before peephole2. Modified: trunk/gcc/ChangeLog trunk/gcc/bb-reorder.c trunk/gcc/passes.def
(In reply to Kai Tietz from comment #14) > I think we can close that bug for now. > Patch won't be backported. OK, closed