Bug 39284 - Computed gotos combined too aggressively
Summary: Computed gotos combined too aggressively
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
Keywords: missed-optimization
Depends on:
Reported: 2009-02-23 22:42 UTC by Jeffrey Yasskin
Modified: 2014-06-30 20:15 UTC (History)
6 users (show)

See Also:
Known to work:
Known to fail:
Last reconfirmed: 2009-02-23 23:19:12

ceval.i (56.78 KB, text/plain)
2009-02-23 22:44 UTC, Jeffrey Yasskin

Note You need to log in before you can comment on or make changes to this bug.
Description Jeffrey Yasskin 2009-02-23 22:42:31 UTC
The following uses a file "ceval.i" that I'll upload shortly and a nightly build of gcc-4.4.0-20090223. The file is a version of Python's core interpretation loop modified to use computed gotos in order to help the processor's branch predictor do a better job than with a switch statement. However, gcc combines the computed gotos into only a few instructions, producing a result nearly as bad as the switch.

The computed gotos are all of the form "goto *next_label;". About half of the instances counted are reachable, as measured by the asm volatile farther down.
$ grep -c -F 'goto *next_label' ceval.i

I compile with -fno-gcse because http://gcc.gnu.org/onlinedocs/gcc-4.3.3/gcc/Optimize-Options.html#index-fgcse-646 says it helps and with "--param max-goto-duplication-insns=100000" because Diego Novillo suggested tweaking that parameter.
$ gcc-4.4 -m32  -pthread  -fno-strict-aliasing -g -fwrapv -O3 -Wall -Wstrict-prototypes -fno-gcse --param max-goto-duplication-insns=100000  -S -dA ceval.i -o ceval.s
$ egrep -c 'jmp[[:space:]]*\*' ceval.s

I tried to make the common instruction sequence leading up to the indirection jump shorter by putting an asm volatile before it, but that didn't work. You can see the result by running:

$ sed 's!goto \*next_label!asm volatile ("/* Computed goto */":::"memory");goto *next_label!' < ceval.i > ceval-asm.i
$ gcc-4.4 -m32  -pthread  -fno-strict-aliasing -g -fwrapv -O3 -fno-gcse --param max-goto-duplication-insns=100000  -S -dA ceval-asm.i -o ceval-asm.s
$ egrep -c 'Computed goto' ceval-asm.s
$ egrep -c 'jmp[[:space:]]*\*' ceval-asm.s

One of the relevant bits of assembly (search for "Computed goto" in ceval-asm.s) is:

	# basic block 66
	# ../src/Python/ceval.c:1000
	.loc 1 1000 0
	jmp	*-72(%ebp)
	.p2align 4,,7
	.p2align 3
	# basic block 67


	# basic block 114
# 11 "../src/Include/ceval-vm.i" 1
	/* Computed goto */
# 0 "" 2
	movl	%esi, %ebx
	.p2align 4,,7
	.p2align 3
	# basic block 115
	# ../src/Python/ceval.c:1000
	.loc 1 1000 0
	movl	$1, %eax
	movl	%ebx, %esi
	jmp	.L1125

The documentation for max-goto-duplication-insns says:

    The maximum number of instructions to duplicate to a block that jumps to a computed goto. To avoid O(N^2) behavior in a number of passes, GCC factors computed gotos early in the compilation process, and unfactors them as late as possible. Only computed jumps at the end of a basic blocks with no more than max-goto-duplication-insns are unfactored. The default value is 8. 

Since .L1125 and basic block 66 have only a single instruction, the computed goto should have been unfactored.

$ gcc-4.4 -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: /home/dnovillo/perf/sbox/gcc/local.x86_64/src/configure --prefix=/home/dnovillo/perf/sbox/gcc/local.x86_64/inst --srcdir=/home/dnovillo/perf/sbox/gcc/local.x86_64/src --with-gmp=/home/dnovillo/i686 --with-mpfr=/home/dnovillo/i686 --build=i686-pc-linux-gnu --enable-languages=c++,fortran,objc,obj-c++
Thread model: posix
gcc version 4.4.0 20090223 (experimental) (GCC)
Comment 1 Jeffrey Yasskin 2009-02-23 22:44:04 UTC
Created attachment 17354 [details]
Comment 2 Andrew Pinski 2009-02-23 22:46:25 UTC
This is by design; GCSE is the one which pulls back the computed gotos IIRC.
Comment 3 Andrew Pinski 2009-02-23 22:48:52 UTC
It says may but not will get better performance.
Comment 4 Jeffrey Yasskin 2009-02-23 22:58:32 UTC
Taking out -fno-gcse doesn't change the result.

$ gcc-4.4 -m32  -pthread  -fno-strict-aliasing -g -fwrapv -O3 --param max-goto-duplication-insns=100000  -S -dA ceval.i -o ceval.s
$ egrep -c 'jmp[[:space:]]*\*' ceval.s
Comment 5 Steven Bosscher 2009-02-23 23:19:12 UTC
Should unfactor.  We have pass_duplicate_computed_gotos for this.  We should look into this, see why it doesn't work.

Someone added a "optimize_for_size_p()" check in duplicate_computed_gotos().  That is just *stupid*.  If there should be a check like that, it should be per function, not per basic block.  I bet that's the cause.
Comment 6 Richard Biener 2009-02-24 10:17:42 UTC
rguenther@murzim:/tmp> gcc-4.4 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000  -S ceval.i 
rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s
rguenther@murzim:/tmp> gcc-4.3 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000  -S ceval.i 
rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s
rguenther@murzim:/tmp> gcc-4.2 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000  -S ceval.i 
rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s
rguenther@murzim:/tmp> gcc-4.1 -m32 -fno-strict-aliasing -fwrapv -O3 --param max-goto-duplication-insns=100000  -S ceval.i 
rguenther@murzim:/tmp> egrep -c 'jmp[[:space:]]*\*' ceval.s

How many gotos do you expect?
Comment 7 Jeffrey Yasskin 2009-02-24 15:26:10 UTC
I'd like to get gcc not to combine any of them, which I believe would produce 130, as many as the asm volatiles that survived optimization.
Comment 8 Jan Hubicka 2009-06-08 20:55:44 UTC
the conditional is bogus, there should not be !
but still after patching this we don't duplicate.  The reason is that the BB 71 (containing conditional jump) is reached via 2 BBs containing memory load.  I guess it is result of crossjumping.

I will send patch fixing the conditional, but it makes little difference.

Comment 9 Steven Bosscher 2009-06-08 22:43:39 UTC
There shouldn't be a "!".  Just make use optimize_function_for_speed_p(cfun).
Comment 10 Steven Bosscher 2009-06-08 22:45:48 UTC
Honza, what do the basic blocks 2 and 71 look like for you, exactly?  I see no memory load.  But I have local crossjumping patches -- as you know ;-) -- so I am probably not looking at the same dumps as you are.
Comment 11 Jan Hubicka 2009-06-09 15:18:43 UTC
Hmm, it is not exactly load.  In first case I get:

(code_label 12524 16482 12523 70 1249 "" [4 uses])

(note 12523 12524 149 70 [bb 70] NOTE_INSN_BASIC_BLOCK)

(insn:TI 149 12523 13690 70 ../src/Include/ceval-vm.i:47 (set (mem/c:SI (plus:SI (reg/f:SI 6 bp)
                (const_int -64 [0xffffffffffffffc0])) [72 %sfp+-40 S4 A32])
        (const_int 1 [0x1])) 47 {*movsi_1} (expr_list:REG_EQUAL (const_int 1 [0x1])

(insn 13690 149 1351 70 (set (reg/v:SI 0 ax [orig:155 why ] [155])
        (const_int 1 [0x1])) 47 {*movsi_1} (nil))

(code_label 1351 13690 1352 71 382 "" [0 uses])

(note 1352 1351 1353 71 [bb 71] NOTE_INSN_BASIC_BLOCK)

(jump_insn:TI 1353 1352 1354 71 ../src/Python/ceval.c:1000 (set (pc)
        (mem/c:SI (plus:SI (reg/f:SI 6 bp)
                (const_int -60 [0xffffffffffffffc4])) [72 %sfp+-36 S4 A32])) 640 {*indirect_jump} (nil))

(barrier 1354 1353 1477)

So there are 4 edges reaching WHY set.  In the second case it is move of WHY to 1:

(code_label 1363 1365 1349 150 384 "" [127 uses])

(note 1349 1363 1350 150 [bb 150] NOTE_INSN_BASIC_BLOCK)

(insn:TI 1350 1349 19980 150 ../src/Python/ceval.c:1000 (set (reg/v:SI 0 ax [orig:155 why ] [155])
        (const_int 1 [0x1])) 47 {*movsi_1} (nil))

(note 19980 1350 19979 151 [bb 151] NOTE_INSN_BASIC_BLOCK)

(jump_insn 19979 19980 19982 151 ../src/Python/ceval.c:1000 (set (pc)
        (mem/c:SI (plus:SI (reg/f:SI 6 bp)
                (const_int -60 [0xffffffffffffffc4])) [72 %sfp+-36 S4 A32])) 640 {*indirect_jump} (nil))

(barrier 19982 19979 1373)

that prevents duplicating.  Probably ordirnary bb-reorder should be convincable to handle this well?
This don't seem to happen at 64bit compilation.

I also posted the patch fixing optimize_for_size check
Comment 12 Timo Kreuzer 2014-03-26 21:14:18 UTC
Any updates on this (after 5 years)? Or any workarounds?

I tried to do some optimization to my x86 instruction parser code by using computed gotos. This post (https://blogs.oracle.com/nike/entry/fast_interpreter_using_gcc_s) suggests that it should give good speed improvements and the shown compilation results looked promising.
But the results with newer GCC versions were so bad, I went back to a simple switch()

The 2 major speed improvements (less instructions / improved branch prediction) are completely killed by this bug.
Another thing, that I find suspicious (maybe this is some crazy optimization I just don't understand) is that GCC doesn't generate a single JMP with a memory operand anymore, but first loads the memory into a register and then does a register based JMP, even when the load operation is exactly the same in all cases.

From the comments it looks like there is already a working patch. Why is it not committed? I'd really appreaciate if you could fix this asap.

PS: It would be great if you could make this work for switch statements in a loop as well! Normally people don't hassle with computed gotos, they use a switch. If this is in a loop and cases go back directly to the switch statement, the additional jump should be eliminated, possibly duplicating n instructions from the top of the loop before the switch.
Comment 13 Kai Tietz 2014-06-23 21:53:02 UTC
Author: ktietz
Date: Mon Jun 23 21:52:31 2014
New Revision: 211919

URL: https://gcc.gnu.org/viewcvs?rev=211919&root=gcc&view=rev
        PR target/39284
        * passes.def (peephole2): Move peephole2 pass before
        before sched2 pass.
        * config/i386/i386.md (peehole2): Combine memories
        and indirect jumps.

Comment 14 Kai Tietz 2014-06-23 21:55:35 UTC
I think we can close that bug for now.
Patch won't be backported.
Comment 15 Richard Henderson 2014-06-30 20:15:16 UTC
Author: rth
Date: Mon Jun 30 20:14:42 2014
New Revision: 212172

URL: https://gcc.gnu.org/viewcvs?rev=212172&root=gcc&view=rev
PR rtl-opt/61608

        PR target/39284
        * bb-reorder.c (pass_duplicate_computed_gotos::execute): Cleanup
        the cfg if there were any changes.
        * passes.def: Revert move of peephole2 after reorder_blocks;
        move duplicate_computed_gotos before peephole2.