I triggered this is the inner loop of the CPU emulation code of openMSX (http://openmsx.sf.net/). I tried to reduce the code. Below is the smallest code I could come with up that still shows the problem: ------------------------------------------- struct Clock { void f(); void add(unsigned n) { a += n; } int a; }; struct CPU : Clock { virtual ~CPU(); unsigned char readSlow(); void execute(); void delay() { add(2); } unsigned char readFast() { if (unsigned char* p = ptrs[addr >> 8]) { // fast-path delay(); // ### 1 delay(); // ### 2 return p[addr & 255]; } else { // slow-path return readSlow(); } } typedef void (CPU::*FuncPtr)(); static FuncPtr tab[256]; unsigned char* ptrs[256]; unsigned addr; }; void CPU::execute() { f(); while (true) { unsigned char b = readFast(); delay(); // # 3 (this->*tab[b])(); } } ---------------------------------------- When compiled with SVN revision 128037 on a linux x86_64 machine: > g++ -O3 -S CPU.ii > cat -n CPU.s 1 .file "CPU.ii" 2 .text 3 .align 2 4 .p2align 4,,15 5 .globl _ZN3CPU7executeEv 6 .type _ZN3CPU7executeEv, @function 7 _ZN3CPU7executeEv: 8 .LFB5: 9 pushq %rbp 10 .LCFI0: 11 leaq 8(%rdi), %rbp 12 pushq %rbx 13 .LCFI1: 14 movq %rdi, %rbx 15 movq %rbp, %rdi 16 subq $8, %rsp 17 .LCFI2: 18 call _ZN5Clock1fEv 19 .p2align 4,,10 20 .p2align 3 21 .L6: 22 movl 2064(%rbx), %eax 23 shrl $8, %eax 24 mov %eax, %eax 25 movq 16(%rbx,%rax,8), %rdx 26 testq %rdx, %rdx 27 je .L2 28 movl 8(%rbx), %eax ### 29 addl $2, %eax ### 1 30 movl %eax, (%rbp) ### 31 movl 8(%rbx), %eax ### 32 addl $2, %eax ### 2 33 movl %eax, (%rbp) ### 34 movzbl 2064(%rbx), %eax 35 movzbl (%rdx,%rax), %edx 36 .L3: 37 movl 8(%rbx), %eax # 38 addl $2, %eax # 3 39 movl %eax, (%rbp) # 40 movzbl %dl, %eax 41 salq $4, %rax 42 movq _ZN3CPU3tabE(%rax), %rdx 43 testb $1, %dl 44 jne .L4 45 movq %rbx, %rdi 46 addq _ZN3CPU3tabE+8(%rax), %rdi 47 call *%rdx 48 jmp .L6 49 .p2align 4,,10 50 .p2align 3 51 .L4: 52 movq %rbx, %rdi 53 addq _ZN3CPU3tabE+8(%rax), %rdi 54 movq (%rdi), %rax 55 movq -1(%rdx,%rax), %rdx 56 call *%rdx 57 jmp .L6 58 .L2: 59 movq %rbx, %rdi 60 call _ZN3CPU8readSlowEv 61 movl %eax, %edx 62 .p2align 4,,4 63 .p2align 3 64 jmp .L3 [skipped the rest of the output] The missed optimization is visible in lines 28-33. It's also strange to me why reading the variable is done via 8(%rbx) while writing is done via (%rbp). gcc-4.2.1 does a better job on this, it optimizes the two consecutive delay() functions to just: addl $4, 8(%rbx) Additionally I would have prefered that all three delay() functions would be collapsed into a single instruction in the fast code path (and partly duplicated as a+=4; readSlow(); a+=2; in the slow path). But I understand this might be more difficult to implement.
# VUSE <tab_56, SMT.9_58, SMT.10_60> D.2581_35 = this_2(D)->D.2503.a; D.2582_36 = (unsigned int) D.2581_35; D.2583_37 = D.2582_36 + 2; D.2584_38 = (int) D.2583_37; # tab_76 = VDEF <tab_56> # SMT.9_77 = VDEF <SMT.9_58> D.2529_3->a = D.2584_38; # VUSE <tab_76, SMT.9_77, SMT.10_60> D.2586_40 = this_2(D)->D.2503.a; D.2587_41 = (unsigned int) D.2586_40; D.2588_42 = D.2587_41 + 2; D.2589_43 = (int) D.2588_42; # tab_78 = VDEF <tab_76> # SMT.9_79 = VDEF <SMT.9_77> D.2529_3->a = D.2589_43; hmmm, aliasing
The problem is that forwprop doesn't propagate addr_exprs to memory reference stmts in early optimization anymore (due to the volatile issues) and value numbering cannot deal with the different (but same) load/store addresses: D.2605_34 = &this_2(D)->D.2527; # VUSE <tab_57, SMT.9_59, SMT.11_63> D.2606_35 = this_2(D)->D.2527.a; ... # tab_77 = VDEF <tab_57> # SMT.11_78 = VDEF <SMT.11_63> D.2605_34->a = D.2609_38; (also aliasing computes different answers here, for whatever reason). With scheduling an extra forwprop pass before FRE the second _load_ of a is eliminated, but DSE still cannot figure the dead store: # VUSE <tab_57, SMT.9_59, SMT.11_63> D.2606_35 = this_2(D)->D.2527.a; D.2607_36 = (unsigned int) D.2606_35; D.2608_37 = D.2607_36 + 2; D.2609_38 = (int) D.2608_37; # tab_77 = VDEF <tab_57> # SMT.9_93 = VDEF <SMT.9_59> # SMT.11_78 = VDEF <SMT.11_63> this_2(D)->D.2527.a = D.2609_38; D.2612_41 = (unsigned int) D.2609_38; D.2613_42 = D.2612_41 + 2; D.2614_43 = (int) D.2613_42; # tab_79 = VDEF <tab_77> # SMT.9_94 = VDEF <SMT.9_93> # SMT.11_80 = VDEF <SMT.11_78> this_2(D)->D.2527.a = D.2614_43;
That is, rtl level DSE removes the dead store: _ZN3CPU7executeEv: .LFB5: pushq %rbx .LCFI0: movq %rdi, %rbx leaq 8(%rdi), %rdi call _ZN5Clock1fEv .p2align 4,,10 .p2align 3 .L6: movl 2064(%rbx), %eax shrl $8, %eax mov %eax, %eax movq 16(%rbx,%rax,8), %rdx testq %rdx, %rdx je .L2 movzbl 2064(%rbx), %eax addl $4, 8(%rbx) movzbl (%rdx,%rax), %eax .L3: movzbl %al, %eax addl $2, 8(%rbx) salq $4, %rax movq _ZN3CPU3tabE(%rax), %rdx testb $1, %dl jne .L4 movq %rbx, %rdi addq _ZN3CPU3tabE+8(%rax), %rdi call *%rdx jmp .L6 .p2align 4,,10 .p2align 3 .L4: movq %rbx, %rdi addq _ZN3CPU3tabE+8(%rax), %rdi movq (%rdi), %rax movq -1(%rdx,%rax), %rdx call *%rdx jmp .L6 .L2: movq %rbx, %rdi call _ZN3CPU8readSlowEv .p2align 4,,6 .p2align 3 jmp .L3 .LFE5:
I have a patch that makes it work apart from the tree level DSE issue.
Subject: Bug 33291 Author: rguenth Date: Tue Sep 4 08:38:56 2007 New Revision: 128068 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128068 Log: 2007-09-04 Richard Guenther <rguenther@suse.de> PR tree-optimization/33291 * tree-pretty-print.c (dump_generic_node): Dump all qualifiers for pointer types, not only first. Dump qualifiers for aggregate types as well. * tree-ssa-ccp.c (maybe_fold_offset_to_array_ref): Always use the canonical type for building ARRAY_REFs. * gimplify.c (canonicalize_addr_expr): Clean up. The correct validness check is compatibility of the pointer types. Always use the canonical type for building ARRAY_REFs and ADDR_EXPRs. * tree-ssa-forwprop.c (forward_propagate_addr_expr): Revert change that disabled propagation of ADDR_EXPRs into statements with volatile ops. * gcc.dg/volatile2.c: New testcase. * gcc.dg/pr32721.c: Adjust volatile reference pattern. * gcc.dg/tree-ssa/forwprop-1.c: Remove xfail. * gcc.dg/tree-ssa/forwprop-2.c: Likewise. * gcc.dg/tree-ssa/pr17141-1.c: Likewise. Added: trunk/gcc/testsuite/gcc.dg/volatile2.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimplify.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/pr32721.c trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr17141-1.c trunk/gcc/tree-pretty-print.c trunk/gcc/tree-ssa-ccp.c trunk/gcc/tree-ssa-forwprop.c
Fixed.
Thanks for looking into this so quickly! I confirm the problem is solved for the reduced testcase. However in my original code the dead-store is not eliminated. Do you want me to file a separate bug report for that? .... mov (%rbx),%edx movzbl %cl,%edi lea 0x3(%rdx),%r8d add $0x5,%edx mov %r8d,(%rbx) movzbl (%rsi,%rdi,1),%eax mov %edx,(%rbx) ....
Yes please.