Bug 33291

Summary: [4.3 Regression] a+=2; a+=2 not simplified to a+=4; with -O3 (ok with gcc-4.2.1)
Product: gcc Reporter: Wouter Vermaelen <vermaelen.wouter>
Component: tree-optimizationAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: minor CC: gcc-bugs, rguenth
Priority: P3 Keywords: alias, missed-optimization, TREE
Version: 4.3.0   
Target Milestone: 4.3.0   
Host: Target:
Build: Known to work: 4.2.1
Known to fail: 4.3.0 Last reconfirmed: 2007-09-03 14:36:17

Description Wouter Vermaelen 2007-09-03 11:23:52 UTC
I triggered this is the inner loop of the CPU emulation code of openMSX (http://openmsx.sf.net/). I tried to reduce the code. Below is the smallest code I could come with up that still shows the problem:

-------------------------------------------
struct Clock {
 void f();
 void add(unsigned n) { a += n; }
 int a;
};

struct CPU : Clock {
 virtual ~CPU();
 unsigned char readSlow();
 void execute();

 void delay() { add(2); }
 unsigned char readFast() {
  if (unsigned char* p = ptrs[addr >> 8]) {
   // fast-path
   delay();     // ### 1
   delay();     // ### 2
   return p[addr & 255];
  } else {
   // slow-path
   return readSlow();
  }
 }

 typedef void (CPU::*FuncPtr)();
 static FuncPtr tab[256];
 unsigned char* ptrs[256];
 unsigned addr;
};

void CPU::execute() {
 f();
 while (true) {
  unsigned char b = readFast();
  delay();       // # 3
  (this->*tab[b])();
 }
}
----------------------------------------

When compiled with SVN revision 128037 on a linux x86_64 machine:

> g++ -O3 -S CPU.ii
> cat -n CPU.s
     1          .file   "CPU.ii"
     2          .text
     3          .align 2
     4          .p2align 4,,15
     5  .globl _ZN3CPU7executeEv
     6          .type   _ZN3CPU7executeEv, @function
     7  _ZN3CPU7executeEv:
     8  .LFB5:
     9          pushq   %rbp
    10  .LCFI0:
    11          leaq    8(%rdi), %rbp
    12          pushq   %rbx
    13  .LCFI1:
    14          movq    %rdi, %rbx
    15          movq    %rbp, %rdi
    16          subq    $8, %rsp
    17  .LCFI2:
    18          call    _ZN5Clock1fEv
    19          .p2align 4,,10
    20          .p2align 3
    21  .L6:
    22          movl    2064(%rbx), %eax
    23          shrl    $8, %eax
    24          mov     %eax, %eax
    25          movq    16(%rbx,%rax,8), %rdx
    26          testq   %rdx, %rdx
    27          je      .L2
    28          movl    8(%rbx), %eax            ###
    29          addl    $2, %eax                 ### 1
    30          movl    %eax, (%rbp)             ###
    31          movl    8(%rbx), %eax            ###
    32          addl    $2, %eax                 ### 2
    33          movl    %eax, (%rbp)             ###
    34          movzbl  2064(%rbx), %eax
    35          movzbl  (%rdx,%rax), %edx
    36  .L3:
    37          movl    8(%rbx), %eax            #
    38          addl    $2, %eax                 # 3
    39          movl    %eax, (%rbp)             #
    40          movzbl  %dl, %eax
    41          salq    $4, %rax
    42          movq    _ZN3CPU3tabE(%rax), %rdx
    43          testb   $1, %dl
    44          jne     .L4
    45          movq    %rbx, %rdi
    46          addq    _ZN3CPU3tabE+8(%rax), %rdi
    47          call    *%rdx
    48          jmp     .L6
    49          .p2align 4,,10
    50          .p2align 3
    51  .L4:
    52          movq    %rbx, %rdi
    53          addq    _ZN3CPU3tabE+8(%rax), %rdi
    54          movq    (%rdi), %rax
    55          movq    -1(%rdx,%rax), %rdx
    56          call    *%rdx
    57          jmp     .L6
    58  .L2:
    59          movq    %rbx, %rdi
    60          call    _ZN3CPU8readSlowEv
    61          movl    %eax, %edx
    62          .p2align 4,,4
    63          .p2align 3
    64          jmp     .L3
    [skipped the rest of the output]

The missed optimization is visible in lines 28-33. It's also strange to me why reading the variable is done via 8(%rbx) while writing is done via (%rbp).

gcc-4.2.1 does a better job on this, it optimizes the two consecutive delay() functions to just:   addl $4, 8(%rbx)




Additionally I would have prefered that all three delay() functions would be collapsed into a single instruction in the fast code path (and partly duplicated as   a+=4; readSlow(); a+=2;   in the slow path). But I understand this might be more difficult to implement.
Comment 1 Andrew Pinski 2007-09-03 11:28:53 UTC
  # VUSE <tab_56, SMT.9_58, SMT.10_60>
  D.2581_35 = this_2(D)->D.2503.a;
  D.2582_36 = (unsigned int) D.2581_35;
  D.2583_37 = D.2582_36 + 2;
  D.2584_38 = (int) D.2583_37;
  # tab_76 = VDEF <tab_56>
  # SMT.9_77 = VDEF <SMT.9_58>
  D.2529_3->a = D.2584_38;
  # VUSE <tab_76, SMT.9_77, SMT.10_60>
  D.2586_40 = this_2(D)->D.2503.a;
  D.2587_41 = (unsigned int) D.2586_40;
  D.2588_42 = D.2587_41 + 2;
  D.2589_43 = (int) D.2588_42;
  # tab_78 = VDEF <tab_76>
  # SMT.9_79 = VDEF <SMT.9_77>
  D.2529_3->a = D.2589_43;


hmmm, aliasing
Comment 2 Richard Biener 2007-09-03 12:01:46 UTC
The problem is that forwprop doesn't propagate addr_exprs to memory reference
stmts in early optimization anymore (due to the volatile issues) and 
value numbering cannot deal with the different (but same) load/store addresses:

  D.2605_34 = &this_2(D)->D.2527;
  # VUSE <tab_57, SMT.9_59, SMT.11_63>
  D.2606_35 = this_2(D)->D.2527.a;
...
  # tab_77 = VDEF <tab_57>
  # SMT.11_78 = VDEF <SMT.11_63>
  D.2605_34->a = D.2609_38;

(also aliasing computes different answers here, for whatever reason).  With
scheduling an extra forwprop pass before FRE the second _load_ of a is
eliminated, but DSE still cannot figure the dead store:

  # VUSE <tab_57, SMT.9_59, SMT.11_63>
  D.2606_35 = this_2(D)->D.2527.a;
  D.2607_36 = (unsigned int) D.2606_35;
  D.2608_37 = D.2607_36 + 2;
  D.2609_38 = (int) D.2608_37;
  # tab_77 = VDEF <tab_57>
  # SMT.9_93 = VDEF <SMT.9_59>
  # SMT.11_78 = VDEF <SMT.11_63>
  this_2(D)->D.2527.a = D.2609_38;
  D.2612_41 = (unsigned int) D.2609_38;
  D.2613_42 = D.2612_41 + 2;
  D.2614_43 = (int) D.2613_42;
  # tab_79 = VDEF <tab_77>
  # SMT.9_94 = VDEF <SMT.9_93>
  # SMT.11_80 = VDEF <SMT.11_78>
  this_2(D)->D.2527.a = D.2614_43;
Comment 3 Richard Biener 2007-09-03 12:04:54 UTC
That is, rtl level DSE removes the dead store:

_ZN3CPU7executeEv:
.LFB5:
        pushq   %rbx
.LCFI0:
        movq    %rdi, %rbx
        leaq    8(%rdi), %rdi
        call    _ZN5Clock1fEv
        .p2align 4,,10
        .p2align 3
.L6:
        movl    2064(%rbx), %eax
        shrl    $8, %eax
        mov     %eax, %eax
        movq    16(%rbx,%rax,8), %rdx
        testq   %rdx, %rdx
        je      .L2
        movzbl  2064(%rbx), %eax
        addl    $4, 8(%rbx)
        movzbl  (%rdx,%rax), %eax
.L3:
        movzbl  %al, %eax
        addl    $2, 8(%rbx)
        salq    $4, %rax
        movq    _ZN3CPU3tabE(%rax), %rdx
        testb   $1, %dl
        jne     .L4
        movq    %rbx, %rdi
        addq    _ZN3CPU3tabE+8(%rax), %rdi
        call    *%rdx
        jmp     .L6
        .p2align 4,,10
        .p2align 3
.L4:
        movq    %rbx, %rdi
        addq    _ZN3CPU3tabE+8(%rax), %rdi
        movq    (%rdi), %rax
        movq    -1(%rdx,%rax), %rdx
        call    *%rdx
        jmp     .L6
.L2:
        movq    %rbx, %rdi
        call    _ZN3CPU8readSlowEv
        .p2align 4,,6
        .p2align 3
        jmp     .L3
.LFE5:
Comment 4 Richard Biener 2007-09-03 14:36:17 UTC
I have a patch that makes it work apart from the tree level DSE issue.
Comment 5 Richard Biener 2007-09-04 08:39:09 UTC
Subject: Bug 33291

Author: rguenth
Date: Tue Sep  4 08:38:56 2007
New Revision: 128068

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128068
Log:
2007-09-04  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/33291
	* tree-pretty-print.c (dump_generic_node): Dump all
	qualifiers for pointer types, not only first.  Dump
	qualifiers for aggregate types as well.
	* tree-ssa-ccp.c (maybe_fold_offset_to_array_ref): Always
	use the canonical type for building ARRAY_REFs.
	* gimplify.c (canonicalize_addr_expr): Clean up.  The
	correct validness check is compatibility of the pointer
	types.  Always use the canonical type for building
	ARRAY_REFs and ADDR_EXPRs.
	* tree-ssa-forwprop.c (forward_propagate_addr_expr): Revert
	change that disabled propagation of ADDR_EXPRs into statements
	with volatile ops.

	* gcc.dg/volatile2.c: New testcase.
	* gcc.dg/pr32721.c: Adjust volatile reference pattern.
	* gcc.dg/tree-ssa/forwprop-1.c: Remove xfail.
	* gcc.dg/tree-ssa/forwprop-2.c: Likewise.
	* gcc.dg/tree-ssa/pr17141-1.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/volatile2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimplify.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/pr32721.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr17141-1.c
    trunk/gcc/tree-pretty-print.c
    trunk/gcc/tree-ssa-ccp.c
    trunk/gcc/tree-ssa-forwprop.c

Comment 6 Richard Biener 2007-09-04 08:41:42 UTC
Fixed.
Comment 7 Wouter Vermaelen 2007-09-04 12:11:40 UTC
Thanks for looking into this so quickly!

I confirm the problem is solved for the reduced testcase. However in my original code the dead-store is not eliminated. Do you want me to file a separate bug report for that?

    ....
    mov    (%rbx),%edx
    movzbl %cl,%edi
    lea    0x3(%rdx),%r8d
    add    $0x5,%edx
    mov    %r8d,(%rbx)
    movzbl (%rsi,%rdi,1),%eax
    mov    %edx,(%rbx)
    ....
Comment 8 Richard Biener 2007-09-04 12:13:08 UTC
Yes please.