42494 – [4.4 Regression] Missed dead-code-elimination

Bug 42494 - [4.4 Regression] Missed dead-code-elimination

Summary: [4.4 Regression] Missed dead-code-elimination

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.4.0

Importance:	P3 normal
Target Milestone:	4.5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2009-12-25 05:42 UTC by Shih-wei Liao
Modified:	2012-03-13 13:04 UTC (History)
CC List:	10 users (show)

See Also:
Host:	i686-linux
Target:	arm-eabi
Build:	i686-linux
Known to work:	4.2.1, 4.5.0
Known to fail:	4.4.3
Last reconfirmed:	2010-01-07 09:49:13

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Shih-wei Liao 2009-12-25 05:42:27 UTC

For the following code:

extern void func();
extern char outbuf[];
extern int outcnt;
extern int bool_var;
void test ()
{
 char flags;
 flags = 0;
 outcnt = 0;
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (bool_var) flags = 2;
 outbuf[outcnt] = flags;
 if (outcnt == 1) func ();
}

I found that GCC 4.4.0 generates the following code:

       .code   16
       .file   "outcnt.c"
       .text
       .align  2
       .global test
       .code   16
       .thumb_func
       .type   test, %function
test:
       push    {r4, lr}
       ldr     r3, .L7
       mov     r2, #0
       str     r2, [r3]
       ldr     r3, .L7+4
       mov     r2, #2
       ldr     r3, [r3]
       cmp     r3, #0
       bne     .L3
       mov     r2, #0
.L3:
       ldr     r3, .L7
       ldr     r1, .L7+8
       ldr     r3, [r3]
       strb    r2, [r1, r3]
       cmp     r3, #1
       bne     .L5
       bl      func
.L5:
       @ sp needed for prologue
       pop     {r4, pc}
.L8:
       .align  2
.L7:
       .word   outcnt
       .word   bool_var
       .word   outbuf
       .size   test, .-test
       .ident  "GCC: (GNU) 4.4.0"

, while GCC 4.2.1 generates the following code:

       .code   16
       .file   "outcnt.c"
       .text
       .align  2
       .global test
       .code 16
       .thumb_func
       .type   test, %function
test:
       push    {lr}
       ldr     r2, .L6
       mov     r3, #0
       str     r3, [r2]
       ldr     r3, .L6+4
       ldr     r3, [r3]
       cmp     r3, #0
       beq     .L2
       mov     r2, #2
       b       .L4
.L2:
       mov     r2, #0
.L4:
       ldr     r3, .L6+8
       @ sp needed for prologue
       strb    r2, [r3]
       pop     {pc}
.L7:
       .align  2
.L6:
       .word   outcnt
       .word   bool_var
       .word   outbuf
       .size   test, .-test
       .ident  "GCC: (GNU) 4.2.1"

The code snippet has a lot of dead code that is not completely eliminated by gcc 4.4.0 (but correctly eliminated by gcc 4.2.1).
Because outcnt == 0, all lines 'if (outcnt == 1)' can be eliminated. Because outbuf and outcnt are global external symbols of different types, they can not be aliased, so the last statement can be also eliminated.
gcc 4.2.1 output is 40 bytes, gcc 4.4.0 output is 52 bytes. It also inserts 'bl func'. 

This code snippet was extracted from gzip benchmark, but got changed quite a bit.

Comment 1 Ramana Radhakrishnan 2009-12-29 10:35:21 UTC

I can confirm this as a problem on the 4.4 branch and I can see this isn't a problem on trunk .



        .arch armv5te
        .fpu softvfp
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 1
        .eabi_attribute 30, 4
        .eabi_attribute 18, 4
        .code   16
        .file   "t.c"
        .text
        .align  1
        .global test
        .code   16
        .thumb_func
        .type   test, %function
test:
        ldr     r3, .L4
        mov     r2, #0
        str     r2, [r3]
        ldr     r3, .L4+4
        @ sp needed for prologue
        ldr     r2, [r3]
        sub     r3, r2, #1
        sbc     r2, r2, r3
        ldr     r3, .L4+8
        lsl     r2, r2, #1
        strb    r2, [r3]
        bx      lr
.L5:
        .align  2
.L4:
        .word   outcnt
        .word   bool_var
        .word   outbuf
        .size   test, .-test
        .ident  "GCC: (GNU) 4.5.0 20091223 (experimental)"

Comment 2 Steven Bosscher 2009-12-29 11:31:09 UTC

The "if (outcnt == 1) func ();" bit is optimized for me with gcc-4.4.2 on x86_64 at -O1 and -O2, but not at -Os. I was a bit too hasty to call this alias related, it seems. The O2 and Os tree dumps start to diverge in the dom1 dump, where we have just one call to func after dom1 at -O2 and still two calls at -O2.

Comment 3 Ramana Radhakrishnan 2009-12-29 12:44:56 UTC

Steven,

(In reply to comment #2)
> The "if (outcnt == 1) func ();" bit is optimized for me with gcc-4.4.2 on
> x86_64 at -O1 and -O2, but not at -Os. I was a bit too hasty to call this alias
> related, it seems.

Yes it seems to go away at O1 and O2. I didn't see any comment from you earlier so not sure what you are referring to here. 

> The O2 and Os tree dumps start to diverge in the dom1 dump,
> where we have just one call to func after dom1 at -O2 and still two calls at
> -O2.
> 

FWIW I saw this removed by dom2 in trunk at -Os. In any case I think the component ought to be tree-optimization and changed.

Cheers
Ramana

Comment 4 Richard Biener 2009-12-30 23:34:07 UTC

The issue is that we do not do predicated value-numbering so we need multiple
invocations of it.  Which all of FRE, DOM and PRE do - but PRE is not run at
-Os in 4.4.

And yes, the optimization done by 4.5 is correct.

You could say "fixed" in 4.5 - or wait for someone to implement predicated
value-numbering to also fix a testcase with one more if (outcnt == 1) func ();

Comment 5 Richard Biener 2009-12-31 15:36:20 UTC

Btw, in 4.4 there is one DOM pass removed for compile-time.

We won't fix this for 4.4, the particular testcase is fixed in 4.5.

Comment 6 Shih-wei Liao 2010-01-07 09:18:51 UTC

For the trunk snapshot of 20100102, GCC 4.5.0 indeed removes most of the redundancy. However, -O1 and -Os still produce an extra instruction, while -O2 doesn't. Do we care about an extra instruction below? Thanks.

The instruction is marked as "redundant" below.

GCC trunk's -Os:
Disassembly of section .text:
00000000 <test>:
   0:	e59f3024 	ldr	r3, [pc, #36]	; 2c <test+0x2c>
   4:	e3a02000 	mov	r2, #0
   8:	e5832000 	str	r2, [r3]
   c:	e59f301c 	ldr	r3, [pc, #28]	; 30 <test+0x30>
  10:	e5932000 	ldr	r2, [r3]
  14:	e59f3018 	ldr	r3, [pc, #24]	; 34 <test+0x34>
  18:	e3520000 	cmp	r2, #0
  1c:	13a02002 	movne	r2, #2 
  20:	03a02000 	moveq	r2, #0 ;redundant
  24:	e5c32000 	strb	r2, [r3]
  28:	e12fff1e 	bx	lr
	...

GCC trunk's -O1:
Disassembly of section .text:
00000000 <test>:
   0:	e3a02000 	mov	r2, #0
   4:	e59f3020 	ldr	r3, [pc, #32]	; 2c <test+0x2c>
   8:	e5832000 	str	r2, [r3]
   c:	e59f301c 	ldr	r3, [pc, #28]	; 30 <test+0x30>
  10:	e5932000 	ldr	r2, [r3]
  14:	e3520000 	cmp	r2, #0
  18:	13a02002 	movne	r2, #2
  1c:	03a02000 	moveq	r2, #0	;redundant
  20:	e59f300c 	ldr	r3, [pc, #12]	; 34 <test+0x34>
  24:	e5c32000 	strb	r2, [r3]
  28:	e12fff1e 	bx	lr
	...

Below, GCC trunk's -O2 doesn't produce the redundant instruction.
-O2:
Disassembly of section .text:
00000000 <test>:
   0:	e59f3020 	ldr	r3, [pc, #32]	; 28 <test+0x28>
   4:	e59f2020 	ldr	r2, [pc, #32]	; 2c <test+0x2c>
   8:	e5933000 	ldr	r3, [r3]
   c:	e3a01000 	mov	r1, #0
  10:	e3530000 	cmp	r3, #0
  14:	e5821000 	str	r1, [r2]
  18:	e59f2010 	ldr	r2, [pc, #16]	; 30 <test+0x30>
  1c:	13a03002 	movne	r3, #2
  20:	e5c23000 	strb	r3, [r2]
  24:	e12fff1e 	bx	lr
	...

Comment 7 Steven Bosscher 2010-01-07 09:49:13 UTC

*sigh* With all the Google power, why can't anyone there try to debug this?

I'll have a look.

Comment 8 Steven Bosscher 2010-01-07 21:29:58 UTC

The diff between -O2 and -Os starts in combine:

diff -ur O2/t.c.175r.combine Os/t.c.175r.combine
--- O2/t.c.175r.combine 2010-01-07 22:24:04.000000000 +0100
+++ Os/t.c.175r.combine 2010-01-07 22:23:14.000000000 +0100
@@ -4,18 +4,20 @@
 starting the processing of deferred insns
 ending the processing of deferred insns
 df_analyze called
-insn_cost 7: 12
+insn_cost 7: 8
 insn_cost 8: 4
 insn_cost 9: 4
-insn_cost 10: 12
-insn_cost 11: 12
+insn_cost 10: 8
+insn_cost 11: 4
 insn_cost 12: 4
-insn_cost 46: 4
-insn_cost 17: 12
+insn_cost 46: 16
+insn_cost 17: 8
 insn_cost 19: 4
-rejecting combination of insns 12 and 46
-original costs 4 + 4 = 8
-replacement cost 12
+deferring deletion of insn with uid = 12.
+modifying insn i3    46 {r133:SI={(r138:SI!=0x0)?0x2:0x0};clobber cc:CC;}
+      REG_UNUSED: cc:CC
+      REG_DEAD: r138:SI
+deferring rescan insn with uid = 46.
 (note# 0 # 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

 (note# # # 2 NOTE_INSN_FUNCTION_BEG)
@@ -40,16 +42,17 @@
         (expr_list:REG_EQUAL (mem/c/i:SI (symbol_ref:SI ("bool_var") [flags 0xc0]  <var_decl # bool_var>) [2 bool_var+0 S4 A32])
             (nil))))

-(insn# # # 2 t.c:13 (set (reg:CC 24 cc)
-        (compare:CC (reg:SI 138 [ bool_var ])
-            (const_int 0 [0x0])))# {*arm_cmpsi_insn} (nil))
-
-(insn# # # 2 t.c:8 (set (reg/v:SI 133 [ flags ])
-        (if_then_else:SI (eq (reg:CC 24 cc)
-                (const_int 0 [0x0]))
-            (reg:SI 138 [ bool_var ])
-            (const_int 2 [0x2])))# {*movsicc_insn} (expr_list:REG_DEAD (reg:SI 138 [ bool_var ])
-        (expr_list:REG_DEAD (reg:CC 24 cc)
+(note# # # 2 NOTE_INSN_DELETED)
+
+(insn# # # 2 t.c:8 (parallel [
+            (set (reg/v:SI 133 [ flags ])
+                (if_then_else:SI (ne (reg:SI 138 [ bool_var ])
+                        (const_int 0 [0x0]))
+                    (const_int 2 [0x2])
+                    (const_int 0 [0x0])))
+            (clobber (reg:CC 24 cc))
+        ])# {movcond} (expr_list:REG_UNUSED (reg:CC 24 cc)
+        (expr_list:REG_DEAD (reg:SI 138 [ bool_var ])
             (nil))))

 (insn# # # 2 t.c:14 (set (reg/f:SI 139)
@@ -60,7 +63,10 @@
         (expr_list:REG_DEAD (reg/v:SI 133 [ flags ])
             (nil))))
 starting the processing of deferred insns
+deleting insn with uid = 12.
+rescanning insn with uid = 46.
+deleting insn with uid = 46.
 ending the processing of deferred insns

-;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space),
-;; 0 successes.
+;; Combiner totals: 13 attempts, 13 substitutions (2 requiring new space),
+;; 1 successes.

Comment 9 Steven Bosscher 2010-01-07 21:36:43 UTC

At -Os, in the .188r.postreload RTL dump:

(insn 46 12 17 2 t.c:8 (parallel [
            (set (reg/v:SI 2 r2 [orig:133 flags ] [133])
                (if_then_else:SI (ne (reg:SI 2 r2 [orig:138 bool_var ] [138])
                        (const_int 0 [0x0]))
                    (const_int 2 [0x2])
                    (const_int 0 [0x0])))
            (clobber (reg:CC 24 cc))
        ]) 291 {movcond} (nil))



And one pass later, in the .190r.split2 RTL dump:

(insn 49 12 50 2 t.c:8 (set (reg:CC 24 cc)
        (compare:CC (reg:SI 2 r2 [orig:138 bool_var ] [138])
            (const_int 0 [0x0]))) 220 {*arm_cmpsi_insn} (nil))
                    
(insn 50 49 51 2 t.c:8 (cond_exec (ne (reg:CC 24 cc)
            (const_int 0 [0x0]))
        (set (reg/v:SI 2 r2 [orig:133 flags ] [133])
            (const_int 2 [0x2]))) 2367 {neon_vornv2di+77} (nil))

(insn 51 50 17 2 t.c:8 (cond_exec (eq (reg:CC 24 cc)
            (const_int 0 [0x0]))
        (set (reg/v:SI 2 r2 [orig:133 flags ] [133])
            (const_int 0 [0x0]))) 2367 {neon_vornv2di+77} (nil))


There's your redundant mov, from an insn splitter.

Comment 10 Steven Bosscher 2010-01-07 22:08:07 UTC

An ARM maintainer can look at a solution for this (special splitters, maybe a peephole2, perhaps a post-ce3 special DCE pass to clean up cond_exec silliness.
(See also bug 21803, for similar cond_exec idiocy on ia64.)

Comment 11 Steven Bosscher 2010-02-08 16:45:19 UTC

Trunk today (r156595) optimizes this at -O1, -Os, and -O2 in the tree optimizers. The .fre pass removes the first func call, then .dom1 removes the next two. The .dom2 pass removes the remaining one.

If I add another, say, 10 lines of "if (outcnt == 1) func ();", then dom1 removes them all (even with -fno-tree-fre).

So something changed in (or for) DOM that allows it to clean up this mess completely. I would like to know what... Perhaps the OP can try to bisect to identify what was changed here (in the hope that it is something that can be back-ported to GCC 4.4)?

Comment 12 Jing Yu 2011-04-16 10:36:27 UTC

I am on leave from 02/01/2011 to 05/30/2011. I may not reply your
email during this period.

If you have Android toolchain questions/issues/requests, please
contact Doug (dougkwan@google.com) or my manager Bhaskar
(bjanakiraman@google.com).

Thanks,
Jing

Comment 13 Arnaud Lacombe 2011-05-13 19:26:40 UTC

It would seem that GCC 4.5.1 is still generating a call to func(), at -Os. Considering the following reduced test-case:

extern int outcnt;
void test ()
{
 outcnt = 0;
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
 if (outcnt == 1) func ();
}

At -O2:

/usr/libexec/gcc/x86_64-redhat-linux/4.5.1/cc1 -O2 -o - test.c
        .file   "test.c"
 test
Analyzing compilation unit
Performing interprocedural optimizations
 <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <cp> <inline> <static-var> <pure-const>Assembling functions:
 test   .text
        .p2align 4,,15
.globl test
        .type   test, @function
test:
.LFB0:
        .cfi_startproc
        movl    $0, outcnt(%rip)
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .ident  "GCC: (GNU) 4.5.1 20100924 (Red Hat 4.5.1-4)"
        .section        .note.GNU-stack,"",@progbits


Now, at -Os:

/usr/libexec/gcc/x86_64-redhat-linux/4.5.1/cc1 -Os -o - test.c
        .file   "test.c"
 test
Analyzing compilation unit
Performing interprocedural optimizations
 <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <cp> <inline> <static-var> <pure-const>Assembling functions:
 test   .text
.globl test
        .type   test, @function
test:
.LFB0:
        .cfi_startproc
        movl    $0, outcnt(%rip)
        cmpl    $1, outcnt(%rip)
        jne     .L1
        xorl    %eax, %eax
        jmp     func
.L1:
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .ident  "GCC: (GNU) 4.5.1 20100924 (Red Hat 4.5.1-4)"
        .section        .note.GNU-stack,"",@progbits

Comment 14 Jing Yu 2011-05-13 19:34:35 UTC

I am on leave from 02/01/2011 to 05/30/2011. I may not reply your
email during this period.

If you have Android toolchain questions/issues/requests, please
contact Doug (dougkwan@google.com) or my manager Bhaskar
(bjanakiraman@google.com).

Thanks,
Jing

Comment 15 Arnaud Lacombe 2011-05-13 20:14:08 UTC

trunk from early May also fails with one more "if (outcnt == 1) func ();" line at the end of the function:

% ./gcc/cc1 -Os -o - test.c

        .file   "test.c"
 test
Analyzing compilation unit
Performing interprocedural optimizations
 <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <ipa-profile> <cp> <inline> <pure-const> <static-var>Assembling functions:
 test   .text
        .globl  test
        .type   test, @function
test:
.LFB0:
        .cfi_startproc
        movl    $0, outcnt(%rip)
        cmpl    $0, outcnt(%rip)
        je      .L1
        xorl    %eax, %eax
        jmp     func
.L1:
        ret
        .cfi_endproc
.LFE0:
        .size   test, .-test
        .ident  "GCC: (GNU) 4.7.0 20110416 (experimental)"
        .section        .note.GNU-stack,"",@progbits

Target and host are x86-64-linux. -O2 and above are fine.

Comment 16 Richard Biener 2011-05-18 10:48:08 UTC

I don't see any calls to func() for the original testcase that survive at
-O2 or -Os for 4.5, 4.6 and current trunk.

Comment 17 Jing Yu 2011-05-18 11:06:35 UTC

I am on leave from 02/01/2011 to 05/30/2011. I may not reply your
email during this period.

If you have Android toolchain questions/issues/requests, please
contact Doug (dougkwan@google.com) or my manager Bhaskar
(bjanakiraman@google.com).

Thanks,
Jing

Comment 18 Arnaud Lacombe 2011-05-18 15:17:48 UTC

Yes, the original test case is fine now, but not the updated snippet I posted. Maybe should I open a new bug ?

Comment 19 Jakub Jelinek 2012-03-13 13:04:09 UTC

Fixed in 4.5+, 4.4 is no longer supported.