For the following code: extern void func(); extern char outbuf[]; extern int outcnt; extern int bool_var; void test () { char flags; flags = 0; outcnt = 0; if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (bool_var) flags = 2; outbuf[outcnt] = flags; if (outcnt == 1) func (); } I found that GCC 4.4.0 generates the following code: .code 16 .file "outcnt.c" .text .align 2 .global test .code 16 .thumb_func .type test, %function test: push {r4, lr} ldr r3, .L7 mov r2, #0 str r2, [r3] ldr r3, .L7+4 mov r2, #2 ldr r3, [r3] cmp r3, #0 bne .L3 mov r2, #0 .L3: ldr r3, .L7 ldr r1, .L7+8 ldr r3, [r3] strb r2, [r1, r3] cmp r3, #1 bne .L5 bl func .L5: @ sp needed for prologue pop {r4, pc} .L8: .align 2 .L7: .word outcnt .word bool_var .word outbuf .size test, .-test .ident "GCC: (GNU) 4.4.0" , while GCC 4.2.1 generates the following code: .code 16 .file "outcnt.c" .text .align 2 .global test .code 16 .thumb_func .type test, %function test: push {lr} ldr r2, .L6 mov r3, #0 str r3, [r2] ldr r3, .L6+4 ldr r3, [r3] cmp r3, #0 beq .L2 mov r2, #2 b .L4 .L2: mov r2, #0 .L4: ldr r3, .L6+8 @ sp needed for prologue strb r2, [r3] pop {pc} .L7: .align 2 .L6: .word outcnt .word bool_var .word outbuf .size test, .-test .ident "GCC: (GNU) 4.2.1" The code snippet has a lot of dead code that is not completely eliminated by gcc 4.4.0 (but correctly eliminated by gcc 4.2.1). Because outcnt == 0, all lines 'if (outcnt == 1)' can be eliminated. Because outbuf and outcnt are global external symbols of different types, they can not be aliased, so the last statement can be also eliminated. gcc 4.2.1 output is 40 bytes, gcc 4.4.0 output is 52 bytes. It also inserts 'bl func'. This code snippet was extracted from gzip benchmark, but got changed quite a bit.
I can confirm this as a problem on the 4.4 branch and I can see this isn't a problem on trunk . .arch armv5te .fpu softvfp .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 1 .eabi_attribute 30, 4 .eabi_attribute 18, 4 .code 16 .file "t.c" .text .align 1 .global test .code 16 .thumb_func .type test, %function test: ldr r3, .L4 mov r2, #0 str r2, [r3] ldr r3, .L4+4 @ sp needed for prologue ldr r2, [r3] sub r3, r2, #1 sbc r2, r2, r3 ldr r3, .L4+8 lsl r2, r2, #1 strb r2, [r3] bx lr .L5: .align 2 .L4: .word outcnt .word bool_var .word outbuf .size test, .-test .ident "GCC: (GNU) 4.5.0 20091223 (experimental)"
The "if (outcnt == 1) func ();" bit is optimized for me with gcc-4.4.2 on x86_64 at -O1 and -O2, but not at -Os. I was a bit too hasty to call this alias related, it seems. The O2 and Os tree dumps start to diverge in the dom1 dump, where we have just one call to func after dom1 at -O2 and still two calls at -O2.
Steven, (In reply to comment #2) > The "if (outcnt == 1) func ();" bit is optimized for me with gcc-4.4.2 on > x86_64 at -O1 and -O2, but not at -Os. I was a bit too hasty to call this alias > related, it seems. Yes it seems to go away at O1 and O2. I didn't see any comment from you earlier so not sure what you are referring to here. > The O2 and Os tree dumps start to diverge in the dom1 dump, > where we have just one call to func after dom1 at -O2 and still two calls at > -O2. > FWIW I saw this removed by dom2 in trunk at -Os. In any case I think the component ought to be tree-optimization and changed. Cheers Ramana
The issue is that we do not do predicated value-numbering so we need multiple invocations of it. Which all of FRE, DOM and PRE do - but PRE is not run at -Os in 4.4. And yes, the optimization done by 4.5 is correct. You could say "fixed" in 4.5 - or wait for someone to implement predicated value-numbering to also fix a testcase with one more if (outcnt == 1) func ();
Btw, in 4.4 there is one DOM pass removed for compile-time. We won't fix this for 4.4, the particular testcase is fixed in 4.5.
For the trunk snapshot of 20100102, GCC 4.5.0 indeed removes most of the redundancy. However, -O1 and -Os still produce an extra instruction, while -O2 doesn't. Do we care about an extra instruction below? Thanks. The instruction is marked as "redundant" below. GCC trunk's -Os: Disassembly of section .text: 00000000 <test>: 0: e59f3024 ldr r3, [pc, #36] ; 2c <test+0x2c> 4: e3a02000 mov r2, #0 8: e5832000 str r2, [r3] c: e59f301c ldr r3, [pc, #28] ; 30 <test+0x30> 10: e5932000 ldr r2, [r3] 14: e59f3018 ldr r3, [pc, #24] ; 34 <test+0x34> 18: e3520000 cmp r2, #0 1c: 13a02002 movne r2, #2 20: 03a02000 moveq r2, #0 ;redundant 24: e5c32000 strb r2, [r3] 28: e12fff1e bx lr ... GCC trunk's -O1: Disassembly of section .text: 00000000 <test>: 0: e3a02000 mov r2, #0 4: e59f3020 ldr r3, [pc, #32] ; 2c <test+0x2c> 8: e5832000 str r2, [r3] c: e59f301c ldr r3, [pc, #28] ; 30 <test+0x30> 10: e5932000 ldr r2, [r3] 14: e3520000 cmp r2, #0 18: 13a02002 movne r2, #2 1c: 03a02000 moveq r2, #0 ;redundant 20: e59f300c ldr r3, [pc, #12] ; 34 <test+0x34> 24: e5c32000 strb r2, [r3] 28: e12fff1e bx lr ... Below, GCC trunk's -O2 doesn't produce the redundant instruction. -O2: Disassembly of section .text: 00000000 <test>: 0: e59f3020 ldr r3, [pc, #32] ; 28 <test+0x28> 4: e59f2020 ldr r2, [pc, #32] ; 2c <test+0x2c> 8: e5933000 ldr r3, [r3] c: e3a01000 mov r1, #0 10: e3530000 cmp r3, #0 14: e5821000 str r1, [r2] 18: e59f2010 ldr r2, [pc, #16] ; 30 <test+0x30> 1c: 13a03002 movne r3, #2 20: e5c23000 strb r3, [r2] 24: e12fff1e bx lr ...
*sigh* With all the Google power, why can't anyone there try to debug this? I'll have a look.
The diff between -O2 and -Os starts in combine: diff -ur O2/t.c.175r.combine Os/t.c.175r.combine --- O2/t.c.175r.combine 2010-01-07 22:24:04.000000000 +0100 +++ Os/t.c.175r.combine 2010-01-07 22:23:14.000000000 +0100 @@ -4,18 +4,20 @@ starting the processing of deferred insns ending the processing of deferred insns df_analyze called -insn_cost 7: 12 +insn_cost 7: 8 insn_cost 8: 4 insn_cost 9: 4 -insn_cost 10: 12 -insn_cost 11: 12 +insn_cost 10: 8 +insn_cost 11: 4 insn_cost 12: 4 -insn_cost 46: 4 -insn_cost 17: 12 +insn_cost 46: 16 +insn_cost 17: 8 insn_cost 19: 4 -rejecting combination of insns 12 and 46 -original costs 4 + 4 = 8 -replacement cost 12 +deferring deletion of insn with uid = 12. +modifying insn i3 46 {r133:SI={(r138:SI!=0x0)?0x2:0x0};clobber cc:CC;} + REG_UNUSED: cc:CC + REG_DEAD: r138:SI +deferring rescan insn with uid = 46. (note# 0 # 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (note# # # 2 NOTE_INSN_FUNCTION_BEG) @@ -40,16 +42,17 @@ (expr_list:REG_EQUAL (mem/c/i:SI (symbol_ref:SI ("bool_var") [flags 0xc0] <var_decl # bool_var>) [2 bool_var+0 S4 A32]) (nil)))) -(insn# # # 2 t.c:13 (set (reg:CC 24 cc) - (compare:CC (reg:SI 138 [ bool_var ]) - (const_int 0 [0x0])))# {*arm_cmpsi_insn} (nil)) - -(insn# # # 2 t.c:8 (set (reg/v:SI 133 [ flags ]) - (if_then_else:SI (eq (reg:CC 24 cc) - (const_int 0 [0x0])) - (reg:SI 138 [ bool_var ]) - (const_int 2 [0x2])))# {*movsicc_insn} (expr_list:REG_DEAD (reg:SI 138 [ bool_var ]) - (expr_list:REG_DEAD (reg:CC 24 cc) +(note# # # 2 NOTE_INSN_DELETED) + +(insn# # # 2 t.c:8 (parallel [ + (set (reg/v:SI 133 [ flags ]) + (if_then_else:SI (ne (reg:SI 138 [ bool_var ]) + (const_int 0 [0x0])) + (const_int 2 [0x2]) + (const_int 0 [0x0]))) + (clobber (reg:CC 24 cc)) + ])# {movcond} (expr_list:REG_UNUSED (reg:CC 24 cc) + (expr_list:REG_DEAD (reg:SI 138 [ bool_var ]) (nil)))) (insn# # # 2 t.c:14 (set (reg/f:SI 139) @@ -60,7 +63,10 @@ (expr_list:REG_DEAD (reg/v:SI 133 [ flags ]) (nil)))) starting the processing of deferred insns +deleting insn with uid = 12. +rescanning insn with uid = 46. +deleting insn with uid = 46. ending the processing of deferred insns -;; Combiner totals: 12 attempts, 12 substitutions (2 requiring new space), -;; 0 successes. +;; Combiner totals: 13 attempts, 13 substitutions (2 requiring new space), +;; 1 successes.
At -Os, in the .188r.postreload RTL dump: (insn 46 12 17 2 t.c:8 (parallel [ (set (reg/v:SI 2 r2 [orig:133 flags ] [133]) (if_then_else:SI (ne (reg:SI 2 r2 [orig:138 bool_var ] [138]) (const_int 0 [0x0])) (const_int 2 [0x2]) (const_int 0 [0x0]))) (clobber (reg:CC 24 cc)) ]) 291 {movcond} (nil)) And one pass later, in the .190r.split2 RTL dump: (insn 49 12 50 2 t.c:8 (set (reg:CC 24 cc) (compare:CC (reg:SI 2 r2 [orig:138 bool_var ] [138]) (const_int 0 [0x0]))) 220 {*arm_cmpsi_insn} (nil)) (insn 50 49 51 2 t.c:8 (cond_exec (ne (reg:CC 24 cc) (const_int 0 [0x0])) (set (reg/v:SI 2 r2 [orig:133 flags ] [133]) (const_int 2 [0x2]))) 2367 {neon_vornv2di+77} (nil)) (insn 51 50 17 2 t.c:8 (cond_exec (eq (reg:CC 24 cc) (const_int 0 [0x0])) (set (reg/v:SI 2 r2 [orig:133 flags ] [133]) (const_int 0 [0x0]))) 2367 {neon_vornv2di+77} (nil)) There's your redundant mov, from an insn splitter.
An ARM maintainer can look at a solution for this (special splitters, maybe a peephole2, perhaps a post-ce3 special DCE pass to clean up cond_exec silliness. (See also bug 21803, for similar cond_exec idiocy on ia64.)
Trunk today (r156595) optimizes this at -O1, -Os, and -O2 in the tree optimizers. The .fre pass removes the first func call, then .dom1 removes the next two. The .dom2 pass removes the remaining one. If I add another, say, 10 lines of "if (outcnt == 1) func ();", then dom1 removes them all (even with -fno-tree-fre). So something changed in (or for) DOM that allows it to clean up this mess completely. I would like to know what... Perhaps the OP can try to bisect to identify what was changed here (in the hope that it is something that can be back-ported to GCC 4.4)?
I am on leave from 02/01/2011 to 05/30/2011. I may not reply your email during this period. If you have Android toolchain questions/issues/requests, please contact Doug (dougkwan@google.com) or my manager Bhaskar (bjanakiraman@google.com). Thanks, Jing
It would seem that GCC 4.5.1 is still generating a call to func(), at -Os. Considering the following reduced test-case: extern int outcnt; void test () { outcnt = 0; if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); if (outcnt == 1) func (); } At -O2: /usr/libexec/gcc/x86_64-redhat-linux/4.5.1/cc1 -O2 -o - test.c .file "test.c" test Analyzing compilation unit Performing interprocedural optimizations <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <cp> <inline> <static-var> <pure-const>Assembling functions: test .text .p2align 4,,15 .globl test .type test, @function test: .LFB0: .cfi_startproc movl $0, outcnt(%rip) ret .cfi_endproc .LFE0: .size test, .-test .ident "GCC: (GNU) 4.5.1 20100924 (Red Hat 4.5.1-4)" .section .note.GNU-stack,"",@progbits Now, at -Os: /usr/libexec/gcc/x86_64-redhat-linux/4.5.1/cc1 -Os -o - test.c .file "test.c" test Analyzing compilation unit Performing interprocedural optimizations <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <cp> <inline> <static-var> <pure-const>Assembling functions: test .text .globl test .type test, @function test: .LFB0: .cfi_startproc movl $0, outcnt(%rip) cmpl $1, outcnt(%rip) jne .L1 xorl %eax, %eax jmp func .L1: ret .cfi_endproc .LFE0: .size test, .-test .ident "GCC: (GNU) 4.5.1 20100924 (Red Hat 4.5.1-4)" .section .note.GNU-stack,"",@progbits
trunk from early May also fails with one more "if (outcnt == 1) func ();" line at the end of the function: % ./gcc/cc1 -Os -o - test.c .file "test.c" test Analyzing compilation unit Performing interprocedural optimizations <*free_lang_data> <visibility> <early_local_cleanups> <whole-program> <ipa-profile> <cp> <inline> <pure-const> <static-var>Assembling functions: test .text .globl test .type test, @function test: .LFB0: .cfi_startproc movl $0, outcnt(%rip) cmpl $0, outcnt(%rip) je .L1 xorl %eax, %eax jmp func .L1: ret .cfi_endproc .LFE0: .size test, .-test .ident "GCC: (GNU) 4.7.0 20110416 (experimental)" .section .note.GNU-stack,"",@progbits Target and host are x86-64-linux. -O2 and above are fine.
I don't see any calls to func() for the original testcase that survive at -O2 or -Os for 4.5, 4.6 and current trunk.
Yes, the original test case is fine now, but not the updated snippet I posted. Maybe should I open a new bug ?
Fixed in 4.5+, 4.4 is no longer supported.