The following code (by Alex Vod.) struct A { char a[400]; float* c; }; struct A glob; void func(); void func1(float*); int func2(float*, int*); void func3(float*); void test(int *p) { func1(glob.c); if (func2(glob.c, p)) { func(); } func3(glob.c); } is compiled by gcc 4.2.1 to 56 bytes and 4.4.0 to 64 bytes. The problem is that it calculates address glob.c *twice*, but there is no need to - it doesn't change as glob is a global variable. gcc 4.2.1 output: push {r4, r5, r6, lr} ldr r3, .L5 ldr r2, .L5+4 .LPIC0: add r3, pc ldr r5, [r3, r2] mov r6, #200 lsl r6, r6, #1 mov r4, r0 ldr r0, [r5, r6] bl func1 ldr r0, [r5, r6] mov r1, r4 bl func2 cmp r0, #0 beq .L2 bl func .L2: ldr r0, [r5, r6] bl func3 @ sp needed for prologue pop {r4, r5, r6, pc} gcc 4.4.0 output: push {r3, r4, r5, r6, r7, lr} ldr r4, .L5 ldr r3, .L5+4 .LPIC0: add r4, pc ldr r6, [r4, r3] mov r5, #200 lsl r5, r5, #1 mov r7, r0 ldr r0, [r6, r5] bl func1 ldr r0, [r6, r5] mov r1, r7 bl func2 cmp r0, #0 beq .L2 bl func .L2: ldr r3, .L5+4 @ sp needed for prologue ldr r2, [r4, r3] mov r3, #200 // this is the calculation of glob.c address lsl r3, r3, #1 // it is redundant, as r5 already contains #400 ldr r0, [r2, r3] bl func3 pop {r3, r4, r5, r6, r7, pc} I believe gcc 4.2.1 works correctly due to RTL CSE pass. Before CSE: ;; Start of basic block 4, registers live: (nil) (code_label 34 33 35 4 2 "" [1 uses]) (note 35 34 37 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 37 35 38 4 (set (reg:SI 113) // put attention to this insn (unspec:SI [ (symbol_ref:SI ("glob") <var_decl 0xf7d5a000 glob>) ] 3)) 148 {pic_load_addr_thumb} (nil) (nil)) (insn 38 37 39 4 (set (reg/f:SI 112) // put attention to this insn (mem/u/c:SI (plus:SI (reg:SI 103) (reg:SI 113)) [0 S4 A32])) 146 {*thumb_movsi_insn} (nil) (expr_list:REG_EQUAL (symbol_ref:SI ("glob") <var_decl 0xf7d5a000 glob>) (nil))) (insn 39 38 40 4 (set (reg:SI 114) // put attention to this insn (const_int 400 [0x190])) 146 {*thumb_movsi_insn} (nil) (nil)) (insn 40 39 41 4 (set (reg:SI 115 [ glob.c ]) (mem/s/f/c:SI (plus:SI (reg/f:SI 112) (reg:SI 114)) [5 glob.c+0 S4 A32])) 146 {*thumb_movsi_insn} (nil) (nil)) (insn 41 40 42 4 (set (reg:SI 0 r0 [ glob.c ]) (reg:SI 115 [ glob.c ])) 146 {*thumb_movsi_insn} (nil) (nil)) (call_insn 42 41 43 4 (parallel [ (call (mem:SI (symbol_ref:SI ("func3") [flags 0x41] <function_decl 0xf7d5c100 func3>) [0 S4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 231 {*call_insn} (nil) (nil) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0 [ glob.c ])) (nil))) ;; End of basic block 4, registers live: after CSE: ;; Start of basic block 4, registers live: (nil) (code_label 34 33 35 4 2 "" [1 uses]) (note 35 34 40 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 40 35 41 4 (set (reg:SI 115 [ glob.c ]) // all mentioned instructions were removed (mem/s/f/c:SI (plus:SI (reg/f:SI 104) (reg:SI 106)) [5 glob.c+0 S4 A32])) 146 {*thumb_movsi_insn} (nil) (nil)) (insn 41 40 42 4 (set (reg:SI 0 r0 [ glob.c ]) (reg:SI 115 [ glob.c ])) 146 {*thumb_movsi_insn} (nil) (nil)) (call_insn 42 41 43 4 (parallel [ (call (mem:SI (symbol_ref:SI ("func3") [flags 0x41] <function_decl 0xf7d5c100 func3>) [0 S 4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 231 {*call_insn} (nil) (nil) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0 [ glob.c ])) (nil))) ;; End of basic block 4, registers live: For some reason, this CSE pass doesn't work in gcc 4.3.1 or gcc-4.4.0
Probably doesn't work in CSE because -fcse-skip-blocks was removed (for good reason). You should try to find out why gcse.c doesn't eliminate this. There could be (should have been?) a REG_EQUAL note to take care of this.
I would say that at -Os we don't run RTL PRE?
CPROP should handle this, since the address is a "constant" if a proper REG_EQUAL note is placed on the (set (reg) (address)).
Actually, no - you're right, this is not something CPROP would handle.
Playing with a classic-GCSE pass (resurrected from svn...). No idea yet if it will help, but nonetheless mine in the mean time.
Created attachment 19446 [details] Classic GCSE, resurrected (with some improvements) With this patch (not bootstrapped/tested/etc.), I get the following code: test: push {r4, r5, r6, lr} ldr r3, .L4 ldr r2, .L4+4 .LPIC0: add r3, pc ldr r5, [r3, r2] mov r4, #200 lsl r4, r4, #1 mov r6, r0 ldr r0, [r5, r4] bl func1 ldr r0, [r5, r4] mov r1, r6 bl func2 cmp r0, #0 beq .L2 bl func .L2: mov r3, #200 lsl r3, r3, #1 ldr r0, [r5, r3] bl func3 @ sp needed for prologue pop {r4, r5, r6, pc} The constant load of #400 is not eliminated, because classic GCSE is not value numbering, so to it, the following expressions in insns 11 and 31 are not equivalent: 11 r141:SI=0x190 12 r142:SI=[r139:SI+r141:SI] REG_EQUAL: [const(`glob'+0x190)] 30 r149:SI=0x190 31 r150:SI=[r139:SI+r149:SI] The REG_EQUAL note is probably not exploited because it is a CONST and want_to_gcse_p rejects it (someone should check if that is indeed the reason). I have no intention at this point to finish this patch.
For the record, options for compiler to get the asm output of the comments: "-march=armv5te -mthumb -mthumb-interwork -fpic -Os"
Created attachment 20379 [details] Classic GCSE, resurrected (with some improvements) Updated patch for trunk r158281. Bootstrapped and tested on ia64-unknown-linux-gnu (with this new old pass enabled at -O2 or higher with an incremental patch). Also tested the effect on size using CSiBE with various different options. Baseline options are "-Os -mcpu=arm7tdmi -fno-short-enums", other options are "-fpic", "-mthumb", and "-fpic -mthumb": trunk patched diff ratio baseline 3542232 3537080 5152 99.8546 baseline+fpic 3792475 3782747 9728 99.7435 baseline+mthumb 2693008 2690942 2066 99.9233 baseline+fpic+mthumb 2840739 2836169 4570 99.8391 So in all cases there is a code size reduction, but it's not very much. All in all I'm not sure if it's worth the cost. I'm also still working on integrating this with the hoisting pass. That would make this pass easier to justify (it'd be almost free, HOIST already computes everything).
GCC 4.3.5 is being released, adjusting target milestone.
Steven, I'm shamelessly stealing this PR from you. There are two sides to this missed optimization: 1. Calculation of PIC address is not CSE'd; this is the same as PR42495 and will be fixed there. 2. Constant "400", which expands to 2 instructions is not CSE'd. After addressing the first issue, the second problem can be fixed by asking hoist to CSE "complicated" constants and making sure that RA will rematerialize them instead of spilling under high register pressure. We can define a constant "complicated" if it takes a PARALLEL -- (parallel [(set (reg1) (const)) (clobber reg2)]) -- to set it; this is a common way of defining instructions that should be split later and require a temporary register to hold intermediate value. I'm now testing a patch that makes ARM backend to expand constants into parallels instead of sequences of two instructions and tweaking hoist to gcse "complicated" const_int's. The result is the following: test: push {r4, r5, r6, lr} ldr r3, .L3 ldr r2, .L3+4 .LPIC0: add r3, pc ldr r5, [r3, r2] mov r4, #200 lsl r4, r4, #1 mov r6, r0 ldr r0, [r5, r4] bl func1 ldr r0, [r5, r4] mov r1, r6 bl func2 cmp r0, #0 beq .L2 bl func .L2: ldr r0, [r5, r4] bl func3 @ sp needed for prologue pop {r4, r5, r6, pc} .L4: .align 2 .L3: .word _GLOBAL_OFFSET_TABLE_-(.LPIC0+4) .word glob(GOT)
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 19:34:55 2010 New Revision: 162590 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162590 Log: PR rtl-optimization/40956 PR target/42495 PR middle-end/42574 * gcse.c (compute_code_hoist_vbeinout): Consider more expressions for hoisting. (hoist_code): Count occurences in current block too. Modified: trunk/gcc/ChangeLog trunk/gcc/gcse.c
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 19:38:10 2010 New Revision: 162592 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162592 Log: PR target/42495 PR middle-end/42574 * gcse.c (hoist_expr_reaches_here_p): Remove excessive check. Modified: trunk/gcc/ChangeLog trunk/gcc/gcse.c
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 19:42:15 2010 New Revision: 162594 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162594 Log: PR target/42495 PR middle-end/42574 * config/arm/arm.c (thumb1_size_rtx_costs): Add cost for "J" constants. * config/arm/arm.md (define_split "J", define_split "K"): Make IRA/reload friendly. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/config/arm/arm.md
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 19:44:51 2010 New Revision: 162595 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162595 Log: PR target/42495 PR middle-end/42574 * config/arm/arm.c (legitimize_pic_address): Use gen_calculate_pic_address pattern to emit calculation of PIC address. (will_be_in_index_register): New function. (arm_legitimate_address_outer_p, thumb2_legitimate_address_p,) (thumb1_legitimate_address_p): Use it provided !strict_p. * config/arm/arm.md (calculate_pic_address): New expand and split. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm.c trunk/gcc/config/arm/arm.md
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 19:48:15 2010 New Revision: 162597 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162597 Log: PR target/42495 PR middle-end/42574 * basic-block.h (get_dominated_to_depth): Declare. * dominance.c (get_dominated_to_depth): New function, use get_all_dominated_blocks as a base. (get_all_dominated_blocks): Use get_dominated_to_depth. * gcse.c (occr_t, VEC (occr_t, heap)): Define. (hoist_exprs): Remove. (alloc_code_hoist_mem, free_code_hoist_mem): Update. (compute_code_hoist_vbeinout): Add debug print outs. (hoist_code): Partially rewrite, simplify. Use get_dominated_to_depth. * params.def (PARAM_MAX_HOIST_DEPTH): New parameter to avoid quadratic behavior. * params.h (MAX_HOIST_DEPTH): New macro. * doc/invoke.texi (max-hoist-depth): Document. Modified: trunk/gcc/ChangeLog trunk/gcc/basic-block.h trunk/gcc/doc/invoke.texi trunk/gcc/dominance.c trunk/gcc/gcse.c trunk/gcc/params.def trunk/gcc/params.h
Subject: Bug 42574 Author: mkuvyrkov Date: Tue Jul 27 21:06:31 2010 New Revision: 162600 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162600 Log: PR rtl-optimization/40956 PR target/42495 PR middle-end/42574 * gcc.target/arm/pr40956.c, gcc.target/arm/pr42495.c, * gcc.target/arm/pr42574.c: Add tests. Added: trunk/gcc/testsuite/gcc.target/arm/pr40956.c trunk/gcc/testsuite/gcc.target/arm/pr42495.c trunk/gcc/testsuite/gcc.target/arm/pr42574.c Modified: trunk/gcc/testsuite/ChangeLog
Should be fixed now by the above patch series.