Bug 42574 - [4.3/4.4/4.5/4.6 Regression] Address of global variable is calculated multiple times (missed CSE)
Summary: [4.3/4.4/4.5/4.6 Regression] Address of global variable is calculated multipl...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P2 normal
Target Milestone: 4.3.6
Assignee: Maxim Kuvyrkov
URL:
Keywords: missed-optimization
Depends on:
Blocks: 16996
  Show dependency treegraph
 
Reported: 2010-01-01 17:27 UTC by Shih-wei Liao
Modified: 2010-07-27 21:11 UTC (History)
2 users (show)

See Also:
Host:
Target: arm-eabi
Build:
Known to work: 4.1.2 4.2.4
Known to fail: 4.3.3 4.4.2 4.5.0 4.6.0
Last reconfirmed: 2010-02-08 12:15:39


Attachments
Classic GCSE, resurrected (with some improvements) (5.34 KB, patch)
2010-01-02 14:08 UTC, Steven Bosscher
Details | Diff
Classic GCSE, resurrected (with some improvements) (5.62 KB, patch)
2010-04-14 20:49 UTC, Steven Bosscher
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Shih-wei Liao 2010-01-01 17:27:45 UTC
The following code  (by Alex Vod.)

struct A {
 char a[400];
 float* c;
};
struct A glob;
void func();
void func1(float*);
int func2(float*, int*);
void func3(float*);

void test(int *p) {
 func1(glob.c);
 if (func2(glob.c, p)) {
   func();
 }
 func3(glob.c);
}

is compiled by gcc 4.2.1 to 56 bytes and 4.4.0 to 64 bytes. The problem is that it calculates address glob.c *twice*, but there is no need to - it doesn't change as glob is a global variable.
gcc 4.2.1 output:
       push    {r4, r5, r6, lr}
       ldr     r3, .L5
       ldr     r2, .L5+4
.LPIC0:
       add     r3, pc
       ldr     r5, [r3, r2]
       mov     r6, #200
       lsl     r6, r6, #1
       mov     r4, r0
       ldr     r0, [r5, r6]
       bl      func1
       ldr     r0, [r5, r6]
       mov     r1, r4
       bl      func2
       cmp     r0, #0
       beq     .L2
       bl      func
.L2:
       ldr     r0, [r5, r6]
       bl      func3
       @ sp needed for prologue
       pop     {r4, r5, r6, pc}

gcc 4.4.0 output:
       push    {r3, r4, r5, r6, r7, lr}
       ldr     r4, .L5
       ldr     r3, .L5+4
.LPIC0:
       add     r4, pc
       ldr     r6, [r4, r3]
       mov     r5, #200
       lsl     r5, r5, #1
       mov     r7, r0
       ldr     r0, [r6, r5]
       bl      func1
       ldr     r0, [r6, r5]
       mov     r1, r7
       bl      func2
       cmp     r0, #0
       beq     .L2
       bl      func
.L2:
       ldr     r3, .L5+4
       @ sp needed for prologue
       ldr     r2, [r4, r3]
       mov     r3, #200   // this is the calculation of glob.c address
       lsl     r3, r3, #1    // it is redundant, as r5 already contains #400
       ldr     r0, [r2, r3]
       bl      func3
       pop     {r3, r4, r5, r6, r7, pc}

I believe gcc 4.2.1 works correctly due to RTL CSE pass.
Before CSE:

;; Start of basic block 4, registers live: (nil)
(code_label 34 33 35 4 2 "" [1 uses])
(note 35 34 37 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 37 35 38 4 (set (reg:SI 113) // put attention to this insn
       (unspec:SI [
               (symbol_ref:SI ("glob") <var_decl 0xf7d5a000 glob>)
           ] 3)) 148 {pic_load_addr_thumb} (nil)
   (nil))
(insn 38 37 39 4 (set (reg/f:SI 112)   // put attention to this insn
       (mem/u/c:SI (plus:SI (reg:SI 103)
               (reg:SI 113)) [0 S4 A32])) 146 {*thumb_movsi_insn} (nil)
   (expr_list:REG_EQUAL (symbol_ref:SI ("glob") <var_decl 0xf7d5a000 glob>)
       (nil)))
(insn 39 38 40 4 (set (reg:SI 114)             // put attention to this insn
       (const_int 400 [0x190])) 146 {*thumb_movsi_insn} (nil)
   (nil))
(insn 40 39 41 4 (set (reg:SI 115 [ glob.c ])
       (mem/s/f/c:SI (plus:SI (reg/f:SI 112)
               (reg:SI 114)) [5 glob.c+0 S4 A32])) 146 {*thumb_movsi_insn} (nil)
   (nil))
(insn 41 40 42 4 (set (reg:SI 0 r0 [ glob.c ])
       (reg:SI 115 [ glob.c ])) 146 {*thumb_movsi_insn} (nil)
   (nil))
(call_insn 42 41 43 4 (parallel [
           (call (mem:SI (symbol_ref:SI ("func3") [flags 0x41] <function_decl 0xf7d5c100 func3>) [0 S4 A32])
               (const_int 0 [0x0]))
           (use (const_int 0 [0x0]))
           (clobber (reg:SI 14 lr))
       ]) 231 {*call_insn} (nil)
   (nil)
   (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0 [ glob.c ]))
       (nil)))
;; End of basic block 4, registers live:

after CSE:

;; Start of basic block 4, registers live: (nil)
(code_label 34 33 35 4 2 "" [1 uses])
(note 35 34 40 4 [bb 4] NOTE_INSN_BASIC_BLOCK)
(insn 40 35 41 4 (set (reg:SI 115 [ glob.c ]) // all mentioned instructions were removed
       (mem/s/f/c:SI (plus:SI (reg/f:SI 104)
               (reg:SI 106)) [5 glob.c+0 S4 A32])) 146 {*thumb_movsi_insn} (nil)
   (nil))
(insn 41 40 42 4 (set (reg:SI 0 r0 [ glob.c ])
       (reg:SI 115 [ glob.c ])) 146 {*thumb_movsi_insn} (nil)
   (nil))
(call_insn 42 41 43 4 (parallel [
           (call (mem:SI (symbol_ref:SI ("func3") [flags 0x41] <function_decl 0xf7d5c100 func3>) [0 S
4 A32])
               (const_int 0 [0x0]))
           (use (const_int 0 [0x0]))
           (clobber (reg:SI 14 lr))
       ]) 231 {*call_insn} (nil)
   (nil)
   (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0 [ glob.c ]))
       (nil)))
;; End of basic block 4, registers live:

For some reason, this CSE pass doesn't work in gcc 4.3.1 or gcc-4.4.0
Comment 1 Steven Bosscher 2010-01-01 20:16:29 UTC
Probably doesn't work in CSE because -fcse-skip-blocks was removed (for good reason). You should try to find out why gcse.c doesn't eliminate this. There could be (should have been?) a REG_EQUAL note to take care of this.
Comment 2 Richard Biener 2010-01-01 20:39:27 UTC
I would say that at -Os we don't run RTL PRE?
Comment 3 Steven Bosscher 2010-01-01 20:44:11 UTC
CPROP should handle this, since the address is a "constant" if a proper REG_EQUAL note is placed on the (set (reg) (address)).
Comment 4 Steven Bosscher 2010-01-01 21:00:08 UTC
Actually, no - you're right, this is not something CPROP would handle.
Comment 5 Steven Bosscher 2010-01-02 00:12:45 UTC
Playing with a classic-GCSE pass (resurrected from svn...). No idea yet if it will help, but nonetheless mine in the mean time.
Comment 6 Steven Bosscher 2010-01-02 14:08:32 UTC
Created attachment 19446 [details]
Classic GCSE, resurrected (with some improvements)

With this patch (not bootstrapped/tested/etc.), I get the following code:

test:
	push	{r4, r5, r6, lr}
	ldr	r3, .L4
	ldr	r2, .L4+4
.LPIC0:
	add	r3, pc
	ldr	r5, [r3, r2]
	mov	r4, #200
	lsl	r4, r4, #1
	mov	r6, r0
	ldr	r0, [r5, r4]
	bl	func1
	ldr	r0, [r5, r4]
	mov	r1, r6
	bl	func2
	cmp	r0, #0
	beq	.L2
	bl	func
.L2:
	mov	r3, #200
	lsl	r3, r3, #1
	ldr	r0, [r5, r3]
	bl	func3
	@ sp needed for prologue
	pop	{r4, r5, r6, pc}


The constant load of #400 is not eliminated, because classic GCSE is not value numbering, so to it, the following expressions in insns 11 and 31 are not equivalent:

   11 r141:SI=0x190
   12 r142:SI=[r139:SI+r141:SI]
      REG_EQUAL: [const(`glob'+0x190)]

   30 r149:SI=0x190
   31 r150:SI=[r139:SI+r149:SI]

The REG_EQUAL note is probably not exploited because it is a CONST and want_to_gcse_p rejects it (someone should check if that is indeed the reason).

I have no intention at this point to finish this patch.
Comment 7 Steven Bosscher 2010-01-02 14:10:17 UTC
For the record, options for compiler to get the asm output of the comments:
"-march=armv5te -mthumb -mthumb-interwork -fpic -Os"
Comment 8 Steven Bosscher 2010-04-14 20:49:54 UTC
Created attachment 20379 [details]
Classic GCSE, resurrected (with some improvements)

Updated patch for trunk r158281.

Bootstrapped and tested on ia64-unknown-linux-gnu (with this new old pass enabled at -O2 or higher with an incremental patch).

Also tested the effect on size using CSiBE with various different options. Baseline options are "-Os -mcpu=arm7tdmi -fno-short-enums", other options are "-fpic", "-mthumb", and "-fpic -mthumb":

                     trunk   patched  diff    ratio
baseline             3542232 3537080  5152    99.8546
baseline+fpic        3792475 3782747  9728    99.7435
baseline+mthumb      2693008 2690942  2066    99.9233
baseline+fpic+mthumb 2840739 2836169  4570    99.8391

So in all cases there is a code size reduction, but it's not very much. All in all I'm not sure if it's worth the cost.

I'm also still working on integrating this with the hoisting pass. That would make this pass easier to justify (it'd be almost free, HOIST already computes everything).
Comment 9 Richard Biener 2010-05-22 18:13:48 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 10 Maxim Kuvyrkov 2010-06-08 15:24:25 UTC
Steven, I'm shamelessly stealing this PR from you.

There are two sides to this missed optimization:

1. Calculation of PIC address is not CSE'd; this is the same as PR42495 and will be fixed there.
2. Constant "400", which expands to 2 instructions is not CSE'd.

After addressing the first issue, the second problem can be fixed by asking hoist to CSE "complicated" constants and making sure that RA will rematerialize them instead of spilling under high register pressure.

We can define a constant "complicated" if it takes a PARALLEL -- (parallel [(set (reg1) (const)) (clobber reg2)]) -- to set it; this is a common way of defining instructions that should be split later and require a temporary register to hold intermediate value.

I'm now testing a patch that makes ARM backend to expand constants into parallels instead of sequences of two instructions and tweaking hoist to gcse "complicated" const_int's.

The result is the following:

test:
	push	{r4, r5, r6, lr}
	ldr	r3, .L3
	ldr	r2, .L3+4
.LPIC0:
	add	r3, pc
	ldr	r5, [r3, r2]
	mov	r4, #200
	lsl	r4, r4, #1
	mov	r6, r0
	ldr	r0, [r5, r4]
	bl	func1
	ldr	r0, [r5, r4]
	mov	r1, r6
	bl	func2
	cmp	r0, #0
	beq	.L2
	bl	func
.L2:
	ldr	r0, [r5, r4]
	bl	func3
	@ sp needed for prologue
	pop	{r4, r5, r6, pc}
.L4:
	.align	2
.L3:
	.word	_GLOBAL_OFFSET_TABLE_-(.LPIC0+4)
	.word	glob(GOT)
Comment 11 Maxim Kuvyrkov 2010-07-27 19:35:23 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 19:34:55 2010
New Revision: 162590

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162590
Log:
	PR rtl-optimization/40956
	PR target/42495
	PR middle-end/42574
	* gcse.c (compute_code_hoist_vbeinout): Consider more expressions
	for hoisting.
	(hoist_code): Count occurences in current block too.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gcse.c

Comment 12 Maxim Kuvyrkov 2010-07-27 19:38:28 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 19:38:10 2010
New Revision: 162592

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162592
Log:
	PR target/42495
	PR middle-end/42574
	* gcse.c (hoist_expr_reaches_here_p): Remove excessive check.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gcse.c

Comment 13 Maxim Kuvyrkov 2010-07-27 19:42:33 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 19:42:15 2010
New Revision: 162594

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162594
Log:
	PR target/42495
	PR middle-end/42574
	* config/arm/arm.c (thumb1_size_rtx_costs): Add cost for "J" constants.
	* config/arm/arm.md (define_split "J", define_split "K"): Make
	IRA/reload friendly.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
    trunk/gcc/config/arm/arm.md

Comment 14 Maxim Kuvyrkov 2010-07-27 19:45:11 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 19:44:51 2010
New Revision: 162595

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162595
Log:
	PR target/42495
	PR middle-end/42574
	* config/arm/arm.c (legitimize_pic_address): Use
	gen_calculate_pic_address pattern to emit calculation of PIC address.
	(will_be_in_index_register): New function.
	(arm_legitimate_address_outer_p, thumb2_legitimate_address_p,)
	(thumb1_legitimate_address_p): Use it provided !strict_p.
	* config/arm/arm.md (calculate_pic_address): New expand and split.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm.c
    trunk/gcc/config/arm/arm.md

Comment 15 Maxim Kuvyrkov 2010-07-27 19:48:33 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 19:48:15 2010
New Revision: 162597

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162597
Log:
	PR target/42495
	PR middle-end/42574
	* basic-block.h (get_dominated_to_depth): Declare.
	* dominance.c (get_dominated_to_depth): New function, use
	get_all_dominated_blocks as a base.
	(get_all_dominated_blocks): Use get_dominated_to_depth.

	* gcse.c (occr_t, VEC (occr_t, heap)): Define.
	(hoist_exprs): Remove.
	(alloc_code_hoist_mem, free_code_hoist_mem): Update.
	(compute_code_hoist_vbeinout): Add debug print outs.
	(hoist_code): Partially rewrite, simplify.  Use get_dominated_to_depth.

	* params.def (PARAM_MAX_HOIST_DEPTH): New parameter to avoid
	quadratic behavior.
	* params.h (MAX_HOIST_DEPTH): New macro.
	* doc/invoke.texi (max-hoist-depth): Document.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/basic-block.h
    trunk/gcc/doc/invoke.texi
    trunk/gcc/dominance.c
    trunk/gcc/gcse.c
    trunk/gcc/params.def
    trunk/gcc/params.h

Comment 16 Maxim Kuvyrkov 2010-07-27 21:06:56 UTC
Subject: Bug 42574

Author: mkuvyrkov
Date: Tue Jul 27 21:06:31 2010
New Revision: 162600

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=162600
Log:
	PR rtl-optimization/40956
	PR target/42495
	PR middle-end/42574
	* gcc.target/arm/pr40956.c, gcc.target/arm/pr42495.c,
	* gcc.target/arm/pr42574.c: Add tests.

Added:
    trunk/gcc/testsuite/gcc.target/arm/pr40956.c
    trunk/gcc/testsuite/gcc.target/arm/pr42495.c
    trunk/gcc/testsuite/gcc.target/arm/pr42574.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 17 Maxim Kuvyrkov 2010-07-27 21:11:35 UTC
Should be fixed now by the above patch series.