40615 – unnecessary CSE

Bug 40615 - unnecessary CSE

Summary: unnecessary CSE

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.5.0

Importance:	P3 enhancement
Target Milestone:	4.6.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2009-07-02 07:39 UTC by Carrot
Modified:	2010-06-08 15:30 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	arm-eabi, powerpc--*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2009-07-02 08:53:25

Attachments
test case (109 bytes, text/plain) 2009-07-02 07:39 UTC, Carrot	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Carrot 2009-07-02 07:39:12 UTC

Compile the attached source code with options -march=armv5te -mthumb -Os -fno-exceptions, gcc generates:

        push    {r4, lr}
        sub     sp, sp, #8
        add     r4, sp, #4    // redundant
        mov     r0, r4        // add  r0, sp, 4
        bl      _ZN1XC1Ev
        mov     r0, r4        // add  r0, sp, 4
        bl      _Z3barP1X
        mov     r0, r4        // add  r0, sp, 4
        bl      _ZN1XD1Ev
        add     sp, sp, #8
        @ sp needed for prologue
        pop     {r4, pc}

As mentioned in the comments, the cse is redundant. We can recompute the value of (sp + 4) each time we want it. With this method we can save one instruction.

Comment 1 Carrot 2009-07-02 07:39:36 UTC

Created attachment 18120 [details]
test case

Comment 2 Ramana Radhakrishnan 2009-07-02 08:53:25 UTC

This looks like one of those rematerialization problems albeit with the stack pointer this time.

Comment 3 Steven Bosscher 2009-07-02 09:15:51 UTC

Is there a C test case?  Can you add objdump of the gcc-generated asm and the fixed asm to show the impact on code size? (/me is surprised that 3*"add r0,sp,4" is smaller than 1**"add r0,sp,4"+3*"mov r0,r4"... Thumb is amazing :-)

Comment 4 Ramana Radhakrishnan 2009-07-02 09:39:40 UTC

(In reply to comment #3)
> Is there a C test case?  Can you add objdump of the gcc-generated asm and the
> fixed asm to show the impact on code size? (/me is surprised that 3*"add
> r0,sp,4" is smaller than 1**"add r0,sp,4"+3*"mov r0,r4"... Thumb is amazing :-)

The length of add r0,sp,4 and mov r0,r4 is the same for Thumb1 (16 bits).
 

I suppose the ideal code generated would be something like this modulo errors with stack alignments in the prologue and the epilogue. 

We also don't need r4 in that case :) . So we can save a load, a store as well as 1 instruction over all. Smaller and faster by 1 instruction and reduced register usage.



        push    {lr}
        sub     sp, sp, #12   (8 byte stack alignment )
        add     r0, sp, 4        // add  r0, sp, 4
        bl      _ZN1XC1Ev
        add     r0, sp, #4        // add  r0, sp, 4
        bl      _Z3barP1X
        add     r0, sp, #4       // add  r0, sp, 4
        bl      _ZN1XD1Ev
        add     sp, sp, #12    (8 byte stack alignment )
        @ sp needed for prologue
        pop     {pc}

Comment 5 Andrew Pinski 2009-09-29 05:50:06 UTC

PowerPC has the same issue.  Instructions on PPC are all the same size so 3 adds are better than one add plus 3 register moves. 
Here is a C example:
int f(int *a);
int g(int *a);
int h(int *a);

void hh(void)
{
  int t;
  f(&t);
  g(&t);
  h(&t);
}

--- CUT ---

Most RISC will have the same issue as most will have instructions which are fixed length.

I bet this has to do with hard registers.

Comment 6 Andrew Pinski 2009-09-29 05:51:28 UTC

Before CSE:
(insn 13 2 5 2 t.c:8 (set (reg:SI 119)
        (plus:SI (reg/f:SI 113 sfp)
            (const_int 8 [0x8]))) -1 (nil))

(insn 5 13 6 2 t.c:8 (set (reg:SI 3 3)
        (reg:SI 119)) 332 {*movsi_internal1} (nil))
...
(insn 14 6 7 2 t.c:9 (set (reg:SI 120)
        (plus:SI (reg/f:SI 113 sfp)
            (const_int 8 [0x8]))) -1 (nil))

(insn 7 14 8 2 t.c:9 (set (reg:SI 3 3)
        (reg:SI 120)) 332 {*movsi_internal1} (nil))

And then after CSE, we remove the adds but we should have moved the adds into the move.

Comment 7 Bernd Schmidt 2010-06-04 12:44:28 UTC

Subject: Bug 40615

Author: bernds
Date: Fri Jun  4 12:44:01 2010
New Revision: 160260

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=160260
Log:
	PR rtl-optimization/39871
	PR rtl-optimization/40615
	PR rtl-optimization/42500
	PR rtl-optimization/42502
	* ira.c (init_reg_equiv_memory_loc: New function.
	(ira): Call it twice.
	* reload.h (calculate_elim_costs_all_insns): Declare.
	* ira-costs.c: Include "reload.h".
	(regno_equiv_gains): New static variable.
	(init_costs): Allocate it.
	(finish_costs): Free it.
	(ira_costs): Call calculate_elim_costs_all_insns.
	(find_costs_and_classes): Take estimated elimination costs
	into account.
	(ira_adjust_equiv_reg_cost): New function.
	* ira.h (ira_adjust_equiv_reg_cost): Declare it.
	* reload1.c (init_eliminable_invariants, free_reg_equiv,
	elimination_costs_in_insn, note_reg_elim_costly): New static
	functions.
	(elim_bb): New static variable.
	(reload): Move code out of here into init_eliminable_invariants and
	free_reg_equiv.  Call them.
	(calculate_elim_costs_all_insns): New function.
	(eliminate_regs_1): Declare.  Add extra arg FOR_COSTS;
	all callers changed.  If FOR_COSTS is true, don't call alter_reg,
	but call note_reg_elim_costly if we turned a valid memory address
	into an invalid one.
	* Makefile.in (ira-costs.o): Depend on reload.h.

testsuite/
	PR rtl-optimization/39871
	PR rtl-optimization/40615
	PR rtl-optimization/42500
	PR rtl-optimization/42502
	* gcc.target/arm/eliminate.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/arm/eliminate.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/ira-costs.c
    trunk/gcc/ira.c
    trunk/gcc/ira.h
    trunk/gcc/reload.h
    trunk/gcc/reload1.c
    trunk/gcc/testsuite/ChangeLog

Comment 8 Bernd Schmidt 2010-06-07 22:49:52 UTC

Fixed.