Bug 40615 - unnecessary CSE
Summary: unnecessary CSE
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.5.0
: P3 enhancement
Target Milestone: 4.6.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2009-07-02 07:39 UTC by Carrot
Modified: 2010-06-08 15:30 UTC (History)
4 users (show)

See Also:
Host:
Target: arm-eabi, powerpc*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-07-02 08:53:25


Attachments
test case (109 bytes, text/plain)
2009-07-02 07:39 UTC, Carrot
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carrot 2009-07-02 07:39:12 UTC
Compile the attached source code with options -march=armv5te -mthumb -Os -fno-exceptions, gcc generates:

        push    {r4, lr}
        sub     sp, sp, #8
        add     r4, sp, #4    // redundant
        mov     r0, r4        // add  r0, sp, 4
        bl      _ZN1XC1Ev
        mov     r0, r4        // add  r0, sp, 4
        bl      _Z3barP1X
        mov     r0, r4        // add  r0, sp, 4
        bl      _ZN1XD1Ev
        add     sp, sp, #8
        @ sp needed for prologue
        pop     {r4, pc}

As mentioned in the comments, the cse is redundant. We can recompute the value of (sp + 4) each time we want it. With this method we can save one instruction.
Comment 1 Carrot 2009-07-02 07:39:36 UTC
Created attachment 18120 [details]
test case
Comment 2 Ramana Radhakrishnan 2009-07-02 08:53:25 UTC
This looks like one of those rematerialization problems albeit with the stack pointer this time. 
Comment 3 Steven Bosscher 2009-07-02 09:15:51 UTC
Is there a C test case?  Can you add objdump of the gcc-generated asm and the fixed asm to show the impact on code size? (/me is surprised that 3*"add r0,sp,4" is smaller than 1**"add r0,sp,4"+3*"mov r0,r4"... Thumb is amazing :-)
Comment 4 Ramana Radhakrishnan 2009-07-02 09:39:40 UTC
(In reply to comment #3)
> Is there a C test case?  Can you add objdump of the gcc-generated asm and the
> fixed asm to show the impact on code size? (/me is surprised that 3*"add
> r0,sp,4" is smaller than 1**"add r0,sp,4"+3*"mov r0,r4"... Thumb is amazing :-)

The length of add r0,sp,4 and mov r0,r4 is the same for Thumb1 (16 bits).
 

I suppose the ideal code generated would be something like this modulo errors with stack alignments in the prologue and the epilogue. 

We also don't need r4 in that case :) . So we can save a load, a store as well as 1 instruction over all. Smaller and faster by 1 instruction and reduced register usage.



        push    {lr}
        sub     sp, sp, #12   (8 byte stack alignment )
        add     r0, sp, 4        // add  r0, sp, 4
        bl      _ZN1XC1Ev
        add     r0, sp, #4        // add  r0, sp, 4
        bl      _Z3barP1X
        add     r0, sp, #4       // add  r0, sp, 4
        bl      _ZN1XD1Ev
        add     sp, sp, #12    (8 byte stack alignment )
        @ sp needed for prologue
        pop     {pc}
Comment 5 Andrew Pinski 2009-09-29 05:50:06 UTC
PowerPC has the same issue.  Instructions on PPC are all the same size so 3 adds are better than one add plus 3 register moves. 
Here is a C example:
int f(int *a);
int g(int *a);
int h(int *a);

void hh(void)
{
  int t;
  f(&t);
  g(&t);
  h(&t);
}

--- CUT ---

Most RISC will have the same issue as most will have instructions which are fixed length.

I bet this has to do with hard registers.
Comment 6 Andrew Pinski 2009-09-29 05:51:28 UTC
Before CSE:
(insn 13 2 5 2 t.c:8 (set (reg:SI 119)
        (plus:SI (reg/f:SI 113 sfp)
            (const_int 8 [0x8]))) -1 (nil))

(insn 5 13 6 2 t.c:8 (set (reg:SI 3 3)
        (reg:SI 119)) 332 {*movsi_internal1} (nil))
...
(insn 14 6 7 2 t.c:9 (set (reg:SI 120)
        (plus:SI (reg/f:SI 113 sfp)
            (const_int 8 [0x8]))) -1 (nil))

(insn 7 14 8 2 t.c:9 (set (reg:SI 3 3)
        (reg:SI 120)) 332 {*movsi_internal1} (nil))

And then after CSE, we remove the adds but we should have moved the adds into the move.
Comment 7 Bernd Schmidt 2010-06-04 12:44:28 UTC
Subject: Bug 40615

Author: bernds
Date: Fri Jun  4 12:44:01 2010
New Revision: 160260

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=160260
Log:
	PR rtl-optimization/39871
	PR rtl-optimization/40615
	PR rtl-optimization/42500
	PR rtl-optimization/42502
	* ira.c (init_reg_equiv_memory_loc: New function.
	(ira): Call it twice.
	* reload.h (calculate_elim_costs_all_insns): Declare.
	* ira-costs.c: Include "reload.h".
	(regno_equiv_gains): New static variable.
	(init_costs): Allocate it.
	(finish_costs): Free it.
	(ira_costs): Call calculate_elim_costs_all_insns.
	(find_costs_and_classes): Take estimated elimination costs
	into account.
	(ira_adjust_equiv_reg_cost): New function.
	* ira.h (ira_adjust_equiv_reg_cost): Declare it.
	* reload1.c (init_eliminable_invariants, free_reg_equiv,
	elimination_costs_in_insn, note_reg_elim_costly): New static
	functions.
	(elim_bb): New static variable.
	(reload): Move code out of here into init_eliminable_invariants and
	free_reg_equiv.  Call them.
	(calculate_elim_costs_all_insns): New function.
	(eliminate_regs_1): Declare.  Add extra arg FOR_COSTS;
	all callers changed.  If FOR_COSTS is true, don't call alter_reg,
	but call note_reg_elim_costly if we turned a valid memory address
	into an invalid one.
	* Makefile.in (ira-costs.o): Depend on reload.h.

testsuite/
	PR rtl-optimization/39871
	PR rtl-optimization/40615
	PR rtl-optimization/42500
	PR rtl-optimization/42502
	* gcc.target/arm/eliminate.c: New test.


Added:
    trunk/gcc/testsuite/gcc.target/arm/eliminate.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/ira-costs.c
    trunk/gcc/ira.c
    trunk/gcc/ira.h
    trunk/gcc/reload.h
    trunk/gcc/reload1.c
    trunk/gcc/testsuite/ChangeLog

Comment 8 Bernd Schmidt 2010-06-07 22:49:52 UTC
Fixed.