40375 – redundant register move with scheduler before RA turned off

Bug 40375 - redundant register move with scheduler before RA turned off

Summary: redundant register move with scheduler before RA turned off

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.5.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	16996
	Show dependency tree / graph

Reported:	2009-06-08 03:20 UTC by Carrot
Modified:	2017-03-02 10:45 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	arm-eabi, powerpc-
Build:
Known to work:
Known to fail:
Last reconfirmed:	2010-02-10 10:55:07

Attachments
test case shows the redundant register move (93 bytes, text/plain) 2009-06-08 03:23 UTC, Carrot	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Carrot 2009-06-08 03:20:53 UTC

Compile the following code with -mthumb -O2 -Os,

extern void foo(int*, const char*, int);
void test(const char name[], int style)
{
   foo(0, name, style);
}

I got:

        push    {r4, lr}
        mov     r3, r0          //  A
        mov     r2, r1          //  B
        mov     r0, #0          //  C
        mov     r1, r3          //  D
        bl      foo
        pop     {r4, pc}

Instructions A and D move register r0 to r1, actually it can be replaced with 1 instruction
        mov  r1, r0
and place it between B and C.

Comment 1 Carrot 2009-06-08 03:23:57 UTC

Created attachment 17962 [details]
test case shows the redundant register move

This problem occurs quite frequently if both caller and callee have multiple parameters.

Comment 2 Andrew Pinski 2009-06-08 03:27:01 UTC

This might be caused by scheduling before the register allocator (or maybe the lack of).

Comment 3 Steven Bosscher 2009-06-08 11:41:09 UTC

"might be" is such a useless statement.

Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume?  Then you should have no trouble finding out:
1) Where the move comes from
2) Why postreload (the post-reload CSE pass) does not eliminate the redundant move

Comment 4 Carrot 2009-06-09 03:46:44 UTC

Thank you, Steven.

(In reply to comment #3)
> "might be" is such a useless statement.
> 
> Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume?  Then
> you should have no trouble finding out:
> 1) Where the move comes from

This is rtl dump before RA, everything is in normal state:

cat obj/reg.c.173r.asmcons
;; Function test (test)

(note 1 0 5 NOTE_INSN_DELETED)

(note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 133 [ name ])
        (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 0 r0 [ name ])
        (nil)))

(insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 134 [ style ])
        (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 1 r1 [ style ])
        (nil)))

(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)

(insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0)
        (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil))

(insn 8 7 9 2 src/./reg.c:4 (set (reg:SI 1 r1)
        (reg/v/f:SI 133 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v/f:SI 133 [ name ])
        (nil)))

(insn 9 8 10 2 src/./reg.c:4 (set (reg:SI 2 r2)
        (reg/v:SI 134 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 134 [ style ])
        (nil)))

(call_insn 10 9 0 2 src/./reg.c:4 (parallel [
            (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] <function_decl 0xf7f6a680 foo>) [0 S4 A32])
                (const_int 0 [0x0]))
            (use (const_int 0 [0x0]))
            (clobber (reg:SI 14 lr))
        ]) 256 {*call_insn} (expr_list:REG_DEAD (reg:SI 2 r2)
        (expr_list:REG_DEAD (reg:SI 1 r1)
            (expr_list:REG_DEAD (reg:SI 0 r0)
                (nil))))
    (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2))
        (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1))
            (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0))
                (nil)))))

Here is rtl dump after RA, quite straightforward but inefficient:

(note 1 0 5 NOTE_INSN_DELETED)

(note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 3 r3 [orig:133 name ] [133])
        (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (nil))

(insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 2 r2 [orig:134 style ] [134])
        (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (nil))

(note 4 3 7 2 NOTE_INSN_FUNCTION_BEG)

(insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0)
        (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil))

(insn 8 7 10 2 src/./reg.c:4 (set (reg:SI 1 r1)
        (reg/v/f:SI 3 r3 [orig:133 name ] [133])) 168 {*thumb1_movsi_insn} (nil))

(call_insn 10 8 18 2 src/./reg.c:4 (parallel [
            (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] <function_decl 0xf7f6a680 foo>) [0 S4 A32])
                (const_int 0 [0x0]))
            (use (const_int 0 [0x0]))
            (clobber (reg:SI 14 lr))
        ]) 256 {*call_insn} (nil)
    (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2))
        (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1))
            (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0))
                (nil)))))

(note 18 10 0 NOTE_INSN_DELETED)

> 2) Why postreload (the post-reload CSE pass) does not eliminate the redundant
> move
> 
It seems the post-reload CSE pass can't handle this case. Because at instruction C r0 is killed so instruction can't use r0. In order to make it work for this case we must move instruction D before C first.

As Andrew said we need to improve scheduling before RA to handle this.

Comment 5 Steven Bosscher 2009-06-09 08:34:24 UTC

Hmm, I was under the impression that postreload-cse could move instructions too, but that was just wishful thinking.

I don't really see how the scheduler can solve this.  The scheduler would have to know what the live ranges of r0 and r133 overlap, but we don't have an interference graph in  the scheduler.  Maybe calculating conflicts locally and adjusting scheduling priorities based on that?  What improvement to pre-RA scheduling do you have in mind?

Perhaps an easier solution would be to change the order of the parameter loads or stores.  Right now GCC does this from this pre-RA to post-RA:

(set (reg:SI 133) (reg:SI r0))
(set (reg:SI 134) (reg:SI r1))
(set (reg:SI r0) (const_int 0))
(set (reg:SI r1) (reg:SI r133))
(set (reg:SI r2) (reg:SI r134))
(call "foo")

==>

(set (reg:SI r3) (reg:SI r0))
(set (reg:SI r2) (reg:SI r1))
(set (reg:SI r0) (const_int 0))
(set (reg:SI r1) (reg:SI r3))
(set (reg:SI r2) (reg:SI r2))		// NOP move, so eliminated
(call "foo")

==>

(set (reg:SI r3) (reg:SI r0))
(set (reg:SI r2) (reg:SI r1))
(set (reg:SI r0) (const_int 0))
(set (reg:SI r1) (reg:SI r3))
(call "foo")


There is always going to be a conflict if GCC generate the above RTL before register allocation, so maybe the generated RTL should be changed.  If GCC would emit the parameter stores before the call in reverse order, then the conflicts would go away too:

(set (reg:SI 133) (reg:SI r0))
(set (reg:SI 134) (reg:SI r1))
(set (reg:SI r2) (reg:SI r134))
(set (reg:SI r1) (reg:SI r133))
(set (reg:SI r0) (const_int 0))
(call "foo")

==>

(set (reg:SI r0) (reg:SI r0))		// NOP move, so eliminated
(set (reg:SI r2) (reg:SI r1))
(set (reg:SI r2) (reg:SI r2))		// NOP move, so eliminated
(set (reg:SI r1) (reg:SI r0))
(set (reg:SI r0) (const_int 0))
(call "foo")

==>

(set (reg:SI r2) (reg:SI r1))
(set (reg:SI r1) (reg:SI r0))
(set (reg:SI r0) (const_int 0))
(call "foo")

Comment 6 Carrot 2009-06-09 13:52:12 UTC

(In reply to comment #5)
> Hmm, I was under the impression that postreload-cse could move instructions
> too, but that was just wishful thinking.
> 
I will look into postreload-cse.

Comment 7 Ramana Radhakrishnan 2009-06-09 14:32:32 UTC

(In reply to comment #6)
> (In reply to comment #5)
> > Hmm, I was under the impression that postreload-cse could move instructions
> > too, but that was just wishful thinking.
> > 
> I will look into postreload-cse.
> 

I've looked at this superficially and something that I noticed was that in the ARM case (i.e !-mthumb) there aren't any redundant moves - 

It works fine in the Register allocator but the only difference that I can see / think of is that we support sibling calls for ARM mode. We don't have sibling calls supported for Thumb.

Comment 8 Andrew Pinski 2009-09-25 08:10:38 UTC

PPC has the same issue with -fno-schedule-insns:
test:
        stwu 1,-16(1)
        mflr 0
        stw 0,20(1)
        mr 0,3
        mr 5,4
        li 3,0
        mr 4,0
        bl foo
        lwz 0,20(1)
        addi 1,1,16
        mtlr 0
        blr

Comment 9 Steven Bosscher 2010-02-10 10:55:07 UTC

Reconfirmed with r156595 and r156650 (with and without -fschedule-insns, and -fsched-pressure makes no difference either).

Vlad, you added some register pressure awareness to the scheduler. Do you think there is a way to teach the scheduler to unconditionally move insns, or at least simple moves, around if it reduces register pressure? (See comment #5 in this PR.)

Comment 10 Segher Boessenkool 2017-03-02 10:45:21 UTC

This now works fine with trunk on both powerpc and arm thumb, with
-O2 and -Os and with or without -fno-schedule-insns.

Closing as fixed; please reopen if any case still does not work.