Compile the following code with -mthumb -O2 -Os, extern void foo(int*, const char*, int); void test(const char name[], int style) { foo(0, name, style); } I got: push {r4, lr} mov r3, r0 // A mov r2, r1 // B mov r0, #0 // C mov r1, r3 // D bl foo pop {r4, pc} Instructions A and D move register r0 to r1, actually it can be replaced with 1 instruction mov r1, r0 and place it between B and C.
Created attachment 17962 [details] test case shows the redundant register move This problem occurs quite frequently if both caller and callee have multiple parameters.
This might be caused by scheduling before the register allocator (or maybe the lack of).
"might be" is such a useless statement. Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume? Then you should have no trouble finding out: 1) Where the move comes from 2) Why postreload (the post-reload CSE pass) does not eliminate the redundant move
Thank you, Steven. (In reply to comment #3) > "might be" is such a useless statement. > > Carrot, you are aware of the -fdump-rtl-all and -dAP options, I assume? Then > you should have no trouble finding out: > 1) Where the move comes from This is rtl dump before RA, everything is in normal state: cat obj/reg.c.173r.asmcons ;; Function test (test) (note 1 0 5 NOTE_INSN_DELETED) (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 133 [ name ]) (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 0 r0 [ name ]) (nil))) (insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 134 [ style ]) (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg:SI 1 r1 [ style ]) (nil))) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0) (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil)) (insn 8 7 9 2 src/./reg.c:4 (set (reg:SI 1 r1) (reg/v/f:SI 133 [ name ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v/f:SI 133 [ name ]) (nil))) (insn 9 8 10 2 src/./reg.c:4 (set (reg:SI 2 r2) (reg/v:SI 134 [ style ])) 168 {*thumb1_movsi_insn} (expr_list:REG_DEAD (reg/v:SI 134 [ style ]) (nil))) (call_insn 10 9 0 2 src/./reg.c:4 (parallel [ (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] <function_decl 0xf7f6a680 foo>) [0 S4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 256 {*call_insn} (expr_list:REG_DEAD (reg:SI 2 r2) (expr_list:REG_DEAD (reg:SI 1 r1) (expr_list:REG_DEAD (reg:SI 0 r0) (nil)))) (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2)) (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1)) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0)) (nil))))) Here is rtl dump after RA, quite straightforward but inefficient: (note 1 0 5 NOTE_INSN_DELETED) (note 5 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (insn 2 5 3 2 src/./reg.c:3 (set (reg/v/f:SI 3 r3 [orig:133 name ] [133]) (reg:SI 0 r0 [ name ])) 168 {*thumb1_movsi_insn} (nil)) (insn 3 2 4 2 src/./reg.c:3 (set (reg/v:SI 2 r2 [orig:134 style ] [134]) (reg:SI 1 r1 [ style ])) 168 {*thumb1_movsi_insn} (nil)) (note 4 3 7 2 NOTE_INSN_FUNCTION_BEG) (insn 7 4 8 2 src/./reg.c:4 (set (reg:SI 0 r0) (const_int 0 [0x0])) 168 {*thumb1_movsi_insn} (nil)) (insn 8 7 10 2 src/./reg.c:4 (set (reg:SI 1 r1) (reg/v/f:SI 3 r3 [orig:133 name ] [133])) 168 {*thumb1_movsi_insn} (nil)) (call_insn 10 8 18 2 src/./reg.c:4 (parallel [ (call (mem:SI (symbol_ref:SI ("foo") [flags 0x41] <function_decl 0xf7f6a680 foo>) [0 S4 A32]) (const_int 0 [0x0])) (use (const_int 0 [0x0])) (clobber (reg:SI 14 lr)) ]) 256 {*call_insn} (nil) (expr_list:REG_DEP_TRUE (use (reg:SI 2 r2)) (expr_list:REG_DEP_TRUE (use (reg:SI 1 r1)) (expr_list:REG_DEP_TRUE (use (reg:SI 0 r0)) (nil))))) (note 18 10 0 NOTE_INSN_DELETED) > 2) Why postreload (the post-reload CSE pass) does not eliminate the redundant > move > It seems the post-reload CSE pass can't handle this case. Because at instruction C r0 is killed so instruction can't use r0. In order to make it work for this case we must move instruction D before C first. As Andrew said we need to improve scheduling before RA to handle this.
Hmm, I was under the impression that postreload-cse could move instructions too, but that was just wishful thinking. I don't really see how the scheduler can solve this. The scheduler would have to know what the live ranges of r0 and r133 overlap, but we don't have an interference graph in the scheduler. Maybe calculating conflicts locally and adjusting scheduling priorities based on that? What improvement to pre-RA scheduling do you have in mind? Perhaps an easier solution would be to change the order of the parameter loads or stores. Right now GCC does this from this pre-RA to post-RA: (set (reg:SI 133) (reg:SI r0)) (set (reg:SI 134) (reg:SI r1)) (set (reg:SI r0) (const_int 0)) (set (reg:SI r1) (reg:SI r133)) (set (reg:SI r2) (reg:SI r134)) (call "foo") ==> (set (reg:SI r3) (reg:SI r0)) (set (reg:SI r2) (reg:SI r1)) (set (reg:SI r0) (const_int 0)) (set (reg:SI r1) (reg:SI r3)) (set (reg:SI r2) (reg:SI r2)) // NOP move, so eliminated (call "foo") ==> (set (reg:SI r3) (reg:SI r0)) (set (reg:SI r2) (reg:SI r1)) (set (reg:SI r0) (const_int 0)) (set (reg:SI r1) (reg:SI r3)) (call "foo") There is always going to be a conflict if GCC generate the above RTL before register allocation, so maybe the generated RTL should be changed. If GCC would emit the parameter stores before the call in reverse order, then the conflicts would go away too: (set (reg:SI 133) (reg:SI r0)) (set (reg:SI 134) (reg:SI r1)) (set (reg:SI r2) (reg:SI r134)) (set (reg:SI r1) (reg:SI r133)) (set (reg:SI r0) (const_int 0)) (call "foo") ==> (set (reg:SI r0) (reg:SI r0)) // NOP move, so eliminated (set (reg:SI r2) (reg:SI r1)) (set (reg:SI r2) (reg:SI r2)) // NOP move, so eliminated (set (reg:SI r1) (reg:SI r0)) (set (reg:SI r0) (const_int 0)) (call "foo") ==> (set (reg:SI r2) (reg:SI r1)) (set (reg:SI r1) (reg:SI r0)) (set (reg:SI r0) (const_int 0)) (call "foo")
(In reply to comment #5) > Hmm, I was under the impression that postreload-cse could move instructions > too, but that was just wishful thinking. > I will look into postreload-cse.
(In reply to comment #6) > (In reply to comment #5) > > Hmm, I was under the impression that postreload-cse could move instructions > > too, but that was just wishful thinking. > > > I will look into postreload-cse. > I've looked at this superficially and something that I noticed was that in the ARM case (i.e !-mthumb) there aren't any redundant moves - It works fine in the Register allocator but the only difference that I can see / think of is that we support sibling calls for ARM mode. We don't have sibling calls supported for Thumb.
PPC has the same issue with -fno-schedule-insns: test: stwu 1,-16(1) mflr 0 stw 0,20(1) mr 0,3 mr 5,4 li 3,0 mr 4,0 bl foo lwz 0,20(1) addi 1,1,16 mtlr 0 blr
Reconfirmed with r156595 and r156650 (with and without -fschedule-insns, and -fsched-pressure makes no difference either). Vlad, you added some register pressure awareness to the scheduler. Do you think there is a way to teach the scheduler to unconditionally move insns, or at least simple moves, around if it reduces register pressure? (See comment #5 in this PR.)
This now works fine with trunk on both powerpc and arm thumb, with -O2 and -Os and with or without -fno-schedule-insns. Closing as fixed; please reopen if any case still does not work.