Since the arguments are gimple registers, gimple optimizers are happy to create many references to it. While lowering them to RTL these however compile to memory loads causing number of redundant loads. The attached quicksort loop on when compiled with -O2 -fno-loop-optimize (the other being needed only for clarity of the testcase) produces such a funny sequence: movl 8(%ebp), %eax # 143 *movsi_1/1 [length = 3] movl 8(%ebp), %edx # 171 *movsi_1/1 [length = 3] movl 8(%ebp), %ebx # 145 *movsi_1/1 [length = 3] These are comming from: median = data[start]; pos.22 = start + 1; if (end - start <= 1) goto <L6>; else goto <L25>; where each of these compiles into RTL expression that looks different to CSE: (insn 16 15 18 1 (set (reg/v:SI 66 [ median ]) (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ]) (const_int 4 [0x4])) (reg/f:SI 70)) [3 data S4 A32])) -1 (nil) (nil)) (insn 18 16 20 1 (parallel [ (set (reg/v:SI 60 [ pos.22 ]) (plus:SI (reg/v:SI 68 [ start ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 20 18 21 1 (parallel [ (set (reg:SI 71) (minus:SI (reg/v:SI 69 [ end ]) (reg/v:SI 68 [ start ]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 21 20 22 1 (set (reg:CCGC 17 flags) (compare:CCGC (reg:SI 71) (const_int 1 [0x1]))) -1 (nil) (nil)) Similarly we get redudnant loads inside the loop itself. Not sure about sollution - making arguments nongimple registers does not lead optimizers to deal with them very nicely, forcing expander to load memory operands to register in prologue would lead to unnecesarly long lifetimes... forcing memory operands to registers in RTL generation is something we want to avoid ;) Ideas?
Created attachment 7407 [details] testcase
*** This bug has been marked as a duplicate of 18136 ***
Lets reopen this one as this is the one with the testcase.
*** Bug 18136 has been marked as a duplicate of this bug. ***
This is really a rtl problem, the problem comes from greg. before that we have: (insn:HI 7 11 8 0 (set (reg/v:SI 68 [ start ]) (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])) 44 {*movsi_1} (nil) (expr_list:REG_EQUIV (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32]) (nil))) (insn:HI 16 9 18 0 (set (reg/v:SI 66 [ median ]) (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ]) (const_int 4 [0x4])) (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} (insn_list:REG_DEP_TRUE 7 (nil)) (nil)) (insn:HI 18 16 20 0 (parallel [ (set (reg/v:SI 60 [ pos.22 ]) (plus:SI (reg/v:SI 68 [ start ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) 200 {*addsi_1} (nil) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn:HI 20 18 21 0 (parallel [ (set (reg:SI 71) (minus:SI (reg/v:SI 69 [ end ]) (reg/v:SI 68 [ start ]))) (clobber (reg:CC 17 flags)) ]) 233 {*subsi_1} (insn_list:REG_DEP_TRUE 8 (nil)) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) but after that we get: (insn 153 9 16 0 (set (reg:SI 0 ax) (mem/i:SI (plus:SI (reg/f:SI 6 bp) (const_int 8 [0x8])) [3 start+0 S4 A32])) 44 {*movsi_1} (nil) (nil)) (insn:HI 16 153 154 0 (set (reg:SI 0 ax) (mem/s:SI (plus:SI (mult:SI (reg:SI 0 ax) (const_int 4 [0x4])) (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} (insn_list:REG_DEP_TRUE 7 (nil)) (nil)) (insn 154 16 155 0 (set (mem:SI (plus:SI (reg/f:SI 6 bp) (const_int -16 [0xfffffffffffffff0])) [4 median+0 S4 A8]) (reg:SI 0 ax)) 44 {*movsi_1} (nil) (nil)) (insn 155 154 18 0 (set (reg/v:SI 3 bx [orig:60 pos.22 ] [60]) (mem/i:SI (plus:SI (reg/f:SI 6 bp) (const_int 8 [0x8])) [3 start+0 S4 A32])) 44 {*movsi_1} (nil) (nil)) Oh why is reload doing this.
Subject: Re: arguments being gimple registers cause redundant memory loads > > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-10-25 03:33 ------- > This is really a rtl problem, the problem comes from greg. before that we have: > (insn:HI 7 11 8 0 (set (reg/v:SI 68 [ start ]) > (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])) 44 {*movsi_1} (nil) > (expr_list:REG_EQUIV (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32]) > (nil))) > > (insn:HI 16 9 18 0 (set (reg/v:SI 66 [ median ]) > (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ]) > (const_int 4 [0x4])) > (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} > (insn_list:REG_DEP_TRUE 7 (nil)) > (nil)) > > (insn:HI 18 16 20 0 (parallel [ > (set (reg/v:SI 60 [ pos.22 ]) > (plus:SI (reg/v:SI 68 [ start ]) > (const_int 1 [0x1]))) > (clobber (reg:CC 17 flags)) > ]) 200 {*addsi_1} (nil) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) > > (insn:HI 20 18 21 0 (parallel [ > (set (reg:SI 71) > (minus:SI (reg/v:SI 69 [ end ]) > (reg/v:SI 68 [ start ]))) > (clobber (reg:CC 17 flags)) > ]) 233 {*subsi_1} (insn_list:REG_DEP_TRUE 8 (nil)) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (expr_list:REG_UNUSED (reg:CC 17 flags) > (nil)))) Yep, I was sleepy enought to missread the patterns and mess up the bug entry, sorry for that. Hmm, this does not look that bad after all, but still the 20% slowdown in the loop is interesting, I will look into it deeper later today. Honza
A simpler exampler which shows the problem. Compile with -O1 -fno-ivopts: void fcpy(float *restrict a, float *restrict b, float *restrict aa, float *restrict bb, int n) { int i; for(i = 0; i < n; i++) { aa[i]=a[i]; bb[i]=b[i]; } } You will see that we pull the load to aa into the loop which is wrong.
Note the small example is wrong as not related at all, we just don't have enough registers so we use the agrument's location.
Fixed, at least it looks to be. Most likely by: 2004-11-25 Andrew Pinski <pinskia@physics.uc.edu> parts of PR rtl-opt/18463, rtl-opt/17647 * cse.c (canon_for_address): New function. (find_best_addr): Call canon_for_address before getting the address's cost when checking if we should take that address. But I don't know for sure.