18137 – [4.0 Regression] arguments being gimple registers cause redundant memory loads

Bug 18137 - [4.0 Regression] arguments being gimple registers cause redundant memory loads

Summary: [4.0 Regression] arguments being gimple registers cause redundant memory loads

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.0.0

Importance:	P2 normal
Target Milestone:	4.0.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (1):	18136 (view as bug list)
Depends on:
Blocks:

Reported:	2004-10-25 00:50 UTC by Jan Hubicka
Modified:	2004-11-26 05:30 UTC (History)
CC List:	4 users (show)

See Also:
Host:
Target:	i686-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2004-10-25 03:33:34

Attachments
testcase (479 bytes, text/plain) 2004-10-25 00:51 UTC, Jan Hubicka	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jan Hubicka 2004-10-25 00:50:59 UTC

Since the arguments are gimple registers, gimple optimizers are happy to create many references to it.
While lowering them to RTL these however compile to memory loads causing number of redundant loads.
The attached quicksort loop on when compiled with -O2 -fno-loop-optimize (the other being needed only for clarity of the testcase) produces such a funny sequence:
        movl    8(%ebp), %eax   # 143   *movsi_1/1      [length = 3]
        movl    8(%ebp), %edx   # 171   *movsi_1/1      [length = 3]
        movl    8(%ebp), %ebx   # 145   *movsi_1/1      [length = 3]
These are comming from:
  median = data[start];
  pos.22 = start + 1;
  if (end - start <= 1) goto <L6>; else goto <L25>;
where each of these compiles into RTL expression that looks different to CSE:
(insn 16 15 18 1 (set (reg/v:SI 66 [ median ])
        (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ])
                    (const_int 4 [0x4]))
                (reg/f:SI 70)) [3 data S4 A32])) -1 (nil)
    (nil))

(insn 18 16 20 1 (parallel [
            (set (reg/v:SI 60 [ pos.22 ])
                (plus:SI (reg/v:SI 68 [ start ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 20 18 21 1 (parallel [
            (set (reg:SI 71)
                (minus:SI (reg/v:SI 69 [ end ])
                    (reg/v:SI 68 [ start ])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 21 20 22 1 (set (reg:CCGC 17 flags)
        (compare:CCGC (reg:SI 71)
            (const_int 1 [0x1]))) -1 (nil)
    (nil))
Similarly we get redudnant loads inside the loop itself.

Not sure about sollution - making arguments nongimple registers does not lead optimizers to deal with them very nicely, forcing
expander to load memory operands to register in prologue would lead to unnecesarly long lifetimes...
forcing memory operands to registers in RTL generation is something we want to avoid ;)
Ideas?

Comment 1 Jan Hubicka 2004-10-25 00:51:27 UTC

Created attachment 7407 [details]
testcase

Comment 2 Andrew Pinski 2004-10-25 01:03:02 UTC


*** This bug has been marked as a duplicate of 18136 ***

Comment 3 Andrew Pinski 2004-10-25 01:08:57 UTC

Lets reopen this one as this is the one with the testcase.

Comment 4 Andrew Pinski 2004-10-25 01:09:25 UTC

*** Bug 18136 has been marked as a duplicate of this bug. ***

Comment 5 Andrew Pinski 2004-10-25 03:33:34 UTC

This is really a rtl problem, the problem comes from greg. before that we have:
(insn:HI 7 11 8 0 (set (reg/v:SI 68 [ start ])
        (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])) 44 {*movsi_1} (nil)
    (expr_list:REG_EQUIV (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])
        (nil)))

(insn:HI 16 9 18 0 (set (reg/v:SI 66 [ median ])
        (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ])
                    (const_int 4 [0x4]))
                (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} 
(insn_list:REG_DEP_TRUE 7 (nil))
    (nil))

(insn:HI 18 16 20 0 (parallel [
            (set (reg/v:SI 60 [ pos.22 ])
                (plus:SI (reg/v:SI 68 [ start ])
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) 200 {*addsi_1} (nil)
    (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))

(insn:HI 20 18 21 0 (parallel [
            (set (reg:SI 71)
                (minus:SI (reg/v:SI 69 [ end ])
                    (reg/v:SI 68 [ start ])))
            (clobber (reg:CC 17 flags))
        ]) 233 {*subsi_1} (insn_list:REG_DEP_TRUE 8 (nil))
    (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))


but after that we get:
(insn 153 9 16 0 (set (reg:SI 0 ax)
        (mem/i:SI (plus:SI (reg/f:SI 6 bp)
                (const_int 8 [0x8])) [3 start+0 S4 A32])) 44 {*movsi_1} (nil)
    (nil))

(insn:HI 16 153 154 0 (set (reg:SI 0 ax)
        (mem/s:SI (plus:SI (mult:SI (reg:SI 0 ax)
                    (const_int 4 [0x4]))
                (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} 
(insn_list:REG_DEP_TRUE 7 (nil))
    (nil))

(insn 154 16 155 0 (set (mem:SI (plus:SI (reg/f:SI 6 bp)
                (const_int -16 [0xfffffffffffffff0])) [4 median+0 S4 A8])
        (reg:SI 0 ax)) 44 {*movsi_1} (nil)
    (nil))

(insn 155 154 18 0 (set (reg/v:SI 3 bx [orig:60 pos.22 ] [60])
        (mem/i:SI (plus:SI (reg/f:SI 6 bp)
                (const_int 8 [0x8])) [3 start+0 S4 A32])) 44 {*movsi_1} (nil)
    (nil))


Oh why is reload doing this.

Comment 6 Jan Hubicka 2004-10-25 09:20:12 UTC

Subject: Re:  arguments being gimple registers cause redundant memory loads

> 
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-10-25 03:33 -------
> This is really a rtl problem, the problem comes from greg. before that we have:
> (insn:HI 7 11 8 0 (set (reg/v:SI 68 [ start ])
>         (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])) 44 {*movsi_1} (nil)
>     (expr_list:REG_EQUIV (mem/i:SI (reg/f:SI 16 argp) [3 start+0 S4 A32])
>         (nil)))
> 
> (insn:HI 16 9 18 0 (set (reg/v:SI 66 [ median ])
>         (mem/s:SI (plus:SI (mult:SI (reg/v:SI 68 [ start ])
>                     (const_int 4 [0x4]))
>                 (symbol_ref:SI ("data") <var_decl 0x416db6c8 data>)) [3 data S4 A32])) 44 {*movsi_1} 
> (insn_list:REG_DEP_TRUE 7 (nil))
>     (nil))
> 
> (insn:HI 18 16 20 0 (parallel [
>             (set (reg/v:SI 60 [ pos.22 ])
>                 (plus:SI (reg/v:SI 68 [ start ])
>                     (const_int 1 [0x1])))
>             (clobber (reg:CC 17 flags))
>         ]) 200 {*addsi_1} (nil)
>     (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))
> 
> (insn:HI 20 18 21 0 (parallel [
>             (set (reg:SI 71)
>                 (minus:SI (reg/v:SI 69 [ end ])
>                     (reg/v:SI 68 [ start ])))
>             (clobber (reg:CC 17 flags))
>         ]) 233 {*subsi_1} (insn_list:REG_DEP_TRUE 8 (nil))
>     (expr_list:REG_UNUSED (reg:CC 17 flags)
>         (expr_list:REG_UNUSED (reg:CC 17 flags)
>             (nil))))

Yep, I was sleepy enought to missread the patterns and mess up the bug
entry, sorry for that.  Hmm, this does not look that bad after all, but
still the 20% slowdown in the loop is interesting, I will look into it
deeper later today.

Honza

Comment 7 Andrew Pinski 2004-11-25 23:54:32 UTC

A simpler exampler which shows the problem.
Compile with -O1 -fno-ivopts:
void 
fcpy(float *restrict a,  float *restrict b, 
     float *restrict aa, float *restrict bb, int n) 
{ 
        int i; 
        for(i = 0; i < n; i++) { 
                aa[i]=a[i]; 
                bb[i]=b[i]; 
        } 
} 

You will see that we pull the load to aa into the loop which is wrong.

Comment 8 Andrew Pinski 2004-11-26 05:13:53 UTC

Note the small example is wrong as not related at all, we just don't have enough registers so we use the 
agrument's location.

Comment 9 Andrew Pinski 2004-11-26 05:30:59 UTC

Fixed, at least it looks to be.
Most likely by:
2004-11-25  Andrew Pinski <pinskia@physics.uc.edu>

        parts of PR rtl-opt/18463, rtl-opt/17647
        * cse.c (canon_for_address): New function.
        (find_best_addr): Call canon_for_address before getting the
        address's cost when checking if we should take that address.

But I don't know for sure.