Bug 40730 - redundant memory load
Summary: redundant memory load
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.5.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 20070
Blocks:
  Show dependency treegraph
 
Reported: 2009-07-13 08:57 UTC by Carrot
Modified: 2023-12-28 00:05 UTC (History)
6 users (show)

See Also:
Host: i686-linux
Target: arm-eabi
Build: i686-linux
Known to work: 12.1.0, 13.1.0
Known to fail: 10.1.0, 11.1.0, 4.5.0, 5.1.0
Last reconfirmed: 2010-01-08 09:34:27


Attachments
test case (161 bytes, text/plain)
2009-07-13 08:58 UTC, Carrot
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carrot 2009-07-13 08:57:35 UTC
Compile the attached source code with options -Os -mthumb -march=armv5te -fno-strict-aliasing, Gcc generates:

iterate:
        push    {lr}
        ldr     r3, [r1]        // C
        b       .L5
.L4:
        ldr     r3, [r3, #8]    // D
.L5:
        str     r3, [r0]        //  A
        ldr     r3, [r0]        //  B
        cmp     r3, #0
        beq     .L3
        ldr     r2, [r3, #4]
        cmp     r2, #0
        beq     .L4
.L3:
        str     r3, [r0, #12]
        @ sp needed for prologue
        pop     {pc}

Pay attention to instructions marked as A and B. Instruction A store r3 to [r0] but insn B load it back to r3.

The instruction A was originally put after instruction C and D. After register allocation, they were allocated to the same registers and looks exactly same. In pass csa, cleanup_cfg was called and it found the same instructions and moved them before instruction B. Now instruction B is obviously redundant.

Is it OK to remove this kind of redundant code in pass dce?
Comment 1 Carrot 2009-07-13 08:58:13 UTC
Created attachment 18183 [details]
test case
Comment 2 Richard Biener 2009-07-13 09:50:47 UTC
-fgcse-las should do the trick.  Note that PRE would do this kind of
optimization on the tree-level, but it is disabled with -Os (so is gcse).

<bb 2>:
  D.1614_2 = p2_1(D)->front;
  p1_3(D)->head = D.1614_2;
  goto <bb 4>;

<bb 3>:
  D.1616_8 = D.1615_4->next;
  p1_3(D)->head = D.1616_8;

<bb 4>:
  D.1615_4 = p1_3(D)->head;
Comment 3 Steven Bosscher 2009-07-13 10:10:07 UTC
And no, it is *not* OK to remove this kind of redundant code in DCE. The load may be redundant, but it is not dead. 

It is not clear to me why cleanup_cfg would move that insn.  Perhaps you can show what is going on with the RTL dumps (use -fdump-rtl-all-slim for the most readable results).
Comment 4 Carrot 2009-07-14 09:14:05 UTC
In TREE level, the two stores are different statements. Only after register allocation, the two stores get same register and make the load redundant.

try_crossjump_bb tries to find same instruction sequence in all predecessors of a basic block bb, and move that code sequence to head of bb. It is triggered by this function, and the store is moved just before the load.

I tried -fgcse-las but it couldn't do the work.

(In reply to comment #2)
> -fgcse-las should do the trick.  Note that PRE would do this kind of
> optimization on the tree-level, but it is disabled with -Os (so is gcse).
> 
> <bb 2>:
>   D.1614_2 = p2_1(D)->front;
>   p1_3(D)->head = D.1614_2;
>   goto <bb 4>;
> 
> <bb 3>:
>   D.1616_8 = D.1615_4->next;
>   p1_3(D)->head = D.1616_8;
> 
> <bb 4>:
>   D.1615_4 = p1_3(D)->head;
> 

Comment 5 Steven Bosscher 2009-07-14 09:18:38 UTC
As you said, try_crossjump_bb tries to find the same instruction sequence in *all* predecessors of a basic block bb. Meaning that the load must have been redundant even before cross jumping occurred.

If you are right, and this redundancy is not exposed until after register allocation, then this may be another case that postreload-gcse should handle (but probably does not).

There are a couple of bugs about the need for a more powerful postreload-gcse. We should perhaps group them (in a meta-bug for example) and make a plan for a fix...
Comment 6 Steven Bosscher 2009-07-14 09:20:25 UTC
Carrot, can you please try this test case with my patch "crossjump_abstract.diff" from Bug 20070 applied?
Comment 7 Carrot 2009-07-15 08:07:10 UTC
(In reply to comment #6)
> Carrot, can you please try this test case with my patch
> "crossjump_abstract.diff" from Bug 20070 applied?
> 

I tried your patch. It did remove the redundant memory load. Following is the output

        push    {lr}
        ldr     r3, [r1]
.L6:
        str     r3, [r0]
        mov     r2, r3          // M
        cmp     r3, #0
        bne     .L5
        b       .L3
.L4:
        ldr     r3, [r3, #8]
        b       .L6
.L5:
        ldr     r1, [r3, #4]
        cmp     r1, #0
        beq     .L4
.L3:
        str     r2, [r0, #12]
        @ sp needed for prologue
        pop     {pc}

In pass ifcvt it noticed the difference of two stores is the different pseudo register number and there is no conflict between the two pseudo registers, so it rename one of them to the same as another and do basic block cross jump on them earlier. Then pass iterate.c.161r.cse2 detected the redundant load and remove it.

But it introduced another redundant move instruction marked as M. At the place r2 is used, r3 still contain the same result as r2, so we can also use r3 there. I think this is another problem.
Comment 8 Steven Bosscher 2009-07-15 09:47:09 UTC
That redundant move has to be a separate issue, indeed.  I would expect the register allocator to coalesce those registers.

I hadn't expected this.  I thought the result would be just the removal of the redundant load, but the code that comes out is bigger (14 instructions instead of 13) and has a completely different structure.

I'll see if I can understand what is going on. Thus, mine.
Comment 9 Steven Bosscher 2010-01-08 09:34:19 UTC
With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my patch for bug 20070 applied, I get the following code:

iterate:
        push    {lr}
        ldr     r3, [r1]
.L6:
        str     r3, [r0]
        sub     r2, r3, #0
        bne     .L5
        b       .L3
.L4:
        ldr     r3, [r3, #8]
        b       .L6
.L5:
        ldr     r1, [r3, #4]
        cmp     r1, #0
        beq     .L4
.L3:
        str     r2, [r0, #12]
        @ sp needed for prologue
        pop     {pc}

Carrot, could you please double-check that this is still correct code?
Comment 10 Carrot 2010-01-11 06:47:00 UTC
(In reply to comment #9)
> With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my
> patch for bug 20070 applied, I get the following code:
> 
> iterate:
>         push    {lr}
>         ldr     r3, [r1]
> .L6:
>         str     r3, [r0]
>         sub     r2, r3, #0
>         bne     .L5
>         b       .L3
> .L4:
>         ldr     r3, [r3, #8]
>         b       .L6
> .L5:
>         ldr     r1, [r3, #4]
>         cmp     r1, #0
>         beq     .L4
> .L3:
>         str     r2, [r0, #12]
>         @ sp needed for prologue
>         pop     {pc}
> 
> Carrot, could you please double-check that this is still correct code?
> 

Yes, it is correct.
There are still 13 instructions, I think it is related to unoptimized basic block order.
Comment 11 stevenb.gcc@gmail.com 2010-01-11 08:22:57 UTC
Subject: Re:  redundant memory load

On Mon, Jan 11, 2010 at 7:47 AM, carrot at google dot com
<gcc-bugzilla@gcc.gnu.org> wrote:
>> iterate:
>>         push    {lr}
>>         ldr     r3, [r1]
>> .L6:
>>         str     r3, [r0]
>>         sub     r2, r3, #0
>>         bne     .L5
>>         b       .L3
>> .L4:
>>         ldr     r3, [r3, #8]
>>         b       .L6
>> .L5:
>>         ldr     r1, [r3, #4]
>>         cmp     r1, #0
>>         beq     .L4
>> .L3:
>>         str     r2, [r0, #12]
>>         @ sp needed for prologue
>>         pop     {pc}
>>
>> Carrot, could you please double-check that this is still correct code?
>>
>
> Yes, it is correct.
> There are still 13 instructions, I think it is related to unoptimized basic
> block order.

Yes, I would have expected the block starting with .L4 to be *after*
the block starting with .L5, something like so:

iterate:
        push    {lr}
        ldr     r3, [r1]
.L6:
        str     r3, [r0]
        sub     r2, r3, #0
        beq     .L3
.L5:
        ldr     r1, [r3, #4]
        cmp     r1, #0
        bne     .L3
        ldr     r3, [r3, #8]
        b       .L6
.L3:
        str     r2, [r0, #12]
        @ sp needed for prologue
        pop     {pc}

Does that look correct?  And if so, could you see if there is an open
bug report about this; or otherwise file a new PR and add me to the
CC-list?
Comment 12 Carrot 2010-01-11 08:55:52 UTC
(In reply to comment #11)
> Yes, I would have expected the block starting with .L4 to be *after*
> the block starting with .L5, something like so:
> 
> iterate:
>         push    {lr}
>         ldr     r3, [r1]
> .L6:
>         str     r3, [r0]
>         sub     r2, r3, #0
>         beq     .L3
> .L5:
>         ldr     r1, [r3, #4]
>         cmp     r1, #0
>         bne     .L3
>         ldr     r3, [r3, #8]
>         b       .L6
> .L3:
>         str     r2, [r0, #12]
>         @ sp needed for prologue
>         pop     {pc}
> 
> Does that look correct?  And if so, could you see if there is an open
> bug report about this; or otherwise file a new PR and add me to the
> CC-list?
> 

It is correct. The basic block ordering issue (-Os) has been observed several times. Following are related PRs:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396
Comment 13 Steven Bosscher 2019-03-05 08:21:11 UTC
Not working on this...
Comment 14 Andrew Pinski 2023-05-15 00:10:08 UTC
Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c).
Comment 15 Andrew Pinski 2023-05-15 00:24:06 UTC
(In reply to Andrew Pinski from comment #14)
> Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c).

The first redundant load was fixed by r11-408-g84935c9822183c.

The extra store was fixed was fixed by r12-897-gde56f95afaaa22 .

But it is still fixed fully.