Compile the attached source code with options -Os -mthumb -march=armv5te -fno-strict-aliasing, Gcc generates: iterate: push {lr} ldr r3, [r1] // C b .L5 .L4: ldr r3, [r3, #8] // D .L5: str r3, [r0] // A ldr r3, [r0] // B cmp r3, #0 beq .L3 ldr r2, [r3, #4] cmp r2, #0 beq .L4 .L3: str r3, [r0, #12] @ sp needed for prologue pop {pc} Pay attention to instructions marked as A and B. Instruction A store r3 to [r0] but insn B load it back to r3. The instruction A was originally put after instruction C and D. After register allocation, they were allocated to the same registers and looks exactly same. In pass csa, cleanup_cfg was called and it found the same instructions and moved them before instruction B. Now instruction B is obviously redundant. Is it OK to remove this kind of redundant code in pass dce?
Created attachment 18183 [details] test case
-fgcse-las should do the trick. Note that PRE would do this kind of optimization on the tree-level, but it is disabled with -Os (so is gcse). <bb 2>: D.1614_2 = p2_1(D)->front; p1_3(D)->head = D.1614_2; goto <bb 4>; <bb 3>: D.1616_8 = D.1615_4->next; p1_3(D)->head = D.1616_8; <bb 4>: D.1615_4 = p1_3(D)->head;
And no, it is *not* OK to remove this kind of redundant code in DCE. The load may be redundant, but it is not dead. It is not clear to me why cleanup_cfg would move that insn. Perhaps you can show what is going on with the RTL dumps (use -fdump-rtl-all-slim for the most readable results).
In TREE level, the two stores are different statements. Only after register allocation, the two stores get same register and make the load redundant. try_crossjump_bb tries to find same instruction sequence in all predecessors of a basic block bb, and move that code sequence to head of bb. It is triggered by this function, and the store is moved just before the load. I tried -fgcse-las but it couldn't do the work. (In reply to comment #2) > -fgcse-las should do the trick. Note that PRE would do this kind of > optimization on the tree-level, but it is disabled with -Os (so is gcse). > > <bb 2>: > D.1614_2 = p2_1(D)->front; > p1_3(D)->head = D.1614_2; > goto <bb 4>; > > <bb 3>: > D.1616_8 = D.1615_4->next; > p1_3(D)->head = D.1616_8; > > <bb 4>: > D.1615_4 = p1_3(D)->head; >
As you said, try_crossjump_bb tries to find the same instruction sequence in *all* predecessors of a basic block bb. Meaning that the load must have been redundant even before cross jumping occurred. If you are right, and this redundancy is not exposed until after register allocation, then this may be another case that postreload-gcse should handle (but probably does not). There are a couple of bugs about the need for a more powerful postreload-gcse. We should perhaps group them (in a meta-bug for example) and make a plan for a fix...
Carrot, can you please try this test case with my patch "crossjump_abstract.diff" from Bug 20070 applied?
(In reply to comment #6) > Carrot, can you please try this test case with my patch > "crossjump_abstract.diff" from Bug 20070 applied? > I tried your patch. It did remove the redundant memory load. Following is the output push {lr} ldr r3, [r1] .L6: str r3, [r0] mov r2, r3 // M cmp r3, #0 bne .L5 b .L3 .L4: ldr r3, [r3, #8] b .L6 .L5: ldr r1, [r3, #4] cmp r1, #0 beq .L4 .L3: str r2, [r0, #12] @ sp needed for prologue pop {pc} In pass ifcvt it noticed the difference of two stores is the different pseudo register number and there is no conflict between the two pseudo registers, so it rename one of them to the same as another and do basic block cross jump on them earlier. Then pass iterate.c.161r.cse2 detected the redundant load and remove it. But it introduced another redundant move instruction marked as M. At the place r2 is used, r3 still contain the same result as r2, so we can also use r3 there. I think this is another problem.
That redundant move has to be a separate issue, indeed. I would expect the register allocator to coalesce those registers. I hadn't expected this. I thought the result would be just the removal of the redundant load, but the code that comes out is bigger (14 instructions instead of 13) and has a completely different structure. I'll see if I can understand what is going on. Thus, mine.
With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my patch for bug 20070 applied, I get the following code: iterate: push {lr} ldr r3, [r1] .L6: str r3, [r0] sub r2, r3, #0 bne .L5 b .L3 .L4: ldr r3, [r3, #8] b .L6 .L5: ldr r1, [r3, #4] cmp r1, #0 beq .L4 .L3: str r2, [r0, #12] @ sp needed for prologue pop {pc} Carrot, could you please double-check that this is still correct code?
(In reply to comment #9) > With "GCC: (GNU) 4.5.0 20100108 (experimental) [trunk revision 155731]" and my > patch for bug 20070 applied, I get the following code: > > iterate: > push {lr} > ldr r3, [r1] > .L6: > str r3, [r0] > sub r2, r3, #0 > bne .L5 > b .L3 > .L4: > ldr r3, [r3, #8] > b .L6 > .L5: > ldr r1, [r3, #4] > cmp r1, #0 > beq .L4 > .L3: > str r2, [r0, #12] > @ sp needed for prologue > pop {pc} > > Carrot, could you please double-check that this is still correct code? > Yes, it is correct. There are still 13 instructions, I think it is related to unoptimized basic block order.
Subject: Re: redundant memory load On Mon, Jan 11, 2010 at 7:47 AM, carrot at google dot com <gcc-bugzilla@gcc.gnu.org> wrote: >> iterate: >> push {lr} >> ldr r3, [r1] >> .L6: >> str r3, [r0] >> sub r2, r3, #0 >> bne .L5 >> b .L3 >> .L4: >> ldr r3, [r3, #8] >> b .L6 >> .L5: >> ldr r1, [r3, #4] >> cmp r1, #0 >> beq .L4 >> .L3: >> str r2, [r0, #12] >> @ sp needed for prologue >> pop {pc} >> >> Carrot, could you please double-check that this is still correct code? >> > > Yes, it is correct. > There are still 13 instructions, I think it is related to unoptimized basic > block order. Yes, I would have expected the block starting with .L4 to be *after* the block starting with .L5, something like so: iterate: push {lr} ldr r3, [r1] .L6: str r3, [r0] sub r2, r3, #0 beq .L3 .L5: ldr r1, [r3, #4] cmp r1, #0 bne .L3 ldr r3, [r3, #8] b .L6 .L3: str r2, [r0, #12] @ sp needed for prologue pop {pc} Does that look correct? And if so, could you see if there is an open bug report about this; or otherwise file a new PR and add me to the CC-list?
(In reply to comment #11) > Yes, I would have expected the block starting with .L4 to be *after* > the block starting with .L5, something like so: > > iterate: > push {lr} > ldr r3, [r1] > .L6: > str r3, [r0] > sub r2, r3, #0 > beq .L3 > .L5: > ldr r1, [r3, #4] > cmp r1, #0 > bne .L3 > ldr r3, [r3, #8] > b .L6 > .L3: > str r2, [r0, #12] > @ sp needed for prologue > pop {pc} > > Does that look correct? And if so, could you see if there is an open > bug report about this; or otherwise file a new PR and add me to the > CC-list? > It is correct. The basic block ordering issue (-Os) has been observed several times. Following are related PRs: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41004 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41396
Not working on this...
Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c).
(In reply to Andrew Pinski from comment #14) > Fixed for GCC 12 by r12-897-gde56f95afaaa22 (and r11-408-g84935c9822183c). The first redundant load was fixed by r11-408-g84935c9822183c. The extra store was fixed was fixed by r12-897-gde56f95afaaa22 . But it is still fixed fully.