Compile the attached source code with options -Os -march=armv5te -mthumb Gcc generates following code snippet: ... cmp r0, r2 bne .L5 b .L15 <--- A .L9: ldr r3, [r1] cmp r3, #0 beq .L7 str r0, [r1, #8] b .L8 .L7: str r3, [r1, #8] .L8: ldr r1, [r1, #4] b .L12 <---- C .L15: mov r0, #1 <--- B .L12: cmp r1, r2 <---- D bne .L9 ... inst A jump to B then fall through to D inst C jump to D there is no other instructions jump to instruction B, so we can put inst B just before A, then A jump to D, and C can be removed. There are two possible functions can potentially do this optimization. They are merge_blocks_move and try_forward_edges. Function try_forward_edges can only redirect a series of forwarder blocks. It can't move the target blocks before the forwarder blocks. In function merge_blocks_move only when both block b and c aren't forwarder blocks then can they be merged. In this case block A is a forwarder block, so they are not merged.
Created attachment 18326 [details] test case
Why does the basic block reordering pass also not handle this?
I would suggest to closely look at SSA expansion as well. There is I think still the regression from pre-SSA expand that it doesn't expand PHI nodes optimally at -Os (pre-SSA expand split common PHI args into extra blocks to minimize copy instructions which is something that can be done as a tree level transform). Like BB5: # a_1 = PHI <a_2(2), a_2(3), b_3(4)> to BB6: # tmp_4 = PHI <a_2(2), a_2(3)> BB5: # a_1 = PHI <tmp_4(6), b_3(4)> which we usually don't want and in fact undo during mergephi. But it needs one less copy if a_2 and tmp_4 can be coalesced together.
(In reply to comment #2) > Why does the basic block reordering pass also not handle this? > Basic block reordering is disabled with options -Os. The basic block reordering algorithm is for performance only, it usually increases code size. So it won't be called when do optimization for size. But for this specific case, the extra branch can be removed when I compile it with -O2.
Basic block reordering *should* be altered to work with -Os. It's definitely required to clean up silly jump sequences that exception handling creates. There's no reason, in principal, that the ordering algorithm can't be adjusted to minimize the number of branches, or the total size of all of the branch instructions (though that could require some clever changes to the shorten_branches code).
Re. comment #5 -- rth, any suggestions what an algorithm would look like to "minimize the number of branches, or the total size of all of the branch instructions"? And what do you mean with "some clever changes to the shorten_branches code"? Did you mean the calculation of insn sizes, or avoiding padding, or something else completely? Hope you can help a bit...
First off, I think cselim should also be run at -Os. And then we have the issue of tree-ssa-tail-merge not running at -Os.
Ehm, why does tree tail-merge not run at -Os? It's a size optimization, after all!
May be fixed by the patch that was applied for PR54364.
Resulting code from trunk r193358 primal_update_flow: push {r4, r5, lr} mov r3, #1 mov r4, #0 .L2: cmp r0, r2 beq .L11 ldr r5, [r0] cmp r5, #0 beq .L3 str r4, [r0, #8] b .L4 .L3: str r3, [r0, #8] .L4: ldr r0, [r0, #4] b .L2 .L11: mov r0, #1 .L6: cmp r1, r2 beq .L12 ldr r3, [r1] cmp r3, #0 beq .L7 str r0, [r1, #8] b .L8 .L7: str r3, [r1, #8] .L8: ldr r1, [r1, #4] b .L6 .L12: @ sp needed pop {r4, r5, pc} So the issue of comment #0 is fixed. With "-march=armv7 -mthumb" the stores to "[r0,#8]" and "[r0,#8]" are properly if-converted, I don't see anything that could be improved further. So, fixed.