Description: A non-optimal code sequence is illustrated. Several invariant instructions could be hoisted from a loop. A store with update form could be used. A branch on count instruction could be used. Any or all would improve the loop's performance. Duplicate using gcc 3.5 and command line: gcc -O3 -m64 -c test.c Testcase: #define SOME_CONST 20 unsigned short *x; int y; int main () { int i; for (i = 0; i<= y+SOME_CONST; i++) x[i] = 0; return 0; } Assembly: Currently gcc 3.5 generates the following code: ld 5,.LC0@toc(2) -- load base of "y" lwz 3,0(5) -- load "y" addi 9,3,20 -- compute "y+SOME_CONST" cmpwi 7,9,0 -- determine if loop should be entered. blt- 7,.L2 -- branch around loop if not. ld 7,.LC1@toc(2) -- load address of "x" li 8,0 -- initialize "i" li 6,0 -- load value to store. .L4: ld 10,0(7) -- load base of "x" - loop invariant sldi 12,8,1 -- compute index into "x" (i * 2) addi 0,8,1 -- increment i extsw 8,0 -- sign extend i sthx 6,12,10 -- store "x[i]" lwz 4,0(5) -- load "y" - loop invariant addi 11,4,20 -- compute "y+SOME_CONST" - loop invariant cmpw 0,11,8 -- compare result of add with i bge+ 0,.L4 -- loop back. Remove the invariant instructions and use a store with update and this code improves to: ld 5,.LC0@toc(2) -- load base of "y" lwz 3,0(5) -- load "y" addi 9,3,20 -- compute "y+SOME_CONST" cmpwi 7,9,0 -- determine if loop should be entered. blt- 7,.L7 -- branch around loop if not. ld 7,.LC1@toc(2) -- load address of "x" li 8,0 -- initialize "i" li 6,0 -- load value to store. ld 10,0(7) -- load base of "x" - loop invariant .L5: addi 0,8,1 -- increment i extsw 8,0 -- sign extend i sthu 6,2(10) -- use store with update instead of sldi/sthx cmpw 0,9,8 -- compare result of add with i bge+ 0,.L5 -- loop back. This could be further improved to the following with the use of bct: ld 5,.LC0@toc(2) -- load base of "y" li 8,0 -- initialize "i" lwz 3,0(5) -- load "y" addi 9,3,20 -- compute "y+SOME_CONST" cmpwi 7,9,0 -- determine if loop should be entered. blt- 7,.L7 -- branch around loop if not. ld 7,.LC1@toc(2) -- load address of "x" li 6,0 -- load value to store. ld 10,0(7) -- load base of "x" - loop invariant mtctr 11 -- load count register .L5: sthu 6,2(10) -- use store with update instead of sldi/sthx bdnz+ 0,.L5 -- loop back. The loop has gone from 8 instructions and a branch to 1 instruction and a branch.
It would be nice if you could try this on the lno branch, since that is where all the loop work is going on at the moment.
I doubt LNO would help here, it looks like a backend issue. But you can give it a try.
Not a tree optimization issue.
But an rtl one though as I would not think this does not effect another target with doloops.
Branch on count is not used since we do not know whether the loop is not infinite (which would be the case if y == MAX_INT - SOME_CONST). The fact that neither of the two rtl loop invariant motion passes nor PRE moves the invariants out of the loop is weird. I will have a look at that. Zdenek
Invariant motion does not recognize y as invariant, since we believe that store to x[i] may alias y. As far as I understand the C standard, this is wrong, since short should not be compatible with int, but maybe I misunderstood this. Diego?
Similarly for the load of x -- x is assumed to be clobbered by the store to x[i].
Actually this is the problem which I am seeing and I posted an initialize fix, the problem comes from the fact that the cost that lim thinks it should do the load is just one. See <http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02534.html> for the patch and <http://gcc.gnu.org/ml/gcc/2004-08/msg01545.html> for the my findings.
Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02662.html>. Related aliasing bug here in PR 17252.
Patch causes other problems and I don't have time to investage them.
FSF HEAD 2004-11-11 gives better code, with the invariants moved out of the loop, .main: ld 9,.LC1@toc(2) lwz 11,0(9) addi 11,11,20 extsw 11,11 cmpwi 7,11,0 blt 7,.L2 ld 9,.LC2@toc(2) li 10,0 li 7,0 ld 8,0(9) .align 4 .L4: addi 0,10,1 sldi 9,10,1 extsw 10,0 sthx 7,9,8 cmpw 7,11,10 bge 7,.L4 .L2: li 3,0 blr but this has 6 insns in the loop instead of 5. Investigating ...
Woops I I filed PR 18431 (which I think is the same problem well the testcases are the same), I will note I copied both -m32 and -m64 loops to show where the problem is and with arrays we get much better code.
We cannot generate better code, without having a different meaning for the sign_extend action that occurs in the loop. As zdenek points out, we cannot use dbra, because we cannot tell if the loop will actually terminate -- because of the <= continuation condition. on ppc64 ints are 32 bits and pointers are 64 bits. the base address 'x' is a pointer to an array that we cannot tell the bounds of -- x might point into the middle of an array. Therefore both negative and positive ints of any value might be valid offsets. If the loop is unbounded, then there is the possibility of wrap-around during the increment of i. What happens in this case is undefined. We could therefore remove the sign-extension from int->long, and optimize further on the basis that undefined things never happen in a well-formed program. Unfortunately we cannot do this with the current RTL structure. The sign-extend operation is used in two different circumstances, (a) when it really is sign extending a shorter (valid) value to a longer representation, and (b) when we're truncating a possibly out-of-range value. (b) is undefined, yet we do not represent that undefinedness, and cannot distinguish it from (a), which is well defined. I will add a meta-bug noting this problem with extend operations.
Now I see what is the difference between this and PR 18431, <= vs <.
All P1 enhancements not targeted towards 4.1, moving to P5.
Fixed in 4.2.0 by the patch which fixed PR 18527.
I forgot to say the loop now looks like: L4: sthx r0,r2,r11 addi r2,r2,2 bdnz L4