Bug 16803 - PowerPC - invariant code motion could be removed from loop.
Summary: PowerPC - invariant code motion could be removed from loop.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.0.0
: P5 enhancement
Target Milestone: 4.2.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 18446 18527
Blocks:
  Show dependency treegraph
 
Reported: 2004-07-28 17:00 UTC by Pete Steinmetz
Modified: 2006-01-28 21:18 UTC (History)
4 users (show)

See Also:
Host: powerpc64-linux
Target: powerpc64-linux
Build: powerpc64-linux
Known to work:
Known to fail:
Last reconfirmed: 2005-10-04 06:22:42


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pete Steinmetz 2004-07-28 17:00:38 UTC
Description:
A non-optimal code sequence is illustrated. Several invariant instructions could be hoisted from a loop.  A store with update form could be used.  A branch on count instruction could be used.  Any or all would improve the loop's performance.  Duplicate using gcc 3.5 and command line:

gcc -O3 -m64 -c test.c

Testcase:
#define SOME_CONST 20
unsigned short *x;
int y;

int main ()
{
   int i;

   for (i = 0; i<= y+SOME_CONST; i++)
       x[i] = 0;

   return 0;
}

Assembly:
Currently gcc 3.5 generates the following code:

	ld 5,.LC0@toc(2) -- load base of "y"
	lwz 3,0(5)       -- load "y"
	addi 9,3,20      -- compute "y+SOME_CONST"
	cmpwi 7,9,0      -- determine if loop should be entered.
	blt- 7,.L2       -- branch around loop if not.
	ld 7,.LC1@toc(2) -- load address of "x"
	li 8,0           -- initialize "i"
	li 6,0           -- load value to store.
.L4:
	ld 10,0(7)       -- load base of "x" - loop invariant
	sldi 12,8,1      -- compute index into "x" (i * 2)
	addi 0,8,1       -- increment i
	extsw 8,0        -- sign extend i
	sthx 6,12,10     -- store "x[i]"
	lwz 4,0(5)       -- load "y" - loop invariant
	addi 11,4,20     -- compute "y+SOME_CONST" - loop invariant
	cmpw 0,11,8      -- compare result of add with i
	bge+ 0,.L4       -- loop back.

Remove the invariant instructions and use a store with update and this code improves to:

	ld 5,.LC0@toc(2) -- load base of "y"
	lwz 3,0(5)       -- load "y"
	addi 9,3,20      -- compute "y+SOME_CONST"
	cmpwi 7,9,0      -- determine if loop should be entered.
	blt- 7,.L7       -- branch around loop if not.
	ld 7,.LC1@toc(2) -- load address of "x"
	li 8,0           -- initialize "i"
	li 6,0           -- load value to store.
	ld 10,0(7)       -- load base of "x" - loop invariant
.L5:
	addi 0,8,1       -- increment i
	extsw 8,0        -- sign extend i
	sthu 6,2(10)     -- use store with update instead of sldi/sthx
	cmpw 0,9,8       -- compare result of add with i
	bge+ 0,.L5       -- loop back.

This could be further improved to the following with the use of bct:

	ld 5,.LC0@toc(2) -- load base of "y"
	li 8,0           -- initialize "i"
	lwz 3,0(5)       -- load "y"
	addi 9,3,20      -- compute "y+SOME_CONST"
	cmpwi 7,9,0      -- determine if loop should be entered.
	blt- 7,.L7       -- branch around loop if not.
	ld 7,.LC1@toc(2) -- load address of "x"
	li 6,0           -- load value to store.
	ld 10,0(7)       -- load base of "x" - loop invariant
	mtctr 11         -- load count register
.L5:
	sthu 6,2(10)     -- use store with update instead of sldi/sthx
	bdnz+ 0,.L5      -- loop back.

The loop has gone from 8 instructions and a branch to 1 instruction and a branch.

Comment 1 Falk Hueffner 2004-07-28 18:33:51 UTC
It would be nice if you could try this on the lno branch, since that is
where all the loop work is going on at the moment.
Comment 2 Steven Bosscher 2004-07-30 17:41:54 UTC
I doubt LNO would help here, it looks like a backend issue. 
But you can give it a try. 
 
Comment 3 Steven Bosscher 2004-07-30 17:49:18 UTC
Not a tree optimization issue. 
Comment 4 Andrew Pinski 2004-07-30 19:00:23 UTC
But an rtl one though as I would not think this does not effect another target with doloops.
Comment 5 Zdenek Dvorak 2004-07-30 19:15:26 UTC
Branch on count is not used since we do not know whether the loop is not 
infinite (which would be the case if y == MAX_INT - SOME_CONST).

The fact that neither of the two rtl loop invariant motion passes nor PRE moves 
the invariants out of the loop is weird.  I will have a look at that.

Zdenek
Comment 6 Zdenek Dvorak 2004-08-30 12:55:25 UTC
Invariant motion does not recognize y as invariant, since we believe that store 
to x[i] may alias y.

As far as I understand the C standard, this is wrong, since short should not be 
compatible with int, but maybe I misunderstood this.  Diego?
Comment 7 Zdenek Dvorak 2004-08-30 13:39:29 UTC
Similarly for the load of x -- x is assumed to be clobbered by the store
to x[i].
Comment 8 Andrew Pinski 2004-08-31 04:02:27 UTC
Actually this is the problem which I am seeing and I posted an initialize fix, the problem comes from 
the fact that the cost that lim thinks it should do the load is just one.

See <http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02534.html> for the patch
and <http://gcc.gnu.org/ml/gcc/2004-08/msg01545.html> for the my findings.
Comment 9 Andrew Pinski 2004-08-31 22:41:57 UTC
Patch here: <http://gcc.gnu.org/ml/gcc-patches/2004-08/msg02662.html>.

Related aliasing bug here in PR 17252.
Comment 10 Andrew Pinski 2004-09-27 03:47:13 UTC
Patch causes other problems and I don't have time to investage them.
Comment 11 Nathan Sidwell 2004-11-11 17:40:11 UTC
FSF HEAD 2004-11-11 gives better code, with the invariants moved out of the loop,
.main:
        ld 9,.LC1@toc(2)
        lwz 11,0(9)
        addi 11,11,20
        extsw 11,11
        cmpwi 7,11,0
        blt 7,.L2
        ld 9,.LC2@toc(2)
        li 10,0
        li 7,0
        ld 8,0(9)
        .align 4
.L4:
        addi 0,10,1
        sldi 9,10,1
        extsw 10,0
        sthx 7,9,8
        cmpw 7,11,10
        bge 7,.L4
.L2:
        li 3,0
        blr

but this has 6 insns in the loop instead of 5.  Investigating ...
Comment 12 Andrew Pinski 2004-11-11 18:19:07 UTC
Woops I I filed PR 18431 (which I think is the same problem well the testcases are the same), I will note 
I copied both -m32 and -m64 loops to show where the problem is and with arrays we get much better 
code.
Comment 13 Nathan Sidwell 2004-11-12 09:24:03 UTC
We cannot generate better code, without having a different meaning for the
sign_extend action that occurs in the loop.
As zdenek points out, we cannot use dbra, because we cannot tell if the loop
will actually terminate -- because of the <= continuation condition.  on ppc64
ints are 32 bits and pointers are 64 bits.  the base address 'x' is a pointer
to an array that we cannot tell the bounds of -- x might point into the middle
of an array.  Therefore both negative and positive ints of any value might
be valid offsets.

If the loop is unbounded, then there is the possibility of wrap-around during
the increment of i.  What happens in this case is undefined.  We could therefore
remove the sign-extension from int->long, and optimize further on the basis that
undefined things never happen in a well-formed program.  Unfortunately we cannot
do this with the current RTL structure.  The sign-extend operation is used in
two different circumstances, (a) when it really is sign extending a shorter
(valid) value to a longer representation, and (b) when we're truncating a
possibly out-of-range value.  (b) is undefined, yet we do not represent that
undefinedness, and cannot distinguish it from (a), which is well defined.

I will add a meta-bug noting this problem with extend operations.
Comment 14 Andrew Pinski 2004-11-15 01:58:45 UTC
Now I see what is the difference between this and PR 18431, <= vs <.
Comment 15 Andrew Pinski 2005-11-02 17:16:13 UTC
All P1 enhancements not targeted towards 4.1, moving to P5.
Comment 16 Andrew Pinski 2006-01-07 16:23:54 UTC
Fixed in 4.2.0 by the patch which fixed PR 18527.
Comment 17 Andrew Pinski 2006-01-07 16:24:32 UTC
I forgot to say the loop now looks like:
L4:
        sthx r0,r2,r11
        addi r2,r2,2
        bdnz L4