I apologize if it is a well disguised feature, but I am forced to consider this being a performance regression/bug. In the following trivial example: void VecADD( long long *In1, long long *In2, long long *Out, unsigned int samples ){ int i; for (i = 0; i < samples; i++) { Out[i] = In1[i] + In2[i]; } } there is an implicit imprecision in the way C is used - type of 'samples' is unsigned, while type of 'i' is signed. The problem on the high level - induction variable analysis fails for this loop, which impairs further tree level loop optimizations from functioning properly (including autoincrement). In my port performance is off by 50% for this loop. GCC 3.4.6 was able to handle this situation fine. What I believe to be the problem at the lowest level is a non-minimal (or overly restrictive) SSA representation right before the iv detection: VecADD (In1, In2, Out, samples) { int i; long long int D.1857; long long int D.1856; long long int * D.1855; long long int D.1854; long long int * D.1853; long long int * D.1852; unsigned int D.1851; unsigned int i.0; <bb 2>: <bb 6>: # i_10 = PHI <0(2)> i.0_5 = (unsigned int) i_10; if (i.0_5 < samples_4(D)) goto <bb 3>; else goto <bb 5>; <bb 3>: # i.0_9 = PHI <i.0_3(4), i.0_5(6)> # i_14 = PHI <i_1(4), i_10(6)> D.1851_6 = i.0_9 * 8; D.1852_8 = Out_7(D) + D.1851_6; D.1853_12 = In1_11(D) + D.1851_6; D.1854_13 = *D.1853_12; D.1855_17 = In2_16(D) + D.1851_6; D.1856_18 = *D.1855_17; D.1857_19 = D.1854_13 + D.1856_18; *D.1852_8 = D.1857_19; i_20 = i_14 + 1; <bb 4>: # i_1 = PHI <i_20(3)> i.0_3 = (unsigned int) i_1; if (i.0_3 < samples_4(D)) goto <bb 3>; else goto <bb 5>; <bb 5>: return; } The two PHI nodes in the beginning of BB3 break the iv detection. Same example when types of ‘i’ and ‘samples’ would match will be analyzed perfectly fine with the SSA at the same point looking like this: VecADD (In1, In2, Out, samples) { int i; long long int D.1857; long long int D.1856; long long int * D.1855; long long int D.1854; long long int * D.1853; long long int * D.1852; unsigned int D.1851; unsigned int i.0; <bb 2>: <bb 6>: # i_9 = PHI <0(2)> if (i_9 < samples_3(D)) goto <bb 3>; else goto <bb 5>; <bb 3>: # i_13 = PHI <i_1(4), i_9(6)> i.0_4 = (unsigned int) i_13; D.1851_5 = i.0_4 * 8; D.1852_7 = Out_6(D) + D.1851_5; D.1853_11 = In1_10(D) + D.1851_5; D.1854_12 = *D.1853_11; D.1855_16 = In2_15(D) + D.1851_5; D.1856_17 = *D.1855_16; D.1857_18 = D.1854_12 + D.1856_17; *D.1852_7 = D.1857_18; i_19 = i_13 + 1; <bb 4>: # i_1 = PHI <i_19(3)> if (i_1 < samples_3(D)) goto <bb 3>; else goto <bb 5>; <bb 5>: return; } On one hand I seem to understand that a danger of signed/unsigned overflow at increment can force this kind of conservatism, but on the high level this situation was handled fine by gcc 3.4.6 and is handled with no issues by another SSA based compiler. If there is a way to relax this strict interpretation of C rules by GCC 4.3.2, I would gladly learn about it, but my brief flag mining exercise yielded no results. Thank you.
Well first off, iv-opts on the tree level should be the place where autoincrement is helped out. See PR 31849 which I think this is a duplicate of.
Andrew, thank you for the prompt reply. I have seen PR 31849 and used the patch suggested. Without it autoincrement would not work at all for either case. But the patch seems to deal with the case when iv _were_ detected and generally alters cost computation to take that addressing mode into account, while in this case I see that ivs are _not_ detected in the first place. This seems to be a more fundamental issue - should SSA be made "minimal" at that point or should iv detection be improved to "see" trough “aliased” PHI node – I lack better words to describe it.
This bug is a bit funny, autoincrement is not the issue, just the detecting of the induction variable's limits correctly. For an example I noticed that if I compile this on a LP64 target, the induction variable is found correctly.
Here is a good testcase: void VecADD( int *In1, int *In2, int *Out, unsigned int samples ){ int i; for (i = 0; i < samples; i++) { Out[i] = In1[i]; } } This testcase should be vectorized with -maltivec -O3 on PowerPC but is not while it is with -m64.
I think this boils down to the usual POINTER_PLUS fallout.
Subject: Re: loop iv detection failure Sent from my iPhone On Jan 16, 2009, at 1:57 AM, "rguenth at gcc dot gnu dot org" <gcc-bugzilla@gcc.gnu.org > wrote: > > > ------- Comment #5 from rguenth at gcc dot gnu dot org 2009-01-16 > 09:57 ------- > I think this boils down to the usual POINTER_PLUS fallout. It failed in 4.1 also so nope :). > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38856 >
Here is a testcase where IV can be found but we end up with two copies of the same register in the loop: int f(unsigned long s, unsigned long *c) { long j; for(j = 0;j < s; j++) { c[j]=j; } } Which is basically the same issue.
Hello, This has not been touched in a while, nevertheless the issue still exist with 4.5.1
I think the original issue has been fixed since GCC 4.5.0 or so.