With patch from here: http://gcc.gnu.org/ml/gcc-patches/2010-02/msg00668.html IVopts begin to create IVs for expressions like &a0[i][j][0]. This may cause regressions in stack usage and code size (also possibly speed). Test case: /* ---8<--- */ enum {N=123}; int a0[N][N][N], a1[N][N][N], a2[N][N][N], a3[N][N][N], a4[N][N][N], a5[N][N][N], a6[N][N][N], a7[N][N][N]; int foo() { int i, j, k, s = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) { s += a0[i][j][k]; s += a1[i][j][k]; s += a2[i][j][k]; s += a3[i][j][k]; s += a4[i][j][k]; s += a5[i][j][k]; s += a6[i][j][k]; s += a7[i][j][k]; } return s; } /* ---8<--- */ Without the patch, IVopts produce one IV for j loop and 8 IVs for k loop. With the patch, IVopts additionally produce 8 IVs for j loop (with 123*4 increment), 4 of which live on stack (on x86-64, -O2). Creation of IVs that live on stack is likely due to inexact register pressure estimation in IVopts. However, it would be nice if IVopts could notice that it's cheaper to take the final value of inner loop IVs (e.g. &a0[i][j][k]) instead of incrementing IV holding &a0[i][j][0] by 123*4. It would decrease register pressure and allow to generate perfect code for the test case.
Created attachment 20001 [details] Simplify increments in IVopts using final values of inner loop IVs A quick & dirty attempt to implement register pressure reduction in outer loops by using final values of inner loop IVs. Currently, given for (i = 0; i < N; i++) for (j = 0; j < N; j++) s += a[i][j]; we generate something like <bb1> L1: s.0 = PHI(0, s.2) i.0 = PHI(0, i.1) ivtmp.0 = &a[i.0][0] <bb2> L2: s.1 = PHI(s.0, s.2) j.0 = PHI(122, j.1) ivtmp.1 = PHI(ivtmp.0, ivtmp.2) s.2 = s.1 + MEM(ivtmp.1) ivtmp.2 = ivtmp.1 + 4 j.1 = j.0 - 1 if (j.1 >= 0) goto L2 <bb3> i.1 = i.0 + 1 if (i.1 <= 122) goto L1 This together with the patch mentioned in the previous comment allows to generate: ivtmp.0 = &a[0][0] <bb1> L1: s.0 = PHI(0, s.2) i.0 = PHI(122, i.1) ivtmp.1 = PHI(ivtmp.0, ivtmp.4) <bb2> L2: s.1 = PHI(s.0, s.2) j.0 = PHI(122, j.1) ivtmp.2 = PHI(ivtmp.1, ivtmp.3) s.2 = s.1 + MEM(ivtmp.2) ivtmp.3 = ivtmp.2 + 4 j.1 = j.0 - 1 if (j.1 >= 0) goto L2 <bb3> ivtmp.4 = ivtmp.3 // would be ivtmp.4 = ivtmp.1 + stride i.1 = i.0 - 1 if (i.1 >= 0) goto L1 The improvement is that ivtmp.1 is not live across the inner loop. The approach is to store final values of IVs in a hashtable, mapping SSA_NAME of initial value in the preheader to aff_tree with final value, and then try to replace increments of new IVs with uses of IVs from inner loops (currently I just implemented a brute force loop over all IV uses to find a useful entry in that hashtable). Does this make sense and sound acceptable?
Subject: Re: Teaching SCEV about ADDR_EXPR causes regression > This together with the patch mentioned in the previous comment allows to > generate: > ivtmp.0 = &a[0][0] > <bb1> > L1: > s.0 = PHI(0, s.2) > i.0 = PHI(122, i.1) > ivtmp.1 = PHI(ivtmp.0, ivtmp.4) > <bb2> > L2: > s.1 = PHI(s.0, s.2) > j.0 = PHI(122, j.1) > ivtmp.2 = PHI(ivtmp.1, ivtmp.3) > s.2 = s.1 + MEM(ivtmp.2) > ivtmp.3 = ivtmp.2 + 4 > j.1 = j.0 - 1 > if (j.1 >= 0) goto L2 > <bb3> > ivtmp.4 = ivtmp.3 // would be ivtmp.4 = ivtmp.1 + stride > i.1 = i.0 - 1 > if (i.1 >= 0) goto L1 > > The improvement is that ivtmp.1 is not live across the inner loop. > > The approach is to store final values of IVs in a hashtable, mapping SSA_NAME > of initial value in the preheader to aff_tree with final value, and then try to > replace increments of new IVs with uses of IVs from inner loops (currently I > just implemented a brute force loop over all IV uses to find a useful entry in > that hashtable). > Does this make sense and sound acceptable? the approach seems ok. However, it is not immediately clear that performing the replacement is a good idea -- it trades of register pressure for creating new dependences, i.e., it makes register allocation easier, but scheduling harder. So, some performance testing is necessary to check this, Zdenek
Note for the three levels of loop example, GCC chooses one IV for both j and k loops, thus generates pretty clean output on x86_64 with O2. For the simple example, now gcc can eliminate comparison iv with the address candidate, and generates below codes: <bb 2>: ivtmp.18_14 = (unsigned long) &a; _10 = &a + 60516; _28 = (unsigned long) _10; goto <bb 7>; <bb 3>: <bb 4>: # s_18 = PHI <s_9(3), s_19(7)> # ivtmp.9_12 = PHI <ivtmp.9_1(3), ivtmp.9_4(7)> _22 = (void *) ivtmp.9_12; _8 = MEM[base: _22, offset: 0B]; s_9 = _8 + s_18; ivtmp.9_1 = ivtmp.9_12 + 4; if (ivtmp.9_1 != _27) goto <bb 3>; else goto <bb 5>; <bb 5>: # s_17 = PHI <s_9(4)> ivtmp.18_15 = ivtmp.18_5 + 492; if (ivtmp.18_15 != _28) goto <bb 6>; else goto <bb 8>; <bb 6>: <bb 7>: # s_19 = PHI <s_17(6), 0(2)> # ivtmp.18_5 = PHI <ivtmp.18_15(6), ivtmp.18_14(2)> ivtmp.9_4 = ivtmp.18_5; _29 = ivtmp.18_5 + 492; _27 = _29; goto <bb 4>; <bb 8>: # s_16 = PHI <s_17(5)> return s_16; With this, following gimple optimizers are able to CSE the opportunity of "ivtmp.18_5 + 492". As a result, optimal code is generated as in optimized dump: <bb 2>: ivtmp.18_14 = (unsigned long) &a; _28 = (unsigned long) &MEM[(void *)&a + 60516B]; goto <bb 5>; <bb 3>: # s_18 = PHI <s_9(3), s_19(5)> # ivtmp.9_12 = PHI <ivtmp.9_1(3), ivtmp.18_5(5)> _22 = (void *) ivtmp.9_12; _8 = MEM[base: _22, offset: 0B]; s_9 = _8 + s_18; ivtmp.9_1 = ivtmp.9_12 + 4; if (ivtmp.9_1 != _29) goto <bb 3>; else goto <bb 4>; <bb 4>: if (_28 != _29) goto <bb 5>; else goto <bb 6>; <bb 5>: # s_19 = PHI <s_9(4), 0(2)> # ivtmp.18_5 = PHI <_29(4), ivtmp.18_14(2)> _29 = ivtmp.18_5 + 492; goto <bb 3>; <bb 6>: return s_9; This in effect is the transformation you wanted in this PR, but I doubt if GCC can do this if it can't eliminate the inner loop's comparison using ivtmp.9_1 at the first place. On the other hand, for cases that we can use IV's final value, it maybe likely for GCC to eliminate the comparison IV.
Looks the ADDR_EXPR issue is fixed by revision 185129.
Looks indeed fixed on trunk: xorl %eax, %eax .L2: leaq -60516(%rsi), %rdx .p2align 4,,10 .p2align 3 .L6: leaq 492(%rdx), %rcx .p2align 4,,10 .p2align 3 .L3: addl a0(%rdx), %eax addq $4, %rdx addl a1-4(%rdx), %eax addl a2-4(%rdx), %eax addl a3-4(%rdx), %eax addl a4-4(%rdx), %eax addl a5-4(%rdx), %eax addl a6-4(%rdx), %eax addl a7-4(%rdx), %eax cmpq %rdx, %rcx jne .L3 cmpq %rcx, %rsi jne .L6 addq $60516, %rsi cmpq $7503984, %rsi jne .L2