This is a missed-optimization bug. The following reduced test case illustrates the problem. It doesn't do anything useful, but just compile it with mipsisa32r2-elfoabi-gcc -S -mtune=24kc -G4096 -O2 example4.c #define N 511 #define M 9 long A[N]; long B[N]; long AA[N]; long BB[N]; long tA; long tB; void foo (unsigned iterations) { unsigned loop_cnt; static long *aLow; static long *bLow; static long *aHi; static long *bHi; static long n1; static long n2; static long l; static long i; static long j; static long k; for (loop_cnt = 0; loop_cnt < iterations; loop_cnt ++) { /* This is the loop we're interested in. */ for (i = 0; i < N; i ++) { AA[i] = A[i]; BB[i] = B[i]; } /* The rest of this stuff is just here to add some context to the outer loop. */ for (k = 1; k <= M; k++) { n1 = 1 << k; n2 = n1 >> 1; for (j = 0; j < n2; j++) { for (i = j; i < N; i += n1) { l = i + n2; aLow = &A[l]; bLow = &B[l]; aHi = &A[i]; bHi = &B[i]; A[l] = *aHi - tA; B[l] = *bHi - tB; A[i] += tA; B[i] += tB; } } } } } The -G option forces the global variables to use GP-relative addressing, which involves an extra addition. Thus the first nested loop should be optimized as if it were written: { long *t1 = AA; long *t2 = A; long *t3 = BB; long *t4 = B; for (i = 0; i < N; i++) { *t1 = *t2; *t3 = *t4; t1++; t2++; t3++; t4++; } } In 4.3.1, though, it is producing code with GP-relative addressing inside the loop, so that the loop body has 9 adds instead of 5. Mainline head does a better job and at least pulls out the references to A and B (which also appear in the second nested loop). PRE is working fine, and pulling the invariant GP-relative addressing of all four variables all the way out of the outer loop. However, this means the lifetimes of the corresponding pseudo-registers span the entire outer loop, and the register allocator is (correctly) giving priority to the more localized pseudos in the more deeply nested loops that follow. Having failed to allocate a hardware register to span the entire lifetime of the pseudos, reload stupidly re-inserts the previously hoisted GP-relative address computation at the point of reference, inside the first nested loop. I think what is needed is more smarts to make it understand that it should try allocating a register just around the inner loop if it can't get one for the entire outer loop, before giving up. Any thoughts on where the best place for this to happen would be? Can this be done entirely within the register allocator or do we need another pass to identify places where we can potentially shorten the lifetimes of pseudos? While this example is specific to MIPS with the GP-relative addressing, I can see that the underlying PRE/register allocation conflict is a more general problem that probably crops up in lots of other code with similar structure of outer-loop-containing-multiple-inner-loops. -Sandra
This sounds more like an IV issue rather than a PRE issue.
Actually this is an IV-opt issue rather than a RA/PRE issue. For GCC 4.1.1 targeting an internal PPC port, I get the following on the tree level at the end: <L1>:; D.2011 = (long int *) ivtmp.92; MEM[base: &AA, index: D.2011] = MEM[base: &A, index: D.2011]; MEM[base: &BB, index: D.2011] = MEM[base: &B, index: D.2011]; ivtmp.92 = ivtmp.92 + 4B; if (ivtmp.92 != 2044B) goto <L1>; else goto <L49>; Which should show that this is IV-opts issue rather than a PRE issue at all.
One other tidbit: the MIPS SDE 3.4.4-based toolchain produced the desired code for this test case. It's really a 4.* regression, not an enhancement.
n
Maybe I'm just being clueless here, but I don't understand why this bug was re-categorized. In my original analysis, I traced the bad code directly to the RA pass un-doing the results of previous optimizations. Andrew, if you think it is going wrong somewhere else, can you provide more details as to where, and what code you think ought to be coming out of that pass that isn't? I could perhaps chew on this some more if I knew what to look for.
This bug appears to be fixed in mainline HEAD now. Here's an excerpt showing the generated code for the inner loop in the example program now: addiu $21,$28,%gp_rel(AA) addiu $10,$28,%gp_rel(A) addiu $20,$28,%gp_rel(BB) addiu $9,$28,%gp_rel(B) li $19,2044 # 0x7fc li $18,10 # 0xa move $2,$0 .L3: addu $8,$10,$2 addu $3,$9,$2 lw $24,0($8) addu $14,$21,$2 lw $8,0($3) addu $3,$20,$2 addiu $2,$2,4 sw $24,0($14) bne $2,$19,.L3 sw $8,0($3) All 4 gp_rel address computations pulled outside the loop, and only 5 adds inside. I'm not sure what fixed this, but it does seem fixed.
I tweaked IVOPTS a few months ago to help in similar cases on PowerPC: 2009-05-29 Eric Botcazou <ebotcazou@adacore.com> * tree-ssa-loop-ivopts.c (strip_offset_1) <MULT_EXPR>: New case. (force_expr_to_var_cost) <NEGATE_EXPR>: Likewise. (ptr_difference_cost): Use affine combinations to compute it. (difference_cost): Likewise. (get_computation_cost_at): Compute more accurate cost for addresses if the ratio is a multiplier allowed in addresses. For non-addresses, consider that an additional offset or symbol is added only once. Maybe this helped for MIPS in this case.