36223 – IV-opt is not optimal for mips

Bug 36223 - IV-opt is not optimal for mips

Summary: IV-opt is not optimal for mips

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.3.1

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2008-05-12 18:58 UTC by Sandra Loosemore
Modified:	2009-08-25 00:37 UTC (History)
CC List:	3 users (show)

See Also:
Host:	i686-pc-linux-gnu
Target:	mipsisa32r2-elfoabi
Build:	i686-pc-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sandra Loosemore 2008-05-12 18:58:34 UTC

This is a missed-optimization bug.  The following reduced test case illustrates the problem.  It doesn't do anything useful, but just compile it with

mipsisa32r2-elfoabi-gcc -S -mtune=24kc -G4096 -O2 example4.c


#define N 511
#define M 9

long A[N];
long B[N];

long AA[N];
long BB[N];

long tA;
long tB;

void foo (unsigned iterations)
{
  unsigned loop_cnt;
  static long *aLow;
  static long *bLow;
  static long *aHi;
  static long *bHi;
  static long n1;
  static long n2;
  static long l;
  static long i;
  static long j;
  static long k;

  for (loop_cnt = 0; loop_cnt < iterations; loop_cnt ++) {

    /* This is the loop we're interested in.  */
    for (i = 0; i < N; i ++) {
      AA[i] = A[i];
      BB[i] = B[i];
    }

    /* The rest of this stuff is just here to add some context to the
       outer loop.  */
    for (k = 1; k <= M; k++) {
      n1 = 1 << k;
      n2 = n1 >> 1;
      for (j = 0; j < n2; j++) {
        for (i = j; i < N; i += n1) {
          l = i + n2;
          aLow = &A[l];
          bLow = &B[l];
          aHi = &A[i];
          bHi = &B[i];
          
          A[l] = *aHi - tA;
          B[l] = *bHi - tB;
          A[i] += tA;
          B[i] += tB;
        }
      }
    }

  }
}


The -G option forces the global variables to use GP-relative addressing, which involves an extra addition.  Thus the first nested loop should be optimized as if it were written:

{
    long *t1 = AA;
    long *t2 = A;
    long *t3 = BB;
    long *t4 = B;

    for (i = 0; i < N; i++)  {
      *t1 = *t2;
      *t3 = *t4;
      t1++; t2++; t3++; t4++;
    }
  }

In 4.3.1, though, it is producing code with GP-relative addressing inside the loop, so that the loop body has 9 adds instead of 5.  Mainline head does a better job and at least pulls out the references to A and B (which also appear in the second nested loop).

PRE is working fine, and pulling the invariant GP-relative addressing of all four variables all the way out of the outer loop.  However, this means the lifetimes of the corresponding pseudo-registers span the entire outer loop, and the register allocator is (correctly) giving priority to the more localized pseudos in the more deeply nested loops that follow.  Having failed to allocate a hardware register to span the entire lifetime of the pseudos, reload stupidly re-inserts the previously hoisted GP-relative address computation at the point of reference, inside the first nested loop.

I think what is needed is more smarts to make it understand that it should try allocating a register just around the inner loop if it can't get one for the entire outer loop, before giving up.  Any thoughts on where the best place for this to happen would be?  Can this be done entirely within the register allocator or do we need another pass to identify places where we can potentially shorten the lifetimes of pseudos?  

While this example is specific to MIPS with the GP-relative addressing, I can see that the underlying PRE/register allocation conflict is a more general problem that probably crops up in lots of other code with similar structure of outer-loop-containing-multiple-inner-loops.

-Sandra

Comment 1 Andrew Pinski 2008-05-12 19:02:09 UTC

This sounds more like an IV issue rather than a PRE issue.

Comment 2 Andrew Pinski 2008-05-12 19:09:24 UTC

Actually this is an IV-opt issue rather than a RA/PRE issue.
For GCC 4.1.1 targeting an internal PPC port, I get the following on the tree level at the end:
<L1>:;
  D.2011 = (long int *) ivtmp.92;
  MEM[base: &AA, index: D.2011] = MEM[base: &A, index: D.2011];
  MEM[base: &BB, index: D.2011] = MEM[base: &B, index: D.2011];
  ivtmp.92 = ivtmp.92 + 4B;
  if (ivtmp.92 != 2044B) goto <L1>; else goto <L49>;

Which should show that this is IV-opts issue rather than a PRE issue at all.

Comment 3 Sandra Loosemore 2008-05-12 19:10:56 UTC

One other tidbit:  the MIPS SDE 3.4.4-based toolchain produced the desired code for this test case.  It's really a 4.* regression, not an enhancement.

Comment 4 Hans-Peter Nilsson 2008-06-02 23:11:29 UTC

Comment 5 Sandra Loosemore 2008-06-30 02:05:02 UTC

Maybe I'm just being clueless here, but I don't understand why this bug was re-categorized.  In my original analysis, I traced the bad code directly to the RA pass un-doing the results of previous optimizations.  Andrew, if you think it is going wrong somewhere else, can you provide more details as to where, and what code you think ought to be coming out of that pass that isn't?  I could perhaps chew on this some more if I knew what to look for.

Comment 6 Sandra Loosemore 2009-08-24 22:36:32 UTC

This bug appears to be fixed in mainline HEAD now.  Here's an excerpt showing the generated code for the inner loop in the example program now:

        addiu   $21,$28,%gp_rel(AA)
        addiu   $10,$28,%gp_rel(A)
        addiu   $20,$28,%gp_rel(BB)
        addiu   $9,$28,%gp_rel(B)
        li      $19,2044                        # 0x7fc
        li      $18,10                  # 0xa
        move    $2,$0
.L3:
        addu    $8,$10,$2
        addu    $3,$9,$2
        lw      $24,0($8)
        addu    $14,$21,$2
        lw      $8,0($3)
        addu    $3,$20,$2
        addiu   $2,$2,4
        sw      $24,0($14)
        bne     $2,$19,.L3
        sw      $8,0($3)

All 4 gp_rel address computations pulled outside the loop, and only 5 adds inside.  I'm not sure what fixed this, but it does seem fixed.

Comment 7 Eric Botcazou 2009-08-25 00:37:26 UTC

I tweaked IVOPTS a few months ago to help in similar cases on PowerPC:

2009-05-29  Eric Botcazou  <ebotcazou@adacore.com>

	* tree-ssa-loop-ivopts.c (strip_offset_1) <MULT_EXPR>: New case.
	(force_expr_to_var_cost) <NEGATE_EXPR>: Likewise.
	(ptr_difference_cost): Use affine combinations to compute it.
	(difference_cost): Likewise.
	(get_computation_cost_at): Compute more accurate cost for addresses
	if the ratio is a multiplier allowed in addresses.
	For non-addresses, consider that an additional offset or symbol is
	added only once.

Maybe this helped for MIPS in this case.