Bug 20643 - [4.2 Regression] Tree loop optimizer does worse job than RTL loop optimizer
Summary: [4.2 Regression] Tree loop optimizer does worse job than RTL loop optimizer
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.1.0
: P2 normal
Target Milestone: 4.3.0
Assignee: Not yet assigned to anyone
URL:
Keywords: alias, missed-optimization
Depends on:
Blocks:
 
Reported: 2005-03-25 22:21 UTC by Steve Ellcey
Modified: 2009-03-30 15:43 UTC (History)
5 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.3.0
Known to fail: 4.0.4 4.2.5
Last reconfirmed: 2006-07-16 13:52:59


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Ellcey 2005-03-25 22:21:47 UTC
In the attached test case, 3.4.* GCC generates better code than 4.0 (or 4.1)
because it moves more loop invariant code out of the inner loop of P7Viterbi. 
The problem seems to be in the alias analysis which determines what can be moved
out of that loop.  If you change the field M, which is unused, from int to float
then the 4.* GCC generates better code.  I tried the structure-alias branch to
see if that helped and it didn't.  See the email string starting at
http://gcc.gnu.org/ml/gcc/2005-03/msg00835.html for some more info.

Test case:

#define L_CONST 500

void *malloc(long size);

struct plan7_s {
  int M;
  int **tsc;                   /* transition scores     [0.6][1.M-1]        */
};

struct dpmatrix_s {
  int **mmx;
};
struct dpmatrix_s *mx;



void
AllocPlan7Body(struct plan7_s *hmm, int M) 
{
  int i;

  hmm->tsc    = malloc (7 * sizeof(int *));
  hmm->tsc[0] = malloc ((M+16) * sizeof(int));
  mx->mmx = (int **) malloc(sizeof(int *) * (L_CONST+1));
  for (i = 0; i <= L_CONST; i++) {
    mx->mmx[i] = malloc (M+2+16);
  }
  return;
}  

void
P7Viterbi(int L, int M, struct plan7_s *hmm, int **mmx)
{
  int   i,k;
  
  for (i = 1; i <= L; i++) {
    for (k = 1; k <= M; k++) {
      mmx[i][k] = mmx[i-1][k-1] + hmm->tsc[0][k-1];
    }
  }
}

main ()
{
	struct plan7_s *hmm;
	char dsq[L_CONST];
        int i;

	hmm = (struct plan7_s *) malloc (sizeof (struct plan7_s));
	mx = (struct dpmatrix_s *) malloc (sizeof (struct dpmatrix_s));
	AllocPlan7Body(hmm, 10);
        for (i = 0; i < 600000; i++) {
                P7Viterbi(500, 10, hmm, mx->mmx);
        }
}
Comment 1 Daniel Berlin 2005-03-26 00:02:58 UTC
Subject: Re:  New: Tree loop optimizer does
	worse job than RTL loop optimizer

On Fri, 2005-03-25 at 22:21 +0000, sje at cup dot hp dot com wrote:
> In the attached test case, 3.4.* GCC generates better code than 4.0 (or 4.1)
> because it moves more loop invariant code out of the inner loop of P7Viterbi. 
> The problem seems to be in the alias analysis which determines what can be moved
> out of that loop.  If you change the field M, which is unused, from int to float
> then the 4.* GCC generates better code.  I tried the structure-alias branch to
> see if that helped and it didn't. 

Structure aliasing can't disambiguate this yet. It will soon enough.

I haven't been putting the remainder of the part2 work on the
struct-aliasing branch because it's all of one file and a very small
amount of integration work, and it wasn't worth keeping merges up to
date for it.



Comment 2 Andrew Pinski 2005-03-26 00:48:44 UTC
And as I have mentioned in the email, do-loop is not being done because well it would be invalid to do 
it as we change an infinite loop to a finite one, see PR 19210.
Comment 3 Andrew Pinski 2005-06-15 03:18:44 UTC
Confirmed.
Comment 4 Steve Ellcey 2005-10-03 15:10:19 UTC
Just to clarify the main issue here, my main concern here is not the conversion of the for loop to a do loop but rather the alias disambiguation and the movement of loop invariant code out of the inner most loop in P7Viterbi.  Because the new alias analysis is not as good as it used to be (with respect to structures) more code is left in the inner loop.  This is why the test case is so much slower now than before.
Comment 5 Mark Mitchell 2005-10-31 02:51:25 UTC
Dan, is there any chance of fixing this for 4.1?
Comment 6 Daniel Berlin 2005-10-31 03:43:37 UTC
Realistically?
No.
I'm about to start solving it on the improved-aliasing branch.
Comment 7 Mark Mitchell 2006-02-24 00:25:50 UTC
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Comment 8 Paolo Bonzini 2006-04-18 14:15:40 UTC
Mark, I don't believe there is any chance that it be fixed in 4.0/4.1?
Comment 9 Richard Biener 2006-04-28 14:06:49 UTC
This is really because we do not decompose structs pointed to by parameters for their elements, so a write to an int clobbers all of plan7_s.

With -O2 timings on i686 are for me (averages of three runs)

3.4.6    5.4s
4.0.3    7.4s
4.1.0    6.1s
4.2.0    6.6s

manually PREing gives 5.2s for mainline.
Comment 10 Mark Mitchell 2006-05-25 02:32:48 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 11 Atsushi Nemoto 2006-07-16 13:21:18 UTC
I have a similer optimization problem with this tiny function.

void foo(int *a)
{
	int i;
	for (i = 0; i < 100; i++)
		a[0] += a[1];
}

All gcc 4.x I tried generate load and store in inner loop.
On http://gcc.gnu.org/ml/gcc/2006-07/msg00282.html, I posted some results.
Is this a same probelm?  If not, I should file another bug report.  Thank you.
Comment 12 Steven Bosscher 2006-07-16 13:52:59 UTC
The test case in comment #11 looks like a classic store motion opportunity to me.  GCC 3.3 performs the store motion, GCC 4.2 r115467 does not.

Zdenek, I thought tree-ssa-lim should be able to do store motion in loops?
Comment 13 Zdenek Dvorak 2006-07-17 11:54:48 UTC
(In reply to comment #12)
> The test case in comment #11 looks like a classic store motion opportunity to
> me.  GCC 3.3 performs the store motion, GCC 4.2 r115467 does not.
> 
> Zdenek, I thought tree-ssa-lim should be able to do store motion in loops?

Yes, however again, the alias analysis does not tell it that a[0] does not alias a[1]; once that is fixed, tree-ssa-lim should work just fine.
Comment 14 Daniel Berlin 2006-07-17 13:34:48 UTC
Subject: Re:  [4.0/4.1/4.2 Regression] Tree loop
 optimizer does worse job than RTL loop optimizer

rakdver at gcc dot gnu dot org wrote:
> ------- Comment #13 from rakdver at gcc dot gnu dot org  2006-07-17 11:54 -------
> (In reply to comment #12)
>> The test case in comment #11 looks like a classic store motion opportunity to
>> me.  GCC 3.3 performs the store motion, GCC 4.2 r115467 does not.
>>
>> Zdenek, I thought tree-ssa-lim should be able to do store motion in loops?
> 
> Yes, however again, the alias analysis does not tell it that a[0] does not
> alias a[1]; once that is fixed, tree-ssa-lim should work just fine.

This is because it's an incoming parameter, and as a result, this
doesn't look at all like an array access, but just a random pointer access.

I have no plans to make the alias analysis algorithm reconstruct array
indexes from random pointer arithmetic.

--Dan
Comment 15 Atsushi Nemoto 2006-07-18 16:53:13 UTC
(In reply to comment #14)
> This is because it's an incoming parameter, and as a result, this
> doesn't look at all like an array access, but just a random pointer access.
> 
> I have no plans to make the alias analysis algorithm reconstruct array
> indexes from random pointer arithmetic.

I do not think reconstructing array indexes are needed,
but is it hard to tell that *(a+0) never be an alias of *(a+1) ?
Comment 16 Daniel Berlin 2006-07-18 17:03:53 UTC
Subject: Re:  [4.0/4.1/4.2 Regression] Tree loop
 optimizer does worse job than RTL loop optimizer

anemo at mba dot ocn dot ne dot jp wrote:
> ------- Comment #15 from anemo at mba dot ocn dot ne dot jp  2006-07-18 16:53 -------
> (In reply to comment #14)
>> This is because it's an incoming parameter, and as a result, this
>> doesn't look at all like an array access, but just a random pointer access.
>>
>> I have no plans to make the alias analysis algorithm reconstruct array
>> indexes from random pointer arithmetic.
> 
> I do not think reconstructing array indexes are needed,
> but is it hard to tell that *(a+0) never be an alias of *(a+1) ?

We already do say this when we know the offsets.  The point is that
without reconstructing the array indexes, we'd have to follow use-def
chains for *every single pointer access* on *every single operand
update*, in order to attempt to get the offsets and disambiguate them.

This is incredibly slow.




Comment 17 Andrew Pinski 2007-06-11 00:53:59 UTC
the pointer_plus branch improves the code here (I can't tell if it fixes the problem fully).
Comment 18 Andrew Pinski 2007-07-04 17:58:53 UTC
Yep this was fixed by Pointer_plus.
The load of hmm->tsc is no longer in the inner most loop.
Comment 19 Joseph S. Myers 2008-07-04 16:50:59 UTC
Closing 4.1 branch.
Comment 20 Joseph S. Myers 2009-03-30 15:43:07 UTC
Closing 4.2 branch, fixed in 4.3.