Bug 17863 - [4.3/4.4/4.5 Regression] performance loss (TER register presure and inlining limits problems)
Summary: [4.3/4.4/4.5 Regression] performance loss (TER register presure and inlining ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.0.0
: P4 normal
Target Milestone: 4.3.5
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
: 18704 (view as bug list)
Depends on:
Blocks: 18704
  Show dependency treegraph
 
Reported: 2004-10-06 15:33 UTC by kunert
Modified: 2010-03-21 12:13 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail: 4.0.4
Last reconfirmed: 2009-12-28 15:40:33


Attachments
testcase (3.65 KB, text/plain)
2004-10-06 15:34 UTC, kunert
Details

Note You need to log in before you can comment on or make changes to this bug.
Description kunert 2004-10-06 15:33:23 UTC
I see a threefold performance loss. This is a rather big chunk of code and it is
probably difficult to extract a small testcase, because there is no pronounced
hot spot. To reproduce just compile and run the code.

commandline: g++ -O3 -march=pentium4 ttest.cc -o t34a -static

Execution time of the compiled code:

with GCC 3.4.1: 6.6s
with todays GCC 4.0.0: 18.9s

machine is a 2.4 GHz P4.
Comment 1 kunert 2004-10-06 15:34:31 UTC
Created attachment 7295 [details]
testcase
Comment 2 Andrew Pinski 2004-10-06 17:02:35 UTC
Looks like IV-OPTS is doing something wrong.
Comment 3 Andrew Pinski 2004-10-06 18:40:54 UTC
I was wrong, this is just the case where we are not inlining as much as we should be.
Comment 4 Andrew Pinski 2004-11-02 15:48:29 UTC
Jan can you look into this testcase, this is one where we don't inline as much as 3.4 did.
(for ppc-darwin, it is much worse as templates have to go through a stub now so it is much worse 
there).  Maybe the CALL_INSN_COST should be bumped.
Comment 5 Andrew Pinski 2004-12-06 05:20:51 UTC
*** Bug 18704 has been marked as a duplicate of this bug. ***
Comment 6 Andrew Pinski 2004-12-24 20:36:17 UTC
Reduced testcase:
const int LMAX = 4;
const int LMAX41 = 4*LMAX+1;
const int LMAX12 = (LMAX+1)*(LMAX+2)/2;
template<int n>
inline double accu1( const double* p1, const double* p2 )
{
     double d = *p1 * *p2;
     return d + accu1<n-1>( ++p1, ++p2 );
}
template <> inline double accu1<0>( const double* p1, const double* p2 )
{
    return p1[0] * p2[0]; 
}
template <int ny, int nz>
inline double accu2( const double* py, const double* pz, const double* h )
{
    const double d = accu1<nz>( pz, h ) * *py;
    if( ny == 0 ) return d;
    return d + accu2<(ny ? ny-1 : -1), (ny ? nz : -1 )>( ++py, pz, ++h );
}
template<>
inline double accu2<-1, -1>( const double* , const double* , const double* )
{
    return 0.0;
}
template <int ny, int nz>
inline double accu( const double* py, const double* pz, const double* h )
{
    if( ny == 0 ) return accu1<nz>( pz, h );
    else  if( nz == 0 ) return accu1<ny>( py, h );
    else if( nz >= ny ) return accu2<ny, nz>( py, pz, h );
    else return accu2<nz, ny>( pz, py, h );
}
template <>
inline double accu<0,0>( const double* , const double* , const double* h )
{
    return *h;
}
#define SWYZ( Y, Z ) ((Y+Z) * (Y+Z+1) / 2+Z)
#define CASA( Y, Z ) case SWYZ( Y, Z ):         \
        *ap1 = accu<Y, Z>( py, pz, dxb );      \
    if( z1 == 0 ) break;                        \
    ++ap1;                                      \
    z1--; py += LMAX41; pz -= LMAX41;
#define CAS( Y, Z ) case SWYZ( Y, Z ): *ap1 = accu<Y, Z>( py, pz, dxb ); break
#define CAS1( Y ) CASA( Y, 1 );  CAS( Y+1, 0 );
#define CAS2( Y ) CASA( Y, 2 ); CAS1( Y+1 );
#define CAS3( Y ) CASA( Y, 3 ); CAS2( Y+1 );
#define CAS4( Y ) CASA( Y, 4 ); CAS3( Y+1 );

double f(const double *py, const double *pz, double *dxb, double *ap1, int mh_z1234, unsigned int 
z1)
{
  switch( mh_z1234 )
 {
    CAS( 0, 0 );
    CAS1(0);
    CAS2(0);
    CAS3(0);
    CAS4(0);
  }
}


When we do -O3 or -O2, we don't inline accu1<1> into accu1<2> at all, why?????????
Comment 7 Jan Hubicka 2004-12-24 21:09:34 UTC
Subject: Re:  [4.0 Regression] threefold performance loss, not inlining as much

> 
> ------- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-24 20:36 -------
> Reduced testcase:
> const int LMAX = 4;
> const int LMAX41 = 4*LMAX+1;
> const int LMAX12 = (LMAX+1)*(LMAX+2)/2;
> template<int n>
> inline double accu1( const double* p1, const double* p2 )
> {
>      double d = *p1 * *p2;
>      return d + accu1<n-1>( ++p1, ++p2 );
> }
> template <> inline double accu1<0>( const double* p1, const double* p2 )
> {
>     return p1[0] * p2[0]; 
> }
> template <int ny, int nz>
> inline double accu2( const double* py, const double* pz, const double* h )
> {
>     const double d = accu1<nz>( pz, h ) * *py;
>     if( ny == 0 ) return d;
>     return d + accu2<(ny ? ny-1 : -1), (ny ? nz : -1 )>( ++py, pz, ++h );
> }
> template<>
> inline double accu2<-1, -1>( const double* , const double* , const double* )
> {
>     return 0.0;
> }
> template <int ny, int nz>
> inline double accu( const double* py, const double* pz, const double* h )
> {
>     if( ny == 0 ) return accu1<nz>( pz, h );
>     else  if( nz == 0 ) return accu1<ny>( py, h );
>     else if( nz >= ny ) return accu2<ny, nz>( py, pz, h );
>     else return accu2<nz, ny>( pz, py, h );
> }
> template <>
> inline double accu<0,0>( const double* , const double* , const double* h )
> {
>     return *h;
> }
> #define SWYZ( Y, Z ) ((Y+Z) * (Y+Z+1) / 2+Z)
> #define CASA( Y, Z ) case SWYZ( Y, Z ):         \
>         *ap1 = accu<Y, Z>( py, pz, dxb );      \
>     if( z1 == 0 ) break;                        \
>     ++ap1;                                      \
>     z1--; py += LMAX41; pz -= LMAX41;
> #define CAS( Y, Z ) case SWYZ( Y, Z ): *ap1 = accu<Y, Z>( py, pz, dxb ); break
> #define CAS1( Y ) CASA( Y, 1 );  CAS( Y+1, 0 );
> #define CAS2( Y ) CASA( Y, 2 ); CAS1( Y+1 );
> #define CAS3( Y ) CASA( Y, 3 ); CAS2( Y+1 );
> #define CAS4( Y ) CASA( Y, 4 ); CAS3( Y+1 );
> 
> double f(const double *py, const double *pz, double *dxb, double *ap1, int mh_z1234, unsigned int 
> z1)
> {
>   switch( mh_z1234 )
>  {
>     CAS( 0, 0 );
>     CAS1(0);
>     CAS2(0);
>     CAS3(0);
>     CAS4(0);
>   }
> }
> 
> 
> When we do -O3 or -O2, we don't inline accu1<1> into accu1<2> at all, why?????????

Because we inline other functions before we get into this one and
inline-unit-growth is reached...
Actually for very small units the inline-unit-growth limit seems to be
bit too tight, so we might think about bypassing this limit for very
small units, but this won't solve original testcase anyway....

Profiling branch seems to get this testcase right and inline everything
due to slightly different code size estimates, but it does not work
particularly well on tramp3d testcase (right now it is slightly worse
than mainline without profiling, I have patch to bring it back to
mainlie levels that is obviously far from optimum...)

I am unsure if we can come with more realistic cost model without
actually trying to inline the function and see how much it optimize as
suggested by some papers (but apparently not very suitable for
production compiler I would say), but I am all ears about ideas ;))

Honza
Comment 8 Steven Bosscher 2005-01-28 01:04:19 UTC
Final callgraph for amd64: 
double accu1(const double*, const double*) [with int n = 2]/21: 22 insns (29 
after inlining) needed inlinable asm_written 
  called by: 
  calls: 
double f(const double*, const double*, double*, double*, int, unsigned 
int)/17: 281 insns (2583 after inlining) needed inlinable asm_written 
  called by: 
  calls: 
 
Final callgraph for i686: 
 
double accu2(const double*, const double*, const double*) [with int ny = 2, 
int nz = 1]/29: 43 insns (102 after inlining) reachable inlinable asm_written 
  called by: 
  calls: 
double accu1(const double*, const double*) [with int n = 3]/25: 28 insns 
needed inlinable asm_written 
  called by: 
  calls: 
double accu1(const double*, const double*) [with int n = 2]/21: 28 insns 
needed inlinable asm_written 
  called by: 
  calls: 
double accu1(const double*, const double*) [with int n = 1]/18: 28 insns (27 
after inlining) needed inlinable asm_written 
  called by: 
  calls: 
double f(const double*, const double*, double*, double*, int, unsigned 
int)/17: 311 insns (3003 after inlining) needed inlinable asm_written 
  called by: 
  calls: 
 
 
Comment 9 Paolo Bonzini 2005-02-03 16:49:52 UTC
To the reporter: in this case you probably want __attribute__ ((leafify)), just 
in case, though you are right in expecting the compiler to inline it.
Comment 10 Richard Biener 2005-02-03 17:32:24 UTC
Subject: Re:  [4.0 Regression] threefold performance
 loss, not inlining as much

bonzini at gcc dot gnu dot org wrote:

> To the reporter: in this case you probably want __attribute__ ((leafify)), just 
> in case, though you are right in expecting the compiler to inline it.

But of course attribute leafify is not available without patching your 
gcc sources.
Comment 11 kunert 2005-02-08 10:13:29 UTC
Subject: Re:  [4.0 Regression] threefold performance
 loss, not inlining as much

Good idea.

However, this provides just another knob to tune the inlining and it is not obvious where to apply that attribute for the best performance. I played with the present dozen or so parameters for some hours to come close to the old (aka 3.4) performance, and I definitely can't handle even more. 

Thank you
Thomas Kunert



bonzini at gcc dot gnu dot org wrote:
> ------- Additional Comments From bonzini at gcc dot gnu dot org  2005-02-03 16:49 -------
> To the reporter: in this case you probably want __attribute__ ((leafify)), just 
> in case, though you are right in expecting the compiler to inline it.
> 

Comment 12 Richard Biener 2005-02-24 17:09:49 UTC
With __attribute__((leafify)) sticked to v4c_quad and mult_pq runtime goes down
from 16.0s to 4.4s with recent gcc 4.0.  For gcc 3.4.3 runtimes are 5.0s and 4.9s.

We indeed do not very well on estimating the size of template metaprograms in 4.0.
Comment 13 kunert 2005-02-25 09:52:04 UTC
Subject: Re:  [4.0 Regression] threefold performance
 loss, not inlining as much

Wow. Many thanks for that analysis.  Now I will go and fetch the patch. Since nobody seems to care about improving the inlining parameters, I'd love to see this patch in 4.0.

Thomas Kunert

rguenth at gcc dot gnu dot org wrote:
> ------- Additional Comments From rguenth at gcc dot gnu dot org  2005-02-24 17:09 -------
> With __attribute__((leafify)) sticked to v4c_quad and mult_pq runtime goes down
> from 16.0s to 4.4s with recent gcc 4.0.  For gcc 3.4.3 runtimes are 5.0s and 4.9s.
> 
> We indeed do not very well on estimating the size of template metaprograms in 4.0.
> 

Comment 14 Richard Biener 2005-02-25 10:06:25 UTC
Patch at
http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01571.html

improves the testcase from 16.2s to 12.1s (3.4: 5.0s) - aka, still not good
enough.  As we have (with the patch) still size estimates for the functions
that are 15-40% higher than for 3.4 we'd probably need to bump our inlining
limits accordingly, say by 20%.
Comment 15 Giovanni Bajo 2005-02-25 16:43:48 UTC
Why isn't this a critical regression? We're regressing *badly* on code 
generation.
Comment 16 Richard Biener 2005-02-25 16:56:25 UTC
Yes, the regression is even worse on the closed-duplicate #18704.  There you can
also find some analysis of inline parameter tuning.
Comment 17 Steven Bosscher 2005-03-02 11:35:03 UTC
Performance bugs are never critical. 
Comment 18 Steven Bosscher 2005-03-02 11:36:19 UTC
Updated patch here: 
http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01796.html 
Comment 19 Steven Bosscher 2005-03-05 18:49:13 UTC
Even with Richard Guenther's patches, the only thing that really helps is 
setting --param large-function-growth=200, or more.  The default is 100. 
Comment 20 Richard Biener 2005-03-05 19:03:42 UTC
Subject: Re:  [4.0/4.1 Regression] threefold
 performance loss, not inlining as much

steven at gcc dot gnu dot org wrote:
> ------- Additional Comments From steven at gcc dot gnu dot org  2005-03-05 18:49 -------
> Even with Richard Guenther's patches, the only thing that really helps is 
> setting --param large-function-growth=200, or more.  The default is 100. 

Yup, this is probably one of the testcases, where -fobey-inline would 
help.  Or of course profile directed inlining.
Comment 21 dank 2005-06-27 04:54:09 UTC
I just verified the regression here with -march=pentium on a pentium 4.
On the original testcase, I got runtimes of 7.0, 4.9, 8.1, and 7.0
seconds with gcc-2.95.3, gcc-3.4.3, gcc-4.0.0, and gcc-4.1-20050603
using just -O3 (no -static).
Comment 22 Steven Bosscher 2005-06-27 06:38:28 UTC
As you can see from those numbers Dan Kegel posted, this kind of test case 
is very sensitive to the intermediate representation presented to the inliner 
and to inliner heuristics.  Personally, I don't think it is worth keeping a 
bug report like this opened.  This kind of almost-random behavior is similar 
to the reload failures on x86: You'll always be able to construct a test case 
that will fail with a "fixed" compiler for one particular case. 
 
However, in this case there are still a number of things that are interesting 
to look at: 
1) GCC 4.1 has profile guided inlining.  Does it help in this case? 
2) Does the early inlining patch [1] help? 
3) Does the pre-inlining patch and the IPA stuff help? (i.e. try the 
   tree-profiling-branch with all pistons firing ;-) 
 
Personally, I expect that this is a case where the pre-inline optimizations 
may be helpful 
 
Could someone construct a graphical representation of the call graphs for 
GCC 3.4 and GCC 4.1 and compare them?  I'm very curious which function (or 
functions) are apparently inlined only by GCC 3.4 and not by any other 
release. 
   
 
[1] http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01839.html 
 
Comment 23 Anthony Danalis 2005-06-30 22:16:31 UTC
I'm looking at the reduced testcase from comment #6,
and I noticed that f() is declared double, but does not return anything.
Thus the code doesn't compile with -O3 -Wall -Werror.
If I fix the bug adding a "return(return *ap1)",
or by declaring f() to be void, the performance regression dissappears.

Here's the test harness I used to call the minimized testcase:

int main(int argc, char *argv[]){
    double ay[100][100];
    const double *py, *pz;
    double *dxb, *ap1;
    double sum=0;
    int i,j,k;

    for(i=0; i<100; i++){
        for(j=0; j<100; j++){
            ay[i][j] = 1000*(i+1)+2*(j+1);
        }
    }
    py  = ay[0];
    pz  = ay[1];
    dxb = ay[2];
    ap1 = ay[3];

    for(k=0; k<100; k++){
        for(i=0; i<10000; i++){
            for(j=0; j<12; j++){
                sum += f(py,pz,dxb,ap1,j,5);
                sum /= 2;
            }
        }
    }
    cout << sum << endl;
    return 0;
}

Is that ok?   I compiled this with -O3 -mtune=pentium.

Runtimes *without* the fix to f() were
0.31s, 8.72s, 8.83s and 8.80s when compiled with g++
2.95.3, 3.4.3, 4.0.0 and 4.1.0-20050625, respectively
(making this a large performance regression relative to gcc-2.95.3).
Runtimes *with* the fix were
0.34s, 0.28s, 0.36s, 0.32s when compiled with g++
2.95.3, 3.4.3, 4.0.0 and 4.1.0-20050625, respectively.
Comment 24 Anthony Danalis 2005-06-30 22:24:23 UTC
I meant to say "return(*ap1)" not "return(return *ap1)"
Comment 25 dank 2005-07-01 03:44:14 UTC
Anthony, it looks like the runtimes with the fix
match the runtimes from the larger testcase reasonably
well; at least they're faster on gcc-3.4.3 where they're
supposed to be.

So maybe we should try to answer the questions from comment # 22
for the reduced testcase.
Comment 26 Steven Bosscher 2005-10-29 22:36:31 UTC
Waiting for someone to look into this...
Comment 27 Mark Mitchell 2005-10-31 00:37:54 UTC
It seems unlikely to me that this is going to be release-critical, so I've downgraded it to P4.

Our inlining heuristics are notoriously easy to perturb.  Probably, to do substantially better, we'll need a more sophisticated cost model; or, as Jan suggests, actually try inlining things see how much that helps.  I'm absolutely certain that for every release we'll have at least one program that was inlined worse than the previous release; the question is how we're doing overall, and whether we're being particularly stupid.  So, I think the question is if we're actually being particularly stupid on this test case.
Comment 28 Jan Hubicka 2005-10-31 18:36:02 UTC
I get 0m8.052s on 3.4 and 0m8.127s on mainline on Athlon.  This hardly counts as an regression.
This is actually effect of some cost tweaks we did relatie to gimplifier a while ago.
Reduced testcase fits in limits now too.
Unlike Mark I would not claim that our inliner heuristics are actually more notorious than any others ;))
Comment 29 Jan Hubicka 2005-10-31 19:15:17 UTC
Actually I have to reopen this.  When playing around on pentiumM or opteron, I still get roughly 20% regression (6s to 8s), 4.1 and 4.0 scores are about the same on both machines. For some reason this don't reproduce on my Athlon.
This however don't seem to be related to inlining limits as ramping them up so everything gets inlined still won't make the testcase work better, so someone would probably have to oprofile it and figure out what is still regressing.
I think it is non-critical tough.
Comment 30 James A. Morrison 2005-11-24 17:34:11 UTC
On powerpc-linux, I get the following timings:
Using the following command line:
g++ -O3 -o t41 -mcpu=7450 -mtune=7450 pr17863.cc -static


    real          user
3.4 0m11.761s   0m11.148s
4.0 0m10.196s   0m9.495s
4.1 0m17.824s   0m16.832s
4.1 0m11.547s   0m10.502s -- With attribute flatten
Comment 31 Steven Bosscher 2006-02-09 23:14:56 UTC
Someone will have to investigate this one better.

According to comment #29, inlining is no longer a problem so the bug subject line was no longer correct.
Comment 32 Gabriel Dos Reis 2007-01-18 03:06:11 UTC
No fix will happen for GCC-4.0.x
Comment 33 Steven Bosscher 2007-12-18 20:02:25 UTC
Honza, since you re-opened this, perhaps you can give new timings?
Comment 34 Jan Hubicka 2008-01-30 17:13:53 UTC
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./test-3.4

real    0m5.692s
user    0m5.588s
sys     0m0.012s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./test-3.4

real    0m5.536s
user    0m5.492s
sys     0m0.016s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out

real    0m7.438s
user    0m7.388s
sys     0m0.024s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out

real    0m7.392s
user    0m7.336s
sys     0m0.008s

This is with current mainline and -O2 -march=athlon-xp.  I had to remove two "static" keywords in explicit template instantiation.
So we still regress here.

Honza
Comment 35 Jan Hubicka 2008-01-30 17:58:58 UTC
So for more proper analysis. The testcase is quite challenging for inlining heuristics and by introducing early inlining and reducing call cost we now inline less that we used to at a time I claimed that we inline everything.  However making inlined everything again is still not solving the problem.

For inline decisions, the problematic bit seems to be accu1 and friends.  They are templates using easier templates of same form.  For n=1:
double accu1(const double*, const double*) [with int n = 0] (p1, p2)
{
  double D.4655;
  double D.4654;
  double D.4653;

<bb 2>:
  D.4654_2 = *p1_1(D);
  D.4655_4 = *p2_3(D);
  D.4653_5 = D.4654_2 * D.4655_4;
  return D.4653_5;

}

With n>1 we simply copy the body few times:
double accu1(const double*, const double*) [with int n = 1] (p1, p2)
{
  double D.17506;
  double D.17507;
  double D.17505;
  double D.17505;
  double d;
  double D.6664;
  double D.6663;
  double D.6662;

<bb 2>:
  D.6662_2 = *p1_1(D);
  D.6663_4 = *p2_3(D);
  d_5 = D.6662_2 * D.6663_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.17506_11 = *p1_7;
  D.17507_12 = *p2_6;
  D.17505_13 = D.17506_11 * D.17507_12;
  D.6664_9 = d_5 + D.17505_13;
  return D.6664_9;

}
Early inlinier handles this well until the function grows up, that happens on n=4 and for n=5 we end up not inlining:
double accu1(const double*, const double*) [with int n = 5] (p1, p2)
{
  double d;
  double D.6697;
  double D.6696;
  double D.6695;
  double D.6694;

<bb 2>:
  D.6694_2 = *p1_1(D);
  D.6695_4 = *p2_3(D);
  d_5 = D.6694_2 * D.6695_4;
  p2_6 = p2_3(D) + 8;
  p1_7 = p1_1(D) + 8;
  D.6697_8 = accu1 (p1_7, p2_6);
  D.6696_9 = D.6697_8 + d_5;
  return D.6696_9;

}
This is as expected, for n=4 the code is definitely longer than call sequence, having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can resonably expect it to simplify.

We inline these functions later in late inlining as expected, but since there are just too many calls of them, we end up eventually on large function and large unit limits.

Now to get everything inlined one needs --param inline-call-cost=9999 --param max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is just big IMO).

Now with this:
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time /aux/hubicka/gcc-install/bin/g++  -O3 ttest.cc  -fpermissive --static -march=athlon-xp  -Winline --param inline-call-cost=9999 --param max-inline-insns-single=999999
ttest.cc: In function 'void testv4c()':
ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param inline-unit-growth limit reached
ttest.cc:468: warning: called from here

real    1m0.934s
user    0m59.736s
sys     0m1.204s
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out

real    0m7.055s
user    0m7.052s
sys     0m0.000s

We still have long way to GCC 3-4 perfomrance (5s, see my previous post).  I suspect that alising simply give up. Setting inline-call-cost to 1 (the other extreme) leads to 6.9s.
Comment 36 Jan Hubicka 2008-01-30 18:14:53 UTC
Looking at the .optimized dump, one obvious problem is that we keep a lot of pointer arithmetic that should be forward propagated:
<L147>:;
  D.184420 = *pz;
  p1 = pz + 8;
  D.184422 = *p1;
  p1 = p1 + 8;
  D.184424 = *p1;
  p1 = p1 + 8;
  D.184426 = *p1;
  p1 = p1 + 8;
  D.184428 = *p1;
  p1 = p1 + 8;
  D.184430 = *p1;
  p1 = p1 + 8;
  D.184432 = *p1;
  D.184434 = *(p1 + 8);

Those seems to be all just array manipulations.

Honza
Comment 37 stevenb.gcc@gmail.com 2008-01-30 20:13:39 UTC
Subject: Re:  [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)

> Those seems to be all just array manipulations.

AFAICT, they are exactly in the form that some targets like it (e.g.
auto-inc/dec and SMALL_REGISTER_CLASS targets).
Comment 38 Jan Hubicka 2008-01-30 23:19:43 UTC
Subject: Re:  [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)

> AFAICT, they are exactly in the form that some targets like it (e.g.
> auto-inc/dec and SMALL_REGISTER_CLASS targets).

Yep, but all the pointer arithmetic makes us not to realize we are doing
quite simple manipulations with array and propagate load/stores through.
CSE undoes this later in the game, so we end up with normal offsetted
addressing. Doing it earlier should make load/store elimination happier.

Honza
Comment 39 rguenther@suse.de 2008-01-31 09:39:08 UTC
Subject: Re:  [4.0/4.1/4.2/4.3 Regression]
 performance loss (not inlining as much??)

On Wed, 30 Jan 2008, stevenb dot gcc at gmail dot com wrote:

> 
> 
> ------- Comment #37 from stevenb dot gcc at gmail dot com  2008-01-30 20:13 -------
> Subject: Re:  [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as
> much??)
> 
> > Those seems to be all just array manipulations.
> 
> AFAICT, they are exactly in the form that some targets like it (e.g.
> auto-inc/dec and SMALL_REGISTER_CLASS targets).

They also don't operate on arrays, so we cannot use ARRAY_REF in the
IL.

Richard.
Comment 40 Jan Hubicka 2008-02-01 16:47:41 UTC
Well, I still meant that simplifying the cascaded addition into accumulator into direct addition from base makes the code to simplify. I implemented experimentally the trick in fwprop and will attach later, but the patch itself doesn't help.

What happens is now obvious.  For sequence like:

  D.185211 = *py.4861;
  d = D.180590 * D.185211;
  p1 = py.4861 + 8;
  D.185213 = *p1;
  d = D.180606 * D.185213;
  p1 = py.4861 + 16;
  D.185215 = *p1;
  d = D.180642 * D.185215;
  p1 = py.4861 + 24;
  D.185217 = *p1;
  d = D.180728 * D.185217;
  p1 = py.4861 + 32;
  D.185219 = *p1;
  d = D.180888 * D.185219;
  p1 = py.4861 + 40;
  D.185221 = *p1;
  d = D.181157 * D.185221;
  p1 = py.4861 + 48;
  D.185223 = *p1;
  D.185098 = D.181571 * D.185223;
  D.185094 = d + D.185098;
  D.185090 = d + D.185094;
  D.185086 = d + D.185090;
  D.185082 = d + D.185086;
  D.185078 = d + D.185082;
  D.185074 = d + D.185078;
  D.185210 = *pz;
  d = D.185074 * D.185210;
  py = pz + 8; 
  d = D.180606 * D.185211;
  d = D.180642 * D.185213;
  d = D.180728 * D.185215;
  d = D.180888 * D.185217;
  d = D.181157 * D.185219;
  d = D.181571 * D.185221;
  D.185130 = D.182177 * D.185223;
  D.185126 = d + D.185130;
  D.185122 = d + D.185126;
  D.185118 = d + D.185122;
  D.185114 = d + D.185118;
  D.185110 = d + D.185114;
  D.185106 = d + D.185110;
  D.185225 = *py;
  d = D.185106 * D.185225;
  py = pz + 16;
  d = D.180642 * D.185211;
  d = D.180728 * D.185213;
  d = D.180888 * D.185215;
  d = D.181157 * D.185217;
  d = D.181571 * D.185219;
  d = D.182177 * D.185221;
  D.185162 = D.183023 * D.185223;
  D.185158 = d + D.185162;
  D.185154 = d + D.185158;
  D.185150 = d + D.185154;
  D.185146 = d + D.185150;
  D.185142 = d + D.185146;
  D.185138 = d + D.185142;
  D.185240 = *py;
  d = D.185138 * D.185240;
  py = pz + 24;
  d = D.180728 * D.185211;
  d = D.180888 * D.185213;
  d = D.181157 * D.185215;
  d = D.181571 * D.185217;
  d = D.182177 * D.185219;
  d = D.183023 * D.185221;
  D.185194 = D.184168 * D.185223;
  D.185190 = d + D.185194;
  D.185186 = d + D.185190;
  D.185182 = d + D.185186;
  D.185178 = d + D.185182;
  D.185174 = d + D.185178;
  D.185170 = d + D.185174;
  D.185255 = *py;
  d = D.185170 * D.185255;
  D.185134 = d + d;
  D.185102 = d + D.185134;
  D.185195 = d + D.185102;
  *ap1.4607 = D.185195;
  if (z1.4734 == 0)
    goto <bb 339> (<L351>);
  else
    goto <bb 144>;

that are accumulating values from array into few variables, TER merges all the arithmetic into single giant expression leaving the loads in the front of it.

<L262>:;
  D.197135 = *pz;
  D.197137 = *(pz + 8);
  D.197139 = *(pz + 16);
  D.197141 = *(pz + 24);
  D.197143 = *(pz + 32);
  D.197145 = *(pz + 40);
  D.197147 = *(pz + 48);
  D.197149 = *(pz + 56);
  D.197151 = *(pz + 64);
  D.197153 = *(pz + 72);
  D.197155 = *(pz + 80);
  D.197157 = *(pz + 88);
  D.197159 = *(pz + 96);
  *ap1.4658 = (D.180590 * D.197135 + (D.180606 * D.197137 + (D.180642 * D.197139 + (D.180728 * D.197141 + (D.180888 * D.197143 + (D.181157 * D.197145 + (D.181571 * D.197147 + (D.182177 * D.197149 + (D.183023 * D.197151 + (D.184168 * D.197153 + (D.185672 * D.197155 + (D.187606 * D.197157 + D.190042 * D.197159)))))))))))) * *py.4912 + ((D.180606 * D.197135 + (D.180642 * D.197137 + (D.180728 * D.197139 + (D.180888 * D.197141 + (D.181157 * D.197143 + (D.181571 * D.197145 + (D.182177 * D.197147 + (D.183023 * D.197149 + (D.184168 * D.197151 + (D.185672 * D.197153 + (D.187606 * D.197155 + (D.190042 * D.197157 + D.193063 * D.197159)))))))))))) * *(py.4912 + 8) + (D.180642 * D.197135 + (D.180728 * D.197137 + (D.180888 * D.197139 + (D.181157 * D.197141 + (D.181571 * D.197143 + (D.182177 * D.197145 + (D.183023 * D.197147 + (D.184168 * D.197149 + (D.185672 * D.197151 + (D.187606 * D.197153 + (D.190042 * D.197155 + (D.193063 * D.197157 + D.196753 * D.197159)))))))))))) * *(py.4912 + 16));
  if (z1.4780 == 0)
    goto <bb 339> (<L351>);
  else
    goto <bb 251>;


With the patch for fwprop and -fno-tree-ter I get 5.1s, that is same as in pre GCC-4.0.  Why TER is not placing loads into expressions at first place?  This seems like quite common pattern to kill register pressure to me.

I have to leave but will play with it further, try if fwprop patch is needed and polish it.

Honza
Comment 41 Andrew Macleod 2008-02-01 17:46:06 UTC
TER will not replace any load into an expression if there is more than one use of the load. Your sample shows multiple uses of each load. If it did this substitution, it could be introducing worse code, it doesn't know.   (TER is also strictly a single block replacement as well).

A stand alone register pressure analysis could determine that those loads should all be substituted because pressure is too high on an 8 register machine. If there were 128 register available however, then it may not want to substitute the loads. TER just doesn't have that kind of information.
Comment 42 Jan Hubicka 2008-02-01 22:33:56 UTC
So for summary. 

With TER disabled, I get 6.2s, so we are still worse than 3.4 that is 5.6s. 

With call-cost inline parameter increased and TER disabled, I get 5.3s.

With forwprop fix, ter disabled and inline parameter increased, I get 5.2s. 

Forwprop alone we get 7.1s.

WIth forwprop and TER disabled is 5.8s.

Other combinations brings no difference. TER increasing register pressure is major offender here masking the other improvements.

I don't see how to track the TER and inlining limits issue with current organization of compiler. It is probably GCC-4.4 thing if we get lucky.

Honza
Comment 43 Jan Hubicka 2008-02-01 22:45:08 UTC
Subject: Re:  [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??)

> TER will not replace any load into an expression if there is more than one use
> of the load. Your sample shows multiple uses of each load. If it did this
> substitution, it could be introducing worse code, it doesn't know.   (TER is
> also strictly a single block replacement as well).

I noticed that now too.   The code after reordering by TER simply need
even more registers alive by changling how the temporaries overlap.
There is probably no simple heuristics to control this...

Honza
Comment 44 Andrew Pinski 2008-02-01 22:55:05 UTC
(In reply to comment #42)
> I don't see how to track the TER and inlining limits issue with current
> organization of compiler. It is probably GCC-4.4 thing if we get lucky.

I rather not have TER changed, because I will make a mention the register pressure issues is really because our RA is not able to deal with it correctly.  Also people can write their code that way.  TER in some cases can be thought about a scheduler and I will tell you pushing loads further up helps targets like PowerPC and others.
Comment 45 Joseph S. Myers 2008-07-04 16:33:51 UTC
Closing 4.1 branch.
Comment 46 Jan Hubicka 2009-02-08 11:50:42 UTC
With new-RA we seem to do better on this testcase now:

hubicka@occam:~$ time ./a.out-3.4

real    0m5.448s
user    0m5.440s
sys     0m0.012s
hubicka@occam:~$ time ./a.out

real    0m5.834s
user    0m5.836s
sys     0m0.000s
hubicka@occam:~$ 

still there is small regression.
Comment 47 Joseph S. Myers 2009-03-31 16:20:19 UTC
Closing 4.2 branch.
Comment 48 Richard Biener 2009-08-04 12:26:05 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 49 Steven Bosscher 2009-12-28 15:40:33 UTC
To make the test case work, I had to solve two errors by removing "static" keywords:

ttest.cc:105: error: explicit template specialization cannot have a storage class
ttest.cc:117: error: explicit template specialization cannot have a storage class

With that fixed, I timed the compiled binaries for x86_64 and for i386


Compiled for x86_64 (with "g++-4.5.0 -O3 ttest.cc -static -fpermissive -o ttest45" etc.):

stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest36 ; done

real	0m4.238s
user	0m4.210s
sys	0m0.030s

real	0m4.209s
user	0m4.190s
sys	0m0.000s

real	0m4.193s
user	0m4.170s
sys	0m0.010s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest41 ; done

real	0m3.733s
user	0m3.720s
sys	0m0.010s

real	0m3.632s
user	0m3.620s
sys	0m0.000s

real	0m3.662s
user	0m3.630s
sys	0m0.010s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest42 ; done

real	0m3.292s
user	0m3.260s
sys	0m0.020s

real	0m3.338s
user	0m3.300s
sys	0m0.010s

real	0m3.264s
user	0m3.260s
sys	0m0.010s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest43 ; done

real	0m3.515s
user	0m3.500s
sys	0m0.020s

real	0m3.463s
user	0m3.420s
sys	0m0.000s

real	0m3.518s
user	0m3.490s
sys	0m0.000s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest44 ; done

real	0m3.467s
user	0m3.420s
sys	0m0.010s

real	0m3.378s
user	0m3.380s
sys	0m0.000s

real	0m3.434s
user	0m3.400s
sys	0m0.000s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest45 ; done

real	0m0.284s
user	0m0.280s
sys	0m0.000s

real	0m0.202s
user	0m0.180s
sys	0m0.000s

real	0m0.183s
user	0m0.180s
sys	0m0.000s




Compiled for i386 (with "g++-4.5.0 -O3 -m32 -march=pentium4 ttest.cc -static -fpermissive -o ttest45" etc.):

stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest36 ; done

real	0m4.092s
user	0m4.080s
sys	0m0.010s

real	0m3.954s
user	0m3.940s
sys	0m0.020s

real	0m3.988s
user	0m3.970s
sys	0m0.010s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest42 ; done

real	0m5.818s
user	0m5.810s
sys	0m0.010s

real	0m5.828s
user	0m5.770s
sys	0m0.030s

real	0m5.813s
user	0m5.790s
sys	0m0.000s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest43 ; done

real	0m5.379s
user	0m5.360s
sys	0m0.010s

real	0m5.419s
user	0m5.370s
sys	0m0.030s

real	0m5.382s
user	0m5.360s
sys	0m0.010s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest44 ; done

real	0m4.430s
user	0m4.410s
sys	0m0.020s

real	0m4.433s
user	0m4.390s
sys	0m0.010s

real	0m4.389s
user	0m4.380s
sys	0m0.000s
stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest45 ; done

real	0m0.230s
user	0m0.220s
sys	0m0.010s

real	0m0.236s
user	0m0.220s
sys	0m0.000s

real	0m0.216s
user	0m0.210s
sys	0m0.000s


So GCC 4.4 with -m32 still has a ~10% performance regression compared to CC 3.4, but GCC 4.5 appears to optimize the test case away (but I am not sure that the result is correct -- how to check for correctness?).

For -m64 (x86-64), all GCC4 versions are better than GCC 3.4, and GCC 4.2 gives the best performance.

Reconfirmed for 32-bits x86, then.
Comment 50 Steven Bosscher 2010-03-21 12:13:05 UTC
Performance loss within acceptable limits (by the "you give some, you take some" principle). GCC 4.5 optimizes the test case away completely. I see no reason to do anything more here. Fixed for GCC 4.5 and GCC 4.4. Won't fix for GCC 4.3.