I see a threefold performance loss. This is a rather big chunk of code and it is probably difficult to extract a small testcase, because there is no pronounced hot spot. To reproduce just compile and run the code. commandline: g++ -O3 -march=pentium4 ttest.cc -o t34a -static Execution time of the compiled code: with GCC 3.4.1: 6.6s with todays GCC 4.0.0: 18.9s machine is a 2.4 GHz P4.
Created attachment 7295 [details] testcase
Looks like IV-OPTS is doing something wrong.
I was wrong, this is just the case where we are not inlining as much as we should be.
Jan can you look into this testcase, this is one where we don't inline as much as 3.4 did. (for ppc-darwin, it is much worse as templates have to go through a stub now so it is much worse there). Maybe the CALL_INSN_COST should be bumped.
*** Bug 18704 has been marked as a duplicate of this bug. ***
Reduced testcase: const int LMAX = 4; const int LMAX41 = 4*LMAX+1; const int LMAX12 = (LMAX+1)*(LMAX+2)/2; template<int n> inline double accu1( const double* p1, const double* p2 ) { double d = *p1 * *p2; return d + accu1<n-1>( ++p1, ++p2 ); } template <> inline double accu1<0>( const double* p1, const double* p2 ) { return p1[0] * p2[0]; } template <int ny, int nz> inline double accu2( const double* py, const double* pz, const double* h ) { const double d = accu1<nz>( pz, h ) * *py; if( ny == 0 ) return d; return d + accu2<(ny ? ny-1 : -1), (ny ? nz : -1 )>( ++py, pz, ++h ); } template<> inline double accu2<-1, -1>( const double* , const double* , const double* ) { return 0.0; } template <int ny, int nz> inline double accu( const double* py, const double* pz, const double* h ) { if( ny == 0 ) return accu1<nz>( pz, h ); else if( nz == 0 ) return accu1<ny>( py, h ); else if( nz >= ny ) return accu2<ny, nz>( py, pz, h ); else return accu2<nz, ny>( pz, py, h ); } template <> inline double accu<0,0>( const double* , const double* , const double* h ) { return *h; } #define SWYZ( Y, Z ) ((Y+Z) * (Y+Z+1) / 2+Z) #define CASA( Y, Z ) case SWYZ( Y, Z ): \ *ap1 = accu<Y, Z>( py, pz, dxb ); \ if( z1 == 0 ) break; \ ++ap1; \ z1--; py += LMAX41; pz -= LMAX41; #define CAS( Y, Z ) case SWYZ( Y, Z ): *ap1 = accu<Y, Z>( py, pz, dxb ); break #define CAS1( Y ) CASA( Y, 1 ); CAS( Y+1, 0 ); #define CAS2( Y ) CASA( Y, 2 ); CAS1( Y+1 ); #define CAS3( Y ) CASA( Y, 3 ); CAS2( Y+1 ); #define CAS4( Y ) CASA( Y, 4 ); CAS3( Y+1 ); double f(const double *py, const double *pz, double *dxb, double *ap1, int mh_z1234, unsigned int z1) { switch( mh_z1234 ) { CAS( 0, 0 ); CAS1(0); CAS2(0); CAS3(0); CAS4(0); } } When we do -O3 or -O2, we don't inline accu1<1> into accu1<2> at all, why?????????
Subject: Re: [4.0 Regression] threefold performance loss, not inlining as much > > ------- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-24 20:36 ------- > Reduced testcase: > const int LMAX = 4; > const int LMAX41 = 4*LMAX+1; > const int LMAX12 = (LMAX+1)*(LMAX+2)/2; > template<int n> > inline double accu1( const double* p1, const double* p2 ) > { > double d = *p1 * *p2; > return d + accu1<n-1>( ++p1, ++p2 ); > } > template <> inline double accu1<0>( const double* p1, const double* p2 ) > { > return p1[0] * p2[0]; > } > template <int ny, int nz> > inline double accu2( const double* py, const double* pz, const double* h ) > { > const double d = accu1<nz>( pz, h ) * *py; > if( ny == 0 ) return d; > return d + accu2<(ny ? ny-1 : -1), (ny ? nz : -1 )>( ++py, pz, ++h ); > } > template<> > inline double accu2<-1, -1>( const double* , const double* , const double* ) > { > return 0.0; > } > template <int ny, int nz> > inline double accu( const double* py, const double* pz, const double* h ) > { > if( ny == 0 ) return accu1<nz>( pz, h ); > else if( nz == 0 ) return accu1<ny>( py, h ); > else if( nz >= ny ) return accu2<ny, nz>( py, pz, h ); > else return accu2<nz, ny>( pz, py, h ); > } > template <> > inline double accu<0,0>( const double* , const double* , const double* h ) > { > return *h; > } > #define SWYZ( Y, Z ) ((Y+Z) * (Y+Z+1) / 2+Z) > #define CASA( Y, Z ) case SWYZ( Y, Z ): \ > *ap1 = accu<Y, Z>( py, pz, dxb ); \ > if( z1 == 0 ) break; \ > ++ap1; \ > z1--; py += LMAX41; pz -= LMAX41; > #define CAS( Y, Z ) case SWYZ( Y, Z ): *ap1 = accu<Y, Z>( py, pz, dxb ); break > #define CAS1( Y ) CASA( Y, 1 ); CAS( Y+1, 0 ); > #define CAS2( Y ) CASA( Y, 2 ); CAS1( Y+1 ); > #define CAS3( Y ) CASA( Y, 3 ); CAS2( Y+1 ); > #define CAS4( Y ) CASA( Y, 4 ); CAS3( Y+1 ); > > double f(const double *py, const double *pz, double *dxb, double *ap1, int mh_z1234, unsigned int > z1) > { > switch( mh_z1234 ) > { > CAS( 0, 0 ); > CAS1(0); > CAS2(0); > CAS3(0); > CAS4(0); > } > } > > > When we do -O3 or -O2, we don't inline accu1<1> into accu1<2> at all, why????????? Because we inline other functions before we get into this one and inline-unit-growth is reached... Actually for very small units the inline-unit-growth limit seems to be bit too tight, so we might think about bypassing this limit for very small units, but this won't solve original testcase anyway.... Profiling branch seems to get this testcase right and inline everything due to slightly different code size estimates, but it does not work particularly well on tramp3d testcase (right now it is slightly worse than mainline without profiling, I have patch to bring it back to mainlie levels that is obviously far from optimum...) I am unsure if we can come with more realistic cost model without actually trying to inline the function and see how much it optimize as suggested by some papers (but apparently not very suitable for production compiler I would say), but I am all ears about ideas ;)) Honza
Final callgraph for amd64: double accu1(const double*, const double*) [with int n = 2]/21: 22 insns (29 after inlining) needed inlinable asm_written called by: calls: double f(const double*, const double*, double*, double*, int, unsigned int)/17: 281 insns (2583 after inlining) needed inlinable asm_written called by: calls: Final callgraph for i686: double accu2(const double*, const double*, const double*) [with int ny = 2, int nz = 1]/29: 43 insns (102 after inlining) reachable inlinable asm_written called by: calls: double accu1(const double*, const double*) [with int n = 3]/25: 28 insns needed inlinable asm_written called by: calls: double accu1(const double*, const double*) [with int n = 2]/21: 28 insns needed inlinable asm_written called by: calls: double accu1(const double*, const double*) [with int n = 1]/18: 28 insns (27 after inlining) needed inlinable asm_written called by: calls: double f(const double*, const double*, double*, double*, int, unsigned int)/17: 311 insns (3003 after inlining) needed inlinable asm_written called by: calls:
To the reporter: in this case you probably want __attribute__ ((leafify)), just in case, though you are right in expecting the compiler to inline it.
Subject: Re: [4.0 Regression] threefold performance loss, not inlining as much bonzini at gcc dot gnu dot org wrote: > To the reporter: in this case you probably want __attribute__ ((leafify)), just > in case, though you are right in expecting the compiler to inline it. But of course attribute leafify is not available without patching your gcc sources.
Subject: Re: [4.0 Regression] threefold performance loss, not inlining as much Good idea. However, this provides just another knob to tune the inlining and it is not obvious where to apply that attribute for the best performance. I played with the present dozen or so parameters for some hours to come close to the old (aka 3.4) performance, and I definitely can't handle even more. Thank you Thomas Kunert bonzini at gcc dot gnu dot org wrote: > ------- Additional Comments From bonzini at gcc dot gnu dot org 2005-02-03 16:49 ------- > To the reporter: in this case you probably want __attribute__ ((leafify)), just > in case, though you are right in expecting the compiler to inline it. >
With __attribute__((leafify)) sticked to v4c_quad and mult_pq runtime goes down from 16.0s to 4.4s with recent gcc 4.0. For gcc 3.4.3 runtimes are 5.0s and 4.9s. We indeed do not very well on estimating the size of template metaprograms in 4.0.
Subject: Re: [4.0 Regression] threefold performance loss, not inlining as much Wow. Many thanks for that analysis. Now I will go and fetch the patch. Since nobody seems to care about improving the inlining parameters, I'd love to see this patch in 4.0. Thomas Kunert rguenth at gcc dot gnu dot org wrote: > ------- Additional Comments From rguenth at gcc dot gnu dot org 2005-02-24 17:09 ------- > With __attribute__((leafify)) sticked to v4c_quad and mult_pq runtime goes down > from 16.0s to 4.4s with recent gcc 4.0. For gcc 3.4.3 runtimes are 5.0s and 4.9s. > > We indeed do not very well on estimating the size of template metaprograms in 4.0. >
Patch at http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01571.html improves the testcase from 16.2s to 12.1s (3.4: 5.0s) - aka, still not good enough. As we have (with the patch) still size estimates for the functions that are 15-40% higher than for 3.4 we'd probably need to bump our inlining limits accordingly, say by 20%.
Why isn't this a critical regression? We're regressing *badly* on code generation.
Yes, the regression is even worse on the closed-duplicate #18704. There you can also find some analysis of inline parameter tuning.
Performance bugs are never critical.
Updated patch here: http://gcc.gnu.org/ml/gcc-patches/2005-02/msg01796.html
Even with Richard Guenther's patches, the only thing that really helps is setting --param large-function-growth=200, or more. The default is 100.
Subject: Re: [4.0/4.1 Regression] threefold performance loss, not inlining as much steven at gcc dot gnu dot org wrote: > ------- Additional Comments From steven at gcc dot gnu dot org 2005-03-05 18:49 ------- > Even with Richard Guenther's patches, the only thing that really helps is > setting --param large-function-growth=200, or more. The default is 100. Yup, this is probably one of the testcases, where -fobey-inline would help. Or of course profile directed inlining.
I just verified the regression here with -march=pentium on a pentium 4. On the original testcase, I got runtimes of 7.0, 4.9, 8.1, and 7.0 seconds with gcc-2.95.3, gcc-3.4.3, gcc-4.0.0, and gcc-4.1-20050603 using just -O3 (no -static).
As you can see from those numbers Dan Kegel posted, this kind of test case is very sensitive to the intermediate representation presented to the inliner and to inliner heuristics. Personally, I don't think it is worth keeping a bug report like this opened. This kind of almost-random behavior is similar to the reload failures on x86: You'll always be able to construct a test case that will fail with a "fixed" compiler for one particular case. However, in this case there are still a number of things that are interesting to look at: 1) GCC 4.1 has profile guided inlining. Does it help in this case? 2) Does the early inlining patch [1] help? 3) Does the pre-inlining patch and the IPA stuff help? (i.e. try the tree-profiling-branch with all pistons firing ;-) Personally, I expect that this is a case where the pre-inline optimizations may be helpful Could someone construct a graphical representation of the call graphs for GCC 3.4 and GCC 4.1 and compare them? I'm very curious which function (or functions) are apparently inlined only by GCC 3.4 and not by any other release. [1] http://gcc.gnu.org/ml/gcc-patches/2005-06/msg01839.html
I'm looking at the reduced testcase from comment #6, and I noticed that f() is declared double, but does not return anything. Thus the code doesn't compile with -O3 -Wall -Werror. If I fix the bug adding a "return(return *ap1)", or by declaring f() to be void, the performance regression dissappears. Here's the test harness I used to call the minimized testcase: int main(int argc, char *argv[]){ double ay[100][100]; const double *py, *pz; double *dxb, *ap1; double sum=0; int i,j,k; for(i=0; i<100; i++){ for(j=0; j<100; j++){ ay[i][j] = 1000*(i+1)+2*(j+1); } } py = ay[0]; pz = ay[1]; dxb = ay[2]; ap1 = ay[3]; for(k=0; k<100; k++){ for(i=0; i<10000; i++){ for(j=0; j<12; j++){ sum += f(py,pz,dxb,ap1,j,5); sum /= 2; } } } cout << sum << endl; return 0; } Is that ok? I compiled this with -O3 -mtune=pentium. Runtimes *without* the fix to f() were 0.31s, 8.72s, 8.83s and 8.80s when compiled with g++ 2.95.3, 3.4.3, 4.0.0 and 4.1.0-20050625, respectively (making this a large performance regression relative to gcc-2.95.3). Runtimes *with* the fix were 0.34s, 0.28s, 0.36s, 0.32s when compiled with g++ 2.95.3, 3.4.3, 4.0.0 and 4.1.0-20050625, respectively.
I meant to say "return(*ap1)" not "return(return *ap1)"
Anthony, it looks like the runtimes with the fix match the runtimes from the larger testcase reasonably well; at least they're faster on gcc-3.4.3 where they're supposed to be. So maybe we should try to answer the questions from comment # 22 for the reduced testcase.
Waiting for someone to look into this...
It seems unlikely to me that this is going to be release-critical, so I've downgraded it to P4. Our inlining heuristics are notoriously easy to perturb. Probably, to do substantially better, we'll need a more sophisticated cost model; or, as Jan suggests, actually try inlining things see how much that helps. I'm absolutely certain that for every release we'll have at least one program that was inlined worse than the previous release; the question is how we're doing overall, and whether we're being particularly stupid. So, I think the question is if we're actually being particularly stupid on this test case.
I get 0m8.052s on 3.4 and 0m8.127s on mainline on Athlon. This hardly counts as an regression. This is actually effect of some cost tweaks we did relatie to gimplifier a while ago. Reduced testcase fits in limits now too. Unlike Mark I would not claim that our inliner heuristics are actually more notorious than any others ;))
Actually I have to reopen this. When playing around on pentiumM or opteron, I still get roughly 20% regression (6s to 8s), 4.1 and 4.0 scores are about the same on both machines. For some reason this don't reproduce on my Athlon. This however don't seem to be related to inlining limits as ramping them up so everything gets inlined still won't make the testcase work better, so someone would probably have to oprofile it and figure out what is still regressing. I think it is non-critical tough.
On powerpc-linux, I get the following timings: Using the following command line: g++ -O3 -o t41 -mcpu=7450 -mtune=7450 pr17863.cc -static real user 3.4 0m11.761s 0m11.148s 4.0 0m10.196s 0m9.495s 4.1 0m17.824s 0m16.832s 4.1 0m11.547s 0m10.502s -- With attribute flatten
Someone will have to investigate this one better. According to comment #29, inlining is no longer a problem so the bug subject line was no longer correct.
No fix will happen for GCC-4.0.x
Honza, since you re-opened this, perhaps you can give new timings?
hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./test-3.4 real 0m5.692s user 0m5.588s sys 0m0.012s hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./test-3.4 real 0m5.536s user 0m5.492s sys 0m0.016s hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out real 0m7.438s user 0m7.388s sys 0m0.024s hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out real 0m7.392s user 0m7.336s sys 0m0.008s This is with current mainline and -O2 -march=athlon-xp. I had to remove two "static" keywords in explicit template instantiation. So we still regress here. Honza
So for more proper analysis. The testcase is quite challenging for inlining heuristics and by introducing early inlining and reducing call cost we now inline less that we used to at a time I claimed that we inline everything. However making inlined everything again is still not solving the problem. For inline decisions, the problematic bit seems to be accu1 and friends. They are templates using easier templates of same form. For n=1: double accu1(const double*, const double*) [with int n = 0] (p1, p2) { double D.4655; double D.4654; double D.4653; <bb 2>: D.4654_2 = *p1_1(D); D.4655_4 = *p2_3(D); D.4653_5 = D.4654_2 * D.4655_4; return D.4653_5; } With n>1 we simply copy the body few times: double accu1(const double*, const double*) [with int n = 1] (p1, p2) { double D.17506; double D.17507; double D.17505; double D.17505; double d; double D.6664; double D.6663; double D.6662; <bb 2>: D.6662_2 = *p1_1(D); D.6663_4 = *p2_3(D); d_5 = D.6662_2 * D.6663_4; p2_6 = p2_3(D) + 8; p1_7 = p1_1(D) + 8; D.17506_11 = *p1_7; D.17507_12 = *p2_6; D.17505_13 = D.17506_11 * D.17507_12; D.6664_9 = d_5 + D.17505_13; return D.6664_9; } Early inlinier handles this well until the function grows up, that happens on n=4 and for n=5 we end up not inlining: double accu1(const double*, const double*) [with int n = 5] (p1, p2) { double d; double D.6697; double D.6696; double D.6695; double D.6694; <bb 2>: D.6694_2 = *p1_1(D); D.6695_4 = *p2_3(D); d_5 = D.6694_2 * D.6695_4; p2_6 = p2_3(D) + 8; p1_7 = p1_1(D) + 8; D.6697_8 = accu1 (p1_7, p2_6); D.6696_9 = D.6697_8 + d_5; return D.6696_9; } This is as expected, for n=4 the code is definitely longer than call sequence, having 4 FP multiples, 4 adds, 8 loads, I don't think simple heuristic can resonably expect it to simplify. We inline these functions later in late inlining as expected, but since there are just too many calls of them, we end up eventually on large function and large unit limits. Now to get everything inlined one needs --param inline-call-cost=9999 --param max-inline-insns-single=999999 (the second is needed for DCubuc::DCubic that is just big IMO). Now with this: hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time /aux/hubicka/gcc-install/bin/g++ -O3 ttest.cc -fpermissive --static -march=athlon-xp -Winline --param inline-call-cost=9999 --param max-inline-insns-single=999999 ttest.cc: In function 'void testv4c()': ttest.cc:21: warning: inlining failed in call to 'tcdata::tcdata()': --param inline-unit-growth limit reached ttest.cc:468: warning: called from here real 1m0.934s user 0m59.736s sys 0m1.204s hubicka@occam:/aux/hubicka/trunk-write/buidl2$ time ./a.out real 0m7.055s user 0m7.052s sys 0m0.000s We still have long way to GCC 3-4 perfomrance (5s, see my previous post). I suspect that alising simply give up. Setting inline-call-cost to 1 (the other extreme) leads to 6.9s.
Looking at the .optimized dump, one obvious problem is that we keep a lot of pointer arithmetic that should be forward propagated: <L147>:; D.184420 = *pz; p1 = pz + 8; D.184422 = *p1; p1 = p1 + 8; D.184424 = *p1; p1 = p1 + 8; D.184426 = *p1; p1 = p1 + 8; D.184428 = *p1; p1 = p1 + 8; D.184430 = *p1; p1 = p1 + 8; D.184432 = *p1; D.184434 = *(p1 + 8); Those seems to be all just array manipulations. Honza
Subject: Re: [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??) > Those seems to be all just array manipulations. AFAICT, they are exactly in the form that some targets like it (e.g. auto-inc/dec and SMALL_REGISTER_CLASS targets).
Subject: Re: [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??) > AFAICT, they are exactly in the form that some targets like it (e.g. > auto-inc/dec and SMALL_REGISTER_CLASS targets). Yep, but all the pointer arithmetic makes us not to realize we are doing quite simple manipulations with array and propagate load/stores through. CSE undoes this later in the game, so we end up with normal offsetted addressing. Doing it earlier should make load/store elimination happier. Honza
Subject: Re: [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??) On Wed, 30 Jan 2008, stevenb dot gcc at gmail dot com wrote: > > > ------- Comment #37 from stevenb dot gcc at gmail dot com 2008-01-30 20:13 ------- > Subject: Re: [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as > much??) > > > Those seems to be all just array manipulations. > > AFAICT, they are exactly in the form that some targets like it (e.g. > auto-inc/dec and SMALL_REGISTER_CLASS targets). They also don't operate on arrays, so we cannot use ARRAY_REF in the IL. Richard.
Well, I still meant that simplifying the cascaded addition into accumulator into direct addition from base makes the code to simplify. I implemented experimentally the trick in fwprop and will attach later, but the patch itself doesn't help. What happens is now obvious. For sequence like: D.185211 = *py.4861; d = D.180590 * D.185211; p1 = py.4861 + 8; D.185213 = *p1; d = D.180606 * D.185213; p1 = py.4861 + 16; D.185215 = *p1; d = D.180642 * D.185215; p1 = py.4861 + 24; D.185217 = *p1; d = D.180728 * D.185217; p1 = py.4861 + 32; D.185219 = *p1; d = D.180888 * D.185219; p1 = py.4861 + 40; D.185221 = *p1; d = D.181157 * D.185221; p1 = py.4861 + 48; D.185223 = *p1; D.185098 = D.181571 * D.185223; D.185094 = d + D.185098; D.185090 = d + D.185094; D.185086 = d + D.185090; D.185082 = d + D.185086; D.185078 = d + D.185082; D.185074 = d + D.185078; D.185210 = *pz; d = D.185074 * D.185210; py = pz + 8; d = D.180606 * D.185211; d = D.180642 * D.185213; d = D.180728 * D.185215; d = D.180888 * D.185217; d = D.181157 * D.185219; d = D.181571 * D.185221; D.185130 = D.182177 * D.185223; D.185126 = d + D.185130; D.185122 = d + D.185126; D.185118 = d + D.185122; D.185114 = d + D.185118; D.185110 = d + D.185114; D.185106 = d + D.185110; D.185225 = *py; d = D.185106 * D.185225; py = pz + 16; d = D.180642 * D.185211; d = D.180728 * D.185213; d = D.180888 * D.185215; d = D.181157 * D.185217; d = D.181571 * D.185219; d = D.182177 * D.185221; D.185162 = D.183023 * D.185223; D.185158 = d + D.185162; D.185154 = d + D.185158; D.185150 = d + D.185154; D.185146 = d + D.185150; D.185142 = d + D.185146; D.185138 = d + D.185142; D.185240 = *py; d = D.185138 * D.185240; py = pz + 24; d = D.180728 * D.185211; d = D.180888 * D.185213; d = D.181157 * D.185215; d = D.181571 * D.185217; d = D.182177 * D.185219; d = D.183023 * D.185221; D.185194 = D.184168 * D.185223; D.185190 = d + D.185194; D.185186 = d + D.185190; D.185182 = d + D.185186; D.185178 = d + D.185182; D.185174 = d + D.185178; D.185170 = d + D.185174; D.185255 = *py; d = D.185170 * D.185255; D.185134 = d + d; D.185102 = d + D.185134; D.185195 = d + D.185102; *ap1.4607 = D.185195; if (z1.4734 == 0) goto <bb 339> (<L351>); else goto <bb 144>; that are accumulating values from array into few variables, TER merges all the arithmetic into single giant expression leaving the loads in the front of it. <L262>:; D.197135 = *pz; D.197137 = *(pz + 8); D.197139 = *(pz + 16); D.197141 = *(pz + 24); D.197143 = *(pz + 32); D.197145 = *(pz + 40); D.197147 = *(pz + 48); D.197149 = *(pz + 56); D.197151 = *(pz + 64); D.197153 = *(pz + 72); D.197155 = *(pz + 80); D.197157 = *(pz + 88); D.197159 = *(pz + 96); *ap1.4658 = (D.180590 * D.197135 + (D.180606 * D.197137 + (D.180642 * D.197139 + (D.180728 * D.197141 + (D.180888 * D.197143 + (D.181157 * D.197145 + (D.181571 * D.197147 + (D.182177 * D.197149 + (D.183023 * D.197151 + (D.184168 * D.197153 + (D.185672 * D.197155 + (D.187606 * D.197157 + D.190042 * D.197159)))))))))))) * *py.4912 + ((D.180606 * D.197135 + (D.180642 * D.197137 + (D.180728 * D.197139 + (D.180888 * D.197141 + (D.181157 * D.197143 + (D.181571 * D.197145 + (D.182177 * D.197147 + (D.183023 * D.197149 + (D.184168 * D.197151 + (D.185672 * D.197153 + (D.187606 * D.197155 + (D.190042 * D.197157 + D.193063 * D.197159)))))))))))) * *(py.4912 + 8) + (D.180642 * D.197135 + (D.180728 * D.197137 + (D.180888 * D.197139 + (D.181157 * D.197141 + (D.181571 * D.197143 + (D.182177 * D.197145 + (D.183023 * D.197147 + (D.184168 * D.197149 + (D.185672 * D.197151 + (D.187606 * D.197153 + (D.190042 * D.197155 + (D.193063 * D.197157 + D.196753 * D.197159)))))))))))) * *(py.4912 + 16)); if (z1.4780 == 0) goto <bb 339> (<L351>); else goto <bb 251>; With the patch for fwprop and -fno-tree-ter I get 5.1s, that is same as in pre GCC-4.0. Why TER is not placing loads into expressions at first place? This seems like quite common pattern to kill register pressure to me. I have to leave but will play with it further, try if fwprop patch is needed and polish it. Honza
TER will not replace any load into an expression if there is more than one use of the load. Your sample shows multiple uses of each load. If it did this substitution, it could be introducing worse code, it doesn't know. (TER is also strictly a single block replacement as well). A stand alone register pressure analysis could determine that those loads should all be substituted because pressure is too high on an 8 register machine. If there were 128 register available however, then it may not want to substitute the loads. TER just doesn't have that kind of information.
So for summary. With TER disabled, I get 6.2s, so we are still worse than 3.4 that is 5.6s. With call-cost inline parameter increased and TER disabled, I get 5.3s. With forwprop fix, ter disabled and inline parameter increased, I get 5.2s. Forwprop alone we get 7.1s. WIth forwprop and TER disabled is 5.8s. Other combinations brings no difference. TER increasing register pressure is major offender here masking the other improvements. I don't see how to track the TER and inlining limits issue with current organization of compiler. It is probably GCC-4.4 thing if we get lucky. Honza
Subject: Re: [4.0/4.1/4.2/4.3 Regression] performance loss (not inlining as much??) > TER will not replace any load into an expression if there is more than one use > of the load. Your sample shows multiple uses of each load. If it did this > substitution, it could be introducing worse code, it doesn't know. (TER is > also strictly a single block replacement as well). I noticed that now too. The code after reordering by TER simply need even more registers alive by changling how the temporaries overlap. There is probably no simple heuristics to control this... Honza
(In reply to comment #42) > I don't see how to track the TER and inlining limits issue with current > organization of compiler. It is probably GCC-4.4 thing if we get lucky. I rather not have TER changed, because I will make a mention the register pressure issues is really because our RA is not able to deal with it correctly. Also people can write their code that way. TER in some cases can be thought about a scheduler and I will tell you pushing loads further up helps targets like PowerPC and others.
Closing 4.1 branch.
With new-RA we seem to do better on this testcase now: hubicka@occam:~$ time ./a.out-3.4 real 0m5.448s user 0m5.440s sys 0m0.012s hubicka@occam:~$ time ./a.out real 0m5.834s user 0m5.836s sys 0m0.000s hubicka@occam:~$ still there is small regression.
Closing 4.2 branch.
GCC 4.3.4 is being released, adjusting target milestone.
To make the test case work, I had to solve two errors by removing "static" keywords: ttest.cc:105: error: explicit template specialization cannot have a storage class ttest.cc:117: error: explicit template specialization cannot have a storage class With that fixed, I timed the compiled binaries for x86_64 and for i386 Compiled for x86_64 (with "g++-4.5.0 -O3 ttest.cc -static -fpermissive -o ttest45" etc.): stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest36 ; done real 0m4.238s user 0m4.210s sys 0m0.030s real 0m4.209s user 0m4.190s sys 0m0.000s real 0m4.193s user 0m4.170s sys 0m0.010s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest41 ; done real 0m3.733s user 0m3.720s sys 0m0.010s real 0m3.632s user 0m3.620s sys 0m0.000s real 0m3.662s user 0m3.630s sys 0m0.010s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest42 ; done real 0m3.292s user 0m3.260s sys 0m0.020s real 0m3.338s user 0m3.300s sys 0m0.010s real 0m3.264s user 0m3.260s sys 0m0.010s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest43 ; done real 0m3.515s user 0m3.500s sys 0m0.020s real 0m3.463s user 0m3.420s sys 0m0.000s real 0m3.518s user 0m3.490s sys 0m0.000s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest44 ; done real 0m3.467s user 0m3.420s sys 0m0.010s real 0m3.378s user 0m3.380s sys 0m0.000s real 0m3.434s user 0m3.400s sys 0m0.000s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest45 ; done real 0m0.284s user 0m0.280s sys 0m0.000s real 0m0.202s user 0m0.180s sys 0m0.000s real 0m0.183s user 0m0.180s sys 0m0.000s Compiled for i386 (with "g++-4.5.0 -O3 -m32 -march=pentium4 ttest.cc -static -fpermissive -o ttest45" etc.): stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest36 ; done real 0m4.092s user 0m4.080s sys 0m0.010s real 0m3.954s user 0m3.940s sys 0m0.020s real 0m3.988s user 0m3.970s sys 0m0.010s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest42 ; done real 0m5.818s user 0m5.810s sys 0m0.010s real 0m5.828s user 0m5.770s sys 0m0.030s real 0m5.813s user 0m5.790s sys 0m0.000s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest43 ; done real 0m5.379s user 0m5.360s sys 0m0.010s real 0m5.419s user 0m5.370s sys 0m0.030s real 0m5.382s user 0m5.360s sys 0m0.010s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest44 ; done real 0m4.430s user 0m4.410s sys 0m0.020s real 0m4.433s user 0m4.390s sys 0m0.010s real 0m4.389s user 0m4.380s sys 0m0.000s stevenb@stevenb-laptop:~/t$ for f in 1 2 3 ; do time ./ttest45 ; done real 0m0.230s user 0m0.220s sys 0m0.010s real 0m0.236s user 0m0.220s sys 0m0.000s real 0m0.216s user 0m0.210s sys 0m0.000s So GCC 4.4 with -m32 still has a ~10% performance regression compared to CC 3.4, but GCC 4.5 appears to optimize the test case away (but I am not sure that the result is correct -- how to check for correctness?). For -m64 (x86-64), all GCC4 versions are better than GCC 3.4, and GCC 4.2 gives the best performance. Reconfirmed for 32-bits x86, then.
Performance loss within acceptable limits (by the "you give some, you take some" principle). GCC 4.5 optimizes the test case away completely. I see no reason to do anything more here. Fixed for GCC 4.5 and GCC 4.4. Won't fix for GCC 4.3.