After revision 159852 Author: pault Date: Wed May 26 05:11:04 2010 UTC (4 days, 12 hours ago) Changed paths: 4 Log Message: 2010-05-26 Paul Thomas <pault@gcc.gnu.org> PR fortran/40011 * resolve.c (resolve_global_procedure): Resolve the gsymbol's namespace before trying to reorder the gsymbols. 2010-05-26 Paul Thomas <pault@gcc.gnu.org> PR fortran/40011 * gfortran.dg/whole_file_19.f90 : New test. the executable of the polyhedron test rnflow.f90 is ~27% slower when compiled with -fwhole-program -flto: [macbook] lin/test% gfcpf -v Using built-in specs. COLLECT_GCC=gfcpf COLLECT_LTO_WRAPPER=/opt/gcc/gcc4.6pf/libexec/gcc/x86_64-apple-darwin10/4.6.0/lto-wrapper Target: x86_64-apple-darwin10 Configured with: ../p_work/configure --prefix=/opt/gcc/gcc4.6pf --mandir=/opt/gcc/gcc4.6pf/share/man --infodir=/opt/gcc/gcc4.6pf/share/info --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --target=x86_64-apple-darwin10 --enable-languages=c,fortran --with-gmp=/opt/sw64 --with-libiconv-prefix=/opt/sw64 --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --with-cloog=/opt/sw64 --with-ppl=/opt/sw64 --with-mpc=/opt/sw64 --enable-lto Thread model: posix gcc version 4.6.0 20100526 (experimental) [trunk revision 159851] (GCC) [macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer rnflow.f90 [macbook] lin/test% time a.out > /dev/null 25.826u 0.686s 0:26.52 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto rnflow.f90 [macbook] lin/test% time a.out > /dev/null 25.506u 0.674s 0:26.19 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 [macbook] lin/test% time a.out > /dev/null 25.772u 0.678s 0:26.46 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -v Using built-in specs. COLLECT_GCC=gfcp COLLECT_LTO_WRAPPER=/opt/gcc/gcc4.6p/libexec/gcc/x86_64-apple-darwin10/4.6.0/lto-wrapper Target: x86_64-apple-darwin10 Configured with: ../p_work/configure --prefix=/opt/gcc/gcc4.6p --mandir=/opt/gcc/gcc4.6p/share/man --infodir=/opt/gcc/gcc4.6p/share/info --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --target=x86_64-apple-darwin10 --enable-languages=c,fortran --with-gmp=/opt/sw64 --with-libiconv-prefix=/opt/sw64 --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --with-cloog=/opt/sw64 --with-ppl=/opt/sw64 --with-mpc=/opt/sw64 --enable-lto Thread model: posix gcc version 4.6.0 20100526 (experimental) [trunk revision 159852] (GCC) [macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer rnflow.f90 [macbook] lin/test% time a.out > /dev/null 25.841u 0.696s 0:26.54 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto rnflow.f90 [macbook] lin/test% time a.out > /dev/null 25.540u 0.677s 0:26.22 99.9% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 [macbook] lin/test% time a.out > /dev/null 32.627u 0.685s 0:33.31 99.9% 0+0k 0+0io 0pf+0w <--- ~27% slower As it has been noticed previously the executable of fatigue.f90 is ~30% faster when compiled with -fwhole-program: [macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto fatigue.f90 [macbook] lin/test% time a.out > /dev/null 9.031u 0.006s 0:09.04 99.8% 0+0k 0+1io 0pf+0w [macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program fatigue.f90 [macbook] lin/test% time a.out > /dev/null 6.448u 0.004s 0:06.47 99.5% 0+0k 0+1io 0pf+0w
I'll attach the assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto for revisions 159851 and 159852. It is the same with/without -fwhole-program (probably obvious), however when assembled and linked with gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow_wp5*.s the timing depends on the revision used to generate the assembly, but not on the compiler revision.
Insufficient analysis. This more sounds like a dup of profile-estimate messed up by inlining.
Created attachment 20780 [details] Assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159851
Created attachment 20781 [details] Assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159852
Output of gprof on darwin: Revision 159851: called/total parents index %time self descendents called+self name index called/total children 520605 _dgetf2_ [81] 0.00 0.00 64/1041192 ___timctr_MOD_gettim [1429] 0.00 0.00 6548/1041192 _dswap_ [4112] 0.00 0.00 1034580/1041192 _xerbla_ [83] [81] 0.0 0.00 0.00 1041192+520605 _dgetf2_ [81] 0.00 0.00 64137/110864 _dgetrf_ [82] 520605 _dgetf2_ [81] ----------------------------------------------- 13315 _dgetrf_ [82] 0.00 0.00 8/110864 ___timctr_MOD_gettim [1429] 0.00 0.00 6548/110864 _dswap_ [4112] 0.00 0.00 6685/110864 __dyld_func_lookup [1665] 0.00 0.00 33486/110864 _xerbla_ [83] 0.00 0.00 64137/110864 _dgetf2_ [81] [82] 0.0 0.00 0.00 110864+13315 _dgetrf_ [82] 0.00 0.00 1/1 _main [85] 13315 _dgetrf_ [82] ----------------------------------------------- 0.00 0.00 10872/10872 _dswap_ [4112] [83] 0.0 0.00 0.00 10872 _xerbla_ [83] 0.00 0.00 1034580/1041192 _dgetf2_ [81] 0.00 0.00 33486/110864 _dgetrf_ [82] ----------------------------------------------- 0.00 0.00 1/1 _main [85] [84] 0.0 0.00 0.00 1 __start [84] ----------------------------------------------- 0.00 0.00 1/1 _dgetrf_ [82] [85] 0.0 0.00 0.00 1 _main [85] 0.00 0.00 1/1 __start [84] ----------------------------------------------- ... % cumulative self self total time seconds seconds calls ms/call ms/call name 0.0 0.00 0.00 1561733 0.00 0.00 _dgetf2_ [81] 0.0 0.00 0.00 110927 0.00 0.00 _dgetrf_ [82] 0.0 0.00 0.00 10872 0.00 0.00 _xerbla_ [83] 0.0 0.00 0.00 1 0.00 0.00 __start [84] 0.0 0.00 0.00 1 0.00 0.00 _main [85] ================================================================================ Revision 159852: called/total parents index %time self descendents called+self name index called/total children 0.00 0.00 6548/1561733 _dswap_ [4112] 0.00 0.00 1555185/1561733 _xerbla_ [83] [81] 0.0 0.00 0.00 1561733 _dgetf2_ [81] 0.00 0.00 64136/110927 _dgetrf_ [82] ----------------------------------------------- 13315 _dgetrf_ [82] 0.00 0.00 72/110927 ___timctr_MOD_gettim [1429] 0.00 0.00 6548/110927 _dswap_ [4112] 0.00 0.00 6685/110927 __dyld_func_lookup [1665] 0.00 0.00 33486/110927 _xerbla_ [83] 0.00 0.00 64136/110927 _dgetf2_ [81] [82] 0.0 0.00 0.00 110927+13315 _dgetrf_ [82] 0.00 0.00 1/1 _main [85] 13315 _dgetrf_ [82] ----------------------------------------------- 0.00 0.00 10872/10872 _dswap_ [4112] [83] 0.0 0.00 0.00 10872 _xerbla_ [83] 0.00 0.00 1555185/1561733 _dgetf2_ [81] 0.00 0.00 33486/110927 _dgetrf_ [82] ----------------------------------------------- 0.00 0.00 1/1 _main [85] [84] 0.0 0.00 0.00 1 __start [84] ----------------------------------------------- 0.00 0.00 1/1 _dgetrf_ [82] [85] 0.0 0.00 0.00 1 _main [85] 0.00 0.00 1/1 __start [84] ----------------------------------------------- ... % cumulative self self total time seconds seconds calls ms/call ms/call name 0.0 0.00 0.00 5572994 0.00 0.00 _xerbla_ [154] 0.0 0.00 0.00 20556 0.00 0.00 _dswap_ [155] 0.0 0.00 0.00 20000 0.00 0.00 ___timctr_MOD_gettim [156] 0.0 0.00 0.00 3 0.00 0.00 __dyld_func_lookup [157] 0.0 0.00 0.00 2 0.00 0.00 __start [158]
0.0 0.00 0.00 5572994 0.00 0.00 _xerbla_ [154] eh? that's the blas error handler. something is fishy with your setup.
> Insufficient analysis. This more sounds like a dup of profile-estimate > messed up by inlining. Do you mean a dup of pr40106? Or is there others I am not aware of? > eh? that's the blas error handler. something is fishy with your setup. Which setup?
At revision 160309, I get [macbook] lin/test% gfc -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 --param hot-bb-frequency-fraction=1000 [macbook] lin/test% time a.out > /dev/null 32.601u 0.716s 0:33.35 99.8% 0+0k 0+0io 0pf+0w [macbook] lin/test% gfc -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 --param hot-bb-frequency-fraction=2000 [macbook] lin/test% time a.out > /dev/null 25.760u 0.708s 0:26.47 99.9% 0+0k 0+0io 0pf+0w
For what it is worth, on AMD Athlon 64 X2 4800+ / x86-64-linux, I get for gfortran -O3 -ffast-math -march=native -- and with with and without -flto: 0m45.132s -- (options as above) 0m52.731s -- additionally -fwhole-program That's a +16% increase in run-time with -fwhole-program.
So hot-bb-frequency-fraction solves the whole regression?
[Move comment from IRC #gcc to bugzilla] (In reply to comment #9) > For what it is worth, on AMD Athlon 64 X2 4800+ / x86-64-linux, [...] > That's a +16% increase in run-time with -fwhole-program. (In reply to comment #10) > So hot-bb-frequency-fraction solves the whole regression? For me (cf. system above), --param hot-bb-frequency-fraction=2000 reduces the slow down due to -fwhole-program from 16% to 3%. (The LTO version with and without -fwhole-file is about 2% slower than the corresponding -fno-lto version.)
I think this is not a gfortran bug. Marked as aLTO one.
Static profile estimation problem, to be exact. LTO is just triggering it by bringing in enough of context ;)
I finally got into some time to test the various solutions. easiest is probably the following: Index: predict.c =================================================================== --- predict.c (revision 168047) +++ predict.c (working copy) @@ -126,7 +126,7 @@ maybe_hot_frequency_p (int freq) if (node->frequency == NODE_FREQUENCY_EXECUTED_ONCE && freq <= (ENTRY_BLOCK_PTR->frequency * 2 / 3)) return false; - if (freq < BB_FREQ_MAX / PARAM_VALUE (HOT_BB_FREQUENCY_FRACTION)) + if (freq < ENTRY_BLOCK_PTR->frequency / PARAM_VALUE (HOT_BB_FREQUENCY_FRACTION)) return false; return true; } It makes GCC to decide on cold basic blocks not based on the innermost loop nest but on the entry block frequency - so many conditoinals or EH renders BB cold but not the fact it is outside of very many BBs. Could you try if this solves the problem?
> Could you try if this solves the problem? The patch in comment #14 fixed the problem on x86_64-apple-darwin10 (I cannot say anything for AMD). I have run the polyhedron tests without noticing any slow down. I'll do a clean regstrap tonight. Thanks for the patch.
The patch in comment #14 fixed the problem on x86_64-apple-darwin10, but causes the following regressions: FAIL: gcc.dg/autopar/outer-2.c scan-tree-dump-times parloops "parallelizing outer loop" 1 FAIL: gcc.dg/autopar/outer-2.c scan-tree-dump-times optimized "loopfn" 5 FAIL: gcc.dg/tree-ssa/ldist-pr45948.c scan-tree-dump ldist "distributed: split to 3" which disappear if I revert the patch. Note that something looks uninitialized with the patch: [macbook] f90/bug% gcc46 -O2 -ftree-loop-distribution -fdump-tree-ldist-details -c /opt/gcc/work/gcc/testsuite/gcc.dg/tree-ssa/ldist-pr45948.c [macbook] f90/bug% grep distributed ldist-pr45948.c.101t.ldist Loop -1515870811 distributed: split to 2 loops. ^^^^ instead of Loop 1 distributed: split to 3 loops.
For the record I have also tested the patch in comment #14 on powerpc-apple-darwin9 at revision 168070. Without the patch I get [karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000 -fwhole-program -flto rnflow.f90 [karma] lin/test% time a.out > /dev/null 68.236u 6.947s 1:17.77 96.6% 0+0k 0+0io 0pf+0w [karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto rnflow.f90 [karma] lin/test% time a.out > /dev/null 65.229u 6.838s 1:14.61 96.5% 0+0k 0+0io 0pf+0w Note a slight slow down with -param hot-bb-frequency-fraction=2000. With the patch I get [karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000 -fwhole-program -flto rnflow.f90 [karma] lin/test% time a.out > /dev/null 69.690u 6.917s 1:19.44 96.4% 0+0k 0+0io 1pf+0w [karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto rnflow.f90 [karma] lin/test% time a.out > /dev/null 69.791u 7.225s 1:20.08 96.1% 0+0k 0+0io 0pf+0w i.e., -param hot-bb-frequency-fraction=2000 does not change the timings, but the resulting code is slower.
Created attachment 23000 [details] assembly for gcc.dg/autopar/outer-2.c at -m32 with r168907 /Users/howarth/work/gcc/xgcc -B/Users/howarth/work/gcc/ /Users/howarth/gcc-4.6-20110116/gcc/testsuite/gcc.dg/autopar/outer-2.c -O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized -S -m32 -o outer-2.s
Created attachment 23001 [details] assembly for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14
Created attachment 23002 [details] parloops for gcc.dg/autopar/outer-2.c at -m32 with r168907
Created attachment 23003 [details] parloops for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14
Created attachment 23004 [details] optimized for gcc.dg/autopar/outer-2.c at -m32 with r168907
Created attachment 23005 [details] optimized for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14
PR 43884 has similar problem with deep loop nests.
Author: hubicka Date: Sat Jan 22 21:47:40 2011 New Revision: 169136 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169136 Log: PR tree-optimization/43884 PR lto/44334 * predict.c (maybe_hot_frequency_p): Use entry block frequency as an base. * doc/invoke.texi (hot-bb-frequency-fraction): Update docs. * gcc.dg/autopar/outer-2.c: Increase array size. * gcc.dg/tree-ssa/ldist-pr45948.c: Update test. Modified: trunk/gcc/ChangeLog trunk/gcc/predict.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/autopar/outer-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-pr45948.c
OK, i comitted the branch prediction change. I am bit confused by the rest of trail, can you please confirm if the problem is fixed in all the configurations mentioned?
On x86_64-apple-darwin10 at r169137, the pb05 benchmarks compiled with benchmark -O3 -ffast-math -O3 -ffast-math -funroll-loops %change -funroll-loops -flto -fwhole-program ac 8.81 8.81 0.0 aermod 17.30 17.50 1.2 air 5.62 5.57 -0.9 capacita 32.77 33.35 1.8 channel 1.89 1.89 0.0 doduc 26.58 26.52 -0.2 fatigue 8.37 8.36 -0.1 gas_dyn 4.36 4.35 -0.2 induct 13.05 13.04 -0.1 linpk 17.15 17.05 -0.6 mdbx 11.25 11.26 0.1 nf 32.14 33.50 4.2 protein 32.50 32.27 -0.7 rnflow 24.11 24.84 3.0 test_fpu 8.22 8.20 -0.2 tfft 1.89 1.88 -0.5 Geometric 11.07 11.11 0.4 Mean
According to http://gcc.gnu.org/ml/gcc-regression/2011-01/msg00375.html revision 169136 caused a bootstrap failure on powerpc-apple-darwin9.8.0: .... /Users/regress/tbox/native/build/./prev-gcc/xgcc -B/Users/regress/tbox/native/build/./prev-gcc/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/bin/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/bin/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/lib/ -isystem /Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/include -isystem /Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/sys-include -c -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC -W -Wall -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -Wold-style-definition -Wc++-compat -fno-common -DHAVE_CONFIG_H -I. -I. -I/Users/regress/tbox/svn-gcc/gcc -I/Users/regress/tbox/svn-gcc/gcc/. -I/Users/regress/tbox/svn-gcc/gcc/../include -I./../intl -I/Users/regress/tbox/svn-gcc/gcc/../libcpp/include -I/Users/regress/tbox/svn-gcc/gcc/../libdecnumber -I/Users/regress/tbox/svn-gcc/gcc/../libdecnumber/dpd -I../libdecnumber /Users/regress/tbox/svn-gcc/gcc/compare-elim.c -o compare-elim.o /Users/regress/tbox/svn-gcc/gcc/compare-elim.c: In function 'maybe_select_cc_mode': /Users/regress/tbox/svn-gcc/gcc/compare-elim.c:407:58: error: unused parameter 'b' [-Werror=unused-parameter] cc1: all warnings being treated as errors
From http://gcc.gnu.org/ml/gcc-patches/2011-01/msg01607.html the bootstrap failure seems rather due to revision 169131. Note that revision 169142 bootstrapped on x86_64-apple-darwin10 configured with --enable-checking=release.
Concerning the timings in comment #27 they may reflect the fact the the inliner is not aggressive enough for fortran codes and that it is worsen when using -flto: For rnflow.f90 I get 26.75s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer 26.66s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 27.60s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto 27.14s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto 26.79s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=2000 -fwhole-program -flto The result is more spectacular for fatigue.f90 8.50s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto 4.69s with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=2000 -fwhole-program -flto Note that revision 169136 seems to require higher values of -finline-limit: before it, 600 was sufficient to see the speed-up (I have reported that in an other pr), now it has been increased (I did not tried values lower than 2000 yet).
The relevant pr for comment #30 is pr45810 comment #9. The threshold for fatigue.f90 was322 before revision 169136 and is now 1520 (~x5).
> The relevant pr for comment #30 is pr45810 comment #9. The threshold for > fatigue.f90 was322 before revision 169136 and is now 1520 (~x5). Interesting. Do you know what function we fail to inline? Can you attach ipa-inline dump from both settings? I know that also c-ray wants to increase inline limits. I can increase them a bit, but not by factor of 5, since that would cause code size explosion at -O3. (I did some tests on this two weeks ago) Honza
Please use -fdump-ipa-inline-details to generate the dump. Perhaps we just miscompute function body size somehow.
Pretty obvoius fix to the compare-elim issue is adding ATTRIBUTE_UNUSED to b parameter. It is used by SELECT_CC_MODE macro that is defined to not use it by default. Honza
> Do you know what function we fail to inline? It is generalized_hookes_law. I have looked to fatigue.f90 in more details. With revision 168741, I see the transitions: 9.25s for inline-limit < 214 6.50s for 213 < inline-limit < 322 4.76s for 321 < inline-limit With revision 169142, I see 9.25s for inline-limit < 214 6.50s for 213 < inline-limit < 322 8.48s for 321 < inline-limit < 1520 4.70s for 1519 < nline-limit Indeed I may have missed other thresholds (especially in the range 322--1519). I have dumps for values below and above the thresholds (10 of them). Do you want them all? or only a subset? In the later case which ones?
Created attachment 23086 [details] bzip2 compressed ipa-inline-details dump without -finline-limit generated at r169137 on x86_64-apple-darwin10 with... gfortran -O3 -ffast-math -funroll-loops -fdump-ipa-inline-details -flto -fwhole-program ../fatigue.f90 -o ../fatigue
Created attachment 23087 [details] bzip2 compressed ipa-inline-details dump with -finline-limit=600 bzip2 compressed ipa-inline-details dump with -finline-limit=600 generated at r169137 on x86_64-apple-darwin10 with... gfortran -O3 -ffast-math -funroll-loops -finline-limit=600 -fdump-ipa-inline-details -flto -fwhole-program ../fatigue.f90 -o ../fatigue
Created attachment 23088 [details] bzip2 compressed ipa-inline-details dump with -finline-limit=2000 generated at r169137 on x86_64-apple-darwin10 with... gfortran -O3 -ffast-math -funroll-loops -finline-limit=2000 -fdump-ipa-inline-details -flto -fwhole-program ../fatigue.f90 -o ../fatigue
Created attachment 23089 [details] -finline-limit=321 revision 168741 bzip2 fatigue.f90.048i.inline generated at revision168741 with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=321 -fwhole-program -flto -fdump-ipa-inline-details
Created attachment 23090 [details] -finline-limit=322 revision168741 bzip2 fatigue.f90.048i.inline generated at revision168741 with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=322 -fwhole-program -flto -fdump-ipa-inline-details
Created attachment 23091 [details] -finline-limit=321 revision 169142 bzip2 fatigue.f90.048i.inline generated at revision 169142 with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=321 -fwhole-program -flto -fdump-ipa-inline-details
Created attachment 23092 [details] -finline-limit=322 revision 169142 bzip2 fatigue.f90.048i.inline generated at revision 169142 with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=322 -fwhole-program -flto -fdump-ipa-inline-details
On x86_64-apple-darwin10 at r169137, the pb05 benchmarks compiled with benchmark -O3 -ffast-math -funroll-loops -O3 -ffast-math -funroll-loops %change -flto -fwhole-program -finline-limit=2000 -flto -fwhole-program ac 8.81 7.30 -17.1 aermod 17.50 17.43 -0.4 air 5.57 5.57 0.0 capacita 33.35 31.86 -4.5 channel 1.89 1.76 -6.9 doduc 26.52 25.15 -5.2 fatigue 8.36 4.21 -49.6 gas_dyn 4.35 4.28 -1.6 induct 13.04 13.05 0.1 linpk 17.05 17.31 1.5 mdbx 11.26 11.26 0.0 nf 33.50 30.97 -7.6 protein 32.27 32.62 1.1 rnflow 24.84 24.16 -2.7 test_fpu 8.20 8.90 8.5 tfft 1.88 1.94 3.2 Geometric 11.11 10.42 -6.2 Mean
Testing... Index: gcc/params.def =================================================================== --- gcc/params.def (revision 169185) +++ gcc/params.def (working copy) @@ -182,7 +182,7 @@ DEFPARAM(PARAM_LARGE_FUNCTION_INSNS, DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH, "large-function-growth", "Maximal growth due to inlining of large function (in percent)", - 100, 0, 0) + 400, 0, 0) DEFPARAM(PARAM_LARGE_UNIT_INSNS, "large-unit-insns", "The size of translation unit to be considered large", shows only a major improvement for fatigue (30%). This same improvement can be achieved at -m32 and -m64 with just an increase of large-function-growth to 200.
On Tue, 25 Jan 2011, howarth at nitro dot med.uc.edu wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44334 > > --- Comment #44 from Jack Howarth <howarth at nitro dot med.uc.edu> 2011-01-25 03:13:39 UTC --- > Testing... > > Index: gcc/params.def > =================================================================== > --- gcc/params.def (revision 169185) > +++ gcc/params.def (working copy) > @@ -182,7 +182,7 @@ DEFPARAM(PARAM_LARGE_FUNCTION_INSNS, > DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH, > "large-function-growth", > "Maximal growth due to inlining of large function (in percent)", > - 100, 0, 0) > + 400, 0, 0) > DEFPARAM(PARAM_LARGE_UNIT_INSNS, > "large-unit-insns", > "The size of translation unit to be considered large", > > shows only a major improvement for fatigue (30%). This same improvement can be > achieved at -m32 and -m64 with just an increase of large-function-growth to > 200. We certainly won't adjust params at this stage. There are other cases (that c-ray one) where more aggressive inlining helps, but we should avoid regressing for -O2 and only tune -O3 params eventually.
I sorted out increasing large function growth ratio as most safe way to deal with (easier half of) this problem. Unlike the parameters for inline limits it won't cause code size issues. It just allow somewhat bigger functions and thus stress more the backend on its linearity. Given that the parameter was never tuned since its inclusion in GCC 4.2, I guess we are not terribly sensitive here. We also improved a bit in the scalability here as I tuned the df code bit for LTO and spagetti code. Otherwise we need to wait for 4.7 or possibly 4.6.1. That is fine with me. I will still run tests tonight on how increasing the parameter affect our tester. This is first time I see it hit in perfomrance sensitive way. Not sure how common it is in practice since I never really tried to change it.
> I sorted out increasing large function growth ratio as most safe way > to deal with (easier half of) this problem. Unlike the parameters for > inline limits it won't cause code size issues. It just allow somewhat > bigger functions and thus stress more the backend on its linearity. Well, the choice is not '-finline-limit' versus '--param large-function-growth': some polyhedron tests are sensitive to some value of '-finline-limit' (ac, channel, fatigue, ...) and for most of them '--param large-function-growth' does not change anything. fatigue is quite peculiar in that there is a big speed-up with -fwhole-program for -finline-limit>=322and an additional small speed-up for --param large-function-growth>=132. In addition the later prevent a bad choice with -flto (this should probably be discussed in pr 45810 and this pr closed as fixed). Note that I am not interested by fine tuning, but to find some acceptable values of the default parameters that give good results for all (most;-) fortran codes).
(In reply to comment #47) > Well, the choice is not '-finline-limit' versus '--param > large-function-growth': some polyhedron tests are sensitive to some value of > '-finline-limit' (ac, channel, fatigue, ...) and for most of them '--param > large-function-growth' does not change anything. > > fatigue is quite peculiar in that there is a big speed-up with -fwhole-program > for -finline-limit>=322and an additional small speed-up for --param > large-function-growth>=132. In addition the later prevent a bad choice with > -flto (this should probably be discussed in pr 45810 and this pr closed as > fixed). > > Note that I am not interested by fine tuning, but to find some acceptable > values of the default parameters that give good results for all (most;-) > fortran codes). In my tests, --param large-function-growth=200 was sufficient to yield 60% of the performance increase in the fatigue benchmark obtained by modifying both -finline-limit and --param large-function-growth. Unlike increasing -finline-limit, none of the other pb05 benchmarks showed even minor regressions in speed.
Since it seems that revision 169136 is the "right fix", I am closing this PR as fixed. Any further discussion about the interaction between -fwhole-program and -flto for the polyhedron test fatigue.f90 should take place in pr45810.