44334 – rnflow.f90 ~27% slower with -fwhole-program -flto after revision 159852

Bug 44334 - rnflow.f90 ~27% slower with -fwhole-program -flto after revision 159852

Summary: rnflow.f90 ~27% slower with -fwhole-program -flto after revision 159852

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	lto (show other bugs)
Version:	4.6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2010-05-30 17:17 UTC by Dominique d'Humieres
Modified:	2021-12-26 12:35 UTC (History)
CC List:	6 users (show)

See Also:	47033
Host:	x86_64-apple-darwin10
Target:	x86_64-apple-darwin10
Build:	x86_64-apple-darwin10
Known to work:
Known to fail:
Last reconfirmed:	2010-12-19 11:43:48

Attachments
Assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159851 (341.28 KB, application/octet-stream) 2010-05-30 18:10 UTC, Dominique d'Humieres	Details
Assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159852 (339.09 KB, application/octet-stream) 2010-05-30 18:12 UTC, Dominique d'Humieres	Details
assembly for gcc.dg/autopar/outer-2.c at -m32 with r168907 (1.71 KB, text/plain) 2011-01-17 21:12 UTC, Jack Howarth	Details
assembly for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14 (1.71 KB, text/plain) 2011-01-17 21:13 UTC, Jack Howarth	Details
parloops for gcc.dg/autopar/outer-2.c at -m32 with r168907 (5.93 KB, application/octet-stream) 2011-01-17 21:14 UTC, Jack Howarth	Details
parloops for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14 (714 bytes, text/plain) 2011-01-17 21:14 UTC, Jack Howarth	Details
optimized for gcc.dg/autopar/outer-2.c at -m32 with r168907 (2.04 KB, text/plain) 2011-01-17 21:15 UTC, Jack Howarth	Details
optimized for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14 (1023 bytes, text/plain) 2011-01-17 21:16 UTC, Jack Howarth	Details
bzip2 compressed ipa-inline-details dump without -finline-limit (132.88 KB, application/octet-stream) 2011-01-23 15:45 UTC, Jack Howarth	Details
bzip2 compressed ipa-inline-details dump with -finline-limit=600 (178.68 KB, application/octet-stream) 2011-01-23 15:47 UTC, Jack Howarth	Details
bzip2 compressed ipa-inline-details dump with -finline-limit=2000 (205.11 KB, application/octet-stream) 2011-01-23 15:49 UTC, Jack Howarth	Details
-finline-limit=321 revision 168741 (171.61 KB, application/octet-stream) 2011-01-23 16:32 UTC, Dominique d'Humieres	Details
-finline-limit=322 revision168741 (178.41 KB, application/octet-stream) 2011-01-23 16:33 UTC, Dominique d'Humieres	Details
-finline-limit=321 revision 169142 (172.65 KB, application/octet-stream) 2011-01-23 16:35 UTC, Dominique d'Humieres	Details
-finline-limit=322 revision 169142 (180.20 KB, application/octet-stream) 2011-01-23 16:36 UTC, Dominique d'Humieres	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Dominique d'Humieres 2010-05-30 17:17:22 UTC

After revision 159852

Author:	pault
Date:	Wed May 26 05:11:04 2010 UTC (4 days, 12 hours ago)
Changed paths:	4
Log Message:	
2010-05-26  Paul Thomas  <pault@gcc.gnu.org>

	PR fortran/40011
	* resolve.c (resolve_global_procedure): Resolve the gsymbol's
	namespace before trying to reorder the gsymbols.

2010-05-26  Paul Thomas  <pault@gcc.gnu.org>

	PR fortran/40011
	* gfortran.dg/whole_file_19.f90 : New test.

the executable of the polyhedron test rnflow.f90 is ~27% slower when compiled with -fwhole-program -flto:

[macbook] lin/test% gfcpf -v
Using built-in specs.
COLLECT_GCC=gfcpf
COLLECT_LTO_WRAPPER=/opt/gcc/gcc4.6pf/libexec/gcc/x86_64-apple-darwin10/4.6.0/lto-wrapper
Target: x86_64-apple-darwin10
Configured with: ../p_work/configure --prefix=/opt/gcc/gcc4.6pf --mandir=/opt/gcc/gcc4.6pf/share/man --infodir=/opt/gcc/gcc4.6pf/share/info --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --target=x86_64-apple-darwin10 --enable-languages=c,fortran --with-gmp=/opt/sw64 --with-libiconv-prefix=/opt/sw64 --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --with-cloog=/opt/sw64 --with-ppl=/opt/sw64 --with-mpc=/opt/sw64 --enable-lto
Thread model: posix
gcc version 4.6.0 20100526 (experimental) [trunk revision 159851] (GCC) 
[macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer rnflow.f90 
[macbook] lin/test% time a.out > /dev/null
25.826u 0.686s 0:26.52 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto rnflow.f90
[macbook] lin/test% time a.out > /dev/null
25.506u 0.674s 0:26.19 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcpf -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90
[macbook] lin/test% time a.out > /dev/null
25.772u 0.678s 0:26.46 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -v
Using built-in specs.
COLLECT_GCC=gfcp
COLLECT_LTO_WRAPPER=/opt/gcc/gcc4.6p/libexec/gcc/x86_64-apple-darwin10/4.6.0/lto-wrapper
Target: x86_64-apple-darwin10
Configured with: ../p_work/configure --prefix=/opt/gcc/gcc4.6p --mandir=/opt/gcc/gcc4.6p/share/man --infodir=/opt/gcc/gcc4.6p/share/info --build=x86_64-apple-darwin10 --host=x86_64-apple-darwin10 --target=x86_64-apple-darwin10 --enable-languages=c,fortran --with-gmp=/opt/sw64 --with-libiconv-prefix=/opt/sw64 --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib --with-cloog=/opt/sw64 --with-ppl=/opt/sw64 --with-mpc=/opt/sw64 --enable-lto
Thread model: posix
gcc version 4.6.0 20100526 (experimental) [trunk revision 159852] (GCC) 
[macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer rnflow.f90
[macbook] lin/test% time a.out > /dev/null
25.841u 0.696s 0:26.54 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto rnflow.f90
[macbook] lin/test% time a.out > /dev/null
25.540u 0.677s 0:26.22 99.9%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90
[macbook] lin/test% time a.out > /dev/null
32.627u 0.685s 0:33.31 99.9%	0+0k 0+0io 0pf+0w             <---  ~27% slower

As it has been noticed previously the executable of fatigue.f90 is ~30% faster when compiled with -fwhole-program:

[macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-file -flto fatigue.f90
[macbook] lin/test% time a.out > /dev/null
9.031u 0.006s 0:09.04 99.8%	0+0k 0+1io 0pf+0w
[macbook] lin/test% gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program fatigue.f90
[macbook] lin/test% time a.out > /dev/null
6.448u 0.004s 0:06.47 99.5%	0+0k 0+1io 0pf+0w

Comment 1 Dominique d'Humieres 2010-05-30 18:06:01 UTC

I'll attach the assembly generated with -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto for revisions 159851 and 159852. It is the same with/without -fwhole-program (probably obvious), however when assembled and linked with 

gfcp -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow_wp5*.s

the timing depends on the revision used to generate the assembly, but not on the compiler revision.

Comment 2 Richard Biener 2010-05-30 18:09:19 UTC

Insufficient analysis.  This more sounds like a dup of profile-estimate
messed up by inlining.

Comment 3 Dominique d'Humieres 2010-05-30 18:10:52 UTC

Created attachment 20780 [details]
Assembly generated with  -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159851

Comment 4 Dominique d'Humieres 2010-05-30 18:12:07 UTC

Created attachment 20781 [details]
Assembly generated with  -O3 -ffast-math -funroll-loops -fomit-frame-pointer -flto and revision 159852

Comment 5 Dominique d'Humieres 2010-05-30 18:30:58 UTC

Output of gprof on darwin:

Revision 159851:

				  called/total       parents 
index  %time    self descendents  called+self    name    	index
				  called/total       children

				  520605             _dgetf2_ [81]
		0.00        0.00      64/1041192     ___timctr_MOD_gettim [1429]
		0.00        0.00    6548/1041192     _dswap_ [4112]
		0.00        0.00 1034580/1041192     _xerbla_ [83]
[81]     0.0    0.00        0.00 1041192+520605 _dgetf2_ [81]
		0.00        0.00   64137/110864      _dgetrf_ [82]
				  520605             _dgetf2_ [81]

-----------------------------------------------

				   13315             _dgetrf_ [82]
		0.00        0.00       8/110864      ___timctr_MOD_gettim [1429]
		0.00        0.00    6548/110864      _dswap_ [4112]
		0.00        0.00    6685/110864      __dyld_func_lookup [1665]
		0.00        0.00   33486/110864      _xerbla_ [83]
		0.00        0.00   64137/110864      _dgetf2_ [81]
[82]     0.0    0.00        0.00  110864+13315  _dgetrf_ [82]
		0.00        0.00       1/1           _main [85]
				   13315             _dgetrf_ [82]

-----------------------------------------------

		0.00        0.00   10872/10872       _dswap_ [4112]
[83]     0.0    0.00        0.00   10872         _xerbla_ [83]
		0.00        0.00 1034580/1041192     _dgetf2_ [81]
		0.00        0.00   33486/110864      _dgetrf_ [82]

-----------------------------------------------

		0.00        0.00       1/1           _main [85]
[84]     0.0    0.00        0.00       1         __start [84]

-----------------------------------------------

		0.00        0.00       1/1           _dgetrf_ [82]
[85]     0.0    0.00        0.00       1         _main [85]
		0.00        0.00       1/1           __start [84]

-----------------------------------------------

...

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  0.0       0.00     0.00  1561733     0.00     0.00  _dgetf2_ [81]
  0.0       0.00     0.00   110927     0.00     0.00  _dgetrf_ [82]
  0.0       0.00     0.00    10872     0.00     0.00  _xerbla_ [83]
  0.0       0.00     0.00        1     0.00     0.00  __start [84]
  0.0       0.00     0.00        1     0.00     0.00  _main [85]

================================================================================

Revision 159852:

				  called/total       parents 
index  %time    self descendents  called+self    name    	index
				  called/total       children

		0.00        0.00    6548/1561733     _dswap_ [4112]
		0.00        0.00 1555185/1561733     _xerbla_ [83]
[81]     0.0    0.00        0.00 1561733         _dgetf2_ [81]
		0.00        0.00   64136/110927      _dgetrf_ [82]

-----------------------------------------------

				   13315             _dgetrf_ [82]
		0.00        0.00      72/110927      ___timctr_MOD_gettim [1429]
		0.00        0.00    6548/110927      _dswap_ [4112]
		0.00        0.00    6685/110927      __dyld_func_lookup [1665]
		0.00        0.00   33486/110927      _xerbla_ [83]
		0.00        0.00   64136/110927      _dgetf2_ [81]
[82]     0.0    0.00        0.00  110927+13315  _dgetrf_ [82]
		0.00        0.00       1/1           _main [85]
				   13315             _dgetrf_ [82]

-----------------------------------------------

		0.00        0.00   10872/10872       _dswap_ [4112]
[83]     0.0    0.00        0.00   10872         _xerbla_ [83]
		0.00        0.00 1555185/1561733     _dgetf2_ [81]
		0.00        0.00   33486/110927      _dgetrf_ [82]

-----------------------------------------------

		0.00        0.00       1/1           _main [85]
[84]     0.0    0.00        0.00       1         __start [84]

-----------------------------------------------

		0.00        0.00       1/1           _dgetrf_ [82]
[85]     0.0    0.00        0.00       1         _main [85]
		0.00        0.00       1/1           __start [84]

-----------------------------------------------

...

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
  0.0       0.00     0.00  5572994     0.00     0.00  _xerbla_ [154]
  0.0       0.00     0.00    20556     0.00     0.00  _dswap_ [155]
  0.0       0.00     0.00    20000     0.00     0.00  ___timctr_MOD_gettim [156]
  0.0       0.00     0.00        3     0.00     0.00  __dyld_func_lookup [157]
  0.0       0.00     0.00        2     0.00     0.00  __start [158]

Comment 6 Richard Biener 2010-05-30 18:48:53 UTC

 0.0       0.00     0.00  5572994     0.00     0.00  _xerbla_ [154]

eh?  that's the blas error handler.  something is fishy with your setup.

Comment 7 Dominique d'Humieres 2010-05-30 18:55:09 UTC

> Insufficient analysis.  This more sounds like a dup of profile-estimate
> messed up by inlining.

Do you mean a dup of pr40106? Or is there others I am not aware of?

> eh?  that's the blas error handler.  something is fishy with your setup.

Which setup?

Comment 8 Dominique d'Humieres 2010-06-05 09:52:18 UTC

At revision 160309, I get

[macbook] lin/test% gfc -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 --param hot-bb-frequency-fraction=1000
[macbook] lin/test% time a.out > /dev/null
32.601u 0.716s 0:33.35 99.8%	0+0k 0+0io 0pf+0w
[macbook] lin/test% gfc -O3 -ffast-math -funroll-loops -fomit-frame-pointer -fwhole-program -flto rnflow.f90 --param hot-bb-frequency-fraction=2000
[macbook] lin/test% time a.out > /dev/null
25.760u 0.708s 0:26.47 99.9%	0+0k 0+0io 0pf+0w

Comment 9 Tobias Burnus 2010-09-08 21:00:10 UTC

For what it is worth, on AMD Athlon 64 X2 4800+ / x86-64-linux, I get for
gfortran -O3 -ffast-math -march=native -- and with with and without -flto:
 0m45.132s -- (options as above)
 0m52.731s -- additionally -fwhole-program

That's a +16% increase in run-time with -fwhole-program.

Comment 10 Jan Hubicka 2010-09-08 21:04:04 UTC

So hot-bb-frequency-fraction solves the whole regression?

Comment 11 Tobias Burnus 2010-09-09 09:00:45 UTC

[Move comment from IRC #gcc to bugzilla]

(In reply to comment #9)
> For what it is worth, on AMD Athlon 64 X2 4800+ / x86-64-linux, [...]
> That's a +16% increase in run-time with -fwhole-program.

(In reply to comment #10)
> So hot-bb-frequency-fraction solves the whole regression?

For me (cf. system above), --param hot-bb-frequency-fraction=2000 reduces the slow down due to -fwhole-program from 16% to 3%. (The LTO version with and without -fwhole-file is about 2% slower than the corresponding -fno-lto version.)

Comment 12 Dominique d'Humieres 2010-11-15 13:29:24 UTC

I think this is not a gfortran bug. Marked as aLTO one.

Comment 13 Jan Hubicka 2010-11-15 15:22:19 UTC

Static profile estimation problem, to be exact. LTO is just triggering it by bringing in enough of context ;)

Comment 14 Jan Hubicka 2010-12-19 11:43:48 UTC

I finally got into some time to test the various solutions. easiest is probably the following:
Index: predict.c
===================================================================
--- predict.c   (revision 168047)
+++ predict.c   (working copy)
@@ -126,7 +126,7 @@ maybe_hot_frequency_p (int freq)
   if (node->frequency == NODE_FREQUENCY_EXECUTED_ONCE
       && freq <= (ENTRY_BLOCK_PTR->frequency * 2 / 3))
     return false;
-  if (freq < BB_FREQ_MAX / PARAM_VALUE (HOT_BB_FREQUENCY_FRACTION))
+  if (freq < ENTRY_BLOCK_PTR->frequency / PARAM_VALUE (HOT_BB_FREQUENCY_FRACTION))
     return false;
   return true;
 }
It makes GCC to decide on cold basic blocks not based on the innermost loop nest but on the entry block frequency - so many conditoinals or EH renders BB cold but not the fact it is outside of very many BBs.

Could you try if this solves the problem?

Comment 15 Dominique d'Humieres 2010-12-19 13:37:17 UTC

> Could you try if this solves the problem?

The patch in comment #14 fixed the problem on x86_64-apple-darwin10 (I cannot say anything for AMD). I have run the polyhedron tests without noticing any slow down. I'll do a clean regstrap tonight. Thanks for the patch.

Comment 16 Dominique d'Humieres 2010-12-20 08:57:32 UTC

The patch in comment #14 fixed the problem on x86_64-apple-darwin10, but causes the following regressions:

FAIL: gcc.dg/autopar/outer-2.c scan-tree-dump-times parloops "parallelizing outer loop" 1
FAIL: gcc.dg/autopar/outer-2.c scan-tree-dump-times optimized "loopfn" 5
FAIL: gcc.dg/tree-ssa/ldist-pr45948.c scan-tree-dump ldist "distributed: split to 3"

which disappear if I revert the patch. Note that something looks uninitialized with the patch:

[macbook] f90/bug% gcc46 -O2 -ftree-loop-distribution -fdump-tree-ldist-details -c /opt/gcc/work/gcc/testsuite/gcc.dg/tree-ssa/ldist-pr45948.c
[macbook] f90/bug% grep distributed ldist-pr45948.c.101t.ldist
Loop -1515870811 distributed: split to 2 loops.
          ^^^^
instead of

Loop 1 distributed: split to 3 loops.

Comment 17 Dominique d'Humieres 2010-12-21 10:46:06 UTC

For the record I have also tested the patch in comment #14 on powerpc-apple-darwin9 at revision 168070. Without the patch I get

[karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000 -fwhole-program -flto rnflow.f90
[karma] lin/test% time a.out > /dev/null
68.236u 6.947s 1:17.77 96.6%	0+0k 0+0io 0pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto rnflow.f90
[karma] lin/test% time a.out > /dev/null
65.229u 6.838s 1:14.61 96.5%	0+0k 0+0io 0pf+0w

Note a slight slow down with -param hot-bb-frequency-fraction=2000. With the patch I get

[karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 --param hot-bb-frequency-fraction=2000 -fwhole-program -flto rnflow.f90
[karma] lin/test% time a.out > /dev/null
69.690u 6.917s 1:19.44 96.4%	0+0k 0+0io 1pf+0w
[karma] lin/test% gfc -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto rnflow.f90
[karma] lin/test% time a.out > /dev/null
69.791u 7.225s 1:20.08 96.1%	0+0k 0+0io 0pf+0w

i.e.,  -param hot-bb-frequency-fraction=2000 does not change the timings, but the resulting code is slower.

Comment 18 Jack Howarth 2011-01-17 21:12:03 UTC

Created attachment 23000 [details]
assembly for gcc.dg/autopar/outer-2.c at -m32 with r168907

/Users/howarth/work/gcc/xgcc -B/Users/howarth/work/gcc/ /Users/howarth/gcc-4.6-20110116/gcc/testsuite/gcc.dg/autopar/outer-2.c -O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized -S -m32 -o outer-2.s

Comment 19 Jack Howarth 2011-01-17 21:13:36 UTC

Created attachment 23001 [details]
assembly for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14

Comment 20 Jack Howarth 2011-01-17 21:14:10 UTC

Created attachment 23002 [details]
parloops for gcc.dg/autopar/outer-2.c at -m32 with r168907

Comment 21 Jack Howarth 2011-01-17 21:14:42 UTC

Created attachment 23003 [details]
parloops for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14

Comment 22 Jack Howarth 2011-01-17 21:15:38 UTC

Created attachment 23004 [details]
optimized for gcc.dg/autopar/outer-2.c at -m32 with r168907

Comment 23 Jack Howarth 2011-01-17 21:16:22 UTC

Created attachment 23005 [details]
optimized for gcc.dg/autopar/outer-2.c at -m32 with patch from comment 14

Comment 24 Jan Hubicka 2011-01-22 16:23:49 UTC

PR 43884 has similar problem with deep loop nests.

Comment 25 Jan Hubicka 2011-01-22 21:47:43 UTC

Author: hubicka
Date: Sat Jan 22 21:47:40 2011
New Revision: 169136

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=169136
Log:
	PR tree-optimization/43884
	PR lto/44334
	* predict.c (maybe_hot_frequency_p): Use entry block frequency as an base.
	* doc/invoke.texi (hot-bb-frequency-fraction): Update docs.
	* gcc.dg/autopar/outer-2.c: Increase array size.
	* gcc.dg/tree-ssa/ldist-pr45948.c: Update test.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/predict.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.dg/autopar/outer-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-pr45948.c

Comment 26 Jan Hubicka 2011-01-22 21:49:19 UTC

OK,
i comitted the branch prediction change.  I am bit confused by the rest of trail, can you please confirm if the problem is fixed in all the configurations mentioned?

Comment 27 Jack Howarth 2011-01-23 03:36:02 UTC

On x86_64-apple-darwin10 at r169137, the pb05 benchmarks compiled with

benchmark  -O3 -ffast-math  -O3 -ffast-math -funroll-loops   %change
           -funroll-loops   -flto -fwhole-program

ac            8.81            8.81                            0.0
aermod       17.30           17.50                            1.2 
air           5.62            5.57                           -0.9
capacita     32.77           33.35                            1.8
channel       1.89            1.89                            0.0
doduc        26.58           26.52                           -0.2
fatigue       8.37            8.36                           -0.1
gas_dyn       4.36            4.35                           -0.2
induct       13.05           13.04                           -0.1
linpk        17.15           17.05                           -0.6
mdbx         11.25           11.26                            0.1
nf           32.14           33.50                            4.2
protein      32.50           32.27                           -0.7
rnflow       24.11           24.84                            3.0
test_fpu      8.22            8.20                           -0.2
tfft          1.89            1.88                           -0.5

Geometric    11.07           11.11                            0.4
Mean

Comment 28 Dominique d'Humieres 2011-01-23 08:44:45 UTC

According to http://gcc.gnu.org/ml/gcc-regression/2011-01/msg00375.html revision 169136 caused a bootstrap failure on powerpc-apple-darwin9.8.0:

....
/Users/regress/tbox/native/build/./prev-gcc/xgcc -B/Users/regress/tbox/native/build/./prev-gcc/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/bin/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/bin/ -B/Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/lib/ -isystem /Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/include -isystem /Users/regress/tbox/objs/powerpc-apple-darwin9.8.0/sys-include    -c   -g -O2 -mdynamic-no-pic -gtoggle -DIN_GCC   -W -Wall -Wwrite-strings -Wcast-qual -Wstrict-prototypes -Wmissing-prototypes -Wmissing-format-attribute -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -Werror -Wold-style-definition -Wc++-compat -fno-common  -DHAVE_CONFIG_H -I. -I. -I/Users/regress/tbox/svn-gcc/gcc -I/Users/regress/tbox/svn-gcc/gcc/. -I/Users/regress/tbox/svn-gcc/gcc/../include -I./../intl -I/Users/regress/tbox/svn-gcc/gcc/../libcpp/include  -I/Users/regress/tbox/svn-gcc/gcc/../libdecnumber -I/Users/regress/tbox/svn-gcc/gcc/../libdecnumber/dpd -I../libdecnumber    /Users/regress/tbox/svn-gcc/gcc/compare-elim.c -o compare-elim.o
/Users/regress/tbox/svn-gcc/gcc/compare-elim.c: In function 'maybe_select_cc_mode':
/Users/regress/tbox/svn-gcc/gcc/compare-elim.c:407:58: error: unused parameter 'b' [-Werror=unused-parameter]
cc1: all warnings being treated as errors

Comment 29 Dominique d'Humieres 2011-01-23 11:17:32 UTC

From http://gcc.gnu.org/ml/gcc-patches/2011-01/msg01607.html  the bootstrap failure seems rather due to revision 169131. Note that revision 169142 bootstrapped on x86_64-apple-darwin10 configured with --enable-checking=release.

Comment 30 Dominique d'Humieres 2011-01-23 11:43:09 UTC

Concerning the timings in comment #27 they may reflect the fact the the inliner is not aggressive enough for fortran codes and that it is worsen when using -flto:

For rnflow.f90 I get

26.75s   with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer
26.66s   with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600
27.60s   with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -fwhole-program -flto
27.14s   with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto
26.79s  with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=2000 -fwhole-program -flto

The result is more spectacular for fatigue.f90

8.50s    with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=600 -fwhole-program -flto
4.69s    with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=2000 -fwhole-program -flto

Note that revision 169136 seems to require higher values of -finline-limit: before it, 600 was sufficient to see the speed-up (I have reported that in an other pr), now it has been increased (I did not tried values lower than 2000 yet).

Comment 31 Dominique d'Humieres 2011-01-23 12:06:00 UTC

The relevant pr for comment #30 is pr45810 comment #9. The threshold for fatigue.f90 was322 before revision 169136 and is now 1520 (~x5).

Comment 32 Jan Hubicka 2011-01-23 13:14:56 UTC

> The relevant pr for comment #30 is pr45810 comment #9. The threshold for
> fatigue.f90 was322 before revision 169136 and is now 1520 (~x5).
Interesting. Do you know what function we fail to inline?
Can you attach ipa-inline dump from both settings?
I know that also c-ray wants to increase inline limits.  I can increase them a bit,
but not by factor of 5, since that would cause code size explosion at -O3.
(I did some tests on this two weeks ago)

Honza

Comment 33 Jan Hubicka 2011-01-23 13:15:27 UTC

Please use -fdump-ipa-inline-details to generate the dump.  Perhaps we just miscompute function body size somehow.

Comment 34 Jan Hubicka 2011-01-23 13:16:34 UTC

Pretty obvoius fix to the compare-elim issue is adding ATTRIBUTE_UNUSED to b parameter.
It is used by SELECT_CC_MODE macro that is defined to not use it by default.

Honza

Comment 35 Dominique d'Humieres 2011-01-23 15:02:43 UTC

> Do you know what function we fail to inline?

It is generalized_hookes_law.

I have looked to fatigue.f90 in more details. With revision 168741, I see the transitions:

 9.25s for inline-limit < 214
 6.50s for 213 < inline-limit < 322
 4.76s for 321 < inline-limit

With revision 169142, I see

 9.25s for inline-limit < 214
 6.50s for 213 < inline-limit < 322
 8.48s for 321 < inline-limit < 1520
 4.70s for 1519 < nline-limit

Indeed I may have missed other thresholds (especially in the range 322--1519).

I have dumps for values below and above the thresholds (10 of them). Do you want them all? or only a subset? In the later case which ones?

Comment 36 Jack Howarth 2011-01-23 15:45:19 UTC

Created attachment 23086 [details]
bzip2 compressed ipa-inline-details dump without -finline-limit

generated at r169137 on x86_64-apple-darwin10 with...

gfortran -O3 -ffast-math -funroll-loops -fdump-ipa-inline-details -flto -fwhole-program ../fatigue.f90 -o ../fatigue

Comment 37 Jack Howarth 2011-01-23 15:47:54 UTC

Created attachment 23087 [details]
bzip2 compressed ipa-inline-details dump with -finline-limit=600

bzip2 compressed ipa-inline-details dump with -finline-limit=600

generated at r169137 on x86_64-apple-darwin10 with...

gfortran -O3 -ffast-math -funroll-loops  -finline-limit=600 -fdump-ipa-inline-details -flto
-fwhole-program ../fatigue.f90 -o ../fatigue

Comment 38 Jack Howarth 2011-01-23 15:49:19 UTC

Created attachment 23088 [details]
bzip2 compressed ipa-inline-details dump with -finline-limit=2000

generated at r169137 on x86_64-apple-darwin10 with...

gfortran -O3 -ffast-math -funroll-loops  -finline-limit=2000
-fdump-ipa-inline-details -flto
-fwhole-program ../fatigue.f90 -o ../fatigue

Comment 39 Dominique d'Humieres 2011-01-23 16:32:38 UTC

Created attachment 23089 [details]
-finline-limit=321 revision 168741

bzip2 fatigue.f90.048i.inline generated at revision168741  with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=321 -fwhole-program -flto -fdump-ipa-inline-details

Comment 40 Dominique d'Humieres 2011-01-23 16:33:39 UTC

Created attachment 23090 [details]
-finline-limit=322 revision168741

bzip2 fatigue.f90.048i.inline generated at revision168741  with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=322 -fwhole-program -flto -fdump-ipa-inline-details

Comment 41 Dominique d'Humieres 2011-01-23 16:35:02 UTC

Created attachment 23091 [details]
-finline-limit=321 revision 169142

bzip2 fatigue.f90.048i.inline generated at revision 169142  with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=321 -fwhole-program -flto -fdump-ipa-inline-details

Comment 42 Dominique d'Humieres 2011-01-23 16:36:00 UTC

Created attachment 23092 [details]
-finline-limit=322 revision 169142

bzip2 fatigue.f90.048i.inline generated at revision 169142  with -Ofast -funroll-loops -ftree-loop-linear -fomit-frame-pointer -finline-limit=322 -fwhole-program -flto -fdump-ipa-inline-details

Comment 43 Jack Howarth 2011-01-23 18:07:36 UTC

On x86_64-apple-darwin10 at r169137, the pb05 benchmarks compiled with

benchmark  -O3 -ffast-math -funroll-loops  -O3 -ffast-math -funroll-loops  %change
           -flto -fwhole-program           -finline-limit=2000 -flto 
                                           -fwhole-program

ac            8.81                          7.30                           -17.1
aermod       17.50                         17.43                            -0.4
air           5.57                          5.57                             0.0
capacita     33.35                         31.86                            -4.5
channel       1.89                          1.76                            -6.9
doduc        26.52                         25.15                            -5.2
fatigue       8.36                          4.21                           -49.6
gas_dyn       4.35                          4.28                            -1.6
induct       13.04                         13.05                             0.1
linpk        17.05                         17.31                             1.5
mdbx         11.26                         11.26                             0.0
nf           33.50                         30.97                            -7.6
protein      32.27                         32.62                             1.1
rnflow       24.84                         24.16                            -2.7
test_fpu      8.20                          8.90                             8.5
tfft          1.88                          1.94                             3.2

Geometric    11.11                         10.42                            -6.2
Mean

Comment 44 Jack Howarth 2011-01-25 03:13:39 UTC

Testing...

Index: gcc/params.def
===================================================================
--- gcc/params.def	(revision 169185)
+++ gcc/params.def	(working copy)
@@ -182,7 +182,7 @@ DEFPARAM(PARAM_LARGE_FUNCTION_INSNS,
 DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH,
 	 "large-function-growth",
 	 "Maximal growth due to inlining of large function (in percent)",
-	 100, 0, 0)
+	 400, 0, 0)
 DEFPARAM(PARAM_LARGE_UNIT_INSNS,
 	 "large-unit-insns",
 	 "The size of translation unit to be considered large",

shows only a major improvement for fatigue (30%). This same improvement can be achieved at -m32 and -m64 with just an increase of large-function-growth to 200.

Comment 45 rguenther@suse.de 2011-01-25 10:20:01 UTC

On Tue, 25 Jan 2011, howarth at nitro dot med.uc.edu wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44334
> 
> --- Comment #44 from Jack Howarth <howarth at nitro dot med.uc.edu> 2011-01-25 03:13:39 UTC ---
> Testing...
> 
> Index: gcc/params.def
> ===================================================================
> --- gcc/params.def    (revision 169185)
> +++ gcc/params.def    (working copy)
> @@ -182,7 +182,7 @@ DEFPARAM(PARAM_LARGE_FUNCTION_INSNS,
>  DEFPARAM(PARAM_LARGE_FUNCTION_GROWTH,
>       "large-function-growth",
>       "Maximal growth due to inlining of large function (in percent)",
> -     100, 0, 0)
> +     400, 0, 0)
>  DEFPARAM(PARAM_LARGE_UNIT_INSNS,
>       "large-unit-insns",
>       "The size of translation unit to be considered large",
> 
> shows only a major improvement for fatigue (30%). This same improvement can be
> achieved at -m32 and -m64 with just an increase of large-function-growth to
> 200.

We certainly won't adjust params at this stage.  There are other cases
(that c-ray one) where more aggressive inlining helps, but we should
avoid regressing for -O2 and only tune -O3 params eventually.

Comment 46 jh 2011-01-25 17:57:38 UTC

I sorted out increasing large function growth ratio as most safe way  
to deal with (easier half of) this problem. Unlike the parameters for  
inline limits it won't cause code size issues. It just allow somewhat  
bigger functions and thus stress more the backend on its linearity.

Given that the parameter was never tuned since its inclusion in GCC  
4.2, I guess we are not terribly sensitive here. We also improved a  
bit in the scalability here as I tuned the df code bit for LTO and  
spagetti code.

Otherwise we need to wait for 4.7 or possibly 4.6.1. That is fine with me.
I will still run tests tonight on how increasing the parameter affect  
our tester.  This is first time I see it hit in perfomrance sensitive  
way. Not sure how common it is in practice since I never really tried  
to change it.

Comment 47 Dominique d'Humieres 2011-01-25 19:06:04 UTC

> I sorted out increasing large function growth ratio as most safe way  
> to deal with (easier half of) this problem. Unlike the parameters for  
> inline limits it won't cause code size issues. It just allow somewhat  
> bigger functions and thus stress more the backend on its linearity.

Well, the choice is not '-finline-limit' versus '--param large-function-growth': some polyhedron tests are sensitive to some value of '-finline-limit' (ac, channel, fatigue, ...) and for most of them '--param large-function-growth' does not change anything. 

fatigue is quite peculiar in that there is a big speed-up with -fwhole-program for -finline-limit>=322and an additional small speed-up for --param large-function-growth>=132. In addition the later prevent a bad choice with -flto (this should probably be discussed in pr 45810 and this pr closed as fixed).

Note that I am not interested by fine tuning, but to find some acceptable values of the default parameters that give good results for all (most;-) fortran codes).

Comment 48 Jack Howarth 2011-01-25 21:29:53 UTC

(In reply to comment #47)

> Well, the choice is not '-finline-limit' versus '--param
> large-function-growth': some polyhedron tests are sensitive to some value of
> '-finline-limit' (ac, channel, fatigue, ...) and for most of them '--param
> large-function-growth' does not change anything. 
> 
> fatigue is quite peculiar in that there is a big speed-up with -fwhole-program
> for -finline-limit>=322and an additional small speed-up for --param
> large-function-growth>=132. In addition the later prevent a bad choice with
> -flto (this should probably be discussed in pr 45810 and this pr closed as
> fixed).
> 
> Note that I am not interested by fine tuning, but to find some acceptable
> values of the default parameters that give good results for all (most;-)
> fortran codes).

In my tests, --param large-function-growth=200 was sufficient to yield 60% of the performance increase in the fatigue benchmark obtained by modifying both -finline-limit and --param large-function-growth.  Unlike increasing -finline-limit, none of the other pb05 benchmarks showed even minor regressions in speed.

Comment 49 Dominique d'Humieres 2011-02-16 17:21:14 UTC

Since it seems that revision 169136 is the "right fix", I am closing this PR as fixed. Any further discussion about the interaction between -fwhole-program and -flto for the polyhedron test fatigue.f90 should take place in pr45810.