Bug 32086 - [4.3 Regression] 10% to 20% Performance Regression Between 4.1.3 and 4.3
Summary: [4.3 Regression] 10% to 20% Performance Regression Between 4.1.3 and 4.3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.3.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 32084
  Show dependency treegraph
 
Reported: 2007-05-25 17:09 UTC by Tobias Burnus
Modified: 2007-12-10 17:07 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-11-30 10:59:33


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2007-05-25 17:09:39 UTC
The program induct.f90 of the Polyhedron testsuite, http://www.polyhedron.co.uk/pb05/polyhedron_benchmark_suite.html, runs about 10% slower under 4.3 than under 4.1.3 (20070430 prerelease SUSE Linux).

A cut-down testcase "test2.f90 (attachment 13611 [details] of PR 32084) shows the same result. At least for the testcase, the original tree is almost identical for 4.3 and 4.1.3 which means that the difference must be the middle or backend.

Timings (w/o "volatile"):

a) gfortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3

induct.f90: 51.65 [100%] vs 46.94 [ 90%] for gfortran 4.3 vs. 4.1.3
test2.f90:   4.60 [100%] vs  4.18 [ 91%]

b) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3

induct.f90: 61.41 [100%] vs 46.94 [ 76%]
test2.f90:   5.45 [100%] vs  4.54 [ 83%]

c) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -mfpmath=sse -O3

induct.f90: 46.12 [100%] vs 46.94 [102%]  (4.3 is better :-)
test2.f90:   4.14 [100%] vs  3.96 [ 96%]

(For the other polyhedron test cases, the performance loss is less: tfft 4% slower, protein 3%, doduc 3%, channel 2%; in total 4.3 is faster, for fatigue 4.1.3 takes twice as long as 4.3. See:
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/#rt)
Comment 1 Uroš Bizjak 2007-11-29 08:09:18 UTC
(In reply to comment #0)

> b) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
> -ftree-loop-linear -O3
> 
> induct.f90: 61.41 [100%] vs 46.94 [ 76%]
> test2.f90:   5.45 [100%] vs  4.54 [ 83%]

I have run the test compiled with above options, with and without -fvect-cost-model, but on Xeon 3.2G in 32bit mode:

w/o -fvect-cost-model:

user    1m40.906s

w/ -fvect-cost-model:

user    0m46.439s
Comment 2 Paolo Bonzini 2007-11-30 05:41:20 UTC
What were the benchmarks where the cost model was slower?
Comment 3 Uroš Bizjak 2007-11-30 06:42:30 UTC
(In reply to comment #1)

> w/ -fvect-cost-model:
> user    0m46.439s

Looking a bit into generated code, it looks that -fect-cost-model effectively disables all interesting vectorizations, effectively -fno-tree-vectorize.
Comment 4 Paolo Bonzini 2007-11-30 07:17:31 UTC
So -fvect-cost-model is doing its job.  The vectorizations must not be profitable, or are they?
Comment 5 Uroš Bizjak 2007-11-30 10:27:07 UTC
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse induct.f90:
user    1m32.226s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse -fno-tree-vectorize induct.f90
user    0m58.492s
fortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 induct.f90
user    1m40.906s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 -fno-tree-vectorize induct.f90
user    0m46.439s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse -fvect-cost-model induct.f90
user    0m58.168s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 -fvect-cost-model induct.f90
user    0m46.415s

All on:

Family: 15 Model: 4 Stepping: 10 Type: 0 Brand: 0
CPU Model: Pentium 4 D (Foster) Original OEM
Processor name string: Intel(R) Xeon(TM) CPU 3.60GHz

(so -march=opteron is a bit misleading)

I'd say that vectorizer cost model is doing its job pretty well.
Comment 6 Paolo Bonzini 2007-11-30 10:59:33 UTC
Looking at http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/gfortran-run.dat and http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/gfortranVecCost-run.dat I think we should turn on cost model by default, at least for i386.

Uros?
Comment 7 Uroš Bizjak 2007-11-30 13:20:08 UTC
This is with latest SVN:

   Benchmark
        Name     387       387c      sse       ssec
   ---------   -------   -------   -------   -------
          ac     16.57     16.57     22.66     22.65
      aermod     55.85     54.72     50.23     51.08
         air     14.95     15.02     15.92     15.94
    capacita     78.60     78.55     81.89     81.95
     channel      9.90      9.78      9.90      9.65
       doduc     59.82     59.65     67.81     68.97
     fatigue     20.06     18.27     21.65     21.27
     gas_dyn     11.47     11.35     10.62     10.68
      induct     60.60     50.92     73.82     58.74
       linpk     27.26     27.20     28.24     28.17
        mdbx     30.41     30.36     33.33     33.23
          nf     33.63     33.69     31.97     32.03
     protein     73.16     72.87     72.67     72.76
      rnflow     57.18     57.18     42.19     42.46
    test_fpu     20.71     20.86     21.61     21.54
        tfft      4.86      5.10      4.92      5.11

                 27.48     27.03     28.26     27.93

gcc version 4.3.0 20071130 (experimental) [trunk revision 130533] (GCC)

Family: 15 Model: 4 Stepping: 10 Type: 0 Brand: 0
CPU Model: Pentium 4 D (Foster) Original OEM
Processor name string: Intel(R) Xeon(TM) CPU 3.60GHz

Compile Command :
-march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3

387 : -mfpmath=387
387c: -mfpmath=387 -fvec-cost-model
sse : -mfpmath=sse
ssec: -mfpmath=sse -fvec-cost-model
Comment 8 Paolo Bonzini 2007-11-30 13:30:05 UTC
Testing a one-liner.
Comment 9 Dominique d'Humieres 2007-11-30 15:23:09 UTC
> I think we should turn on cost model by default, at least for i386.

Although the information on cost model are very scarce in the gcc manual, if its goal is to avoid too costly vectorization, it should certainly turned on by default.  And probably if this increases the execution time, the cost model would need some tuning.  Concerning the induct code it is not a good test case on PPC since it uses double FP.  On a 2.16Ghz Core2Duo I get for induct:

-O3 -ffast-math -funroll-loops induct.f90
        93.192u 0.066s 1:33.28 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -fvect-cost-model induct.f90
        73.453u 0.107s 1:13.57 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -mfpmath=387 induct.f90
        105.564u 0.128s 1:45.69 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -mfpmath=387 -fvect-cost-model induct.f90
        79.162u 0.088s 1:19.25 99.9%	0+0k 0+0io 0pf+0w


With the patch in comment #5 of PR34265, the timings are:
without -mfpmath=387  and with or without -fvect-cost-model
        37.014u 0.065s 0:37.08 99.9%	0+0k 0+0io 0pf+0w
with -mfpmath=387  and with or without -fvect-cost-model
        39.820u 0.071s 0:39.89 100.0%	0+0k 0+0io 0pf+0w

Comment 10 Paolo Bonzini 2007-12-10 08:34:50 UTC
Subject: Bug 32086

Author: bonzini
Date: Mon Dec 10 08:34:37 2007
New Revision: 130738

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=130738
Log:
2007-12-10  Paolo Bonzini  <bonzini@gnu.org>

	PR target/32086
	* config/i386/i386.c (override_options): Enable -fvect-cost-model.

2007-12-10  Paolo Bonzini  <bonzini@gnu.org>

	PR target/32086
	* gcc.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.
	* g++.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.
	* gfortran.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/vect/vect.exp
    trunk/gcc/testsuite/gcc.dg/vect/vect.exp
    trunk/gcc/testsuite/gfortran.dg/vect/vect.exp

Comment 11 Paolo Bonzini 2007-12-10 08:36:49 UTC
committed, cost model now enabled for i386.
Comment 12 Dominique d'Humieres 2007-12-10 16:03:47 UTC
> committed, cost model now enabled for i386.

Is it working for Intel Core2Duo? At revision 130743 and

Using built-in specs.
Target: i686-apple-darwin9
Configured with: ../gcc-4.3-work/configure --prefix=/opt/gcc/gcc4.3w --mandir=/opt/gcc/gcc4.3w/share/man --infodir=/opt/gcc/gcc4.3w/share/info --build=i686-apple-darwin9 --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib
Thread model: posix
gcc version 4.3.0 20071210 (experimental) (GCC) 

for 'gfc -O3 -ffast-math -funroll-loops induct.f90' (with/without -fvect-cost-model), the execution time is:

93.986u 0.051s 1:34.04 99.9%	0+0k 0+0io 0pf+0w

while for 'gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=2 induct.f90', it is:

76.345u 0.048s 1:16.39 99.9%	0+0k 0+0io 0pf+0w

If yes, the cost model should be tuned for Core2Duo. If no, did I do something wrong with the configure?

Should I open a new PR for these questions?

Comment 13 Paolo Bonzini 2007-12-10 16:37:57 UTC
I think so.
Comment 14 Jack Howarth 2007-12-10 16:41:17 UTC
Dominique,
    What do you get when you use the proposed early-complete-unrolling patch from PR34265 and is there any movement towards getting some form of that patch into gcc trunk?
Comment 15 Richard Biener 2007-12-10 17:07:02 UTC
Early unrolling will be addressed earliest in the next stage1.