32086 – [4.3 Regression] 10% to 20% Performance Regression Between 4.1.3 and 4.3

Bug 32086 - [4.3 Regression] 10% to 20% Performance Regression Between 4.1.3 and 4.3

Summary: [4.3 Regression] 10% to 20% Performance Regression Between 4.1.3 and 4.3

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.3.0

Importance:	P2 normal
Target Milestone:	4.3.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:	32084
	Show dependency tree / graph

Reported:	2007-05-25 17:09 UTC by Tobias Burnus
Modified:	2007-12-10 17:07 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2007-11-30 10:59:33

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Tobias Burnus 2007-05-25 17:09:39 UTC

The program induct.f90 of the Polyhedron testsuite, http://www.polyhedron.co.uk/pb05/polyhedron_benchmark_suite.html, runs about 10% slower under 4.3 than under 4.1.3 (20070430 prerelease SUSE Linux).

A cut-down testcase "test2.f90 (attachment 13611 [details] of PR 32084) shows the same result. At least for the testcase, the original tree is almost identical for 4.3 and 4.1.3 which means that the difference must be the middle or backend.

Timings (w/o "volatile"):

a) gfortran -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -O3

induct.f90: 51.65 [100%] vs 46.94 [ 90%] for gfortran 4.3 vs. 4.1.3
test2.f90:   4.60 [100%] vs  4.18 [ 91%]

b) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3

induct.f90: 61.41 [100%] vs 46.94 [ 76%]
test2.f90:   5.45 [100%] vs  4.54 [ 83%]

c) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -msse3 -mfpmath=sse -O3

induct.f90: 46.12 [100%] vs 46.94 [102%]  (4.3 is better :-)
test2.f90:   4.14 [100%] vs  3.96 [ 96%]

(For the other polyhedron test cases, the performance loss is less: tfft 4% slower, protein 3%, doduc 3%, channel 2%; in total 4.3 is faster, for fatigue 4.1.3 takes twice as long as 4.3. See:
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/#rt)

Comment 1 Uroš Bizjak 2007-11-29 08:09:18 UTC

(In reply to comment #0)

> b) gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
> -ftree-loop-linear -O3
> 
> induct.f90: 61.41 [100%] vs 46.94 [ 76%]
> test2.f90:   5.45 [100%] vs  4.54 [ 83%]

I have run the test compiled with above options, with and without -fvect-cost-model, but on Xeon 3.2G in 32bit mode:

w/o -fvect-cost-model:

user    1m40.906s

w/ -fvect-cost-model:

user    0m46.439s

Comment 2 Paolo Bonzini 2007-11-30 05:41:20 UTC

What were the benchmarks where the cost model was slower?

Comment 3 Uroš Bizjak 2007-11-30 06:42:30 UTC

(In reply to comment #1)

> w/ -fvect-cost-model:
> user    0m46.439s

Looking a bit into generated code, it looks that -fect-cost-model effectively disables all interesting vectorizations, effectively -fno-tree-vectorize.

Comment 4 Paolo Bonzini 2007-11-30 07:17:31 UTC

So -fvect-cost-model is doing its job.  The vectorizations must not be profitable, or are they?

Comment 5 Uroš Bizjak 2007-11-30 10:27:07 UTC

gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse induct.f90:
user    1m32.226s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse -fno-tree-vectorize induct.f90
user    0m58.492s
fortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 induct.f90
user    1m40.906s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 -fno-tree-vectorize induct.f90
user    0m46.439s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=sse -fvect-cost-model induct.f90
user    0m58.168s
gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3 -mfpmath=387 -fvect-cost-model induct.f90
user    0m46.415s

All on:

Family: 15 Model: 4 Stepping: 10 Type: 0 Brand: 0
CPU Model: Pentium 4 D (Foster) Original OEM
Processor name string: Intel(R) Xeon(TM) CPU 3.60GHz

(so -march=opteron is a bit misleading)

I'd say that vectorizer cost model is doing its job pretty well.

Comment 6 Paolo Bonzini 2007-11-30 10:59:33 UTC

Looking at http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/gfortran-run.dat and http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/gfortranVecCost-run.dat I think we should turn on cost model by default, at least for i386.

Uros?

Comment 7 Uroš Bizjak 2007-11-30 13:20:08 UTC

This is with latest SVN:

   Benchmark
        Name     387       387c      sse       ssec
   ---------   -------   -------   -------   -------
          ac     16.57     16.57     22.66     22.65
      aermod     55.85     54.72     50.23     51.08
         air     14.95     15.02     15.92     15.94
    capacita     78.60     78.55     81.89     81.95
     channel      9.90      9.78      9.90      9.65
       doduc     59.82     59.65     67.81     68.97
     fatigue     20.06     18.27     21.65     21.27
     gas_dyn     11.47     11.35     10.62     10.68
      induct     60.60     50.92     73.82     58.74
       linpk     27.26     27.20     28.24     28.17
        mdbx     30.41     30.36     33.33     33.23
          nf     33.63     33.69     31.97     32.03
     protein     73.16     72.87     72.67     72.76
      rnflow     57.18     57.18     42.19     42.46
    test_fpu     20.71     20.86     21.61     21.54
        tfft      4.86      5.10      4.92      5.11

                 27.48     27.03     28.26     27.93

gcc version 4.3.0 20071130 (experimental) [trunk revision 130533] (GCC)

Family: 15 Model: 4 Stepping: 10 Type: 0 Brand: 0
CPU Model: Pentium 4 D (Foster) Original OEM
Processor name string: Intel(R) Xeon(TM) CPU 3.60GHz

Compile Command :
-march=opteron -ffast-math -funroll-loops -ftree-loop-linear -msse3 -O3

387 : -mfpmath=387
387c: -mfpmath=387 -fvec-cost-model
sse : -mfpmath=sse
ssec: -mfpmath=sse -fvec-cost-model

Comment 8 Paolo Bonzini 2007-11-30 13:30:05 UTC

Testing a one-liner.

Comment 9 Dominique d'Humieres 2007-11-30 15:23:09 UTC

> I think we should turn on cost model by default, at least for i386.

Although the information on cost model are very scarce in the gcc manual, if its goal is to avoid too costly vectorization, it should certainly turned on by default.  And probably if this increases the execution time, the cost model would need some tuning.  Concerning the induct code it is not a good test case on PPC since it uses double FP.  On a 2.16Ghz Core2Duo I get for induct:

-O3 -ffast-math -funroll-loops induct.f90
        93.192u 0.066s 1:33.28 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -fvect-cost-model induct.f90
        73.453u 0.107s 1:13.57 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -mfpmath=387 induct.f90
        105.564u 0.128s 1:45.69 99.9%	0+0k 0+1io 0pf+0w
-O3 -ffast-math -funroll-loops -mfpmath=387 -fvect-cost-model induct.f90
        79.162u 0.088s 1:19.25 99.9%	0+0k 0+0io 0pf+0w


With the patch in comment #5 of PR34265, the timings are:
without -mfpmath=387  and with or without -fvect-cost-model
        37.014u 0.065s 0:37.08 99.9%	0+0k 0+0io 0pf+0w
with -mfpmath=387  and with or without -fvect-cost-model
        39.820u 0.071s 0:39.89 100.0%	0+0k 0+0io 0pf+0w

Comment 10 Paolo Bonzini 2007-12-10 08:34:50 UTC

Subject: Bug 32086

Author: bonzini
Date: Mon Dec 10 08:34:37 2007
New Revision: 130738

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=130738
Log:
2007-12-10  Paolo Bonzini  <bonzini@gnu.org>

	PR target/32086
	* config/i386/i386.c (override_options): Enable -fvect-cost-model.

2007-12-10  Paolo Bonzini  <bonzini@gnu.org>

	PR target/32086
	* gcc.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.
	* g++.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.
	* gfortran.dg/vect/vect.exp (DEFAULT_VECTCFLAGS): Disable cost model.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/vect/vect.exp
    trunk/gcc/testsuite/gcc.dg/vect/vect.exp
    trunk/gcc/testsuite/gfortran.dg/vect/vect.exp

Comment 11 Paolo Bonzini 2007-12-10 08:36:49 UTC

committed, cost model now enabled for i386.

Comment 12 Dominique d'Humieres 2007-12-10 16:03:47 UTC

> committed, cost model now enabled for i386.

Is it working for Intel Core2Duo? At revision 130743 and

Using built-in specs.
Target: i686-apple-darwin9
Configured with: ../gcc-4.3-work/configure --prefix=/opt/gcc/gcc4.3w --mandir=/opt/gcc/gcc4.3w/share/man --infodir=/opt/gcc/gcc4.3w/share/info --build=i686-apple-darwin9 --enable-languages=c,c++,fortran,objc,obj-c++,java --with-gmp=/sw --with-libiconv-prefix=/sw --with-system-zlib --x-includes=/usr/X11R6/include --x-libraries=/usr/X11R6/lib
Thread model: posix
gcc version 4.3.0 20071210 (experimental) (GCC) 

for 'gfc -O3 -ffast-math -funroll-loops induct.f90' (with/without -fvect-cost-model), the execution time is:

93.986u 0.051s 1:34.04 99.9%	0+0k 0+0io 0pf+0w

while for 'gfc -O3 -ffast-math -funroll-loops --param min-vect-loop-bound=2 induct.f90', it is:

76.345u 0.048s 1:16.39 99.9%	0+0k 0+0io 0pf+0w

If yes, the cost model should be tuned for Core2Duo. If no, did I do something wrong with the configure?

Should I open a new PR for these questions?

Comment 13 Paolo Bonzini 2007-12-10 16:37:57 UTC

I think so.

Comment 14 Jack Howarth 2007-12-10 16:41:17 UTC

Dominique,
    What do you get when you use the proposed early-complete-unrolling patch from PR34265 and is there any movement towards getting some form of that patch into gcc trunk?

Comment 15 Richard Biener 2007-12-10 17:07:02 UTC

Early unrolling will be addressed earliest in the next stage1.