Bug 31897 - [4.3 Regression] 30% speed regression with -m32 on Opteron with rnflow
Summary: [4.3 Regression] 30% speed regression with -m32 on Opteron with rnflow
Status: RESOLVED WONTFIX
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.3.6
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2007-05-10 16:53 UTC by Tobias Burnus
Modified: 2011-06-27 11:12 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-unknown-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2007-12-21 18:57:56


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2007-05-10 16:53:46 UTC
The Polyhedron test case rnflow is runs since 20 April 30% slower than before.

gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90

Today's gfortran: real    0m58.205s / user    0m56.600s
gfortran 2007-04-20 (r123986): real    0m58.237s / user    0m56.396s
gfortran 2007-04-16 (r123859): real    0m43.912s / user    0m42.403s
gfortran 4.2.0: real    0m45.449s / user    0m43.859s

This only affects that compiliation with that options.

Using the following option:

gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -mfpmath=sse -msse3 -O3

or compiling for a 64bit system does not show this slowdown.

See also:
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/run-rnflow.png
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/
and for x86-64 which is not affected, see also:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-2-0.html

The benchmark can be obtained from:
http://www.polyhedron.co.uk/pb05/polyhedron_benchmark_suite.html
directory: pb05/lin/source
Comment 1 Uroš Bizjak 2007-05-30 15:08:44 UTC
gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90

time ./a.out
user    0m37.982s

profiled run:
user    0m43.147s

each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 59.76     17.73    17.73 64527290     0.00     0.00  idamax_
 11.80     21.23     3.50       64     0.05     0.06  gentrs_
  9.47     24.04     2.81       64     0.04     0.32  cptrf2_
  6.94     26.10     2.06     6749     0.00     0.00  cmpcpt_
  4.01     27.29     1.19       64     0.02     0.02  cptrf1_
  3.98     28.47     1.18        1     1.18    26.48  matsim_
  0.78     28.70     0.23        1     0.23     3.17  evlrnf_

gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize rnflow.f90

time ./a.out
user    0m55.031s

profiled run:
user    1m0.124s

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 71.55     31.43    31.43 64527290     0.00     0.00  idamax_
  8.17     35.02     3.59       64     0.06     0.06  gentrs_
  6.65     37.94     2.92       64     0.05     0.53  cptrf2_
  4.89     40.09     2.15     6749     0.00     0.00  cmpcpt_
  2.66     41.26     1.17        1     1.17    40.19  matsim_
  2.53     42.37     1.11       64     0.02     0.02  cptrf1_
  0.80     42.72     0.35        1     0.35     3.70  evlrnf_

However, idamax_ routine is identical in both cases.
Comment 2 Jakub Jelinek 2007-07-04 11:57:43 UTC
Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel quadcore Core2):
/usr/src/gcc-4.2/obj/gcc/gfortran -B /usr/src/gcc-4.2/obj/gcc/ -L /usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -Wl,-rpath,/usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow42
/usr/src/gcc/obj/gcc/gfortran -B /usr/src/gcc/obj/gcc/ -L /usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -Wl,-rpath,/usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow43
gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear -O3 rnflow.f90 -o rnflow41

for i in 1 2 3; do time ./rnflow4$i > /dev/null; time ./rnflow4$i > /dev/null; done

real    0m30.003s
user    0m29.601s
sys     0m0.399s

real    0m29.811s
user    0m29.436s
sys     0m0.370s

real    0m29.875s
user    0m29.468s
sys     0m0.403s

real    0m29.824s
user    0m29.441s
sys     0m0.378s

real    0m26.007s
user    0m25.627s
sys     0m0.376s

real    0m25.822s
user    0m25.403s
sys     0m0.415s
Comment 3 Uroš Bizjak 2007-07-04 12:29:36 UTC
(In reply to comment #2)
> Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel
> quadcore Core2):

On core2 the bug doesn't trigger, but it shows on FC4 with:

vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.60GHz
stepping        : 10
cpu MHz         : 3600.970
cache size      : 2048 KB

This is one of most mysterious bugs I've ever seen. The _idamax routine is exactly the same for both builds, but it shows such a difference. I have analyzed this with cachegrind but nothing sticks out there.
Comment 4 Uroš Bizjak 2007-07-04 12:32:39 UTC
(In reply to comment #1)
> gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90
> 
> time ./a.out
> user    0m37.982s

> gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize
> rnflow.f90
> 
> time ./a.out
> user    0m55.031s

This is on XEON as in Comment #3.

Comment 5 Uroš Bizjak 2007-07-18 12:05:24 UTC
The problem is in cptrf2 function when both -mfpmath=387 and -ftree-vectorize are used.
Comment 6 Jan Hubicka 2007-10-09 14:54:59 UTC
It is little bit sick, but what about implying -mfpmath=sse when -ftree-vectorize is used and SSE is available?

The reason why we don't default to fpmath=sse is because the extra precision is told to be part of i386 ABI, with vectorization we are not going to maintain this "feature" anyway.  

I can easily imagine that many users will try -ftree-vectorize and forget about -mfpmath...

Honza
Comment 7 Uroš Bizjak 2007-10-19 17:53:02 UTC
(In reply to comment #6)
> It is little bit sick, but what about implying -mfpmath=sse when
> -ftree-vectorize is used and SSE is available?

Then you will hit Core2 Duo, that shows the opposite in 32bit and 64bit mode:

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=387

user 0m22.785s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=sse

user 0.27.886s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=387

user 0m20.473s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=sse

user 0.25.046s
Comment 8 Uroš Bizjak 2007-12-21 18:57:56 UTC
Confirmed on K8 at http://gcc.gnu.org/ml/gcc-patches/2007-12/msg01042.html
Comment 9 Jan Hubicka 2008-01-19 12:03:22 UTC
Does the regression on C2 duo show even without vectorizing?  It looks like generic SSE fpmath performance issue.  There should be no reason why SSE math in combination with SSE vectorization should result in regression... 
Comment 10 Uroš Bizjak 2008-01-19 16:31:18 UTC
(In reply to comment #9)
> Does the regression on C2 duo show even without vectorizing?  It looks like
> generic SSE fpmath performance issue.  There should be no reason why SSE math
> in combination with SSE vectorization should result in regression... 

Hm, using latest SVN, the C2D difference is only marginal:

gfortran -O3 -m64 -march=core2 -msse3 -ffast-math -funroll-loops -ftree-loop-linear

                      -fno-tree-vectorize  -fno-vect-cost-model
-mfpmath=sse   21.37          21.38                21.41
-mfpmath=387   20.73          20.64                20.69

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.422
cache size      : 4096 KB

gcc version 4.3.0 20080119 (experimental) [trunk revision 131650] (GCC) 
Comment 11 Jan Hubicka 2008-01-30 16:04:29 UTC
If you have nothing against, I would probably go for -mfpmath=sse implied by -ftree-vectorize route then.

Since there is now i386 ABI mailing list I hope if we can move in direction of having -mfpmath=sse by default on CPUs where it is win and when available. It will however take few years until we can ignore non-SSE2 CPUs reasonably :(
Comment 12 Richard Biener 2008-03-14 16:47:59 UTC
Adjusting target milestone.
Comment 13 Richard Biener 2008-06-06 14:56:57 UTC
4.3.1 is being released, adjusting target milestone.
Comment 14 Uroš Bizjak 2008-08-23 20:07:18 UTC
Looking at http://users.physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/, it looks that the problem was mysteriously fixed in recent mainline.
Comment 15 Joseph S. Myers 2008-08-27 22:01:50 UTC
4.3.2 is released, changing milestones to 4.3.3.
Comment 16 Richard Biener 2009-01-24 10:19:36 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 17 Richard Biener 2009-08-04 12:28:10 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 18 Richard Biener 2010-05-22 18:11:30 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 19 Richard Biener 2011-06-27 11:12:09 UTC
The GCC 4.3 branch is being closed.