Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 31897
Product:  
Component:  
Status: NEW
Resolution:
Assigned To: Not yet assigned to anyone <unassigned@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: burnus@gcc.gnu.org
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 31897 depends on: Show dependency tree
Show dependency graph
Bug 31897 blocks:

Additional Comments:





Mark bug as waiting for feedback
Mark bug as suspended




View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2007-12-21 18:57 Opened: 2007-05-10 16:53
The Polyhedron test case rnflow is runs since 20 April 30% slower than before.

gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -O3 rnflow.f90

Today's gfortran: real    0m58.205s / user    0m56.600s
gfortran 2007-04-20 (r123986): real    0m58.237s / user    0m56.396s
gfortran 2007-04-16 (r123859): real    0m43.912s / user    0m42.403s
gfortran 4.2.0: real    0m45.449s / user    0m43.859s

This only affects that compiliation with that options.

Using the following option:

gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-loop-linear
-ftree-vectorize -mfpmath=sse -msse3 -O3

or compiling for a 64bit system does not show this slowdown.

See also:
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/run-rnflow.png
http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/
and for x86-64 which is not affected, see also:
http://www.suse.de/~gcctest/c++bench/polyhedron/polyhedron-summary.txt-2-0.html

The benchmark can be obtained from:
http://www.polyhedron.co.uk/pb05/polyhedron_benchmark_suite.html
directory: pb05/lin/source

------- Comment #1 From Uros Bizjak 2007-05-30 15:08 -------
gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90

time ./a.out
user    0m37.982s

profiled run:
user    0m43.147s

each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 59.76     17.73    17.73 64527290     0.00     0.00  idamax_
 11.80     21.23     3.50       64     0.05     0.06  gentrs_
  9.47     24.04     2.81       64     0.04     0.32  cptrf2_
  6.94     26.10     2.06     6749     0.00     0.00  cmpcpt_
  4.01     27.29     1.19       64     0.02     0.02  cptrf1_
  3.98     28.47     1.18        1     1.18    26.48  matsim_
  0.78     28.70     0.23        1     0.23     3.17  evlrnf_

gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize
rnflow.f90

time ./a.out
user    0m55.031s

profiled run:
user    1m0.124s

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 71.55     31.43    31.43 64527290     0.00     0.00  idamax_
  8.17     35.02     3.59       64     0.06     0.06  gentrs_
  6.65     37.94     2.92       64     0.05     0.53  cptrf2_
  4.89     40.09     2.15     6749     0.00     0.00  cmpcpt_
  2.66     41.26     1.17        1     1.17    40.19  matsim_
  2.53     42.37     1.11       64     0.02     0.02  cptrf1_
  0.80     42.72     0.35        1     0.35     3.70  evlrnf_

However, idamax_ routine is identical in both cases.

------- Comment #2 From Jakub Jelinek 2007-07-04 11:57 -------
Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel
quadcore Core2):
/usr/src/gcc-4.2/obj/gcc/gfortran -B /usr/src/gcc-4.2/obj/gcc/ -L
/usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/
-Wl,-rpath,/usr/src/gcc-4.2/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/
-m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -O3 rnflow.f90 -o rnflow42
/usr/src/gcc/obj/gcc/gfortran -B /usr/src/gcc/obj/gcc/ -L
/usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/
-Wl,-rpath,/usr/src/gcc/obj/x86_64-unknown-linux-gnu/32/libgfortran/.libs/ -m32
-march=opteron -ffast-math -funroll-loops -ftree-vectorize -ftree-loop-linear
-O3 rnflow.f90 -o rnflow43
gfortran -m32 -march=opteron -ffast-math -funroll-loops -ftree-vectorize
-ftree-loop-linear -O3 rnflow.f90 -o rnflow41

for i in 1 2 3; do time ./rnflow4$i > /dev/null; time ./rnflow4$i > /dev/null;
done

real    0m30.003s
user    0m29.601s
sys     0m0.399s

real    0m29.811s
user    0m29.436s
sys     0m0.370s

real    0m29.875s
user    0m29.468s
sys     0m0.403s

real    0m29.824s
user    0m29.441s
sys     0m0.378s

real    0m26.007s
user    0m25.627s
sys     0m0.376s

real    0m25.822s
user    0m25.403s
sys     0m0.415s

------- Comment #3 From Uros Bizjak 2007-07-04 12:29 -------
(In reply to comment #2)
> Can't reproduce this, gcc 4.3 actually seems to be faster (tests done on Intel
> quadcore Core2):

On core2 the bug doesn't trigger, but it shows on FC4 with:

vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 3.60GHz
stepping        : 10
cpu MHz         : 3600.970
cache size      : 2048 KB

This is one of most mysterious bugs I've ever seen. The _idamax routine is
exactly the same for both builds, but it shows such a difference. I have
analyzed this with cachegrind but nothing sticks out there.

------- Comment #4 From Uros Bizjak 2007-07-04 12:32 -------
(In reply to comment #1)
> gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 rnflow.f90
> 
> time ./a.out
> user    0m37.982s

> gfortran -ffast-math -funroll-loops -O3 -msse3 -mfpmath=387 -ftree-vectorize
> rnflow.f90
> 
> time ./a.out
> user    0m55.031s

This is on XEON as in Comment #3.

------- Comment #5 From Uros Bizjak 2007-07-18 12:05 -------
The problem is in cptrf2 function when both -mfpmath=387 and -ftree-vectorize
are used.

------- Comment #6 From Jan Hubicka 2007-10-09 14:54 -------
It is little bit sick, but what about implying -mfpmath=sse when
-ftree-vectorize is used and SSE is available?

The reason why we don't default to fpmath=sse is because the extra precision is
told to be part of i386 ABI, with vectorization we are not going to maintain
this "feature" anyway.  

I can easily imagine that many users will try -ftree-vectorize and forget about
-mfpmath...

Honza

------- Comment #7 From Uros Bizjak 2007-10-19 17:53 -------
(In reply to comment #6)
> It is little bit sick, but what about implying -mfpmath=sse when
> -ftree-vectorize is used and SSE is available?

Then you will hit Core2 Duo, that shows the opposite in 32bit and 64bit mode:

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=387

user 0m22.785s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m32 -mfpmath=sse

user 0.27.886s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=387

user 0m20.473s

-O3 -ffast-math -ftree-vectorize -funroll-loops -msse3 -m64 -mfpmath=sse

user 0.25.046s

------- Comment #8 From Uros Bizjak 2007-12-21 18:57 -------
Confirmed on K8 at http://gcc.gnu.org/ml/gcc-patches/2007-12/msg01042.html

------- Comment #9 From Jan Hubicka 2008-01-19 12:03 -------
Does the regression on C2 duo show even without vectorizing?  It looks like
generic SSE fpmath performance issue.  There should be no reason why SSE math
in combination with SSE vectorization should result in regression... 

------- Comment #10 From Uros Bizjak 2008-01-19 16:31 -------
(In reply to comment #9)
> Does the regression on C2 duo show even without vectorizing?  It looks like
> generic SSE fpmath performance issue.  There should be no reason why SSE math
> in combination with SSE vectorization should result in regression... 

Hm, using latest SVN, the C2D difference is only marginal:

gfortran -O3 -m64 -march=core2 -msse3 -ffast-math -funroll-loops
-ftree-loop-linear

                      -fno-tree-vectorize  -fno-vect-cost-model
-mfpmath=sse   21.37          21.38                21.41
-mfpmath=387   20.73          20.64                20.69

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU         X6800  @ 2.93GHz
stepping        : 5
cpu MHz         : 2933.422
cache size      : 4096 KB

gcc version 4.3.0 20080119 (experimental) [trunk revision 131650] (GCC) 

------- Comment #11 From Jan Hubicka 2008-01-30 16:04 -------
If you have nothing against, I would probably go for -mfpmath=sse implied by
-ftree-vectorize route then.

Since there is now i386 ABI mailing list I hope if we can move in direction of
having -mfpmath=sse by default on CPUs where it is win and when available. It
will however take few years until we can ignore non-SSE2 CPUs reasonably :(

------- Comment #12 From Richard Guenther 2008-03-14 16:47 -------
Adjusting target milestone.

------- Comment #13 From Richard Guenther 2008-06-06 14:56 -------
4.3.1 is being released, adjusting target milestone.

------- Comment #14 From Uros Bizjak 2008-08-23 20:07 -------
Looking at http://users.physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/, it
looks that the problem was mysteriously fixed in recent mainline.

------- Comment #15 From Joseph S. Myers 2008-08-27 22:01 -------
4.3.2 is released, changing milestones to 4.3.3.

------- Comment #16 From Richard Guenther 2009-01-24 10:19 -------
GCC 4.3.3 is being released, adjusting target milestone.

------- Comment #17 From Richard Guenther 2009-08-04 12:28 -------
GCC 4.3.4 is being released, adjusting target milestone.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug