Bug 17088 - [4.4 Regression] poor fortran optimisation at -O2/3
Summary: [4.4 Regression] poor fortran optimisation at -O2/3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.0.0
: P2 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 13246
Blocks:
  Show dependency treegraph
 
Reported: 2004-08-18 20:47 UTC by Joost VandeVondele
Modified: 2008-08-28 16:08 UTC (History)
2 users (show)

See Also:
Host:
Target: i686-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-06-18 01:48:07


Attachments
test program (1.60 KB, text/plain)
2004-08-18 20:48 UTC, Joost VandeVondele
Details
ifort asm (8.11 KB, text/plain)
2008-08-28 15:56 UTC, Joost VandeVondele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2004-08-18 20:47:14 UTC
GNU F95 version 3.5.0 20040818 (experimental) (i686-pc-linux-gnu)
        compiled by GNU C version 3.5.0 20040818 (experimental).

the attached program (generated code to perform small matrix multiply) 
illustrates interesting behaviour. The third number printed is the timing of 
the MULT subroutine and is the one of interest (the other times the MATMUL 
built-in and the first measures the numerical error).

ifc -O2 test.f90               : 0.38 ! reference
gfortran -O1 test.f90          : 0.26 ! nice, faster than intel
gfortran -O2 test.f90          : 0.50 ! 2x slower than -O1
gfortran -O3 test.f90          : 0.50 ! idem
gfortran -O2 -fnew-ra test.f90 : 7.78 ! 30x slower than -O1

I have no idea what the magical switch would be to good code at e.g. -O2
Comment 1 Joost VandeVondele 2004-08-18 20:48:47 UTC
Created attachment 6955 [details]
test program
Comment 2 Andrew Pinski 2004-08-19 19:00:35 UTC
Confirming to ...
Comment 3 Andrew Pinski 2004-08-19 19:01:10 UTC
Suspending until either new-regalloc branch is merged to mainline, or bug is rechecked against 
new-regalloc branch.
Comment 4 Joost VandeVondele 2004-08-19 19:27:14 UTC
Well, it is not only new-ra that is doing badly (it is of course clearly 
worst, produces interesting asm btw). Even a normal -O2 slows down 
significantly as compared to -O1. 
Comment 5 Andrew Pinski 2004-08-19 19:29:36 UTC
Okay I was thinking about undoing what I did but then decied against it and now I am going to just 
confirm it and remove the new-ra part for now.
Comment 6 Steven Bosscher 2005-01-03 12:09:02 UTC
new-ra bug, so SUSPENDING.
Comment 7 Steven Bosscher 2005-01-03 12:09:15 UTC
new-ra bug, so SUSPENDING.
Comment 8 Steven Bosscher 2005-01-03 12:17:44 UTC
On closer inspection this is not a new-ra bug, sorry Joost.

Can you see how the numbers look for you today?  Don't use new-ra, it is
known to be very, very broken.
Comment 9 Joost VandeVondele 2005-01-06 21:30:38 UTC
(In reply to comment #8)
> On closer inspection this is not a new-ra bug, sorry Joost.
> Can you see how the numbers look for you today?  Don't use new-ra, it is
> known to be very, very broken.

timings for -O1 and -O2 are still unchanged for a recent version of gfortran, 
i.e. -O2 is half the speed of -O1
Comment 10 Andrew Pinski 2005-01-06 21:39:02 UTC
Looks like to me the register allocator is f'ing up as on PPC (where there more fp registers) -O2 is faster 
(by a factor of 2) than -O1.  It is also one of the reasons why new-ra could be fucking up too.
Comment 11 Andrew Pinski 2005-04-07 07:24:56 UTC
This seems to be fixed on the mainline at least for me:
gold:~>gfortran -O1 t.f90 
gold:~>!./
./a.out ; ./a.out ; ./a.out
  2.220446049250313E-016
   1.62675300000000       0.990850000000000     
  2.220446049250313E-016
   1.57976000000000        1.00884700000000     
  2.220446049250313E-016
   1.64775000000000       0.999848000000000     
gold:~>gfortran -O2 t.f90
gold:~>!./
./a.out ; ./a.out ; ./a.out
  4.440892098500626E-016
   1.49477200000000       0.722890000000000     
  4.440892098500626E-016
   1.53276600000000       0.716892000000000     
  4.440892098500626E-016
   1.53476700000000       0.707892000000000     
gold:~>gfortran -O3 t.f90
gold:~>!./
./a.out ; ./a.out ; ./a.out
  4.440892098500626E-016
   1.51277000000000       0.784881000000000     
  4.440892098500626E-016
   1.52476900000000       0.722890000000000     
  4.440892098500626E-016
   1.54276600000000       0.710892000000000  

Though MATMUL should be able to improved still.
Comment 12 Jerry DeLisle 2006-05-02 05:01:42 UTC
With:
$ gfc -v
Using built-in specs.
Target: i686-pc-linux-gnu
Configured with: ../main/configure --prefix=/home/jerry/gcc/usr --enable-languages=c,fortran --disable-libmudflap
Thread model: posix
gcc version 4.2.0 20060424 (experimental)

$ gfc -O2 -march=pentium4 test-optimize.f90      <gfortran
$ ./a.out
  4.440892098500626E-016
  0.748046000000000       0.544034000000000
$ ifc -O2 test-optimize.f90                      <intel
$ ./a.out
  0.000000000000000E+000
  0.460028000000000       0.436027000000000

Still a lot of room for improvement here.  The bottom left number is time using matmul and the right is time hardcoded.
Comment 13 Joost VandeVondele 2007-07-03 18:15:20 UTC
looks like current mainline is much slower than ifort (300%) on this testcase (on core2).

> ifort -xT -O2 test.f90
> ./a.out
  0.000000000000000E+000
  0.228014000000000       0.228014000000000
> gfortran -O3 -ffast-math -ftree-vectorize -march=native test.f90
> ./a.out
  0.00000000000000
  0.684042000000000       0.280018000000000

0.584042000000000 vs 0.228014000000000 seconds
Comment 14 Joost VandeVondele 2008-08-28 15:55:52 UTC
It looks like 4.4 performs even worse than 4.3 on the attached testcase.

gfortran -ffast-math -march=native -O3 PR17088.f90
trunk: 0.52803299999999997
4.3.0: 0.49202999999999997

ifort -xhost -O2 PR17088.f90
ifort: 0.136008000000000

so trunk is somehow 4 times slower than ifort...

Comment 15 Joost VandeVondele 2008-08-28 15:56:28 UTC
Created attachment 16158 [details]
ifort asm

ifort asm as a reference
Comment 16 Joost VandeVondele 2008-08-28 16:08:18 UTC
actually, I've been misreading the numbers... the timings for the library function   (MATMUL) is bad, not the generated code, which is reasonable also with gfortran. I'll close the bug.