GNU F95 version 3.5.0 20040818 (experimental) (i686-pc-linux-gnu) compiled by GNU C version 3.5.0 20040818 (experimental). the attached program (generated code to perform small matrix multiply) illustrates interesting behaviour. The third number printed is the timing of the MULT subroutine and is the one of interest (the other times the MATMUL built-in and the first measures the numerical error). ifc -O2 test.f90 : 0.38 ! reference gfortran -O1 test.f90 : 0.26 ! nice, faster than intel gfortran -O2 test.f90 : 0.50 ! 2x slower than -O1 gfortran -O3 test.f90 : 0.50 ! idem gfortran -O2 -fnew-ra test.f90 : 7.78 ! 30x slower than -O1 I have no idea what the magical switch would be to good code at e.g. -O2
Created attachment 6955 [details] test program
Confirming to ...
Suspending until either new-regalloc branch is merged to mainline, or bug is rechecked against new-regalloc branch.
Well, it is not only new-ra that is doing badly (it is of course clearly worst, produces interesting asm btw). Even a normal -O2 slows down significantly as compared to -O1.
Okay I was thinking about undoing what I did but then decied against it and now I am going to just confirm it and remove the new-ra part for now.
new-ra bug, so SUSPENDING.
On closer inspection this is not a new-ra bug, sorry Joost. Can you see how the numbers look for you today? Don't use new-ra, it is known to be very, very broken.
(In reply to comment #8) > On closer inspection this is not a new-ra bug, sorry Joost. > Can you see how the numbers look for you today? Don't use new-ra, it is > known to be very, very broken. timings for -O1 and -O2 are still unchanged for a recent version of gfortran, i.e. -O2 is half the speed of -O1
Looks like to me the register allocator is f'ing up as on PPC (where there more fp registers) -O2 is faster (by a factor of 2) than -O1. It is also one of the reasons why new-ra could be fucking up too.
This seems to be fixed on the mainline at least for me: gold:~>gfortran -O1 t.f90 gold:~>!./ ./a.out ; ./a.out ; ./a.out 2.220446049250313E-016 1.62675300000000 0.990850000000000 2.220446049250313E-016 1.57976000000000 1.00884700000000 2.220446049250313E-016 1.64775000000000 0.999848000000000 gold:~>gfortran -O2 t.f90 gold:~>!./ ./a.out ; ./a.out ; ./a.out 4.440892098500626E-016 1.49477200000000 0.722890000000000 4.440892098500626E-016 1.53276600000000 0.716892000000000 4.440892098500626E-016 1.53476700000000 0.707892000000000 gold:~>gfortran -O3 t.f90 gold:~>!./ ./a.out ; ./a.out ; ./a.out 4.440892098500626E-016 1.51277000000000 0.784881000000000 4.440892098500626E-016 1.52476900000000 0.722890000000000 4.440892098500626E-016 1.54276600000000 0.710892000000000 Though MATMUL should be able to improved still.
With: $ gfc -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: ../main/configure --prefix=/home/jerry/gcc/usr --enable-languages=c,fortran --disable-libmudflap Thread model: posix gcc version 4.2.0 20060424 (experimental) $ gfc -O2 -march=pentium4 test-optimize.f90 <gfortran $ ./a.out 4.440892098500626E-016 0.748046000000000 0.544034000000000 $ ifc -O2 test-optimize.f90 <intel $ ./a.out 0.000000000000000E+000 0.460028000000000 0.436027000000000 Still a lot of room for improvement here. The bottom left number is time using matmul and the right is time hardcoded.
looks like current mainline is much slower than ifort (300%) on this testcase (on core2). > ifort -xT -O2 test.f90 > ./a.out 0.000000000000000E+000 0.228014000000000 0.228014000000000 > gfortran -O3 -ffast-math -ftree-vectorize -march=native test.f90 > ./a.out 0.00000000000000 0.684042000000000 0.280018000000000 0.584042000000000 vs 0.228014000000000 seconds
It looks like 4.4 performs even worse than 4.3 on the attached testcase. gfortran -ffast-math -march=native -O3 PR17088.f90 trunk: 0.52803299999999997 4.3.0: 0.49202999999999997 ifort -xhost -O2 PR17088.f90 ifort: 0.136008000000000 so trunk is somehow 4 times slower than ifort...
Created attachment 16158 [details] ifort asm ifort asm as a reference
actually, I've been misreading the numbers... the timings for the library function (MATMUL) is bad, not the generated code, which is reasonable also with gfortran. I'll close the bug.