Bug 31021 - gfortran 20% slower than ifort on CP2K computational kernel
Summary: gfortran 20% slower than ifort on CP2K computational kernel
Status: WAITING
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.3.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 37150
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2007-03-02 08:38 UTC by Joost VandeVondele
Modified: 2013-03-29 08:15 UTC (History)
10 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-02-06 21:34:18


Attachments
gfortran kernel asm (541 bytes, text/plain)
2007-03-02 08:39 UTC, Joost VandeVondele
Details
ifort kernel asm (813 bytes, text/plain)
2007-03-02 08:39 UTC, Joost VandeVondele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2007-03-02 08:38:00 UTC
I've extracted the computational kernel of CP2K (see PR 29975) for easier benchmarking. Together with required utility routines to turn it into a self-contained program and data to test it, I have made it available here:

http://www.pci.unizh.ch/vandevondele/tmp/extracted_collocate.tgz

the summary is that (yesterday's trunk) gfortran is about 20% slower than ifort (ifort (IFORT) 9.1 20060707) on my machine. To reproduce, untar the above link, and use (after specifying the relevant FC in the Makefile)
make
make run

a run takes a few seconds, and yields 
gfortran '-O3 -march=native -ffast-math -ffree-form -ftree-vectorize':
 # of primitives       154502
 # computational kernel timings            5
 Kernel time   4.612288
 Kernel time   4.616289
 [...]
ifort  -xP -O3 -free
 # of primitives       154502
 # computational kernel timings            5
 Kernel time   3.796237
 Kernel time   3.800237
[...]

which is in this case 21.5% slower. I haven't found any options that made gfortran much faster (in fact timings are very unsensitive to the options used), and it is unrelated to any IPO (I actually notice ifort now that is slightly faster at -O2). Since this might be relevant, timings are on:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6

The computational time is ~80% due to a single routine (collocate_core in grid_fast.F), which in turn is dominated by the inner loops in the select case statement, and of those, the one over ig is (should be) dominant. For example, the loop starting at line 216 of grid_fast.F. If I look at the asm for this loop (with my best guess of what that loop might be, I have little experience), my main observation is that it contains 36 mov* instructions with intel and 51 mov* instructions with gfortran (and the same number of mulsd and addsd), which could explain the slowdown. I'll attach the respective asm.

I'm of course happy to try other compile flags for gfortran, and also hints on how to rewrite the kernels in order to get better performance with  gfortran would be much appreciated.
Comment 1 Joost VandeVondele 2007-03-02 08:39:24 UTC
Created attachment 13131 [details]
gfortran kernel asm
Comment 2 Joost VandeVondele 2007-03-02 08:39:57 UTC
Created attachment 13132 [details]
ifort kernel asm
Comment 3 Tobias Burnus 2007-03-02 09:38:59 UTC
On my "AMD Athlon(tm) 64 X2 Dual Core Processor 4800+", gfortran is in x86_64 mode only 13% slower:
gfortran: Kernel time 5.872366, real 0m33.121s; user 0m32.898s; sys 0m0.088s. Ifort:    Kernel time 5.244328, real 0m28.893s, user 0m28.758s, sys 0m0.076s.
Options: "ifort -xP -O3 -xW -free" and "gfortran -O3 -march=native -ffast-math -ffree-form -ftree-vectorize -funroll-loops".

For grid_fast.F, one difference is which loops are vectorized; ifort vectorizes the loops in line 44, 469, 483 and 496, gfortran only vectorizes the loops in line 496 and 469; for the other ones:

grid_fast.F:44: note: not vectorized: complicated access pattern.
          DO lz=1,lz_max(lxy)
             lxyz=lxyz+1
             pyx(1,lxy)=pyx(1,lxy)+pzyx(lxyz)*polz(lxyz,kg)
             pyx(2,lxy)=pyx(2,lxy)+pzyx(lxyz)*polz(lxyz,kg2)
          ENDDO

grid_fast.F:483: note: not vectorized: can't determine dependence between (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320]
              DO icoef=1,coef_max
                 coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
                 coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
                 coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
                 coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
              ENDDO
Comment 4 Joost VandeVondele 2007-03-02 09:55:05 UTC
(In reply to comment #3)
> On my "AMD Athlon(tm) 64 X2 Dual Core Processor 4800+", gfortran is in x86_64
> mode only 13% slower:
> gfortran: Kernel time 5.872366, real 0m33.121s; user 0m32.898s; sys 0m0.088s.
> Ifort:    Kernel time 5.244328, real 0m28.893s, user 0m28.758s, sys 0m0.076s.
> Options: "ifort -xP -O3 -xW -free" and "gfortran -O3 -march=native -ffast-math
> -ffree-form -ftree-vectorize -funroll-loops".
> 
> For grid_fast.F, one difference is which loops are vectorized; ifort vectorizes
> the loops in line 44, 469, 483 and 496, gfortran only vectorizes the loops in
> line 496 and 469; for the other ones:
> 
> grid_fast.F:44: note: not vectorized: complicated access pattern.
>           DO lz=1,lz_max(lxy)
>              lxyz=lxyz+1
>              pyx(1,lxy)=pyx(1,lxy)+pzyx(lxyz)*polz(lxyz,kg)
>              pyx(2,lxy)=pyx(2,lxy)+pzyx(lxyz)*polz(lxyz,kg2)
>           ENDDO

this might matter a bit, but this is not in an inner loop, so I don't think it accounts for a lot of time. Having it vectorized would be good of course.

> 
> grid_fast.F:483: note: not vectorized: can't determine dependence between
> (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320]
>               DO icoef=1,coef_max
>                  coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
>                  coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
>                  coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
>                  coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
>               ENDDO
> 

This part, which is in the default part of the switch statement should only be executed in rare cases. I doubt it matters much in the overall timings. Also, this loop has very short trips (i.e. coef_max should, for the provided input, be at most 5).
Comment 5 Joost VandeVondele 2007-03-02 18:15:06 UTC
> > 
> > grid_fast.F:483: note: not vectorized: can't determine dependence between
> > (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320]
> >               DO icoef=1,coef_max
> >                  coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1
> >                  coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2
> >                  coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k
> >                  coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k
> >               ENDDO
> > 
> 
> This part, which is in the default part of the switch statement should only be
> executed in rare cases. I doubt it matters much in the overall timings. Also,
> this loop has very short trips (i.e. coef_max should, for the provided input,
> be at most 5).

I verified that the default branch is indeed not called frequently enough for this to matter. However, by deleting all other cases (equivalent, but specialized code), I can time that case, and find:
gfortran: 6.636415
ifort: 5.252329
which means ifort is about 26% faster for the 'case default' branch.
Comment 6 Francois-Xavier Coudert 2008-05-10 12:16:39 UTC
With current trunk, I see current mainline gfortran being 5% faster than Intel 10.0 on a Dual-Core AMD Opteron(tm) Processor 2212 at 2GHz. Joost, on your particular setup, does this still run too slow?
Comment 7 Joost VandeVondele 2008-05-10 12:30:12 UTC
(In reply to comment #6)
> With current trunk, I see current mainline gfortran being 5% faster than Intel
> 10.0 on a Dual-Core AMD Opteron(tm) Processor 2212 at 2GHz. Joost, on your
> particular setup, does this still run too slow?

Right now, the testcase in comment 1 still is 20% slower ifort/gcc.
This is, however, with gfortran 4.3.0. Furthermore, it matters on which CPU you run this (in particular Intel vs. AMD). 

To summarize:processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
stepping        : 6

ifort (IFORT) 9.1 20060707
Kernel time   3.812238

gcc version 4.3.0 (GCC)
Kernel time   4.5482836

I'll try to build trunk on this machine and test again, but it might not be for today.


Comment 8 Joost VandeVondele 2008-05-10 13:43:36 UTC
(In reply to comment #7)
> This is, however, with gfortran 4.3.0. 

Trunk is marginally faster than 4.3.0, still about 20% slower than ifort
Kernel time   4.5042820
Comment 9 Steven Bosscher 2009-02-06 21:34:18 UTC
Confirmed with gcc 4.3.  Where do we stand today?
Comment 10 Joost VandeVondele 2009-02-07 07:50:26 UTC
(In reply to comment #9)
> Confirmed with gcc 4.3.  Where do we stand today?

same place:

gfortran -O3 -march=native -ffast-math -ffree-form -ftree-vectorize
gcc version 4.4.0 20090207 (experimental) (GCC)
> ./a.out
 # of primitives       154502
 # computational kernel timings            5
 Kernel time   4.4882798
 Kernel time   4.4922795
 Kernel time   4.4882793

ifort -v
Version 9.1
./a.out
 # of primitives       154502
 # computational kernel timings            5
 Kernel time   3.800237
 Kernel time   3.792237
 Kernel time   3.796237
Comment 11 Joost VandeVondele 2009-06-20 09:59:23 UTC
some more progress with 4.5.0, but not quite there yet:

./a.out
 # of primitives       154502
 # computational kernel timings            5
 Kernel time   4.3522720
 Kernel time   4.3562722
 Kernel time   4.3522720
 Kernel time   4.3522720
 Kernel time   4.3562717
Comment 12 Richard Biener 2009-06-20 10:46:39 UTC
Usual things to try are: -fno-tree-pre, -fno-ivopts, -fschedule-insns (on top
of the usuall -O3 -ffast-math -funroll-loops setting, of course).
Comment 13 Joost VandeVondele 2009-06-20 11:37:01 UTC
(In reply to comment #12)
> Usual things to try are: -fno-tree-pre, -fno-ivopts, -fschedule-insns (on top
> of the usuall -O3 -ffast-math -funroll-loops setting, of course).

-O3 -march=native -ffast-math -ffree-form -ftree-vectorize: 4.3482709

added on top of the above independently:
-funroll-loops: 4.2682667
-fschedule-insns: 4.3962746
-fno-tree-pre: 4.4682798
-fno-ivopts: 4.8963070
-funroll-loops -fno-ivopts: 4.7722988
-funroll-loops -fschedule-insns: 4.4242764

so best so far is:

-O3 -march=native -ffast-math -ffree-form -ftree-vectorize -funroll-loops: 4.2682667
Comment 14 Richard Biener 2013-03-27 11:34:28 UTC
Testcase is lost, the URL does no longer work.  Can you please attach it here?
Comment 15 Joost VandeVondele 2013-03-27 11:47:12 UTC
New URL:

https://www.dropbox.com/s/g28kdvatrgeu6hm/extracted_collocate.tgz

(contains nearly 2Mb of data needed to run the testcase).

the difference between trunk and ifort has become smaller. I'm now seeing only 5% difference (on a different CPU).

3.50946712 vs. 3.354490

I adjusted in the Makefile the ifort option to use -xHost.
Comment 16 Joost VandeVondele 2013-03-29 08:15:30 UTC
I believe this is actually testing the same kernel (maybe a slightly older variant) as in PR37150. I would rather revisit this once PR37150 has been fixed.