I've extracted the computational kernel of CP2K (see PR 29975) for easier benchmarking. Together with required utility routines to turn it into a self-contained program and data to test it, I have made it available here: http://www.pci.unizh.ch/vandevondele/tmp/extracted_collocate.tgz the summary is that (yesterday's trunk) gfortran is about 20% slower than ifort (ifort (IFORT) 9.1 20060707) on my machine. To reproduce, untar the above link, and use (after specifying the relevant FC in the Makefile) make make run a run takes a few seconds, and yields gfortran '-O3 -march=native -ffast-math -ffree-form -ftree-vectorize': # of primitives 154502 # computational kernel timings 5 Kernel time 4.612288 Kernel time 4.616289 [...] ifort -xP -O3 -free # of primitives 154502 # computational kernel timings 5 Kernel time 3.796237 Kernel time 3.800237 [...] which is in this case 21.5% slower. I haven't found any options that made gfortran much faster (in fact timings are very unsensitive to the options used), and it is unrelated to any IPO (I actually notice ifort now that is slightly faster at -O2). Since this might be relevant, timings are on: vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz stepping : 6 The computational time is ~80% due to a single routine (collocate_core in grid_fast.F), which in turn is dominated by the inner loops in the select case statement, and of those, the one over ig is (should be) dominant. For example, the loop starting at line 216 of grid_fast.F. If I look at the asm for this loop (with my best guess of what that loop might be, I have little experience), my main observation is that it contains 36 mov* instructions with intel and 51 mov* instructions with gfortran (and the same number of mulsd and addsd), which could explain the slowdown. I'll attach the respective asm. I'm of course happy to try other compile flags for gfortran, and also hints on how to rewrite the kernels in order to get better performance with gfortran would be much appreciated.
Created attachment 13131 [details] gfortran kernel asm
Created attachment 13132 [details] ifort kernel asm
On my "AMD Athlon(tm) 64 X2 Dual Core Processor 4800+", gfortran is in x86_64 mode only 13% slower: gfortran: Kernel time 5.872366, real 0m33.121s; user 0m32.898s; sys 0m0.088s. Ifort: Kernel time 5.244328, real 0m28.893s, user 0m28.758s, sys 0m0.076s. Options: "ifort -xP -O3 -xW -free" and "gfortran -O3 -march=native -ffast-math -ffree-form -ftree-vectorize -funroll-loops". For grid_fast.F, one difference is which loops are vectorized; ifort vectorizes the loops in line 44, 469, 483 and 496, gfortran only vectorizes the loops in line 496 and 469; for the other ones: grid_fast.F:44: note: not vectorized: complicated access pattern. DO lz=1,lz_max(lxy) lxyz=lxyz+1 pyx(1,lxy)=pyx(1,lxy)+pzyx(lxyz)*polz(lxyz,kg) pyx(2,lxy)=pyx(2,lxy)+pzyx(lxyz)*polz(lxyz,kg2) ENDDO grid_fast.F:483: note: not vectorized: can't determine dependence between (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320] DO icoef=1,coef_max coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1 coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2 coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k ENDDO
(In reply to comment #3) > On my "AMD Athlon(tm) 64 X2 Dual Core Processor 4800+", gfortran is in x86_64 > mode only 13% slower: > gfortran: Kernel time 5.872366, real 0m33.121s; user 0m32.898s; sys 0m0.088s. > Ifort: Kernel time 5.244328, real 0m28.893s, user 0m28.758s, sys 0m0.076s. > Options: "ifort -xP -O3 -xW -free" and "gfortran -O3 -march=native -ffast-math > -ffree-form -ftree-vectorize -funroll-loops". > > For grid_fast.F, one difference is which loops are vectorized; ifort vectorizes > the loops in line 44, 469, 483 and 496, gfortran only vectorizes the loops in > line 496 and 469; for the other ones: > > grid_fast.F:44: note: not vectorized: complicated access pattern. > DO lz=1,lz_max(lxy) > lxyz=lxyz+1 > pyx(1,lxy)=pyx(1,lxy)+pzyx(lxyz)*polz(lxyz,kg) > pyx(2,lxy)=pyx(2,lxy)+pzyx(lxyz)*polz(lxyz,kg2) > ENDDO this might matter a bit, but this is not in an inner loop, so I don't think it accounts for a lot of time. Having it vectorized would be good of course. > > grid_fast.F:483: note: not vectorized: can't determine dependence between > (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320] > DO icoef=1,coef_max > coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1 > coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2 > coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k > coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k > ENDDO > This part, which is in the default part of the switch statement should only be executed in rare cases. I doubt it matters much in the overall timings. Also, this loop has very short trips (i.e. coef_max should, for the provided input, be at most 5).
> > > > grid_fast.F:483: note: not vectorized: can't determine dependence between > > (*coef_447)[D.1967_2320] and (*coef_447)[D.1967_2320] > > DO icoef=1,coef_max > > coef(icoef,1)=coef(icoef,1)+alpha(icoef,lx)*g1 > > coef(icoef,2)=coef(icoef,2)+alpha(icoef,lx)*g2 > > coef(icoef,3)=coef(icoef,3)+alpha(icoef,lx)*g1k > > coef(icoef,4)=coef(icoef,4)+alpha(icoef,lx)*g2k > > ENDDO > > > > This part, which is in the default part of the switch statement should only be > executed in rare cases. I doubt it matters much in the overall timings. Also, > this loop has very short trips (i.e. coef_max should, for the provided input, > be at most 5). I verified that the default branch is indeed not called frequently enough for this to matter. However, by deleting all other cases (equivalent, but specialized code), I can time that case, and find: gfortran: 6.636415 ifort: 5.252329 which means ifort is about 26% faster for the 'case default' branch.
With current trunk, I see current mainline gfortran being 5% faster than Intel 10.0 on a Dual-Core AMD Opteron(tm) Processor 2212 at 2GHz. Joost, on your particular setup, does this still run too slow?
(In reply to comment #6) > With current trunk, I see current mainline gfortran being 5% faster than Intel > 10.0 on a Dual-Core AMD Opteron(tm) Processor 2212 at 2GHz. Joost, on your > particular setup, does this still run too slow? Right now, the testcase in comment 1 still is 20% slower ifort/gcc. This is, however, with gfortran 4.3.0. Furthermore, it matters on which CPU you run this (in particular Intel vs. AMD). To summarize:processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz stepping : 6 ifort (IFORT) 9.1 20060707 Kernel time 3.812238 gcc version 4.3.0 (GCC) Kernel time 4.5482836 I'll try to build trunk on this machine and test again, but it might not be for today.
(In reply to comment #7) > This is, however, with gfortran 4.3.0. Trunk is marginally faster than 4.3.0, still about 20% slower than ifort Kernel time 4.5042820
Confirmed with gcc 4.3. Where do we stand today?
(In reply to comment #9) > Confirmed with gcc 4.3. Where do we stand today? same place: gfortran -O3 -march=native -ffast-math -ffree-form -ftree-vectorize gcc version 4.4.0 20090207 (experimental) (GCC) > ./a.out # of primitives 154502 # computational kernel timings 5 Kernel time 4.4882798 Kernel time 4.4922795 Kernel time 4.4882793 ifort -v Version 9.1 ./a.out # of primitives 154502 # computational kernel timings 5 Kernel time 3.800237 Kernel time 3.792237 Kernel time 3.796237
some more progress with 4.5.0, but not quite there yet: ./a.out # of primitives 154502 # computational kernel timings 5 Kernel time 4.3522720 Kernel time 4.3562722 Kernel time 4.3522720 Kernel time 4.3522720 Kernel time 4.3562717
Usual things to try are: -fno-tree-pre, -fno-ivopts, -fschedule-insns (on top of the usuall -O3 -ffast-math -funroll-loops setting, of course).
(In reply to comment #12) > Usual things to try are: -fno-tree-pre, -fno-ivopts, -fschedule-insns (on top > of the usuall -O3 -ffast-math -funroll-loops setting, of course). -O3 -march=native -ffast-math -ffree-form -ftree-vectorize: 4.3482709 added on top of the above independently: -funroll-loops: 4.2682667 -fschedule-insns: 4.3962746 -fno-tree-pre: 4.4682798 -fno-ivopts: 4.8963070 -funroll-loops -fno-ivopts: 4.7722988 -funroll-loops -fschedule-insns: 4.4242764 so best so far is: -O3 -march=native -ffast-math -ffree-form -ftree-vectorize -funroll-loops: 4.2682667
Testcase is lost, the URL does no longer work. Can you please attach it here?
New URL: https://www.dropbox.com/s/g28kdvatrgeu6hm/extracted_collocate.tgz (contains nearly 2Mb of data needed to run the testcase). the difference between trunk and ifort has become smaller. I'm now seeing only 5% difference (on a different CPU). 3.50946712 vs. 3.354490 I adjusted in the Makefile the ifort option to use -xHost.
I believe this is actually testing the same kernel (maybe a slightly older variant) as in PR37150. I would rather revisit this once PR37150 has been fixed.