The (to be) attached code runs about ~15% (4.4 vs 4.2) slower compiled with: gfortran -O3 -march=native -funroll-loops -ffast-math test.f90 4.4: 5.060s 4.3: 4.376s 4.2: 4.316s most time would be spent in PD2VAL. FYI, the cpu is: cpu family : 15 model : 65 model name : Dual-Core AMD Opteron(tm) Processor 8218 stepping : 2 cpu MHz : 2612.084 cache size : 1024 KB flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy (-march -> -march=k8-sse3 -mcx16 -msahf --param l1-cache-size=64 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=k8) on Core2 4.4 is actually faster: 4.4: 4.236s 4.3.0: 4.572s -march=core2 -mcx16 -msahf --param l1-cache-size=32 --param l1-cache-line-size=64 -mtune=core2
Created attachment 16788 [details] testcase
Due to the high density of branches in the code this is easily a code layout and/or padding issue. Different architectures have different constraints on their decoders and branch predictors related to branch density. Core introduces other branch limitations for loops that engage the loop stream detector. We do not at all try to properly optimize (or even model) this apart from inserting nops. YMMV with -fschedule-insns.
Oh, maybe try -fno-tree-reassoc as well.
(In reply to comment #2) > Due to the high density of branches in the code this is easily a code layout > and/or padding issue. Different architectures have different constraints on > their decoders and branch predictors related to branch density. Core > introduces other branch limitations for loops that engage the loop stream > detector. > We do not at all try to properly optimize (or even model) this apart > from inserting nops. YMMV with -fschedule-insns. I'm not expert enough to understand this, but you have it right. However, it remains a regression (on opteron) 4.4: -O3 -march=native -funroll-loops -ffast-math ==> 5.064s -O3 -march=native -funroll-loops -ffast-math -fschedule-insns ==> 4.396 4.3: -O3 -march=native -funroll-loops -ffast-math ==> 4.376 -O3 -march=native -funroll-loops -ffast-math -fschedule-insns ==> 3.372 -fno-tree-reassoc has no effect.
(In reply to comment #4) > 4.3: > -O3 -march=native -funroll-loops -ffast-math ==> 4.376 > -O3 -march=native -funroll-loops -ffast-math -fschedule-insns ==> 3.372 strangely: http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Optimize-Options.html#Optimize-Options suggests -fschedule-insns is enabled by default at -O3 ?
Not on all targets though.
But a regression at least on some targets. Confirmed.
(In reply to comment #5) > (In reply to comment #4) > > 4.3: > > -O3 -march=native -funroll-loops -ffast-math ==> 4.376 > > -O3 -march=native -funroll-loops -ffast-math -fschedule-insns ==> 3.372 > > strangely: > > http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Optimize-Options.html#Optimize-Options > suggests -fschedule-insns is enabled by default at -O3 ? > This may be related to PR 37565. i386.c has void optimization_options (int level, int size ATTRIBUTE_UNUSED) { /* For -O2 and beyond, turn off -fschedule-insns by default. It tends to make the problem with not enough registers even worse. */ #ifdef INSN_SCHEDULING if (level > 1) flag_schedule_insns = 0; #endif
I tried -fschedule-insns on CP2K, which lead to an ICE (now PR38403)
If the code layout (see comment #2) is indeed causing the slow-down, this problem might have been fixed along with bug 38074.
(In reply to comment #10) > If the code layout (see comment #2) is indeed causing the slow-down, this > problem might have been fixed along with bug 38074. No, timings are still identical: gcc version 4.4.0 20081206 (experimental) [trunk revision 142525] (GCC) Time for evaluation [s]: 5.028 gcc version 4.3.3 20080912 (prerelease) (GCC) Time for evaluation [s]: 4.376 (note that the regression is on opteron)
/* For -O2 and beyond, turn off -fschedule-insns by default. It tends to make the problem with not enough registers even worse. */ As risky as this may be (for performance, not correctness), what about changing "if (level > 1)" to "if (level == 2)"? And what about enabling it on x86-64?
(In reply to comment #12) > /* For -O2 and beyond, turn off -fschedule-insns by default. It tends to > make the problem with not enough registers even worse. */ > > As risky as this may be (for performance, not correctness), what about changing > "if (level > 1)" to "if (level == 2)"? And what about enabling it on x86-64? But even on x86-64 this seems to lead to ICEs (see PR38403).
I wanted to try Vladimir Makarov's new patch for this testcase, but on an unpatched trunk I notice a serious runtime regression with '-fschedule-insns' with respect to 4.3.3 Using as base options (for the attached testcase) gfortran -O3 -march=native -funroll-loops -ffast-math test.f90 4.3.3 w -fschedule-insns : 3.372s 4.3.3 w/o -fschedule-insns : 4.384s 4.4.2 w -fschedule-insns : 4.748s 4.4.2 w/o -fschedule-insns : 4.408s 4.5.0 w -fschedule-insns : 4.712s 4.5.0 w/o -fschedule-insns : 4.408s so 4.3 against 4.5 'w -fschedule-insns' is about 40% faster. I guess this is pretty target specific, I'm running this on an Opteron, this is what -v reports: Target: x86_64-unknown-linux-gnu Configured with: /data03/vondele/gcc_trunk/gcc/configure --disable-bootstrap --prefix=/data03/vondele/gcc_trunk/build --enable-languages=c,c++,fortran --disable-multilib --with-ppl=/data03/vondele/gcc_trunk/build/ --with-cloog=/data03/vondele/gcc_trunk/build/ Thread model: posix gcc version 4.5.0 20090830 (experimental) [trunk revision 151229] (GCC) COLLECT_GCC_OPTIONS='-O3' '-funroll-loops' '-ffast-math' '-fschedule-insns' '-v' '-shared-libgcc' /data03/vondele/gcc_trunk/build/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/f951 test.f90 -march=k8-sse3 -mcx16 -msahf --param l1-cache-size=64 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=k8 -quiet -dumpbase test.f90 -auxbase test -O3 -version -funroll-loops -ffast-math -fschedule-insns -fintrinsic-modules-path /data03/vondele/gcc_trunk/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/finclude -o /tmp/ccvGq2CO.s
Please try -O2 and -O2 -funroll-loops too, since -O3 is not always good for speed. (It would be even better if -O2 is not slower and you can find out what the culprit is at -O3; this is not necessarily possible though).
(In reply to comment #15) > Please try -O2 and -O2 -funroll-loops too, since -O3 is not always good for > speed. (It would be even better if -O2 is not slower and you can find out what > the culprit is at -O3; this is not necessarily possible though). you're right that, without -fschedule-insns -O2 is faster than -O3 on this case, but nothing comes close to 4.3 performance. adding '-fschedule-insns' to the fastest -O2 choice makes it 20% slower. All numbers with trunk: -O2 -march=native -funroll-loops -ffast-math: 4.032 -O2 -march=native -funroll-loops -ffast-math -fschedule-insns: 4.712 -O3 -march=native -funroll-loops -ffast-math: 4.408 -O2 -march=native -ffast-math: 11.373 -O2 -march=native -ffast-math -fschedule-insns: 11.409 -O3 -march=native -ffast-math: 4.296 -O3 -march=native -ffast-math -fschedule-insns: 4.656 I can test other flags if you've a hint
(In reply to comment #16) > All numbers with trunk: with 4.3 there is no difference between -O2 and -O3 -O2 -march=native -funroll-loops -ffast-math: 4.388 -O2 -march=native -funroll-loops -ffast-math -fschedule-insns: 3.352 -O3 -march=native -funroll-loops -ffast-math: 4.380 -O3 -march=native -funroll-loops -ffast-math -fschedule-insns: 3.372
Hello Joost, could you please check if this is still a problem in GCC 4.6?
(In reply to comment #18) > Hello Joost, could you please check if this is still a problem in GCC 4.6? I think it still is a minor problem, but (without -fschedule-insns) somewhat less pronounced (the old hardware is gone, this might make a difference): 4.3 branch > gfortran -O3 -march=native -funroll-loops -ffast-math -fschedule-insns test.f90 ; ./a.out Time for evaluation [s]: 3.478 > gfortran -O3 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out Time for evaluation [s]: 4.367 4.5 branch > gfortran -O3 -march=native -funroll-loops -ffast-math -fschedule-insns test.f90 ; ./a.out Time for evaluation [s]: 4.839 > gfortran -O3 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out Time for evaluation [s]: 4.524 4.6 branch > gfortran -O3 -march=native -funroll-loops -ffast-math -fschedule-insns test.f90 ; ./a.out Time for evaluation [s]: 4.997 > gfortran -O3 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out Time for evaluation [s]: 4.547 FYI: -march=amdfam10 -mcx16 -msahf -mpopcnt -mabm model name : AMD Opteron(tm) Processor 6176 SE
additionally for trunk, lto/profile-use seem not to help: > gfortran -O3 -march=native -funroll-loops -ffast-math -flto -fprofile-use test.f90 ; ./a.out Time for evaluation [s]: 4.664 > gfortran -O3 -march=native -funroll-loops -ffast-math -fprofile-use test.f90 ; ./a.out Time for evaluation [s]: 4.665
... however, the following works great: > gfortran -O2 -march=native -funroll-loops -ffast-math -ftree-vectorize test.f90 ; ./a.out Time for evaluation [s]: 2.700 (notice -O2 instead of -O3, -O2 is thus twice as fast as -O3)
What is the performance with 4.3 -O2? A regression that is limited to -O3 is (a bit) less important since -O3 is still a "mixing bag" of optimizations that might or might not be proficient.
(In reply to comment #22) > What is the performance with 4.3 -O2? 4.3: > gfortran -O2 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out Time for evaluation [s]: 4.373 4.6: > gfortran -O2 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out Time for evaluation [s]: 4.347 so, same performance. Given that vectorization only happens at -O3, it is an important optimization level for numerical codes. Nevertheless, I would propose to remove the regression tag, and instead refocus the bug on the what current trunk does at -O3 vs -O2 -ftree-vectorize as noted in comment #21 > gfortran -O2 -march=native -funroll-loops -ffast-math -ftree-vectorize test.f90 ; ./a.out Time for evaluation [s]: 2.694 > gfortran -O3 -march=native -funroll-loops -ffast-math -ftree-vectorize test.f90 ; ./a.out Time for evaluation [s]: 4.536
checked again current trunk, the situation remains that -O2 is much faster than -O3: > gfortran -O2 -march=native -funroll-loops -ffast-math -ftree-vectorize pr38306.f90 ; ./a.out Time for evaluation [s]: 2.830 > gfortran -O3 -march=native -funroll-loops -ffast-math -ftree-vectorize pr38306.f90 ; ./a.out Time for evaluation [s]: 4.593 The issue is that at -O3 the subroutine PD2VAL is not vectorized, while it is at -O2.
(In reply to comment #24) > > The issue is that at -O3 the subroutine PD2VAL is not vectorized, while it is > at -O2. If you are interested in investigating why this is so by yourself, I would suggest that you use the various -fdump- options to check what GCC is doing differently between the two variants. 1) Dump everything you can dump. 2) Then find the earliest optimization pass where they differ (you may even use diff to make this faster). 3) Check subsequent dumps to see if that difference is actually what makes -O3 to not vectorize. (At this point you can play with -f* -fno-* to reduce the differences further and isolate the trigger).
At -O3, vectorizer says: pr38306.f90:246: note: not vectorized: the size of group of strided accesses is not a power of 2D.2258_518 = *c0_193(D)[D.2257_517];
(In reply to comment #25) > 2) Then find the earliest optimization pass where they differ (you may even use > diff to make this faster). The first point where things differ for PD2VAL is pr38306_xxx.f90.057t.cunrolli afterwards, everything seems fully different.
Using original flags (-O3 -march=native -funroll-loops -ffast-math) on a AMD Athlon(tm) 64 X2 (that's close enough to "Opteron") (same family 15): 4.2.4: 5.78s 4.3.6: 5.77s 4.4.6: 5.84s 4.5.3: 5.77s 4.6.2: 5.85s trunk: 5.75s seems to be a wash for me and I cannot reproduce even the originally reported slowdown in 4.4. There are very many different flag measurements in this report which makes it unlikely that this bug will be ever properly triaged (or even "fixed"). Note that in general we'd like to concentrate monitoring performance for standard flags (thus, not including -fschedule-insns if not enabled by default). -Ofast -funroll-loops is a reasonable flag set, as we (still) do not enable loop unrolling by default at -O3. I'm closing this as worksforme. Please try to not make too much of a mess out of regression bugreports ;)