Bug 38306 - [4.4/4.5/4.6/4.7 Regression] 15% slowdown w.r.t. 4.3 of computational kernel on some architectures
Summary: [4.4/4.5/4.6/4.7 Regression] 15% slowdown w.r.t. 4.3 of computational kernel ...
Status: RESOLVED WORKSFORME
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.4.0
: P2 normal
Target Milestone: 4.4.7
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2008-11-28 16:01 UTC by Joost VandeVondele
Modified: 2012-01-16 12:38 UTC (History)
8 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-09-09 19:01:16


Attachments
testcase (2.04 KB, text/plain)
2008-11-28 16:01 UTC, Joost VandeVondele
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Joost VandeVondele 2008-11-28 16:01:20 UTC
The (to be) attached code runs about ~15% (4.4 vs 4.2) slower compiled with:
gfortran -O3 -march=native -funroll-loops  -ffast-math test.f90

4.4:  5.060s
4.3:  4.376s
4.2:  4.316s

most time would be spent in PD2VAL.

FYI, the cpu is:

cpu family      : 15
model           : 65
model name      : Dual-Core AMD Opteron(tm) Processor 8218
stepping        : 2
cpu MHz         : 2612.084
cache size      : 1024 KB
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow pni cx16 lahf_lm cmp_legacy svm cr8_legacy

(-march -> -march=k8-sse3 -mcx16 -msahf --param l1-cache-size=64 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=k8)

on Core2 4.4 is actually faster:

4.4: 4.236s
4.3.0: 4.572s

-march=core2 -mcx16 -msahf --param l1-cache-size=32 --param l1-cache-line-size=64 -mtune=core2
Comment 1 Joost VandeVondele 2008-11-28 16:01:58 UTC
Created attachment 16788 [details]
testcase
Comment 2 Richard Biener 2008-11-30 11:38:52 UTC
Due to the high density of branches in the code this is easily a code layout
and/or padding issue.  Different architectures have different constraints on
their decoders and branch predictors related to branch density.  Core
introduces other branch limitations for loops that engage the loop stream
detector.

We do not at all try to properly optimize (or even model) this apart
from inserting nops.  YMMV with -fschedule-insns.
Comment 3 Richard Biener 2008-11-30 11:48:31 UTC
Oh, maybe try -fno-tree-reassoc as well.
Comment 4 Joost VandeVondele 2008-11-30 16:17:18 UTC
(In reply to comment #2)
> Due to the high density of branches in the code this is easily a code layout
> and/or padding issue.  Different architectures have different constraints on
> their decoders and branch predictors related to branch density.  Core
> introduces other branch limitations for loops that engage the loop stream
> detector.
> We do not at all try to properly optimize (or even model) this apart
> from inserting nops.  YMMV with -fschedule-insns.

I'm not expert enough to understand this, but you have it right. However, it remains a regression (on opteron)

4.4: 
-O3 -march=native -funroll-loops  -ffast-math                  ==> 5.064s
-O3 -march=native -funroll-loops  -ffast-math -fschedule-insns ==> 4.396

4.3:
-O3 -march=native -funroll-loops  -ffast-math                  ==> 4.376
-O3 -march=native -funroll-loops  -ffast-math -fschedule-insns ==> 3.372

-fno-tree-reassoc has no effect.
Comment 5 Joost VandeVondele 2008-11-30 16:26:02 UTC
(In reply to comment #4)
> 4.3:
> -O3 -march=native -funroll-loops  -ffast-math                  ==> 4.376
> -O3 -march=native -funroll-loops  -ffast-math -fschedule-insns ==> 3.372

strangely:

http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Optimize-Options.html#Optimize-Options
suggests -fschedule-insns is enabled by default at -O3 ?
Comment 6 Richard Biener 2008-11-30 16:39:06 UTC
Not on all targets though.
Comment 7 Steven Bosscher 2008-12-03 19:01:16 UTC
But a regression at least on some targets.  Confirmed.
Comment 8 H.J. Lu 2008-12-03 21:28:13 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > 4.3:
> > -O3 -march=native -funroll-loops  -ffast-math                  ==> 4.376
> > -O3 -march=native -funroll-loops  -ffast-math -fschedule-insns ==> 3.372
> 
> strangely:
> 
> http://gcc.gnu.org/onlinedocs/gcc-4.3.2/gcc/Optimize-Options.html#Optimize-Options
> suggests -fschedule-insns is enabled by default at -O3 ?
> 

This may be related to PR 37565. i386.c has

void
optimization_options (int level, int size ATTRIBUTE_UNUSED)
{
  /* For -O2 and beyond, turn off -fschedule-insns by default.  It tends to
     make the problem with not enough registers even worse.  */
#ifdef INSN_SCHEDULING
  if (level > 1)
    flag_schedule_insns = 0;
#endif    
Comment 9 Joost VandeVondele 2008-12-04 16:11:32 UTC
I tried -fschedule-insns  on CP2K, which lead to an ICE (now PR38403)
Comment 10 Steven Bosscher 2008-12-06 15:37:23 UTC
If the code layout (see comment #2) is indeed causing the slow-down, this problem might have been fixed along with bug 38074.
Comment 11 Joost VandeVondele 2008-12-06 18:54:51 UTC
(In reply to comment #10)
> If the code layout (see comment #2) is indeed causing the slow-down, this
> problem might have been fixed along with bug 38074.

No, timings are still identical:

gcc version 4.4.0 20081206 (experimental) [trunk revision 142525] (GCC)
Time for evaluation [s]:                        5.028
gcc version 4.3.3 20080912 (prerelease) (GCC)
Time for evaluation [s]:                        4.376

(note that the regression is on opteron)
Comment 12 Paolo Bonzini 2009-02-11 19:00:40 UTC
  /* For -O2 and beyond, turn off -fschedule-insns by default.  It tends to
     make the problem with not enough registers even worse.  */

As risky as this may be (for performance, not correctness), what about changing "if (level > 1)" to "if (level == 2)"?  And what about enabling it on x86-64?
Comment 13 Joost VandeVondele 2009-02-11 19:25:06 UTC
(In reply to comment #12)
>   /* For -O2 and beyond, turn off -fschedule-insns by default.  It tends to
>      make the problem with not enough registers even worse.  */
> 
> As risky as this may be (for performance, not correctness), what about changing
> "if (level > 1)" to "if (level == 2)"?  And what about enabling it on x86-64?

But even on x86-64 this seems to lead to ICEs (see PR38403). 



Comment 14 Joost VandeVondele 2009-09-01 06:56:19 UTC
I wanted to try Vladimir Makarov's new patch for this testcase, but on an unpatched trunk I notice a serious runtime regression with '-fschedule-insns' with respect to 4.3.3

Using as base options (for the attached testcase)

gfortran -O3 -march=native -funroll-loops  -ffast-math test.f90

4.3.3 w   -fschedule-insns : 3.372s
4.3.3 w/o -fschedule-insns : 4.384s

4.4.2 w   -fschedule-insns : 4.748s
4.4.2 w/o -fschedule-insns : 4.408s

4.5.0 w   -fschedule-insns : 4.712s
4.5.0 w/o -fschedule-insns : 4.408s

so 4.3 against 4.5 'w -fschedule-insns' is about 40% faster.

I guess this is pretty target specific, I'm running this on an Opteron, this is what -v reports:

Target: x86_64-unknown-linux-gnu
Configured with: /data03/vondele/gcc_trunk/gcc/configure --disable-bootstrap --prefix=/data03/vondele/gcc_trunk/build --enable-languages=c,c++,fortran --disable-multilib --with-ppl=/data03/vondele/gcc_trunk/build/ --with-cloog=/data03/vondele/gcc_trunk/build/
Thread model: posix
gcc version 4.5.0 20090830 (experimental) [trunk revision 151229] (GCC)
COLLECT_GCC_OPTIONS='-O3'  '-funroll-loops' '-ffast-math' '-fschedule-insns' '-v' '-shared-libgcc'
 /data03/vondele/gcc_trunk/build/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/f951 test.f90 -march=k8-sse3 -mcx16 -msahf --param l1-cache-size=64 --param l1-cache-line-size=64 --param l2-cache-size=1024 -mtune=k8 -quiet -dumpbase test.f90 -auxbase test -O3 -version -funroll-loops -ffast-math -fschedule-insns -fintrinsic-modules-path /data03/vondele/gcc_trunk/build/lib/gcc/x86_64-unknown-linux-gnu/4.5.0/finclude -o /tmp/ccvGq2CO.s
Comment 15 Paolo Bonzini 2009-09-01 08:54:41 UTC
Please try -O2 and -O2 -funroll-loops too, since -O3 is not always good for speed.  (It would be even better if -O2 is not slower and you can find out what the culprit is at -O3; this is not necessarily possible though).
Comment 16 Joost VandeVondele 2009-09-01 09:13:35 UTC
(In reply to comment #15)
> Please try -O2 and -O2 -funroll-loops too, since -O3 is not always good for
> speed.  (It would be even better if -O2 is not slower and you can find out what
> the culprit is at -O3; this is not necessarily possible though).

you're right that, without -fschedule-insns -O2 is faster than -O3 on this case, but nothing comes close to 4.3 performance. adding '-fschedule-insns' to the fastest -O2 choice makes it 20% slower.

All numbers with trunk:

 -O2 -march=native -funroll-loops  -ffast-math: 4.032
 -O2 -march=native -funroll-loops  -ffast-math -fschedule-insns: 4.712
 -O3 -march=native -funroll-loops  -ffast-math: 4.408
 -O2 -march=native -ffast-math: 11.373
 -O2 -march=native -ffast-math -fschedule-insns: 11.409
 -O3 -march=native -ffast-math: 4.296
 -O3 -march=native -ffast-math -fschedule-insns: 4.656

I can test other flags if you've a hint
Comment 17 Joost VandeVondele 2009-09-01 09:17:42 UTC
(In reply to comment #16)
> All numbers with trunk:
with 4.3 there is no difference between -O2 and -O3

-O2 -march=native -funroll-loops  -ffast-math: 4.388
-O2 -march=native -funroll-loops  -ffast-math -fschedule-insns: 3.352
-O3 -march=native -funroll-loops  -ffast-math: 4.380
-O3 -march=native -funroll-loops  -ffast-math -fschedule-insns: 3.372
Comment 18 Steven Bosscher 2011-02-20 15:22:26 UTC
Hello Joost, could you please check if this is still a problem in GCC 4.6?
Comment 19 Joost VandeVondele 2011-02-20 16:17:33 UTC
(In reply to comment #18)
> Hello Joost, could you please check if this is still a problem in GCC 4.6?

I think it still is a minor problem, but (without   -fschedule-insns) somewhat less pronounced (the old hardware is gone, this might make a difference):

4.3 branch

> gfortran -O3 -march=native -funroll-loops  -ffast-math   -fschedule-insns test.f90 ; ./a.out 
Time for evaluation [s]:                        3.478
> gfortran -O3 -march=native -funroll-loops  -ffast-math   test.f90 ; ./a.out 
Time for evaluation [s]:                        4.367

4.5 branch

> gfortran -O3 -march=native -funroll-loops  -ffast-math   -fschedule-insns test.f90 ; ./a.out 
Time for evaluation [s]:                        4.839
> gfortran -O3 -march=native -funroll-loops  -ffast-math  test.f90 ; ./a.out 
Time for evaluation [s]:                        4.524

4.6 branch
> gfortran -O3 -march=native -funroll-loops  -ffast-math   -fschedule-insns test.f90 ; ./a.out 
Time for evaluation [s]:                        4.997
> gfortran -O3 -march=native -funroll-loops  -ffast-math   test.f90 ; ./a.out 
Time for evaluation [s]:                        4.547

FYI: -march=amdfam10 -mcx16 -msahf -mpopcnt -mabm
model name      : AMD Opteron(tm) Processor 6176 SE
Comment 20 Joost VandeVondele 2011-02-20 16:28:00 UTC
additionally for trunk, lto/profile-use seem not to help:

> gfortran -O3 -march=native -funroll-loops  -ffast-math  -flto -fprofile-use test.f90 ; ./a.out 
Time for evaluation [s]:                        4.664

> gfortran -O3 -march=native -funroll-loops  -ffast-math   -fprofile-use test.f90 ; ./a.out 
Time for evaluation [s]:                        4.665
Comment 21 Joost VandeVondele 2011-02-20 16:32:38 UTC
... however, the following works great:

> gfortran -O2 -march=native -funroll-loops  -ffast-math  -ftree-vectorize test.f90 ; ./a.out 
Time for evaluation [s]:                        2.700

(notice -O2 instead of -O3, -O2 is thus twice as fast as -O3)
Comment 22 Paolo Bonzini 2011-02-21 07:55:35 UTC
What is the performance with 4.3 -O2?  A regression that is limited to -O3 is (a bit) less important since -O3 is still a "mixing bag" of optimizations that might or might not be proficient.
Comment 23 Joost VandeVondele 2011-02-21 12:53:30 UTC
(In reply to comment #22)
> What is the performance with 4.3 -O2?  

4.3:
> gfortran -O2 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out
Time for evaluation [s]:                        4.373

4.6:
>  gfortran -O2 -march=native -funroll-loops -ffast-math test.f90 ; ./a.out
Time for evaluation [s]:                        4.347

so, same performance. 

Given that vectorization only happens at -O3, it is an important optimization level for numerical codes. Nevertheless, I would propose to remove the regression tag, and instead refocus the bug on the what current trunk does at -O3 vs -O2 -ftree-vectorize as noted in comment #21

> gfortran -O2 -march=native -funroll-loops  -ffast-math  -ftree-vectorize test.f90 ; ./a.out
Time for evaluation [s]:                        2.694

> gfortran -O3 -march=native -funroll-loops  -ffast-math  -ftree-vectorize test.f90 ; ./a.out
Time for evaluation [s]:                        4.536
Comment 24 Joost VandeVondele 2011-09-09 19:06:50 UTC
checked again current trunk, the situation remains that -O2 is much faster than -O3:

> gfortran -O2 -march=native -funroll-loops  -ffast-math  -ftree-vectorize pr38306.f90  ; ./a.out
Time for evaluation [s]:                        2.830

> gfortran -O3 -march=native -funroll-loops  -ffast-math  -ftree-vectorize pr38306.f90  ; ./a.out
Time for evaluation [s]:                        4.593

The issue is that at -O3 the subroutine PD2VAL is not vectorized, while it is at -O2.
Comment 25 Manuel López-Ibáñez 2011-09-10 09:43:58 UTC
(In reply to comment #24)
> 
> The issue is that at -O3 the subroutine PD2VAL is not vectorized, while it is
> at -O2.

If you are interested in investigating why this is so by yourself, I would suggest that you use the various -fdump- options to check what GCC is doing differently between the two variants. 

1) Dump everything you can dump.

2) Then find the earliest optimization pass where they differ (you may even use diff to make this faster).

3) Check subsequent dumps to see if that difference is actually what makes -O3 to not vectorize. (At this point you can play with -f* -fno-* to reduce the differences further and isolate the trigger).
Comment 26 Uroš Bizjak 2011-09-10 12:31:23 UTC
At -O3, vectorizer says:

pr38306.f90:246: note: not vectorized: the size of group of strided accesses is not a power of 2D.2258_518 = *c0_193(D)[D.2257_517];
Comment 27 Joost VandeVondele 2011-09-13 07:59:06 UTC
(In reply to comment #25)
> 2) Then find the earliest optimization pass where they differ (you may even use
> diff to make this faster).

The first point where things differ for PD2VAL is 

pr38306_xxx.f90.057t.cunrolli

afterwards, everything seems fully different.
Comment 28 Richard Biener 2012-01-16 12:38:17 UTC
Using original flags (-O3 -march=native -funroll-loops -ffast-math) on
a AMD Athlon(tm) 64 X2 (that's close enough to "Opteron") (same family 15):

4.2.4: 5.78s
4.3.6: 5.77s
4.4.6: 5.84s
4.5.3: 5.77s
4.6.2: 5.85s
trunk: 5.75s

seems to be a wash for me and I cannot reproduce even the originally
reported slowdown in 4.4.  There are very many different flag measurements
in this report which makes it unlikely that this bug will be ever
properly triaged (or even "fixed").

Note that in general we'd like to concentrate monitoring performance
for standard flags (thus, not including -fschedule-insns if not enabled
by default).  -Ofast -funroll-loops is a reasonable flag set, as we (still)
do not enable loop unrolling by default at -O3.

I'm closing this as worksforme.  Please try to not make too much of a mess
out of regression bugreports ;)