59811 – [5 Regression] Huge increase in memory usage and compile time in combine

Bug 59811 - [5 Regression] Huge increase in memory usage and compile time in combine

Summary: [5 Regression] Huge increase in memory usage and compile time in combine

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.8.2

Importance:	P2 normal
Target Milestone:	6.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	compile-time-hog, memory-hog

Depends on:
Blocks:

Reported:	2014-01-14 17:30 UTC by Bastian Feigl
Modified:	2017-10-10 14:56 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:	4.7.3
Known to fail:	4.8.2, 4.9.0
Last reconfirmed:	2014-01-15 00:00:00

Attachments
Example fortran source code file with problems with gfortran 4.8 (30.47 KB, text/x-fortran) 2014-01-14 17:30 UTC, Bastian Feigl	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bastian Feigl 2014-01-14 17:30:34 UTC

Created attachment 31833 [details]
Example fortran source code file with problems with gfortran 4.8

When switching from gfortran 4.7 to gfortran 4.8 the memory usage and compile time vastly increases for some files in our project, e.g. for the attached example file. gfortran 4.8.2 needs 50s to compile, using up to 1.7 GB of RAM, while gfortran 4.7 compiles it in 7s with a memory usage of 136 MB.

The command line of the gfortran-call for the attached example is
/opt/gcc/4.8.2/bin/gfortran -fno-automatic -ffixed-line-length-none -O2 -c FermionBoxEventempCoupling_Div.F -o output.o 

The problem seems to be partly linked to the option -fno-automatic, since omitting it inhibits the memory increase, but the compile time is still 14s with gfortran 4.8, compared to 6s with gfortran 4.7.

The problem arises with optimizations -O2 and -O1 lead to a similar discrepancy, with -O0 the problem does not exist.

The gfortran 4.8 which shows the problematic behaviour is built with:
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.8.2/configure --prefix=/opt/gcc/4.8.2 --enable-languages=c,c++,fortran
gcc version 4.8.2 (GCC) 

The gfortran 4.7 is the built-in from openSUSE 12.2:
Target: x86_64-suse-linux
Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.7 --enable-ssp --disable-libssp --disable-libitm --disable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --program-suffix=-4.7 --enable-linux-futex --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux
gcc version 4.7.1 20120723 [gcc-4_7-branch revision 189773] (SUSE Linux)

Comment 1 Richard Biener 2014-01-15 12:06:33 UTC

Confirmed.  At -O2 I see for 4.8

 alias stmt walking      :   4.90 ( 8%) usr   0.02 ( 3%) sys   5.05 ( 8%) wall       0 kB ( 0%) ggc
 tree copy propagation   :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall     513 kB ( 0%) ggc
 combiner                :  45.94 (71%) usr   0.57 (83%) sys  46.57 (71%) wall 1612812 kB (96%) ggc
 integrated RA           :   0.59 ( 1%) usr   0.01 ( 1%) sys   0.58 ( 1%) wall   12481 kB ( 1%) ggc
 LRA hard reg assignment :   7.53 (12%) usr   0.03 ( 4%) sys   7.56 (11%) wall       0 kB ( 0%) ggc
 TOTAL                 :  65.03             0.69            65.92            1681500 kB

and for 4.7

 alias stmt walking      :   2.77 (28%) usr   0.02 (17%) sys   2.80 (28%) wall       0 kB ( 0%) ggc
 tree copy propagation   :   1.64 (17%) usr   0.01 ( 8%) sys   1.67 (17%) wall     513 kB ( 1%) ggc
 integrated RA           :   0.66 ( 7%) usr   0.02 (17%) sys   0.68 ( 7%) wall    6295 kB (10%) ggc
 reload                  :   0.45 ( 5%) usr   0.01 ( 8%) sys   0.46 ( 5%) wall    1593 kB ( 3%) ggc
 reload CSE regs         :   0.74 ( 7%) usr   0.00 ( 0%) sys   0.75 ( 7%) wall    1103 kB ( 2%) ggc
 TOTAL                 :   9.87             0.12            10.01              61922 kB

the combiner slowdown and the LRA thing are worth investigating.  I suppose
the generated code differs quite a bit ;)  It's a very large function
with some very large basic-block(s).  Current trunk behaves comparable
to GCC 4.8.

Comment 2 Richard Biener 2014-01-15 13:53:42 UTC

No clear offender in combine.  Points at if_then_else_cond (14% of combine time).
LRA slowness is calls to find_hard_regno_for, 93% of lra_assign ().

Comment 3 Jakub Jelinek 2014-01-22 15:01:12 UTC

Started with r190594 which apparently made big changes in some of the GIMPLE passes, e.g. the optimized dump linecount went down from 65782 lines (r190593) to 52107 lines (r190594).  But, for some reason that change results in uncomparably more log links and combiner opportunities.
In -fdump-rtl-combine-details dump, 'Successfully match' matches went up from
9742 up to 3921927, and 'Failed to match' matches went up from 53193 to 22538586.
So, the combiner success rate is approximately the same, around 17.5%, just on 400 times more opportunities.
For GIMPLE passes, looking just at the sizes of the gimple dumps, until esra the sizes are the same, from fre1 the dump with r190594 is slightly to significantly larger than corresponding r190593 dump until crited, and then surprisingly the pre dump is on the other side half the size of r190593 dump and from sink till optimized the dump sizes roughly match the 65782 to 52107 lines.

Comment 4 Jakub Jelinek 2014-01-22 15:16:31 UTC

With -O2 -fno-automatic -ffixed-line-length-none --param sccvn-max-alias-queries-per-access=1300 the combiner completely disappears out of the picture, supposedly because then FRE/PRE doesn't give up, with 1200 the combiner already takes 17% of compile time and 77% of memory, with the default 1000 it takes 58% of compile time and 96% of memory.

Comment 5 Richard Biener 2014-01-31 11:35:03 UTC

Bah.  Add a limiting --param to combine? ....

Comment 6 Jakub Jelinek 2014-01-31 11:41:23 UTC

Dunno what to limit there though.

BTW, the 1000 default for sccvn-max-alias-queries-per-access param was some carefully chosen value, e.g. does the test that was fixed by the addition of the param fail if it is bumped to say 1500 or 2000, or can we just bump slightly the default?

Comment 7 rguenther@suse.de 2014-01-31 11:48:15 UTC

On Fri, 31 Jan 2014, jakub at gcc dot gnu.org wrote:

> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811
> 
> --- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Dunno what to limit there though.
> 
> BTW, the 1000 default for sccvn-max-alias-queries-per-access param was some
> carefully chosen value, e.g. does the test that was fixed by the addition of
> the param fail if it is bumped to say 1500 or 2000, or can we just bump
> slightly the default?

It was reasonably chosen - the testcase I added it for still uses
a lot of time in alias-walking (~30% I think which I decided was
reasonable for the testcase).  We sure can slightly bump it,
but that only requires a slightly larger testcase to re-trigger
this bug?

Comment 8 Richard Biener 2014-05-22 09:01:30 UTC

GCC 4.8.3 is being released, adjusting target milestone.

Comment 9 Jakub Jelinek 2014-12-19 13:28:01 UTC

GCC 4.8.4 has been released.

Comment 10 Richard Biener 2015-06-23 08:17:41 UTC

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

Comment 11 Jakub Jelinek 2015-06-26 19:53:50 UTC

GCC 4.9.3 has been released.

Comment 12 Jeffrey A. Law 2016-02-10 16:21:30 UTC

So Jan fixed the combine issue when he committed his changes to remove statements which set write-only variables.  That essentially brought the combiner down to nothing.  Hard to believe since it was just a dozen or so nodes that were determined to be write-only.  But that was double-checked by just hacking out the call to set_writeonly_bit in one of the older compilers I was testing.

Richi's change to improve DOM's handling of aliased loads helps considerably if someone were to disable the IPA propagation bits.

Unfortunately we're still burning a lot of time in the alias walking.  So we can't reasonably consider this resolved for gcc-6:


 alias stmt walking      :  38.67 (64%) usr   0.01 ( 7%) sys  38.64 (64%) wall       2 kB ( 0%) ggc
 tree copy propagation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall       0 kB ( 0%) ggc
 combiner                :   0.35 ( 1%) usr   0.01 ( 7%) sys   0.36 ( 1%) wall    2709 kB ( 2%) ggc
 integrated RA           :   2.27 ( 4%) usr   0.00 ( 0%) sys   2.31 ( 4%) wall    6634 kB ( 4%) ggc
 LRA hard reg assignment :   4.39 ( 7%) usr   0.01 ( 7%) sys   4.41 ( 7%) wall       0 kB ( 0%) ggc

The alias statement walking is a bit surprising.  This is Fortran code after all, I wouldn't expect a lot of aliasing, I guess it's just a lot of calls into the alias statement walking code (but I haven't looked to confirm that theory).

I think the RA/LRA times are probably worth some investigation as well once we get a handle on what's causing alias stmt walking to go nuts.

Comment 13 Richard Biener 2016-02-11 08:47:01 UTC

(In reply to Jeffrey A. Law from comment #12)
> So Jan fixed the combine issue when he committed his changes to remove
> statements which set write-only variables.  That essentially brought the
> combiner down to nothing.  Hard to believe since it was just a dozen or so
> nodes that were determined to be write-only.  But that was double-checked by
> just hacking out the call to set_writeonly_bit in one of the older compilers
> I was testing.
> 
> Richi's change to improve DOM's handling of aliased loads helps considerably
> if someone were to disable the IPA propagation bits.
> 
> Unfortunately we're still burning a lot of time in the alias walking.  So we
> can't reasonably consider this resolved for gcc-6:
> 
> 
>  alias stmt walking      :  38.67 (64%) usr   0.01 ( 7%) sys  38.64 (64%)
> wall       2 kB ( 0%) ggc
>  tree copy propagation   :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%)
> wall       0 kB ( 0%) ggc
>  combiner                :   0.35 ( 1%) usr   0.01 ( 7%) sys   0.36 ( 1%)
> wall    2709 kB ( 2%) ggc
>  integrated RA           :   2.27 ( 4%) usr   0.00 ( 0%) sys   2.31 ( 4%)
> wall    6634 kB ( 4%) ggc
>  LRA hard reg assignment :   4.39 ( 7%) usr   0.01 ( 7%) sys   4.41 ( 7%)
> wall       0 kB ( 0%) ggc
> 
> The alias statement walking is a bit surprising.  This is Fortran code after
> all, I wouldn't expect a lot of aliasing, I guess it's just a lot of calls
> into the alias statement walking code (but I haven't looked to confirm that
> theory).

If there were a lot of aliasing it were faster ;)  We spend the time
disambiguating (with success).  Alias stmt walking can be attributed to
FRE, PRE, DCE, DSE and some IPA code.  All but the IPA use is throttled
(but that doesn't mean it can use quite some time in some cases).

I have a local modification to how timevars are accounted (accounting nested
sub-systems to individual passes as well, so overall sum doesn't add up to
100% but sth larger).  That, with an unoptimized compiler and -fno-checking
turns up

 alias stmt walking      :  50.53 (68%) usr   0.06 (46%) sys  50.45 (68%) wall       2 kB ( 0%) ggc
 tree PRE                :   3.40 ( 5%) usr   0.02 (15%) sys   3.42 ( 5%) wall    3579 kB ( 2%) ggc
 tree FRE                :  48.79 (65%) usr   0.05 (38%) sys  48.79 (65%) wall     364 kB ( 0%) ggc
 tree DSE                :   1.89 ( 3%) usr   0.00 ( 0%) sys   1.89 ( 3%) wall       0 kB ( 0%) ggc

so it's FRE to blame (surprisingly PRE which runs ontop of the FRE engine
is way down here).  I will investigate a bit.

> I think the RA/LRA times are probably worth some investigation as well once
> we get a handle on what's causing alias stmt walking to go nuts.

Comment 14 Richard Biener 2016-02-11 10:42:56 UTC

So SCCVN is limited with --param sccvn-max-alias-queries-per-access which has
a quite high default value (1000).  There isn't something more elaborate like
having an overall budget as well.

With lowering the default value to 100 I get to

 alias stmt walking      :  12.15 (16%) usr   0.06 (12%) sys  12.23 (16%) wall       2 kB ( 0%) ggc
...
 tree PRE                :   9.48 (13%) usr   0.01 ( 2%) sys   9.48 (13%) wall    1333 kB ( 0%) ggc
 tree FRE                :   9.91 (13%) usr   0.03 ( 6%) sys   9.92 (13%) wall     310 kB ( 0%) ggc
...
 combiner                :  31.87 (43%) usr   0.24 (46%) sys  32.10 (43%) wall  528441 kB (80%) ggc
 rest of compilation     :  51.05 (69%) usr   0.31 (60%) sys  51.25 (69%) wall  620347 kB (94%) ggc
 TOTAL                 :  73.85             0.52            74.34             661274 kB

(see disclaimer about things not adding up).  As we run FRE twice and PRE
only once (but it uses a cheaper walking mode that ends up giving up earlier)
a reasonable result would be FRE at around 2 times the PRE time.

Note that we do run into the above limit all the time with this testcase
which has loads of stores and loads (non-aliased, of course).  And we do
seem to optimize stuff as upping the limit results in faster overall compile
time.

I think we need to more intelligently limit the walking and/or design some
caching of queries that works better or simply make the queries cheaper.
Most of the query cost comes from decomposing our reference trees and
repeatedly computing the overall access offset/size (get_ref_base_and_extent).
So if we could cache that we might win.

The number of loads/stores are not that bad in the testcase, we have
1700 stores after fre1 and 547 loads, 1142 stores after fre2 and 547 loads.
Before fre1 we had 1714 stores and 15334 loads.  So its clearly fre1 that
will run into this compile-time hog (it does max. #loads * #stores alias
queries).  OTOH we _do_ want to optimize those 14000 redundant loads!
It simply looks like the defs are far away (happens for example if you
have an initializer at the start of the function we can optimize from).

The 547 loads after fre1 is actually with upping the limit to 10000, with
the default limit we still have 5548 loads!  With lowering to 100 we get
11616 loads.

Note that compile-time would be worse if there were nothing to optimize here
(then you hit the worst-case of #loads * #stores alias queries).

Comment 15 Richard Biener 2016-02-11 11:09:25 UTC

Note that the wide-int merge brought some extra slowness into get_ref_base_and_extent which is no longer call free.  For example

        case BIT_FIELD_REF:
          bit_offset += wi::to_offset (TREE_OPERAND (exp, 2));
          break;

is optimized to

<L24>:
  _68 = exp_403->exp.operands[2];
  MEM[(struct  &)&D.86120] ={v} {CLOBBER};
  _445 = MEM[(unsigned int *)&bit_offset + 24B];
  _448 = MEM[(const union tree_node *)_68].base.u.int_length.offset;
  _449 = (unsigned int) _448;
  _451 = _445 + _449;
  if (_451 == 2)
    goto <bb 31>;
  else
    goto <bb 32>;

  <bb 31>:
  _455 = MEM[(const long int *)&bit_offset];
  _456 = (long unsigned int) _455;
  _457 = MEM[(const long int *)_68 + 16B];
  _458 = (long unsigned int) _457;
  resultl_459 = _456 + _458;
  resultl.194_460 = (long int) resultl_459;
  MEM[(long int *)&D.86120] = resultl.194_460;
  _1156 = resultl.194_460 >= 0;
  _896 = (long int) _1156;
  _895 = -_896;
  MEM[(long int *)&D.86120 + 8B] = _895;
  _462 = _456 ^ resultl_459;
  _463 = _458 ^ resultl_459;
  _464 = _462 & _463;
  _465 = _464 >> 63;
  _466 = (unsigned int) _465;
  _467 = _466 + 1;
  MEM[(struct fixed_wide_int_storage *)&D.86120].len = _467;
  goto <bb 33>;

  <bb 32>:
  _450 = &MEM[(const union tree_node *)_68].int_cst.val[0];
  _468 = wi::add_large (&MEM[(struct fixed_wide_int_storage *)&D.86120].val, &MEM[(const struct fixed_wide_int_storage *)&bit_offset].val, _445, _450, _449, 128, 1, 0B);
  MEM[(struct fixed_wide_int_storage *)&D.86120].len = _468;

  <bb 33>:
  MEM[(struct generic_wide_int *)&bit_offset] = MEM[(struct generic_wide_int *)&D.86120];
  D.86120 ={v} {CLOBBER};
  pretmp_580 = exp_403->exp.operands[0];
  goto <bb 119>;

while it was a simple

<L21>:
  _64 = exp_1->exp.operands[2];
  SR.212_245 = MEM[(struct double_int *)_64 + 16B];
  SR.213_244 = MEM[(struct double_int *)_64 + 24B];
  MEM[(struct double_int *)&D.52195] = SR.212_245;
  MEM[(struct double_int *)&D.52195 + 8B] = SR.213_244;
  double_int::operator+= (&bit_offset, D.52195);
  pretmp_295 = exp_1->exp.operands[0];
  goto <bb 70>;

in GCC 4.9 (ok, looks like double_int::operator+= wasn't inlined there
either).

That said, more detailed profiling is needed here but get_ref_base_and_extent is one of the bottle-necks here.

Comment 16 Richard Biener 2016-02-11 12:59:07 UTC

Call-grinding a release-checking stage3 shows get_ref_base_and_extend and
find_hard_regno_for_1 at the top.

And it shows wi::lshift_large called from get_ref_base_and_extent - exactly
what I feared...  we do hit both wi::lshift_large and wi::mul_internal.

perf confirmes the hot spots get_ref_base_and_extent (9%) and find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%),
wi::mul_internal is at 1%.  Note the shifts are all by 3 (BITS_PER_UNIT
multiplication).

The

      /* The first unfilled output block is a left shift of the first
         block in XVAL.  The other output blocks contain bits from two
         consecutive input blocks.  */
      unsigned HOST_WIDE_INT carry = 0;
      for (unsigned int i = skip; i < len; ++i)
        {
          unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip);
          val[i] = (x << small_shift) | carry;
          carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT);
        }

loop in lshift_large doesn't seem to be very latency friendly:

  4.02 │ e0:   mov    (%r11,%r9,8),%rax     
  5.54 │ e4:   mov    %rax,%rdi                                                ▒
       │       mov    %r8d,%ecx                                                ▒
       │       shl    %cl,%rdi                                                 ▒
  4.23 │       mov    %rdi,%rcx                                                ▒
  3.91 │       or     %r15,%rcx                                                ▒
  2.06 │       mov    %rcx,(%r14,%r9,8)           
  7.38 │       mov    %r13d,%ecx                                               ▒
  1.41 │       add    $0x1,%r9                                                 ▒
       │       shr    %cl,%rax         
  3.04 │       cmp    %r12,%r9       
  3.37 │       mov    %rax,%r15         
  1.95 │     ↑ je     9e          

I wonder if GCC can be more efficient here by special-casing skip == 0,
len == 2 and using a __int128 on hosts where that is available.

In this case we're shifting xlen == 1 values but the precision might need 2
(byte to bit precision).  Special casing that case might also make sense.
It helps a bit but of course all the testing has an overhead as well.
Maybe a wi::bytes_to_bits helper is a better solution here.

Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute
only for the stores btw, for stmt_may_clobber_ref_p) might help more.

Note that with release-checking the testcase compiles quite fast for me.

 alias stmt walking      :   4.53 (37%) usr   0.04 (18%) sys   4.44 (36%) wall       2 kB ( 0%) ggc
 dead store elim2        :   0.67 ( 5%) usr   0.04 (18%) sys   0.71 ( 6%) wall   87250 kB (57%) ggc
 combiner                :   0.20 ( 2%) usr   0.00 ( 0%) sys   0.20 ( 2%) wall    2709 kB ( 2%) ggc
 integrated RA           :   1.18 (10%) usr   0.01 ( 5%) sys   1.19 (10%) wall    6629 kB ( 4%) ggc
 LRA hard reg assignment :   2.69 (22%) usr   0.01 ( 5%) sys   2.69 (22%) wall       0 kB ( 0%) ggc
 reload CSE regs         :   0.56 ( 5%) usr   0.00 ( 0%) sys   0.56 ( 5%) wall    1064 kB ( 1%) ggc
 TOTAL                 :  12.21             0.22            12.42             152002 kB

(that's w/o mucking with timevars).

Comment 17 Richard Biener 2016-08-03 11:34:42 UTC

GCC 4.9 branch is being closed

Comment 18 Jeffrey A. Law 2017-03-17 15:43:53 UTC

WRT c#16 -- I must have totally missed the differences in checking configurations!  The gcc-6 and the trunk are comparable to gcc-4.7 with this test when the compilers are similarly configured (checking was also responsible for the huge register allocator jump).  I'm pretty sure gcc-5 doesn't have the appropriate improvements.

I'm removing the gcc-6 and gcc-7 regression markers.

Comment 19 Jakub Jelinek 2017-10-10 13:48:33 UTC

GCC 5 branch has been closed, should be fixed in GCC 6 and later.