Created attachment 31833 [details] Example fortran source code file with problems with gfortran 4.8 When switching from gfortran 4.7 to gfortran 4.8 the memory usage and compile time vastly increases for some files in our project, e.g. for the attached example file. gfortran 4.8.2 needs 50s to compile, using up to 1.7 GB of RAM, while gfortran 4.7 compiles it in 7s with a memory usage of 136 MB. The command line of the gfortran-call for the attached example is /opt/gcc/4.8.2/bin/gfortran -fno-automatic -ffixed-line-length-none -O2 -c FermionBoxEventempCoupling_Div.F -o output.o The problem seems to be partly linked to the option -fno-automatic, since omitting it inhibits the memory increase, but the compile time is still 14s with gfortran 4.8, compared to 6s with gfortran 4.7. The problem arises with optimizations -O2 and -O1 lead to a similar discrepancy, with -O0 the problem does not exist. The gfortran 4.8 which shows the problematic behaviour is built with: Target: x86_64-unknown-linux-gnu Configured with: ../gcc-4.8.2/configure --prefix=/opt/gcc/4.8.2 --enable-languages=c,c++,fortran gcc version 4.8.2 (GCC) The gfortran 4.7 is the built-in from openSUSE 12.2: Target: x86_64-suse-linux Configured with: ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,java,ada --enable-checking=release --with-gxx-include-dir=/usr/include/c++/4.7 --enable-ssp --disable-libssp --disable-libitm --disable-plugin --with-bugurl=http://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --disable-libgcj --disable-libmudflap --with-slibdir=/lib64 --with-system-zlib --enable-__cxa_atexit --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-version-specific-runtime-libs --enable-linker-build-id --program-suffix=-4.7 --enable-linux-futex --without-system-libunwind --with-arch-32=i586 --with-tune=generic --build=x86_64-suse-linux gcc version 4.7.1 20120723 [gcc-4_7-branch revision 189773] (SUSE Linux)
Confirmed. At -O2 I see for 4.8 alias stmt walking : 4.90 ( 8%) usr 0.02 ( 3%) sys 5.05 ( 8%) wall 0 kB ( 0%) ggc tree copy propagation : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 513 kB ( 0%) ggc combiner : 45.94 (71%) usr 0.57 (83%) sys 46.57 (71%) wall 1612812 kB (96%) ggc integrated RA : 0.59 ( 1%) usr 0.01 ( 1%) sys 0.58 ( 1%) wall 12481 kB ( 1%) ggc LRA hard reg assignment : 7.53 (12%) usr 0.03 ( 4%) sys 7.56 (11%) wall 0 kB ( 0%) ggc TOTAL : 65.03 0.69 65.92 1681500 kB and for 4.7 alias stmt walking : 2.77 (28%) usr 0.02 (17%) sys 2.80 (28%) wall 0 kB ( 0%) ggc tree copy propagation : 1.64 (17%) usr 0.01 ( 8%) sys 1.67 (17%) wall 513 kB ( 1%) ggc integrated RA : 0.66 ( 7%) usr 0.02 (17%) sys 0.68 ( 7%) wall 6295 kB (10%) ggc reload : 0.45 ( 5%) usr 0.01 ( 8%) sys 0.46 ( 5%) wall 1593 kB ( 3%) ggc reload CSE regs : 0.74 ( 7%) usr 0.00 ( 0%) sys 0.75 ( 7%) wall 1103 kB ( 2%) ggc TOTAL : 9.87 0.12 10.01 61922 kB the combiner slowdown and the LRA thing are worth investigating. I suppose the generated code differs quite a bit ;) It's a very large function with some very large basic-block(s). Current trunk behaves comparable to GCC 4.8.
No clear offender in combine. Points at if_then_else_cond (14% of combine time). LRA slowness is calls to find_hard_regno_for, 93% of lra_assign ().
Started with r190594 which apparently made big changes in some of the GIMPLE passes, e.g. the optimized dump linecount went down from 65782 lines (r190593) to 52107 lines (r190594). But, for some reason that change results in uncomparably more log links and combiner opportunities. In -fdump-rtl-combine-details dump, 'Successfully match' matches went up from 9742 up to 3921927, and 'Failed to match' matches went up from 53193 to 22538586. So, the combiner success rate is approximately the same, around 17.5%, just on 400 times more opportunities. For GIMPLE passes, looking just at the sizes of the gimple dumps, until esra the sizes are the same, from fre1 the dump with r190594 is slightly to significantly larger than corresponding r190593 dump until crited, and then surprisingly the pre dump is on the other side half the size of r190593 dump and from sink till optimized the dump sizes roughly match the 65782 to 52107 lines.
With -O2 -fno-automatic -ffixed-line-length-none --param sccvn-max-alias-queries-per-access=1300 the combiner completely disappears out of the picture, supposedly because then FRE/PRE doesn't give up, with 1200 the combiner already takes 17% of compile time and 77% of memory, with the default 1000 it takes 58% of compile time and 96% of memory.
Bah. Add a limiting --param to combine? ....
Dunno what to limit there though. BTW, the 1000 default for sccvn-max-alias-queries-per-access param was some carefully chosen value, e.g. does the test that was fixed by the addition of the param fail if it is bumped to say 1500 or 2000, or can we just bump slightly the default?
On Fri, 31 Jan 2014, jakub at gcc dot gnu.org wrote: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811 > > --- Comment #6 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Dunno what to limit there though. > > BTW, the 1000 default for sccvn-max-alias-queries-per-access param was some > carefully chosen value, e.g. does the test that was fixed by the addition of > the param fail if it is bumped to say 1500 or 2000, or can we just bump > slightly the default? It was reasonably chosen - the testcase I added it for still uses a lot of time in alias-walking (~30% I think which I decided was reasonable for the testcase). We sure can slightly bump it, but that only requires a slightly larger testcase to re-trigger this bug?
GCC 4.8.3 is being released, adjusting target milestone.
GCC 4.8.4 has been released.
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
So Jan fixed the combine issue when he committed his changes to remove statements which set write-only variables. That essentially brought the combiner down to nothing. Hard to believe since it was just a dozen or so nodes that were determined to be write-only. But that was double-checked by just hacking out the call to set_writeonly_bit in one of the older compilers I was testing. Richi's change to improve DOM's handling of aliased loads helps considerably if someone were to disable the IPA propagation bits. Unfortunately we're still burning a lot of time in the alias walking. So we can't reasonably consider this resolved for gcc-6: alias stmt walking : 38.67 (64%) usr 0.01 ( 7%) sys 38.64 (64%) wall 2 kB ( 0%) ggc tree copy propagation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc combiner : 0.35 ( 1%) usr 0.01 ( 7%) sys 0.36 ( 1%) wall 2709 kB ( 2%) ggc integrated RA : 2.27 ( 4%) usr 0.00 ( 0%) sys 2.31 ( 4%) wall 6634 kB ( 4%) ggc LRA hard reg assignment : 4.39 ( 7%) usr 0.01 ( 7%) sys 4.41 ( 7%) wall 0 kB ( 0%) ggc The alias statement walking is a bit surprising. This is Fortran code after all, I wouldn't expect a lot of aliasing, I guess it's just a lot of calls into the alias statement walking code (but I haven't looked to confirm that theory). I think the RA/LRA times are probably worth some investigation as well once we get a handle on what's causing alias stmt walking to go nuts.
(In reply to Jeffrey A. Law from comment #12) > So Jan fixed the combine issue when he committed his changes to remove > statements which set write-only variables. That essentially brought the > combiner down to nothing. Hard to believe since it was just a dozen or so > nodes that were determined to be write-only. But that was double-checked by > just hacking out the call to set_writeonly_bit in one of the older compilers > I was testing. > > Richi's change to improve DOM's handling of aliased loads helps considerably > if someone were to disable the IPA propagation bits. > > Unfortunately we're still burning a lot of time in the alias walking. So we > can't reasonably consider this resolved for gcc-6: > > > alias stmt walking : 38.67 (64%) usr 0.01 ( 7%) sys 38.64 (64%) > wall 2 kB ( 0%) ggc > tree copy propagation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) > wall 0 kB ( 0%) ggc > combiner : 0.35 ( 1%) usr 0.01 ( 7%) sys 0.36 ( 1%) > wall 2709 kB ( 2%) ggc > integrated RA : 2.27 ( 4%) usr 0.00 ( 0%) sys 2.31 ( 4%) > wall 6634 kB ( 4%) ggc > LRA hard reg assignment : 4.39 ( 7%) usr 0.01 ( 7%) sys 4.41 ( 7%) > wall 0 kB ( 0%) ggc > > The alias statement walking is a bit surprising. This is Fortran code after > all, I wouldn't expect a lot of aliasing, I guess it's just a lot of calls > into the alias statement walking code (but I haven't looked to confirm that > theory). If there were a lot of aliasing it were faster ;) We spend the time disambiguating (with success). Alias stmt walking can be attributed to FRE, PRE, DCE, DSE and some IPA code. All but the IPA use is throttled (but that doesn't mean it can use quite some time in some cases). I have a local modification to how timevars are accounted (accounting nested sub-systems to individual passes as well, so overall sum doesn't add up to 100% but sth larger). That, with an unoptimized compiler and -fno-checking turns up alias stmt walking : 50.53 (68%) usr 0.06 (46%) sys 50.45 (68%) wall 2 kB ( 0%) ggc tree PRE : 3.40 ( 5%) usr 0.02 (15%) sys 3.42 ( 5%) wall 3579 kB ( 2%) ggc tree FRE : 48.79 (65%) usr 0.05 (38%) sys 48.79 (65%) wall 364 kB ( 0%) ggc tree DSE : 1.89 ( 3%) usr 0.00 ( 0%) sys 1.89 ( 3%) wall 0 kB ( 0%) ggc so it's FRE to blame (surprisingly PRE which runs ontop of the FRE engine is way down here). I will investigate a bit. > I think the RA/LRA times are probably worth some investigation as well once > we get a handle on what's causing alias stmt walking to go nuts.
So SCCVN is limited with --param sccvn-max-alias-queries-per-access which has a quite high default value (1000). There isn't something more elaborate like having an overall budget as well. With lowering the default value to 100 I get to alias stmt walking : 12.15 (16%) usr 0.06 (12%) sys 12.23 (16%) wall 2 kB ( 0%) ggc ... tree PRE : 9.48 (13%) usr 0.01 ( 2%) sys 9.48 (13%) wall 1333 kB ( 0%) ggc tree FRE : 9.91 (13%) usr 0.03 ( 6%) sys 9.92 (13%) wall 310 kB ( 0%) ggc ... combiner : 31.87 (43%) usr 0.24 (46%) sys 32.10 (43%) wall 528441 kB (80%) ggc rest of compilation : 51.05 (69%) usr 0.31 (60%) sys 51.25 (69%) wall 620347 kB (94%) ggc TOTAL : 73.85 0.52 74.34 661274 kB (see disclaimer about things not adding up). As we run FRE twice and PRE only once (but it uses a cheaper walking mode that ends up giving up earlier) a reasonable result would be FRE at around 2 times the PRE time. Note that we do run into the above limit all the time with this testcase which has loads of stores and loads (non-aliased, of course). And we do seem to optimize stuff as upping the limit results in faster overall compile time. I think we need to more intelligently limit the walking and/or design some caching of queries that works better or simply make the queries cheaper. Most of the query cost comes from decomposing our reference trees and repeatedly computing the overall access offset/size (get_ref_base_and_extent). So if we could cache that we might win. The number of loads/stores are not that bad in the testcase, we have 1700 stores after fre1 and 547 loads, 1142 stores after fre2 and 547 loads. Before fre1 we had 1714 stores and 15334 loads. So its clearly fre1 that will run into this compile-time hog (it does max. #loads * #stores alias queries). OTOH we _do_ want to optimize those 14000 redundant loads! It simply looks like the defs are far away (happens for example if you have an initializer at the start of the function we can optimize from). The 547 loads after fre1 is actually with upping the limit to 10000, with the default limit we still have 5548 loads! With lowering to 100 we get 11616 loads. Note that compile-time would be worse if there were nothing to optimize here (then you hit the worst-case of #loads * #stores alias queries).
Note that the wide-int merge brought some extra slowness into get_ref_base_and_extent which is no longer call free. For example case BIT_FIELD_REF: bit_offset += wi::to_offset (TREE_OPERAND (exp, 2)); break; is optimized to <L24>: _68 = exp_403->exp.operands[2]; MEM[(struct &)&D.86120] ={v} {CLOBBER}; _445 = MEM[(unsigned int *)&bit_offset + 24B]; _448 = MEM[(const union tree_node *)_68].base.u.int_length.offset; _449 = (unsigned int) _448; _451 = _445 + _449; if (_451 == 2) goto <bb 31>; else goto <bb 32>; <bb 31>: _455 = MEM[(const long int *)&bit_offset]; _456 = (long unsigned int) _455; _457 = MEM[(const long int *)_68 + 16B]; _458 = (long unsigned int) _457; resultl_459 = _456 + _458; resultl.194_460 = (long int) resultl_459; MEM[(long int *)&D.86120] = resultl.194_460; _1156 = resultl.194_460 >= 0; _896 = (long int) _1156; _895 = -_896; MEM[(long int *)&D.86120 + 8B] = _895; _462 = _456 ^ resultl_459; _463 = _458 ^ resultl_459; _464 = _462 & _463; _465 = _464 >> 63; _466 = (unsigned int) _465; _467 = _466 + 1; MEM[(struct fixed_wide_int_storage *)&D.86120].len = _467; goto <bb 33>; <bb 32>: _450 = &MEM[(const union tree_node *)_68].int_cst.val[0]; _468 = wi::add_large (&MEM[(struct fixed_wide_int_storage *)&D.86120].val, &MEM[(const struct fixed_wide_int_storage *)&bit_offset].val, _445, _450, _449, 128, 1, 0B); MEM[(struct fixed_wide_int_storage *)&D.86120].len = _468; <bb 33>: MEM[(struct generic_wide_int *)&bit_offset] = MEM[(struct generic_wide_int *)&D.86120]; D.86120 ={v} {CLOBBER}; pretmp_580 = exp_403->exp.operands[0]; goto <bb 119>; while it was a simple <L21>: _64 = exp_1->exp.operands[2]; SR.212_245 = MEM[(struct double_int *)_64 + 16B]; SR.213_244 = MEM[(struct double_int *)_64 + 24B]; MEM[(struct double_int *)&D.52195] = SR.212_245; MEM[(struct double_int *)&D.52195 + 8B] = SR.213_244; double_int::operator+= (&bit_offset, D.52195); pretmp_295 = exp_1->exp.operands[0]; goto <bb 70>; in GCC 4.9 (ok, looks like double_int::operator+= wasn't inlined there either). That said, more detailed profiling is needed here but get_ref_base_and_extent is one of the bottle-necks here.
Call-grinding a release-checking stage3 shows get_ref_base_and_extend and find_hard_regno_for_1 at the top. And it shows wi::lshift_large called from get_ref_base_and_extent - exactly what I feared... we do hit both wi::lshift_large and wi::mul_internal. perf confirmes the hot spots get_ref_base_and_extent (9%) and find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%), wi::mul_internal is at 1%. Note the shifts are all by 3 (BITS_PER_UNIT multiplication). The /* The first unfilled output block is a left shift of the first block in XVAL. The other output blocks contain bits from two consecutive input blocks. */ unsigned HOST_WIDE_INT carry = 0; for (unsigned int i = skip; i < len; ++i) { unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip); val[i] = (x << small_shift) | carry; carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT); } loop in lshift_large doesn't seem to be very latency friendly: 4.02 │ e0: mov (%r11,%r9,8),%rax 5.54 │ e4: mov %rax,%rdi ▒ │ mov %r8d,%ecx ▒ │ shl %cl,%rdi ▒ 4.23 │ mov %rdi,%rcx ▒ 3.91 │ or %r15,%rcx ▒ 2.06 │ mov %rcx,(%r14,%r9,8) 7.38 │ mov %r13d,%ecx ▒ 1.41 │ add $0x1,%r9 ▒ │ shr %cl,%rax 3.04 │ cmp %r12,%r9 3.37 │ mov %rax,%r15 1.95 │ ↑ je 9e I wonder if GCC can be more efficient here by special-casing skip == 0, len == 2 and using a __int128 on hosts where that is available. In this case we're shifting xlen == 1 values but the precision might need 2 (byte to bit precision). Special casing that case might also make sense. It helps a bit but of course all the testing has an overhead as well. Maybe a wi::bytes_to_bits helper is a better solution here. Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute only for the stores btw, for stmt_may_clobber_ref_p) might help more. Note that with release-checking the testcase compiles quite fast for me. alias stmt walking : 4.53 (37%) usr 0.04 (18%) sys 4.44 (36%) wall 2 kB ( 0%) ggc dead store elim2 : 0.67 ( 5%) usr 0.04 (18%) sys 0.71 ( 6%) wall 87250 kB (57%) ggc combiner : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall 2709 kB ( 2%) ggc integrated RA : 1.18 (10%) usr 0.01 ( 5%) sys 1.19 (10%) wall 6629 kB ( 4%) ggc LRA hard reg assignment : 2.69 (22%) usr 0.01 ( 5%) sys 2.69 (22%) wall 0 kB ( 0%) ggc reload CSE regs : 0.56 ( 5%) usr 0.00 ( 0%) sys 0.56 ( 5%) wall 1064 kB ( 1%) ggc TOTAL : 12.21 0.22 12.42 152002 kB (that's w/o mucking with timevars).
GCC 4.9 branch is being closed
WRT c#16 -- I must have totally missed the differences in checking configurations! The gcc-6 and the trunk are comparable to gcc-4.7 with this test when the compilers are similarly configured (checking was also responsible for the huge register allocator jump). I'm pretty sure gcc-5 doesn't have the appropriate improvements. I'm removing the gcc-6 and gcc-7 regression markers.
GCC 5 branch has been closed, should be fixed in GCC 6 and later.