The attached Fortran file takes unusually long to compile with full optimization on the latest GCC trunk builds. I tried a lot of the optimization flags, and found that -fstrict-aliasing seems to trigger the long compile-time. Without this flag the compile-time is perfectly reasonable, as you can see from the following numbers (using trunk revision 131325 on i686-pc-linux-gnu, Pentium M 1,7Ghz and 1GB memory): -O0: 0.9s -O1: 4.5s -O2 -fno-strict-aliasing: 8.0s -O3 -fno-strict-aliasing: 10.4s But once I use -fstrict-aliasing, compile-time explodes: -O1 -fstrict-aliasing: 1m10s -O2: 2m16s -O3: 1m43s (-fstrict-aliasing is included in -O2 and -O3, but not in -O1) I don't know why -O3 is faster than -O2 here (which seems funny), but anyway all three numbers are way too large in my opinion. I mean: compile-time rises by a factor of roughly 15, just due to this one flag! I don't even know what the flag is supposed to do for this code. Maybe someone can explain this to me? I didn't examine memory consumption very carefully, but this also seems unusually high, as a similar file failed to compile on an old machine (with only 256MB) due to lack of memory(!). Am I right to assume that something goes wrong here somehow, or can someone tell me why I should expect this kind of behaviour for this code?
Created attachment 14885 [details] test case
Can you give the output of the compiler when -ftime-report is added?
(In reply to comment #2) > Can you give the output of the compiler when -ftime-report is added? Sure thing, thanks for the remark. Here it goes: gfortran-4.3 -c -ftime-report -O1 -fstrict-aliasing Amplitudes.f90 Execution times (seconds) garbage collection : 0.21 ( 0%) usr 0.02 ( 2%) sys 0.24 ( 0%) wall 0 kB ( 0%) ggc callgraph construction: 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 299 kB ( 0%) ggc callgraph optimization: 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 62 kB ( 0%) ggc trivially dead code : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc df reaching defs : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc df live regs : 0.10 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc df live&initialized regs: 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc df reg dead/unused notes: 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 127 kB ( 0%) ggc register information : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc alias analysis : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 196 kB ( 0%) ggc rebuild jump labels : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc parser : 0.08 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 1848 kB ( 0%) ggc inline heuristics : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 1 kB ( 0%) ggc tree gimplify : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 847 kB ( 0%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 769 kB ( 0%) ggc tree CFG cleanup : 4.20 ( 5%) usr 0.06 ( 7%) sys 4.60 ( 5%) wall 2550 kB ( 0%) ggc tree copy propagation : 0.25 ( 0%) usr 0.01 ( 1%) sys 0.31 ( 0%) wall 24 kB ( 0%) ggc tree PTA : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 67 kB ( 0%) ggc tree alias analysis : 0.01 ( 0%) usr 0.01 ( 1%) sys 0.01 ( 0%) wall 2 kB ( 0%) ggc tree call clobbering : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree flow sensitive alias: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 16 kB ( 0%) ggc tree memory partitioning: 0.07 ( 0%) usr 0.00 ( 0%) sys 0.08 ( 0%) wall 0 kB ( 0%) ggc tree SSA rewrite : 0.48 ( 1%) usr 0.01 ( 1%) sys 0.53 ( 1%) wall 8455 kB ( 1%) ggc tree SSA other : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree SSA incremental : 3.26 ( 4%) usr 0.08 ( 9%) sys 3.62 ( 4%) wall 19615 kB ( 2%) ggc tree operand scan : 49.67 (63%) usr 0.48 (52%) sys 57.11 (64%) wall 247409 kB (30%) ggc dominator optimization: 0.85 ( 1%) usr 0.02 ( 2%) sys 1.57 ( 2%) wall 14019 kB ( 2%) ggc tree SRA : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 27 kB ( 0%) ggc tree STORE-CCP : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 1 kB ( 0%) ggc tree CCP : 0.07 ( 0%) usr 0.01 ( 1%) sys 0.08 ( 0%) wall 15 kB ( 0%) ggc tree PHI const/copy prop: 0.35 ( 0%) usr 0.00 ( 0%) sys 0.37 ( 0%) wall 0 kB ( 0%) ggc tree split crit edges : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 2489 kB ( 0%) ggc tree reassociation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 72 kB ( 0%) ggc tree FRE : 16.12 (20%) usr 0.17 (18%) sys 16.75 (19%) wall 505793 kB (62%) ggc tree code sinking : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 41 kB ( 0%) ggc tree forward propagate: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 7 kB ( 0%) ggc tree conservative DCE : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.19 ( 0%) wall 0 kB ( 0%) ggc tree aggressive DCE : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc tree DSE : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 1879 kB ( 0%) ggc tree loop bounds : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 160 kB ( 0%) ggc loop invariant motion : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1 kB ( 0%) ggc tree canonical iv : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 138 kB ( 0%) ggc scev constant prop : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 68 kB ( 0%) ggc complete unrolling : 0.30 ( 0%) usr 0.01 ( 1%) sys 0.36 ( 0%) wall 2694 kB ( 0%) ggc tree iv optimization : 0.17 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 3958 kB ( 0%) ggc tree loop init : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 52 kB ( 0%) ggc tree copy headers : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 1280 kB ( 0%) ggc tree SSA uncprop : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc tree SSA to normal : 0.11 ( 0%) usr 0.01 ( 1%) sys 0.12 ( 0%) wall 274 kB ( 0%) ggc tree rename SSA copies: 0.05 ( 0%) usr 0.00 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc dominance frontiers : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc expand : 0.91 ( 1%) usr 0.02 ( 2%) sys 1.01 ( 1%) wall 3726 kB ( 0%) ggc lower subreg : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 8 kB ( 0%) ggc CSE : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 49 kB ( 0%) ggc dead code elimination : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc dead store elim1 : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 249 kB ( 0%) ggc dead store elim2 : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall 214 kB ( 0%) ggc loop analysis : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 83 kB ( 0%) ggc branch prediction : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 154 kB ( 0%) ggc combiner : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 485 kB ( 0%) ggc if-conversion : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 58 kB ( 0%) ggc local alloc : 0.06 ( 0%) usr 0.00 ( 0%) sys 0.06 ( 0%) wall 217 kB ( 0%) ggc global alloc : 0.26 ( 0%) usr 0.01 ( 1%) sys 0.28 ( 0%) wall 1050 kB ( 0%) ggc reload CSE regs : 0.07 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 258 kB ( 0%) ggc thread pro- & epilogue: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 5 kB ( 0%) ggc rename registers : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc machine dep reorg : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc reg stack : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 34 kB ( 0%) ggc final : 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 79.11 0.92 89.22 822312 kB So we have 30% for "tree operand scan" and 62% for "tree FRE". I have no idea what that means though. I hope you do ;)
But 800MB of memory is definitely *a lot* for such a small file!
Confirmed: Some comparative timings with Intel and Sun compilers which also use very little memory. $ time gfc -c -O2 janus_slow.f90 real 2m53.448s user 2m51.425s sys 0m1.878s $ time ifort -c -O2 janus_slow.f90 real 0m10.489s user 0m10.264s sys 0m0.185s $ time sunf95 -c -O2 janus_slow.f90 real 0m24.029s user 0m23.339s sys 0m0.685s
The problem is we have loads of unpartitionable SFTs. There are workarounds possible that only might affect performance of generated code, for example you can use --param salias-max-array-elements=0, which will reduce the number of unpartitionable SFTs somewhat. Or as a more big hammer, disable generating SFTs completely by --param max-fields-for-field-sensitive=0.
Note that this shouldn't really be affected by strict aliasing or not. Note that the FE generated IL causes the first alias pass to emit too conservative alias info as well: # VUSE <MPT.5965_208230> D.6049_4652 = atmp.1110.data; D.6050_4653 = (complex(kind=4)[0:] *) D.6049_4652; D.6051_4654 = S.1113_289 + D.3340_4634; # VUSE <SFT.......> D.6052_4655 = (*D.6050_4653)[D.6051_4654]; I suppose the atmp.1110.data type is something like (void *), so the cast is required. But this really pessimizes the middle-end IL and it looks like the FE knows it will be complex(kind=4)[4] at the point of creation. Note fixing this will also improve optimization and thus runtime performance. I also see the FE creates lots of array temporaries: struct array2_complex(kind=4) atmp.1093; complex(kind=4) A.1094[4]; struct array2_complex(kind=4) atmp.1095; complex(kind=4) A.1096[4]; struct array2_complex(kind=4) atmp.1100; complex(kind=4) A.1101[4]; struct array2_complex(kind=4) atmp.1102; complex(kind=4) A.1103[4]; struct array2_complex(kind=4) atmp.1106; complex(kind=4) A.1107[4]; real(kind=4) D.3326; ... instead of re-using a single one. This also causes internal representation to blow up. So, to sum up, the situation could be significantly improved by improving the FE. (Like I noticed for I/O statements, where I provided a patch - that was not accepted - to share temporary variables created for the I/O context) As we already have bugs for the middle-end partitioning problems, I make this one a Fortran FE bug instead.
> So, to sum up, the situation could be significantly improved by improving > the FE. (Like I noticed for I/O statements, where I provided a patch - > that was not accepted - to share temporary variables created for the I/O > context) Could you add a link to that patch? I quickly searched my mail archive, but could not find it.
Basically, not re-using array descriptor variables is as bad as not re-using I/O descriptor variables. I also wonder what this _gfortran_internal_{un,}pack is about? For taux=gn(p2+p4) gfortran generates real(kind=4) A.1[4]; struct array1_real(kind=4) atmp.0; atmp.0.dtype = 281; atmp.0.dim[0].stride = 1; atmp.0.dim[0].lbound = 0; atmp.0.dim[0].ubound = 3; atmp.0.data = (void *) &A.1; atmp.0.offset = 0; { integer(kind=4) S.2; S.2 = 0; while (1) { if (S.2 > 3) goto L.1; (*(real(kind=4)[0:] *) atmp.0.data)[S.2] = (*p2)[S.2] + (*p4)[S.2]; S.2 = S.2 + 1; } L.1:; } D.531 = _gfortran_internal_pack (&atmp.0); D.533 = gn (D.531); if ((real(kind=4)[0:] *) D.531 != (real(kind=4)[0:] *) atmp.0.data) { _gfortran_internal_unpack (&atmp.0, D.531); { void * D.532; D.532 = D.531; if (D.532 != 0B) { __builtin_free (D.532); } } } why does it not know the result of if ((real(kind=4)[0:] *) D.531 != (real(kind=4)[0:] *) atmp.0.data) at the point it creates the IL? I suppose this is about arrays with stride != 1?
This one: http://gcc.gnu.org/ml/gcc-patches/2006-02/msg01762.html basically it mentiones async I/O as a reason to not do it, but instead the callee can copy the I/O structure in that case instead (it probably needs to anyway, as currently those structures live on the stack, so with async I/O you'd need otherwise to guarantee that the I/O finished before the current frame is left).
So, currently you can build up arbitrary large chains of virtual clobbers with write (*,*) 'hello' write (*,*) 'hello' ... as each of those writes create two un-partitionable SFTs. For arrays temporaries this is the same (no idea what is the most simple statement to create one). Without the fix for PR34093 this would even reliably ICE the compiler (and it of course creates compile-time and memory-usage problems, because the growth of the IL is non-linear).
Actually it isn't that bad for I/O as long as there are no pointers to the dt_parm structs in the IL (function arguments are ok).
For array temporaries the pinning of SFTs happens because we have the address of the actual array _data_ in the IL: # SFT.10_52 = VDEF <SFT.10_51(D)> atmp.0.data = &A.1; the array descriptor itself is not the problem (the redundant descriptors still consume memory, but should not cause compile-time problems where observed).
As of optimization, the conversion sequence atmp.0.data = &A.1; D.540_5 = atmp.0.data; D.541_6 = (real(kind=4)[0:] *) D.540_5; ... (*D.541_6)[S.5_1] = D.545_14; can be optimized to A.1[S.5_1] = D.545_14; with the following patch that makes sure we use (real(kind=4)[4] *) instead of the unknown size array type. Index: trans-types.c =================================================================== --- trans-types.c (revision 131336) +++ trans-types.c (working copy) @@ -1511,10 +1511,12 @@ gfc_get_array_type_bounds (tree etype, i /* TODO: known offsets for descriptors. */ GFC_TYPE_ARRAY_OFFSET (fat_type) = NULL_TREE; - /* We define data as an unknown size array. Much better than doing + /* We define data as an array with the correct size. Much better than doing pointer arithmetic. */ arraytype = - build_array_type (etype, gfc_array_range_type); + build_array_type (etype, build_range_type (gfc_array_index_type, + gfc_index_zero_node, int_const_binop (MINUS_EXPR, stride, + integer_one_node, 0))); arraytype = build_pointer_type (arraytype); GFC_TYPE_ARRAY_DATAPTR_TYPE (fat_type) = arraytype; (the patch needs to be adjusted for the cases stride is not the actual array size, but you should get the idea)
Note that the large FRE compile-time is a dup of PR34450, try experimenting with --param sccvn-max-scc-size=N and find a N where the problem vanishes.
I've done some experimenting with older GCC versions I have floating around on my machines: Compiling the test case with both 4.1.2 and 4.2.1 gives an ICE, so I guess we can't exactly call this a regression. But then compiling with older 4.3 trunk builds works much better: I have a version from August 24 of last year, which runs the test case at full -O3 in a perfect 8sec, with 32MB memory usage (that is with --enable-checking=release). And I have two other builds from October 8 and November 9, which both run the test case in about 38s with 85MB mem-usage (they are built with --enable-checking=debug, so this is probably just debugging overhead). At least none of them takes the crazy 800MB of recent builds. To sum up: It seems like trunk versions up to at least November 9 seem to work fine on this test case.
Subject: Re: Fortran FE generated IL pessimizes middle-end IL and analysis On Sun, 6 Jan 2008, jaydub66 at gmail dot com wrote: > ------- Comment #16 from jaydub66 at gmail dot com 2008-01-06 22:08 ------- > I've done some experimenting with older GCC versions I have floating around on > my machines: > > Compiling the test case with both 4.1.2 and 4.2.1 gives an ICE, so I guess we > can't exactly call this a regression. > > But then compiling with older 4.3 trunk builds works much better: > I have a version from August 24 of last year, which runs the test case at full > -O3 in a perfect 8sec, with 32MB memory usage (that is with > --enable-checking=release). And I have two other builds from October 8 and > November 9, which both run the test case in about 38s with 85MB mem-usage (they > are built with --enable-checking=debug, so this is probably just debugging > overhead). At least none of them takes the crazy 800MB of recent builds. > > To sum up: It seems like trunk versions up to at least November 9 seem to work > fine on this test case. Well, we know what introduced this regression on the trunk. It was the fix for PR33870. Richard.
Most of the operand scanner time is caused by cfg_cleanup: tree CFG cleanup : 93.71 (79%) usr 1.16 (54%) sys 95.03 (78%) wall 474427 kB (28%) ggc globbing CFG cleanup (and operand scanner) time to the passes that trigger it reveals: tree PHI const/copy prop: 27.30 (16%) usr 0.27 ( 9%) sys 31.33 (16%) wall 9 kB ( 0%) ggc tree FRE : 25.01 (15%) usr 0.75 (25%) sys 27.63 (14%) wall 1061857 kB (62%) ggc complete unrolling : 100.00 (59%) usr 1.46 (49%) sys 111.81 (58%) wall 482221 kB (28%) ggc wheee ;) looking where we spend most of the time it seems to be building tree lists in ssa_redirect_edge. I have a hack to make this a single allocation of a VEC on the heap, still in cfg_cleanup remove_forwarder_block_with_phi there is a linear search(!) in the list of pending PHI args. Profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 48.75 52.85 52.85 2734252 0.00 0.00 ggc_alloc_stat 11.07 64.85 12.00 35486689 0.00 0.00 add_vars_for_offset 4.22 69.42 4.57 327924414 0.00 0.00 var_ann 3.89 73.64 4.22 138084 0.00 0.00 finalize_ssa_stmt_operands 3.22 77.13 3.49 44525266 0.00 0.00 add_virtual_operand ... the add_vars_for_offset is PR33870 related.
-fno-tree-loop-optimize reduces compile-time of this testcase from 117s to 35s with -O -fstrict-aliasing. try_unroll_loop_completely calls update_ssa, for every loop unrolled, added by r98599, which seems to be necessary - mainly because of the interesting PHI redirect re-use of PENDING_STMT.
Subject: Bug 34683 Author: rguenth Date: Mon Jan 7 14:49:36 2008 New Revision: 131375 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131375 Log: 2008-01-07 Richard Guenther <rguenther@suse.de> PR tree-optimization/34683 * tree-ssa-sccvn.c (vuses_to_vec): Pre-allocate the vector of VOPs of the needed size to save memory. Use VEC_quick_push to save compile-time. (vdefs_to_vec): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-sccvn.c
Without the loop optimizer the profile looks like: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 9.77 2.96 2.96 14743868 0.00 0.00 add_vars_for_offset 8.00 5.38 2.42 vuses_compare 7.06 7.51 2.14 44262948 0.00 0.00 add_virtual_operand 6.65 9.52 2.01 235204852 0.00 0.00 var_ann 4.83 10.98 1.46 68814811 0.00 0.00 iterative_hash_expr 4.66 12.39 1.41 172 0.01 0.04 DFS 4.63 13.79 1.40 69248754 0.00 0.00 htab_find_with_hash 4.28 15.09 1.30 1229783 0.00 0.00 ggc_alloc_stat 4.13 16.34 1.25 206399974 0.00 0.00 VN_INFO 4.00 17.55 1.21 19029637 0.00 0.00 set_bb_for_stmt 3.74 18.68 1.13 12610 0.00 0.00 get_call_expr_operands 2.88 19.55 0.87 266239 0.00 0.00 valueize_vuses 2.31 20.25 0.70 68677290 0.00 0.00 uid_decl_map_eq 2.25 20.93 0.68 129473 0.00 0.00 visit_reference_op_store 1.90 21.50 0.58 114366877 0.00 0.00 is_gimple_reg 1.49 21.95 0.45 147496 0.00 0.00 vec_gc_o_reserve_1 1.41 22.38 0.43 68653081 0.00 0.00 referenced_var_lookup and most of the time (cfgcleanup and operand scanner globbed into passes) is spent in tree PHI const/copy prop: 27.82 (42%) usr 0.05 ( 5%) sys 27.95 (41%) wall 8 kB ( 0%) ggc tree FRE : 24.94 (37%) usr 0.78 (73%) sys 25.73 (38%) wall 508759 kB (80%) ggc
A histogram over the number of VUSEs/VDEFs that are being sorted by sort_vuses reveals: 180 2 281964 256 1498 4 that is, we call qsort 281964 times with 256 virtual operands in the VEC. Now, curiously(?), virtual operands are sorted already, but after DECL_UID, not SSA_NAME_VERSION (see operand_build_sort_virtual). Adjusting the tree-vn sorting order to that of the operand scanner lets us improve these numbers to 27 2 28665 256 128 4 a reduction of a factor of 10. Profile after that: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 10.37 2.91 2.91 14769893 0.00 0.00 add_vars_for_offset 7.91 5.13 2.22 44264182 0.00 0.00 add_virtual_operand 7.16 7.14 2.01 235231611 0.00 0.00 var_ann 5.24 8.61 1.47 206399974 0.00 0.00 VN_INFO 5.20 10.07 1.46 172 0.01 0.04 DFS 5.17 11.52 1.45 68814811 0.00 0.00 iterative_hash_expr 4.92 12.90 1.38 12610 0.00 0.00 get_call_expr_operands 4.74 14.23 1.33 1439173 0.00 0.00 ggc_alloc_stat 4.63 15.53 1.30 69275355 0.00 0.00 htab_find_with_hash 3.88 16.62 1.09 19029637 0.00 0.00 set_bb_for_stmt 3.56 17.62 1.00 266239 0.00 0.00 valueize_vuses 2.21 18.24 0.62 129473 0.00 0.00 visit_reference_op_store 1.96 18.79 0.55 68703417 0.00 0.00 uid_decl_map_eq 1.92 19.33 0.54 operand_build_cmp
Zdenek, can you have a look if there is sth low-hanging with the tree level loop unroller?
The patch in comment #14 reduces compile time from 120s to 80s. I'll try to get it into shape.
The unroll problem is that if you have multiple loops like for (i=0; i<3; ++i) a[i] = i; ... for (i=0; i<3; ++i) a[i] = i; then even though we are in loop closed SSA form, we rename all variables in the loop BBs after duplicating them for *every loop* we unroll. Which causes the complexity of renaming to be quadratic (you can easily see that for this testcase the number of SSA_NAMEs allocated by the renamer is quadratic with the number of loops in the testcase). Now what makes it so bad in this testcase is that we mostly deal with virtual operands during renaming.
We do a _lot_ of calls to var_ann during the compilation: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 39.79 30.12 30.12 1311048 0.00 0.00 ggc_alloc_stat 12.62 39.67 9.55 34186814 0.00 0.00 add_vars_for_offset 5.77 44.04 4.37 331206978 0.00 0.00 var_ann so var_ann is not inlined despite of it being declared as such. 0.00 0.00 35522/331206978 add_to_addressable_set [318] 1.58 0.00 119702754/331206978 add_virtual_operand [12] 2.79 0.00 211468702/331206978 add_vars_for_offset [17] [39] 5.8 4.37 0.00 331206978 var_ann [39] 0.00 0.00 119615/96446282 htab_find_with_hash <cycle 34> [46] So, the add_vars_for_offset loop is resposible for most of the calls (unsurprisingly), and most of the calls are not for global vars. Making sure this call gets inlined brings another 5s of compile-time.
Subject: Bug 34683 Author: rguenth Date: Tue Jan 8 12:12:56 2008 New Revision: 131393 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131393 Log: 2008-01-08 Richard Guenther <rguenther@suse.de> PR tree-optimization/34683 * tree-flow-inline.h (var_ann): Remove overzealous asserts. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-flow-inline.h
Re. comment #26 More proof that assert checking _is_ expensive. I wonder how much compile time can be blamed overall on release checking (i.e. checking disabled vs. release checking). Have you ever compared e.g. build times of cc1-i files or a linux kernel with a non-checking compiler vs. a release-checking compiler?
No, I didn't. But yes, overzealous asserts in the simple inline accessor functions are probably counter-productive. For example things like static inline bool gimple_aliases_computed_p (const struct function *fun) { gcc_assert (fun && fun->gimple_df); return fun->gimple_df->aliases_computed_p; } would segfault anyway if fun or fun->gimple_df would be NULL. Or static inline function_ann_t function_ann (const_tree t) { gcc_assert (t); gcc_assert (TREE_CODE (t) == FUNCTION_DECL); gcc_assert (!t->base.ann || t->base.ann->common.type == FUNCTION_ANN); return (function_ann_t) t->base.ann; } multiple asserts generate multiple diagnostic calls. Now, both of these functions are not in hot code-paths.
Another hot assert is static inline tree memory_partition (tree sym) { tree tag; /* MPTs belong to their own partition. */ if (TREE_CODE (sym) == MEMORY_PARTITION_TAG) return sym; gcc_assert (!is_gimple_reg (sym)); and is_gimple_reg isn't cheap either. For this testcase two thirds of the calls are attributed to add_vars_for_offset (through add_v{use,def}): 0.50 0.01 135234172/199610365 add_vars_for_offset [27] [114] 1.1 0.74 0.01 199610365 is_gimple_reg [114] and it really isn't cheap: CPU: AMD64 processors, speed 2000 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 samples % symbol name 545326 36.8318 ggc_alloc_stat 310838 20.9943 add_vars_for_offset 85655 5.7852 add_virtual_operand 69156 4.6709 finalize_ssa_stmt_operands 57057 3.8537 bitmap_set_bit 36755 2.4825 htab_find_with_hash 35915 2.4257 get_expr_operands 28207 1.9051 is_gimple_reg 23340 1.5764 referenced_var_lookup wrapping the assert inside ENABLE_CHECKING buys us 2s or 2.5%. But this is too much micro-optimizing for me ;) I'd rather have the quadratic SSA rewriting in the unroller fixed.
You are of course right that for this test case the bigger win is in the algorithmic issues. On the other hand, for compile time in general, the gcc_asserts are IMHO the worst thing that happened to GCC since the introduction of the garbage allocator ;-) It is way too easy to write "gcc_assert (blah)" instead of "if (blah)\nabort ();", and people don't realize how expensive these "cheap" asserts are. This would then be a nice project for the Wiki beginners projects pages, perhaps: Audit the gcc_assert calls in small "inline" functions and see if there is any measurable gain to win from removing some that don't make sense (like e.g. the ones you identified in comment #30). But that's not relevant for this bug report, of course :-)
Subject: Bug 34683 Author: rguenth Date: Tue Jan 8 16:29:14 2008 New Revision: 131400 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131400 Log: 2008-01-08 Richard Guenther <rguenther@suse.de> PR tree-optimization/34683 * tree-ssa-operands.c (operand_build_cmp): Export. * tree-ssa-operands.h (operand_build_cmp): Declare. * tree-vn.c (vuses_compare): Remove. (sort_vuses): Use operand_build_cmp. (sort_vuses_heap): Likewise. * tree-ssa-sccvn.c (vuses_to_vec): Use VEC_reserve, not VEC_alloc to re-use old VEC if available. Do not sort already sorted VUSEs. (vdefs_to_vec): Do not sort already sorted VDEFs. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-operands.c trunk/gcc/tree-ssa-operands.h trunk/gcc/tree-ssa-sccvn.c trunk/gcc/tree-vn.c
Subject: Bug 34683 Author: rguenth Date: Tue Jan 8 21:21:29 2008 New Revision: 131404 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131404 Log: 2008-01-08 Richard Guenther <rguenther@suse.de> PR fortran/34706 PR tree-optimization/34683 * trans-types.c (gfc_get_array_type_bounds): Use an array type with known size for accesses if that is known. Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/trans-types.c
How does the trunk compare now to the numbers mentioned in comment #16 ?
> How does the trunk compare now to the numbers mentioned in comment #16 ? Compiling with rev. 131414 gives: -O1 -fstrict-aliasing: 33sec, 438197 kB -O2: 97sec, 669637 kB -O3: 50sec, 392669 kB This is of course better than a few days ago (thanks to Richard's patches). But it's not even close to the numbers in comment #16. What worries me about this is not the 33sec compile time (I could live with that, if I'd have to), but the really large amount of memory. On a machine with only 256MB compiling takes several minutes, due to massive swapping. Richard: You said that the remaining issue is probably not gonna be fixed for 4.3. I don't quite see the reason for this, since this apparently *was* fixed at some point in the 4.3 trunk (see comment 16). Also this would really be a shame since it would make GCC 4.3 practically unusable for our project (http://gibuu.physik.uni-giessen.de/GiBUU/).
Subject: Re: [4.3 Regression] Fortran FE generated IL pessimizes middle-end IL and analysis On Wed, 9 Jan 2008, jaydub66 at gmail dot com wrote: > ------- Comment #36 from jaydub66 at gmail dot com 2008-01-09 09:38 ------- > > > How does the trunk compare now to the numbers mentioned in comment #16 ? > > Compiling with rev. 131414 gives: > > -O1 -fstrict-aliasing: 33sec, 438197 kB > -O2: 97sec, 669637 kB > -O3: 50sec, 392669 kB > > This is of course better than a few days ago (thanks to Richard's patches). But > it's not even close to the numbers in comment #16. > > What worries me about this is not the 33sec compile time (I could live with > that, if I'd have to), but the really large amount of memory. On a machine with > only 256MB compiling takes several minutes, due to massive swapping. > > Richard: > You said that the remaining issue is probably not gonna be fixed for 4.3. I > don't quite see the reason for this, since this apparently *was* fixed at some > point in the 4.3 trunk (see comment 16). Yeah, but we won't trade wrong-code bugs for better compile-time here. I have no idea how practical it is to fix basic-block duplication code (which is the worst offender regarding compile-time in this circumstances of too many VOPs). > Also this would really be a shame since it would make GCC 4.3 practically > unusable for our project (http://gibuu.physik.uni-giessen.de/GiBUU/). There is a workaround available - you can use --param max-fields-for-field-sensitive=0 to disable SFTs, but this may disable useful optimization. This is the route we want to go with 4.4, addressing the missed optimizations in different ways. Richard.
Created attachment 14908 [details] hack to disable rewriting of VOPs I tried to change the unroller doing only SSA_NAME replacement update after each unrolling and leave rewriting the virtual operands to after the whole pass but either I messed up or this even makes it worse ;) See the prototype hack.
Hmm, looks I was compeltely wrong about the cause of the slowdown. We actually run cfg_cleanup after cunroll and merge blocks like <BB1> ... <BB2> # SFT.1_2 = PHI <SFT.1_1 (BB1)> ... # SFT.1000_2 = PHI <SFT.1000_1 (BB1)> # SFT.1_3 = VDEF <SFT.1_2> ... # SFT.1000_3 = VDEF <SFT.1_2> *mem = x; and in merging the blocks we do (tree_merge_blocks): /* Remove all single-valued PHI nodes from block B of the form V_i = PHI <V_j> by propagating V_j to all the uses of V_i. */ for (phi = phi_nodes (b); phi; phi = phi_nodes (b)) { ... replace_uses_by (def, use); remove_phi_node (phi, NULL, true); BUT! replace_uses_by will for _each_ phi-node we replace its uses update the target stmt! And fold it! We can do better with VOPs Preparing a patch.
After you fix this, we're back to SCCVN is slow with lots of virtual operands. For -O2 (which appearantly is the worst case) we now have the following profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 10.33 3.08 3.08 313563133 0.00 0.00 VN_INFO 8.85 5.72 2.64 365 0.01 0.04 DFS 7.17 7.86 2.14 96368081 0.00 0.00 iterative_hash_expr 5.77 9.58 1.72 472085 0.00 0.00 valueize_vuses 5.43 11.20 1.62 216407 0.00 0.00 visit_reference_op_store 4.93 12.67 1.47 2093493 0.00 0.00 ggc_alloc_stat 3.15 13.61 0.94 17042636 0.00 0.00 set_bb_for_stmt 2.63 14.40 0.79 2442685 0.00 0.00 add_vars_for_offset 1.74 14.92 0.52 operand_build_cmp 1.54 15.38 0.46 226140 0.00 0.00 shared_vuses_from_stmt 1.48 15.82 0.44 245792 0.00 0.00 vn_reference_lookup 1.27 16.20 0.38 5272581 0.00 0.00 mark_all_vars_used_1 1.24 16.57 0.37 16115332 0.00 0.00 bitmap_set_bit 1.14 16.91 0.34 58 0.01 0.02 remove_unused_locals 1.14 17.25 0.34 6878571 0.00 0.00 htab_find_slot_with_hash 1.04 17.56 0.31 4286747 0.00 0.00 add_virtual_operand 1.04 17.87 0.31 255362 0.00 0.00 vec_gc_o_reserve_1 ...
Subject: Bug 34683 Author: rguenth Date: Thu Jan 10 16:59:06 2008 New Revision: 131446 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131446 Log: 2008-01-10 Richard Guenther <rguenther@suse.de> PR middle-end/34683 * tree-cfg.c (tree_merge_blocks): Do not go through the full-blown folding and stmt updating path if we just deal with virtual operands. * tree-ssa-copy.c (may_propagate_copy): Do not short-cut test for abnormal SSA_NAMEs. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-ssa-copy.c
This is probably all we can get for now - maybe another few % with minor tweaks, but nothing earth-shattering. After all, we _do_ have a much larger IL due to the number of VOPs not partitioned.
Actually this is not a regression against a released compiler, as both the 4.1 and the 4.2 branches ICE for this testcase: gfortran-4.2 -S -o /dev/null t.f90 -O -fstrict-aliasing t.f90: In function 'ampli': t.f90:83: internal compiler error: in gfc_trans_array_constructor, at fortran/trans-array.c:1466 Please submit a full bug report, with preprocessed source if appropriate. See <URL:http://bugs.opensuse.org> for instructions. so, can we declare this as fixed (the huge number of VOPs we have special PRs for)?
Yes, fixed.
(In reply to comment #42) > This is probably all we can get for now - maybe another few % with minor > tweaks, > but nothing earth-shattering. After all, we _do_ have a much larger IL due > to the number of VOPs not partitioned. Well, it may not be earth-shattering, but it's still a solid improvement - of more than a few %! With rev. 131446 I get the following numbers: -O1 -fstrict-aliasing: 11sec, 201825 kB -O2: 22sec, 392166 kB -O3: 24sec, 388802 kB Compile time is reduced by a factor of 4 for -O2, and memory usage is halved for -O1! Nice job, Richard. So I'd say compile time really isn't an issue any more. But memory consumption is still a factor of 10 above what it was last year. Most of that comes from 'tree FRE' (~140MB), but all the junk in 'tree operand scan' is gone. For -O2 and -O3 another large chunk appears in 'tree PRE' (~160MB). Are you sure you want to close this bug? For me memory consumption is still an issue, and I would really like to see this fixed at some point. And if there is another bug which is a duplicate of this, could you point me to it?
See e.g. bug 27004 and bug 33974.
Most of the PRE/FRE memory is spent in copied VOPs VECs. Unfortunately we cannot move them to heap memory easily as the get shared in the PRE tables... I tried to be explicit in managing the memory, but it gets really hard (which is probably the reason we stick it in GC memory in the first place). So we could use two variants of all routines, one using GC allocated VECs and one using heap allocated VECs and use the GC allocated one only for the final SCC iteration and the PRE tables. Smells like a task for C++ templates or a really ugly hack in C. Obviously this is not going to happen for 4.3 :/
(In reply to comment #47) > Most of the PRE/FRE memory is spent in copied VOPs VECs. Unfortunately we > cannot move them to heap memory easily as the get shared in the PRE tables... > I tried to be explicit in managing the memory, but it gets really hard (which > is probably the reason we stick it in GC memory in the first place). in fact, it copies the VOPS list of each store statement 50 times on average (once for PRE and once for FRE, i.e., about 100x total). Cannot sccvn be throttled down somehow so that it does not spend so much memory this way?