Metabug to track all the issues ;)
Quick summary :) 1) -g build is currently broken because of dwarf2out recursion. 2) sqlite still gets miscompiled at 32bit (PR44897), but works now at 64bit for some reason 3) Workaround attached to PR44846 is needed to avoid ICE due to one decl C++ FE issues 4) 32bit mozilla now builds fine for me when linked with -O2, but -Os (the default) leads to segfault at startup apparently because xpcom components do not reproduce correctly 5) Older versions of gold seems to have issues. 6) Martin's devirtualization seems to behave funny doing 7400 clones and the redirecting just about 20 calls. 7) Both Martin and Taras reported ICE in lto-symtab I can't reproduce 8) Mozilla needs some changes, since __attribute__ ((used)) is missing. I will attach diff. 9) One needs 4GB in /tmp, with sane partitioning this goes down to 1GB 10) 32bit build gets close to addressing space issues at WPA stage, probably we should not mmap all the .o files, since only about 1GB goes to garbage collected memory.
Created attachment 21543 [details] Mozilla changes needed.
mozconfig I use: export CC="gcc -flto -fuse-linker-plugin" export CXX="g++ -fwhopr=24 -fuse-linker-plugin -fpermissive" #export CXX="/builds/slave/tryserver-linux/build/gcc/bin/g++ -fwhopr=16 #-fuse-linker-plugin -static-libstdc++ -fpermissive" ac_add_options --enable-application=browser ac_add_options --enable-libxul #ac_add_options --enable-debug ac_add_options --enable-optimize ac_add_options --disable-tests #ac_add_options --enable-debug-symbols export LDFLAGS="-Wl,--no-keep-memory" mk_add_options MOZ_MAKE_FLAGS=-j24 mk_add_options MOZ_OBJDIR=/build-mozilla-scratch-O1
WPA stage profile after (with sane partitioning). Decl reading and merging is major issue. I am surprised we are faster on streaming out than reading. Execution times (seconds) garbage collection : 5.71 ( 3%) usr 0.00 ( 0%) sys 5.72 ( 3%) wall 0 kB ( 0%) ggc callgraph optimization: 1.70 ( 1%) usr 0.00 ( 0%) sys 1.72 ( 1%) wall 13488 kB ( 0%) ggc varpool construction : 0.58 ( 0%) usr 0.01 ( 0%) sys 0.57 ( 0%) wall 43924 kB ( 1%) ggc ipa cp : 1.62 ( 1%) usr 0.02 ( 0%) sys 1.66 ( 1%) wall 70914 kB ( 2%) ggc ipa lto gimple in : 4.28 ( 2%) usr 0.33 ( 4%) sys 4.63 ( 2%) wall 15 kB ( 0%) ggc ipa lto gimple out : 6.45 ( 3%) usr 0.33 ( 4%) sys 6.74 ( 3%) wall 0 kB ( 0%) ggc ipa lto decl in : 48.34 (26%) usr 1.93 (23%) sys 50.30 (26%) wall 3021266 kB (87%) ggc ipa lto decl out : 40.53 (22%) usr 0.19 ( 2%) sys 40.75 (21%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 1.03 ( 1%) usr 0.06 ( 1%) sys 1.08 ( 1%) wall 77094 kB ( 2%) ggc ipa lto cgraph I/O : 0.94 ( 1%) usr 0.21 ( 3%) sys 1.15 ( 1%) wall 237872 kB ( 7%) ggc ipa lto decl merge : 45.14 (24%) usr 1.08 (13%) sys 46.23 (24%) wall 273 kB ( 0%) ggc ipa lto cgraph merge : 0.89 ( 0%) usr 0.00 ( 0%) sys 0.89 ( 0%) wall 5164 kB ( 0%) ggc whopr wpa : 2.38 ( 1%) usr 0.04 ( 0%) sys 2.41 ( 1%) wall 1 kB ( 0%) ggc whopr wpa I/O : 3.08 ( 2%) usr 3.97 (48%) sys 7.38 ( 4%) wall 0 kB ( 0%) ggc ipa reference : 1.55 ( 1%) usr 0.00 ( 0%) sys 1.59 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.18 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 1.05 ( 1%) usr 0.00 ( 0%) sys 1.04 ( 1%) wall 0 kB ( 0%) ggc parser : 0.58 ( 0%) usr 0.00 ( 0%) sys 0.58 ( 0%) wall 17738 kB ( 1%) ggc inline heuristics : 15.73 ( 8%) usr 0.00 ( 0%) sys 15.74 ( 8%) wall 2974 kB ( 0%) ggc callgraph verifier : 2.56 ( 1%) usr 0.02 ( 0%) sys 2.59 ( 1%) wall 0 kB ( 0%) ggc varconst : 0.01 ( 0%) usr 0.02 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 186.41 8.27 195.10 3491946 kB
Oprofile of WHOPR build. It is quite suprrising how low the usual cpu hogs shows.. 113909 7.6329 lto1 lto1 htab_find_slot_with_hash 42787 2.8671 libc-2.11.1.so libc-2.11.1.so _int_malloc 36514 2.4468 lto1 lto1 iterative_hash_hashval_t 36289 2.4317 libelf.so.0.8.12 libelf.so.0.8.12 /usr/lib64/libelf.so.0.8.12 28366 1.9008 lto1 lto1 htab_expand 27648 1.8527 libc-2.11.1.so libc-2.11.1.so memset 27045 1.8123 lto1 lto1 cgraph_edge_badness 26670 1.7871 lto1 lto1 inflate_fast 25955 1.7392 lto1 lto1 lto_input_tree 20010 1.3408 lto1 lto1 lto_input_uleb128 18853 1.2633 lto1 lto1 bitmap_set_bit 16452 1.1024 as as /usr/bin/as 16215 1.0865 lto1 lto1 lto_input_1_unsigned 16141 1.0816 lto1 lto1 lto_output_1_stream 15244 1.0215 libc-2.11.1.so libc-2.11.1.so memcpy 15241 1.0213 lto1 lto1 htab_hash_string 13806 0.9251 lto1 lto1 record_reg_classes.constprop.10 13743 0.9209 lto1 lto1 lto_output_tree 13220 0.8859 lto1 lto1 ggc_internal_alloc_stat 12879 0.8630 libc-2.11.1.so libc-2.11.1.so malloc_consolidate 12847 0.8609 libc-2.11.1.so libc-2.11.1.so _int_free 11712 0.7848 lto1 lto1 lto_streamer_cache_insert_1 11593 0.7768 lto1 lto1 linemap_lookup 11100 0.7438 lto1 lto1 ht_lookup_with_hash 10837 0.7262 lto1 lto1 gtc_visit 10460 0.7009 lto1 lto1 cgraph_estimate_growth 10438 0.6994 lto1 lto1 value_member 9812 0.6575 lto1 lto1 walk_tree_1 9316 0.6243 oprofiled oprofiled /usr/bin/oprofiled 8979 0.6017 libc-2.11.1.so libc-2.11.1.so malloc 8825 0.5914 libc-2.11.1.so libc-2.11.1.so free 8625 0.5780 lto1 lto1 pointer_set_insert 8304 0.5564 lto1 lto1 ggc_set_mark 8276 0.5546 lto1 lto1 type_pair_eq 8089 0.5420 lto1 lto1 gimple_types_compatible_p_1 7981 0.5348 lto1 lto1 lto_output_uleb128_stream 7388 0.4951 lto1 lto1 df_note_compute 7349 0.4924 lto1 lto1 operand_equal_p 7349 0.4924 lto1 lto1 pointer_map_contains 7117 0.4769 lto1 lto1 bitmap_bit_p 7067 0.4736 lto1 lto1 pool_alloc 7030 0.4711 lto1 lto1 verify_cgraph_node 6954 0.4660 lto1 lto1 lto_input_sleb128 6947 0.4655 lto1 lto1 gt_ggc_mx_lang_tree_node 6747 0.4521 libc-2.11.1.so libc-2.11.1.so calloc 6403 0.4291 lto1 lto1 htab_delete 6360 0.4262 lto1 lto1 constrain_operands.part.12 6198 0.4153 lto1 lto1 bitmap_clear_bit 6103 0.4090 lto1 lto1 cse_insn
PR 45679 also reproduce during -O3 build. I am testing patch for it now.
Gold shipped with SLES: GNU gold (GNU Binutils; SUSE Linux Enterprise 11 2.20.0.20100122-0.7.9) 1. is known to have problems leading to PR45194 The following version: GNU gold (GNU Binutils 2.20.51.20100706) 1.9 works for me.
Updated summary... - Last patch needed to get Mozilla working is posted as http://gcc.gnu.org/ml/gcc-patches/2010-10/msg01286.html - Configuration needs to be done with -fwhopr for C++ and -flto for C, to get around sqlite problem (PR44897) - Debugging still needs to be disabled - Recent Gold is needed - Peak memory use is about 4GB, still more than we should need. It is WPA stage having too many declarations in it. - We probably could do better on devirtualization in constructors for addref. With -O3 --param inline-unit-growth -fwhopr=jobserv the code size seems comparable with non-LTO -Os build, speed with non-LTO -O3 build. This seems quite good news. Lacking debug info build seems to be the only remaining showstopper for practical use.
Updated summary, Mozilla now builds with unpatched mainline (with checking disabled)
I am just trying to get Mozilla building with GNU ld instead of gold. First problem is that Mozilla links some of libraries as: /abuild/jh/trunk-install/bin/gcc -O3 -flto -flto-partition=none -fuse-linker-plugin -shared -Wl,-soname -Wl,libplds4.so -o libplds4.so ./plarena.o ./plhash.o ./plvrsion.o -L/abuild/jh/build-mozilla-new7/dist/lib -lnspr4 i.e. there is missing -fPIC that means that we compile into non-PIC code and GNU LD eventually complains about PC32 relocations into symbols that can be overwritten. Is this valid? If so, we need to work out -fPIC ourselves at LTO time.... Honza
OK, working around the previous issues we fail with: /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: gTLSIsMainThread: TLS reference in /tmp/cczRYvg1.ltrans0.ltrans.o mismatches non-TLS definition in nsThreadManager.o.ironly section .text Dave, is this a GNU LD bug? It seems to me that most likely that nsThreadManager.o.ironly section is the one got from lto plugin and we don't put TLS annotations there because we have no way to do so? Honza
(In reply to comment #11) > OK, > working around the previous issues we fail with: > > /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: > gTLSIsMainThread: TLS reference in /tmp/cczRYvg1.ltrans0.ltrans.o mismatches > non-TLS definition in nsThreadManager.o.ironly section .text > > Dave, is this a GNU LD bug? It seems to me that most likely that > nsThreadManager.o.ironly section is the one got from lto plugin and we don't > put TLS annotations there because we have no way to do so? Yeh, precisely. The ironly file is a placeholder into which we put the symbols found in the lto symtab so that they can take part in the link and their resolutions be determined. We have no way of conveying any symbol type info. We'll need to handle this in the multiple-def linker hook in LD's plugin code, by getting it to copy type info from the newly-added symbols to the ironly ones. Oh, hang on, that won't work. elflink.c calls _bfd_elf_merge_symbol /before/ _bfd_generic_link_add_one_symbol, which is where the multiple-def hook gets called back from. So it'll error on the mismatch before we get a chance to do anything about it. That's awkward. Need to scratch my head over that for a bit.
> Yeh, precisely. The ironly file is a placeholder into which we put the > symbols found in the lto symtab so that they can take part in the link and > their resolutions be determined. We have no way of conveying any symbol type The error comes out after the lto1 invocation, so why the ironly section is still around? I would expect it to be discarded at that time and replaced by whatever compiler returns to you. On the other hand, discarding won't help if there was non-LTO module referencing TLS var also used by LTO module I guess.
(In reply to comment #13) > > Yeh, precisely. The ironly file is a placeholder into which we put the > > symbols found in the lto symtab so that they can take part in the link and > > their resolutions be determined. We have no way of conveying any symbol type > > The error comes out after the lto1 invocation, so why the ironly section is > still around? > I would expect it to be discarded at that time and replaced by whatever > compiler > returns to you. It's the symbol from the ironly section that remains, and it gets discarded and replaced by the the symbol from the real object file by the linker multiple_definition callback hook when _bfd_generic_link_add_one_symbol is called to add the symbol from the real object file into the link hash table. Unfortunately, the elf linker has some additional checking that it does before calling that routine which preemptively complains about the multiple definition before the linker hook has a chance to replace the original ironly symbol by the new one.
(In reply to comment #10) > I am just trying to get Mozilla building with GNU ld instead of gold. First > problem is that Mozilla links some of libraries as: > > /abuild/jh/trunk-install/bin/gcc -O3 -flto -flto-partition=none > -fuse-linker-plugin -shared -Wl,-soname -Wl,libplds4.so -o libplds4.so > ./plarena.o ./plhash.o ./plvrsion.o -L/abuild/jh/build-mozilla-new7/dist/lib > -lnspr4 > > i.e. there is missing -fPIC that means that we compile into non-PIC code and > GNU LD eventually complains about PC32 relocations into symbols that can be > overwritten. > > Is this valid? If so, we need to work out -fPIC ourselves at LTO time.... It's valid I think and we try to work out fPIC ourselves in the funny LTO option handling code (but the options are not re-applied at ltrans stage I think, so it doesn't work at all with WHOPR). Richard. > Honza
> It's valid I think and we try to work out fPIC ourselves in the funny > LTO option handling code (but the options are not re-applied at ltrans > stage I think, so it doesn't work at all with WHOPR). Hmm, the link command above is LTO, not WHOPR. I wonder why we don't work out -fPIC ourselves then... Honza
Current mainline crashes: Program received signal SIGSEGV, Segmentation fault. lto_cgraph_replace_node (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:227 227 if (prevailing_node->same_body_alias) (gdb) bt #0 lto_cgraph_replace_node (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:227 #1 lto_symtab_merge_cgraph_nodes_1 (slot=<value optimized out>, data=<value optimized out>) at ../../gcc/lto-symtab.c:798 #2 0x0000000000b0ae08 in htab_traverse_noresize (htab=<value optimized out>, callback=0x60eca0 <lto_symtab_merge_cgraph_nodes_1>, info=0x0) at ../../libiberty/hashtab.c:784 #3 0x00000000004aabf9 in read_cgraph_and_symbols () at ../../gcc/lto/lto.c:2213 #4 lto_main () at ../../gcc/lto/lto.c:2438 #5 0x00000000006cb658 in compile_file (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:579 #6 do_compile (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:1874 #7 toplev_main (argc=2627, argv=0x11a7460) at ../../gcc/toplev.c:1937 #8 0x00007ffff6597bc6 in __libc_start_main () from /lib64/libc.so.6 #9 0x0000000000493411 in _start () at ../sysdeps/x86_64/elf/start.S:113 I guess it is fallout of the merging patch. It is weird since previaling_node is NULL. _moz_cairo_surface_destroy/567259(-1) @0x7ffebef47c60 (asm: _moz_cairo_surface_destroy) visibilit: 2 binds_local called by: CreateSimilarSurface/567227 (0.21 per call) CreateSimilarSurface/567227 (0.14 per call) Init/567225 (0.39 per call) _ZN11gfxASurface7ReleaseEv.part.2/567209 (1.00 per call) calls: References: Refering this function: $5 = void I also generated profile. samples % image name app name symbol name 228038 25.3225 lto1 lto1 htab_find_slot_with_hash 82588 9.1710 lto1 lto1 iterative_hash_hashval_t 58000 6.4406 lto1 lto1 type_pair_eq 32557 3.6153 lto1 lto1 gimple_lookup_type_leader 31622 3.5115 lto1 lto1 gtc_visit 29149 3.2369 lto1 lto1 htab_expand 27463 3.0496 lto1 lto1 gimple_type_hash_1 24348 2.7037 lto1 lto1 gimple_types_compatible_p 24217 2.6892 lto1 lto1 inflate_fast 21984 2.4412 lto1 lto1 gimple_types_compatible_p_1 21796 2.4203 libc-2.11.1.so libc-2.11.1.so memset 21700 2.4097 libc-2.11.1.so libc-2.11.1.so _int_malloc 17894 1.9870 lto1 lto1 lookup_type_pair.isra.120.constprop.129 16087 1.7864 lto1 lto1 ggc_set_mark 15719 1.7455 lto1 lto1 gt_ggc_mx_lang_tree_node Our abuse of hashing is making us slow. It is not only type merging but all the hashing during streaming in.
Filled in the sefault as PR46940 It is really a sickness of mozilla sources definint _INT symbol, _moz symbol and function of same name and visibility and using both. In any case we should handle this gratefully too. Honza
Filled in the GNU LD bug as http://sourceware.org/bugzilla/show_bug.cgi?id=12323
(In reply to comment #19) > Filled in the GNU LD bug as > http://sourceware.org/bugzilla/show_bug.cgi?id=12323 It should have been fixed on hjl/lto-mixed branch at http://git.kernel.org/?p=devel/binutils/hjl/x86.git;a=summary
I am re-building now. Martin's edge cgraph_opt streaming fix is needed and flag_shlib needs to be set in lto-options.c With this fixed oprofile shows that cc1plus spends a lot of time in lookup_filed_1. 40259 5.6000 cc1plus cc1plus lookup_field_1 20275 2.8203 cc1plus cc1plus longest_match 15843 2.2038 libc-2.11.1.so libc-2.11.1.so _int_malloc 12409 1.7261 libc-2.11.1.so libc-2.11.1.so memset 10680 1.4856 cc1plus cc1plus htab_find_slot_with_hash 10471 1.4565 libc-2.11.1.so libc-2.11.1.so vfprintf 9004 1.2525 cc1plus cc1plus deflate_slow 8580 1.1935 cc1plus cc1plus ggc_internal_alloc_stat 8300 1.1545 libc-2.11.1.so libc-2.11.1.so memcpy 8100 1.1267 cc1plus cc1plus ht_lookup_with_hash 8044 1.1189 libpython2.6.so.1.0 libpython2.6.so.1.0 /usr/lib64/libpython2.6.so.1.0 7840 1.0905 cc1plus cc1plus _cpp_lex_direct 6340 0.8819 cc1plus cc1plus pointer_set_insert I am adding c++ maintainers to CC as this seems like relatively low hanging fruit for noticeable compilation speedup? It tends to show in oprofile as 5-7% of compile time.
On 1/5/2011 5:36 AM, hubicka at gcc dot gnu.org wrote: > 40259 5.6000 cc1plus cc1plus > lookup_field_1 I've looked at this, in the distant past. I don't think the routine itself is *very* low-hanging fruit; it's already using an inline log n algorithm to find a field in most cases, and I bet that's as good as a hash table since n is generally relatively small. But, maybe "in most cases" is wrong; there is a slow-path, and we should confirm that most of the time is in the fast-path code. We could also try a bit of memoization; I wouldn't be surprised if we often lookup "x.y" several times in a row. More often, when I've looked at this kind of thing, though, I've concluded that the problem was that we were calling the routine too often, rather than the routine itself was too slow. Quite possibly we could improve algorithms that are using lookup_field_1 so that they didn't do so as often, by building caches or otherwise. For that, we'd need to look at the callers of lookup_field_1. So, in summary, I'd recommend three things: * Split lookup_field_1 into its fast-path and slow-path code so that we can profile it and figure out which code is taking up most of the time. * Assuming it's fast-path code, look at the frequent callers and think about how to optimize them.
I've updated mozilla tree and rebuilt with top of tree GCC. The resulting binary seems to work well. Two GCC patches are required: http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00210.html solving -fPIC issues (at gold this is silently ignored but we end up with non-PIC shared libraries that is bad for startup time) http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00375.html to solve problem with undefined aliases while building libxul. Mozilla patchset seems same as posted earlier. Will try to move to debug build and try also profile feedback. memory peaks at 6.5GB, so we will not be able to build in 32bit environment unless we solve the issues with storing too many types. Honza
Author: hubicka Date: Fri Jan 7 18:21:00 2011 New Revision: 168580 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168580 Log: PR lto/45375 * lto-opt.c (lto_reissue_options): Set flag_shlib. Modified: trunk/gcc/ChangeLog trunk/gcc/lto-opts.c
With current mainline and release checking compiler, I can for first time build mozilla with debug info. 7.5GB of RAM is needed.
This is a great success, although I have to say it's still way too much RAM to ask for. In particular, this excludes the possiblity of compiling on a 32-bit architecture.
There is a lot of room for improvement in the WPA memory use, but I am not sure how much we can still fit in 4.6.0...
With fixes for PR47234 and PR47233 I can build -fprofile-generate libxul. Didn't tried yet if the porfile apply, since build later dies at: /abuild/jh/trunk-install/bin/g++ -fpermissive -O3 -flto=24 -fuse-linker-plugin -fprofile-generate -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -fno-strict-aliasing -fshort-wchar -pthread -pipe -DNDEBUG -DTRIMMED -g -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/gtk-2.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/dbus-1.0 -I/usr/lib64/dbus-1.0/include -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -I/usr/include/gtk-2.0 -I/usr/lib64/gtk-2.0/include -I/usr/include/atk-1.0 -I/usr/include/cairo -I/usr/include/pango-1.0 -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libpng12 -I/usr/include/gtk-unix-print-2.0 -fPIC -shared -Wl,-z,defs -Wl,-h,libmozgnome.so -o libmozgnome.so nsGnomeModule.o nsAlertsService.o nsAlertsIconListener.o -lpthread -Wl,-rpath-link,/abuild/jh/build-mozilla-new8-prof/dist/bin -Wl,-rpath-link,/usr/local/lib /abuild/jh/build-mozilla-new8-prof/dist/lib/libxpcomglue_s.a -L/abuild/jh/build-mozilla-new8-prof/dist/bin -lxpcom -lmozalloc -L/abuild/jh/build-mozilla-new8-prof/dist/bin -lxpcom -lmozalloc -L/abuild/jh/build-mozilla-new8-prof/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl -lgobject-2.0 -lglib-2.0 -L/lib64 -lnotify -lgtk-x11-2.0 -ldbus-glib-1 -lgdk-x11-2.0 -latk-1.0 -lgio-2.0 -lpangoft2-1.0 -lgdk_pixbuf-2.0 -lpangocairo-1.0 -lcairo -lpango-1.0 -lfreetype -lz -lfontconfig -lgmodule-2.0 -ldbus-1 -lgobject-2.0 -lglib-2.0 -Wl,--version-script -Wl,/abuild/jh/mozilla-central2/mozilla-central/build/unix/gnu-ld-scripts/components-version-script -Wl,-Bsymbolic -ldl /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxUnknownSurface::~gfxUnknownSurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:247: error: undefined reference to 'vtable for gfxASurface' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function gfxASurface::~gfxASurface():../../../dist/include/gfxASurface.h:248: error: undefined reference to 'gfxASurface::RecordMemoryFreed()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x10): error: undefined reference to 'gfxASurface::BeginPrinting(nsAString const&, nsAString const&)' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x18): error: undefined reference to 'gfxASurface::EndPrinting()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x20): error: undefined reference to 'gfxASurface::AbortPrinting()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x28): error: undefined reference to 'gfxASurface::BeginPage()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x30): error: undefined reference to 'gfxASurface::EndPage()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x38): error: undefined reference to 'gfxASurface::Finish()' /abuild/jh/trunk-install/lib/gcc/x86_64-unknown-linux-gnu/4.6.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /abuild/jh/tmp//cc0wLUAb.ltrans0.ltrans.o: in function _ZTV17gfxUnknownSurface.local.39.3126:cc0wLUAb.ltrans0.o(.data.rel.ro+0x40): error: undefined reference to 'gfxASurface::CreateSimilarSurface(gfxASurface::gfxContentType, gfxIntSize const&)' those seems suspicious. I saw similar problem previously - the vtables are there but they are not finalized. The non-LTO objects don't seem to reffer to them, so perhaps we do too much of folding... I am bit lost.
... and hacking around, the profile doesn't read back even with -fprofile-correction /abuild/jh/trunk-install/bin/gcc -O3 -flto -flto-partition=none -fuse-linker-plugin -fprofile-correction -fprofile-use -o jemalloc.o -c -DOSTYPE=\"Linux2.6.32.12-0\" -DOSARCH=Linux -I/abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc -I. -I../../dist/include -I../../dist/include/nsprpub -I/abuild/jh/build-mozilla-new8-prof/dist/include/nspr -I/abuild/jh/build-mozilla-new8-prof/dist/include/nss -fPIC -Wall -W -Wno-unused -Wpointer-arith -Wcast-align -W -pedantic -Wno-long-long -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -include ../../mozilla-config.h -DMOZILLA_CLIENT -MD -MF .deps/jemalloc.pp /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'arena_malloc': /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: note: correcting inconsistent profile data /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'malloc_mutex_unlock': /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: error: corrupted profile info: edge from 0 to 2 exceeds maximal count /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c: In function 'malloc_mutex_lock': /abuild/jh/mozilla-central2/mozilla-central/memory/jemalloc/jemalloc.c:6530:1: error: corrupted profile info: edge from 2 to 3 exceeds maximal count will see if this reproduce w/o LTO.
The libmoznome build issue is now Mozilla PR https://bugzilla.mozilla.org/show_bug.cgi?id=624385
Mozilla now builds with profile feedback and LTO. One needs to train without LTO (i.e. -fprofile-generate -O3 only) and then build with LTO (-fprofile-use -O3 -flto) becase of the aforementioned problems with undefined symbols. Resulting binary works, except for libmozsqlite that gets misoptimized (PR44897). http://gcc.gnu.org/ml/gcc-patches/2011-01/msg00375.html is still needed at the GCC side.
Author: hubicka Date: Mon Jan 10 23:37:11 2011 New Revision: 168643 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168643 Log: PR lto/45375 * profile.c (read_profile_edge_counts): Ignore profile inconistency when correcting profile. Modified: trunk/gcc/ChangeLog trunk/gcc/profile.c
Author: hubicka Date: Mon Jan 10 23:37:45 2011 New Revision: 168644 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168644 Log: PR lto/45375 * lto-cgraph.c (input_profile_summary): Remove overactive sanity check. Modified: trunk/gcc/ChangeLog trunk/gcc/lto-cgraph.c
Author: hubicka Date: Tue Jan 11 17:29:52 2011 New Revision: 168666 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=168666 Log: PR lto/45721 PR lto/45375 * tree.h (symbol_alias_set_t): Move typedef here from varasm.c (symbol_alias_set_destroy, symbol_alias_set_contains, propagate_aliases_backward): Declare. * lto-streamer-out.c (struct sets): New sturcture. (trivally_defined_alias): New function. (output_alias_pair_p): Rewrite. (output_unreferenced_globals): Fix output of alias pairs. (produce_symtab): Likewise. * ipa.c (function_and_variable_visibility): Set weak alias destination as needed in lto. * varasm.c (symbol_alias_set_t): Remove. (symbol_alias_set_destroy): Export. (propagate_aliases_forward, propagate_aliases_backward): New functions based on ... (compute_visible_aliases): ... this one; remove. (trivially_visible_alias): New (trivially_defined_alias): New. (remove_unreachable_alias_pairs): Rewrite. (finish_aliases_1): Reorganize code checking if alias is defined. * passes.c (rest_of_decl_compilation): Do not call assemble_alias when in LTO mode. * lto.c (partition_cgraph_node_p, partition_varpool_node_p): Weakrefs are not partitioned. * testsuite/gcc.dg/lto/pr45721_1.c: New file. * testsuite/gcc.dg/lto/pr45721_0.c: New file. Added: trunk/gcc/testsuite/gcc.dg/lto/pr45721_0.c trunk/gcc/testsuite/gcc.dg/lto/pr45721_1.c Modified: trunk/gcc/ChangeLog trunk/gcc/lto-streamer-out.c trunk/gcc/lto/ChangeLog trunk/gcc/lto/lto.c trunk/gcc/passes.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree.h trunk/gcc/varasm.c
I looked briefly into effectivity of the devirtualization bits and they don't seem to work terribly well. In GCC 4.3 -O3 copmiled libxul there are 82155 indirect calls. In mainline -O3 libxul there are 83023 and with LTO there are 87763. The ipa-prop bits at LTO devirtualize 1 call that is consequently optimized away (since -fno-devirtualize seems same to -fdevirtualize). I will give a try http://gcc.gnu.org/ml/gcc-patches/2010-12/msg01214.html However we _really_ need testcases from Mozilla where devirtualization is valid and we don't do it.
Hmm, the patch makes no difference, but I also see failure in its testcase FAIL: g++.dg/ipa/imm-devirt-1.C scan-tree-dump optimized "= B::.*foo" FAIL: g++.dg/ipa/imm-devirt-2.C scan-tree-dump optimized "= B::.*foo" so I will wait for Martin to commit rest of his series and/or update the patch.
I tested Martin's devirtualization patch at cgraph build. The net result is decrease of number of indirect calls in libxul by 2. The code size decrease by about 3KB, so there is probably more devirtualization happening than just 2 calls but the subsequent inlining increase final number of virtual calls again. So for 4.6.0 we won't seee more improvmeents and we can look into improving devirtualization at 4.7. But having an testcases that can be resolved by other compilers, but not by GCC is a must.
Created attachment 23253 [details] failing testcase With current mainline and top of tree mozilla, things seems to go well, sqlite issues are gone. I now however get elfhack fault: jh@evans:/abuild/jh/build-mozilla-new9/build/unix/elfhack> /abuild/jh/build-mozilla-new9/build/unix/elfhack/elfhack -b test.so test.so: terminate called after throwing an instance of 'std::runtime_error' what(): Section index out of bounds Aborted (core dumped) I am attaching test.so I get to see if it is elfhack miscomplation or the binary.
(In reply to comment #38) > Created attachment 23253 [details] > failing testcase > > With current mainline and top of tree mozilla, things seems to go well, sqlite > issues are gone. I now however get elfhack fault: > > jh@evans:/abuild/jh/build-mozilla-new9/build/unix/elfhack> > /abuild/jh/build-mozilla-new9/build/unix/elfhack/elfhack -b test.so > test.so: terminate called after throwing an instance of 'std::runtime_error' > what(): Section index out of bounds > Aborted (core dumped) > > I am attaching test.so I get to see if it is elfhack miscomplation or the > binary. That could well be https://bugzilla.mozilla.org/show_bug.cgi?id=629638 Can you check with a changeset newer than http://hg.mozilla.org/mozilla-central/rev/2772a0cf36d1 ?
(In reply to comment #39) > That could well be https://bugzilla.mozilla.org/show_bug.cgi?id=629638 > Can you check with a changeset newer than > http://hg.mozilla.org/mozilla-central/rev/2772a0cf36d1 ? I have just checked-out mozilla-central entirely by doing hg clone http://hg.mozilla.org/mozilla-central/ and the elfhack test still segfaults for me (with lto).
(In reply to comment #40) > I have just checked-out mozilla-central entirely by doing > > hg clone http://hg.mozilla.org/mozilla-central/ > > and the elfhack test still segfaults for me (with lto). Segfaults or aborts ?
(In reply to comment #41) > > Segfaults or aborts ? Segfaults: === === If you get failures below, please file a bug describing the error === and your environment (compiler and linker versions), and use === --disable-elf-hack until this is fixed. === /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/elfhack -b test.so test.so: Reduced by 12128 bytes # Fail if the backup file doesn't exist [ -f "test.so.bak" ] # Fail if the new library doesn't contain less relocations [ $(objdump -R test.so.bak | wc -l) -gt $(objdump -R test.so | wc -l) ] /home/mjambor/gcc/icln/inst/bin/gcc -o dummy dummy.o test.so # Will either crash or return exit code 1 if elfhack is broken LD_LIBRARY_PATH=/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/dummy make[6]: *** [libs] Segmentation fault make[6]: Leaving directory `/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack' ...and very early on it seems: (gdb) bt #0 0x00007ffff7ff7040 in frame_dummy () from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so #1 0x00007ffff7ff6f5e in _init () from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so #2 0x00007ffff7ffa710 in ?? () #3 0x00007ffff7debe18 in call_init () from /lib64/ld-linux-x86-64.so.2 #4 0x00007ffff7debf47 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2 #5 0x00007ffff7ddeb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
(In reply to comment #42) > (In reply to comment #41) > > > > Segfaults or aborts ? > > Segfaults: > > === > === If you get failures below, please file a bug describing the error > === and your environment (compiler and linker versions), and use > === --disable-elf-hack until this is fixed. > === > /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/elfhack -b > test.so > test.so: Reduced by 12128 bytes > # Fail if the backup file doesn't exist > [ -f "test.so.bak" ] > # Fail if the new library doesn't contain less relocations > [ $(objdump -R test.so.bak | wc -l) -gt $(objdump -R test.so | wc -l) ] > /home/mjambor/gcc/icln/inst/bin/gcc -o dummy dummy.o test.so > # Will either crash or return exit code 1 if elfhack is broken > LD_LIBRARY_PATH=/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack > /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/dummy > make[6]: *** [libs] Segmentation fault > make[6]: Leaving directory > `/home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack' > > ...and very early on it seems: > > (gdb) bt > #0 0x00007ffff7ff7040 in frame_dummy () > from /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so > #1 0x00007ffff7ff6f5e in _init () from > /home/mjambor/mozilla/mc2/objdir-ff-release/build/unix/elfhack/test.so > #2 0x00007ffff7ffa710 in ?? () > #3 0x00007ffff7debe18 in call_init () from /lib64/ld-linux-x86-64.so.2 > #4 0x00007ffff7debf47 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2 > #5 0x00007ffff7ddeb3a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2 Ah, so this is a crash of the test, not of elfhack. Could you attach both test.so and test.so.bak files ?
(In reply to comment #43) > Ah, so this is a crash of the test, not of elfhack. Could you attach both > test.so and test.so.bak files ? Actually, it would be better to just do that on bugzilla.mozilla.org. (please Cc ":glandium" there)
Can you try mozilla-central revision 19f13dea4d4a?
(In reply to comment #45) > Can you try mozilla-central revision 19f13dea4d4a? With that revision the elfhack problems are gone. Thanks!
With the elfhack issues gone, the build now fails with: ---------------------------------------------------------------------- /home/mjambor/gcc/icln/inst/bin/g++ -o js -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -O2 -flto=jobserver -fpermissive -fuse-linker-plugin -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -Os -freorder-blocks -fomit-frame-pointer js.o jsworkers.o -lpthread -O2 -flto=jobserver -fuse-linker-plugin -Wl,-rpath-link,/bin -Wl,-rpath-link,/home/mjambor/mozilla/lto/objdir-ff-release/dist/lib -L../../../dist/bin -L../../../dist/lib -L/home/mjambor/mozilla/lto/objdir-ff-release/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl make[6]: warning: jobserver unavailable: using -j1. Add `+' to parent make rule. /home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x33): error: undefined reference to 'SetVMFrameRegs' /home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x3b): error: undefined reference to 'PushActiveVMFrame' /home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x4d): error: undefined reference to 'PopActiveVMFrame' /home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x6b): error: undefined reference to 'js_InternalThrow' /home/mjambor/binutils/obj/gold/ld-new: /tmp/ccmP9JrU.ltrans0.ltrans.o:(.text+0x7a): error: undefined reference to 'PopActiveVMFrame' collect2: ld returned 1 exit status make[5]: *** [js] Error 1 make[5]: Leaving directory `/home/mjambor/mozilla/lto/objdir-ff-release/js/src/shell' ---------------------------------------------------------------------- I have not been able to have a closer look at the issue yet but hope to do so soon.
Created attachment 23364 [details] Mozilla updates needed Updated mozilla patch fixing the undefined symbols Martin reported. Sorry, had it in tree for a while, but didn't noticed PR is out of date.
(In reply to comment #48) > Updated mozilla patch fixing the undefined symbols Martin reported. > Sorry, had it in tree for a while, but didn't noticed PR is out of date. Thanks, that resolved these issues. However, now my 8GB machine runs out of memory when linking libxul.so.
> Thanks, that resolved these issues. However, now my 8GB machine runs > out of memory when linking libxul.so. That is expected. With richard's -g fixes memory usage is slightly over 8GB. Just add some swap, since it get over 8GB for short time during WPA it might not be that bad. Honza
I tried again on a machine with more RAM and LTO build succeeded for me as well. Thanks a lot.
Just a warning: Building a -fprofile-generate libxul uses ~13GB of memory. (I have 8GB on my build-system and lto1 got killed several times by the OOM killer, until I added enough swap space.) The build process still fails later on as described in Comment 28.
Building fails with GNU ld (Linux/GNU Binutils) 2.21.51.0.7.20110306: c++ -o xpcshell -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -fshort-wchar -pthread -pipe -DNDEBUG -DTRIMMED -O3 xpcshell.o -lpthread -Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/bin -Wl,-rpath-link,/usr/lib -L../../../../dist/bin -L../../../../dist/lib ../../../../dist/lib/libxpcomglue_s.a -L/var/tmp/mozilla-central/moz-build-dir/dist/bin -lxpcom -lmozalloc -lxul -L/var/tmp/mozilla-central/moz-build-dir/dist/bin -lxpcom -lmozalloc -lxul -Wl,-R/usr/lib64 -L/usr/lib64 -lplds4 -lplc4 -lnspr4 -lpthread -ldl -ldl ../../../../dist/bin/libxul.so: undefined reference to `PR_smprintf_free' ../../../../dist/bin/libxul.so: undefined reference to `PR_SetEnv' ../../../../dist/bin/libxul.so: undefined reference to `PR_Now' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetErrorText' ../../../../dist/bin/libxul.so: undefined reference to `PR_FindFunctionSymbol' ../../../../dist/bin/libxul.so: undefined reference to `PR_PushIOLayer' ../../../../dist/bin/libxul.so: undefined reference to `PR_ntohs' ../../../../dist/bin/libxul.so: undefined reference to `PR_FormatTimeUSEnglish' ../../../../dist/bin/libxul.so: undefined reference to `PR_MemMap' ../../../../dist/bin/libxul.so: undefined reference to `PR_LocalTimeParameters' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetDefaultIOMethods' ../../../../dist/bin/libxul.so: undefined reference to `PR_ReadDir' ../../../../dist/bin/libxul.so: undefined reference to `PR_SetPollableEvent' ../../../../dist/bin/libxul.so: undefined reference to `PR_FindSymbol' /usr/lib/libssl3.so: undefined reference to `PR_OpenAnonFileMap' /usr/lib/libssl3.so: undefined reference to `PR_ExportFileMapAsString' ../../../../dist/bin/libxul.so: undefined reference to `PR_Delete' ../../../../dist/bin/libxul.so: undefined reference to `PR_AtomicSet' /usr/lib/libnss3.so: undefined reference to `PR_NewRWLock' ../../../../dist/bin/libxul.so: undefined reference to `PR_SetNetAddr' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetNumberOfProcessors' ../../../../dist/bin/libxul.so: undefined reference to `PR_SecondsToInterval' ../../../../dist/bin/libxul.so: undefined reference to `PR_Close' ../../../../dist/bin/libxul.so: undefined reference to `PR_vsprintf_append' ../../../../dist/bin/libxul.so: undefined reference to `PR_Bind' ../../../../dist/bin/libxul.so: undefined reference to `PR_Sleep' ../../../../dist/bin/libxul.so: undefined reference to `PR_OpenTCPSocket' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetRandomNoise' ../../../../dist/bin/libxul.so: undefined reference to `PR_Send' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetPhysicalMemorySize' ../../../../dist/bin/libxul.so: undefined reference to `PR_NotifyAllCondVar' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetUniqueIdentity' ../../../../dist/bin/libxul.so: undefined reference to `PR_ConnectContinue' ../../../../dist/bin/libxul.so: undefined reference to `PR_snprintf' ../../../../dist/bin/libxul.so: undefined reference to `PR_CreateFileMap' /usr/lib/libnss3.so: undefined reference to `PR_NewTCPSocket' /usr/lib64/libplc4.so: undefined reference to `PR_Assert' ../../../../dist/bin/libxul.so: undefined reference to `PR_htons' ../../../../dist/bin/libxul.so: undefined reference to `PR_FreeAddrInfo' /usr/lib/libnss3.so: undefined reference to `PR_Shutdown' /usr/lib/libssl3.so: undefined reference to `PR_ImportFileMapFromString' /usr/lib/libnss3.so: undefined reference to `PR_EnumerateHostEnt' ../../../../dist/bin/libxul.so: undefined reference to `PR_Malloc' /usr/lib/libnss3.so: undefined reference to `PR_SetErrorText' ../../../../dist/bin/libxul.so: undefined reference to `PR_EnumerateAddrInfo' ../../../../dist/bin/libxul.so: undefined reference to `PR_ConvertIPv4AddrToIPv6' ../../../../dist/bin/libxul.so: undefined reference to `PR_WaitProcess' ../../../../dist/bin/libxul.so: undefined reference to `PR_IntervalNow' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetHostByName' ../../../../dist/bin/libxul.so: undefined reference to `LL_MaxUint' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetSocketOption' ../../../../dist/bin/libxul.so: undefined reference to `PR_Free' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetPageShift' ../../../../dist/bin/libxul.so: undefined reference to `PR_LogPrint' ../../../../dist/bin/libxul.so: undefined reference to `PR_JoinThread' /usr/lib/libnss3.so: undefined reference to `PR_VersionCheck' ../../../../dist/bin/libxul.so: undefined reference to `PR_NewThreadPrivateIndex' ../../../../dist/bin/libxul.so: undefined reference to `PR_IsNetAddrType' ../../../../dist/bin/libxul.so: undefined reference to `PR_vsmprintf' ../../../../dist/bin/libxul.so: undefined reference to `PR_Recv' ../../../../dist/bin/libxul.so: undefined reference to `PR_strtod' ../../../../dist/bin/libxul.so: undefined reference to `PR_Notify' ../../../../dist/bin/libxul.so: undefined reference to `PR_Poll' ../../../../dist/bin/libxul.so: undefined reference to `PR_CeilingLog2' ../../../../dist/bin/libxul.so: undefined reference to `PR_SetSocketOption' ../../../../dist/bin/libxul.so: undefined reference to `PR_OpenUDPSocket' ../../../../dist/bin/libxul.so: undefined reference to `PR_PopIOLayer' ../../../../dist/bin/libxul.so: undefined reference to `PR_LoadLibraryWithFlags' ../../../../dist/bin/libxul.so: undefined reference to `PR_dtoa' ../../../../dist/bin/libxul.so: undefined reference to `PR_AtomicDecrement' ../../../../dist/bin/libxul.so: undefined reference to `PR_GetEnv' /usr/lib/libssl3.so: undefined reference to `PR_Interrupt' ... gold (1.11) works fine.
Turned out that GNU ld doesn't like "--as-needed"; LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine. (although GNU ld uses way more memory than gold.)
> Just a warning: Building a -fprofile-generate libxul uses > ~13GB of memory. (I have 8GB on my build-system and lto1 > got killed several times by the OOM killer, until I added > enough swap space.) > The build process still fails later on as described in Comment 28. You can build -fprofile-generate without -flto and use -flto only for final build. It produce same results and save _alot_ of memory ;) Honza
> Turned out that GNU ld doesn't like "--as-needed"; > LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine. > (although GNU ld uses way more memory than gold.) Hmm, seems like GNU LD bug to me (tough I never used --as-needed) Could you fill it in, please? Honza
(In reply to comment #56) > > Turned out that GNU ld doesn't like "--as-needed"; > > LDFLAGS="-Wl,-O1,--hash-style=gnu,--no-keep-memory" works fine. > > (although GNU ld uses way more memory than gold.) > > Hmm, seems like GNU LD bug to me (tough I never used --as-needed) > Could you fill it in, please? Done: http://sourceware.org/bugzilla/show_bug.cgi?id=12557 >You can build -fprofile-generate without -flto and use -flto only for final >build. How do you do this with "make -f client.mk profiledbuild"? Or do you run both phases by hand?
> How do you do this with "make -f client.mk profiledbuild"? To answer my own question: Just edit ./configure and ./js/src/configure and add "-flto=4 -fwhole-program" (or whatever you may prefer) to the PROFILE_USE_CFLAGS variable. Then you can build Firefox with "make -f client.mk profiledbuild". BTW libmozsqlite3.so still gets miscompiled, but Firefox is now snappy as never before ;-)
> > How do you do this with "make -f client.mk profiledbuild"? > > To answer my own question: > Just edit ./configure and ./js/src/configure and add > "-flto=4 -fwhole-program" (or whatever you may prefer) > to the PROFILE_USE_CFLAGS variable. > Then you can build Firefox with "make -f client.mk profiledbuild". I did not know of existence of profiledbuild and thus I did that by hand where it was easy. Perhaps Mozilla build mahcinery can be told to add -fno-lto into -fprofile-generate run. Hmm, in fact perhaps GCC chould do that by default. Not sure if it is not too late for 4.6 however. > > BTW libmozsqlite3.so still gets miscompiled, but Firefox is > now snappy as never before ;-) yes, there is PR on this, but I have absolutely no idea if it is sqlite or GCC bug. Any help is greatly appreciated, sqlite is big blob of magic for me.
Latest mozilla-central fails here: make[5]: Entering directory `/var/tmp/mozilla-central/moz-build-dir/js/src/shell' js.cpp c++ -o js.o -c -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux2.6\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/ns prpub -I/usr/include/nspr -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-vir tual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fu se-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -O3 -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/js.pp /var/tmp/mozilla-central/js/src/shell/js.cpp jsworkers.cpp c++ -o jsworkers.o -c -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux2.6\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub -I/usr/include/nspr -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -O3 -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/jsworkers.pp /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp In file included from /var/tmp/mozilla-central/js/src/shell/js.cpp:97:0: /var/tmp/mozilla-central/js/src/jsobjinlines.h: In member function ‘void JSObject::setArrayLength(uint32)’: /var/tmp/mozilla-central/js/src/jsobjinlines.h:316:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] /usr/bin/python2.7 /var/tmp/mozilla-central/js/src/config/pythonpath.py -I../config /var/tmp/mozilla-central/js/src/config/expandlibs_exec.py --uselist -- c++ -o js -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -g -O3 js.o jsworkers.o -lpthread -Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory -Wl,-rpath-link,/bin -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/lib -L../../../dist/bin -L../../../dist/lib -Wl,-R/usr/lib64 -L/usr/lib64 -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl lto1: internal compiler error: in output_die, at dwarf2out.c:11355 Please submit a full bug report, with preprocessed source if appropriate. See <http://gcc.gnu.org/bugs.html> for instructions. make[6]: *** [/tmp/ccC5KSYt.ltrans18.ltrans.o] Error 1 make[6]: *** Waiting for unfinished jobs.... lto-wrapper: make returned 2 exit status /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld: fatal error: lto-wrapper failed collect2: ld returned 1 exit status
My tree still builds (this is debug info ICE and I use non-debug info by default). Will update tree and try to reproduce it. Would be handy to have a testcase.
and since it doesn't fail at link time, this is debug info bug, not LTO, so if you get a testcase, please open a new PR.
Some stats on size of the compilation unit... There is 4.5GB of GGC memory, it gets down to 3.9MB after type merging and 3.1MB after cgraph merging. GIMPLE type table: size 524287, 374001 elements, 4447259 searches, 70070870 collisions (ratio: 15.755968) GIMPLE type hash table: size 8388593, 3907773 elements, 325621199 searches, 247539125 collisions (ratio: 0.760206) GIMPLE canonical type table: size 262139, 182719 elements, 655793 searches, 1461075 collisions (ratio: 2.227952) GIMPLE canonical type hash table: size 2097143, 863737 elements, 30341039 searches, 17653238 collisions (ratio: 0.581827) GIMPLE type comparison table: size 134217689, 70698639 elements, 153291912 searches, 154719852 collisions (ratio: 1.009315) [WPA] # of input files: 2721 [WPA] # of input cgraph nodes: 127466 [WPA] # of function bodies: 0 [WPA] GIMPLE type table: size 16381, 55 elements, 55 searches, 2 collisions (ratio: 0.036364) there are overall 600K cgraph nodes before merging, 127K from those do have function bodies. MMAP pool [WPA] Compression: 680146043 input bytes, 2436118544 uncompressed bytes (ratio: 3.581758) [WPA] Size of mmap'd section decls: 421187330 bytes [WPA] Size of mmap'd section function_body: 232170973 bytes [WPA] Size of mmap'd section statics: 9978045 bytes [WPA] Size of mmap'd section cgraph: 6356885 bytes [WPA] Size of mmap'd section vars: 225276 bytes [WPA] Size of mmap'd section refs: 1082929 bytes [WPA] Size of mmap'd section jmpfuncs: 8401591 bytes [WPA] Size of mmap'd section pureconst: 743014 bytes
Some detailed stats on WPA memory usage. Before IPA: ipa-prop.c:2820 (ipa_read_node_info) 0: 0.0% 8895232: 1.1% 24998944: 0.7% 395040: 0.1% 558297 tree.c:5898 (decl_priority_info) 12295536: 0.7% 0: 0.0% 27391696: 0.8% 0: 0.0% 2480452 tree.c:1567 (build_string) 16376223: 0.9% 0: 0.0% 39728388: 1.2% 4876275: 1.1% 1227602 lto-section-in.c:435 (lto_new_in_decl_state) 2280: 0.0% 0: 0.0% 44349120: 1.3% 0: 0.0% 369595 ipa-ref.c:54 (ipa_record_reference) 0: 0.0% 117135752:14.1% 45299512: 1.3% 38560128: 8.5% 488972 lto-streamer-in.c:1875 (lto_materialize_tree) 44134352: 2.5% 0: 0.0% 66615480: 1.9% 4264: 0.0% 1107669 ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw 1480: 0.0% 250512784:30.1% 67551704: 2.0% 157632: 0.0% 7072 cgraph.c:1015 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 68064464: 2.0% 0: 0.0% 654466 lto-streamer-in.c:2307 (lto_input_ts_constructor 33062632: 1.9% 111658560:13.4% 102441008: 3.0% 56848328:12.6% 486571 lto/lto.c:214 (lto_read_in_decl_state) 2288: 0.0% 0: 0.0% 110826912: 3.2% 21320304: 4.7% 2587165 tree.c:1257 (build_int_cst_wide) 143425600: 8.1% 0: 0.0% 199678728: 5.8% 113095664:25.0% 60257 cgraph.c:459 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 236635872: 6.9% 0: 0.0% 672261 toplev.c:1027 (realloc_for_line_map) 0: 0.0% 335593472:40.4% 335550464: 9.8% 134297600:29.7% 15 lto-streamer-in.c:1881 (lto_materialize_tree) 1302081688:73.2% 0: 0.0% 1968493840:57.3% 74550688:16.5% 29259517 Total 1777935767 831048528 3436852692 452441891 49428016 source location Garbage Freed Leak Overhead Times ------------------------------------------------------- after IPA stringpool.c:75 (alloc_node) 0: 0.0% 0: 0.0% 17709680: 0.5% 0: 0.0% 442742 stringpool.c:58 (stringpool_ggc_alloc) 0: 0.0% 0: 0.0% 22641304: 0.7% 1646320: 0.3% 442742 tree.c:1297 (build_int_cst_wide) 10611640: 0.6% 0: 0.0% 21902960: 0.6% 0: 0.0% 812865 tree.c:5898 (decl_priority_info) 12376576: 0.7% 0: 0.0% 27310672: 0.8% 0: 0.0% 2480453 lto-section-in.c:435 (lto_new_in_decl_state) 162720: 0.0% 0: 0.0% 44188680: 1.3% 0: 0.0% 369595 tree.c:1567 (build_string) 17659049: 1.0% 0: 0.0% 38445562: 1.1% 4876275: 1.0% 1227602 cgraph.c:1015 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 68064464: 2.0% 0: 0.0% 654466 ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw 26888: 0.0% 258338128:27.6% 75336800: 2.2% 171272: 0.0% 7667 gimple.c:4187 (iterative_hash_gimple_type) 78311648: 4.3% 0: 0.0% 260960: 0.0% 0: 0.0% 4910788 ipa-ref.c:54 (ipa_record_reference) 0: 0.0% 156312592:16.7% 82529352: 2.4% 63464176:13.2% 506799 lto-streamer-in.c:1875 (lto_materialize_tree) 49735872: 2.8% 0: 0.0% 61013960: 1.8% 4264: 0.0% 1107669 lto/lto.c:214 (lto_read_in_decl_state) 315616: 0.0% 0: 0.0% 110513584: 3.2% 21320304: 4.4% 2587165 lto-symtab.c:156 (lto_symtab_register_decl) 130991616: 7.3% 0: 0.0% 2900408: 0.1% 0: 0.0% 2390929 lto-streamer-in.c:2307 (lto_input_ts_constructor 33062632: 1.8% 111658560:12.0% 102441008: 3.0% 56848328:11.8% 486571 cgraph.c:459 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 236635872: 6.9% 0: 0.0% 672261 toplev.c:1027 (realloc_for_line_map) 0: 0.0% 335593472:35.9% 335550464: 9.8% 134297600:28.0% 15 tree.c:1257 (build_int_cst_wide) 144244592: 8.0% 0: 0.0% 198866208: 5.8% 113097680:23.5% 60267 lto-streamer-in.c:1881 (lto_materialize_tree) 1319860448:73.1% 0: 0.0% 1950715080:57.0% 74550688:15.5% 29259517 Total 1804556313 934357752 3423228826 480284459 49853300 source location Garbage Freed Leak Overhead Times Kind Nodes Bytes --------------------------------------- decls 11502734 1829746088 types 4430124 744260832 blocks 1 88 stmts 0 0 refs 8173 485872 exprs 2358594 113315792 constants 2245230 86809013 identifiers 442742 17709680 vecs 60267 116915440 binfos 1107669 110741304 ssa names 309 27192 constructors 310545 9937440 random kinds 10648367 425935048 lang_decl kinds 0 0 lang_type kinds 0 0 omp clauses 0 0 --------------------------------------- Total 33114755 -839083507 ---------------------------------------
(In reply to comment #62) > and since it doesn't fail at link time, this is debug info bug, not LTO, so if > you get a testcase, please open a new PR. You're right, it builds fine without "-g" (ac_add_options --disable-debug-symbols). But the build now fails early when elfhack is enabled: with gold: c++ -o elfhack -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fwhole-program -fno-strict-aliasing -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -g -O3 -lpthread -Wl,-O1,--hash-style=gnu,--as-needed,--no-keep-memory -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/bin -Wl,-rpath-link,/usr/lib host_elf.o host_elfhack.o /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccGQbukN.ltrans3.ltrans.o: in function _ZN8Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc.local.402:/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:239: error: undefined reference to 'void Elf_Ehdr_Traits::swap<big_endian, Elf64_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf64_Ehdr&)' /usr/lib/gcc/x86_64-pc-linux-gnu/4.6.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccGQbukN.ltrans3.ltrans.o: in function _ZN8Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc.local.402:/var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:228: error: undefined reference to 'void Elf_Ehdr_Traits::swap<big_endian, Elf32_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf32_Ehdr&)' collect2: ld returned 1 exit status make[7]: *** [elfhack] Error 1 or with gnu-ld: In function `serialize': /var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:239: undefined reference to `void Elf_Ehdr_Traits::swap<big_endian, Elf64_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf64_Ehdr&)' /var/tmp/mozilla-central/build/unix/elfhack/elfxx.h:228: undefined reference to `void Elf_Ehdr_Traits::swap<big_endian, Elf32_Ehdr, serializable<Elf_Ehdr_Traits> >(serializable<Elf_Ehdr_Traits>&, Elf32_Ehdr&)' collect2: ld returned 1 exit status see also: https://bugzilla.mozilla.org/show_bug.cgi?id=647458 (but it does look more like a gcc lto bug to me)
On Sun, Apr 03, 2011 at 10:09:06AM +0000, hubicka at gcc dot gnu.org wrote: > Kind Nodes Bytes > --------------------------------------- > decls 11502734 1829746088 > types 4430124 744260832 > blocks 1 88 > stmts 0 0 > refs 8173 485872 > exprs 2358594 113315792 > constants 2245230 86809013 > identifiers 442742 17709680 > vecs 60267 116915440 > binfos 1107669 110741304 > ssa names 309 27192 > constructors 310545 9937440 > random kinds 10648367 425935048 > lang_decl kinds 0 0 > lang_type kinds 0 0 > omp clauses 0 0 > --------------------------------------- > Total 33114755 -839083507 > --------------------------------------- Do folks think it would be useful to include a breakdown by individual TREE_CODE, similar to what's done for RTXes?
(In reply to comment #66) > On Sun, Apr 03, 2011 at 10:09:06AM +0000, hubicka at gcc dot gnu.org wrote: > > Kind Nodes Bytes > > --------------------------------------- > > decls 11502734 1829746088 > > types 4430124 744260832 > > blocks 1 88 > > stmts 0 0 > > refs 8173 485872 > > exprs 2358594 113315792 > > constants 2245230 86809013 > > identifiers 442742 17709680 > > vecs 60267 116915440 > > binfos 1107669 110741304 > > ssa names 309 27192 > > constructors 310545 9937440 > > random kinds 10648367 425935048 > > lang_decl kinds 0 0 > > lang_type kinds 0 0 > > omp clauses 0 0 > > --------------------------------------- > > Total 33114755 -839083507 > > --------------------------------------- > > Do folks think it would be useful to include a breakdown by individual > TREE_CODE, similar to what's done for RTXes? I have posted a patch for this last year, but it seems I forgot to commit it.
On Mon, Apr 04, 2011 at 01:01:27PM +0000, rguenth at gcc dot gnu.org wrote: > > Do folks think it would be useful to include a breakdown by individual > > TREE_CODE, similar to what's done for RTXes? > > I have posted a patch for this last year, but it seems I forgot to commit > it. Well, it'd be most interesting to see the per-code breakdown for Honza's earlier numbers.
On 4/4/2011 3:19 AM, froydnj at codesourcery dot com wrote: > Do folks think it would be useful to include a breakdown by individual > TREE_CODE, similar to what's done for RTXes? Sure couldn't hurt, and I can definitely think of situations where I wanted exactly that. Thank you,
I can not reproduce the aforementioned elfhack failure. For me build fails later at /abuild/jh/trunk-install/bin/g++ -flto=24 -fuse-linker-plugin -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -fno-strict-aliasing -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -g -Os -freorder-blocks -fomit-frame-pointer -fPIC -shared -Wl,-z,defs -Wl,-h,test.so -o test.so test.o === === If you get failures below, please file a bug describing the error === and your environment (compiler and linker versions), and use === --disable-elf-hack until this is fixed. === /abuild/jh/build-mozilla-new11-lto-elfhack/build/unix/elfhack/elfhack -b test.so test.so: terminate called after throwing an instance of 'std::runtime_error' what(): Section index out of bounds make[5]: *** [test.so] Aborted (core dumped) I tend to believe that this is elfhack problem. Only way for me to get similar linker error is to disable the linker plugin and use -fwhole-program. Can you, please, try to build with -save-temps -fdump-ipa-cgraph and attach the produced *.res and *wpa*cgraph files?
Created attachment 23917 [details] -lm.res
Created attachment 23918 [details] elfhack.wpa.000i.cgraph
Jan, elfhack only fails to build if I use: ac_add_options --enable-optimize=-O3 in my .mozconfig. When I delete the =-O3 part everything builds fine.
Interesting. -O3 makes no difference for me. I will look into your dumps if I can spot something useful. The behavior I observe is that GCC optimize away all the strings that are placed into test.so. I didn't look deeper into it (I am looking if i can reproduce your dwarf2out ICE and get a testcase right now), but I think it is what makes my elfhack test to fail. I am surprised it does not happen for yours. If GCC fail to link even such a simple program as elfhack is, something pretty fundamental must go wrong. Perhaps it is linker bug. I had problems with older versions of gold.
(In reply to comment #74) > Interesting. -O3 makes no difference for me. I will look into your dumps if I > can spot something useful. > ... > If GCC fail to link even such a simple program as elfhack is, something pretty > fundamental must go wrong. Perhaps it is linker bug. I had problems with older > versions of gold. The failure only happens with -flto. And the reason is that: c++ -o host_elf.o -c -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long -march=native -fpermissive -flto=4 -fuse-linker-plugin -fno-strict-aliasing -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -Os -I/var/tmp/mozilla-central/build/unix/elfhack -I. -I../../../dist/include -I../../../dist/include/nsprpub -I/usr/include/nspr -I/usr/include/nss -I/usr/include/nspr /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp apparently only compiles correctly in the -Os case. All other optimization switches (-O(0..3) or without -O) lead to the eventual link failure above. And it happens with both gnu-ld and gold (2.21.51.20110402).
Created attachment 23930 [details] Output of -Wl,-Map good
Created attachment 23931 [details] Output of -Wl,-Map bad I've attached the output of "-Wl,-Map,map" of both cases (-Os vs. -O2). Please do a vimdiff of both and search for Elf_Ehdr9serializeERSt14basic_ofstreamIcSt11char_traitsIcEEcc and you'll see that in the good case it lives in its own ltrans file: /tmp/cca0jnrX.ltrans9.ltrans.o while in the bad case it is thrown together with other headers into: /tmp/ccd8WyNK.ltrans3.ltrans.o which then leads to the link error above.
(In reply to comment #75) > (In reply to comment #74) > > Interesting. -O3 makes no difference for me. I will look into your dumps if I > > can spot something useful. > > ... > > If GCC fail to link even such a simple program as elfhack is, something pretty > > fundamental must go wrong. Perhaps it is linker bug. I had problems with older > > versions of gold. > > The failure only happens with -flto. > And the reason is that: > > c++ -o host_elf.o -c -fno-rtti -Wall -Wpointer-arith -Woverloaded-virtual > -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align > -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -Wno-long-long > -march=native -fpermissive -flto=4 -fuse-linker-plugin -fno-strict-aliasing > -fshort-wchar -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -Os > -I/var/tmp/mozilla-central/build/unix/elfhack -I. -I../../../dist/include > -I../../../dist/include/nsprpub -I/usr/include/nspr -I/usr/include/nss > -I/usr/include/nspr /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp > > apparently only compiles correctly in the -Os case. All other optimization > switches (-O(0..3) or without -O) lead to the eventual link failure above. > And it happens with both gnu-ld and gold (2.21.51.20110402). What matters is what is used to build/link test.so, not elfhack itself, and from the look at the command line in comment 70, you're building test.so with unexpected things. It is not meant to be optimized. So, some more variables tweaking would apparently be required in build/unix/elfhack/Makefile.in.
(In reply to comment #78) > What matters is what is used to build/link test.so, not elfhack itself, and > from the look at the command line in comment 70, you're building test.so with > unexpected things. It is not meant to be optimized. So, some more variables > tweaking would apparently be required in build/unix/elfhack/Makefile.in. There are two different issues that we're talking about: -The link error when you build with --enable-optimize=-O3 This has nothing to do with test.so AFAICS. -The test failure Jan reported, which only happens _after_ elfhack is successfully build. And in this case your comment above may apply.
Hi, in the resolution files, the swap functions are already undefined 5382 3d06433b UNDEF __assert_fail 5400 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI13little_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_ 5447 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_ 5455 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI13little_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_ 5459 3d06433b UNDEF _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_ I currently have problems to get past firewall to my mozilla build, but this seems like another instance of problem with COMDATs - i.e. host_elfhack including some header that makes use of those functions in something that is inlined and consistently optimized out in normal compilation but due to comdat issues it stays stuck in the LTO output. According to cgraph dump it is used by
Sorry, firefox concluded I want to save changes when I didn't ;) The problem is function Elf_Ehdr::serialize(std::basic_ofstream<char, std::char_traits<char> >&, char, char) What I see is that this function is defined several times in the unmerged cgraph (i.e. it is comdat inline coming from different .o files) and _some_ of the definitions calls swap function that is not defined, while other definitions calls swap function that is defined. In your build the one that calls undefined swap wins resulting in final link error. I am not sure if this is GCC bug or elfhack, but I would guess for elfhack actually. This is whole bit tricky since the COMDAT hack comes into game here: GCC is not telling linker in the LTO symtab about COMDATs for inline functions when their address is not taken since they should be defined in every unit that needs them. It is not the case here. I think either SWAP should be keyed in one of the units that it is apparently not: swap/622(-1) @0x7f4d15d34000 (asm: _ZN18Elf_RelHack_Traits4swapI10big_endian9Elf32_Rel12serializableIS_EEEvRT1_RT0_) analyzed 19 time, 16 benefit 31 size, 8 benefit externally_visible finalized inlinable called by: serialize/623 (0.65 per call) serialize/623 (0.25 per call) calls: __builtin_constant_p/466 (1.00 per call) __builtin_constant_p/466 (1.00 per call) References: Refering this function: (other defs looks like) swap/825(-1) @0x7f4d15d466e0 (asm: _ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf32_Ehdr12serializableIS_EEEvRT1_RT0_) undef called by: serialize/595 (0.25 per call) (can throw external) calls: References: Refering this function: Or this is some source level bug. I.e. one unit just forward declaring the function while other defining it as comdat inline that is probably violation of one declaration rule. Would be possible for you to look into preprocessed source files of elfhack and see what units define the serialize function and among those how the swap defintions look like? We probably could make lto-symtab to not give up on seeing Undef resolution from linker in these cases, but I would rather avoid pilling up hacks around this COMDAT mess. Honza
(In reply to comment #81) > > The problem is function Elf_Ehdr::serialize(std::basic_ofstream<char, > std::char_traits<char> >&, char, char) ... > Would be possible for you to look into preprocessed source files of elfhack and > see what units define the serialize function and among those how the swap > defintions look like? > I think it would be best if you take a look at the source files yourself once your firewall problem is solved, because there are actually only two of them (elfxx.h and elf.cpp). The instantiation takes place in elfxx.h:431 and elf.cpp:142. BTW when I use -frepo to compile host_elf.o the link error goes away. And if I recompile host_elf.o without -frepo, but leave the host_elf.rpo file, this is what happens: collect: recompiling /var/tmp/mozilla-central/build/unix/elfhack/elf.cpp collect: relinking collect2: '_ZN15Elf_Ehdr_Traits4swapI10big_endian10Elf64_Ehdr12serializableIS_EEEvRT1_RT0_' was assigned to 'host_elf.rpo', but was not defined during recompilation, or vice versa and then the link error from above follows.
> I am not sure if this is GCC bug or elfhack, but I would guess for elfhack actually. I guess you're right, because when I move the swap definitions: template <class endian, typename R, typename T> inline void Elf_Ehdr_Traits::swap(T &t, R &r) ... from elf.cpp to elfxx.h (where they actually belong) the link error vanishes.
(In reply to comment #83) > > I am not sure if this is GCC bug or elfhack, but I would guess for > elfhack actually. > > I guess you're right, because when I move the swap definitions: > > template <class endian, typename R, typename T> > inline void Elf_Ehdr_Traits::swap(T &t, R &r) > ... > > from elf.cpp to elfxx.h (where they actually belong) the > link error vanishes. I'm not convinced they belong there. But wouldn't removing the "inline" keyword work equally well?
Thanks for analysis. removing inline should work too. while as qoi issue gcc can find the missing bodu, i think it is better to avoid more hacks. for 4.7 i will implement the new comdat proposal. does elfhack work for you now?
(In reply to comment #85) > does elfhack work for you now? Yes, no problems anymore.
http://gcc.gnu.org/ml/gcc-patches/2011-04/msg01854.html has updated bulild time/memory stats. With Michaels WPA patch, we now need about 5GB of address space on 64bit build, so we might fit in 32bit again.
As a quick status update, mozilla now builds and works with TOT GCC tree again, after fixes to debug info streaming and clone materialization. -g still fails at PR48724
This is callgrind profile for our hashtables that are consuming most of time at WPA stage. It is from javascript library, but probably close enough for libxul: 9,413,074 < ipa.c:cgraph_node_set_add (47698x) 237,777,114 < lto-streamer-in.c:lto_input_location (253470x) 162,391 < cgraph.c:cgraph_same_body_alias_1 (1125x) 3,481,459 < lto/lto.c:lto_create_files_from_ids (18272x) 1,262,433,061 < lto-streamer.c:lto_streamer_cache_insert_1 (9456405x) 1,721,939 < cgraph.c:cgraph_remove_node (13507x) 32,443,118 < cgraph.c:cgraph_get_node (254257x) 15,700,040 < lto/lto.c:remember_with_vars (88495x) 100,462,329 < lto-streamer.c:lto_streamer_cache_lookup (959530x) 59,948,506 < lto/lto-object.c:lto_obj_add_section (38584x) 551,876,527 < gimple.c:gimple_register_type'2 (9863x) 15,332,148 < lto-symtab.c:lto_symtab_get (148180x) 123,454,996 < ipa.c:varpool_node_set_find (1090522x) 497,594,354 < gimple.c:gimple_register_canonical_type (174920x) 7,723,287 < lto-section-out.c:lto_output_decl_index (48869x) 1,363,423 < lto-section-in.c:lto_get_function_in_decl_state (13102x) 60,607,732 < ipa.c:cgraph_node_set_find (526286x) 3,220,597 < varpool.c:varpool_node (19821x) 3,316,861 < lto-symtab.c:lto_symtab_register_decl (23462x) 523,758,152 < lto-streamer-out.c:lto_output_string_with_length (793000x) 30,909,893 < lto/lto.c:create_subid_section_table (19190x) 4,593,607 < cgraph.c:cgraph_create_node (22343x) 223,259 < cgraph.c:cgraph_clone_node (1353x) 20,940,173 < lto-section-in.c:lto_record_renamed_decl (14960x) 2,983,016,896 < gimple.c:gimple_register_type (149596x) 3,876,333 < cgraph.c:cgraph_get_node_or_alias (27793x) 123,200 < varpool.c:varpool_remove_node (973x) 46,083,990 < tree.c:build_int_cst_wide (247788x) 4,703,171 < ipa.c:cgraph_node_set_remove (40839x) 261,240,516 * libiberty/hashtab.c:htab_find_slo So it seems that in addition to type merging we have quite few other problems. varpool_node_set_find seems just stupid, for example.
Per node memory usage statistics for WPA Code Nodes ---------------------------- identifier_node 428715 tree_list 10992455 tree_vec 54594 enumeral_type 49860 integer_type 201079 real_type 1975 pointer_type 1575376 reference_type 102944 array_type 98085 record_type 903172 union_type 17170 void_type 1496 function_type 127906 method_type 1533898 integer_cst 767153 real_cst 15992 string_cst 1224809 function_decl 2473011 label_decl 264118 field_decl 1399608 var_decl 86596 const_decl 510913 parm_decl 5530790 type_decl 964008 result_decl 553028 debug_expr_decl 144282 namespace_decl 9876 constructor 160380 nop_expr 508605 addr_expr 789320 tree_binfo 1090674
Hi, with the patch I just posted for removal of hash tables for cgraph/varpool node set, the situation with hashing is better. We got from 900s WPA stage to 500s WPA stage. Streaming still dominate: ipa lto decl in : 331.26 (56%) usr 5.51 (34%) sys 337.11 (56%) wall 722314 kB (46%) ggc ipa lto decl out : 118.21 (20%) usr 4.37 (27%) sys 122.57 (20%) wall 0 kB ( 0%) ggc ipa lto decl merge : 23.61 ( 4%) usr 0.20 ( 1%) sys 23.83 ( 4%) wall 962 kB ( 0%) ggc inline heuristics : 57.12 (10%) usr 0.14 ( 1%) sys 57.27 ( 9%) wall 227500 kB (14%) ggc TOTAL : 587.02 16.36 604.01 1585790 kB (I have plans for fixing inliner once more prominent problems are solved) Streaming in oprofile: 150985 20.6876 lto1 htab_find_slot_with_hash 71532 9.8012 lto1 gimple_types_compatible_p 55971 7.6690 libc-2.11.1.so _int_malloc 55104 7.5502 lto1 iterative_hash_hashval_t 33160 4.5435 lto1 type_pair_eq 31554 4.3235 libc-2.11.1.so memset 25670 3.5172 lto1 gtc_visit 23972 3.2846 lto1 gimple_type_hash_1 21562 2.9544 lto1 lto_input_tree 15230 2.0868 lto1 gt_ggc_mx_lang_tree_node 14807 2.0288 lto1 inflate_fast callgrind profile (of javascript instead of libxul) shows that tree_map_base hash is the most busy one: 453,603,428 * libiberty/../../libiberty/hashtab.c:htab_find_slot_with_hash'2 33,167,620 > gcc/../../gcc/tree.c:tree_map_base_eq (6633524x) 134,245,948 > libiberty/../../libiberty/hashtab.c:htab_expand (18x) 25,459,797 > gcc/../../gcc/gimple.c:type_pair_eq (2793149x) and the users of hashing: 63,519,720 < /libiberty/hashtab.c:htab_find_slot'2 (676308x) 3,975,492,482 < /libiberty/hashtab.c:htab_find_slot (2179693x) 255,072,048 * /libiberty/hashtab.c:htab_find_slot_with_hash 14,530,222 < /gcc/gimple.c:iterative_hash_gimple_type'2 (52634x) 526,622,873 < /gcc/gimple.c:lookup_type_pair.isra.103.constprop.111 (1621144x) 17,415,611 < /gcc/gimple.c:iterative_hash_gimple_type (100893x) 11,734,620 < /gcc/gimple.c:visit'2 (98730x) 432,531,796 < /gcc/gimple.c:gimple_type_hash_1 (3851023x) 35,405,473 < /gcc/gimple.c:visit (319520x) 108,790,992 * /libiberty/hashtab.c:htab_find_slot'2 Oprofile of the whole build shows also problem in decl_assembler_name_equal (because of our stupit alias hacks) and can_inline_edge_p. I will look into those two. 260739 7.1750 lto1 lto1 htab_find_slot_with_hash 151080 4.1574 lto1 lto1 decl_assembler_name_equal 130969 3.6040 libc-2.11.1.so libc-2.11.1.so _int_malloc 100723 2.7717 lto1 lto1 gimple_types_compatible_p 97370 2.6794 lto1 lto1 iterative_hash_hashval_t 75051 2.0653 libc-2.11.1.so libc-2.11.1.so memset 56508 1.5550 lto1 lto1 bitmap_set_bit 53211 1.4643 lto1 lto1 can_inline_edge_p 51613 1.4203 oprofiled oprofiled /usr/bin/oprofiled 49992 1.3757 lto1 lto1 pointer_map_insert 48381 1.3313 lto1 lto1 lto_input_tree 44467 1.2236 lto1 lto1 type_pair_eq 35096 0.9658 libc-2.11.1.so libc-2.11.1.so _int_free 35069 0.9650 lto1 lto1 gtc_visit (this is including ltrans stage) Honza
decl in is now at 96 seconds. oprofile for streaming in is: 27469 9.3054 lto1 htab_find_slot_with_hash 23175 7.8508 libc-2.11.1.so _int_malloc 18044 6.1126 lto1 lto_input_tree 14823 5.0215 libc-2.11.1.so memset 14108 4.7792 lto1 gt_ggc_mx_lang_tree_node 13511 4.5770 lto1 inflate_fast 11805 3.9991 lto1 gimple_type_eq 11247 3.8100 lto1 lto_input_uleb128 11227 3.8033 lto1 ggc_set_mark 10903 3.6935 lto1 pointer_map_insert So obviously still some place for improvements for merging. I think malloc calls come mostly from SCC detection code (we create a lot of temporary obstacks and pointer maps). lto_input_tree can probably handle quite a lot of optimizations reducing amount of data we stream. Plus we don't really need to stream ulebs for everything. For whole WPA we now need about 5 minutes, the oprofile is: 152067 14.9073 lto1 decl_assembler_name_equal 48258 4.7308 lto1 htab_find_slot_with_hash 46730 4.5810 lto1 edge_badness 37954 3.7207 libc-2.11.1.so _int_malloc 36692 3.5970 lto1 pointer_map_insert 30387 2.9789 lto1 do_estimate_growth 28496 2.7935 lto1 lto_input_tree 20992 2.0579 lto1 inflate_fast 20765 2.0356 libc-2.11.1.so memset 20264 1.9865 lto1 varpool_node_for_asm 19784 1.9394 lto1 lto_output_tree 19121 1.8745 lto1 htab_hash_string 19053 1.8678 lto1 lto_input_uleb128 good news is that decl_assembler_name_equal is stupid handling of varpool aliases in varpool_node_for_asm that will go away with my alias rewrite. edge_badness is easy to track down, too, it is just inliner updating paranoia. Honza
Time report: ipa lto gimple out : 10.28 ( 4%) usr 1.05 (11%) sys 11.35 ( 4%) wall 0 kB ( 0%) ggc ipa lto decl in : 98.45 (37%) usr 2.23 (24%) sys 100.91 (36%) wall 713587 kB (45%) ggc ipa lto decl out : 82.47 (31%) usr 2.92 (31%) sys 85.84 (31%) wall 0 kB ( 0%) ggc inline heuristics : 31.74 (12%) usr 0.14 ( 1%) sys 32.07 (11%) wall 240317 kB (15%) ggc TOTAL : 269.41 9.36 279.78 1595687 kB GIMPLE type table: size 1048573, 427153 elements, 6361837 searches, 23794591 collisions (ratio: 3.740208) GIMPLE type hash table: size 4194301, 1452245 elements, 72676685 searches, 47569100 collisions (ratio: 0.654530) GIMPLE canonical type table: size 65521, 48844 elements, 762160 searches, 552280 collisions (ratio: 0.724625) GIMPLE canonical type hash table: size 1048573, 402512 elements, 2184661 searches, 1627547 collisions (ratio: 0.744988) Nice improvement. My reading is that GIMPLE type hash table would be better an TYPE_UID indexed array (or an pointer map if it was told to be in GGC). 76 million searches is quite a lot. Honza
Callgrinding htab_find_slot_with_hash leads to: 2,535,276,742 < /libiberty/hashtab.c:htab_find_slot'2 (27545437x) [//lto1] 84,947,655,239 < /libiberty/hashtab.c:htab_find_slot (52919141x) [//lto1] 7,097,218,396 * /libiberty/hashtab.c:htab_find_slot_with_hash [//lto1] 172,769,366 < /gcc/gimple.c:iterative_hash_gimple_type'2 (1062343x) [//lto1] 172,240,553 < /gcc/gimple.c:iterative_hash_canonical_type'2 (1385651x) [//lto1] 577,192,890 < /gcc/gimple.c:iterative_hash_gimple_type (3503598x) [//lto1] 272,475,796 < /gcc/gimple.c:visit'2 (2487924x) [//lto1] 5,719,882,429 < /gcc/gimple.c:gimple_type_hash (54720792x) [//lto1] 220,431,173 < /gcc/gimple.c:iterative_hash_canonical_type (1878732x) [//lto1] 1,049,746,336 < /gcc/gimple.c:visit (10902158x) [//lto1] 1,366,941,564 * /libiberty/hashtab.c:htab_find_slot'2 [//lto1] 1,663,235,593 < /gcc/gimple.c:gimple_register_canonical_type (1841890x) [//lto1] 9,524,617,674 < /gcc/lto-streamer-in.c:lto_input_location (11940149x) [//lto1] 88,359,773,304 < /gcc/gimple.c:gimple_register_type_1 (6184225x) [//lto1] 919,314,384 < /gcc/tree.c:build_int_cst_wide (2665535x) [//lto1] 337,283,088 < /gcc/cgraph.c:cgraph_get_node_or_alias (2410404x) [//lto1] 1,856,067,526 < /gcc/lto/lto.c:remember_with_vars (10704387x) [//lto1] 265,696,672 < /gcc/lto-symtab.c:lto_symtab_register_decl (2471602x) [//lto1] 1,020,331,990 < /gcc/lto-symtab.c:lto_symtab_get (10402341x) [//lto1] 952,544,538 * /libiberty/hashtab.c:htab_find_slot [//lto1] So gimple_type_hash (54 million), input_locaiton and remember_with_vars (with about 10 million) seems to be major (ab)users of hashing now. For malloc abuse, the major source is pointer_map_create (66 million calls), and vec_heap_o_reserve_1 (23 million) and obstack_begin (22 million) that leads to... 30,424,353,893 < /gcc/gimple.c:gimple_type_eq (18852945x) [//lto1] 5,578,574,652 < /gcc/gimple.c:gimple_type_hash (3452343x) [//lto1] 401,735,124 * /gcc/pointer-set.c:pointer_map_create [//lto1]
... and 7,456,601,134 < /gcc/gimple.c:gimple_type_eq (18852945x) [//lto1] 1,384,102,312 < /gcc/gimple.c:gimple_type_hash (3452343x) [//lto1] 936,822,402 * ???:_obstack_begin [/lib64/libc-2.11.1.so]
Stream in oprofile is now quite changed: 33258 9.6313 lto1 htab_find_slot_with_hash 29679 8.5949 lto1 lto_input_tree 18338 5.3106 lto1 gt_ggc_mx_lang_tree_node 15723 4.5533 lto1 ggc_set_mark 15109 4.3755 lto1 inflate_fast 13883 4.0204 lto1 ht_lookup_with_hash 12957 3.7523 lto1 pointer_map_insert 12433 3.6005 libc-2.11.1.so memset 8661 2.5082 lto1 lto_input_uleb128 8584 2.4859 libc-2.11.1.so _int_malloc 6832 1.9785 lto1 ggc_internal_alloc_stat 6722 1.9467 lto1 ht_lookup We do have nice improvements on merging and streaming effectivity. Still burning over 10% in hashing don't seem quite reasonable. I am not sure if most of the htab overhead is still the type merging given that rest of it is off profile. It may be something stupid, like the file name hash, that is queried every time file is changed in the location. Probably should re-do callgraph profile later next week. I do have some extra patches to reduce uleb streaming overhead and further make lto_input_tree bit more streamlined that might help a little. Not sure how much real room for improvement for simple optimizations in this direction is left and how much we really need to look into streaming fewer trees. garbage collection : 16.29 ( 6%) usr 0.02 ( 0%) sys 16.33 ( 6%) wall 0 kB ( 0%) ggc ipa lto decl in : 76.15 (28%) usr 2.96 (21%) sys 79.33 (28%) wall 722892 kB (44%) ggc ipa lto decl out : 83.36 (31%) usr 4.58 (32%) sys 88.37 (31%) wall 0 kB ( 0%) ggc ipa lto decl merge : 14.59 ( 5%) usr 0.00 ( 0%) sys 14.64 ( 5%) wall 801 kB ( 0%) ggc inline heuristics : 40.95 (15%) usr 0.19 ( 1%) sys 41.40 (14%) wall 241725 kB (15%) ggc Memory needed is down, too, at about 4.3GB (in 64bit compilation). GIMPLE type table: size 1048573, 570402 elements, 5098430 searches, 3158421 collisions (ratio: 0.619489) GIMPLE type hash table: size 4194301, 1441169 elements, 44401918 searches, 37071081 collisions (ratio: 0.834898) GIMPLE canonical type table: size 65521, 49079 elements, 896788 searches, 575628 collisions (ratio: 0.641877) GIMPLE canonical type hash table: size 1048573, 524811 elements, 2845518 searches, 2279153 collisions (ratio: 0.800962) [WPA] Compression: 424774798 input bytes, 1619588170 uncompressed bytes (ratio: 3.812816)
Today I noticed by an accident that the following hack: Index: lto-streamer-out.c =================================================================== --- lto-streamer-out.c (revision 174547) +++ lto-streamer-out.c (working copy) @@ -1135,15 +1288,15 @@ lto_output_tree_or_ref (ob, BINFO_OFFSET (expr), ref_p); lto_output_tree_or_ref (ob, BINFO_VTABLE (expr), ref_p); - lto_output_tree_or_ref (ob, BINFO_VIRTUALS (expr), ref_p); + /*lto_output_tree_or_ref (ob, BINFO_VIRTUALS (expr), ref_p);*/ lto_output_tree_or_ref (ob, BINFO_VPTR_FIELD (expr), ref_p); output_uleb128 (ob, VEC_length (tree, BINFO_BASE_ACCESSES (expr))); FOR_EACH_VEC_ELT (tree, BINFO_BASE_ACCESSES (expr), i, t) lto_output_tree_or_ref (ob, t, ref_p); - lto_output_tree_or_ref (ob, BINFO_INHERITANCE_CHAIN (expr), ref_p); - lto_output_tree_or_ref (ob, BINFO_SUBVTT_INDEX (expr), ref_p); + /* Backend do not care about BINFO_INHERITANCE_CHAIN and BINFO_SUBVTT_INDEX. + */ lto_output_tree_or_ref (ob, BINFO_VPTR_INDEX (expr), ref_p); } @@ -2014,7 +2167,7 @@ lto_output_tree_ref (ob, t); /* Output the head of the arguments list. */ - lto_output_tree_ref (ob, DECL_ARGUMENTS (function)); + lto_output_chain (ob, DECL_ARGUMENTS (function), true); /* Output all the SSA names used in the function. */ output_ssa_names (ob, fn); Index: lto-streamer-in.c =================================================================== --- lto-streamer-in.c (revision 174547) +++ lto-streamer-in.c (working copy) @@ -2308,7 +2438,7 @@ while (t); BINFO_OFFSET (expr) = lto_input_tree (ib, data_in); - BINFO_VTABLE (expr) = lto_input_tree (ib, data_in); + /*BINFO_VTABLE (expr) = lto_input_tree (ib, data_in);*/ BINFO_VIRTUALS (expr) = lto_input_tree (ib, data_in); BINFO_VPTR_FIELD (expr) = lto_input_tree (ib, data_in); @@ -2323,8 +2453,6 @@ } } - BINFO_INHERITANCE_CHAIN (expr) = lto_input_tree (ib, data_in); - BINFO_SUBVTT_INDEX (expr) = lto_input_tree (ib, data_in); BINFO_VPTR_INDEX (expr) = lto_input_tree (ib, data_in); } Reduces memory usage from 4.4GB to 2.7GB, so almost halves it and proportionally improves compilation speed. The effect is disabling type based devirtualization. The difference is amount of IL sreamed. W/o hack > [WPA] Compression: 430817772 input bytes, 2004640654 uncompressed bytes (ratio: 4.653106) > [WPA] Size of mmap'd section decls: 267817970 bytes > [WPA] Size of mmap'd section function_body: 144808174 bytes > ipa lto decl in : 74.90 (30%) usr 2.38 (19%) sys 77.51 (29%) wall 722892 kB (44%) ggc (ggc memory info wraps around 4GB limit, have patch for that) With hack: > [WPA] Compression: 308616744 input bytes, 1236371760 uncompressed bytes (ratio: 4.006172) > [WPA] Size of mmap'd section decls: 147396203 bytes > [WPA] Size of mmap'd section function_body: 144662716 bytes > ipa lto decl in : 38.85 (23%) usr 1.18 (12%) sys 40.12 (23%) wall 2674626 kB (75%) ggc The node stats with the patch are as follows: identifier_node 505095 tree_list 1809449 integer_type 175310 pointer_type 1198885 reference_type 65356 array_type 96153 record_type 729335 union_type 14171 function_type 120632 method_type 504881 integer_cst 587216 string_cst 204367 function_decl 909919 label_decl 261908 field_decl 1278114 var_decl 87787 const_decl 327835 parm_decl 1653719 type_decl 771617 result_decl 559971 debug_expr_decl 147434 constructor 162322 nop_expr 531950 addr_expr 920865 tree_binfo 1013612 (to be compared with my previous stats) Heap vector stats: ipa-prop.c:2053 (ipa_node_duplication_hook) 540408: 0.8% 1046048 21339: 0.2% ipa-inline-analysis.c:2008 (inline_merge_summary 1697908: 2.5% 3086804 99582: 1.1% ipa-reference.c:185 (set_reference_optimization_ 6122784: 9.0% 10353528 10: 0.0% lto-cgraph.c:113 (lto_cgraph_encoder_encode) 6485840: 9.5% 10924352 22118: 0.2% ipa-ref.c:59 (ipa_record_reference) 16005792:23.5% 20789048 534854: 6.0% ipa-inline-analysis.c:647 (inline_summary_alloc) 17904344:26.3% 35257432 11486: 0.1% passes.c:1893 (execute_one_pass) 18076256:26.5% 20971480 474948: 5.3% Total 68129708 8892582 GGC stats: ipa-inline-analysis.c:841 (inline_node_duplicati 0: 0.0% 42428: 0.0% 37876224: 2.3% 2058852: 0.6% 232982 gimple.c:4177 (iterative_hash_gimple_type) 43510016: 2.8% 0: 0.0% 0: 0.0% 0: 0.0% 2719376 lto-symtab.c:156 (lto_symtab_register_decl) 50215704: 3.3% 0: 0.0% 0: 0.0% 0: 0.0% 896709 lto-section-in.c:471 (lto_new_in_decl_state) 165360: 0.0% 0: 0.0% 51424080: 3.2% 0: 0.0% 429912 cgraph.c:1008 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 77585352: 4.8% 0: 0.0% 746013 lto-streamer-in.c:2477 (lto_input_ts_constructor 34780240: 2.3% 67555760: 8.4% 45650928: 2.8% 33677352:10.4% 271362 ipa-inline-analysis.c:643 (inline_summary_alloc) 0: 0.0% 0: 0.0% 85235448: 5.3% 18126584: 5.6% 1 ipa-ref.c:54 (ipa_record_reference) 0: 0.0% 171658064:21.4% 85633072: 5.3% 68326696:21.0% 554106 lto-streamer-in.c:1934 (lto_materialize_tree) 90241344: 5.9% 0: 0.0% 11233544: 0.7% 5872: 0.0% 1013612 lto/lto.c:217 (lto_read_in_decl_state) 333288: 0.0% 0: 0.0% 130600080: 8.1% 24601136: 7.6% 3009384 toplev.c:1027 (realloc_for_line_map) 0: 0.0% 167815168:20.9% 167778304:10.4% 67182592:20.7% 14 tree.c:1223 (build_int_cst_wide) 200129008:13.0% 0: 0.0% 2046496: 0.1% 66567480:20.5% 40217 cgraph.c:457 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 226542712:14.0% 0: 0.0% 765347 lto-streamer-in.c:1939 (lto_materialize_tree) 1077917488:70.1% 0: 0.0% 540532272:33.4% 28671712: 8.8% 12277142 Total 1537795379 803354140 1619917572 325016043 27622283 source location Garbage Freed Leak Overhead Times Honza
Martin suggested ingoring BINFOs without FLAG_2 set. It don't seem make much difference: [WPA] Compression: 430287537 input bytes, 1997250286 uncompressed bytes (ratio: 4.641664) [WPA] Size of mmap'd section decls: 267483492 bytes ipa lto decl in : 73.75 (29%) usr 2.37 (17%) sys 76.27 (28%) wall 745752 kB (45%) ggc
New build failure with "gold" and gcc 4.7.0 20110615: ake[6]: Entering directory `/var/tmp/mozilla-central/moz-build-dir/js/src/shell' js.cpp c++ -o js.o -c -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux3.0\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub -I/var/tmp/mozilla-central/moz-build-dir/dist/include/nspr -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -fprofile-generate -O3 -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/js.pp /var/tmp/mozilla-central/js/src/shell/js.cpp jsworkers.cpp c++ -o jsworkers.o -c -I../../../dist/system_wrappers_js -include /var/tmp/mozilla-central/js/src/config/gcc_hidden.h -DEXPORT_JS_API -DOSTYPE=\"Linux3.0\" -DOSARCH=Linux -I/var/tmp/mozilla-central/js/src -I.. -I/var/tmp/mozilla-central/js/src/shell -I. -I../../../dist/include -I../../../dist/include/nsprpub -I/var/tmp/mozilla-central/moz-build-dir/dist/include/nspr -fPIC -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -fprofile-generate -O3 -DMOZILLA_CLIENT -include ../js-confdefs.h -MD -MF .deps/jsworkers.pp /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘void js::workers::MainQueue::destroy(JSContext*)’: /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:371:16: warning: deleting object of polymorphic class type ‘js::workers::MainQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor] /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘bool js::workers::ThreadPool::start(JSContext*)’: /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:511:20: warning: deleting object of polymorphic class type ‘js::workers::WorkerQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor] /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In member function ‘void js::workers::ThreadPool::shutdown(JSContext*)’: /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:548:16: warning: deleting object of polymorphic class type ‘js::workers::WorkerQueue’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor] /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In static member function ‘static void js::workers::Worker::jsFinalize(JSContext*, JSObject*)’: /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:690:20: warning: deleting object of polymorphic class type ‘js::workers::Worker’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor] /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp: In static member function ‘static js::workers::Worker* js::workers::Worker::create(JSContext*, js::workers::WorkerParent*, JSString*, JSObject*)’: /var/tmp/mozilla-central/js/src/shell/jsworkers.cpp:1073:16: warning: deleting object of polymorphic class type ‘js::workers::Worker’ which has non-virtual destructor might cause undefined behaviour [-Wdelete-non-virtual-dtor] In file included from /var/tmp/mozilla-central/js/src/shell/js.cpp:97:0: /var/tmp/mozilla-central/js/src/jsobjinlines.h: In member function ‘void JSObject::setArrayLength(uint32)’: /var/tmp/mozilla-central/js/src/jsobjinlines.h:367:24: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] /usr/bin/python2.7 /var/tmp/mozilla-central/js/src/config/pythonpath.py -I../config /var/tmp/mozilla-central/js/src/config/expandlibs_exec.py --uselist -- c++ -o js -fno-rtti -fno-exceptions -Wall -Wpointer-arith -Woverloaded-virtual -Wsynth -Wno-ctor-dtor-privacy -Wno-non-virtual-dtor -Wcast-align -Wno-invalid-offsetof -Wno-variadic-macros -Werror=return-type -pedantic -Wno-long-long -march=native -ffunction-sections -fdata-sections -fno-strict-aliasing -pthread -pipe -DNDEBUG -DTRIMMED -fprofile-generate -O3 js.o jsworkers.o -lpthread -fprofile-generate -Wl,-rpath-link,/bin -Wl,-rpath-link,/var/tmp/mozilla-central/moz-build-dir/dist/lib -L../../../dist/bin -L../../../dist/lib -L/var/tmp/mozilla-central/moz-build-dir/dist/lib -lplds4 -lplc4 -lnspr4 -lpthread -ldl ../editline/libeditline.a ../libjs_static.a -ldl /var/tmp/mozilla-central/moz-build-dir/js/src/shell/jsworkers.o:jsworkers.cpp:function js::workers::Worker::processOneEvent(): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoRequest::~JSAutoRequest()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/js/src/shell/jsworkers.o:jsworkers.cpp:function js::workers::ThreadPool::start(JSContext*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoSuspendRequest::JSAutoSuspendRequest(JSContext*)' is not defined locally ../libjs_static.a(jsapi.o):jsapi.cpp:function StopRequest(JSContext*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::AutoLockGC::~AutoLockGC()' is not defined locally ../libjs_static.a(jsapi.o):jsapi.cpp:function JS_ConvertArgumentsVA: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ../libjs_static.a(jsapi.o):jsapi.cpp:function JS_New: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ../libjs_static.a(jsarray.o):jsarray.cpp:function array_toSource(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::StringBuffer::StringBuffer(JSContext*)' is not defined locally ../libjs_static.a(jsarray.o):jsarray.cpp:function array_toString_sub(JSContext*, JSObject*, int, JSString*, js::Value*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'js::StringBuffer::StringBuffer(JSContext*)' is not defined locally ../libjs_static.a(jsemit.o):jsemit.cpp:function BindNameToSlot(JSContext*, JSCodeGenerator*, JSParseNode*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ../libjs_static.a(jsemit.o):jsemit.cpp:function BindNameToSlot(JSContext*, JSCodeGenerator*, JSParseNode*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ../libjs_static.a(jsfun.o):jsfun.cpp:function Function(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ../libjs_static.a(jsfun.o):jsfun.cpp:function Function(JSContext*, unsigned int, js::Value*): warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'JSAutoByteString::~JSAutoByteString()' is not defined locally ... Using the bfd linker instead of "gold" seems to work. gcc-4.6.1 also works fine.
Please note that this error only happens during a profiled build. Normal build seems to be OK.
(In reply to comment #100) > Please note that this error only happens during a profiled build. > Normal build seems to be OK. FWIW: https://bugzilla.mozilla.org/show_bug.cgi?id=664387
Jan, this is caused by: commit 8c1fce46fc02e43e82b05f49894690133a1bcdcf Author: hubicka <hubicka@138bc75d-0d04-0410-961f-82ee72b054a4> Date: Fri Jun 10 20:06:48 2011 +0000 Reverting the commit "fixes" the problem.
Even with 8c1fce46fc0 reverted libxul fails to link during a profiledbuild. Normal build is fine. with bfd: /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: ../../layout/ipc/RenderFrameParent.o: relocation R_X86_64_PC32 against undefined hidden symbol `nsRefPtr<mozilla::layers::ImageContainer>::~nsRefPtr()' can not be used when making a shared object /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: final link failed: Bad value with gold: /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../layout/ipc/RenderFrameParent.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../content/events/src/nsEventStateManager.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../content/xul/templates/src/nsRuleNetwork.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../gfx/thebes/GLContextProviderGLX.o: requires dynamic reloc which may overflow at runtime; recompile with -fPIC /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvlatin/nsUnicodeToUCS2BE.o:nsUnicodeToUCS2BE.cpp:function vtable for nsUnicodeToUTF16BE: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsUnicodeToUTF16BE::~nsUnicodeToUTF16BE()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvlatin/nsUnicodeToUCS2BE.o:nsUnicodeToUCS2BE.cpp:function vtable for nsUnicodeToUTF16LE: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsUnicodeToUTF16LE::~nsUnicodeToUTF16LE()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvcn/nsGBKToUnicode.o:nsGBKToUnicode.cpp:function vtable for nsGBKToUnicode: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsGBKToUnicode::~nsGBKToUnicode()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../intl/uconv/ucvcn/nsGBKToUnicode.o:nsGBKToUnicode.cpp:function vtable for nsGB18030ToUnicode: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'nsGB18030ToUnicode::~nsGB18030ToUnicode()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../parser/htmlparser/src/nsHTMLTokens.o:nsHTMLTokens.cpp:function vtable for CAttributeToken: warning: relocation refers to discarded section /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: hidden symbol 'CAttributeToken::~CAttributeToken()' is not defined locally /var/tmp/mozilla-central/moz-build-dir/toolkit/library/../../layout/generic/nsGfxScrollFrame.o:nsGfxScrollFrame.cpp:function vtable for nsHTMLScrollFrame: warning: relocation refers to discarded section ...
> Even with 8c1fce46fc0 reverted libxul fails to link during > a profiledbuild. Normal build is fine. I didn't really tested profiledbuild for a while, so I will check. Last time I tried we was able to build libxul but still had problems building one of later libraries because of the COMDAT issues. I filled Mozilla PR for that (the problem really is including some classes but not linking with their implementation). What worked well for me is to profile w/o LTO and LTO final build. This is recommended way anyway as LTO -fprofile-generae build is unnecesarily expensive. What is the official way of building mozilla with FDO? Does the non-FDO problem persist for you? The Jul 10 commit was part of longer series of alias rewrite and I fixed some of fallout afterwards (and was able to build mozilla). Didn't see the particular problem you report however. Honza
(In reply to comment #104) > > Even with 8c1fce46fc0 reverted libxul fails to link during > > a profiledbuild. Normal build is fine. > > I didn't really tested profiledbuild for a while, so I will check. > Last time I tried we was able to build libxul but still had problems > building one of later libraries because of the COMDAT issues. I filled > Mozilla PR for that (the problem really is including some classes but not > linking with their implementation). > > What worked well for me is to profile w/o LTO and LTO final build. This is > recommended way anyway as LTO -fprofile-generae build is unnecesarily > expensive. Yes, that how I run things normally. too. > What is the official way of building mozilla with FDO? (Here is what I use:) make -f client.mk profiledbuild with the following appended to your .mozconfig: ac_add_options --enable-profile-guided-optimization mk_add_options PROFILE_GEN_SCRIPT=/home/markus/run-firefox.sh ~ % cat run-firefox.sh #!/bin/sh export NO_EM_RESTART=1 sudo -u markus $OBJDIR/dist/bin/firefox -no-remote This will start the instrumented firefox. Use it for some time. After you close it, the final -fprofile-use build starts. > Does the non-FDO problem persist for you? The Jul 10 commit was part of > longer series of alias rewrite and I fixed some of fallout afterwards (and > was able to build mozilla). Didn't see the particular problem you report > however. I only see the problems during a FDO build, non-FDO is fine. (But because it turned out that both issues have nothing to do with LTO maybe it would be better to file a new bug for them?)
I've opened a new bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49533 with a patch that fixes the issue seen in Comment 99.
Now my build dies on what appears to be configure confussion: /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:43:17: error: 'close' was not declared in this scope /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:52:26: error: 'read' was not declared in this scope /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:52:26: error: invalid type in declaration before ';' token /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:62:39: error: 'write' was not declared in this scope /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:62:39: error: invalid type in declaration before ';' token /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:74:7: error: 'close' was not declared in this scope /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:74:7: error: invalid type in declaration before ';' token /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:76:7: error: 'close' was not declared in this scope /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:76:7: error: invalid type in declaration before ';' token While I could definitely get around by adding the proper #includes, it seems that things simply gets misconfigured. Martin, you mentioned similar problem earlier, perhaps you already have solution?
(In reply to comment #107) > Now my build dies on what appears to be configure confussion: > /abuild/jh/mozilla-central2/mozilla-central/ipc/chromium/src/base/file_util_linux.cc:43:17: > error: 'close' was not declared in this scope Actually I think this was caused by the removal of the #include of unistd.h in gthr-posix.h which means the version of mozilla you are trying to use has not be updated for that change.
Yep. See: http://gcc.gnu.org/viewcvs?view=revision&revision=176335 http://thread.gmane.org/gmane.comp.gcc.devel/121989
(In reply to comment #107) > > Martin, you mentioned similar problem earlier, perhaps you already have > solution? I went for adding the includes. I wasn't looking into dependencies in much detail and ended up just adding #include <unistd.h> to: - ipc/chromium/src/base/file_util.cc - ipc/chromium/src/base/message_pump_libevent.cc - ipc/chromium/src/base/file_util_linux.cc - toolkit/crashreporter/client/crashreporter_gtk_common.cpp However, I also suspected some configure problem because I also had to tweak #if's in ipc/chromium/src/base/time_posix.cc. The patch that I use to do this is at http://labs.suse.cz/mjambor/undefined_and_pp_errors.diff In order to LTO build mozilla I currently need this one, a patch adding attribute used to various places I got from you and a simple patch fixing mozilla bug 652563.
Mozilla now builds for me with slim LTO objects. I.e. with -flto=24 -fuse-linker-plugin -fno-fat-lto-objects One needs ar/nm/ranlib that works with slim LTO. I simply set PATH to directory with following scripts: jh@evans:/abuild/jh/trunk-install/bin> cat nm #!/bin/sh /usr/bin/nm --plugin /abuild/jh/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/liblto_plugin.so $* jh@evans:/abuild/jh/trunk-install/bin> cat ar #!/bin/sh cmd=$1 shift /abuild/jh/trunk-install/bin/ar-with-plugin $cmd --plugin /abuild/jh/trunk-install/libexec/gcc/x86_64-unknown-linux-gnu/4.7.0/liblto_plugin.so $* jh@evans:/abuild/jh/trunk-install/bin> cat ranlib #!/bin/sh jh@evans:/abuild/jh/trunk-install/bin> If I was not lazy to rebuild ranlib, I think it exists with plugin support now, too. Just disabling it was however equally easy. I will do some benchmarks about build time/disk usage. Resulting binary works too, BTW :)
OK, the problem turns out to be configure issue. Configure script greps asm output and with slim LTO it does not find there what it expects disabling hidden visibilities. No surprise this leads to a performance disaster. I use the following hack: diff -r 06b2977afb85 configure.in --- a/configure.in Fri Sep 09 23:25:02 2011 -0400 +++ b/configure.in Wed Sep 28 15:30:56 2011 +0200 @@ -3035,7 +3035,7 @@ int foo __attribute__ ((visibility ("hidden"))) = 1; EOF ac_cv_visibility_hidden=no - if ${CC-cc} -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then + if ${CC-cc} -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then if egrep '\.(hidden|private_extern).*foo' conftest.s >/dev/null; then ac_cv_visibility_hidden=yes fi @@ -3051,7 +3051,7 @@ int foo __attribute__ ((visibility ("default"))) = 1; EOF ac_cv_visibility_default=no - if ${CC-cc} -fvisibility=hidden -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then + if ${CC-cc} -fvisibility=hidden -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then if ! egrep '\.(hidden|private_extern).*foo' conftest.s >/dev/null; then ac_cv_visibility_default=yes fi @@ -3070,7 +3070,7 @@ int foo_default = 1; EOF ac_cv_visibility_pragma=no - if ${CC-cc} -Werror -S conftest.c -o conftest.s >/dev/null 2>&1; then + if ${CC-cc} -Werror -S -fno-lto conftest.c -o conftest.s >/dev/null 2>&1; then if egrep '\.(hidden|private_extern).*foo_hidden' conftest.s >/dev/null; then if ! egrep '\.(hidden|private_extern).*foo_default' conftest.s > /dev/null; then ac_cv_visibility_pragma=yes @@ -3092,7 +3092,7 @@ } EOF ac_cv_have_visibility_class_bug=no - if ! ${CXX-g++} ${CXXFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -S -o conftest.S conftest.c > /dev/null 2>&1 ; then + if ! ${CXX-g++} ${CXXFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -S -fno-lto -o conftest.S conftest.c > /dev/null 2>&1 ; then ac_cv_have_visibility_class_bug=yes else if test `egrep -c '@PLT|\\$stub' conftest.S` = 0; then @@ -3116,7 +3116,7 @@ } EOF ac_cv_have_visibility_builtin_bug=no - if ! ${CC-cc} ${CFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -O2 -S -o conftest.S conftest.c > /dev/null 2>&1 ; then + if ! ${CC-cc} ${CFLAGS} ${DSO_PIC_CFLAGS} ${DSO_LDOPTS} -O2 -S -fno-lto -o conftest.S conftest.c > /dev/null 2>&1 ; then ac_cv_have_visibility_builtin_bug=yes else if test `grep -c "@PLT" conftest.S` = 0; then
Even with PR47247 solved, -fprofile-generate -flto build fails at libbrowsercomps.so.ltrans23.ltrans.o:libbrowsercomps.so.ltrans23.o:function _ZTV17gfxUnknownSurface.local.706.2371: error: undefined reference to '_ZN11gfxASurface13BeginPrintingERK9nsAStringS2_' -fprofile-generate -flto is stupid, since one can profile w/o LTO and get a lot faster build. (We also need 15GB for libxul link). Still it seems that we miss some optimization we ought not.
So quick summary 1) -g build is still blocked by dwarf2out ICE 2) build with gold works, but only without -fprofile-generate. FDO build is also possible, but -fprofile-generate needs -fno-lto (that makes a lot of sense, but we still should fix the bug at GCC side) 3) With GNU LD, there is still bug that blocks Mozilla LTO http://sourceware.org/bugzilla/show_bug.cgi?id=13244 4) Slim LTO works well. Build times are about the same as for non-LTO. One needs the aforementioned configure hacks and ar/nm/ranlib wrappers. Honza
OK the same errors also happens with GNU LD build http://sourceware.org/bugzilla/show_bug.cgi?id=13244 https://bugzilla.mozilla.org/show_bug.cgi?id=691053 I will analyze what happens with -fprofile-generate and gold but I bet it all fails because we now take address of the constructor and consequentely the constructor is exported out of libxul, but visibilities are wrong. Honza
Solving http://sourceware.org/bugzilla/show_bug.cgi?id=13245 should make that linker error with -flto -fprofile-generate to go away.
"-flto=4 -fno-fat-lto-objects -fprofile-use -fprofile-correction" breaks at js/src/xpconnect/src/dombindings.cpp: ... In file included from /var/tmp/mozilla-central/js/src/xpconnect/src/dombindings.cpp:1109:0: ./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_Add(JSContext*, unsigned int, JS::Value*)’: ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL25HTMLOptionsCollection_AddEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL25HTMLOptionsCollection_AddEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_SetSelectedIndex(JSContext*, JSObject*, long, int, JS::Value*)’: ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_SetSelectedIndexEP9JSContextP8JSObjectliPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_SetSelectedIndexEP9JSContextP8JSObjectliPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_GetSelectedIndex(JSContext*, JSObject*, long, JS::Value*)’: ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_GetSelectedIndexEP9JSContextP8JSObjectlPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL38HTMLOptionsCollection_GetSelectedIndexEP9JSContextP8JSObjectlPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_Item(JSContext*, unsigned int, JS::Value*)’: ./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN7mozilla3dom7bindingL26HTMLOptionsCollection_ItemEP9JSContextjPN2JS5ValueE’ found [enabled by default] ./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN7mozilla3dom7bindingL26HTMLOptionsCollection_ItemEP9JSContextjPN2JS5ValueE’ found [enabled by default] ./dombindings_gen.cpp: In member function ‘nsCOMPtr<nsIDOMNode>::~nsCOMPtr()’: ./dombindings_gen.cpp:546:1: warning: no coverage for function ‘_ZN8nsCOMPtrI10nsIDOMNodeED2Ev’ found [enabled by default] ./dombindings_gen.cpp: In function ‘mozilla::dom::binding::HTMLOptionsCollection_NamedItem(JSContext*, unsigned int, JS::Value*)’: ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL31HTMLOptionsCollection_NamedItemEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘arcs’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot ./dombindings_gen.cpp:546:1: error: The control flow of function ‘_ZN7mozilla3dom7bindingL31HTMLOptionsCollection_NamedItemEP9JSContextjPN2JS5ValueE’ does not match its profile data (counter ‘indirect_call’) [-Werror=coverage-mismatch] ./dombindings_gen.cpp:546:1: note: Use -Wno-error=coverage-mismatch to tolerate the mismatch but performance may drop if the function is hot cc1plus: some warnings being treated as errors
Probably a Mozilla bug. See: https://bugzilla.mozilla.org/show_bug.cgi?id=693563
Some up to date perfomrance data. WPA peaks 3.1GB in TOP now. (3261 virt). Overall compile time is 4m32s real, 21m14 user. GGC memory is GC 2248537k -> 1727826k WPA time report: callgraph optimization : 1.68 ( 1%) usr 0.00 ( 0%) sys 1.70 ( 1%) wall 16008 kB (11%) ggc varpool construction : 0.66 ( 0%) usr 0.02 ( 0%) sys 0.68 ( 0%) wall 55300 kB (39%) ggc ipa cp : 1.70 ( 1%) usr 0.09 ( 1%) sys 1.79 ( 1%) wall 75845 kB (53%) ggc ipa lto gimple out : 9.40 ( 6%) usr 0.91 (10%) sys 10.36 ( 6%) wall 0 kB ( 0%) ggc ipa lto decl in : 45.99 (29%) usr 1.66 (19%) sys 47.95 (28%) wall 3285797 kB (2315%) ggc ipa lto decl out : 35.61 (22%) usr 1.65 (19%) sys 37.23 (22%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 3.73 ( 2%) usr 0.22 ( 2%) sys 3.95 ( 2%) wall 621046 kB (438%) ggc ipa lto decl merge : 5.75 ( 4%) usr 0.00 ( 0%) sys 5.75 ( 3%) wall 803 kB ( 1%) ggc ipa lto cgraph merge : 2.79 ( 2%) usr 0.02 ( 0%) sys 2.81 ( 2%) wall 27731 kB (20%) ggc inline heuristics : 31.32 (19%) usr 0.13 ( 1%) sys 31.48 (18%) wall 252282 kB (178%) ggc TOTAL : 161.21 8.82 170.40 141952 kB (i.e. 60% of overall compilation time and about 1/3 if streaming in 1/3 of straming out and 1/5th for inliner). oprofile of streaming in: 9467 6.8109 lto1 htab_find_slot_with_hash 9036 6.5008 lto1 inflate_fast 6608 4.7540 libc-2.11.1.so memset 6256 4.5008 libc-2.11.1.so _int_malloc 6243 4.4914 lto1 pointer_map_insert 5694 4.0965 lto1 lto_input_tree 5014 3.6072 lto1 gt_ggc_mx_lang_tree_node 4522 3.2533 lto1 streamer_read_tree_bitfields 4463 3.2108 lto1 ggc_set_mark 4087 2.9403 opreport /usr/bin/opreport 3661 2.6339 lto1 ggc_internal_alloc_stat 3475 2.5000 lto1 streamer_read_uhwi 2508 1.8043 lto1 gimple_type_eq 2418 1.7396 lto1 streamer_read_tree_body 2310 1.6619 libc-2.11.1.so memcpy 2292 1.6489 lto1 streamer_tree_cache_insert_1 2255 1.6223 libc-2.11.1.so memcmp 2119 1.5245 lto1 ht_lookup_with_hash 1902 1.3684 lto1 iterative_hash_hashval_t 1885 1.3561 lto1 lto_fixup_types 1884 1.3554 libc-2.11.1.so _int_free 1872 1.3468 lto1 uniquify_nodes 1842 1.3252 lto1 htab_expand 1825 1.3130 oprofiled /usr/bin/oprofiled 1813 1.3043 lto1 adler32 1734 1.2475 lto1 htab_hash_string 1509 1.0856 libc-2.11.1.so _IO_vfscanf 1470 1.0576 libc-2.11.1.so malloc_consolidate pointer map and htab is mostly type merging still, I believe. oprofile of inliner: 8772 37.9215 lto1 edge_badness 5532 23.9149 lto1 do_estimate_growth_1 1647 7.1200 lto1 update_caller_keys 1484 6.4154 lto1 can_inline_edge_p 744 3.2163 lto1 estimate_calls_size_and_time.isra.32 509 2.2004 lto1 estimate_edge_size_and_time.constprop.65 495 2.1399 lto1 fibheap_consolidate 267 1.1542 lto1 fibheap_extr_min_node 210 0.9078 lto1 cgraph_maybe_hot_edge_p I.e. easy to handle by taming down amout of heap updating. Stream out: 33711 19.7166 lto1 lto1 varpool_node_for_asm 13947 8.1572 lto1 lto1 decl_assembler_name_equal 8873 5.1896 lto1 lto1 pointer_map_insert 8765 5.1264 lto1 lto1 linemap_lookup 6809 3.9824 lto1 lto1 lto_output_tree 4931 2.8840 lto1 lto1 inflate_fast 4718 2.7594 lto1 lto1 streamer_write_uhwi_stream 3521 2.0593 lto1 lto1 streamer_tree_cache_insert_1 3340 1.9535 lto1 lto1 splay_tree_splay 3293 1.9260 lto1 lto1 streamer_pack_tree_bitfields 3210 1.8774 libc-2.11.1.so libc-2.11.1.so memcpy 3175 1.8570 libc-2.11.1.so libc-2.11.1.so _int_malloc The assembler name lookups will go away with finishing the alias rewrite. Oprofile of ltrans stage: 52827 3.3333 lto1 lto1 value_member 45691 2.8830 libc-2.11.1.so libc-2.11.1.so _int_malloc 42528 2.6835 lto1 lto1 bitmap_set_bit 41934 2.6460 oprofiled oprofiled /usr/bin/oprofiled 22353 1.4104 libc-2.11.1.so libc-2.11.1.so memset 21573 1.3612 lto1 lto1 htab_find_slot_with_hash 20936 1.3210 lto1 lto1 ggc_internal_alloc_stat 19608 1.2372 lto1 lto1 record_reg_classes.constprop.10 17423 1.0994 lto1 lto1 bitmap_bit_p 17195 1.0850 lto1 lto1 for_each_rtx_1 13504 0.8521 libc-2.11.1.so libc-2.11.1.so _int_free 12343 0.7788 lto1 lto1 bitmap_clear_bit 11826 0.7462 lto1 lto1 constrain_operands The slowest of ltrans is: garbage collection : 1.69 ( 2%) usr 0.01 ( 0%) sys 1.72 ( 2%) wall 0 kB ( 0%) ggc ipa lto gimple in : 1.52 ( 2%) usr 0.45 ( 9%) sys 1.94 ( 2%) wall 212002 kB (11%) ggc ipa lto decl in : 1.61 ( 2%) usr 0.19 ( 4%) sys 1.81 ( 2%) wall 147115 kB ( 7%) ggc cfg cleanup : 1.46 ( 2%) usr 0.03 ( 1%) sys 1.60 ( 2%) wall 5376 kB ( 0%) ggc df live regs : 2.26 ( 3%) usr 0.03 ( 1%) sys 2.62 ( 3%) wall 0 kB ( 0%) ggc tree VRP : 2.04 ( 2%) usr 0.05 ( 1%) sys 2.34 ( 2%) wall 126142 kB ( 6%) ggc tree PTA : 1.97 ( 2%) usr 0.00 ( 0%) sys 2.43 ( 3%) wall 8733 kB ( 0%) ggc tree PRE : 2.98 ( 3%) usr 0.07 ( 1%) sys 3.83 ( 4%) wall 64875 kB ( 3%) ggc tree FRE : 1.50 ( 2%) usr 0.01 ( 0%) sys 1.98 ( 2%) wall 33609 kB ( 2%) ggc expand : 4.11 ( 5%) usr 0.11 ( 2%) sys 4.85 ( 5%) wall 138280 kB ( 7%) ggc CSE : 1.88 ( 2%) usr 0.04 ( 1%) sys 2.16 ( 2%) wall 2764 kB ( 0%) ggc CPROP : 1.83 ( 2%) usr 0.04 ( 1%) sys 1.87 ( 2%) wall 21657 kB ( 1%) ggc integrated RA : 6.84 ( 8%) usr 0.08 ( 2%) sys 7.30 ( 8%) wall 367479 kB (19%) ggc reload : 2.47 ( 3%) usr 0.04 ( 1%) sys 2.82 ( 3%) wall 8783 kB ( 0%) ggc reload CSE regs : 2.03 ( 2%) usr 0.01 ( 0%) sys 2.02 ( 2%) wall 19115 kB ( 1%) ggc scheduling 2 : 3.08 ( 3%) usr 0.03 ( 1%) sys 3.14 ( 3%) wall 3942 kB ( 0%) ggc final : 11.46 (13%) usr 1.06 (21%) sys 3.62 ( 4%) wall 40822 kB ( 2%) ggc rest of compilation : 2.97 ( 3%) usr 0.87 (17%) sys 5.22 ( 5%) wall 60101 kB ( 3%) ggc unaccounted todo : 1.35 ( 2%) usr 0.67 (13%) sys 2.37 ( 2%) wall 0 kB ( 0%) ggc TOTAL : 89.65 5.08 95.59 1962376 kB Final is suprisingly slow.
weakref reorg saves about 15 seconds, so we have total WPA time 145s and decl out at 19s (13%). Honza
With inliner performance fix I am going to push out today, the situation looks as follows: Execution times (seconds) phase parsing : 606.20 (98%) usr 21.98 (99%) sys 641.28 (98%) wall 2164274 kB (100%) ggc phase cgraph : 337.00 (55%) usr 18.52 (83%) sys 367.32 (56%) wall 88841 kB ( 4%) ggc phase finalize : 10.21 ( 2%) usr 0.28 ( 1%) sys 10.50 ( 2%) wall 0 kB ( 0%) ggc garbage collection : 33.12 ( 5%) usr 0.04 ( 0%) sys 33.21 ( 5%) wall 0 kB ( 0%) ggc ipa cp : 3.52 ( 1%) usr 0.15 ( 1%) sys 3.67 ( 1%) wall 93737 kB ( 4%) ggc ipa lto gimple out : 14.43 ( 2%) usr 1.38 ( 6%) sys 15.89 ( 2%) wall 0 kB ( 0%) ggc ipa lto decl in : 221.85 (36%) usr 2.52 (11%) sys 225.61 (35%) wall 1153296 kB (53%) ggc ipa lto decl out : 179.65 (29%) usr 8.60 (39%) sys 198.90 (31%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 4.59 ( 1%) usr 0.50 ( 2%) sys 5.09 ( 1%) wall 550051 kB (25%) ggc ipa lto decl merge : 9.57 ( 2%) usr 0.00 ( 0%) sys 9.58 ( 1%) wall 291 kB ( 0%) ggc ipa lto cgraph merge : 6.06 ( 1%) usr 0.00 ( 0%) sys 6.08 ( 1%) wall 14158 kB ( 1%) ggc whopr wpa : 6.44 ( 1%) usr 0.06 ( 0%) sys 6.54 ( 1%) wall 2 kB ( 0%) ggc whopr wpa I/O : 2.77 ( 0%) usr 8.03 (36%) sys 11.56 ( 2%) wall 0 kB ( 0%) ggc ipa reference : 5.16 ( 1%) usr 0.08 ( 0%) sys 5.25 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.55 ( 0%) usr 0.00 ( 0%) sys 0.55 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 5.59 ( 1%) usr 0.02 ( 0%) sys 5.61 ( 1%) wall 0 kB ( 0%) ggc parser (global) : 3.98 ( 1%) usr 0.04 ( 0%) sys 4.04 ( 1%) wall 0 kB ( 0%) ggc inline heuristics : 94.38 (15%) usr 0.31 ( 1%) sys 94.90 (15%) wall 342900 kB (16%) ggc tree CFG cleanup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc callgraph verifier : 18.53 ( 3%) usr 0.08 ( 0%) sys 18.61 ( 3%) wall 0 kB ( 0%) ggc varconst : 0.04 ( 0%) usr 0.03 ( 0%) sys 0.14 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 4.70 ( 1%) usr 0.10 ( 0%) sys 4.81 ( 1%) wall 0 kB ( 0%) ggc TOTAL : 616.43 22.26 651.79 2165706 kB So memory use is somewhat up (4GB compared to 3.2GB) but Mozilla grew a bit, too, so I think there are no important changes since my last report. Performance wise we are in better shape than 4.7 release (I will backport the fix, 4.7 needs over 10 minutes in the inliner) but we still are way too slow, with over 3 minutes needed for streaming in..
oprofile shows: 139188 15.6963 lto1 lto1 uniquify_nodes 66390 7.4868 lto1 lto1 estimate_edge_growth 52815 5.9560 lto1 lto1 VEC_edge_growth_cache_entry_base_length 47137 5.3157 lto1 lto1 iterative_hash_hashval_t 34037 3.8384 lto1 lto1 htab_find_slot_with_hash 33604 3.7895 lto1 lto1 bp_unpack_value 26584 2.9979 lto1 lto1 do_estimate_growth_1 21410 2.4144 lto1 lto1 ggc_set_mark 17124 1.9311 lto1 lto1 inflate_fast 14464 1.6311 lto1 lto1 streamer_read_uhwi 14204 1.6018 lto1 lto1 lookup_page_table_entry 11430 1.2890 libc-2.11.1.so libc-2.11.1.so memset 11405 1.2861 lto1 lto1 streamer_read_hwi_in_range 11286 1.2727 lto1 lto1 gt_ggc_mx_lang_tree_node 11017 1.2424 lto1 lto1 iterative_hash_gimple_type 10851 1.2237 lto1 lto1 pointer_map_insert 10674 1.2037 lto1 lto1 lto_input_tree 10536 1.1881 lto1 lto1 ht_lookup_with_hash 10269 1.1580 lto1 lto1 streamer_read_uchar 9972 1.1245 lto1 lto1 streamer_read_uchar 9089 1.0250 libc-2.11.1.so libc-2.11.1.so _int_malloc 9086 1.0246 lto1 lto1 alloc_page 6603 0.7446 lto1 lto1 VEC_edge_growth_cache_entry_base_index looks like uniquify_nodes got out of control?
Just for comparison, clang with -O4 runs only single threaded and does everything in memory (no streaming out). It uses 3.5GB of memory (peak) and takes 19 minutes to finish...
> Just for comparison, clang with -O4 runs only single threaded and does > everything in memory (no streaming out). It uses 3.5GB of memory (peak) and > takes 19 minutes to finish... Interesting. Micsofot's compiler is also barely in 4GB space, right? Is it with debug info? I will try non-WHOPR build to see how bad we are. The actual IL is about 1.5GB of the footprint (measuing GGC memory). I think good part of the rest comes to mmap address space (the object files are rather large). Honza
(In reply to comment #122) > oprofile shows: > 139188 15.6963 lto1 lto1 > uniquify_nodes > 66390 7.4868 lto1 lto1 > estimate_edge_growth > 52815 5.9560 lto1 lto1 > VEC_edge_growth_cache_entry_base_length > 47137 5.3157 lto1 lto1 > iterative_hash_hashval_t > 34037 3.8384 lto1 lto1 > htab_find_slot_with_hash > 33604 3.7895 lto1 lto1 > bp_unpack_value > 26584 2.9979 lto1 lto1 > do_estimate_growth_1 > 21410 2.4144 lto1 lto1 > ggc_set_mark > 17124 1.9311 lto1 lto1 > inflate_fast > 14464 1.6311 lto1 lto1 > streamer_read_uhwi > 14204 1.6018 lto1 lto1 > lookup_page_table_entry > 11430 1.2890 libc-2.11.1.so libc-2.11.1.so memset > 11405 1.2861 lto1 lto1 > streamer_read_hwi_in_range > 11286 1.2727 lto1 lto1 > gt_ggc_mx_lang_tree_node > 11017 1.2424 lto1 lto1 > iterative_hash_gimple_type > 10851 1.2237 lto1 lto1 > pointer_map_insert > 10674 1.2037 lto1 lto1 > lto_input_tree > 10536 1.1881 lto1 lto1 > ht_lookup_with_hash > 10269 1.1580 lto1 lto1 > streamer_read_uchar > 9972 1.1245 lto1 lto1 > streamer_read_uchar > 9089 1.0250 libc-2.11.1.so libc-2.11.1.so _int_malloc > 9086 1.0246 lto1 lto1 alloc_page > 6603 0.7446 lto1 lto1 > VEC_edge_growth_cache_entry_base_index > > looks like uniquify_nodes got out of control? Well - the obvious possibly "slow" part of uniquify nodes is that it walks all fields of record/union types. So - do you have a more detailed profile of uniquify_nodes?
(In reply to comment #124) > > Just for comparison, clang with -O4 runs only single threaded and does > > everything in memory (no streaming out). It uses 3.5GB of memory (peak) and > > takes 19 minutes to finish... > > Interesting. Micsofot's compiler is also barely in 4GB space, right? IIRC Mozilla recently switched to a 64-bit toolchain on windows, because the 32-bit linker ran out of memory. So they are above 4GB already. > Is it with debug info? No.
(In reply to comment #126) > (In reply to comment #124) > > > Just for comparison, clang with -O4 runs only single threaded and does > > > everything in memory (no streaming out). It uses 3.5GB of memory (peak) and > > > takes 19 minutes to finish... > > > > Interesting. Micsofot's compiler is also barely in 4GB space, right? > > IIRC Mozilla recently switched to a 64-bit toolchain on windows, because the > 32-bit linker ran out of memory. So they are above 4GB already. There is unfortunately no cross-linker in MSVC, so you can't link 32-bit binaries with a 64-bit toolchain. We're in the process of switching to 64-bits OS with a 32-its toolchain, which will allow an extra gigabyte of address-space. We've gone past the current 3GB limit a couple times now, at which point, we moved some stuff out of libxul. Before that, we hit the 2GB limit, at which point we used the /3GB option that allows for the extra GB.
> Well - the obvious possibly "slow" part of uniquify nodes is that it walks > all fields of record/union types. So - do you have a more detailed profile > of uniquify_nodes? No, I will try to generate annotated sources then. I am bit puzzled by this - looking at the stuff there seems nothing inherently expensive in it. Honza
OK, the slow part of uniuqify_nodes is: /* Remove us from our main variant list if we are not the variant leader. */ if (TYPE_MAIN_VARIANT (t) != t) { tem = TYPE_MAIN_VARIANT (t); while (tem && TYPE_NEXT_VARIANT (tem) != t) tem = TYPE_NEXT_VARIANT (tem); if (tem) TYPE_NEXT_VARIANT (tem) = TYPE_NEXT_VARIANT (t); TYPE_NEXT_VARIANT (t) = NULL_TREE; }
After fixing one linker error, I can now build Mozilla with -flto-partition=none. It takes 11GB and 40 minutes, so there is space for improvement ;) There are some obvious questions, like why IRA needs 63% of GGC memory, and VRP 23% Also the -flto-partition=none .text section is now 18% smaller. This is large enough to be declared a bug, but I am not sure how to track it. Note that my macihne has quite poor since CPU performance, so the compile times are likely not comparable with LLVM ones reported above (and I also use debugging build). ipa lto gimple in : 52.12 ( 2%) usr 3.68 ( 9%) sys 55.72 ( 2%) wall 2998249 kB (84%) ggc ipa lto decl in : 225.68 ( 8%) usr 2.39 ( 6%) sys 228.17 ( 8%) wall 1124821 kB (31%) ggc ipa lto cgraph I/O : 4.82 ( 0%) usr 0.44 ( 1%) sys 5.27 ( 0%) wall 684110 kB (19%) ggc cfg construction : 3.01 ( 0%) usr 0.12 ( 0%) sys 3.29 ( 0%) wall 70205 kB ( 2%) ggc cfg cleanup : 46.57 ( 2%) usr 0.41 ( 1%) sys 46.69 ( 2%) wall 75005 kB ( 2%) ggc df live regs : 78.21 ( 3%) usr 0.25 ( 1%) sys 77.55 ( 3%) wall 0 kB ( 0%) ggc alias analysis : 25.59 ( 1%) usr 0.12 ( 0%) sys 25.88 ( 1%) wall 474769 kB (13%) ggc parser (global) : 8.62 ( 0%) usr 0.65 ( 2%) sys 10.00 ( 0%) wall 259389 kB ( 7%) ggc inline heuristics : 87.23 ( 3%) usr 0.51 ( 1%) sys 88.41 ( 3%) wall 451358 kB (13%) ggc integration : 50.61 ( 2%) usr 1.51 ( 4%) sys 52.67 ( 2%) wall 1479979 kB (41%) ggc tree CFG cleanup : 46.68 ( 2%) usr 0.43 ( 1%) sys 48.09 ( 2%) wall 70493 kB ( 2%) ggc tree VRP : 65.88 ( 2%) usr 0.73 ( 2%) sys 66.71 ( 2%) wall 862879 kB (24%) ggc tree copy propagation : 22.30 ( 1%) usr 0.17 ( 0%) sys 22.11 ( 1%) wall 144298 kB ( 4%) ggc tree PTA : 46.70 ( 2%) usr 0.06 ( 0%) sys 46.90 ( 2%) wall 100249 kB ( 3%) ggc tree SSA rewrite : 19.16 ( 1%) usr 0.15 ( 0%) sys 19.09 ( 1%) wall 149347 kB ( 4%) ggc tree SSA incremental : 27.75 ( 1%) usr 0.61 ( 1%) sys 27.86 ( 1%) wall 72307 kB ( 2%) ggc tree operand scan : 57.17 ( 2%) usr 3.03 ( 7%) sys 59.92 ( 2%) wall 1296208 kB (36%) ggc dominator optimization : 35.95 ( 1%) usr 0.21 ( 0%) sys 35.74 ( 1%) wall 311024 kB ( 9%) ggc tree CCP : 31.61 ( 1%) usr 0.12 ( 0%) sys 31.17 ( 1%) wall 111169 kB ( 3%) ggc tree PRE : 87.46 ( 3%) usr 0.60 ( 1%) sys 88.62 ( 3%) wall 538859 kB (15%) ggc tree FRE : 47.37 ( 2%) usr 0.58 ( 1%) sys 45.89 ( 2%) wall 274455 kB ( 8%) ggc tree aggressive DCE : 8.96 ( 0%) usr 0.22 ( 1%) sys 8.86 ( 0%) wall 137686 kB ( 4%) ggc tree forward propagate : 10.28 ( 0%) usr 0.10 ( 0%) sys 10.33 ( 0%) wall 56466 kB ( 2%) ggc tree slp vectorization : 25.42 ( 1%) usr 0.16 ( 0%) sys 25.50 ( 1%) wall 436119 kB (12%) ggc complete unrolling : 5.81 ( 0%) usr 0.13 ( 0%) sys 6.07 ( 0%) wall 115165 kB ( 3%) ggc tree vectorization : 1.44 ( 0%) usr 0.05 ( 0%) sys 1.36 ( 0%) wall 31337 kB ( 1%) ggc tree iv optimization : 13.00 ( 0%) usr 0.08 ( 0%) sys 12.94 ( 0%) wall 185893 kB ( 5%) ggc dominance computation : 48.61 ( 2%) usr 0.54 ( 1%) sys 47.65 ( 2%) wall 0 kB ( 0%) ggc expand vars : 18.81 ( 1%) usr 0.09 ( 0%) sys 18.42 ( 1%) wall 167798 kB ( 5%) ggc expand : 116.32 ( 4%) usr 0.61 ( 1%) sys 116.22 ( 4%) wall 1508612 kB (42%) ggc forward prop : 23.01 ( 1%) usr 0.36 ( 1%) sys 23.43 ( 1%) wall 130825 kB ( 4%) ggc CSE : 67.21 ( 2%) usr 0.23 ( 1%) sys 66.28 ( 2%) wall 44439 kB ( 1%) ggc dead store elim1 : 20.47 ( 1%) usr 0.10 ( 0%) sys 20.83 ( 1%) wall 103309 kB ( 3%) ggc dead store elim2 : 18.99 ( 1%) usr 0.18 ( 0%) sys 20.48 ( 1%) wall 140398 kB ( 4%) ggc CPROP : 52.83 ( 2%) usr 0.33 ( 1%) sys 52.91 ( 2%) wall 336514 kB ( 9%) ggc PRE : 30.60 ( 1%) usr 0.06 ( 0%) sys 30.51 ( 1%) wall 52724 kB ( 1%) ggc CSE 2 : 37.89 ( 1%) usr 0.04 ( 0%) sys 38.88 ( 1%) wall 29785 kB ( 1%) ggc combiner : 80.20 ( 3%) usr 0.23 ( 1%) sys 80.57 ( 3%) wall 400168 kB (11%) ggc integrated RA : 191.13 ( 7%) usr 0.44 ( 1%) sys 190.64 ( 7%) wall 2328880 kB (65%) ggc reload : 65.46 ( 2%) usr 0.09 ( 0%) sys 67.43 ( 2%) wall 193522 kB ( 5%) ggc reload CSE regs : 56.71 ( 2%) usr 0.14 ( 0%) sys 56.49 ( 2%) wall 241394 kB ( 7%) ggc thread pro- & epilogue : 14.43 ( 1%) usr 0.15 ( 0%) sys 14.97 ( 1%) wall 201098 kB ( 6%) ggc final : 44.77 ( 2%) usr 2.80 ( 6%) sys 48.99 ( 2%) wall 367580 kB (10%) ggc rest of compilation : 57.58 ( 2%) usr 6.23 (14%) sys 63.50 ( 2%) wall 337908 kB ( 9%) ggc remove unused locals : 41.68 ( 2%) usr 0.15 ( 0%) sys 42.04 ( 1%) wall 333 kB ( 0%) ggc TOTAL :2768.94 43.11 2814.85 3588723 kB
(In reply to comment #130) > There are some obvious questions, like why IRA needs 63% of GGC memory, > and VRP 23% > tree VRP : 65.88 ( 2%) usr 0.73 ( 2%) sys 66.71 >( 2%) wall 862879 kB (24%) ggc Is it possible to do this again with gathering statistics enabled? The only thing I can think of for this would be ASSERT_EXPRs and all the rewriting involved for them. > tree slp vectorization : 25.42 ( 1%) usr 0.16 ( 0%) sys 25.50 > ( 1%) wall 436119 kB (12%) ggc This 12% also seems excessive. > CPROP : 52.83 ( 2%) usr 0.33 ( 1%) sys 52.91 > ( 2%) wall 336514 kB ( 9%) ggc And this one also. I'll see if I can understand and explain this one. > integrated RA : 191.13 ( 7%) usr 0.44 ( 1%) sys 190.64 > ( 7%) wall 2328880 kB (65%) ggc Uh, wow! :-(
> > tree VRP : 65.88 ( 2%) usr 0.73 ( 2%) sys 66.71 > >( 2%) wall 862879 kB (24%) ggc > > Is it possible to do this again with gathering statistics enabled? The I started it some time ago, but it takes a while (it runs out of RAM even on my machine ;) > only thing I can think of for this would be ASSERT_EXPRs and all the > rewriting involved for them. It also might be folding doing too much of temporary stuff. > > tree slp vectorization : 25.42 ( 1%) usr 0.16 ( 0%) sys 25.50 > > ( 1%) wall 436119 kB (12%) ggc > > This 12% also seems excessive. Indeed it is. > > integrated RA : 191.13 ( 7%) usr 0.44 ( 1%) sys 190.64 > > ( 7%) wall 2328880 kB (65%) ggc > > Uh, wow! :-( Tep, sems something degenerate here. IRA is usually not that big of memory hog. Honza
Another thing to observe is that GGC memory is "just" 4GB. I am not sure where the other 8GB goes when our IL is believed to be major memory consumer and it resists almost completely in GGC memory. perhaps some of the streaming hashtables gets out of control. Also it seems that line number info is about 1GB. It may be win to write better streaming of locations. Current one enables almost no reuse of locators. Honza
I tracked down the LTO/WHOPR code size difference. It is EH handling. EH frame is empty for LTO build and quite large for WHOPR. Probably -fno-exceptions getting lots on way to ltrans? With memory stats there don't seem to be major suprises: tree-phinodes.c:129 (allocate_phi_node) 110246192: 0.8% 0: 0.0% 3405296: 0.1% 409376: 0.0% 372408 gimple.c:600 (gimple_build_nop) 119935632: 0.8% 0: 0.0% 252144: 0.0% 0: 0.0% 2503912 gimplify.c:437 (create_tmp_var_raw) 119589760: 0.8% 0: 0.0% 1119200: 0.0% 0: 0.0% 754431 tree-vrp.c:3993 (build_assert_expr_for) 124663296: 0.9% 0: 0.0% 0: 0.0% 0: 0.0% 1298576 emit-rtl.c:3731 (make_jump_insn_raw) 118395600: 0.8% 0: 0.0% 11138960: 0.3% 0: 0.0% 1619182 tree-streamer-in.c:484 (streamer_alloc_tree) 90340024: 0.6% 0: 0.0% 51300472: 1.5% 4376: 0.0% 1420249 simplify-rtx.c:183 (simplify_gen_binary) 153607224: 1.1% 0: 0.0% 619968: 0.0% 0: 0.0% 6426133 fold-const.c:1870 (fold_convert_loc) 154700600: 1.1% 0: 0.0% 2160: 0.0% 0: 0.0% 3867569 ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw 80243272: 0.6% 1267966456:15.3% 76357960: 2.2% 11155352: 1.2% 1833025 lto/lto.c:281 (lto_read_in_decl_state) 835696: 0.0% 0: 0.0% 163487336: 4.6% 31116920: 3.4% 4176305 cfg.c:216 (connect_src) 174302184: 1.2% 623048: 0.0% 7861944: 0.2% 133632: 0.0% 4542618 cfg.c:226 (connect_dest) 177198328: 1.2% 5444688: 0.1% 8603432: 0.2% 347648: 0.0% 4628047 tree.c:9115 (make_vector_type) 206615472: 1.4% 0: 0.0% 6720: 0.0% 0: 0.0% 1229894 emit-rtl.c:639 (gen_rtx_MEM) 202133352: 1.4% 0: 0.0% 6629016: 0.2% 0: 0.0% 8698432 dwarf2cfi.c:386 (copy_cfi_row) 212886640: 1.5% 0: 0.0% 0: 0.0% 0: 0.0% 1400570 tree-inline.c:4851 (copy_decl_no_change) 211988960: 1.5% 0: 0.0% 7283480: 0.2% 0: 0.0% 1387268 tree-ssanames.c:78 (init_ssanames) 224107008: 1.6% 252869632: 3.1% 1536: 0.0% 153516032:16.6% 309555 lists.c:144 (alloc_EXPR_LIST) 236354400: 1.7% 0: 0.0% 5798160: 0.2% 0: 0.0% 10089690 gimple.c:2237 (gimple_copy) 268995784: 1.9% 0: 0.0% 4002872: 0.1% 644208: 0.1% 2530798 gimple-streamer-in.c:95 (input_gimple_stmt) 272340080: 1.9% 0: 0.0% 4356168: 0.1% 917040: 0.1% 2550173 tree-inline.c:4331 (copy_tree_r) 286698704: 2.0% 0: 0.0% 2053920: 0.1% 0: 0.0% 5999420 rtl.c:287 (copy_rtx) 291942896: 2.0% 0: 0.0% 318864: 0.0% 0: 0.0% 12315136 emit-rtl.c:393 (gen_raw_REG) 271761568: 1.9% 0: 0.0% 25188032: 0.7% 0: 0.0% 9279675 cselib.c:1896 (cselib_subst_to_values) 299291264: 2.1% 0: 0.0% 0: 0.0% 0: 0.0% 12658684 emit-rtl.c:5427 (init_emit) 354914672: 2.5% 19547728: 0.2% 0: 0.0% 102897600:11.1% 132600 cgraph.c:359 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 401297520:11.4% 0: 0.0% 1286210 emit-rtl.c:3679 (make_insn_raw) 435416472: 3.0% 0: 0.0% 1754496: 0.0% 0: 0.0% 6071819 fold-const.c:7624 (build_fold_addr_expr_with_typ 463283920: 3.2% 0: 0.0% 72880: 0.0% 0: 0.0% 11583920 tree-ssanames.c:141 (make_ssa_name_fn) 459164960: 3.2% 0: 0.0% 5805920: 0.2% 0: 0.0% 5812136 cfg.c:142 (alloc_block) 469702464: 3.3% 0: 0.0% 20328672: 0.6% 0: 0.0% 4375278 toplev.c:964 (realloc_for_line_map) 0: 0.0% 357908640: 4.3% 1073741848:30.4% 184: 0.0% 9 tree.c:1228 (build_int_cst_wide) 1188738504: 8.3% 0: 0.0% 31478720: 0.9% 401175208:43.3% 295230 tree-streamer-in.c:495 (streamer_alloc_tree) 2413661896:16.9% 0: 0.0% 1163973288:32.9% 41183648: 4.4% 28110064 Total 14300758513 8262871404 3534486067 927547008 308001940 source location Garbage Freed Leak Overhead Times From explicitely freed GGC mem there are few interesting cases: alias.c:2807 (init_alias_analysis) 0: 0.0% 597580152: 7.2% 0: 0.0% 116629208:12.6% 1033104 reload1.c:663 (grow_reg_equivs) 0: 0.0% 2244546880:27.2% 0: 0.0% 1859904: 0.2% 204226 tree-ssa-operands.c:331 (ssa_operand_alloc) 0: 0.0% 1326537728:16.1% 1024: 0.0% 0: 0.0% 299739 ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw 80243272: 0.6% 1267966456:15.3% 76357960: 2.2% 11155352: 1.2% 1833025 Heap vectors: source location Leak Peak Times ------------------------------------------------------- ipa-reference.c:171 (set_reference_vars_info) 0: 0.0% 11240664 13: 0.0% ipa-pure-const.c:236 (set_function_state) 0: 0.0% 13472632 842964: 0.8% ipa-inline-analysis.c:3010 (read_inline_edge_sum 0: 0.0% 17281356 870489: 0.8% ipa-prop.c:136 (ipa_initialize_node_params) 0: 0.0% 29039016 666148: 0.6% ipa-inline-analysis.c:804 (inline_summary_alloc) 0: 0.0% 30037064 1: 0.0% ipa-prop.h:308 (ipa_check_create_node_params) 0: 0.0% 51448408 1: 0.0% ipa-prop.h:313 (ipa_check_create_node_params) 0: 0.0% 51448448 1: 0.0% .... tree-vect-slp.c:1553 (vect_analyze_slp_instance) 49136: 0.1% 80056 3273: 0.0% tree-vect-slp.c:1521 (vect_analyze_slp_instance) 49256: 0.1% 80136 3273: 0.0% tree-into-ssa.c:1049 (mark_phi_for_rewrite) 60776: 0.1% 71352 11: 0.0% cfgloop.c:1151 (get_loop_exit_edges) 310312: 0.6% 316976 310269: 0.3% tree-into-ssa.c:291 (get_ssa_name_ann) 352928: 0.6% 612512 13: 0.0% passes.c:2214 (execute_one_pass) 934496: 1.7% 41942992 557113: 0.5% tree-ssa-structalias.c:3861 (handle_lhs_call) 1491552: 2.6% 2359224 20716: 0.0% ipa-inline-analysis.c:2645 (inline_merge_summary 2432148: 4.3% 2442960 157716: 0.1% tree-ssa-loop-im.c:1556 (record_mem_ref_loc) 6634880:11.8% 10465232 595488: 0.6% tree-ssa-loop-im.c:1545 (record_mem_ref_loc) 7587408:13.5% 12637232 579373: 0.5% ipa-reference.c:186 (set_reference_optimization_ 10289688:18.3% 11240664 13: 0.0% lto-cgraph.c:118 (lto_cgraph_encoder_encode) 12756976:22.7% 23348152 25665: 0.0% ipa-ref.c:55 (ipa_record_reference) 13164872:23.4% 41932432 1000598: 0.9% Total 56309568 107517917 I will try to look for ipa-ref related leaks... These should not outgrow other IPA structures, but they are not _that_ off. Bitmap Overall Allocated Peak Leak searched search itr --------------------------------------------------------------------------------- df-problems.c:550 (df_rd_transfer_functio 1401668 550959000 285854280 285854280 1202920 2686239 df-problems.c:4368 (df_md_alloc) 2420865 119625200 103991640 103991640 7882560 876516 df-problems.c:4370 (df_md_alloc) 2420865 47313120 44242920 44242920 0 0 df-problems.c:4366 (df_md_alloc) 2420865 11779160 11744960 11744960 0 0 df-problems.c:4367 (df_md_alloc) 2420865 26404920 26403880 26403880 271729 4 tree-ssa-structalias.c:1249 (build_pred_g 2603931 225511920 225511920 225511920 187843 110177 tree-ssa-tail-merge.c:1316 (deps_ok_for_r 593970 30665680 16874760 16874760 632 40 tree-ssa-structalias.c:5890 (find_what_va 2328862 113793160 102564760 102564760 710275 853412 df-problems.c:1389 (df_live_alloc) 1806260 76241920 12459320 12459320 1826 0 df-problems.c:1390 (df_live_alloc) 1806260 281713360 38869560 38868680 2579692 1190624 df-problems.c:1392 (df_live_alloc) 1806260 991814240 40633200 40629040 221318 201166 dse.c:2452 (copy_fixed_regs) 1132737 90618960 90618960 90618960 0 0 df-problems.c:1391 (df_live_alloc) 1806260 1491519600 40632440 40628480 536753 522104 tree-ssa-loop-im.c:1512 (mem_ref_alloc) 567787 33164080 12373120 12372440 0 0 reload1.c:495 (new_insn_chain) 5276019 402655640 401709040 401709040 24691 0 tree-ssa-pre.c:619 (bitmap_set_new) 32638618 990092880 562280520 562280440 20419995 15879008 tree-ssa-pre.c:620 (bitmap_set_new) 32638618 990371960 574119360 574119280 16846876 10621314 df-problems.c:261 (df_rd_alloc) 2741972 138884160 129954960 129954960 2949744 610463 reload1.c:496 (new_insn_chain) 5276019 151328120 151029880 151029880 388762 10455 tree-ssa-structalias.c:2559 (solve_graph) 3169222 256948000 256292160 256292160 0 0 tree-ssanames.c:90 (init_ssanames) 309555 25951800 12382440 12382200 18777080 7410198 tree-ssa-structalias.c:2113 (label_visit) 5147637 425173040 425173040 425173040 105478 61601 tree-ssa-structalias.c:1108 (add_implicit 4593393 382459560 382459560 382459560 726652 628375 tree-ssa-structalias.c:1123 (add_pred_gra 3379786 273371640 273371640 273371640 121581 98415 tree-ssa-structalias.c:1144 (add_graph_ed 2917231 246071240 174844960 174844960 681820 290190 df-problems.c:262 (df_rd_alloc) 2741972 530288680 506786360 506786360 0 0 df-problems.c:263 (df_rd_alloc) 2741972 304266640 233174000 233172280 108 108 tree-ssa-structalias.c:361 (new_var_info) 7385339 467574280 360290520 360290520 44320 85263 Alloc-pool Kind Elt size Pools Allocated (elts) Peak (elts) Leak (elts) -------------------------------------------------------------------------------------------------------------- insn_info_pool 56 204084 538278104( 9612109) 830704( 14834) 0( 0) bb_info_pool 56 204084 133331912( 2380927) 133616( 2386) 0( 0) rtx_group_info_pool 112 204084 56406672( 503631) 138768( 1239) 0( 0) Bitmap sets 80 204085 2611089440( 32638618) 8824880( 110311) 0( 0) deferred_change_pool 24 204084 52128( 2172) 288( 12) 0( 0) pre_expr nodes 16 204085 138421792( 8651362) 981200( 61325) 0( 0) cse_store_info_pool 104 1972759 98188584( 944121) 485472( 4668) 0( 0) value 16 843341 462086672( 28880417) 245280( 15330) 0( 0) VN phis 32 408170 88913824( 2778557) 83712( 2616) 0( 0) Constraint pool 32 204085 353203136( 11037598) 594528( 18579) 0( 0) struct case_node pool 48 4743 1096848( 22851) 13680( 285) 0( 0) Variable info pool 72 204085 531744408( 7385339) 601560( 8355) 0( 0) IPA-CP value sources 32 1 4760736( 148773) 4260384( 133137) 0( 0) et_occ pool 48 2116800 3595771776( 74911912) 688128( 14336) 0( 0) VN references 56 408170 323302616( 5773261) 3466680( 61905) 0( 0) et_node pool 64 2116800 2533145216( 39580394) 458880( 7170) 0( 0) dep_node 80 102042 734534240( 9181678) 4233840( 52923) 0( 0) df_chain_block pool 16 251647 436908640( 27306790) 2391808( 149488) 0( 0) IPA-CP values 80 1 5005280( 62566) 5005280( 62566) 0( 0) df_scan ref base 56 204084 6325340840( 112952515) 2948400( 52650) 0( 0) SRA accesses 120 102043 13514520( 112621) 92760( 773) 0( 0) df_scan ref artificial 64 204084 901356672( 14083698) 899200( 14050) 0( 0) df_scan ref regular 64 204084 2184845888( 34138217) 2431168( 37987) 0( 0) allocnos 160 102042 281957120( 1762232) 1250560( 7816) 0( 0) elt_list 16 843341 619139328( 38696208) 240832( 15052) 0( 0) elt_loc_list 24 843341 1153775424( 48073976) 521760( 21740) 0( 0) df_scan insn 48 204084 926799792( 19308329) 1070400( 22300) 0( 0) live ranges 40 102042 106931600( 2673290) 508880( 12722) 0( 0) df_scan reg 16 204084 934613472( 58413342) 783216( 48951) 0( 0) SRA links 24 102043 402672( 16778) 4848( 202) 0( 0) rtx_store_info_pool 104 204084 19621264( 188666) 213096( 2049) 0( 0) strinfo_struct pool 56 102042 324184( 5789) 1344( 24) 0( 0) edge predicates 40 1 3540840( 88521) 2030280( 50757) 0( 0) original_copy 8 509567 3890016( 486252) 13264( 1658) 0( 0) cost vectors 192 2551050 192202512( 1001054) 419392( 2184) 0( 0) operand entry pool 24 204084 18481680( 770070) 89424( 3726) 0( 0) objects 72 102042 126880704( 1762232) 562752( 7816) 0( 0) deps_list 16 102042 385122400( 24070150) 847120( 52945) 0( 0) cselib_val_list 40 843341 1155216680( 28880417) 613200( 15330) 0( 0) copies 80 102042 27013920( 337674) 324480( 4056) 0( 0) read_info_pool 32 204084 84871968( 2652249) 91104( 2847) 0( 0) GIMPLE statements Kind Stmts Bytes --------------------------------------- assignments 6803719 658739112 phi nodes 372408 112832736 conditionals 1121446 107658816 everything else 3704547 292211544 Kind Nodes Bytes --------------------------------------- decls 15883790 -1764091088 types 6197660 1041206880 blocks 1809846 144787680 stmts 52888 3384832 refs 11131010 561131416 exprs 31414309 1351944944 constants 2761315 97231060 identifiers 1227582 49103280 vecs 295323 417871880 binfos 1420249 141631744 ssa names 5812136 464970880 constructors 340124 8162976 random kinds 3280618 131225128 lang_decl kinds 0 0 lang_type kinds 0 0 omp clauses 0 0 --------------------------------------- Total 81626850 -1646405684
... and mem reports on WPA stage: toplev.c:964 (realloc_for_line_map) 0: 0.0% 89473168: 9.4% 268435472:10.3% 160: 0.0% 8 cgraph.c:359 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 401297520:15.3% 0: 0.0% 1286210 tree.c:1228 (build_int_cst_wide) 1188709752:33.7% 0: 0.0% 22765400: 0.9% 399425424:83.1% 208540 tree-streamer-in.c:495 (streamer_alloc_tree) 1950272016:55.3% 0: 0.0% 1143907104:43.7% 41182080: 8.6% 22462122 Total 3527995024 956449616 2618397893 480920037 47749265 source location Garbage Freed Leak Overhead Times So about 50% trees, 15% cgraph nodes (I do have plans how to get those smaller), 10% linemaps (I wonder if simple cache would not save a lot of locators), 5% inline summaries I wonder who is producing that 1GB of temporary integer nodes? Someone abusing them for counting too much? It is there before IPA, so it seems to be streaming or type machinery. Heap vectors: source location Leak Peak Times ------------------------------------------------------- ipa-reference.c:186 (set_reference_optimization_ 10289688:10.5% 11240664 13: 0.0% lto-cgraph.c:118 (lto_cgraph_encoder_encode) 12756976:13.0% 23348152 26300: 0.2% ipa-ref.c:55 (ipa_record_reference) 13593072:13.8% 41932432 1000565: 6.0% passes.c:2214 (execute_one_pass) 21214520:21.5% 41942992 557113: 3.3% ipa-inline-analysis.c:804 (inline_summary_alloc) 30037064:30.5% 30037064 1: 0.0% Total 98450004 16768143 Bitmap Overall Allocated Peak Leak searched search itr --------------------------------------------------------------------------------- ipa-reference.c:911 (propagate) 372741 31244280 31223720 31223720 0 0 ipa-reference.c:739 (propagate) 329258 13341680 3058960 3058960 0 0 ipa-reference.c:923 (propagate) 372186 25153920 25138520 25138520 0 0 ipa-reference.c:417 (init_function_info) 487263 19809560 19809560 19809560 551 335 ipa-reference.c:418 (init_function_info) 487263 19584680 19584680 19584680 79 45 ipa-reference.c:747 (propagate) 329351 13229360 3053920 3053920 0 0 Kind Nodes Bytes --------------------------------------- decls 11059354 1770384416 types 6163492 1035466656 blocks 1 80 stmts 0 0 refs 5243 267944 exprs 1826905 74999944 constants 2198755 72290570 identifiers 538891 21555640 vecs 208540 412624304 binfos 1420249 141631744 ssa names 111 8880 constructors 159169 3820056 random kinds 3270917 130837088 Honza
... and oprofile of compilation stage of -flto-partition=none samples % image name app name symbol name 194976 2.8536 lto1 lto1 alloc_page 109091 1.5966 libc-2.11.1.so libc-2.11.1.so _int_malloc 99458 1.4556 lto1 lto1 operand_equal_p 88092 1.2893 lto1 lto1 record_reg_classes 87508 1.2807 lto1 lto1 bitmap_set_bit 75628 1.1069 lto1 lto1 estimate_edge_growth 68760 1.0064 lto1 lto1 mem_attrs_eq_p 62151 0.9096 lto1 lto1 for_each_rtx_1 58274 0.8529 libc-2.11.1.so libc-2.11.1.so memset 55257 0.8087 libc-2.11.1.so libc-2.11.1.so malloc 52116 0.7628 lto1 lto1 htab_find_slot_with_hash 50481 0.7388 oprofiled oprofiled /usr/bin/oprofiled 42524 0.6224 lto1 lto1 ggc_set_mark 40190 0.5882 lto1 lto1 constrain_operands 40124 0.5872 lto1 lto1 lookup_page_table_entry 39279 0.5749 lto1 lto1 extract_insn 34436 0.5040 lto1 lto1 ggc_internal_alloc_stat 33609 0.4919 lto1 lto1 preprocess_constraints 32843 0.4807 lto1 lto1 get_attr_enabled 32582 0.4769 lto1 lto1 reload_cse_simplify_operands 32573 0.4767 lto1 lto1 bitmap_clear_bit 32278 0.4724 libc-2.11.1.so libc-2.11.1.so malloc_consolidate 29633 0.4337 lto1 lto1 bitmap_bit_p 29593 0.4331 lto1 lto1 find_reg_note 29428 0.4307 libc-2.11.1.so libc-2.11.1.so _int_free 29161 0.4268 lto1 lto1 df_note_bb_compute 28939 0.4235 libc-2.11.1.so libc-2.11.1.so calloc 28794 0.4214 lto1 lto1 cse_insn 28084 0.4110 lto1 lto1 find_reloads 26192 0.3833 lto1 lto1 ix86_decompose_address 25211 0.3690 libc-2.11.1.so libc-2.11.1.so memcpy 25016 0.3661 lto1 lto1 df_ref_create_structure 24321 0.3560 lto1 lto1 nonzero_bits1 24066 0.3522 lto1 lto1 htab_traverse_noresize 23895 0.3497 libc-2.11.1.so libc-2.11.1.so free
So since the last report we managed to double WPA memory usage and compile time... 12m wall, 42m user is needed for WPA build. Execution times (seconds) phase opt and generate : 97.34 (21%) usr 0.33 ( 1%) sys 97.70 (20%) wall 98900 kB ( 3%) ggc phase stream in : 242.70 (51%) usr 5.12 (22%) sys 247.94 (50%) wall 3174311 kB (97%) ggc phase stream out : 131.99 (28%) usr 17.49 (76%) sys 149.59 (30%) wall 0 kB ( 0%) ggc garbage collection : 24.01 ( 5%) usr 0.00 ( 0%) sys 24.03 ( 5%) ipa lto gimple out : 12.59 ( 3%) usr 1.07 ( 5%) sys 13.69 ( 3%) wall 0 kB ( 0%) ggc ipa lto decl in : 188.50 (40%) usr 3.93 (17%) sys 192.53 (39%) wall 2083552 kB (64%) ggc ipa lto decl out : 113.33 (24%) usr 8.48 (37%) sys 121.84 (25%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 5.58 ( 1%) usr 0.67 ( 3%) sys 6.25 ( 1%) wall 684122 kB (21%) ggc ipa lto decl merge : 10.64 ( 2%) usr 0.01 ( 0%) sys 10.64 ( 2%) wall 291 kB ( 0%) ggc ipa lto cgraph merge : 9.15 ( 2%) usr 0.01 ( 0%) sys 9.17 ( 2%) wall 15100 kB ( 0%) ggc whopr wpa : 5.80 ( 1%) usr 0.05 ( 0%) sys 5.89 ( 1%) wall 1 kB ( 0%) ggc whopr wpa I/O : 2.19 ( 0%) usr 7.94 (35%) sys 10.19 ( 2%) inline heuristics : 61.46 (13%) usr 0.31 ( 1%) sys 61.80 (12%) wall 351753 kB (11%) ggc callgraph verifier : 15.97 ( 3%) usr 0.06 ( 0%) sys 16.00 ( 3%) wall 0 kB ( 0%) ggc TOTAL : 472.05 22.94 495.25 3274649 kB
Actually not, I looked up wrong report. The last report in comment #121 shows: TOTAL : 616.43 22.26 651.79 2165706 kB So we actually got noticeably faster, but need more memory. 1GB of GGC space, but a lot more in top report. I will look into mem report analysis once I am done with merging some other cleanups/speedups.
oprofile of WPA: 649295 18.2243 lto1 lto1 lto_main() 341256 9.5783 lto1 lto1 htab_find_slot_with_hash 126567 3.5525 lto1 lto1 do_estimate_growth_1(cgraph_node*, void*) 97142 2.7266 lto1 lto1 htab_expand 89658 2.5165 libc-2.11.1.so libc-2.11.1.so _int_malloc 82117 2.3048 lto1 lto1 pointer_map_insert(pointer_map_t*, void const*) 60238 1.6907 lto1 lto1 iterative_hash_hashval_t(unsigned int, unsigned int) 58145 1.6320 lto1 lto1 ggc_internal_alloc_stat(unsigned long, char const*, int, char const*) 53679 1.5067 lto1 lto1 linemap_lookup(line_maps*, unsigned int) 47271 1.3268 lto1 lto1 lto_output_tree(output_block*, tree_node*, bool, bool) 43043 1.2081 lto1 lto1 gt_ggc_mx_lang_tree_node(void*) 42675 1.1978 lto1 lto1 verify_cgraph_node(cgraph_node*) 40609 1.1398 lto1 lto1 streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned int*, bool) 40245 1.1296 lto1 lto1 ggc_marked_p(void const*) 39474 1.1079 libc-2.11.1.so libc-2.11.1.so memset 38955 1.0934 libc-2.11.1.so libc-2.11.1.so malloc_consolidate 32085 0.9006 lto1 lto1 streamer_write_uhwi_stream(lto_output_stream*, unsigned long) 31965 0.8972 lto1 lto1 ggc_set_mark(void const*) 31406 0.8815 lto1 lto1 lto_input_tree(lto_input_block*, data_in*) 29213 0.8199 lto1 lto1 streamer_read_tree_bitfields(lto_input_block*, tree_node*) 26846 0.7535 lto1 lto1 hash_pointer 25870 0.7261 libc-2.11.1.so libc-2.11.1.so memcpy We still spend insanely long time in walking types in lto_main (introduced by Michael's patch)
Author: hubicka Date: Sun Aug 19 05:55:20 2012 New Revision: 190509 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=190509 Log: PR lto/45375 * ipa-inline.c (want_inline_small_function_p): Bypass inline limits for hinted functions. (edge_badness): Dump hints; decrease badness for hinted funcitons. * ipa-inline.h (enum inline_hints_vals): New enum. (inline_hints): New type. (edge_growth_cache_entry): Add hints. (dump_inline_summary): Update. (dump_inline_hints): Declare. (do_estimate_edge_hints): Declare. (estimate_edge_hints): New inline function. (reset_edge_growth_cache): Update. * predict.c (cgraph_maybe_hot_edge_p): Do not ice on indirect edges. * ipa-inline-analysis.c (dump_inline_hints): New function. (estimate_edge_devirt_benefit): Return true when function should be hinted. (estimate_calls_size_and_time): New hints argument; set it when devritualization happens. (estimate_node_size_and_time): New hints argument. (do_estimate_edge_time): Cache hints. (do_estimate_edge_growth): Update. (do_estimate_edge_hints): New function Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-analysis.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h trunk/gcc/predict.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/ipa/iinline-1.c
After the new IonMonkey JIT went in (http://blog.mozilla.org/javascript/2012/09/12/ionmonkey-in-firefox-18/) peak memory use went up. It is now 6.8GB (gcc-4.7 roughly the same: 6.5GB). So we're approaching the point where a 8GB machine isn't enough to build Firefox with LTO...
After updating Mozilla this weekend, I definitely bloat up 8GB machine. The pak in TOP is around 9-10GB. I checked malloc usage and there are not many surprises. It is about 300MB, mostly GGC overhead, pointer maps and such. Most memory is actually the GGC, about 7GB. Here 5GB survives type and decl merging and is distributed as follows: cgraph.c:722 (cgraph_allocate_init_indirect_info 1671240: 0.0% 0: 0.0% 8202960: 0.2% 0: 0.0% 246855 tree.c:1226 (build_int_cst_wide) 625825208:12.3% 0: 0.0% 10437744: 0.2% 4863752: 3.1% 325009 ipa-prop.h:471 (ipa_check_create_edge_args) 0: 0.0% 0: 0.0% 16777216: 0.3% 0: 0.0% 1 ipa-inline-analysis.c:3697 (inline_read_section) 0: 0.0% 28298904: 1.6% 21095504: 0.4% 1064480: 0.7% 423701 tree.c:1561 (build_string) 16526800: 0.3% 0: 0.0% 21695715: 0.4% 3395427: 2.2% 864326 ipa-prop.c:3393 (ipa_read_node_info) 0: 0.0% 4302088: 0.2% 25029448: 0.5% 119192: 0.1% 246788 stringpool.c:75 (alloc_node) 0: 0.0% 0: 0.0% 27817760: 0.5% 0: 0.0% 695444 ipa-ref.c:51 (ipa_record_reference) 0: 0.0% 188442816:10.3% 28443272: 0.6% 2114424: 1.4% 1256259 stringpool.c:58 (stringpool_ggc_alloc) 0: 0.0% 0: 0.0% 34673092: 0.7% 2619412: 1.7% 695444 lto/lto.c:2279 (create_subid_section_table) 275832: 0.0% 0: 0.0% 40363416: 0.8% 8051472: 5.2% 3978 tree-streamer-in.c:895 (lto_input_ts_constructor 171812232: 3.4% 192568640:10.6% 42205992: 0.8% 1425072: 0.9% 947082 ipa-prop.c:3380 (ipa_read_node_info) 0: 0.0% 35825488: 2.0% 58764528: 1.1% 659704: 0.4% 909232 tree-streamer-in.c:488 (streamer_alloc_tree) 129846168: 2.6% 0: 0.0% 75997752: 1.5% 7072: 0.0% 2063753 tree.c:1263 (build_int_cst_wide) 237791264: 4.7% 0: 0.0% 90464320: 1.8% 0: 0.0% 10257987 ipa-inline-analysis.c:3709 (inline_read_section) 0: 0.0% 133938484: 7.4% 101874268: 2.0% 1606480: 1.0% 1099389 lto-section-in.c:361 (lto_new_in_decl_state) 3240: 0.0% 0: 0.0% 107452560: 2.1% 0: 0.0% 895465 cgraph.c:653 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 135509816: 2.6% 0: 0.0% 1302979 ggc-common.c:253 (ggc_cleared_alloc_ptr_array_tw 2040: 0.0% 866397160:47.6% 190623368: 3.7% 263888: 0.2% 11459 lto/lto.c:267 (lto_read_in_decl_state) 3024: 0.0% 0: 0.0% 225743280: 4.4% 41057176:26.5% 6268255 ipa-inline-analysis.c:931 (inline_summary_alloc) 0: 0.0% 0: 0.0% 268435464: 5.2% 8: 0.0% 1 cgraph.c:362 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 515473640:10.1% 0: 0.0% 1741465 toplev.c:953 (realloc_for_line_map) 0: 0.0% 358955168:19.7% 1074790424:21.0% 184: 0.0% 19 tree-streamer-in.c:499 (streamer_alloc_tree) 3668091656:72.1% 0: 0.0% 1995384408:38.9% 87485792:56.5% 46580224 Total 5089831352 1821058652 5124870115 154815271 91384962 source location Garbage Freed Leak Overhead Times I.e. 20% are now linemaps, 38% trees read by the streamer, 10% cgraph nodes, 5% inline summaries, 4% streamer table converting UIDs to decls (that can be freed). The trees are distributed as follows: Kind Nodes Bytes --------------------------------------- decls 20489087 -1105370640 types 10321297 1733977896 blocks 102012 8160960 stmts 0 0 refs 44297 1806000 exprs 8205133 264995952 constants 11667038 376994197 identifiers 695444 27817760 vecs 325009 626535448 binfos 2063753 205829776 ssa names 0 0 constructors 369886 8877264 random kinds 7039351 281574472 lang_decl kinds 0 0 lang_type kinds 0 0 omp clauses 0 0 --------------------------------------- Total 61322307 -1863768211 --------------------------------------- Code Nodes I think all the blocks read to WPA are bugs. We may also do better on sharing constants. ---------------------------- identifier_node 695444 tree_list 7039346 tree_vec 325009 block 102012 offset_type 1762 enumeral_type 371554 boolean_type 7097 integer_type 830019 real_type 10054 pointer_type 3089539 reference_type 215629 array_type 204968 record_type 3818337 union_type 77106 void_type 1478 function_type 259759 method_type 1433688 integer_cst 10784917 real_cst 17553 string_cst 864326 function_decl 2736272 label_decl 82077 field_decl 3121989 var_decl 323843 const_decl 2817588 parm_decl 5244428 type_decl 4906573 result_decl 1225435 constructor 369886 pointer_plus_expr 302600 nop_expr 3307128 addr_expr 4592681 tree_binfo 2063753 Honza
Created attachment 28395 [details] Use size_t for tree code book-keeping ...because overflow looks so sloppy.
It looks like there is a LTO code-size regression on trunk: (size of libxul.so, build without elfhack): gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9% gcc : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6% clang lto : size: 35071848 | Kraken bench: 2804.2ms +/- 1.2% clang : size: 36797384 | Kraken bench: 2819.6ms +/- 1.4%
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC --- > It looks like there is a LTO code-size regression on trunk: > (size of libxul.so, build without elfhack): > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9% About LTO+PGO please be sure that you have the Teresa's fix from this Friday in your tree. > gcc : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6% Is LTO w/o PGO bigger than previous builds? > clang lto : size: 35071848 | Kraken bench: 2804.2ms +/- 1.2% > clang : size: 36797384 | Kraken bench: 2819.6ms +/- 1.4%
(In reply to comment #145) > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC --- > > It looks like there is a LTO code-size regression on trunk: > > (size of libxul.so, build without elfhack): > > > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9% > > About LTO+PGO please be sure that you have the Teresa's fix from this Friday in > your tree. Yes, my tree already included this fix and also the fix from bug 55551. > > gcc : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6% > > Is LTO w/o PGO bigger than previous builds? Couldn't tell, because it doesn't link: /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: hidden symbol 'pixman_add_triangles' in /var/tmp/moz-build-dir/toolkit/library/../../gfx/cairo/libpixman/src/pixman-trap.o is referenced by DSO /usr/lib64/libcairo.so /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at runtime; recompile with -fPIC /tmp/cc0oq4BG.ltrans0.ltrans.o:cc0oq4BG.ltrans0.o:function SharedStub: error: undefined reference to 'PrepareAndDispatch' /tmp/cc0oq4BG.ltrans1.ltrans.o:cc0oq4BG.ltrans1.o:function SkAnnotation::CreateProc(SkFlattenableReadBuffer&) [clone .local.7828.1055099]: error: undefined reference to 'SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&)' collect2: error: ld returned 1 exit status The undefined reference to PrepareAndDispatch is easily fixed by an __attribute__ ((used)). Do you have an idea on how to fix the SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&) issue?
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #146 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-02 07:36:02 UTC --- > (In reply to comment #145) > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > > > > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC --- > > > It looks like there is a LTO code-size regression on trunk: > > > (size of libxul.so, build without elfhack): > > > > > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9% > > > > About LTO+PGO please be sure that you have the Teresa's fix from this Friday in > > your tree. > > Yes, my tree already included this fix and also the fix from bug 55551. Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some regressions on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be nice to know what value is needed for comparable size. > > > > gcc : size: 34072808 | Kraken bench: 2804.3ms +/- 1.6% > > > > Is LTO w/o PGO bigger than previous builds? > > Couldn't tell, because it doesn't link: > > /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: > warning: hidden symbol 'pixman_add_triangles' in > /var/tmp/moz-build-dir/toolkit/library/../../gfx/cairo/libpixman/src/pixman-trap.o > is referenced by DSO /usr/lib64/libcairo.so > /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: > error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc > against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at > runtime; recompile with -fPIC > /tmp/cc0oq4BG.ltrans0.ltrans.o:cc0oq4BG.ltrans0.o:function SharedStub: error: > undefined reference to 'PrepareAndDispatch' > /tmp/cc0oq4BG.ltrans1.ltrans.o:cc0oq4BG.ltrans1.o:function > SkAnnotation::CreateProc(SkFlattenableReadBuffer&) [clone .local.7828.1055099]: > error: undefined reference to > 'SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&)' > collect2: error: ld returned 1 exit status > > The undefined reference to PrepareAndDispatch is easily fixed by > an __attribute__ ((used)). > Do you have an idea on how to fix the > SkAnnotation::SkAnnotation(SkFlattenableReadBuffer&) issue? Hmm, I remember seeing this one, too. I will check. Honza
(In reply to comment #147) > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > > > --- Comment #146 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-02 07:36:02 UTC --- > > (In reply to comment #145) > > > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > > > > > > > --- Comment #144 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-01 12:39:30 UTC --- > > > > It looks like there is a LTO code-size regression on trunk: > > > > (size of libxul.so, build without elfhack): > > > > > > > > gcc lto/pgo : size: 42204584 | Kraken bench: 2723.9ms +/- 0.9% > > > > > > About LTO+PGO please be sure that you have the Teresa's fix from this Friday in > > > your tree. > > > > Yes, my tree already included this fix and also the fix from bug 55551. > > Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some > regressions > on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be > nice to know what value is needed for comparable size. Unfortunately it doesn't help much, because with "--param hot-bb-count-ws-permille=990" the size is only 0.25% smaller: (With --param) : 42098856 (Without ) : 42204584 I will try smaller values later.
> > Please try to reduce HOT_BB_COUNT_WS_PERMILLE to 990. I also see some > > regressions > > on some SPEC benchmarks (such as GCC) and this helps. If it doesn't it would be > > nice to know what value is needed for comparable size. > > Unfortunately it doesn't help much, because with "--param > hot-bb-count-ws-permille=990" the size is only 0.25% smaller: > (With --param) : 42098856 > (Without ) : 42204584 > > I will try smaller values later. Hmm, that sounds like quite bad news - the histogram code was supposed to help in such cases. I will try to fix the non-PGO case and lets try to compare how PGO/non-PGO compare first. If you could put somewhere the -fdump-ipa-inline dump, I will try to check if there is something obviously wrong. In worst case we can resort to combining both heuristics - i.e. keeping the hot_bb_fraction in addition to histogram code. In fact I planned to do that this way but Teresa removed the old code and I did not see good reason why to keep it. Honza
For comparison I've just disabled skia and build with LTO only; the size looks good for this case: 31356968
Teresa comitted another bugfix just today. So with bit of luck it will work now? I will try to look deeper into it ASAP, but I am just getting ready for trip to USA. Honza
Also I suppose you don't have comparsion to 4.7 handy? (I am curious because of inliner heuristic re-tunning) Honza
On 2012.12.02 at 21:09 +0000, hubicka at ucw dot cz wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #152 from Jan Hubicka <hubicka at ucw dot cz> 2012-12-02 21:09:24 UTC --- > Also I suppose you don't have comparsion to 4.7 handy? (I am curious because of > inliner heuristic re-tunning) The LTO/PGO sizes were measured with the newest patch from Teresa already applied. gcc-4.7 lto/pgo: size: 33337456 | Kraken bench: 2706.7ms +/- 1.1%
What was the size of the gcc lto/pgo binary before the change to use the histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a very large increase, ~25%. Markus, could you attach to the bug one of the gcda files so that I can see the program summary and figure out how far off the old hot bb threshold is from the new histogram-based one? Also, it would be good to see the -fdump-ipa-inline dumps before and after the regression (if necessary, the before one could be from 4_7).
(In reply to comment #154) > What was the size of the gcc lto/pgo binary before the change to use the > histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a > very large increase, ~25%. With revision 193914 (before the change) the lto/pgo size is 42115424 bytes. So it looks like Theresa is off the hook. > Markus, could you attach to the bug one of the gcda files so that I can see the > program summary and figure out how far off the old hot bb threshold is from the > new histogram-based one? Also, it would be good to see the -fdump-ipa-inline > dumps before and after the regression (if necessary, the before one could be > from 4_7). Will try to post them tomorrow .
On Tue, Dec 11, 2012 at 2:57 PM, markus at trippelsdorf dot de <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #155 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-11 22:57:14 UTC --- > (In reply to comment #154) >> What was the size of the gcc lto/pgo binary before the change to use the >> histogram? Was it close to the gcc 4.7 lto/pgo size? In that case that is a >> very large increase, ~25%. > > With revision 193914 (before the change) the lto/pgo size is 42115424 bytes. > So it looks like Theresa is off the hook. Unfortunately, I am still possibly on the hook since the main suspect change is r193747 (committed by Honza with changes made by him and I to use the histogram instead of a hard limit for determining bb hotness). Between then and when I committed fixes for this under LTO (r193999) I would expect that the code size might have been worse temporarily because everything looked hot since the histogram was not being streamed through the LTO files properly, and so inlining could have gotten excessive. > >> Markus, could you attach to the bug one of the gcda files so that I can see the >> program summary and figure out how far off the old hot bb threshold is from the >> new histogram-based one? Also, it would be good to see the -fdump-ipa-inline >> dumps before and after the regression (if necessary, the before one could be >> from 4_7). > > Will try to post them tomorrow . Ok thanks. Teresa > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
With revision 193740 libxul's size is ~34MB, which is OK. (Unfortunately this new ICE happens with yesterdays gcc when linking libxul: /var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function ‘CreateRange’: /var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409 I will open a new PR for this later.) Here are the requested files: (I don't know which of the ~3000 gcda files you need, so I've uploaded them all) http://www.trippelsdorf.de/gcda_before.tar.bz2 (4MB) http://www.trippelsdorf.de/gcda_after.tar.bz2 (4MB) (-fdump-ipa-inline output) http://www.trippelsdorf.de/libxul_before.inline.tar.bz2 (100MB) http://www.trippelsdorf.de/libxul_after.inline.tar.bz2 (68MB, everything 'till the ICE hit)
On Wed, Dec 12, 2012 at 3:43 AM, markus at trippelsdorf dot de <gcc-bugzilla@gcc.gnu.org> wrote: > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #157 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-12 11:43:27 UTC --- > With revision 193740 libxul's size is ~34MB, which is OK. > > (Unfortunately this new ICE happens with yesterdays gcc when linking libxul: > > /var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function > ‘CreateRange’: > /var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal > compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409 > > I will open a new PR for this later.) > > Here are the requested files: > > (I don't know which of the ~3000 gcda files you need, so I've uploaded them > all) > http://www.trippelsdorf.de/gcda_before.tar.bz2 (4MB) > http://www.trippelsdorf.de/gcda_after.tar.bz2 (4MB) Sorry, I should have clarified that any one of them would do (as long as it corresponded to an object file included in the LTO link for the main executable), since the info I need is in the program summary section for the executable, which is duplicated in each of them. > > (-fdump-ipa-inline output) > http://www.trippelsdorf.de/libxul_before.inline.tar.bz2 (100MB) > http://www.trippelsdorf.de/libxul_after.inline.tar.bz2 (68MB, everything 'till > the ICE hit) With the old heuristics, the hot bb cutoff was: profile_info->sum_max / PARAM_VALUE (HOT_BB_COUNT_FRACTION)) In this case, sum_max is 103439951 and HOT_BB_COUNT_FRACTION was 10000, so the cutoff count was 10343. From the working set computed from the histogram, the 99.9% cutoff count is 320. See the end of this email for the full set of histograms and working sets, but here are the top few working sets: ... hal/Hal.gcda: 96.72%: num counts=30069, min counter=16389 hal/Hal.gcda: 97.50%: num counts=35296, min counter=10241 hal/Hal.gcda: 98.28%: num counts=43669, min counter=6145 hal/Hal.gcda: 99.06%: num counts=59589, min counter=3072 hal/Hal.gcda: 99.90%: num counts=115840, min counter=320 So it looks like you would want a cutoff of 97.5% to get close to what was there before. (Honza, I just made some changes to enable gcov-dump to optionally compute and dump out the working sets from the histogram. I can send this for upstream review as I have wanted this several times.) The much smaller cutoff count is why there are fewer calls marked unlikely and more inlining: $ grep "call is unlikely" before/libxul.so.wpa.049i.inline | wc 442342 4944522 42560600 $ grep "call is unlikely" after/libxul.so.wpa.049i.inline | wc 392683 4349335 37477001 $ grep Inlined before/libxul.so.wpa.049i.inline | grep eliminated Inlined 60432 calls, eliminated 30986 functions $ grep Inlined after/libxul.so.wpa.049i.inline | grep eliminated Inlined 89573 calls, eliminated 28921 functions On thing that is interesting in the above info, and may be contributing to the larger size now, is that there are more inlines, but fewer functions are being eliminated. I'm not sure why that is offhand. It's possible (probable) that inlining heuristics need some retuning to make optimal use of the new cutoffs. We also see additional inlines in some of our large internal apps with the change, but not much increase in binary size, and it sometimes leads to better performance - although we are not as much affected because the google branches were using a much larger HOT_BB_COUNT_FRACTION of 60K already, in order to get more inlining. In this case, it looks like you are getting more inlines but it is apparently performance-neutral? Looking at a graph of the working set data, the number of counters starts increasing super-exponentially as the percentages approach 100%. I've been thinking that it may be useful to find the "knee" of the curve to determine the appropriate cutoff percentage. I'll see if I can make some progress on that. Full histogram/working set data: hal/Hal.gcda: a3000000: 512:PROGRAM_SUMMARY checksum=0x3aa34521 hal/Hal.gcda: counts=2109045, runs=7, sum_all=9749748271, run_max=97136704, sum_max=103439951 hal/Hal.gcda: counter histogram: hal/Hal.gcda: 0: num counts=1824318, min counter=0, cum_counter=0 hal/Hal.gcda: 1: num counts=30727, min counter=1, cum_counter=30727 hal/Hal.gcda: 2: num counts=11646, min counter=2, cum_counter=23292 hal/Hal.gcda: 3: num counts=5414, min counter=3, cum_counter=16242 hal/Hal.gcda: 4: num counts=5156, min counter=4, cum_counter=20624 hal/Hal.gcda: 5: num counts=3379, min counter=5, cum_counter=16895 hal/Hal.gcda: 6: num counts=3674, min counter=6, cum_counter=22044 hal/Hal.gcda: 7: num counts=2310, min counter=7, cum_counter=16170 hal/Hal.gcda: 8: num counts=4756, min counter=8, cum_counter=40330 hal/Hal.gcda: 9: num counts=4725, min counter=10, cum_counter=49265 hal/Hal.gcda: 10: num counts=4256, min counter=12, cum_counter=52450 hal/Hal.gcda: 11: num counts=3424, min counter=14, cum_counter=49760 hal/Hal.gcda: 12: num counts=4936, min counter=16, cum_counter=86713 hal/Hal.gcda: 13: num counts=4025, min counter=20, cum_counter=86217 hal/Hal.gcda: 14: num counts=5271, min counter=24, cum_counter=134994 hal/Hal.gcda: 15: num counts=3052, min counter=28, cum_counter=89797 hal/Hal.gcda: 16: num counts=6812, min counter=32, cum_counter=241575 hal/Hal.gcda: 17: num counts=6269, min counter=40, cum_counter=274778 hal/Hal.gcda: 18: num counts=5652, min counter=48, cum_counter=289677 hal/Hal.gcda: 19: num counts=4240, min counter=56, cum_counter=253391 hal/Hal.gcda: 20: num counts=8321, min counter=64, cum_counter=592920 hal/Hal.gcda: 21: num counts=5824, min counter=80, cum_counter=508559 hal/Hal.gcda: 22: num counts=4846, min counter=96, cum_counter=497364 hal/Hal.gcda: 23: num counts=4014, min counter=112, cum_counter=478449 hal/Hal.gcda: 24: num counts=6460, min counter=128, cum_counter=919926 hal/Hal.gcda: 25: num counts=5253, min counter=160, cum_counter=916231 hal/Hal.gcda: 26: num counts=4072, min counter=192, cum_counter=844827 hal/Hal.gcda: 27: num counts=3544, min counter=224, cum_counter=850637 hal/Hal.gcda: 28: num counts=6143, min counter=256, cum_counter=1750280 hal/Hal.gcda: 29: num counts=4690, min counter=320, cum_counter=1648174 hal/Hal.gcda: 30: num counts=3864, min counter=384, cum_counter=1614077 hal/Hal.gcda: 31: num counts=3377, min counter=448, cum_counter=1616477 hal/Hal.gcda: 32: num counts=5986, min counter=512, cum_counter=3426093 hal/Hal.gcda: 33: num counts=4449, min counter=640, cum_counter=3100174 hal/Hal.gcda: 34: num counts=5339, min counter=768, cum_counter=4479538 hal/Hal.gcda: 35: num counts=3402, min counter=896, cum_counter=3264788 hal/Hal.gcda: 36: num counts=6139, min counter=1024, cum_counter=7017454 hal/Hal.gcda: 37: num counts=4224, min counter=1280, cum_counter=5931630 hal/Hal.gcda: 38: num counts=3957, min counter=1536, cum_counter=6576291 hal/Hal.gcda: 39: num counts=2747, min counter=1792, cum_counter=5236457 hal/Hal.gcda: 40: num counts=4640, min counter=2048, cum_counter=10611270 hal/Hal.gcda: 41: num counts=3733, min counter=2560, cum_counter=10510163 hal/Hal.gcda: 42: num counts=3079, min counter=3072, cum_counter=10242287 hal/Hal.gcda: 43: num counts=2651, min counter=3584, cum_counter=10140728 hal/Hal.gcda: 44: num counts=4434, min counter=4096, cum_counter=20361262 hal/Hal.gcda: 45: num counts=3987, min counter=5121, cum_counter=22720940 hal/Hal.gcda: 46: num counts=2943, min counter=6145, cum_counter=19504640 hal/Hal.gcda: 47: num counts=2334, min counter=7169, cum_counter=17826112 hal/Hal.gcda: 48: num counts=2817, min counter=8193, cum_counter=25598488 hal/Hal.gcda: 49: num counts=2779, min counter=10241, cum_counter=31417188 hal/Hal.gcda: 50: num counts=3033, min counter=12290, cum_counter=40410833 hal/Hal.gcda: 51: num counts=1853, min counter=14340, cum_counter=28478565 hal/Hal.gcda: 52: num counts=2655, min counter=16389, cum_counter=48690364 hal/Hal.gcda: 53: num counts=2445, min counter=20488, cum_counter=55375590 hal/Hal.gcda: 54: num counts=1691, min counter=24592, cum_counter=44944827 hal/Hal.gcda: 55: num counts=1436, min counter=28719, cum_counter=44036063 hal/Hal.gcda: 56: num counts=2533, min counter=32825, cum_counter=92560194 hal/Hal.gcda: 57: num counts=1974, min counter=41047, cum_counter=88298216 hal/Hal.gcda: 58: num counts=1635, min counter=49329, cum_counter=86653692 hal/Hal.gcda: 59: num counts=1131, min counter=57610, cum_counter=69796538 hal/Hal.gcda: 60: num counts=1638, min counter=65856, cum_counter=120165995 hal/Hal.gcda: 61: num counts=1227, min counter=82393, cum_counter=110414350 hal/Hal.gcda: 62: num counts=1420, min counter=98946, cum_counter=152171465 hal/Hal.gcda: 63: num counts=726, min counter=115741, cum_counter=89865259 hal/Hal.gcda: 64: num counts=1249, min counter=132608, cum_counter=184646974 hal/Hal.gcda: 65: num counts=862, min counter=165900, cum_counter=156618404 hal/Hal.gcda: 66: num counts=930, min counter=198695, cum_counter=199922412 hal/Hal.gcda: 67: num counts=628, min counter=232660, cum_counter=156498665 hal/Hal.gcda: 68: num counts=1136, min counter=266317, cum_counter=338816591 hal/Hal.gcda: 69: num counts=736, min counter=333978, cum_counter=267217317 hal/Hal.gcda: 70: num counts=589, min counter=401495, cum_counter=256810939 hal/Hal.gcda: 71: num counts=431, min counter=469085, cum_counter=216371731 hal/Hal.gcda: 72: num counts=581, min counter=536827, cum_counter=351453204 hal/Hal.gcda: 73: num counts=387, min counter=672090, cum_counter=287503062 hal/Hal.gcda: 74: num counts=345, min counter=811897, cum_counter=302673649 hal/Hal.gcda: 75: num counts=246, min counter=951474, cum_counter=250577118 hal/Hal.gcda: 76: num counts=315, min counter=1084378, cum_counter=382079125 hal/Hal.gcda: 77: num counts=224, min counter=1362634, cum_counter=336536846 hal/Hal.gcda: 78: num counts=142, min counter=1643302, cum_counter=252854048 hal/Hal.gcda: 79: num counts=104, min counter=1925957, cum_counter=215119385 hal/Hal.gcda: 80: num counts=131, min counter=2211770, cum_counter=321748834 hal/Hal.gcda: 81: num counts=123, min counter=2739896, cum_counter=373169753 hal/Hal.gcda: 82: num counts=72, min counter=3277758, cum_counter=253778382 hal/Hal.gcda: 83: num counts=38, min counter=3853957, cum_counter=158229587 hal/Hal.gcda: 84: num counts=59, min counter=4384565, cum_counter=282974111 hal/Hal.gcda: 85: num counts=56, min counter=5467360, cum_counter=340377441 hal/Hal.gcda: 86: num counts=37, min counter=6569721, cum_counter=254677959 hal/Hal.gcda: 87: num counts=17, min counter=7670909, cum_counter=138198211 hal/Hal.gcda: 88: num counts=31, min counter=8797370, cum_counter=300444212 hal/Hal.gcda: 89: num counts=9, min counter=11064352, cum_counter=104597973 hal/Hal.gcda: 90: num counts=5, min counter=13196116, cum_counter=68483280 hal/Hal.gcda: 91: num counts=25, min counter=15471823, cum_counter=405406333 hal/Hal.gcda: 92: num counts=39, min counter=17739191, cum_counter=769153481 hal/Hal.gcda: 93: num counts=1, min counter=23220597, cum_counter=23248710 hal/Hal.gcda: 94: num counts=1, min counter=26834310, cum_counter=26862423 hal/Hal.gcda: 95: num counts=5, min counter=31885437, cum_counter=169003071 hal/Hal.gcda: 96: num counts=1, min counter=33576018, cum_counter=34881284 hal/Hal.gcda: 99: num counts=1, min counter=60798823, cum_counter=60799245 hal/Hal.gcda: 102: num counts=2, min counter=100714244, cum_counter=204154195 hal/Hal.gcda: counter working sets: hal/Hal.gcda: 0.78%: num counts=1, min counter=100714244 hal/Hal.gcda: 1.56%: num counts=2, min counter=100714244 hal/Hal.gcda: 2.34%: num counts=3, min counter=60798823 hal/Hal.gcda: 3.12%: num counts=5, min counter=31885437 hal/Hal.gcda: 3.90%: num counts=7, min counter=31885437 hal/Hal.gcda: 4.68%: num counts=9, min counter=31885437 hal/Hal.gcda: 5.46%: num counts=12, min counter=17739191 hal/Hal.gcda: 6.24%: num counts=17, min counter=17739191 hal/Hal.gcda: 7.02%: num counts=21, min counter=17739191 hal/Hal.gcda: 7.80%: num counts=25, min counter=17739191 hal/Hal.gcda: 8.58%: num counts=29, min counter=17739191 hal/Hal.gcda: 9.36%: num counts=34, min counter=17739191 hal/Hal.gcda: 10.14%: num counts=38, min counter=17739191 hal/Hal.gcda: 10.92%: num counts=42, min counter=17739191 hal/Hal.gcda: 11.70%: num counts=47, min counter=17739191 hal/Hal.gcda: 12.48%: num counts=50, min counter=17739191 hal/Hal.gcda: 13.26%: num counts=51, min counter=15471823 hal/Hal.gcda: 14.04%: num counts=56, min counter=15471823 hal/Hal.gcda: 14.82%: num counts=61, min counter=15471823 hal/Hal.gcda: 15.60%: num counts=66, min counter=15471823 hal/Hal.gcda: 16.38%: num counts=71, min counter=15471823 hal/Hal.gcda: 17.16%: num counts=75, min counter=15471823 hal/Hal.gcda: 17.94%: num counts=80, min counter=13196116 hal/Hal.gcda: 18.72%: num counts=86, min counter=11064352 hal/Hal.gcda: 19.50%: num counts=94, min counter=8797370 hal/Hal.gcda: 20.28%: num counts=102, min counter=8797370 hal/Hal.gcda: 21.06%: num counts=111, min counter=8797370 hal/Hal.gcda: 21.84%: num counts=120, min counter=8797370 hal/Hal.gcda: 22.62%: num counts=126, min counter=7670909 hal/Hal.gcda: 23.40%: num counts=136, min counter=7670909 hal/Hal.gcda: 24.18%: num counts=146, min counter=6569721 hal/Hal.gcda: 24.96%: num counts=158, min counter=6569721 hal/Hal.gcda: 25.74%: num counts=169, min counter=6569721 hal/Hal.gcda: 26.52%: num counts=180, min counter=5467360 hal/Hal.gcda: 27.30%: num counts=194, min counter=5467360 hal/Hal.gcda: 28.08%: num counts=208, min counter=5467360 hal/Hal.gcda: 28.86%: num counts=222, min counter=5467360 hal/Hal.gcda: 29.64%: num counts=230, min counter=5467360 hal/Hal.gcda: 30.42%: num counts=247, min counter=4384565 hal/Hal.gcda: 31.20%: num counts=264, min counter=4384565 hal/Hal.gcda: 31.98%: num counts=281, min counter=4384565 hal/Hal.gcda: 32.76%: num counts=294, min counter=3853957 hal/Hal.gcda: 33.54%: num counts=313, min counter=3853957 hal/Hal.gcda: 34.32%: num counts=331, min counter=3277758 hal/Hal.gcda: 35.10%: num counts=354, min counter=3277758 hal/Hal.gcda: 35.88%: num counts=377, min counter=3277758 hal/Hal.gcda: 36.66%: num counts=399, min counter=3277758 hal/Hal.gcda: 37.44%: num counts=422, min counter=2739896 hal/Hal.gcda: 38.22%: num counts=450, min counter=2739896 hal/Hal.gcda: 39.00%: num counts=477, min counter=2739896 hal/Hal.gcda: 39.78%: num counts=505, min counter=2739896 hal/Hal.gcda: 40.56%: num counts=522, min counter=2739896 hal/Hal.gcda: 41.34%: num counts=554, min counter=2211770 hal/Hal.gcda: 42.12%: num counts=588, min counter=2211770 hal/Hal.gcda: 42.90%: num counts=622, min counter=2211770 hal/Hal.gcda: 43.68%: num counts=653, min counter=2211770 hal/Hal.gcda: 44.46%: num counts=680, min counter=1925957 hal/Hal.gcda: 45.24%: num counts=720, min counter=1925957 hal/Hal.gcda: 46.02%: num counts=757, min counter=1925957 hal/Hal.gcda: 46.80%: num counts=797, min counter=1643302 hal/Hal.gcda: 47.58%: num counts=843, min counter=1643302 hal/Hal.gcda: 48.36%: num counts=890, min counter=1643302 hal/Hal.gcda: 49.14%: num counts=929, min counter=1362634 hal/Hal.gcda: 49.92%: num counts=985, min counter=1362634 hal/Hal.gcda: 50.70%: num counts=1041, min counter=1362634 hal/Hal.gcda: 51.48%: num counts=1097, min counter=1362634 hal/Hal.gcda: 52.26%: num counts=1132, min counter=1084378 hal/Hal.gcda: 53.04%: num counts=1202, min counter=1084378 hal/Hal.gcda: 53.82%: num counts=1272, min counter=1084378 hal/Hal.gcda: 54.60%: num counts=1342, min counter=1084378 hal/Hal.gcda: 55.38%: num counts=1412, min counter=1084378 hal/Hal.gcda: 56.16%: num counts=1446, min counter=951474 hal/Hal.gcda: 56.94%: num counts=1526, min counter=951474 hal/Hal.gcda: 57.72%: num counts=1606, min counter=951474 hal/Hal.gcda: 58.50%: num counts=1684, min counter=951474 hal/Hal.gcda: 59.28%: num counts=1760, min counter=811897 hal/Hal.gcda: 60.06%: num counts=1854, min counter=811897 hal/Hal.gcda: 60.84%: num counts=1948, min counter=811897 hal/Hal.gcda: 61.62%: num counts=2029, min counter=811897 hal/Hal.gcda: 62.40%: num counts=2124, min counter=672090 hal/Hal.gcda: 63.18%: num counts=2237, min counter=672090 hal/Hal.gcda: 63.96%: num counts=2351, min counter=672090 hal/Hal.gcda: 64.74%: num counts=2425, min counter=536827 hal/Hal.gcda: 65.52%: num counts=2567, min counter=536827 hal/Hal.gcda: 66.30%: num counts=2709, min counter=536827 hal/Hal.gcda: 67.08%: num counts=2851, min counter=536827 hal/Hal.gcda: 67.86%: num counts=2993, min counter=536827 hal/Hal.gcda: 68.64%: num counts=3070, min counter=469085 hal/Hal.gcda: 69.42%: num counts=3232, min counter=469085 hal/Hal.gcda: 70.20%: num counts=3395, min counter=469085 hal/Hal.gcda: 70.98%: num counts=3543, min counter=401495 hal/Hal.gcda: 71.76%: num counts=3733, min counter=401495 hal/Hal.gcda: 72.54%: num counts=3923, min counter=401495 hal/Hal.gcda: 73.32%: num counts=4071, min counter=333978 hal/Hal.gcda: 74.10%: num counts=4299, min counter=333978 hal/Hal.gcda: 74.88%: num counts=4527, min counter=333978 hal/Hal.gcda: 75.66%: num counts=4753, min counter=333978 hal/Hal.gcda: 76.44%: num counts=4961, min counter=266317 hal/Hal.gcda: 77.22%: num counts=5247, min counter=266317 hal/Hal.gcda: 78.00%: num counts=5533, min counter=266317 hal/Hal.gcda: 78.78%: num counts=5819, min counter=266317 hal/Hal.gcda: 79.56%: num counts=5980, min counter=232660 hal/Hal.gcda: 80.34%: num counts=6308, min counter=232660 hal/Hal.gcda: 81.12%: num counts=6603, min counter=198695 hal/Hal.gcda: 81.90%: num counts=6986, min counter=198695 hal/Hal.gcda: 82.68%: num counts=7370, min counter=198695 hal/Hal.gcda: 83.46%: num counts=7722, min counter=165900 hal/Hal.gcda: 84.24%: num counts=8181, min counter=165900 hal/Hal.gcda: 85.02%: num counts=8621, min counter=132608 hal/Hal.gcda: 85.80%: num counts=9195, min counter=132608 hal/Hal.gcda: 86.58%: num counts=9636, min counter=115741 hal/Hal.gcda: 87.36%: num counts=10284, min counter=115741 hal/Hal.gcda: 88.14%: num counts=11007, min counter=98946 hal/Hal.gcda: 88.92%: num counts=11704, min counter=98946 hal/Hal.gcda: 89.70%: num counts=12574, min counter=82393 hal/Hal.gcda: 90.48%: num counts=13499, min counter=65856 hal/Hal.gcda: 91.26%: num counts=14569, min counter=65856 hal/Hal.gcda: 92.04%: num counts=15700, min counter=57610 hal/Hal.gcda: 92.82%: num counts=17240, min counter=49329 hal/Hal.gcda: 93.60%: num counts=18930, min counter=41047 hal/Hal.gcda: 94.38%: num counts=20933, min counter=32825 hal/Hal.gcda: 95.16%: num counts=23128, min counter=28719 hal/Hal.gcda: 95.94%: num counts=26146, min counter=20488 hal/Hal.gcda: 96.72%: num counts=30069, min counter=16389 hal/Hal.gcda: 97.50%: num counts=35296, min counter=10241 hal/Hal.gcda: 98.28%: num counts=43669, min counter=6145 hal/Hal.gcda: 99.06%: num counts=59589, min counter=3072 hal/Hal.gcda: 99.90%: num counts=115840, min counter=320 Teresa > > -- > Configure bugmail: http://gcc.gnu.org/bugzilla/userprefs.cgi?tab=email > ------- You are receiving this mail because: ------- > You are on the CC list for the bug. -- Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> hal/Hal.gcda: 96.72%: num counts=30069, min counter=16389 > hal/Hal.gcda: 97.50%: num counts=35296, min counter=10241 > hal/Hal.gcda: 98.28%: num counts=43669, min counter=6145 > hal/Hal.gcda: 99.06%: num counts=59589, min counter=3072 > hal/Hal.gcda: 99.90%: num counts=115840, min counter=320 > > So it looks like you would want a cutoff of 97.5% to get close to what > was there before. Setting the default cutoff to something like 95% would sound fine to me. I see i asked to reduce the parameter but suggested 990. Markus, can you try setting HOT_BB_COUNT_WS_PERMILLE to 950? Honza
(In reply to comment #159) > > hal/Hal.gcda: 96.72%: num counts=30069, min counter=16389 > > hal/Hal.gcda: 97.50%: num counts=35296, min counter=10241 > > hal/Hal.gcda: 98.28%: num counts=43669, min counter=6145 > > hal/Hal.gcda: 99.06%: num counts=59589, min counter=3072 > > hal/Hal.gcda: 99.90%: num counts=115840, min counter=320 > > > > So it looks like you would want a cutoff of 97.5% to get close to what > > was there before. > > Setting the default cutoff to something like 95% would sound fine to me. I > see i asked to reduce the parameter but suggested 990. Markus, can you > try setting HOT_BB_COUNT_WS_PERMILLE to 950? It doesn't help: HOT_BB_COUNT_WS_PERMILLE=950: size of libxul.so: 42149632 bytes (In reply to comment #157) > (Unfortunately this new ICE happens with yesterdays gcc when linking libxul: > > /var/tmp/mozilla-central/content/base/src/nsDocument.cpp: In member function > ‘CreateRange’: > /var/tmp/mozilla-central/content/base/src/nsDocument.cpp:4999:0: internal > compiler error: in cgraph_mark_address_taken_node, at cgraph.c:1409 > > I will open a new PR for this later.) See PR55669
I've opened a new bug for the binary size increase issue: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55674
The libxul binary size issue is solved now. During testing I came across another issue that looks similar to the one Comment 146: /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccwu5G98.ltrans4.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN13nsXUL Document14MaybeBroadcastEv.429466' which may overflow at runtime; recompile with -fPIC /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_NewRunnableMethod<nsXULDocument*, void (nsXU LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.39398] [clone .constprop.84952]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .429466]' /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_NewRunnableMethod<nsXULDocument*, void (nsXU LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.39398] [clone .constprop.84952]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .429466]' collect2: error: ld returned 1 exit status After I deleted both nsXULDocument.o and nsXULDocument.gcda and rebuild with: make -f client.mk realbuild MOZ_PROFILE_USE=1 the problem did go away.
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45375 > > --- Comment #162 from Markus Trippelsdorf <markus at trippelsdorf dot de> 2012-12-13 22:25:27 UTC --- > The libxul binary size issue is solved now. Good > > During testing I came across another issue that looks similar > to the one Comment 146: > /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: > error: /tmp/ccwu5G98.ltrans4.ltrans.o: requires dynamic R_X86_64_PC32 reloc > against '_ZN13nsXUL > Document14MaybeBroadcastEv.429466' which may overflow at runtime; recompile > with -fPIC > /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function > nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* > NS_NewRunnableMethod<nsXULDocument*, void (nsXU > LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone > .local.39398] [clone .constprop.84952]: error: undefined reference to > 'nsXULDocument::MaybeBroadcast() [clone .429466]' > /tmp/ccwu5G98.ltrans4.ltrans.o:ccwu5G98.ltrans4.o:function > nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* > NS_NewRunnableMethod<nsXULDocument*, void (nsXU > LDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone > .local.39398] [clone .constprop.84952]: error: undefined reference to > 'nsXULDocument::MaybeBroadcast() [clone .429466]' > collect2: error: ld returned 1 exit status > > After I deleted both nsXULDocument.o and nsXULDocument.gcda and rebuild with: > make -f client.mk realbuild MOZ_PROFILE_USE=1 > the problem did go away. This sounds like an independent problem with partitining. I am travelling till 17th, so I will try to check this locally myself. Perhaps you can give details on your setup? (i.e. my Mozilla tree got quite dirty with various local hacks I made over time, perhaps I should refresh to cleaner state) Honza
Some trouble while building LLVM with -flto. ../x86_64-linux-gnu/bin/ld.gold: error: /tmp/cc60XH2F.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against 'X86CompilationCallback2' which may overflow at runtime; recompile with -fPIC Code: extern "C" { void X86CompilationCallback(void); asm( ".text\n" ".align 8\n" ".globl " ASMPREFIX "X86CompilationCallback\n" TYPE_FUNCTION(X86CompilationCallback) ASMPREFIX "X86CompilationCallback:\n" ... "movq 8(%rbp), %rdx\n" "call " ASMPREFIX "X86CompilationCallback2\n" "addq $32, %rsp\n" ... ); } void __attribute__((used)) X86CompilationCallback2(intptr_t *StackPtr, intptr_t RetAddr) { intptr_t *RetAddrLoc = &StackPtr[1]; ... } }
OK, I tracked down the undefined reference to error: /tmp/cc0oq4BG.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' which may overflow at runtime; recompile with -fPIC it is caused by bug in Mozilla - it includes file defininig virtual function that use '_ZN12SkAnnotationC1ER23SkFlattenableReadBuffer' (in SkPaint) but it never links with implementation. Normally the function is optimized out. It is not due to fact that we never optimize out virtual functions prior inlining for devirtualization and in WPA path we forget to remove these when done. Fixed by the following patch Index: ipa-inline.c =================================================================== --- ipa-inline.c (revision 194916) +++ ipa-inline.c (working copy) @@ -1793,7 +1793,7 @@ } inline_small_functions (); - symtab_remove_unreachable_nodes (true, dump_file); + symtab_remove_unreachable_nodes (false, dump_file); free (order); /* Inline functions with a property that after inlining into all callers the Index: lto/lto.c =================================================================== --- lto/lto.c (revision 194916) +++ lto/lto.c (working copy) @@ -3215,6 +3215,7 @@ cgraph_state = CGRAPH_STATE_IPA_SSA; execute_ipa_pass_list (all_regular_ipa_passes); + symtab_remove_unreachable_nodes (false, dump_file); if (cgraph_dump_file) { Index: cgraphclones.c =================================================================== --- cgraphclones.c (revision 194916) +++ cgraphclones.c (working copy) @@ -184,6 +184,7 @@ new_node->symbol.decl = decl; symtab_register_node ((symtab_node)new_node); new_node->origin = n->origin; + new_node->symbol.lto_file_data = n->symbol.lto_file_data; if (new_node->origin) { new_node->next_nested = new_node->origin->nested;
Markus, the apperance of undefined references I fixed by patch above is highly sensitive to partitioning and inlining decision. Can you, please, check if the problem with PGO remains? It may be another instance of the same issue.
(In reply to comment #166) > Markus, the apperance of undefined references I fixed by patch above is highly > sensitive to partitioning and inlining decision. Can you, please, check if the > problem with PGO remains? It may be another instance of the same issue. Just checked it using your patch from comment 165, but the issue from comment 162 is still there: /usr/lib/gcc/x86_64-pc-linux-gnu/4.8.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccACx905.ltrans6.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN13nsXULDocument14MaybeBroadcastEv.466048' which may overflow at runtime; recompile with -fPIC /tmp/ccACx905.ltrans6.ltrans.o:ccACx905.ltrans6.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_N ewRunnableMethod<nsXULDocument*, void (nsXULDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.42120] [clone .constprop.89117]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .466048]' /tmp/ccACx905.ltrans6.ltrans.o:ccACx905.ltrans6.o:function nsRunnableMethodTraits<void (nsXULDocument::*)(), true>::base_type* NS_N ewRunnableMethod<nsXULDocument*, void (nsXULDocument::*)()>(nsXULDocument*, void (nsXULDocument::*)()) [clone .local.42120] [clone .constprop.89117]: error: undefined reference to 'nsXULDocument::MaybeBroadcast() [clone .466048]' Also the memory usage went through the roof (not sure if this caused by your patch or my recent git-pull of mozilla-central): over 9GB RAM is needed (not much fun on my 8GB test-machine). (So I will stop testing Firfox for now, until LTO/PGO memory usage gets sane again (hopefully for 4.9).)
Too bad :( The patch should reduce memory usage, not increase it. So it must be something else. My build was around 7GB w/o PGO, I will need to try the PGO builds myself. My tree is however somewhat out of date. I will try fresh checkout and post mem usage stats. Perhaps you can share smewhere the -lm.res and *wpa*cgraph dump of --save-temps -fdump-ipa-cgraph build? I will try to figure out those symbols.
Author: hubicka Date: Wed Jan 9 21:22:26 2013 New Revision: 195066 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=195066 Log: PR lto/45375 * ipa-inline.c (ipa_inline): Remove extern inlines and virtual functions. * cgraphclones.c (cgraph_clone_node): Cpoy also LTO file data. * lto.c (do_whole_program_analysis): Remove unreachable nodes after IPA. Modified: trunk/gcc/ChangeLog trunk/gcc/cgraphclones.c trunk/gcc/ipa-inline.c trunk/gcc/lto/ChangeLog trunk/gcc/lto/lto.c
OK, here is updated memory use: cgraph.c:863 (cgraph_allocate_init_indirect_info 5905200: 0.1% 0: 0.0% 6020160: 0.1% 0: 0.0% 298134 tree.c:1237 (build_int_cst_wide) 15554272: 0.4% 0: 0.0% 782528: 0.0% 0: 0.0% 510525 tree.c:1559 (build_string) 10685931: 0.2% 0: 0.0% 16715642: 0.4% 2193469: 1.7% 563828 stringpool.c:75 (alloc_node) 0: 0.0% 0: 0.0% 30574880: 0.7% 0: 0.0% 764372 lto/lto.c:2286 (create_subid_section_table) 1522184: 0.0% 0: 0.0% 39117064: 0.8% 8051472: 6.4% 3978 stringpool.c:58 (stringpool_ggc_alloc) 0: 0.0% 0: 0.0% 41092405: 0.9% 2954893: 2.4% 764372 gimple.c:3167 (iterative_hash_canonical_type) 45040752: 1.0% 0: 0.0% 0: 0.0% 0: 0.0% 2815047 lto/lto.c:1222 (iterative_hash_gimple_type) 68276864: 1.6% 0: 0.0% 0: 0.0% 0: 0.0% 4267304 ggc-common.c:249 (ggc_cleared_alloc_ptr_array_tw 91784: 0.0% 487289424:48.8% 71432600: 1.5% 248976: 0.2% 10974 lto/lto.c:1266 (iterative_hash_gimple_type) 75288576: 1.8% 0: 0.0% 0: 0.0% 0: 0.0% 4705536 lto-section-in.c:362 (lto_new_in_decl_state) 694320: 0.0% 0: 0.0% 94861800: 2.0% 0: 0.0% 796301 tree.c:1263 (build_int_cst_wide) 76232736: 1.8% 0: 0.0% 19358880: 0.4% 0: 0.0% 2987238 cgraph.c:794 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 125510632: 2.7% 0: 0.0% 1206833 vec.h:565 ((null)) 66034564: 1.5% 98716: 0.0% 68500548: 1.5% 3484420: 2.8% 597783 vec.h:695 ((null)) 124654648: 2.9% 122044288:12.2% 63749232: 1.4% 2614800: 2.1% 1590429 tree-streamer-in.c:562 (streamer_alloc_tree) 125829312: 2.9% 0: 0.0% 74222904: 1.6% 7072: 0.0% 2005091 lto/lto.c:267 (lto_read_in_decl_state) 1478720: 0.0% 0: 0.0% 216390688: 4.7% 38247784:30.5% 5574107 vec.h:747 ((null)) 173791988: 4.0% 19565412: 2.0% 68225644: 1.5% 2680332: 2.1% 1396070 vec.h:707 ((null)) 133872480: 3.1% 0: 0.0% 285212728: 6.1% 800360: 0.6% 1059913 cgraph.c:500 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 472831880:10.2% 0: 0.0% 1597405 tree.c:1223 (build_int_cst_wide) 607138944:14.1% 0: 0.0% 10427664: 0.2% 4719336: 3.8% 315034 toplev.c:959 (realloc_for_line_map) 0: 0.0% 358037664:35.8% 1073872920:23.1% 184: 0.0% 16 tree-streamer-in.c:573 (streamer_alloc_tree) 2762184192:64.2% 0: 0.0% 1861017624:40.0% 59027616:47.1% 34649937 Total 4302007795 999178184 4651003487 125411458 68828967 source location Garbage Freed Leak Overhead Times ------------------------------------------------------- Actually it is a bit of improvement over my past report. Some obvious things 1) we still soak in too many trees (40%) of memory. The per-tree stats are: decls 17310018 -1609736744 types 8983387 1509209016 exprs 2427302 80045744 constants 4079292 135393547 binfos 2005091 200038072 random kinds 5691481 227659664 and counts: tree_list 5691475 pointer_type 2337585 record_type 3702066 function_decl 1856282 field_decl 2812564 const_decl 2739702 parm_decl 3549707 type_decl 4780459 result_decl 1144482 tree_binfo 2005091 2) new linemaps are still a disaster 3) VEC rewrite did break stats. Honza
Created attachment 29182 [details] Patch to compress line info This patch removes column information from LTO (so we lose carret diagnostics in warnings/errors output at LTO time that seems resonable thing to do) and avoid entering duplicate locators into the linemap. The patch reduces linemap usage from 23% to 5% of GGC memory saving 1-2GB on Mozilla. (also reducing LTO file size).
(In reply to comment #171) > Created attachment 29182 [details] > Patch to compress line info > > This patch removes column information from LTO (so we lose carret diagnostics > in warnings/errors output at LTO time that seems resonable thing to do) and > avoid entering duplicate locators into the linemap. The patch reduces linemap > usage from 23% to 5% of GGC memory saving 1-2GB on Mozilla. (also reducing LTO > file size). Patch looks incomplete? What does dropping columns only do to memory use? Please disable flag_diagnostics_show_caret unconditionally in lto1 if you do that.
> Patch looks incomplete? What does dropping columns only do to memory use? I will check. I remember that prior columns there was also some savings for the cache. Just saving 20% out of 23% is cooler than saving 20% out of 5% of memory. Note that we are still over 8GB for Mozilla LTO after latest Mozilla checkout. > Please disable flag_diagnostics_show_caret unconditionally in lto1 if you > do that. Yeah, I wanted, but I am not sure where in lto.c is proper place to do so?
lto_post_options ?
Created attachment 29191 [details] alternative patch without the compression. This is alternative patch just skipping columns but not doing the compression. It seems that compression is actually quite effective. Non-compressing w/o column info is 1073872920 bytes, compression + no column is 268566544 bytes compression + column is 1073872920 bytes Perhaps I messed up the caching with column info? It strikes wrong that the numbers are precisely the same. But perhaps it is just reallocation strategy. I will also generate fresh numbers for unpatched GCC.
(In reply to comment #175) > Created attachment 29191 [details] > alternative patch without the compression. > > This is alternative patch just skipping columns but not doing the compression. > It seems that compression is actually quite effective. > Non-compressing w/o column info is 1073872920 bytes, > compression + no column is 268566544 bytes > compression + column is 1073872920 bytes > > Perhaps I messed up the caching with column info? It strikes wrong that the > numbers are precisely the same. But perhaps it is just reallocation strategy. I > will also generate fresh numbers for unpatched GCC. + linemap_line_start (line_table, data_in->current_line, 0); - return linemap_position_for_column (line_table, data_in->current_col); + return linemap_position_for_column (line_table, 0); linemap_line_start will aready return a location for column 0. So I'd say we want if (file_change) { ... } return linemap_line_start (line_table, data_in->current_line, 0); instead. Which hopefully does nothing if nothing changed. I don't know how you implement caching - you didn't attach a patch to do so.
Created attachment 29192 [details] caching Aha, now I see why you ask for complete patch. I obviously messed up the code. This is how I do caching (in version that still has columns in it). I removed the final incarnation of the patch, but it should be easy to re-do.
The global cache with arbitrary large size reduces usage down to 0.3% (16908304) bytes. So it seems that sharing across files is quite an important part of the game. I will try to fiddle with the cache size to see how big cache is actually needed. Unpatches mainline needs 1073872920 bytes, that is the same as with dropping columns and/or my initial local caching implementation. This is apparently because of the exponential resizing of the table (i.e. we simply do not save enough to see a difference). Honza
I'm currently (gcc revision 196427, FF changeset 123831:c95439870e05) facing a few ICEs during the compilation phase with the following backtrace: #0 0x0000000000f89a73 in get_location_from_adhoc_loc (set=0x7ffff7ff2000, loc=2947526575) at /home/mjambor/gcc/trunk/src/libcpp/line-map.c:165 #1 0x0000000000c247fe in inlined_function_outer_scope_p (block=0x7fffee4bcb28) at /home/mjambor/gcc/trunk/src/gcc/tree.h:5561 #2 pack_ts_block_value_fields (expr=0x7fffee4bcb28, bp=0x7fffffffd1a0, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:319 #3 streamer_pack_tree_bitfields (ob=0x1c73210, bp=0x7fffffffd1a0, expr=0x7fffee4bcb28) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:417 #4 0x00000000009c3bc9 in lto_write_tree (ref_p=true, expr=0x7fffee4bcb28, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:317 #5 lto_output_tree (ob=0x1c73210, expr=0x7fffee4bcb28, ref_p=true, this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410 #6 0x0000000000c26617 in write_ts_common_tree_pointers (ref_p=true, expr=0x7ffff3f6bc80, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:514 #7 streamer_write_tree_body (ob=0x1c73210, expr=0x7ffff3f6bc80, ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:845 #8 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7ffff3f6bc80, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321 #9 lto_output_tree (ob=ob@entry=0x1c73210, expr=0x7ffff3f6bc80, ref_p=ref_p@entry=true, this_ref_p=this_ref_p@entry=true) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410 #10 0x0000000000c26e62 in write_ts_exp_tree_pointers (ref_p=<optimized out>, expr=<optimized out>, ob=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:747 #11 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc63dc0, ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:884 #12 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc63dc0, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321 #13 lto_output_tree (ob=0x1c73210, expr=0x7fffecc63dc0, ref_p=true, this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410 #14 0x0000000000c26df8 in write_ts_exp_tree_pointers (ref_p=<optimized out>, expr=<optimized out>, ob=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:746 #15 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc70078, ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:884 #16 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc70078, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321 #17 lto_output_tree (ob=ob@entry=0x1c73210, expr=0x7fffecc70078, ref_p=ref_p@entry=true, this_ref_p=this_ref_p@entry=true) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410 #18 0x0000000000c2681d in write_ts_decl_common_tree_pointers (ref_p=true, expr=0x7fffecc6d720, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:584 #19 streamer_write_tree_body (ob=0x1c73210, expr=0x7fffecc6d720, ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/tree-streamer-out.c:857 #20 0x00000000009c3bf7 in lto_write_tree (ref_p=true, expr=0x7fffecc6d720, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:321 #21 lto_output_tree (ob=0x1c73210, expr=0x7fffecc6d720, ref_p=true, this_ref_p=<optimized out>) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:410 #22 0x0000000000ecd118 in output_gimple_stmt (stmt=0x7fffec6206c0, ob=0x1c73210) at /home/mjambor/gcc/trunk/src/gcc/gimple-streamer-out.c:143 #23 output_bb (ob=0x1c73210, bb=0x7fffed130f08, fn=0x7fffef8603f0) at /home/mjambor/gcc/trunk/src/gcc/gimple-streamer-out.c:199 #24 0x00000000009c4f26 in output_function (node=0x7fffef8614a0) at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:823 #25 lto_output () at /home/mjambor/gcc/trunk/src/gcc/lto-streamer-out.c:987 #26 0x00000000009fa971 in ipa_write_summaries_2 ( pass=0x1618f00 <pass_ipa_lto_gimple_out>, state=0x1ad8c00) at /home/mjambor/gcc/trunk/src/gcc/passes.c:2408 The statement being written is: (gdb) call debug_gimple_stmt ((gimple)0x7fffec6206c0) # DEBUG v => 18444633011384221696 This happens for example during compilation of js/src/ion/shared/CodeGenerator-shared.cpp
Try Index: gcc/tree-inline.c =================================================================== --- gcc/tree-inline.c (revision 196520) +++ gcc/tree-inline.c (working copy) @@ -3929,7 +3929,7 @@ expand_call_inline (basic_block bb, gimp { id->block = make_node (BLOCK); BLOCK_ABSTRACT_ORIGIN (id->block) = fn; - BLOCK_SOURCE_LOCATION (id->block) = input_location; + BLOCK_SOURCE_LOCATION (id->block) = LOCATION_LOCUS (input_location); prepend_lexical_block (gimple_block (stmt), id->block); }
The bug described in comment #179 is now PR 56570.
OK, after a while I should update the stats here. Richard's new tree merging patch makes libxul linking a lot faster and less memory consuming. Peak memory usage (in TOP) is now just bellow 10GB, with bit of incremental improvmenets I hope to get bellow 8GB again soon. Bulid time is real 19m0.355s user 56m20.459s sys 2m17.533s GGC memory usage after stream in 4938399k Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1399 kB ( 0%) ggc phase opt and generate : 72.86 (12%) usr 0.90 ( 3%) sys 75.25 (11%) wall 270952 kB ( 7%) ggc phase stream in : 274.88 (44%) usr 9.01 (26%) sys 294.99 (43%) wall 3478515 kB (93%) ggc phase stream out : 282.18 (45%) usr 24.40 (71%) sys 308.42 (45%) wall 7162 kB ( 0%) ggc garbage collection : 12.99 ( 2%) usr 0.01 ( 0%) sys 13.00 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization : 1.95 ( 0%) usr 0.00 ( 0%) sys 1.95 ( 0%) wall 32 kB ( 0%) ggc ipa cp : 9.82 ( 2%) usr 0.39 ( 1%) sys 10.26 ( 2%) wall 418482 kB (11%) ggc ipa inlining heuristics : 39.30 ( 6%) usr 1.12 ( 3%) sys 41.52 ( 6%) wall 1353294 kB (36%) ggc ipa lto gimple in : 0.45 ( 0%) usr 0.15 ( 0%) sys 0.62 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 18.24 ( 3%) usr 1.50 ( 4%) sys 19.86 ( 3%) wall 0 kB ( 0%) ggc ipa lto decl in : 200.68 (32%) usr 5.85 (17%) sys 216.44 (32%) wall 3887175 kB (103%) ggc ipa lto decl out : 256.24 (41%) usr 13.44 (39%) sys 271.24 (40%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 7.20 ( 1%) usr 1.61 ( 5%) sys 8.83 ( 1%) wall 2134157 kB (57%) ggc ipa lto decl merge : 27.71 ( 4%) usr 0.01 ( 0%) sys 27.72 ( 4%) wall 8270 kB ( 0%) ggc ipa lto cgraph merge : 17.31 ( 3%) usr 0.07 ( 0%) sys 17.39 ( 3%) wall 142240 kB ( 4%) ggc whopr wpa : 8.82 ( 1%) usr 0.04 ( 0%) sys 8.89 ( 1%) wall 7165 kB ( 0%) ggc whopr wpa I/O : 1.63 ( 0%) usr 9.43 (27%) sys 11.19 ( 2%) wall 0 kB ( 0%) ggc whopr partitioning : 3.21 ( 1%) usr 0.04 ( 0%) sys 3.25 ( 0%) wall 0 kB ( 0%) ggc ipa reference : 5.56 ( 1%) usr 0.04 ( 0%) sys 5.81 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 1.83 ( 0%) usr 0.02 ( 0%) sys 1.86 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 6.07 ( 1%) usr 0.18 ( 1%) sys 6.26 ( 1%) wall 0 kB ( 0%) ggc inline parameters : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 14 kB ( 0%) ggc tree copy propagation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree PTA : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc tree SSA rewrite : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 27 kB ( 0%) ggc tree SSA other : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree CCP : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.14 ( 0%) usr 0.12 ( 0%) sys 0.24 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 10.69 ( 2%) usr 0.29 ( 1%) sys 11.10 ( 2%) wall 0 kB ( 0%) ggc TOTAL : 629.93 34.31 678.67 3758029 kB Memory usage seems about the same with -g. Honza
type merging stats [WPA] read 43156894 SCCs of average size 2.270660 [WPA] 97994652 tree bodies read in total [WPA] tree SCC table: size 8388593, 3830511 elements, collision ratio: 0.684487 [WPA] tree SCC max chain length 88 (size 1) [WPA] Compared 19139975 SCCs, 344923 collisions (0.018021) [WPA] Merged 19067050 SCCs [WPA] Merged 58757829 tree bodies [WPA] Merged 11951381 types [WPA] 4357267 types prevailed (13278034 associated trees) [WPA] Old merging code merges an additional 2026163 types of which 140937 are in the same SCC with their prevailing variant (12389865 and 6362266 associated trees) [WPA] GIMPLE canonical type table: size 131071, 77910 elements, 4357402 searches, 1095104 collisions (ratio: 0.251320) [WPA] GIMPLE canonical type hash table: size 8388593, 4357346 elements, 15252531 searches, 11817317 collisions (ratio: 0.774777) [WPA] # of input files: 4918 [WPA] # of input cgraph nodes: 0 [WPA] # of function bodies: 0 [WPA] # of output files: 0 [WPA] # of output symtab nodes: 0 [WPA] # of output tree pickle references: 0 [WPA] # of output tree bodies: 0 [WPA] # callgraph partitions: 0 [WPA] Compression: 1311851796 input bytes, 4153897270 uncompressed bytes (ratio: 3.166438) [WPA] Size of mmap'd section decls: 1311851796 bytes [LTRANS] read 314277 SCCs of average size 6.082532 [LTRANS] 1911600 tree bodies read in total [LTRANS] GIMPLE canonical type table: size 16381, 9653 elements, 453967 searches, 24697 collisions (ratio: 0.054403) [LTRANS] GIMPLE canonical type hash table: size 1048573, 453913 elements, 1562009 searches, 1517260 collisions (ratio: 0.971352) [LTRANS] # of input files: 1 [LTRANS] # of input cgraph nodes: 0 [LTRANS] # of function bodies: 0
New profiles after Richard's changes to remove pointer maps from straming in. Stream in: samples % app name symbol name 36599 12.3464 lto1 inflate_fast 27382 9.2371 lto1 streamer_read_uhwi(lto_input_block*) 19282 6.5047 lto1 streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*) 15807 5.3324 lto1 compare_tree_sccs_1(tree_node*, tree_node*, tree_node***) 11385 3.8407 libc-2.11.1.so msort_with_tmp 9054 3.0543 libc-2.11.1.so memcpy 8701 2.9352 lto1 htab_find_slot_with_hash 8506 2.8694 lto1 lto_input_tree(lto_input_block*, data_in*) 8405 2.8354 lto1 lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int) 8055 2.7173 lto1 ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option) 6436 2.1711 lto1 streamer_read_tree_body(lto_input_block*, data_in*, tree_node*) 6287 2.1209 lto1 adler32 5891 1.9873 lto1 streamer_get_pickled_tree(lto_input_block*, data_in*) Stream out: samples % app name symbol name 19885 14.6837 lto1 DFS_write_tree(output_block*, sccs*, tree_node*, bool, bool) 19285 14.2407 lto1 linemap_lookup(line_maps*, unsigned int) 16192 11.9567 lto1 streamer_write_uhwi_stream(lto_output_stream*, unsigned long) 15926 11.7603 lto1 pointer_map_insert(pointer_map_t*, void const*) 10285 7.5948 lto1 pointer_map_contains(pointer_map_t const*, void const*) 7324 5.4083 lto1 streamer_tree_cache_lookup(streamer_tree_cache_d*, tree_node*, unsigned int*) 5897 4.3545 lto1 streamer_pack_tree_bitfields(output_block*, bitpack_d*, tree_node*) 5374 3.9683 lto1 lto_output_tree(output_block*, tree_node*, bool, bool) 4896 3.6154 lto1 streamer_tree_cache_insert_1(streamer_tree_cache_d*, tree_node*, unsigned int, unsigned int*, bool) 3285 2.4258 libc-2.11.1.so memset 2669 1.9709 lto1 streamer_write_tree_body(output_block*, tree_node*, bool) 2520 1.8608 libc-2.11.1.so memcpy 2383 1.7597 lto1 streamer_tree_cache_add_to_node_array(streamer_tree_cache_d*, unsigned int, tree_node*, unsigned int) linemap_lookup is easy target, obviously. Execution times (seconds) phase setup : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 1399 kB ( 0%) ggc phase opt and generate : 69.29 (14%) usr 0.82 ( 3%) sys 70.62 (13%) wall 270269 kB (11%) ggc phase stream in : 224.95 (44%) usr 6.23 (22%) sys 236.02 (43%) wall 2174294 kB (89%) ggc phase stream out : 213.26 (42%) usr 21.35 (75%) sys 236.87 (44%) wall 7157 kB ( 0%) ggc garbage collection : 9.92 ( 2%) usr 0.00 ( 0%) sys 9.99 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization : 1.36 ( 0%) usr 0.00 ( 0%) sys 1.34 ( 0%) wall 32 kB ( 0%) ggc ipa cp : 7.65 ( 2%) usr 0.32 ( 1%) sys 8.01 ( 1%) wall 418436 kB (17%) ggc ipa inlining heuristics : 38.83 ( 8%) usr 0.83 ( 3%) sys 39.99 ( 7%) wall 1352530 kB (55%) ggc ipa lto gimple in : 0.39 ( 0%) usr 0.05 ( 0%) sys 0.53 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 16.46 ( 3%) usr 1.39 ( 5%) sys 17.93 ( 3%) wall 0 kB ( 0%) ggc ipa lto decl in : 158.55 (31%) usr 3.99 (14%) sys 166.99 (31%) wall 2583106 kB (105%) ggc ipa lto decl out : 191.10 (38%) usr 11.48 (40%) sys 203.47 (37%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 7.07 ( 1%) usr 1.17 ( 4%) sys 8.27 ( 2%) wall 2134131 kB (87%) ggc ipa lto decl merge : 29.94 ( 6%) usr 0.01 ( 0%) sys 30.06 ( 6%) wall 8270 kB ( 0%) ggc ipa lto cgraph merge : 12.02 ( 2%) usr 0.04 ( 0%) sys 12.13 ( 2%) wall 142240 kB ( 6%) ggc whopr wpa : 7.30 ( 1%) usr 0.03 ( 0%) sys 7.39 ( 1%) wall 7160 kB ( 0%) ggc whopr wpa I/O : 1.40 ( 0%) usr 8.46 (30%) sys 11.14 ( 2%) wall 0 kB ( 0%) ggc whopr partitioning : 2.33 ( 0%) usr 0.01 ( 0%) sys 2.36 ( 0%) wall 0 kB ( 0%) ggc ipa reference : 5.44 ( 1%) usr 0.04 ( 0%) sys 5.53 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 1.26 ( 0%) usr 0.04 ( 0%) sys 1.32 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 5.87 ( 1%) usr 0.13 ( 0%) sys 6.03 ( 1%) wall 0 kB ( 0%) ggc inline parameters : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 14 kB ( 0%) ggc tree eh : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree PTA : 0.05 ( 0%) usr 0.00 ( 0%) sys 0.05 ( 0%) wall 0 kB ( 0%) ggc tree SSA rewrite : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 27 kB ( 0%) ggc tree SSA other : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc tree FRE : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc dominance computation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.10 ( 0%) usr 0.18 ( 1%) sys 0.19 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 10.42 ( 2%) usr 0.23 ( 1%) sys 10.76 ( 2%) wall 0 kB ( 0%) ggc TOTAL : 507.52 28.40 543.51 2453120 kB
I merged in some patches intended to reduce memory of Firefox LTO and also updated firefox tree. Some more involved patches are on the way, so it is summary where we stand now. WPA usage in TOP is 10GB now. 1) After streaming in trees, the GGC usage is now 5.1GB - 2.5GB are trees, - 1GB are linemaps - 0.8GB are decl maps (decl states) tree_list 12561507 integer_type 1511296 pointer_type 4610735 record_type 8139077 method_type 2401664 integer_cst 6677946 string_cst 2127890 function_decl 6069299 label_decl 504859 field_decl 5104957 var_decl 596020 const_decl 5401253 parm_decl 9002744 type_decl 10150100 result_decl 2181250 addr_expr 4173661 tree_binfo 4780477 I have cache that cuts down the linemaps + patch to not stream PARM_DECLs and RETURN_DECLs. With this the usage goes bellow 3GB. 2) Cgraph streaming now becomes important factor. GGC usage goes up to 7.7GB GGC use: - cgraph nodes themselves are 1.5GB - inline summaries are 0.5GB - cgraph edges are 3.7GB - IPA references 2.3GB - IPA-prop 0.7GB Off GGC - IPA-prop 0.6GB - Inline summary 0.5GB - symtab encoder 0.17GB Here one can easily - compress the vectors recording definitions - pull off parts of cgraph nodes that are not really needed by WPA (nested info, etc.) - perhaps implement of streaming of merged cgraph. so good news is that we now have a lot of interesting low hanging fruit. Bad news is that tree streaming still feels slow. I suppose we need to dig more into what trees really need to go into WPA...
oprofile of merging 67647 13.0501 lto1 inflate_fast 38682 7.4624 lto1 compare_tree_sccs_1(tree_node*, tree_node*, tree_node***) 32365 6.2437 lto1 streamer_read_uhwi(lto_input_block*) 31198 6.0186 lto1 streamer_read_tree_bitfields(lto_input_block*, data_in*, tree_node*) 21155 4.0811 libc-2.11.1.so msort_with_tmp 19581 3.7775 lto1 ht_lookup_with_hash(ht*, unsigned char const*, unsigned long, unsigned int, ht_lookup_option) 16584 3.1993 lto1 lto_input_tree(lto_input_block*, data_in*) 15203 2.9329 lto1 lto_input_tree_1(lto_input_block*, data_in*, LTO_tags, unsigned int) 15194 2.9312 libc-2.11.1.so memcpy 14823 2.8596 lto1 htab_find_slot_with_hash 12860 2.4809 lto1 streamer_read_tree_body(lto_input_block*, data_in*, tree_node*) 12705 2.4510 lto1 hash_table<tree_scc_hasher, xcallocator>::find_slot_with_hash(tree_scc const*, unsigned int, insert_option) 11773 2.2712 lto1 adler32 11504 2.2193 libc-2.11.1.so _IO_vfscanf 11401 2.1994 lto1 unify_scc(streamer_tree_cache_d*, unsigned int, unsigned int, unsigned int, unsigned int) 9548 1.8420 lto1 streamer_get_pickled_tree(lto_input_block*, data_in*) 9315 1.7970 lto1 inflate IPA 18799 6.2862 lto1 symtab_remove_unreachable_nodes(bool, _IO_FILE*) 11878 3.9719 lto1 cgraph_redirect_edge_callee(cgraph_edge*, cgraph_node*) 11223 3.7528 lto1 do_per_function(void (*)(void*), void*) 10813 3.6157 lto1 pointer_set_lookup(pointer_set_t const*, void const*, unsigned long*) 8415 2.8139 lto1 ipa_reverse_postorder(cgraph_node**) 7689 2.5711 lto1 htab_find_slot_with_hash 7677 2.5671 lto1 do_estimate_growth_1(cgraph_node*, void*) 7477 2.5002 libc-2.11.1.so free 7035 2.3524 libc-2.11.1.so malloc_consolidate Stream out 9440 16.1663 lto1 linemap_lookup(line_maps*, unsigned int) 7663 13.1231 lto1 DFS_write_tree(output_block*, sccs*, tree_node*, bool, bool) 6052 10.3643 lto1 streamer_write_uhwi_stream(lto_output_stream*, unsigned long) 5831 9.9858 lto1 pointer_set_lookup(pointer_set_t const*, void const*, unsigned long*) 3342 5.7233 lto1 streamer_tree_cache_lookup(streamer_tree_cache_d*, tree_node*, unsigned int*) 2229 3.8172 lto1 pointer_map_insert(pointer_map_t*, void const*) 2196 3.7607 lto1 streamer_pack_tree_bitfields(output_block*, bitpack_d*, tree_node*) 2054 3.5175 lto1 lto_output_tree(output_block*, tree_node*, bool, bool) 1656 2.8360 lto1 inflate_fast 1655 2.8342 lto1 pointer_map<unsigned int>::insert(void const*, bool*)
WPA time report Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1398 kB ( 0%) ggc phase opt and generate : 80.79 (13%) usr 1.01 ( 3%) sys 81.96 (12%) wall 315727 kB (25%) ggc phase stream in : 283.33 (45%) usr 7.82 (24%) sys 292.12 (44%) wall 940315 kB (74%) ggc phase stream out : 261.66 (42%) usr 23.14 (72%) sys 287.88 (43%) wall 7534 kB ( 1%) ggc garbage collection : 14.45 ( 2%) usr 0.02 ( 0%) sys 14.48 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization : 2.55 ( 0%) usr 0.00 ( 0%) sys 2.55 ( 0%) wall 33 kB ( 0%) ggc ipa cp : 10.45 ( 2%) usr 0.36 ( 1%) sys 10.81 ( 2%) wall 456287 kB (36%) ggc ipa inlining heuristics : 42.12 ( 7%) usr 1.06 ( 3%) sys 43.27 ( 7%) wall 1485346 kB (117%) ggc ipa lto gimple in : 0.56 ( 0%) usr 0.25 ( 1%) sys 0.87 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 21.77 ( 3%) usr 1.72 ( 5%) sys 23.53 ( 4%) wall 0 kB ( 0%) ggc ipa lto decl in : 183.90 (29%) usr 4.77 (15%) sys 189.46 (29%) wall 959299 kB (76%) ggc ipa lto decl out : 231.70 (37%) usr 10.78 (34%) sys 242.73 (37%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 14.38 ( 2%) usr 1.57 ( 5%) sys 15.99 ( 2%) wall 2405760 kB (190%) ggc ipa lto decl merge : 32.16 ( 5%) usr 0.00 ( 0%) sys 32.24 ( 5%) wall 8268 kB ( 1%) ggc ipa lto cgraph merge : 28.72 ( 5%) usr 0.06 ( 0%) sys 28.81 ( 4%) wall 135235 kB (11%) ggc whopr wpa : 9.57 ( 2%) usr 0.05 ( 0%) sys 9.62 ( 1%) wall 7537 kB ( 1%) ggc whopr wpa I/O : 2.07 ( 0%) usr 10.62 (33%) sys 15.49 ( 2%) wall 0 kB ( 0%) ggc whopr partitioning : 3.26 ( 1%) usr 0.03 ( 0%) sys 3.29 ( 0%) wall 0 kB ( 0%) ggc ipa reference : 5.55 ( 1%) usr 0.05 ( 0%) sys 5.62 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 2.82 ( 0%) usr 0.05 ( 0%) sys 2.88 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 6.25 ( 1%) usr 0.13 ( 0%) sys 6.38 ( 1%) wall 0 kB ( 0%) ggc unaccounted todo : 13.25 ( 2%) usr 0.28 ( 1%) sys 13.58 ( 2%) wall 0 kB ( 0%) ggc TOTAL : 625.79 31.97 661.97 1264976 kB
With patch to early remove unreachable virtual methods http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00774.html the memory usage fro Firefox WPA goes down to 3.4GB (from 10GB). Most of time is still spent by streaming: phase opt and generate : 48.52 (15%) usr 0.54 ( 3%) sys 49.20 (14%) wall 391219 kB ( 6%) ggc phase stream in : 87.84 (26%) usr 2.03 (10%) sys 90.15 (25%) wall 5968649 kB (94%) ggc phase stream out : 197.98 (59%) usr 18.61 (88%) sys 217.58 (61%) wall 7585 kB ( 0%) ggc garbage collection : 3.10 ( 1%) usr 0.00 ( 0%) sys 3.11 ( 1%) wall 0 kB ( 0%) ggc ipa unreachable code removal: 5.25 ( 2%) usr 0.12 ( 1%) sys 5.43 ( 2%) wall 0 kB ( 0%) ggc ipa inheritance graph construction: 0.26 ( 0%) usr 0.00 ( 0%) sys 0.26 ( 0%) wall 1059 kB ( 0%) ggc ipa virtual call target lookup: 13.76 ( 4%) usr 0.08 ( 0%) sys 13.80 ( 4%) wall 98807 kB ( 2%) ggc ipa cp : 2.79 ( 1%) usr 0.14 ( 1%) sys 2.95 ( 1%) wall 188635 kB ( 3%) ggc ipa inlining heuristics : 18.85 ( 6%) usr 0.24 ( 1%) sys 19.16 ( 5%) wall 439913 kB ( 7%) ggc ipa lto gimple out : 18.80 ( 6%) usr 1.52 ( 7%) sys 20.39 ( 6%) wall 0 kB ( 0%) ggc ipa lto decl in : 73.72 (22%) usr 1.51 ( 7%) sys 75.49 (21%) wall 5180378 kB (81%) ggc ipa lto decl out : 173.97 (52%) usr 7.61 (36%) sys 181.91 (51%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 1.73 ( 1%) usr 0.18 ( 1%) sys 1.91 ( 1%) wall 428921 kB ( 7%) ggc TOTAL : 334.36 21.18 356.94 6368853 kB Streaming in is rather slow because about 80% of trees streamed are duplicates. WPA still streams out 4GB of object files that seems to be main bottleneck. I have some experiments here. Most common tree nodes: tree_list 5707422 integer_type 1064175 pointer_type 2195993 record_type 4539776 integer_cst 4399813 function_decl 1127978 field_decl 3475888 const_decl 3462163 type_decl 5970713 addr_expr 1275696 tree_binfo 2903028 GGC memory is 1.6GB after tree streaming, 2.1GB after IPA streaming. Vecs: ipa-devirt.c:406 (get_odr_type) 172200: 0.2% 336624 3465: 0.0% ipa-devirt.c:407 (get_odr_type) 330376: 0.3% 655112 8267: 0.0% ipa-devirt.c:835 (ipa_devirt_init) 1386952: 1.5% 2419240 15701: 0.1% ipa-devirt.c:524 (maybe_record_node) 1678248: 1.8% 3094376 21842: 0.1% ipa-reference.c:168 (set_reference_optimization_ 5457952: 5.8% 8672960 11: 0.0% vec.h:1460 (copy) 6814272: 7.2% 34335740 604256: 2.4% ipa-inline-analysis.c:3754 (read_inline_edge_sum 7254040: 7.7% 16934500 849179: 3.4% ipa-ref.c:54 (ipa_record_reference) 11668584:12.3% 34881384 494857: 2.0% passes.c:2208 (execute_one_pass) 24435584:25.8% 41942968 651148: 2.6% ipa-inline-analysis.c:944 (inline_summary_alloc) 35603464:37.6% 58351856 200862: 0.8% Total 94804952 24781481 GGC: cgraph.c:912 (cgraph_allocate_init_indirect_info 0: 0.0% 1487184: 0.0% 7575408: 0.3% 0: 0.0% 188804 tree.c:1263 (build_int_cst_wide) 235456: 0.1% 0: 0.0% 8371392: 0.4% 0: 0.0% 268964 ipa-prop.c:2836 (ipa_set_node_agg_value_chain) 0: 0.0% 0: 0.0% 8388608: 0.4% 0: 0.0% 1 ipa-inline-analysis.c:716 (account_size_time) 0: 0.0% 2140820: 0.1% 9143868: 0.4% 240712: 0.3% 28736 ipa-inline-analysis.c:3820 (inline_read_section) 0: 0.0% 12942208: 0.3% 17905336: 0.8% 1287480: 1.4% 228397 ggc-common.c:244 (ggc_cleared_alloc_ptr_array_tw 61536: 0.0% 211278568: 5.0% 26406128: 1.2% 190280: 0.2% 9549 stringpool.c:74 (alloc_node) 0: 0.0% 0: 0.0% 28859960: 1.3% 0: 0.0% 721499 ipa-ref.c:50 (ipa_record_reference) 0: 0.0% 96510048: 2.3% 36203704: 1.6% 1329136: 1.4% 577500 lto-section-in.c:363 (lto_new_in_decl_state) 343800: 0.1% 0: 0.0% 38477520: 1.7% 0: 0.0% 323511 stringpool.c:57 (stringpool_ggc_alloc) 0: 0.0% 0: 0.0% 44558843: 2.0% 2783411: 3.0% 721499 tree-streamer-in.c:482 (unpack_value_fields) 15732776: 5.4% 0: 0.0% 45589448: 2.1% 292720: 0.3% 157392 tree-streamer-in.c:562 (streamer_alloc_tree) 300256: 0.1% 241332496: 5.7% 48823360: 2.2% 13216: 0.0% 2903028 lto/lto.c:2711 (create_subid_section_table) 1939520: 0.7% 0: 0.0% 49182144: 2.2% 10096128:10.9% 5008 ipa-inline-analysis.c:3832 (inline_read_section) 0: 0.0% 51338084: 1.2% 55137448: 2.5% 1313100: 1.4% 416983 ipa-inline-analysis.c:942 (inline_summary_alloc) 0: 0.0% 0: 0.0% 67108920: 3.0% 56: 0.0% 1 toplev.c:960 (realloc_for_line_map) 0: 0.0% 22493304: 0.5% 67239960: 3.0% 144: 0.0% 14 vec.h:792 (vec_safe_copy) 1227184: 0.4% 117348220: 2.8% 94975608: 4.3% 5710316: 6.2% 933481 cgraph.c:840 (cgraph_create_edge_1) 0: 0.0% 0: 0.0% 115711128: 5.2% 0: 0.0% 1112607 lto/lto.c:240 (lto_read_in_decl_state) 1103456: 0.4% 0: 0.0% 162786024: 7.4% 30006496:32.4% 2264577 cgraph.c:499 (cgraph_allocate_node) 0: 0.0% 0: 0.0% 207403696: 9.4% 0: 0.0% 682249 tree-streamer-in.c:573 (streamer_alloc_tree) 75348944:25.9% 3354555968:78.9% 1038361448:47.0% 36020064:38.9% 35699328 Total 290565511 4250101446 2211224488 92491589 56817662 source location Garbage Freed Leak Overhead Times
I've encountered problems connected with PGO: gcc revision: 201894 firefox changeset: 143205:1d6bf2bd4003 (Aug 20, 2013) I build instrumented binary without LTO and after that I use the profile for LTO: MYFLAGS="-flto=9 -fno-fat-lto-objects -ftoplevel-reorder -fprofile-use -Wno-error=coverage-mismatch" I know that there are gcda files that are mentioned in this thread and were removed by me: jemalloc.gcda (makes sense) ptsynch.gcda (likewise) HashFunctions.gcda (?) sqlite3.gcda (?) After linking of sqlite3, there are many corrupted profiles like: /ssd/firefox/js/src/gc/Marking.cpp /ssd/firefox/js/src/frontend/BytecodeEmitter.cpp /ssd/firefox/js/src/frontend/Interpreter.cpp ... Example of an error: /ssd/firefox/js/src/gc/Marking.cpp: In function ‘js::gc::IsAboutToBeFinalized<JSAtom>(JSAtom**)bool [clone .isra.65]’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent } ^ /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-6 thought to be -81 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 39667 /ssd/firefox/js/src/gc/Marking.cpp: In function ‘js::gc::IsAboutToBeFinalized<js::UnownedBaseShape>(js::UnownedBaseShape**)bool [clone .isra.52]’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-6 thought to be -1 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be 41156 /ssd/firefox/js/src/gc/Marking.cpp: In function ‘MarkInternal<JSAtom>(JSTracer*, JSAtom**)void’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 9-14 thought to be -39 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 9-10 thought to be 180119 /ssd/firefox/js/src/gc/Marking.cpp: In function ‘MarkInternal<JSObject>(JSTracer*, JSObject**)void’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 11-18 thought to be -1 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 11-12 thought to be 49007 /ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::MarkStack<unsigned long>::push(unsigned long)’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 4-6 thought to be -1 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 4-5 thought to be 1 /ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::GCMarker::drainMarkStack(js::SliceBudget&)’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-4 thought to be -7 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 3-1 thought to be 7 /ssd/firefox/js/src/gc/Marking.cpp: In member function ‘js::ObjectImpl::slotSpan() const’: /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: profile data is not flow-consistent /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 5-7 thought to be -1 /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: number of executions for edge 5-6 thought to be 15965 Thank you, Martin
> /ssd/firefox/js/src/gc/Marking.cpp: In function > ???js::gc::IsAboutToBeFinalized<JSAtom>(JSAtom**)bool [clone .isra.65]???: > /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: > profile data is not flow-consistent > } > ^ > /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: > number of executions for edge 3-6 thought to be -81 This actually loks like corruption from concurent updates (profiling is not thread safe). Do you get much more of these? I can imagine that garbage collector runs in parrallel and often. > /ssd/firefox/js/src/gc/Marking.cpp:1713:1: error: corrupted profile info: > number of executions for edge 3-4 thought to be 39667 Perhaps we should fix dumping to dump full 64bit value.. :) Honza
First of all many thanks for your work on reducing memory usage. Peak memory usage is now lower (~3GB) than clang's (~4GB). However, with -enable-optimize=-O3 on rev202079 I get: (An default (-Os) build on rev202053 went fine this morning) /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN17nsHtt pTransaction18ReadRequestSegmentEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans0.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN17nsHtt pTransaction18ReadRequestSegmentEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans1.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN16nsInp utStreamTee15WriteSegmentFunEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: /tmp/ccd3grW1.ltrans24.ltrans.o: requires dynamic R_X86_64_PC32 reloc against '_ZN16nsIn putStreamTee15WriteSegmentFunEP14nsIInputStreamPvPKcjjPj' which may overflow at runtime; recompile with -fPIC /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: read-only segment has dynamic relocations /tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpTransaction::ReadSegments(nsAHttpSegmentReader*, unsigned int, unsigned int*): error: undefined reference to 'nsHttpTransaction::ReadRequestSegment(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)' /tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpConnection::OnSocketWritable(): error: undefined reference to 'nsHttpTransaction::ReadRequestSegment(nsIInput Stream*, void*, char const*, unsigned int, unsigned int, unsigned int*)' /tmp/ccd3grW1.ltrans0.ltrans.o:ccd3grW1.ltrans0.o:function nsHttpPipeline::ReadSegments(nsAHttpSegmentReader*, unsigned int, unsigned int*): error: undefined reference to 'ns HttpPipeline::ReadFromPipe(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)' /tmp/ccd3grW1.ltrans1.ltrans.o:ccd3grW1.ltrans1.o:function imgRequest::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned long, unsigned int): error: undefi ned reference to 'nsInputStreamTee::WriteSegmentFun(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*)' /tmp/ccd3grW1.ltrans24.ltrans.o:ccd3grW1.ltrans24.o:function nsInputStreamTee::ReadSegments(tag_nsresult (*)(nsIInputStream*, void*, char const*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*): error: undefined reference to 'nsInputStreamTee::WriteSegmentFun(nsIInputStream*, void*, char const*, unsigned int, unsig ned int, unsigned int*)' Not sure if -O3 or rev202079 is to blame.
It turned out that -enable-optimize=-O3 is the cause. Rev202079 with -Os links fine.
I am building firefox with -O3 and get no undefined symbols. Can you, please, relink with -Wl,--no-demangle --save-temps -fdump-ipa-all and try to look up the missing symbol in -lm.res file and if it not UNDEF there make somewhere available the dumps? If it is undefined there, it may be firefox bug..
(In reply to Jan Hubicka from comment #193) > I am building firefox with -O3 and get no undefined symbols. Can you, > please, relink with -Wl,--no-demangle --save-temps -fdump-ipa-all and try to > look up the missing symbol in -lm.res file and if it not UNDEF there make > somewhere available the dumps? > If it is undefined there, it may be firefox bug.. Hmm, it's strange, because there are five undefined references; one of them does not appear in lm.res at all and the other four are all PREVAILING_DEF_IRONLY. (The whole dump is huge. Please tell me which part you need and I will try to upload it somewhere.)
Today there was two fixes for bugs that produce undefined symbols like one you see. Does the problem still exist on current mainline? Are you using profile feedback?
(In reply to Jan Hubicka from comment #195) > Today there was two fixes for bugs that produce undefined symbols like one > you see. > Does the problem still exist on current mainline? Are you using profile > feedback? The problem is gone on current mainline. (And yes I'm using profile feedback.)
Created attachment 31876 [details] mozilla-central patch
Created attachment 31877 [details] My local PGO/LTO script
Created attachment 31878 [details] .mozconfig_profile_gen
I currently cannot build Firefox with LTO due to PR 60449 (yeah, I know, using gcc configured with checking makes life hard, sometimes unnecessarily). I get errors like /home/mjambor/mozilla/mzc2/media/libvpx/vp8/encoder/onyx_if.c:4884:5: error: control flow in the middle of basic block 7
With current gcc trunk and mozilla-central trunk Firefox crashes on startup when build with -flto (--enable-optimize=-O3): 0x00007ffff5ce5d8f in nsCOMPtr_base::assign_with_AddRef(nsISupports*) [clone .constprop.13162] () from /var/tmp/moz-build-dir/dist/bin/libxul.so (gdb) bt #0 0x00007ffff5ce5d8f in nsCOMPtr_base::assign_with_AddRef(nsISupports*) [clone .constprop.13162] () from /var/tmp/moz-build-dir/dist/bin/libxul.so #1 0x00007ffff3fe60eb in nsSocketTransport::OnSocketDetached(PRFileDesc*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so #2 0x00007ffff3eb74ac in nsSocketTransportService::DetachSocket(nsSocketTransportService::SocketContext*, nsSocketTransportService::SocketContext*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so #3 0x00007ffff3fff28f in nsSocketTransportService::Run() () from /var/tmp/moz-build-dir/dist/bin/libxul.so #4 0x00007ffff4059c6a in nsThread::ProcessNextEvent(bool, bool*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so #5 0x00007ffff5ce5b39 in NS_ProcessNextEvent(nsIThread*, bool) [clone .constprop.13167] () from /var/tmp/moz-build-dir/dist/bin/libxul.so #6 0x00007ffff45af7a0 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so #7 0x00007ffff3ec649d in MessageLoop::Run() () from /var/tmp/moz-build-dir/dist/bin/libxul.so #8 0x00007ffff3fe7a56 in nsThread::ThreadFunc(void*) () from /var/tmp/moz-build-dir/dist/bin/libxul.so #9 0x00007ffff7e7757c in _pt_root () from /var/tmp/moz-build-dir/dist/bin/libnspr4.so #10 0x00007ffff7bc41e2 in start_thread () from /lib/libpthread.so.0 #11 0x00007ffff74932ad in clone () from /lib/libc.so.6 When I build with PGO/LTO Firefox crashes later (when I close a tab with e.g.: https://github.com/JuliaLang/julia/pull/6018 ): Program received signal SIGSEGV, Segmentation fault. 0x00007ffff51645ed in PL_DHashTableEnumerate(PLDHashTable*, PLDHashOperator (*)(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*), void*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so (gdb) bt #0 0x00007ffff51645ed in PL_DHashTableEnumerate(PLDHashTable*, PLDHashOperator (*)(PLDHashTable*, PLDHashEntryHdr*, unsigned int, void*), void*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #1 0x00007ffff5754d32 in PresShell::Destroy() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #2 0x00007ffff5754831 in nsDocumentViewer::DestroyPresShell() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #3 0x00007ffff55ee5c4 in nsDocumentViewer::Hide() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #4 0x00007ffff57b72eb in nsDocShell::SetVisibility(bool) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #5 0x00007ffff5a589a4 in nsFrameLoader::Hide() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #6 0x00007ffff5a588f6 in nsHideViewer::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #7 0x00007ffff53b97de in nsContentUtils::RemoveScriptBlocker() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #8 0x00007ffff53cc954 in nsDocument::EndUpdate(unsigned int) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #9 0x00007ffff5651dd6 in mozilla::dom::XULDocument::EndUpdate(unsigned int) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #10 0x00007ffff549673b in nsINode::doRemoveChildAt(unsigned int, bool, nsIContent*, nsAttrAndChildArray&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #11 0x00007ffff5496085 in nsXULElement::RemoveChildAt(unsigned int, bool) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #12 0x00007ffff5494df9 in nsINode::RemoveChild(nsINode&, mozilla::ErrorResult&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #13 0x00007ffff5494a00 in mozilla::dom::NodeBinding::removeChild(JSContext*, JS::Handle<JSObject*>, nsINode*, JSJitMethodCallArgs const&) [clone .lto_priv.13709] () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #14 0x00007ffff53b01e7 in mozilla::dom::GenericBindingMethod(JSContext*, unsigned int, JS::Value*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #15 0x00007ffff5262744 in js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #16 0x00007ffff524a14c in Interpret(JSContext*, js::RunState&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #17 0x00007ffff5249801 in js::RunScript(JSContext*, js::RunState&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #18 0x00007ffff52627ec in js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #19 0x00007ffff52a574c in js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #20 0x00007ffff55c553d in nsJSEventListener::HandleEvent(nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #21 0x00007ffff5869106 in nsXBLPrototypeHandler::ExecuteHandler(mozilla::dom::EventTarget*, nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #22 0x00007ffff5868554 in nsXBLEventHandler::HandleEvent(nsIDOMEvent*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #23 0x00007ffff5402b6c in nsEventListenerManager::HandleEventInternal(nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent**, mozilla::dom::EventTarget*, nsEventStatus*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #24 0x00007ffff53c38b2 in nsEventTargetChainItem::HandleEventTargetChain(nsTArray<nsEventTargetChainItem>&, nsEventChainPostVisitor&, nsDispatchingCallback*, ELMCreationDetector&) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #25 0x00007ffff53c1fe7 in nsEventDispatcher::Dispatch(nsISupports*, nsPresContext*, mozilla::WidgetEvent*, nsIDOMEvent*, nsEventStatus*, nsDispatchingCallback*, nsCOMArray<mozilla::dom::EventTarget>*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #26 0x00007ffff5a686c5 in nsTransitionManager::FlushTransitions(mozilla::css::CommonAnimationManager::FlushFlags) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #27 0x00007ffff563309f in nsRefreshDriver::Tick(long, mozilla::TimeStamp) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #28 0x00007ffff56325ac in mozilla::RefreshDriverTimer::TimerTick(nsITimer*, void*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #29 0x00007ffff54a32f7 in nsTimerEvent::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #30 0x00007ffff5166651 in nsThread::ProcessNextEvent(bool, bool*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #31 0x00007ffff5627914 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #32 0x00007ffff5146183 in MessageLoop::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #33 0x00007ffff562770a in nsBaseAppShell::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #34 0x00007ffff56276be in nsAppStartup::Run() () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #35 0x00007ffff5136f58 in XRE_main () from /var/tmp/firefox-destdir/usr/lib/firefox-30.0a1/libxul.so #36 0x000000000040aa58 in do_main(int, char**, nsIFile*) [clone .lto_priv.18] () #37 0x000000000040a285 in main () A "vanilla" build without PGO or LTO runs fine.
LTO miscompiles 435.gromacs in SPEC CPU 2006 on x32 with -mx32 -O3 -funroll-loops -ffast-math since r208165 (PR 60418). Can you try r208163?
(In reply to H.J. Lu from comment #202) > LTO miscompiles 435.gromacs in SPEC CPU 2006 on x32 with > > -mx32 -O3 -funroll-loops -ffast-math > > since r208165 (PR 60418). Can you try r208163? Yes. Unfortunately with r208163 Firefox still crashes on startup.
Here is a comparison of libxul sizes (in bytes, unstripped) for different compiler options: gcc (trunk): -O3 90213016 -O3 -flto 79682648 -O3 -flto / PGO 77250512 -Os 70431584 -Os -flto 62474008 clang (trunk): -O3 80574784 -O3 -flto 79394992 -Os 72452776 -Os -flto 65111640
I was looking into this recently, too. Curiously enough, for me clang+LTO was winning but comparing the symbols it seemed that the confiugre scripts picked bit more features at GCC side. I looked briefly on the differences and we can optimize out more vtables which I have patch for pending for next stage1 and optimize out write only global vars. Still the differences may be worth further investigation - clang seems to produce noticeably fewer external relocations, too. This seems like a ABI bug at clang side though. What I use for my firefox builds is --param inline-unit-growth=5. Our -O3 seems bit of overkill for applicatin of fize of Firefox... Honza
Firefox (and chromium) memory reports with -flto=9 and -O2; archive contains also memory usage graph: https://docs.google.com/file/d/0B0pisUJ80pO1bnV5V0RtWXJkaVU/edit
Created attachment 32525 [details] Memory usage graphs for -flto=9, -flto=4, -flto=1 with -O2
Both issues from Comment 201 were fixed by: http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00338.html
(In reply to Markus Trippelsdorf from comment #208) > Both issues from Comment 201 were fixed by: > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00338.html No, only the first issue is fixed. The second one (LTO/PGO build) still happens unfortunately.
Latest firefox 29.0.1 does not compile with LTO enabled (Gentoo/GCc 4.9.0). It fails in elfhack: make[5]: Entering directory '/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack' elfhack /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/_virtualenv/bin/python /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/config/expandlibs_exec.py --depend .deps/elfhack.pp --target elfhack -- x86_64-pc-linux-gnu-g++ -o elfhack -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -std=gnu++0x -MD -MP -MF .deps/elfhack.pp -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags host_elf.o host_elfhack.o x86_64-pc-linux-gnu-gcc -o dummy dummy.o -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib x86_64-pc-linux-gnu-g++ -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Werror=int-to-pointer-cast -Wtype-limits -Wempty-body -Wsign-compare -Wno-invalid-offsetof -Wcast-align -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -fno-strict-aliasing -fno-rtti -fno-math-errno -std=gnu++0x -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -O2 -fomit-frame-pointer -fPIC -shared -Wl,-z,defs -Wl,-h,test-array.so -o test-array.so -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib test-array.o -nostartfiles x86_64-pc-linux-gnu-g++ -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Werror=int-to-pointer-cast -Wtype-limits -Wempty-body -Wsign-compare -Wno-invalid-offsetof -Wcast-align -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -mno-avx -fno-strict-aliasing -fno-rtti -fno-math-errno -std=gnu++0x -pthread -pipe -fexceptions -DNDEBUG -DTRIMMED -O2 -fomit-frame-pointer -fPIC -shared -Wl,-z,defs -Wl,-h,test-ctors.so -o test-ctors.so -lpthread -Wl,-O1 -Wl,--as-needed -march=native -pipe -ggdb -flto=5 -fuse-linker-plugin -Wl,-znow -Wl,--sort-common -Wl,--hash-style=gnu -Wl,--enable-new-dtags -Wl,-z,noexecstack -Wl,-z,text -Wl,-rpath-link,/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/dist/bin -Wl,-rpath-link,/usr/lib test-ctors.o -nostartfiles === === If you get failures below, please file a bug describing the error === and your environment (compiler and linker versions), and use === --disable-elf-hack until this is fixed. === # Fail if the library doesn't have INIT .dynamic info readelf -d test-ctors.so | grep '(INIT)' 0x000000000000000c (INIT) 0x0 /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/elfhack -b -f test-ctors.so === === If you get failures below, please file a bug describing the error === and your environment (compiler and linker versions), and use === --disable-elf-hack until this is fixed. === # Fail if the library doesn't have INIT_ARRAY .dynamic info test-ctors.so: Reduced by 12096 bytes readelf -d test-array.so | grep '(INIT_ARRAY)' # Fail if the backup file doesn't exist [ -f 'test-ctors.so.bak' ] 0x0000000000000019 (INIT_ARRAY) 0x9790 # Fail if the new library doesn't contain less relocations /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/elfhack -b -f test-array.so test-array.so: [ $(objdump -R test-ctors.so.bak | wc -l) -gt $(objdump -R test-ctors.so | wc -l) ] Reduced by 12088 bytes # Fail if the backup file doesn't exist [ -f 'test-array.so.bak' ] # Fail if the new library doesn't contain less relocations [ $(objdump -R test-array.so.bak | wc -l) -gt $(objdump -R test-array.so | wc -l) ] # Will either crash or return exit code 1 if elfhack is broken LD_PRELOAD=/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/test-array.so /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/dummy PASS LD_PRELOAD=/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/test-ctors.so /home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack/dummy FAIL Makefile:52: recipe for target 'libs' failed make[5]: *** [libs] Error 1 make[5]: Leaving directory '/home/misc/gentoo/tmp/portage/www-client/firefox-29.0.1/work/mozilla-release/obj-x86_64-pc-linux-gnu/build/unix/elfhack' Disabling LTO let firefox successfully compile.
Elfhack is rather sensitive to LTO, but it works for me, so this seems like binutils issue or some elfhack change that happened recently. I wrote instructions for building firefox with LTO here http://hubicka.blogspot.ca/2014/04/linktime-optimization-in-gcc-2-firefox.html Here I am attaching -ftime-report after the symtab hashtable was removed Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1536 kB ( 0%) ggc phase opt and generate : 54.29 (58%) usr 1.28 (18%) sys 55.58 (50%) wall 720779 kB (18%) ggc phase stream in : 33.54 (36%) usr 1.84 (26%) sys 35.39 (32%) wall 3389310 kB (82%) ggc phase stream out : 6.00 ( 6%) usr 4.02 (56%) sys 19.99 (18%) wall 0 kB ( 0%) ggc garbage collection : 1.86 ( 2%) usr 0.00 ( 0%) sys 1.86 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization : 0.23 ( 0%) usr 0.00 ( 0%) sys 0.24 ( 0%) wall 9 kB ( 0%) ggc ipa dead code removal : 5.70 ( 6%) usr 0.18 ( 3%) sys 6.15 ( 6%) wall 92 kB ( 0%) ggc ipa inheritance graph : 0.09 ( 0%) usr 0.00 ( 0%) sys 0.09 ( 0%) wall 883 kB ( 0%) ggc ipa virtual call target : 5.58 ( 6%) usr 0.06 ( 1%) sys 5.32 ( 5%) wall 0 kB ( 0%) ggc ipa devirtualization : 0.13 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 9201 kB ( 0%) ggc ipa cp : 2.34 ( 2%) usr 0.21 ( 3%) sys 2.55 ( 2%) wall 223628 kB ( 5%) ggc ipa inlining heuristics : 26.97 (29%) usr 0.67 ( 9%) sys 27.66 (25%) wall 865791 kB (21%) ggc ipa comdats : 0.21 ( 0%) usr 0.00 ( 0%) sys 0.21 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple in : 0.07 ( 0%) usr 0.11 ( 2%) sys 0.21 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 0.46 ( 0%) usr 0.19 ( 3%) sys 0.65 ( 1%) wall 0 kB ( 0%) ggc ipa lto decl in : 24.76 (26%) usr 1.28 (18%) sys 26.08 (23%) wall 2571773 kB (63%) ggc ipa lto decl out : 5.45 ( 6%) usr 0.28 ( 4%) sys 5.75 ( 5%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 1.13 ( 1%) usr 0.24 ( 3%) sys 1.38 ( 1%) wall 414551 kB (10%) ggc ipa lto decl merge : 2.57 ( 3%) usr 0.01 ( 0%) sys 2.58 ( 2%) wall 8227 kB ( 0%) ggc ipa lto cgraph merge : 1.72 ( 2%) usr 0.00 ( 0%) sys 1.72 ( 2%) wall 12166 kB ( 0%) ggc whopr wpa : 1.04 ( 1%) usr 0.00 ( 0%) sys 1.04 ( 1%) wall 2 kB ( 0%) ggc whopr wpa I/O : 0.03 ( 0%) usr 3.55 (50%) sys 13.51 (12%) wall 0 kB ( 0%) ggc whopr partitioning : 4.97 ( 5%) usr 0.06 ( 1%) sys 5.02 ( 5%) wall 3738 kB ( 0%) ggc ipa reference : 3.62 ( 4%) usr 0.12 ( 2%) sys 3.75 ( 3%) wall 0 kB ( 0%) ggc ipa profile : 0.33 ( 0%) usr 0.01 ( 0%) sys 0.33 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 3.86 ( 4%) usr 0.01 ( 0%) sys 3.88 ( 3%) wall 0 kB ( 0%) ggc tree eh : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc tree CFG cleanup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.05 ( 0%) usr 0.16 ( 2%) sys 0.13 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 0.65 ( 1%) usr 0.00 ( 0%) sys 0.64 ( 1%) wall 0 kB ( 0%) ggc TOTAL : 93.84 7.14 110.98 4111626 kB there are some improvements in devirtualization performance that used quite few decl->symbol lookups. (about 20%)
Hi Jan, I have binutils version 2.24 with the patch from Markus Trippelsdorf for early plugin loading, so I have no wrappers for ar, nm and ranlib. I've also symlinked the liblto_plugin.so in binutils bfd-plugins directory. I'll try to apply the 3 patches you mentioned in your blog post and see wether they help, but I think they are not relevant for elfhack portion which is failing on my system. Which firefox version did you successfully compile?
Hi Jan, just a short Update: Firefox since version 30 as well as Thunderbird since version 31 both compile fine with LTO enabled without the need of any additional patches. The package size was reduced by 51% (firefox ~420MB -> ~207MB) and 59% (thunderbird ~480MB -> ~200MB). Both programs work as intended, no crashes or unexpected behaviour so far. Best regards, Steffen
I've just found ICE for r217480 with LTO and -O2: lto1: internal compiler error: in lto_output_node, at lto-cgraph.c:462 0x7ce411 lto_output_node ../../gcc/lto-cgraph.c:462 0x7ce411 output_symtab() ../../gcc/lto-cgraph.c:974 0x7db276 lto_output() ../../gcc/lto-streamer-out.c:2309 0x814671 write_lto ../../gcc/passes.c:2346 0x8177c1 ipa_write_optimization_summaries(lto_symtab_encoder_d*) ../../gcc/passes.c:2545 0x59512a do_stream_out ../../gcc/lto/lto.c:2475 0x59a41f stream_out ../../gcc/lto/lto.c:2538 0x59a41f lto_wpa_write_files ../../gcc/lto/lto.c:2655 0x59a41f do_whole_program_analysis ../../gcc/lto/lto.c:3323 0x59a41f lto_main() ../../gcc/lto/lto.c:3443 if (tag == LTO_symtab_analyzed_node) gcc_assert (clone_of || !node->clone_of); ~~~~^ if (!clone_of) streamer_write_hwi_stream (ob->main_stream, LCC_NOT_FOUND); else streamer_write_hwi_stream (ob->main_stream, ref); If needed I will try to reduce objects that are part of WPA phase. Martin
Author: hubicka Date: Mon Jan 19 23:58:19 2015 New Revision: 219871 URL: https://gcc.gnu.org/viewcvs?rev=219871&root=gcc&view=rev Log: PR lto/45375 * i386.c (gate): Check flag_expensive_optimizations and optimize_size. (ix86_option_override_internal): Drop optimize_size condition on MASK_ACCUMULATE_OUTGOING_ARGS, MASK_VZEROUPPER, MASK_AVX256_SPLIT_UNALIGNED_LOAD, MASK_AVX256_SPLIT_UNALIGNED_STORE, MASK_PREFER_AVX128. (ix86_avx256_split_vector_move_misalign, ix86_avx256_split_vector_move_misalign): Check optimize_insn_for_speed. * sse.md (all uses of TARGET_PREFER_AVX128): Add optimize_insn_for_speed_p check. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/sse.md
Author: hubicka Date: Tue Jan 20 04:39:45 2015 New Revision: 219878 URL: https://gcc.gnu.org/viewcvs?rev=219878&root=gcc&view=rev Log: PR lto/45375 * i386.c (ix86_option_override_internal): Use ix86_tune_cost to set branch cost. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c
Author: hubicka Date: Tue Jan 20 19:48:59 2015 New Revision: 219909 URL: https://gcc.gnu.org/viewcvs?rev=219909&root=gcc&view=rev Log: PR lto/45375 * ipa-inline.c: Include lto-streamer.h (report_inline_failed_reason): Output source file differences and flags on optimization/target node mismatch. (can_inline_edge_p): Consider caller to be the outer inline function; be less restrictive about matching opimize and optimize_size attributes. (inline_account_function_p): Break out from ... (inline_small_functions): ... here. * ipa-inline-transform.c (clone_inlined_nodes): Use inline_account_function_p. (inline_call): Use optimize attribution; use inline_account_function_p. (inline_transform): Use opt_for_fn. * ipa-inline.h (inline_account_function_p): Declare. Modified: trunk/gcc/ChangeLog trunk/gcc/ipa-inline-transform.c trunk/gcc/ipa-inline.c trunk/gcc/ipa-inline.h
Hi. Building Firefox revision: commit a704d34fb1f9e0f5dbf4113298d885cdb650906c Author: Matthew Noorenberghe <mozilla@noorenberghe.ca> Date: Thu Dec 3 17:33:35 2015 -0800 Bug 1230391 - Disable password visibility toggling in the capture doorhanger outside Nightly. rs=bnicholson, a=lizzard on a CLOSED TREE --HG-- extra : source : aea828e2cdf767a358ebc6ea661dd3b9b4160321 extra : intermediate-source : 366dd290472633b06f0942d7737c34e942e0916a This is a minimal set of LTO options for which the built binary can run: MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-devirtualize" For more details: # MYFLAGS="$OPT -march=native -flto=9" FAILED # MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks -fno-devirtualize -fno-strict-aliasing" OK # MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks" FAILED # MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-delete-null-pointer-checks -fno-devirtualize" OK # MYFLAGS="$OPT -march=native -flto=9 -fno-devirtualize" FAILED # MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse -fno-devirtualize" OK # MYFLAGS="$OPT -march=native -flto=9 -fno-lifetime-dse" FAILED Martin
devirtualization issue is now fixed, so we are down to -fno-lifetime-dse.
Comparing Firefox and Chromium builds with LTO for GCC 9 and GCC 10 are here: https://gist.github.com/marxin/223890df4d8d8e490b6b2918b77dacad We have a serious regression in WPA time in between GCC 9 and GCC 10.
For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very likely inliner that takes so long.
(In reply to Martin Liška from comment #221) > For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very > likely inliner that takes so long. 45.07% libc-2.31.so [.] __memset_avx2_erms 21.79% [kernel] [k] change_protection_range 3.74% lto1 [.] fibonacci_heap<sreal, cgraph_edge>::consolidate 3.54% lto1 [.] fibonacci_heap<sreal, cgraph_edge>::extract_minimum_node 2.63% [kernel] [k] task_numa_work
(In reply to Martin Liška from comment #222) > (In reply to Martin Liška from comment #221) > > For the chromium with GCC 10, inliner starts after ~5 minutes, so it's very > > likely inliner that takes so long. > > 45.07% libc-2.31.so [.] __memset_avx2_erms > 21.79% [kernel] [k] change_protection_range > 3.74% lto1 [.] fibonacci_heap<sreal, cgraph_edge>::consolidate > 3.54% lto1 [.] fibonacci_heap<sreal, > cgraph_edge>::extract_minimum_node > 2.63% [kernel] [k] task_numa_work Suggested patch for it: https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550662.html
The master branch has been updated by Martin Liska <marxin@gcc.gnu.org>: https://gcc.gnu.org/g:7f5c0f328eced560a204bb8e3eae0d45795dd235 commit r11-2338-g7f5c0f328eced560a204bb8e3eae0d45795dd235 Author: Martin Liska <mliska@suse.cz> Date: Fri Jul 24 14:33:27 2020 +0200 Use vec::reserve before vec_safe_grow_cleared is called gcc/ChangeLog: PR lto/45375 * symbol-summary.h: Call vec_safe_reserve before grow is called in order to grow to a reasonable size. * vec.h (vec_safe_reserve): Add missing function for vl_ptr type.
The releases/gcc-10 branch has been updated by Martin Liska <marxin@gcc.gnu.org>: https://gcc.gnu.org/g:f93ce9ea23e1806ccf9d8cd1640fc14596f54be8 commit r10-8537-gf93ce9ea23e1806ccf9d8cd1640fc14596f54be8 Author: Martin Liska <mliska@suse.cz> Date: Fri Jul 24 14:33:27 2020 +0200 Use vec::reserve before vec_safe_grow_cleared is called gcc/ChangeLog: PR lto/45375 * symbol-summary.h: Call vec_safe_reserve before grow is called in order to grow to a reasonable size. * vec.h (vec_safe_reserve): Add missing function for vl_ptr type. (cherry picked from commit 7f5c0f328eced560a204bb8e3eae0d45795dd235)